Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve earthdata access #29

Merged
merged 13 commits into from
Jan 7, 2023
Merged

Improve earthdata access #29

merged 13 commits into from
Jan 7, 2023

Conversation

alex-s-gardner
Copy link
Collaborator

[1] remove n5eil01u.ecs.nsidc.org credentials form .netrc
[2] get urls to earthdatacloud for ICESat2 data

[1] remove n5eil01u.ecs.nsidc.org credentials form .netrc
[2] get urls to earthdatacloud for ICESat2 data
@alex-s-gardner
Copy link
Collaborator Author

Don't merge this ... I will add for other sensors as well using new unified metadata .json once I figure out how to parse the damn thing

Copy link
Owner

@evetion evetion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking very interesting. My main question would be how we can still access the original (non-earthdata cloud) urls. Will there be a s3=true option?

src/search.jl Outdated Show resolved Hide resolved
src/search.jl Outdated Show resolved Hide resolved
@alex-s-gardner
Copy link
Collaborator Author

Looking very interesting. My main question would be how we can still access the original (non-earthdata cloud) urls. Will there be a s3=true option?

I believe the url to the DAAC hosted copy is buried in the .umm_json file. Maybe we could have 3 fields in the Granule types:

https_cloud
s3
https_daac

@evetion
Copy link
Owner

evetion commented Oct 24, 2022

What is your use-case here precisely, in terms of API calls and the expected result(s)?

Before, we only stored 1 url (either https, or a local filepath). With this code we would store multiple ones, which raises the question of how you would choose the correct url when you call download. And at the moment download only works for https urls.

@alex-s-gardner
Copy link
Collaborator Author

What is your use-case here precisely, in terms of API calls and the expected result(s)?

Before, we only stored 1 url (either https, or a local filepath). With this code we would store multiple ones, which raises the question of how you would choose the correct url when you call download. And at the moment download only works for https urls.

I would like to eventually be able to run my code on EC2 and on servers. When on EC2 I would like to give preference to S3. As for the two https paths, downloading from the DAAC is typicaly 2x faster than downloading from the cloud https but... the DAAC servers have been flaky lately (was just down for 24 hrs) so I've been using the cloud https.

@evetion
Copy link
Owner

evetion commented Oct 25, 2022

Ok, then I envisage the API as follows:

search takes a provider, either the DAAC (default), or the EarthDataCloud, as I don't see the urls to DAAC and EarthDataCloud in one response.
download keeps working normally, but can take a s3=true parameter, which requires a granule created from the correct provider, so it has an s3_url. search also optionally takes an s3=true, that automatically switches the provider?

download(granule, s3=true) (or s3_download?) then also calls earthdata_cloud_s3 or even earthdata_s3_env! (already implemented) if it's not expired yet (not yet implemented), so any S3 download actually works (whether run from shell, or from within Julia with AWSS3).

@alex-s-gardner
Copy link
Collaborator Author

alex-s-gardner commented Oct 25, 2022

@evetion that makes sense to me. DACC and EarthDataCloud urls and s3 paths are all included in the granules.umm_json.

but @betolink mentioned the "you'll have to know the provider if you want NSIDC's cloud collections then you use NSIDC_CPRD if you want the DAAC hosted collections then NSIDC_ECS but it's not intuitive for a new user at the collection level you can use cloud_hosted and the short name, at the granule level you need the short name and the provider to differentiate cloud vs onprem"

@betolink
Copy link

That's correct @alex-s-gardner, to complicate things a little further, cloud hosted collections come with 2 set of links, the direct S3 links and HTTPS links. HTTPS links are throttled and will be on average 2x slower than getting the same data from their DAACs. My advice is, if we are not running our code in us-west-2 it's better to use the DAAC urls (NSIDC_ECS).

@alex-s-gardner
Copy link
Collaborator Author

@betolink "if" the DAAC servers are up and running :-)

@evetion
Copy link
Owner

evetion commented Oct 26, 2022

Some small remarks:

The UMM JSON doesn't contain Polygon bounds for a granule, whereas the normal json does. This would be blocking for #28.

Mimetype of files (which I would filter on instead of "GET DATA") is application/x-hdfeos for DAAC, but application/x-hdf5 for EarthDataCloud.

src/search.jl Outdated Show resolved Hide resolved
@evetion
Copy link
Owner

evetion commented Jan 3, 2023

I've made some changes here, so this is fully backwards compatible:

  • Reverted the s3 fields on granules
  • Added the non-umm json back

However, some big changes:

  • Dropped JSON dep
  • Find is renamed to search
  • Version is now an int
  • Search now takes only mission + product arguments, the rest are kwargs
  • Kwargs can specify the provider (daac or cloud) and s3 (yes/no)

The internal earthdata_search is fully keywords only now and allows finer control over requesting number of items/pages. It also includes a umm option, as we can parse that, but it's not enabled by default, as it is missing polygons in the response.

I will still need to merge with the master branch (which has polygon support) and investigate why earthdata doesn't return all pages in some cases before we can merge this. I also hope to have an s3 download example working by then.

edit: Do you have a preference for s3 downloads? Call aws s3 externally from Julia, or go with AWSS3.jl?

@alex-s-gardner
Copy link
Collaborator Author

edit: Do you have a preference for s3 downloads? Call aws s3 externally from Julia, or go with AWSS3.jl?

Keeping this all within Julia would be nice, so AWSS3.jl

@evetion
Copy link
Owner

evetion commented Jan 4, 2023

Keeping this all within Julia would be nice, so AWSS3.jl

julia> vietnam = (min_x = 102., min_y = 8.0, max_x = 103.0, max_y = 9.0);
julia> granules = SpaceLiDAR.search(:ICESat2, :ATL08; bbox=vietnam, version=5, s3=true);
julia> g = granules[1]
ICESat2_Granule{:ATL08}("ATL08_20181101194503_05220107_005_01.h5", "s3://nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/11/01/ATL08_20181101194503_05220107_005_01.h5", NamedTuple(), (type = :ATL08, date = Dates.DateTime("2018-11-01T19:45:03"), rgt = 522, cycle = 1, segment = 7, version = 5, revision = 1, ascending = false, descending = true))

julia> download!(g)  # takes some time...
ICESat2_Granule{:ATL08}("ATL08_20181101194503_05220107_005_01.h5", "/Users/evetion/code/SpaceLiDAR.jl/ATL08_20181101194503_05220107_005_01.h5", NamedTuple(), (type = :ATL08, date = Dates.DateTime("2018-11-01T19:45:03"), rgt = 522, cycle = 1, segment = 7, version = 5, revision = 1, ascending = false, descending = true))

edit: This now works without getting credentials manually.

@evetion
Copy link
Owner

evetion commented Jan 7, 2023

I'm gonna go ahead and merge this, I will make documentation updates before a release. In the meantime, could you test the new functionality?

@evetion evetion merged commit 4695caf into evetion:master Jan 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants