Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bbox subsetting for ATL06 data #30

Merged
merged 5 commits into from
Nov 12, 2022
Merged

Conversation

alex-s-gardner
Copy link
Collaborator

No description provided.

Copy link
Owner

@evetion evetion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that there's a in_bbox! function that operates on a DataFrame that does what you want here and is more generic. It is also slower, as you do collect data that's later filtered out, so I understand this feature.

start = findfirst(ind)
stop = findlast(ind)

if isnothing(start)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to just return nothing.

Copy link
Collaborator Author

@alex-s-gardner alex-s-gardner Oct 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by returning empty arrays it lets one keep track of what files and tracks were unsuccessful. There are issues where some tracks have data and others do not from the same granule. If you return nothing then It becomes hard to keep things strait as to which data the granules belong. The returned data needs to nicely convert to a uniform data frame.

I like to have an empty NamedTuple{} returned so that I can retain it in the dataframe of all granules and later go back to the dataframe and know which granules have already been searched. This is really helpful for building a datalake that will be periodically updated..

On that note I'm using a modification of points that I call points_plus that includes the granule information for each track in a variable called granule_info. This way each track can be listed as a row of a table or dataframe and is fully traceable without much overhead.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recognize the need for empty granules (I have several zero size granules on disk, just to prevent repeat downloads), but I wonder about the correct approach.

From what I gather, you iterate over each track and store the results, for which you want an empty table, rather than nothing.

Btw, I think most of this code can be written as stop=0, as:

julia> [1,2][1:1:0]
Int64[]

Copy link
Collaborator Author

@alex-s-gardner alex-s-gardner Oct 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I gather, you iterate over each track and store the results, for which you want an empty table, rather than nothing.

Yes, that's exactly it. I'm building data cubes of all ICESat-2 tracks that fall within 2 degree by 2 degree tiles (geotiles):

df = reduce(vcat, DataFrame.(points_plus.(granules, bbox = geotile)))

points_plus is identical to points with the inclusion of granule_info for each row (track) in the dataframe.

I then save the df as an Arrow file.. I can then go back at a later date and see which granules have already been subset and only download and subset and append new granules to the df.

returning nothing would not produce consistent dataframes that could combined using 'vcat'... at least that is my understanding

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I think most of this code can be written as stop=0, as:

julia> [1,2][1:1:0]
Int64[]

Ohh.. I hadn't thought of that. Do you think it would be less efficient as it would then reside within the h5 call, possibly mapping the variable even though nothing is being returned?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I hadn't thought of that, it is indeed less efficient.

julia> @time h[1:0]  # h = file["$track/land_segments/terrain/h_te_mean"]
  0.000072 seconds (14 allocations: 608 bytes)
Float32[]

julia> @time Vector{Float32}()
  0.000002 seconds (1 allocation: 64 bytes)
Float32[]

function points(
::ICESat2_Granule{:ATL06},
file::HDF5.H5DataStore,
track::AbstractString,
t_offset::Float64,
step = 1,
bbox = (min_x = -Inf, min_y = -Inf, max_x = Inf, max_y = Inf),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means we will now always filter on bbox, even when nothing is passed. Might be better to pass nothing by default?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya that would make more sense. I don't think the search costs us much but if we used bbox = nothing we could just include an if statement to check if a search is needed. I'll add that.

@alex-s-gardner
Copy link
Collaborator Author

Note that there's a in_bbox! function that operates on a DataFrame that does what you want here and is more generic. It is also slower, as you do collect data that's later filtered out, so I understand this feature.

Ya, I saw that. I've also written a crop! function for the dataframes but it costs a lot to read in the extra data.

No change in functionality, only improvement in speed
@alex-s-gardner
Copy link
Collaborator Author

@evetion i think this is ready to merge if you're in agreement. At some point we should add this for all other datasets as well.

@alex-s-gardner
Copy link
Collaborator Author

alex-s-gardner commented Nov 11, 2022

@evetion are you good with this structure? If so I'm going to also implement for GLH06

@evetion
Copy link
Owner

evetion commented Nov 11, 2022

Thanks for the PR, I think it's good to have this feature.

Overall I'd like to replace the bounding box (;min_x=Inf, ..) by Extents.extent(), but that's something that should happen everywhere (like also in search), so that's something I could do this weekend in another PR, but before we release this.

You previously mentioned using bbox = nothing, but now we it's using Inf, and checking whether any bounds are non-Inf. I think using nothing is clearer.
You could replace any(isfinite.(values(bbox))) by !isnothing(bbox).

Lastly, I think the assumption of having sorted data works for ATL06, but I'm unsure if that's also the case for ATL03 data, as those points are much closer together, and the random geolocation error could exceed that(?).

@alex-s-gardner
Copy link
Collaborator Author

alex-s-gardner commented Nov 11, 2022

I'd like to replace the bounding box (;min_x=Inf, ..) by Extents.extent()

Would an extent be a polygon for searching and subsetting by more complex geometry?

Extents.extent(), but that's something that should happen everywhere (like also in search), so that's something I could do this weekend in another PR, but before we release this.

Everything is moving over to EarthData cloud so I think we need to figure out a way to build the extents from the json that includes the EarthData files. Surely the extents are in there, we could always ask @betolink where those might be hiding.

You previously mentioned using bbox = nothing, but now we it's using Inf, and checking whether any bounds are non-Inf. I
think using nothing is clearer.

I used an Inf bbox to be consistent with how search is implemented. It happy to adopt either convention.

Lastly, I think the assumption of having sorted data works for ATL06, but I'm unsure if that's also the case for ATL03 data.

That's a good point, ATL03 is sorted by time not location so there will be a a few meters scatter in the location. I don't think this is a problem though as the function never excludes data, only includes a few point that are outside of the bbox. Secondly since we don't want to read in all of the data before subsetting (this would be too much overhead) we are limited to (start:step:stop) which means we need the maximum extent in terms of how the data is stored. I think our ultimate goal should be for point to have optimal read time as this is one of the biggest bottlenecks to working with the data.

@evetion
Copy link
Owner

evetion commented Nov 11, 2022

Would an extent be a polygon for searching and subsetting by more complex geometry?

Extent would be a generalized bounding box, that's better defined in the ecosystem. So it's no polygon, but it would be something that could be derived from any custom polygon that supports GeoInterface.

Everything is moving over to EarthData cloud so I think we need to figure out a way to build the extents from the json that includes the EarthData files. Surely the extents are in there, we could always ask @betolink where those might be hiding.

Can you describe your envisioned workflow? I understand that you want to search for granules with a bounding box and then want to clip the granules because they are probably bigger than your box. How would the polygon outlines of the granules fit in there?

There are rough polygons in the json response (although not in the UMM JSON, but that's for the other PR to discuss I guess) that we could use.

I used an Inf bbox to be consistent with how search is implemented. It happy to adopt either convention.

It's a nitpick, but more Julian I guess to have a Union{Nothing, bbox}. But you have a fair point that I've used the world bbox myself in the find method.

@alex-s-gardner
Copy link
Collaborator Author

Can you describe your envisioned workflow? I understand that you want to search for granules with a bounding box and then want to clip the granules because they are probably bigger than your box. How would the polygon outlines of the granules fit in there?

Polygon outlines of granules would apply to to granules_from_folder and is not relevant for this PR. I personally don't need those, Extents would be sufficient for my workflows. Users looking at smaller regions will eventually want to do polygon sub-setting but that can come later (not priority until someone asks.

1. change `bbox` default to `nothing`
2. set start and end using first/lastindex
firstindex & firstindex do not work on ::HDF5.Dataset... reverting back to:

start = 1;
last = length();
@evetion evetion merged commit 21338c7 into evetion:master Nov 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants