Add bbox subsetting for ATL06 data #30

alex-s-gardner · 2022-10-20T16:41:54Z

No description provided.

evetion

Note that there's a in_bbox! function that operates on a DataFrame that does what you want here and is more generic. It is also slower, as you do collect data that's later filtered out, so I understand this feature.

evetion · 2022-10-21T21:29:18Z

src/ICESat-2/ATL06.jl

+    start = findfirst(ind)
+    stop = findlast(ind)
+
+    if isnothing(start)


I think it's better to just return nothing.

by returning empty arrays it lets one keep track of what files and tracks were unsuccessful. There are issues where some tracks have data and others do not from the same granule. If you return nothing then It becomes hard to keep things strait as to which data the granules belong. The returned data needs to nicely convert to a uniform data frame.

I like to have an empty NamedTuple{} returned so that I can retain it in the dataframe of all granules and later go back to the dataframe and know which granules have already been searched. This is really helpful for building a datalake that will be periodically updated..

On that note I'm using a modification of points that I call points_plus that includes the granule information for each track in a variable called granule_info. This way each track can be listed as a row of a table or dataframe and is fully traceable without much overhead.

I recognize the need for empty granules (I have several zero size granules on disk, just to prevent repeat downloads), but I wonder about the correct approach.

From what I gather, you iterate over each track and store the results, for which you want an empty table, rather than nothing.

Btw, I think most of this code can be written as stop=0, as:

julia> [1,2][1:1:0] Int64[]

From what I gather, you iterate over each track and store the results, for which you want an empty table, rather than nothing.

Yes, that's exactly it. I'm building data cubes of all ICESat-2 tracks that fall within 2 degree by 2 degree tiles (geotiles):

df = reduce(vcat, DataFrame.(points_plus.(granules, bbox = geotile)))

points_plus is identical to points with the inclusion of granule_info for each row (track) in the dataframe.

I then save the df as an Arrow file.. I can then go back at a later date and see which granules have already been subset and only download and subset and append new granules to the df.

returning nothing would not produce consistent dataframes that could combined using 'vcat'... at least that is my understanding

Btw, I think most of this code can be written as stop=0, as:

julia> [1,2][1:1:0] Int64[]

Ohh.. I hadn't thought of that. Do you think it would be less efficient as it would then reside within the h5 call, possibly mapping the variable even though nothing is being returned?

Hmm, I hadn't thought of that, it is indeed less efficient.

julia> @time h[1:0] # h = file["$track/land_segments/terrain/h_te_mean"] 0.000072 seconds (14 allocations: 608 bytes) Float32[] julia> @time Vector{Float32}() 0.000002 seconds (1 allocation: 64 bytes) Float32[]

evetion · 2022-10-21T21:31:23Z

src/ICESat-2/ATL06.jl

 function points(
    ::ICESat2_Granule{:ATL06},
    file::HDF5.H5DataStore,
    track::AbstractString,
    t_offset::Float64,
    step = 1,
+    bbox = (min_x = -Inf, min_y = -Inf, max_x = Inf, max_y = Inf),


This means we will now always filter on bbox, even when nothing is passed. Might be better to pass nothing by default?

Ya that would make more sense. I don't think the search costs us much but if we used bbox = nothing we could just include an if statement to check if a search is needed. I'll add that.

alex-s-gardner · 2022-10-24T19:37:30Z

Note that there's a in_bbox! function that operates on a DataFrame that does what you want here and is more generic. It is also slower, as you do collect data that's later filtered out, so I understand this feature.

Ya, I saw that. I've also written a crop! function for the dataframes but it costs a lot to read in the extra data.

No change in functionality, only improvement in speed

alex-s-gardner · 2022-11-06T23:41:52Z

@evetion i think this is ready to merge if you're in agreement. At some point we should add this for all other datasets as well.

alex-s-gardner · 2022-11-11T05:06:14Z

@evetion are you good with this structure? If so I'm going to also implement for GLH06

evetion · 2022-11-11T09:46:54Z

Thanks for the PR, I think it's good to have this feature.

Overall I'd like to replace the bounding box (;min_x=Inf, ..) by Extents.extent(), but that's something that should happen everywhere (like also in search), so that's something I could do this weekend in another PR, but before we release this.

You previously mentioned using bbox = nothing, but now we it's using Inf, and checking whether any bounds are non-Inf. I think using nothing is clearer.
You could replace any(isfinite.(values(bbox))) by !isnothing(bbox).

Lastly, I think the assumption of having sorted data works for ATL06, but I'm unsure if that's also the case for ATL03 data, as those points are much closer together, and the random geolocation error could exceed that(?).

alex-s-gardner · 2022-11-11T18:44:00Z

I'd like to replace the bounding box (;min_x=Inf, ..) by Extents.extent()

Would an extent be a polygon for searching and subsetting by more complex geometry?

Extents.extent(), but that's something that should happen everywhere (like also in search), so that's something I could do this weekend in another PR, but before we release this.

Everything is moving over to EarthData cloud so I think we need to figure out a way to build the extents from the json that includes the EarthData files. Surely the extents are in there, we could always ask @betolink where those might be hiding.

You previously mentioned using bbox = nothing, but now we it's using Inf, and checking whether any bounds are non-Inf. I
think using nothing is clearer.

I used an Inf bbox to be consistent with how search is implemented. It happy to adopt either convention.

Lastly, I think the assumption of having sorted data works for ATL06, but I'm unsure if that's also the case for ATL03 data.

That's a good point, ATL03 is sorted by time not location so there will be a a few meters scatter in the location. I don't think this is a problem though as the function never excludes data, only includes a few point that are outside of the bbox. Secondly since we don't want to read in all of the data before subsetting (this would be too much overhead) we are limited to (start:step:stop) which means we need the maximum extent in terms of how the data is stored. I think our ultimate goal should be for point to have optimal read time as this is one of the biggest bottlenecks to working with the data.

evetion · 2022-11-11T21:57:55Z

Would an extent be a polygon for searching and subsetting by more complex geometry?

Extent would be a generalized bounding box, that's better defined in the ecosystem. So it's no polygon, but it would be something that could be derived from any custom polygon that supports GeoInterface.

Everything is moving over to EarthData cloud so I think we need to figure out a way to build the extents from the json that includes the EarthData files. Surely the extents are in there, we could always ask @betolink where those might be hiding.

Can you describe your envisioned workflow? I understand that you want to search for granules with a bounding box and then want to clip the granules because they are probably bigger than your box. How would the polygon outlines of the granules fit in there?

There are rough polygons in the json response (although not in the UMM JSON, but that's for the other PR to discuss I guess) that we could use.

I used an Inf bbox to be consistent with how search is implemented. It happy to adopt either convention.

It's a nitpick, but more Julian I guess to have a Union{Nothing, bbox}. But you have a fair point that I've used the world bbox myself in the find method.

src/ICESat-2/ATL06.jl

alex-s-gardner · 2022-11-11T23:57:06Z

Can you describe your envisioned workflow? I understand that you want to search for granules with a bounding box and then want to clip the granules because they are probably bigger than your box. How would the polygon outlines of the granules fit in there?

Polygon outlines of granules would apply to to granules_from_folder and is not relevant for this PR. I personally don't need those, Extents would be sufficient for my workflows. Users looking at smaller regions will eventually want to do polygon sub-setting but that can come later (not priority until someone asks.

1. change `bbox` default to `nothing` 2. set start and end using first/lastindex

firstindex & firstindex do not work on ::HDF5.Dataset... reverting back to: start = 1; last = length();

Add bbox subsetting for ATL06 data

343dd88

evetion reviewed Oct 21, 2022

View reviewed changes

alex-s-gardner added 2 commits November 4, 2022 11:19

skip bbox subsetting if not needed

19cd633

No change in functionality, only improvement in speed

fix error caught by test

2ac4d73

evetion reviewed Nov 11, 2022

View reviewed changes

src/ICESat-2/ATL06.jl Show resolved Hide resolved

alex-s-gardner added 2 commits November 11, 2022 16:28

Update ATL06.jl

1e17314

1. change `bbox` default to `nothing` 2. set start and end using first/lastindex

Update ATL06.jl

b2c0f46

firstindex & firstindex do not work on ::HDF5.Dataset... reverting back to: start = 1; last = length();

evetion merged commit 21338c7 into evetion:master Nov 12, 2022

alex-s-gardner deleted the dev2 branch November 13, 2022 00:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add bbox subsetting for ATL06 data #30

Add bbox subsetting for ATL06 data #30

alex-s-gardner commented Oct 20, 2022

evetion left a comment

evetion Oct 21, 2022

alex-s-gardner Oct 24, 2022 •

edited

Loading

evetion Oct 25, 2022

alex-s-gardner Oct 25, 2022 •

edited

Loading

alex-s-gardner Oct 25, 2022

evetion Oct 26, 2022

evetion Oct 21, 2022

alex-s-gardner Oct 24, 2022

alex-s-gardner commented Oct 24, 2022

alex-s-gardner commented Nov 6, 2022

alex-s-gardner commented Nov 11, 2022 •

edited

Loading

evetion commented Nov 11, 2022

alex-s-gardner commented Nov 11, 2022 •

edited

Loading

evetion commented Nov 11, 2022

alex-s-gardner commented Nov 11, 2022

Add bbox subsetting for ATL06 data #30

Add bbox subsetting for ATL06 data #30

Conversation

alex-s-gardner commented Oct 20, 2022

evetion left a comment

Choose a reason for hiding this comment

evetion Oct 21, 2022

Choose a reason for hiding this comment

alex-s-gardner Oct 24, 2022 • edited Loading

Choose a reason for hiding this comment

evetion Oct 25, 2022

Choose a reason for hiding this comment

alex-s-gardner Oct 25, 2022 • edited Loading

Choose a reason for hiding this comment

alex-s-gardner Oct 25, 2022

Choose a reason for hiding this comment

evetion Oct 26, 2022

Choose a reason for hiding this comment

evetion Oct 21, 2022

Choose a reason for hiding this comment

alex-s-gardner Oct 24, 2022

Choose a reason for hiding this comment

alex-s-gardner commented Oct 24, 2022

alex-s-gardner commented Nov 6, 2022

alex-s-gardner commented Nov 11, 2022 • edited Loading

evetion commented Nov 11, 2022

alex-s-gardner commented Nov 11, 2022 • edited Loading

evetion commented Nov 11, 2022

alex-s-gardner commented Nov 11, 2022

alex-s-gardner Oct 24, 2022 •

edited

Loading

alex-s-gardner Oct 25, 2022 •

edited

Loading

alex-s-gardner commented Nov 11, 2022 •

edited

Loading

alex-s-gardner commented Nov 11, 2022 •

edited

Loading