-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add bbox subsetting for ATL06 data #30
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that there's a in_bbox!
function that operates on a DataFrame that does what you want here and is more generic. It is also slower, as you do collect data that's later filtered out, so I understand this feature.
src/ICESat-2/ATL06.jl
Outdated
start = findfirst(ind) | ||
stop = findlast(ind) | ||
|
||
if isnothing(start) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to just return nothing
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by returning empty arrays it lets one keep track of what files and tracks were unsuccessful. There are issues where some tracks have data and others do not from the same granule. If you return nothing then It becomes hard to keep things strait as to which data the granules belong. The returned data needs to nicely convert to a uniform data frame.
I like to have an empty NamedTuple{}
returned so that I can retain it in the dataframe
of all granules and later go back to the dataframe
and know which granules have already been searched. This is really helpful for building a datalake that will be periodically updated..
On that note I'm using a modification of points
that I call points_plus
that includes the granule information for each track in a variable called granule_info
. This way each track can be listed as a row of a table or dataframe and is fully traceable without much overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recognize the need for empty granules (I have several zero size granules on disk, just to prevent repeat downloads), but I wonder about the correct approach.
From what I gather, you iterate over each track and store the results, for which you want an empty table, rather than nothing
.
Btw, I think most of this code can be written as stop=0
, as:
julia> [1,2][1:1:0]
Int64[]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I gather, you iterate over each track and store the results, for which you want an empty table, rather than
nothing
.
Yes, that's exactly it. I'm building data cubes of all ICESat-2 tracks that fall within 2 degree by 2 degree tiles (geotiles):
df = reduce(vcat, DataFrame.(points_plus.(granules, bbox = geotile)))
points_plus
is identical to points
with the inclusion of granule_info
for each row (track) in the dataframe.
I then save the df as an Arrow file.. I can then go back at a later date and see which granules have already been subset and only download and subset and append new granules to the df
.
returning nothing
would not produce consistent dataframes that could combined using 'vcat'... at least that is my understanding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, I think most of this code can be written as
stop=0
, as:julia> [1,2][1:1:0] Int64[]
Ohh.. I hadn't thought of that. Do you think it would be less efficient as it would then reside within the h5 call, possibly mapping the variable even though nothing is being returned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I hadn't thought of that, it is indeed less efficient.
julia> @time h[1:0] # h = file["$track/land_segments/terrain/h_te_mean"]
0.000072 seconds (14 allocations: 608 bytes)
Float32[]
julia> @time Vector{Float32}()
0.000002 seconds (1 allocation: 64 bytes)
Float32[]
src/ICESat-2/ATL06.jl
Outdated
function points( | ||
::ICESat2_Granule{:ATL06}, | ||
file::HDF5.H5DataStore, | ||
track::AbstractString, | ||
t_offset::Float64, | ||
step = 1, | ||
bbox = (min_x = -Inf, min_y = -Inf, max_x = Inf, max_y = Inf), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means we will now always filter on bbox, even when nothing is passed. Might be better to pass nothing by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya that would make more sense. I don't think the search costs us much but if we used bbox = nothing
we could just include an if statement to check if a search is needed. I'll add that.
Ya, I saw that. I've also written a |
No change in functionality, only improvement in speed
@evetion i think this is ready to merge if you're in agreement. At some point we should add this for all other datasets as well. |
@evetion are you good with this structure? If so I'm going to also implement for GLH06 |
Thanks for the PR, I think it's good to have this feature. Overall I'd like to replace the bounding box (;min_x=Inf, ..) by Extents.extent(), but that's something that should happen everywhere (like also in search), so that's something I could do this weekend in another PR, but before we release this. You previously mentioned using Lastly, I think the assumption of having sorted data works for ATL06, but I'm unsure if that's also the case for ATL03 data, as those points are much closer together, and the random geolocation error could exceed that(?). |
Would an
Everything is moving over to EarthData cloud so I think we need to figure out a way to build the extents from the json that includes the EarthData files. Surely the extents are in there, we could always ask @betolink where those might be hiding.
I used an
That's a good point, ATL03 is sorted by time not location so there will be a a few meters scatter in the location. I don't think this is a problem though as the function never excludes data, only includes a few point that are outside of the bbox. Secondly since we don't want to read in all of the data before subsetting (this would be too much overhead) we are limited to |
Can you describe your envisioned workflow? I understand that you want to search for granules with a bounding box and then want to clip the granules because they are probably bigger than your box. How would the polygon outlines of the granules fit in there? There are rough polygons in the json response (although not in the UMM JSON, but that's for the other PR to discuss I guess) that we could use.
It's a nitpick, but more Julian I guess to have a |
Polygon outlines of granules would apply to to |
1. change `bbox` default to `nothing` 2. set start and end using first/lastindex
firstindex & firstindex do not work on ::HDF5.Dataset... reverting back to: start = 1; last = length();
No description provided.