Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bbox subsetting for ATL06 data #30

Merged
merged 5 commits into from
Nov 12, 2022
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 46 additions & 12 deletions src/ICESat-2/ATL06.jl
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
"""
points(g::ICESat2_Granule{:ATL06}, tracks=icesat2_tracks, step=1)
points(g::ICESat2_Granule{:ATL06}, tracks=icesat2_tracks, step=1, bbox = (min_x = -Inf, min_y = -Inf, max_x = Inf, max_y = Inf))

Retrieve the points for a given ICESat-2 ATL06 (Land Ice) granule as a list of namedtuples, one for each beam.
The names of the tuples are based on the following fields:
Expand All @@ -25,37 +25,71 @@ function points(
granule::ICESat2_Granule{:ATL06};
tracks = icesat2_tracks,
step = 1,
bbox = (min_x = -Inf, min_y = -Inf, max_x = Inf, max_y = Inf),
)
dfs = Vector{NamedTuple}()
HDF5.h5open(granule.url, "r") do file
t_offset = read(file, "ancillary_data/atlas_sdp_gps_epoch")[1]::Float64 + gps_offset
for (i, track) in enumerate(tracks)
if in(track, keys(file)) && in("land_ice_segments", keys(file[track]))
track_nt = points(granule, file, track, t_offset, step)
track_nt.height[track_nt.height.==fill_value] .= NaN
track_nt = points(granule, file, track, t_offset, step, bbox)
if !isempty(track_nt.height)
track_nt.height[track_nt.height.==fill_value] .= NaN
end
push!(dfs, track_nt)
end
end
end
return dfs
end


function points(
::ICESat2_Granule{:ATL06},
file::HDF5.H5DataStore,
track::AbstractString,
t_offset::Float64,
step = 1,
bbox = (min_x = -Inf, min_y = -Inf, max_x = Inf, max_y = Inf),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means we will now always filter on bbox, even when nothing is passed. Might be better to pass nothing by default?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya that would make more sense. I don't think the search costs us much but if we used bbox = nothing we could just include an if statement to check if a search is needed. I'll add that.

)
z = file["$track/land_ice_segments/h_li"][1:step:end]::Vector{Float32}
sigma_geo_h = file["$track/land_ice_segments/sigma_geo_h"][1:step:end]::Vector{Float32}
h_li_sigma = file["$track/land_ice_segments/h_li_sigma"][1:step:end]::Vector{Float32}
x = file["$track/land_ice_segments/longitude"][1:step:end]::Vector{Float64}
y = file["$track/land_ice_segments/latitude"][1:step:end]::Vector{Float64}
t = file["$track/land_ice_segments/delta_time"][1:step:end]::Vector{Float64}
q = file["$track/land_ice_segments/atl06_quality_summary"][1:step:end]::Vector{Int8}
dem = file["$track/land_ice_segments/dem/dem_h"][1:step:end]::Vector{Float32}
x = file["$track/land_ice_segments/longitude"][:]::Vector{Float64}
y = file["$track/land_ice_segments/latitude"][:]::Vector{Float64}

# find index of points inside of bbox
ind = (x .> bbox.min_x) .& (y .> bbox.min_y) .& (x .< bbox.max_x) .& (y .< bbox.max_y)
start = findfirst(ind)
stop = findlast(ind)

if isnothing(start)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to just return nothing.

Copy link
Collaborator Author

@alex-s-gardner alex-s-gardner Oct 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by returning empty arrays it lets one keep track of what files and tracks were unsuccessful. There are issues where some tracks have data and others do not from the same granule. If you return nothing then It becomes hard to keep things strait as to which data the granules belong. The returned data needs to nicely convert to a uniform data frame.

I like to have an empty NamedTuple{} returned so that I can retain it in the dataframe of all granules and later go back to the dataframe and know which granules have already been searched. This is really helpful for building a datalake that will be periodically updated..

On that note I'm using a modification of points that I call points_plus that includes the granule information for each track in a variable called granule_info. This way each track can be listed as a row of a table or dataframe and is fully traceable without much overhead.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recognize the need for empty granules (I have several zero size granules on disk, just to prevent repeat downloads), but I wonder about the correct approach.

From what I gather, you iterate over each track and store the results, for which you want an empty table, rather than nothing.

Btw, I think most of this code can be written as stop=0, as:

julia> [1,2][1:1:0]
Int64[]

Copy link
Collaborator Author

@alex-s-gardner alex-s-gardner Oct 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I gather, you iterate over each track and store the results, for which you want an empty table, rather than nothing.

Yes, that's exactly it. I'm building data cubes of all ICESat-2 tracks that fall within 2 degree by 2 degree tiles (geotiles):

df = reduce(vcat, DataFrame.(points_plus.(granules, bbox = geotile)))

points_plus is identical to points with the inclusion of granule_info for each row (track) in the dataframe.

I then save the df as an Arrow file.. I can then go back at a later date and see which granules have already been subset and only download and subset and append new granules to the df.

returning nothing would not produce consistent dataframes that could combined using 'vcat'... at least that is my understanding

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I think most of this code can be written as stop=0, as:

julia> [1,2][1:1:0]
Int64[]

Ohh.. I hadn't thought of that. Do you think it would be less efficient as it would then reside within the h5 call, possibly mapping the variable even though nothing is being returned?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I hadn't thought of that, it is indeed less efficient.

julia> @time h[1:0]  # h = file["$track/land_segments/terrain/h_te_mean"]
  0.000072 seconds (14 allocations: 608 bytes)
Float32[]

julia> @time Vector{Float32}()
  0.000002 seconds (1 allocation: 64 bytes)
Float32[]

@warn "no data found within bbox: $(file.filename)"

spot_number = attrs(file["$track"])["atlas_spot_number"]::String
atlas_beam_type = attrs(file["$track"])["atlas_beam_type"]::String

nt = (;
longitude = Vector{Float64}[],
latitude = Vector{Float64}[],
height = Vector{Float32}[],
height_error = Vector{Float64}[],
datetime =Vector{Dates.DateTime}[],
quality = BitVector[],
track = Fill(track, 0),
strong_beam = Fill(atlas_beam_type == "strong", 0),
detector_id = Fill(parse(Int8, spot_number), 0),
height_reference = Vector{Float32}[],
)
return nt
end

# only include x and y data within bbox
x = x[start:step:stop]
y = y[start:step:stop]

z = file["$track/land_ice_segments/h_li"][start:step:stop]::Vector{Float32}
sigma_geo_h = file["$track/land_ice_segments/sigma_geo_h"][start:step:stop]::Vector{Float32}
h_li_sigma = file["$track/land_ice_segments/h_li_sigma"][start:step:stop]::Vector{Float32}
t = file["$track/land_ice_segments/delta_time"][start:step:stop]::Vector{Float64}
q = file["$track/land_ice_segments/atl06_quality_summary"][start:step:stop]::Vector{Int8}
dem = file["$track/land_ice_segments/dem/dem_h"][start:step:stop]::Vector{Float32}
spot_number = attrs(file["$track"])["atlas_spot_number"]::String
atlas_beam_type = attrs(file["$track"])["atlas_beam_type"]::String
times = unix2datetime.(t .+ t_offset)
Expand Down