HLS data access and spatial subsetting
Posted: Tue Aug 19, 2025 8:33 pm America/New_York
I am trying to programmatically access the "HLSL30_2.0" and "HLSS30_2.0" products. My goal is to efficiently retrieve data (from the oldest available up to the present) for several hundred geometries daily. The geometries are defined as H3 level 5 hexagons (~250 km² each, depending on latitude).
I need all images for bands 2, 4, NIR, and Fmask to calculate EVI time series for each hexagon. I am not using directly the vegetation index products (HLSS30_VI and HLSL30_VI) because they do not yet cover the full historical period I need (2015–present).
The main Python-based approaches I found are:
1 - CMR-STAC API + Earthdata login: Using pystac_client (connected to https://cmr.earthdata.nasa.gov/stac/LPCLOUD) to search assets, then odc.stac.load() to subset/download. I also tried using rasterio to crop granules, but performance did not improve.
2 - Harmony: Seems promising, but the collections I need (C2021957295-LPCLOUD and C2021957657-LPCLOUD) do not support spatial subsetting, forcing full granule downloads, which is slow.
3 - AppEEARS API: Likely the least suitable, as requests can take hours for a single year of data per region.
4 - Direct access to COGs on AWS (us-west-2): I have not fully tested this. Using Dask + Coiled in the correct region failed, giving the error:
"User: arn:aws:sts::643705676985:assumed-role/s3-same-region-access-role/thaleslc is not authorized to perform: s3:GetObject on resource: "arn:aws:s3:::lp-prod-protected/HLSS30.020/HLS.S30.T21LXG.2025227T140101.v2.0/HLS.S30.T21LXG.2025227T140101.v2.0.B12.tif" with an explicit deny in an identity-based policy"
Given these four options, I would like to know:
1 - Are these four methods really the only/best options for my use case?
2 - Which approach is likely to offer the best performance? I suspect that accessing and subsetting the COGs directly in the S3 bucket would be fastest, but my attempt using Dask + Coiled in that region failed, and I have not yet been able to test it otherwise.
I need all images for bands 2, 4, NIR, and Fmask to calculate EVI time series for each hexagon. I am not using directly the vegetation index products (HLSS30_VI and HLSL30_VI) because they do not yet cover the full historical period I need (2015–present).
The main Python-based approaches I found are:
1 - CMR-STAC API + Earthdata login: Using pystac_client (connected to https://cmr.earthdata.nasa.gov/stac/LPCLOUD) to search assets, then odc.stac.load() to subset/download. I also tried using rasterio to crop granules, but performance did not improve.
2 - Harmony: Seems promising, but the collections I need (C2021957295-LPCLOUD and C2021957657-LPCLOUD) do not support spatial subsetting, forcing full granule downloads, which is slow.
3 - AppEEARS API: Likely the least suitable, as requests can take hours for a single year of data per region.
4 - Direct access to COGs on AWS (us-west-2): I have not fully tested this. Using Dask + Coiled in the correct region failed, giving the error:
"User: arn:aws:sts::643705676985:assumed-role/s3-same-region-access-role/thaleslc is not authorized to perform: s3:GetObject on resource: "arn:aws:s3:::lp-prod-protected/HLSS30.020/HLS.S30.T21LXG.2025227T140101.v2.0/HLS.S30.T21LXG.2025227T140101.v2.0.B12.tif" with an explicit deny in an identity-based policy"
Given these four options, I would like to know:
1 - Are these four methods really the only/best options for my use case?
2 - Which approach is likely to offer the best performance? I suspect that accessing and subsetting the COGs directly in the S3 bucket would be fastest, but my attempt using Dask + Coiled in that region failed, and I have not yet been able to test it otherwise.