Tips for using HLS with Dask distributed and pystac_client?
Posted: Wed Jul 31, 2024 12:33 pm America/New_York
Good afternoon,
I am currently using Dask distributed to query and process HLS data from CloudSTAC with pystac_client and Open Data Cube. I use client.map() to create percentile composites for 7 bands across several timeframes in parallel. However, it usually takes 5-10 tries to start the processing and fails prematurely 90% of the time with the error shown below, failing with a different file almost every time:
rasterio.errors.RasterioIOError: '/vsicurl/https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/HLSL30.020/HLS.L30.T10TER.2022137T185536.v2.0/HLS.L30.T10TER.2022137T185536.v2.0.Fmask.tif' not recognized as a supported file format.
My script does run nearly all the way through if I'm lucky. I have followed several HLS tutorials to ensure the correct GDAL environmental variables are set, I am using the rasterio environment, and my Earthdata credentials are accessible to the worker nodes.
I understand the sheer amount of data I'm processing is likely to result in more errors, but in the end, I need to be able to reliably generate these percentile composites for several AOIs in parallel with few interruptions.
Is there any unexpected behavior or are there any things I should be aware of when using HLS data from CloudSTAC with Dask and pystac_client? Do they interfere with each other in some way?
Is there anything I can do on my end to reduce the number of retries required to kick off the processing and keep the processing running more consistently?
Thank you!
I am currently using Dask distributed to query and process HLS data from CloudSTAC with pystac_client and Open Data Cube. I use client.map() to create percentile composites for 7 bands across several timeframes in parallel. However, it usually takes 5-10 tries to start the processing and fails prematurely 90% of the time with the error shown below, failing with a different file almost every time:
rasterio.errors.RasterioIOError: '/vsicurl/https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/HLSL30.020/HLS.L30.T10TER.2022137T185536.v2.0/HLS.L30.T10TER.2022137T185536.v2.0.Fmask.tif' not recognized as a supported file format.
My script does run nearly all the way through if I'm lucky. I have followed several HLS tutorials to ensure the correct GDAL environmental variables are set, I am using the rasterio environment, and my Earthdata credentials are accessible to the worker nodes.
I understand the sheer amount of data I'm processing is likely to result in more errors, but in the end, I need to be able to reliably generate these percentile composites for several AOIs in parallel with few interruptions.
Is there any unexpected behavior or are there any things I should be aware of when using HLS data from CloudSTAC with Dask and pystac_client? Do they interfere with each other in some way?
Is there anything I can do on my end to reduce the number of retries required to kick off the processing and keep the processing running more consistently?
Thank you!