HLS Data Processing with rioxarray: parallel reading and cookie questions
Posted: Tue Nov 26, 2024 5:02 pm America/New_York
TLDR: is it ok to use rioxarray parallel read option (i.e. lock=False) to read HLS COG's from the cloud?
I'm writing a python script in that aims to do the following:
- Retrieve the HLS HTTP granule links for a a given tile id and time range, using earthaccess
- Build an xarray dataset and subset the data to 8 small chips (366x366 pixels) per HLS tile id footprint
- Compute a median composite per month, and save each median file per chip to a tiff file.
The central piece of this process is building each of the datasets using rioxarray, which I am doing as shown in this gist: https://gist.github.com/parevalo/a23f0fe98abde52f90cfaea337a806c0
I put together that code based on the HLS tutorial notebook (https://github.com/nasa/HLS-Data-Resources/blob/main/python/tutorials/HLS_Tutorial.ipynb) and the COG's best practices notebook (https://github.com/pangeo-data/cog-best-practices/blob/main/2-dask-localcluster.ipynb)
One thing that I have noticed is that if the lock argument in the rioxarray.open_rasterio() function is set to False, building the dataset is significantly faster because it is read in parallel. This makes a major difference when building a dataset from many granules. Since I want to apply this code over many HLS tiles in North America for multiple months, having some efficiency without having to download the entire data would be great. With that in mind, I have the following questions:
1. Is it safe/sensible/recommended to use that option? I don't see it in the HLS notebook, which makes me wonder if it's discouraged. If that's the case, what is the exact reason for it? I couldn't find any other examples online that used this option, and it's not clear to me why.
2. If using that option is ok, do you have any information on how it may affect the writing and reading of the cookies created by libcurl? I'm asking because when the lock argument is unset there's often a single cookie in the file, but when is set to False, the cookie file is permanently being written and overwritten, sometimes with no content inside the file, and it is not clear to me if the cookies are being used properly to avoid repeat authentications. Using GDAL's CPL_CURL_VERBOSE='ON' didn't help me answer this question fully myself.
I am trying to avoid a scenario where I accidentally request data too aggressively, or one where authentication by cookies doesn't work as expected, resulting in repeated authorizations that would slow down the system. Any guidance would be greatly appreciated!
I have run the code in my laptop using Ubuntu 20.04.6 LTS and in a linux HPC running AlmaLinux 8.10, rioxarray 0.17 and rasterio 1.4.1.
I'm writing a python script in that aims to do the following:
- Retrieve the HLS HTTP granule links for a a given tile id and time range, using earthaccess
- Build an xarray dataset and subset the data to 8 small chips (366x366 pixels) per HLS tile id footprint
- Compute a median composite per month, and save each median file per chip to a tiff file.
The central piece of this process is building each of the datasets using rioxarray, which I am doing as shown in this gist: https://gist.github.com/parevalo/a23f0fe98abde52f90cfaea337a806c0
I put together that code based on the HLS tutorial notebook (https://github.com/nasa/HLS-Data-Resources/blob/main/python/tutorials/HLS_Tutorial.ipynb) and the COG's best practices notebook (https://github.com/pangeo-data/cog-best-practices/blob/main/2-dask-localcluster.ipynb)
One thing that I have noticed is that if the lock argument in the rioxarray.open_rasterio() function is set to False, building the dataset is significantly faster because it is read in parallel. This makes a major difference when building a dataset from many granules. Since I want to apply this code over many HLS tiles in North America for multiple months, having some efficiency without having to download the entire data would be great. With that in mind, I have the following questions:
1. Is it safe/sensible/recommended to use that option? I don't see it in the HLS notebook, which makes me wonder if it's discouraged. If that's the case, what is the exact reason for it? I couldn't find any other examples online that used this option, and it's not clear to me why.
2. If using that option is ok, do you have any information on how it may affect the writing and reading of the cookies created by libcurl? I'm asking because when the lock argument is unset there's often a single cookie in the file, but when is set to False, the cookie file is permanently being written and overwritten, sometimes with no content inside the file, and it is not clear to me if the cookies are being used properly to avoid repeat authentications. Using GDAL's CPL_CURL_VERBOSE='ON' didn't help me answer this question fully myself.
I am trying to avoid a scenario where I accidentally request data too aggressively, or one where authentication by cookies doesn't work as expected, resulting in repeated authorizations that would slow down the system. Any guidance would be greatly appreciated!
I have run the code in my laptop using Ubuntu 20.04.6 LTS and in a linux HPC running AlmaLinux 8.10, rioxarray 0.17 and rasterio 1.4.1.