Different Chunking in MUR SST
Different Chunking in MUR SST
While developing a kerchunk reference for MUR-JPL-L4-GLOB-v4.1 I noticed a different chunk shape for certain temporal periods, starting in February 2023: The analysed_sst has chunk shape (1, 1023, 2047) except the following date ranges, where the chunk shape is (1, 3600, 7200):
* 2023-02-24 to 2023-02-28
* 2023-04-22
* 2023-09-04 to 2024-03-23 (final date updated January 27, 2025)
The same appears to be true for the analysis_error variable.
The same appears to be true for the sea_ice_fraction and mask variables except the chunk shape was (1, 1447, 2895) and changes to (1, 4500, 9000)
Here is a notebook to reproduce these findings: https://gist.github.com/abarciauskas-bgse/a26f34fd4ed135663712914bc774c540
At this time, kerchunk cannot handle datasets with differing chunk shapes, so it is not possible to create a virtual zarr store from files across these 2 chunk shapes. Hopefully this will change in the future but posting here in case others run up against this issue.
Questions:
* Is there any plan to reprocess the data so that the chunk shape for each variable is consistent?
* Out of curiosity, what was the reason for the change in chunk shape?
* 2023-02-24 to 2023-02-28
* 2023-04-22
* 2023-09-04 to 2024-03-23 (final date updated January 27, 2025)
The same appears to be true for the analysis_error variable.
The same appears to be true for the sea_ice_fraction and mask variables except the chunk shape was (1, 1447, 2895) and changes to (1, 4500, 9000)
Here is a notebook to reproduce these findings: https://gist.github.com/abarciauskas-bgse/a26f34fd4ed135663712914bc774c540
At this time, kerchunk cannot handle datasets with differing chunk shapes, so it is not possible to create a virtual zarr store from files across these 2 chunk shapes. Hopefully this will change in the future but posting here in case others run up against this issue.
Questions:
* Is there any plan to reprocess the data so that the chunk shape for each variable is consistent?
* Out of curiosity, what was the reason for the change in chunk shape?
Last edited by aimeeb on Tue Jan 28, 2025 12:45 am America/New_York, edited 3 times in total.
Filters:
-
- Subject Matter Expert
- Posts: 14
- Joined: Thu May 27, 2021 2:52 pm America/New_York
Re: Different Chunking in MUR SST
Hi Amiee,
Thanks for this in depth analysis! We are aware that the chunking changes for certain dates and have it our roadmap to reprocess to ensure consistency but right now the resources of the MUR processing system dont allow us to tackle that. They were probably changed mid-stream due to misunderstandings and other errors.
I have not run your notebook yet but a recent MUR granule (20240828090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc) shows that chunks sizes for the SST variables to be 1023x2047 (4 MB for the 2 byte short) and not 3600 x7200 as you report. This is a good size for subsetting requests and cloud data streaming. I use Panoply or ncdump to do quick checks of chunk sizes.
Part of our roadmap is to migrate and update the MUR production system to run 100% in the cloud. Once that is done a reprocessing campaign will be much easier, cheaper and faster to execute. But, I cant predict when that will happen at the moment.
Regards,
Ed Armstrong, PO.DAAC
Thanks for this in depth analysis! We are aware that the chunking changes for certain dates and have it our roadmap to reprocess to ensure consistency but right now the resources of the MUR processing system dont allow us to tackle that. They were probably changed mid-stream due to misunderstandings and other errors.
I have not run your notebook yet but a recent MUR granule (20240828090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc) shows that chunks sizes for the SST variables to be 1023x2047 (4 MB for the 2 byte short) and not 3600 x7200 as you report. This is a good size for subsetting requests and cloud data streaming. I use Panoply or ncdump to do quick checks of chunk sizes.
Part of our roadmap is to migrate and update the MUR production system to run 100% in the cloud. Once that is done a reprocessing campaign will be much easier, cheaper and faster to execute. But, I cant predict when that will happen at the moment.
Regards,
Ed Armstrong, PO.DAAC
Re: Different Chunking in MUR SST
Very glad to see this reported, I heard about it and had emailed it so just wanted to indicate I'm aware of this discussion.
We're working on a reformat into GeoTIFF and I'd like to discuss further issues that we see with these files. The degenerate rectilinear coords have sufficient noise that they appear "irregular" to some tools, it's a well known issue, and with double floating point coords there wouldn't be as many problems. I've got a more complete listing of our reformat motivations here:
https://discourse.pangeo.io/t/fixing-ghrsst-seeking-opinions/3833
Ideally I'd like to see this rewritten with doubles for lon, lat, stored as Int16 but in Celsius (only needs a metadata tweak for unscaling).
I expect there'd be a bit of pushback on the second, but I think it's pretty important to fix the first. At any rate I'm also keen just to discuss this and related topics. Thanks!
We're working on a reformat into GeoTIFF and I'd like to discuss further issues that we see with these files. The degenerate rectilinear coords have sufficient noise that they appear "irregular" to some tools, it's a well known issue, and with double floating point coords there wouldn't be as many problems. I've got a more complete listing of our reformat motivations here:
https://discourse.pangeo.io/t/fixing-ghrsst-seeking-opinions/3833
Ideally I'd like to see this rewritten with doubles for lon, lat, stored as Int16 but in Celsius (only needs a metadata tweak for unscaling).
I expect there'd be a bit of pushback on the second, but I think it's pretty important to fix the first. At any rate I'm also keen just to discuss this and related topics. Thanks!
-
- Subject Matter Expert
- Posts: 14
- Joined: Thu May 27, 2021 2:52 pm America/New_York
Re: Different Chunking in MUR SST
Thanks for the feedback on the GeoTIFF applications. We do have long range plans to update the coordinate variables (time, lat, lon) to doubles in our data model. When we designed this model in the early 2000s file size was still an issue so we went with floats vs doubles. Now it is less an issue but we are still pretty far off from implementing a change across our vast variety of international GHRSST products.
GHRSST is about standardizing and making SST datasets interoperable so we cannot unilaterally implement a change like this. Will keep you posted though.
regards,
Ed Armstrong
GHRSST is about standardizing and making SST datasets interoperable so we cannot unilaterally implement a change like this. Will keep you posted though.
regards,
Ed Armstrong
Re: Different Chunking in MUR SST
In response to Ed's comment "a recent MUR granule (20240828090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc) shows that chunks sizes for the SST variables to be 1023x2047 (4 MB for the 2 byte short) and not 3600 x7200 as you report" - that is true, I have updated the original post.
Based on a recent evaluation I found the following, some of which is a repeat of the original post, but all the current information is in one place:
From 2023-02-24 to 2023-02-28, on 2023-04-22, and 2023-09-04 to 2024-03-23:
* `analysed_sst` and `analysis_error` use a chunk shape `(1, 3600, 7200)`. The original chunk shape was `(1, 1023, 2047)`.
* `sea_ice_fraction` and `mask` use a chunk shape `(1, 4500, 9000)`. The original chunk shape was `(1, 1447, 2895)`.
2024-03-24 to date:
* `analysed_sst` and `analysis_error` use a chunk shape `(1, 1023, 2047)`. This is the original chunk shape.
* `sea_ice_fraction` and `mask` use a chunk shape `(1, 1023, 2047)`. **Note:** This shape has never been used for these variables before.
Based on a recent evaluation I found the following, some of which is a repeat of the original post, but all the current information is in one place:
From 2023-02-24 to 2023-02-28, on 2023-04-22, and 2023-09-04 to 2024-03-23:
* `analysed_sst` and `analysis_error` use a chunk shape `(1, 3600, 7200)`. The original chunk shape was `(1, 1023, 2047)`.
* `sea_ice_fraction` and `mask` use a chunk shape `(1, 4500, 9000)`. The original chunk shape was `(1, 1447, 2895)`.
2024-03-24 to date:
* `analysed_sst` and `analysis_error` use a chunk shape `(1, 1023, 2047)`. This is the original chunk shape.
* `sea_ice_fraction` and `mask` use a chunk shape `(1, 1023, 2047)`. **Note:** This shape has never been used for these variables before.