User:Markf129/Earth sciences data format interoperability: Difference between revisions
Content deleted Content added
bypass deleted redirect Template:Cite article per WP:Redirects for discussion/Log/2022 August 3#Template:Cite_article |
|||
(36 intermediate revisions by 4 users not shown) | |||
Line 1:
{{Userspace draft|date=July 2010}}
When studying the Earth sciences
| title = Model Data Interoperability for the United States Integrated Ocean Observing System
| author = Richard P. Signell
Line 6:
| url = http://www.usnfra.org/committees/modeling/signell_final%20report_mar8.pdf
}}
</ref>. In some cases, science data has been migrating less rapidly to a standards-based approach<ref>{{cite
| title = Standards-based data interoperability in the climate sciences
| author = AndrewWoolf, Ray Cramer, Marta Gutierrez, Kerstin Kleese van Dam, Siva Kondapalli,
Line 13:
| url = http://journals.cambridge.org/action/displayFulltext?type=1&fid=296181&jid=MAP&volumeId=12&issueId=01&aid=296180
}}
</ref>. Because of these issues, interoperability of data for collaboration is critical in building a continued quantitative understanding of the sciences<ref>{{cite
| title = Achieving interoperability of spatial data
| author = Clemens Portele, Freddy Fierens, Eva Klien
Line 21:
</ref>.
Interoperability of observational or model data must be easy and transparent, without having to reformat the
data, write special tools to read or extract the data, or rely on specific proprietary software. If common formats are adhered to, many benefits would occur. First, it would promote the exchange of models and relevant science data. Second, observational data could be scaled and compared more easily to models. And third, it would eliminate confusion and unnecessary format conversions. Perhaps the most important reason is the latter, as considerable time can be spent converting between the different data formats<ref>{{cite news
| title = Background on BUFR and GRIB Formats
| author = Doug McLain
| date = October 5, 2009
| url = http://www.oceanteacher.org/OTMediawiki/index.php/BUFR_and_GRIB_Formats#Background
}}▼
</ref>. Therefore, it is important to understand the features and limitations in each.
==Overview and
A [[data model]] (e.g. [[NetCDF]]) describes structured data by providing an unambiguous and neutral view on how the data is organized<ref>{{cite
| title = DIFFERENCES AMONG THE DATA MODELS USED BY THE GEOGRAPHIC INFORMATION SYSTEMS AND ATMOSPHERIC SCIENCE COMMUNITIES
| author = Stefano Nativi, University of Florence, Prato, Italy and M. B. Blumenthal, J. Caron, B. Domenico, T. Habermann, D. Hertzmann, Y. Ho, R. Raskin, and J. Weber
Line 33 ⟶ 40:
# A collection of operations that can be applied to the objects such as retrieval, update, subsetting, and averaging.
# A collection of integrity rules that define the legal states (set of values) or changes of state (operations on values).
A [[file format]] defines how data is encoded for storage using a defined structure such as
For example, data models contain datasets such as dimensions, variables, types, and attributes. Some models have the ability to even logically put these sets into groups. These components can be used together to capture the meaning of data and relations among data fields in an array-oriented dataset. In contrast to variables, which are intended for bulk data, attributes are intended for ancillary data
| title =
| url = http://www.unidata.ucar.edu/software/netcdf/docs/netcdf/index.html
Line 42 ⟶ 49:
Interoperability requires that each dataset representation is understood at the core level for each model, so their relationships can be understood. In some cases, models may be inter-compatible simply due to a similar dataset.
===
NetCDF is especially useful for gridded data and time series data, although it can be used with satellite swath data.
HDF is very useful in storing complex files with their associated metadata. HDF-EOS provides structural metadata at both the object and file level making it easier for client programs to read it. HDF-EOS defines certain kinds of earth science data objects, and specifies how to organize them in HDF4 and HDF5. HDF-EOS supports grid, swath, and point data.
GeoTIFF is a specialization of the TIFF
format that incorporates geographic information embedded as tags within the file. The geographic information allows data in the TIFF formatted file to be displayed in geographically correct locations.
GRIB files contain one or more messages, or records with a single parameter and accompanying grid ___location (which can be a standard grid or user defined). Data is equally spaced at a defined latitude or longitutde step which is contained in the message. A single GRIB file can contain separate records for many different parameters. For examplem one file could contain humidity data for several elevations over several time periods as well as snow depth for the same elevations and time periods.
BUFR is the primary format used operationally on the World Meteorological Organization (WMO) Global Telecommunications System for real-time global exchange of weather and satellite observations. BUFR is a self-describing and is table-driven to encode a wide variety of meteorological data: land observations, radar data, climatological data, etc.
===Data model relationships===
It is important to recognize that any given application will have it's own data structure and size that may include variables, tables, arrays, meshes, etc. Each application must correctly map it's own structure to that of the data model. Each data model will typically include:
Line 49 ⟶ 68:
*** NetCDF-4 can read HDF5
*** HDF5 can read XDR based NetCDF-4
*** GRIB2 is backwards compatible with GRIB
* Group - a collection of objects
** HDF5 and NetCDF-4 use the same hierarchical concept, similar to the directory structure in Unix
** GRIB groups are one field per message only
** GRIB2 groups are more than one field, repeated, in a single message
* Dimension - used to specify variable shapes, common grids, and coordinate systems.
** NetCDF dimensions have a name and a length
** HDF defines a dataset for descriptions, and a dataspace for length * Variable - an array of values of the same type.
** NetCDF variables are used in the same context as HDF data elements
Line 58 ⟶ 81:
** NetCDF defines this as variables, dimensions, and attributes
** HDF defines this as data elements (variables), and dimensions and attributes are described by the datatype and dataspace
** A GRIB dataset is a collection of self-containing records
* Datatype - a description of a specific class of data element, including its storage layout as a pattern of bits.
** HDF datatypes define the storage format of an element
** NetCDF datatypes are defined in the variable, either as text or numeric
* Dataspace - a description of the dimensions of a multidimensional array.
** A HDF dataspace
** NetCDF defines the dimensions (scalar, vector, or matrix) in the variable as a shape
* Attribute - a named data value associated with a group, dataset, or named datatype.
Line 68 ⟶ 92:
* Property List - a collection of parameters controlling options in the library model.
Some important notes surrounding attributes in the data formats. Global file attributes are written to NetCDF files by assigning attributes to the variable that references the file. HDF will typically store global attributes at the beginning of it's file. The GRIB format is a series of independent records with data points. However space, time, and even the origin of the data must sometimes be derived outside of the file (i.e. external tables). GRIB2 overcomes these challenges, and allows for more diversity in the records.
===Data model representations===
NetCDF is a simple format that works best with gridded or time series data. The NetCDF classic model (NetCDF 64-bit offset, NetCDF-4 classic) represents:
* dimensions, variables, and attributes
Line 98 ⟶ 124:
The GRIB and GRIB2 models:
* table driven
* file format for meterological data in binary
* header descriptions for data packing, definition, and data representation type
The BUFR model:
* table driven
* file format for meterological data in binary
* sections include: indicator, identification, optional, description, data, and end
===
[[georeference | Georeferencing]] is establishing the relationship between raster or vector images, coordinates, and also when determining the spatial ___location of other geographical features. When translating between different data formats, it is often required to establish a common coordinate system reference. In some cases, additional reference information, such as a [[world file]], may be needed in order to do the translation. For example, challenges occur when grid data is encoded in a "thinned" format, usually in the longitudinal dimension, where interoperability algorithms are needed. When used, translating between the formats will always have trade offs. There are various GIS tools available that can help transform image data to some geographic control framework, like [[ArcGIS|ArcMap]], PCI Geomatica, or [[ERDAS Imagine]].
* NetCDF
** No standard for storing georeferencing, some options to use for translating include:
*** Metadata tag 'grid_mapping'
*** Latitude, Longitude grid array
*** Spatial_ref and Geo transform array
* HDF
** No standard for storing georeferencing, subdataset_type may contain swath data
* HDF-EOS2 and HDF-EOS5
** Geolocation and temporal information to spatial data
** Not generally accessible to GIS community (i.e. convert to GeoTIFF)
* GeoTIFF
** Georeferencing may be contained within file
** ESRI world file with MapInfo may be used
* GRIB
** Grid coordinates defined in description section
* BUFR
** Coordinates defined in element descriptor section
===File formats===
Data models must be stored or encoded in a specific file format. Each format will have options on what data types, attributes, dimensions, or variables that can be used. For a given file format, a brief overview of their respective capabilities are shown below.
Line 149 ⟶ 198:
When designing a convention, certain principles are considered. Some principles may include metadata requirements, interpretation of the data, ease of use, descriptions, and naming.
==Conversion
When converting between the various formats, the translating software must assemble the data and records into similar variables, dimensions, and coordinates. In some cases, a format may not contain all the information needed to translate to the other format. For example, when converting from GRIB to NetCDF often all the needed GRIB dimensions are present. In order to assemble related records into NetCDF like variables, sometimes a single dimension must be used. In this case, the variable is given the same name as the NetCDF dimension.
Dimensions may be established by first sorting the given grid data into a coherent order. Only then, if a dimension is not present it will be absent in the conversion. In contrast, attributes such as the start time, may not change from record to record. In these cases, the same attribute value may be assigned to the subsequent variables.
It is a good practice to still convert data even when elements are missing, but warn the user of potential problems.
==Conversion tables==
Given the vast choices in representing data, the ability to quickly know if your data can be accessed, modified, or converted to a different format is useful. The tables below help provide a subset of answers to some of those questions. So there is no ambiguity, the data model, file format (or file extension), convention, and versions where appropriate are clearly defined in each cell by 3 lines.
Line 156 ⟶ 212:
{| class="wikitable" style="text-align: center; width: 400px; height: 200px;"
|-
!
! NetCDFclassic<br>classic<br>CF
! NetCDFenhanced<br>netCDF-4<br>CF
Line 175 ⟶ 231:
|-
| <b>HDF5<br>HDF5<br>HDF5</b> || No || Yes, but limited<ref>
{{cite
| url = http://www.unidata.ucar.edu/software/netcdf/docs/faq.html#fv15
}}</ref> || Yes, but limited<ref>
{{cite
| url = http://www.hdfgroup.org/h5h4-diff.html
}}</ref> || || || || || || ||
|- valign="top" style="background: #cccccc;"
| <b>HDFEOS2<br>HDF4<br>HDF4</b> || || || || || || || Convert<ref>{{cite
| url = http://newsroom.gsfc.nasa.gov/sdptoolkit/HEG/HEGHome.html
}}</ref> || || ||
Line 190 ⟶ 246:
| <b>[[GeoTIFF]]<br>GeoTIFF<br>TIFF</b> || || || || || || || || || ||
|-
| <b>[[GRIB]]<br>GRIB<br>GRIB</b> || Yes || Yes || || || || || || || Convert ||
|- valign="top" style="background: #cccccc;"
| <b>GRIB2<br>GRIB2<br>GRIB2</b> || || || || || || || || Yes, but limited<ref>
{{cite
| url = http://www.ecmwf.int/publications/manuals/grib_api/conversion.html
}}</ref> || ||
Line 204 ⟶ 260:
{| class="wikitable" style="text-align: center; width: 400px; height: 200px;"
|-
!
! NetCDFclassic<br>classic<br>CF
! NetCDFenhanced<br>netCDF-4<br>CF
Line 238 ⟶ 294:
==Data type representations==
For any given data stream there may be ambiguities regarding the appropriate structural data type to be used. As a general rule, the best way to resolve this ambiguity is to choose the most highly ordered data type that could describe the data.<ref>{{cite
| author = U.S. Department of Commerce
| year = 2006
Line 247 ⟶ 303:
The table below lists some of the structural data types, and their respective recommended data formats. The data formats are defined in three lines: the data model, file format, and convention.
|+ Structural data types and formats
▲|type=class="wikitable sortable"
|-
|
||
|-
|
|-
|row4=Profiles{{!!}}height-or depth-ordered sequence of records at a fixed (or approximately fixed) point in time and position in lat/long{{!!}}▼
|
|-
|row6=Geospatial Framework Data{{!!}}lines, polygonal regions, map annotations{{!!}}▼
▲|
|row7=Point Data{{!!}}scattered points{{!!}}▼
|-
|row8=Metadata{{!!}}“data about data” – context information needed for the interpretation of data{{!!}}▼
| Trajectories || time-ordered sequence of records along a path through space ||
▲}}
|-
|-
|-
▲|
|}
==Interoperability guidelines==
Data interoperability is critical to integrate different models, tools, and perspectives in order to collaborate effectively. Data must be taken from multiple sources in order to study the Earth sciences as a system rather than individual components. In many cases the chosen data types are the natural consequence of the manner in which the data is collected. However, without some sort of strict standard or policy, the ability to utilize observations and model data diminishes. The next best alternative is to incorporate best practices or established conventions (such as in climatology the [[Climate and Forecast Metadata Conventions]]). For example, the Hierarchical Data Format (HDF) is the standard data format for all NASA Earth Observing System (EOS) data products<ref>{{cite
| title = Hierarchical Data Format - Earth Observing System (HDF-EOS)
| url = http://nsidc.org/data/hdfeos/
Line 281 ⟶ 344:
* [http://www.marinebiodiversity.ca/metadata/readme.htm Using Metadata Standards to Achieve Data Interoperability]
* [http://www.cise.ufl.edu/~rms/HDF-NetCDF%20Report.pdf HDF and NetCDF Introductory Document]
* [http://www.gdal.org Geospatial Data Abstraction Library]
* [http://hdfeos.org/software/tool.php Software conversion tools for HDF]
<!--
[[:Category:Data types]]
[[:Category:Computer file formats]]
[[:Category:Science software]]
-->
|