![]() | This is not a Wikipedia article: It is an individual user's work-in-progress page, and may be incomplete and/or unreliable. For guidance on developing this draft, see Wikipedia:So you made a userspace draft. Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL |
When studying the Earth sciences through observation or analytical models, finding the right approach in how to best organize and store the vast collection of information gathered will always be challenging. Different organizations that have specific technical goals, timeline constraints, and model constraints will sometimes create unique file conventions, distributions techniques, and architectures. While developing new solutions sometimes solves short term goals, it often causes more complex long term problems when standards are not adhered to[1]. In some cases, science data has been migrating less rapidly to a standards-based approach[2]. Because of these issues, interoperability of data for collaboration is critical in building a continued quantitative understanding of the sciences[3].
Interoperability allows data users to view, process, and analyze observational data or science model output easily. Otherwise, users must reformat the data, or create new software to read it. Considerable time can be spent converting between the different data formats. Therefore, it is important to understand the features and limitations in each.
Overview and Definition
A data model (e.g. NetCDF) describes structured data by providing an unambiguous and neutral view on how the data is organized[4]. A file format defines how data is encoded for storage using a defined structure such as: chunk, directory based, or unstructured. Usually the file format is easily identified by the file name extension (e.g. .jpg, .bufr). Thus, the data model describes how the data is organized, and the file format how the data is stored. Furthermore, conventions are used to describe what data types, formats, and design principles are applied for a given data model and/or format (e.g. Climate and Forecast Metadata Conventions). By identifying these three elements, data can be accurately described.
Data models contain datasets such as dimensions, variables, types, and attributes. Some models have the ability to even logically put these sets into groups. These components can be used together to capture the meaning of data and relations among data fields in an array-oriented dataset. In contrast to variables, which are intended for bulk data, attributes are intended for ancillary data, or information about the data[5]. Another difference between attributes and variables is that variables may be multidimensional. Attributes are all either scalars (single-valued) or vectors (a single, fixed dimension).
Interoperability requires that each dataset representation is understood at the core level for each model, so their relationships can be understood. In some cases, models may be inter-compatible simply due to a similar dataset.
Data Model Representations
NetCDF is a simple format that works best with gridded or time series data. The NetCDF classic model (NetCDF 64-bit offset, NetCDF-4 classic) represents:
- dimensions, variables, and attributes
- variables also have attributes
- may contain common grids
The NetCDF enhanced model represents:
- classic model representations
- unnamed groups
- user defined data types
HDF offers a variety of data structures with an API to read and write the data. HDF is good at storing complicated files with their respective metadata, and use of compression for storing larger files. HDF-EOS implements additional data structures designed to facilitate access to Earth science data, such as geolocation information with data. The HDF4 model represents:
- choice of 8 objects
The HDF5 model represents:
- choice of dataset or group object
- attributes or datasets to describe metadata
The HDF-EOS2 model:
- includes HDF4 representations
- supports data structures grid, point, and swath
The HDF-EOS5 model:
- includes HDF5 representations
- supports data structures grid, point, and swath
TIFF (Tagged Information File Format) is a raster file format for handling images and data within a single file. GeoTIFF is a specialized version of the TIFF format that included geographic information within the tags of the format. The GeoTIFF model:
- use geoKeys to describe TIFF tags
The GRIB and GRIB2 models:
- file format for meterological data in binary
- header descriptions for data packing, definition, and data representation type
The BUFR model:
- file format for meterological data in binary
- sections include: indicator, identification, optional, description, data, and end
File Formats
Data models must be stored or encoded in a specific file format. Each format will have options on what data types, attributes, dimensions, or variables that can be used. For a given file format, a brief overview of their respective capabilities are shown below.
Format | Dimensions | Variables | Attributes | Data Types | Notes |
---|---|---|---|---|---|
NetCDF classic | single | Unicode | derived | 6 primitive | |
NetCDF 64-bit offset | single | Unicode | derived | 6 primitive | larger datasets |
NetCDF-4 | multiple, and unlimited | Unicode, UTF-8 | group level, user defined | 12 primitive including strings, and user defined | per-variable compression, parallel I/O |
NetCDF-4 classic | single | Unicode, UTF-8 | derived | 6 primitive | per variable compression, parallel I/O |
HDF4 | single | Unicode | predefined | 8 native | |
HDF4 SD | single, unlimited | Unicode, array | predefined, user defined | 8 native | |
HDF5 | multidimensional | Unicode, array | predefined, user defined | 12 native | attributes or datasets for metadata, per-variable compression, parallel I/O |
HDF-EOS2 | multidimensional | Unicode, array | predefined, user defined | 8 native | per-variable compression, parallel I/O |
HDF-EOS5 | multidimensional | Unicode, array | predefined, user defined | 12 native | per-variable compression, parallel I/O |
GeoTIFF | single | Unicode | tag/geoKeys | user defined | Coordinates in raster space, device space, and model space |
GRIB | single | shortname, name, units | MARS, angles, gridType, packingType | coded, computed | standard, non-standard encoding |
GRIB2 | single | shortname, name, units | MARS, angles, gridType, packingType | coded, computed | compression |
BUFR | multidimensional | integer, real, character | centre, categories, version, date, time | observed, other, compressed, non-compressed |
Conventions
Conventions provide a definitive description of what the data values found in each variable represent. For example, a convention may include descriptions of spatial and temporal properties, grid cell bounds, or averaging methods. This enables users of files from different sources to decide which variables are comparable. A convention should support various data types and formats.
When designing a convention, certain principles are considered. Some principles may include metadata requirements, interpretation of the data, ease of use, descriptions, and naming.
Conversion Tables
Given the vast choices in representing data, the ability to quickly know if your data can be accessed, modified, or converted to a different format is useful. The tables below help provide a subset of answers to some of those questions. So there is no ambiguity, the data model, file format (or file extension), convention, and versions where appropriate are clearly defined in each cell by 3 lines.
For reading data, this conversion table provides information on the formats data can be translated to. Columns are shown as the destination, and rows as the source.
File:Srcdest.jpg | NetCDFclassic classic CF |
NetCDFenhanced netCDF-4 CF |
HDF4 SD HDF4 |
HDF5 HDF5 HDF5 |
HDFEOS2 HDF4 HDF4 |
HDFEOS5 HDF5 HDF5 |
GeoTIFF GeoTIFF TIFF |
GRIB GRIB GRIB |
GRIB2 GRIB2 GRIB2 |
BUFR BUFR BUFR |
---|---|---|---|---|---|---|---|---|---|---|
NetCDFclassic classic CF |
Access, modify, and convert | |||||||||
NetCDFenhanced netCDF-4 CF |
No | |||||||||
HDF4 SD HDF4 |
No | Convert | ||||||||
HDF5 HDF5 HDF5 |
No | Yes, but limited[6] | Yes, but limited[7] | |||||||
HDFEOS2 HDF4 HDF4 |
Convert[8] | |||||||||
HDFEOS5 HDF5 HDF5 |
Convert | |||||||||
GeoTIFF GeoTIFF TIFF |
||||||||||
GRIB GRIB GRIB |
Convert | |||||||||
GRIB2 GRIB2 GRIB2 |
Yes, but limited[9] | |||||||||
BUFR BUFR BUFR |
For writing data, this conversion table provides information on the formats data can be translated from. Columns are shown as the destination, and rows as the source.
File:Srcdest.jpg | NetCDFclassic classic CF |
NetCDFenhanced netCDF-4 CF |
HDF4 SD HDF4 |
HDF5 HDF5 HDF5 |
HDFEOS2 HDF4 HDF4 |
HDFEOS5 HDF5 HDF5 |
GeoTIFF GeoTIFF TIFF |
GRIB GRIB GRIB |
GRIB2 GRIB2 GRIB2 |
BUFR BUFR BUFR |
---|---|---|---|---|---|---|---|---|---|---|
NetCDFclassic classic CF |
||||||||||
NetCDFenhanced netCDF-4 CF |
||||||||||
HDF4 SD HDF4 |
||||||||||
HDF5 HDF5 HDF5 |
||||||||||
HDFEOS2 HDF4 HDF4 |
||||||||||
HDFEOS5 HDF5 HDF5 |
||||||||||
GeoTIFF GeoTIFF TIFF |
||||||||||
GRIB GRIB GRIB |
||||||||||
GRIB2 GRIB2 GRIB2 |
||||||||||
BUFR BUFR BUFR |
Data type representations
For any given data stream there may be ambiguities regarding the appropriate structural data type to be used. As a general rule, the best way to resolve this ambiguity is to choose the most highly ordered data type that could describe the data.[10]
The table below lists some of the structural data types, and their respective recommended data formats. The data formats are defined in three lines: the data model, file format, and convention.
class="wikitable "
Interoperability guidelines
Data interoperability is critical to integrate different models, tools, and perspectives in order to collaborate effectively. Data must be taken from multiple sources in order to study the Earth sciences as a system rather than individual components. In many cases the chosen data types are the natural consequence of the manner in which the data is collected. However, without some sort of strict standard or policy, the ability to utilize observations and model data diminishes. The next best alternative is to incorporate best practices or established conventions (such as in climatology the Climate and Forecast Metadata Conventions). For example, the Hierarchical Data Format (HDF) is the standard data format for all NASA Earth Observing System (EOS) data products[11].
The following list is not meant to be exhaustive, but best practices to include to improve interoperability.
- The use of simpler data models.
- The use of an established coordinate system or convention.
References
External links
- Enabling Data Interoperability Through Metadata
- Practical Data Interoperability for Earth Scientists
- Using Metadata Standards to Achieve Data Interoperability