Hierarchical Data Format: Difference between revisions

Content deleted Content added
 
(276 intermediate revisions by more than 100 users not shown)
Line 1:
{{Short description|Set of file formats}}
'''Hierarchical Data Format''', commonly abbreviated '''HDF''', '''HDF4''', or '''HDF5''' is a library and multi-object file format for the transfer of graphical and numerical data between computers. It was created by the [[National Center for Supercomputing Applications|NCSA]], but is currently maintained by the [http://hdf.ncsa.uiuc.edu/products/hdf5/index.html HDF Group]. The freely available HDF distribution consists of the library, command-line utilities, test suite source, [[Java (programming language)|Java]] interface, and [http://hdfgroup.org/hdf-java-html/hdfview/index.html the Java-based HDF Viewer (HDFView)].
{{distinguish|text = [[Apache_Hadoop#HDFS|HDFS]], the file system used in Apache Hadoop}}
{{Infobox file format
| name = Hierarchical Data Format
| icon = HDF logo (2017).svg
| iconcaption =
| icon_size = 200px
| screenshot =
| caption =
| _noextcode = on
| extension = <code>.hdf</code>, <code>.h4</code>, <code>.hdf4</code>, <code>.he2</code>, <code>.h5</code>, <code>.hdf5</code>, <code>.he5</code>
| mime =
| type code =
| uniform type =
| magic = \211HDF\r\n\032\n
| owner = The HDF Group
| released =
| latest release version = {{LSR/wikidata}}
| latest release date =
| genre = [[Scientific data format]]
| container for =
| open = Yes
| contained by =
| extended from =
| extended to =
| standard =
| url = {{Official URL}}
}}
 
'''Hierarchical Data Format''' ('''HDF''') is a set of [[file format]]s ('''HDF4''', '''HDF5''') designed to store and organize large amounts of data. Originally developed at the U.S. [[National Center for Supercomputing Applications]], it is supported by The HDF Group, a non-profit corporation whose mission is to ensure continued development of HDF5 technologies and the continued accessibility of data stored in HDF.
HDF supports several different data models, including multidimensional arrays, raster images, and tables. Each defines a specific aggregate data type and provides an [[Application Programming Interface|API]] for reading, writing, and organizing the data and metadata. New data models can be added by the HDF developers or users.
 
In keeping with this goal, the HDF libraries and associated tools are available under a liberal, [[BSD licenses|BSD-like license]] for general use. HDF is supported by many commercial and non-commercial software platforms and programming languages. The freely available HDF distribution consists of the library, command-line utilities, test suite source, Java interface, and the Java-based HDF Viewer (HDFView).<ref>[http://www.hdfgroup.org/products/java/release/download.html Java-based HDF Viewer (HDFView)]</ref>
HDF is self-describing, allowing an application to interpret the structure and contents of a file without any outside information. One HDF file can hold a mixture of related objects which can be accessed as a group or as individual objects. Users can create their own grouping structures called "vgroups."
 
The current version, HDF5, differs significantly in design and [[API]] from the major legacy version HDF4.
 
==Early history==
 
The quest for a portable scientific data format, originally dubbed AEHOO (All Encompassing Hierarchical Object Oriented format) began in 1987 by the Graphics Foundations Task Force (GFTF) at the National Center for Supercomputing Applications (NCSA). NSF grants received in 1990 and 1992 were important to the project. Around this time [[NASA]] investigated 15 different file formats for use in the [[Earth Observing System]] (EOS) project. After a two-year review process, HDF was selected as the standard data and information system.<ref>{{cite web|url=http://www.hdfgroup.org/about/history.html|title=History of HDF Group|archive-url=https://web.archive.org/web/20160821013712/http://www.hdfgroup.org/about/history.html | access-date=15 July 2014|archive-date=21 August 2016 }}</ref>
 
==HDF4==
 
HDF4 is the older version of the format, although still actively supported by The HDF Group. It supports a proliferation of different data models, including multidimensional arrays, [[Raster graphics|raster images]], and tables. Each defines a specific aggregate data type and provides an [[Application Programming Interface|API]] for reading, writing, and organizing the data and metadata. New data models can be added by the HDF developers or users.
 
HDF is self-describing, allowing an application to interpret the structure and contents of a file with no outside information. One HDF file can hold a mix of related objects which can be accessed as a group or as individual objects. Users can create their own grouping structures called "vgroups."<ref name="foldoc">{{foldoc|Hierarchical+Data+Format}}</ref>
The HDF4 format has many limitations.<ref>[http://www.hdfgroup.org/h5h4-diff.html How is HDF5 different from HDF4?] {{webarchive|url=https://web.archive.org/web/20090330052722/http://www.hdfgroup.org/h5h4-diff.html |date=2009-03-30 }}</ref><ref>{{Cite web |url=http://www.hdfgroup.org/HDF-FAQ.html#6b |title=Are there limitations to HDF4 files? |access-date=2009-03-29 |archive-url=https://web.archive.org/web/20160419122423/http://www.hdfgroup.org/HDF-FAQ.html#6b |archive-date=2016-04-19 |url-status=dead }}</ref> It lacks a clear object model, which makes continued support and improvement difficult. Supporting many different interface styles (images, tables, arrays) leads to a complex API. Support for metadata depends on which interface is in use; ''SD'' (Scientific Dataset) objects support arbitrary named attributes, while other types only support predefined metadata. Perhaps most importantly, the use of 32-bit signed integers for addressing limits HDF4 files to a maximum of 2 GB, which is unacceptable in many modern scientific applications.
===Tools===
* [http://hdfgroup.org/hdf-java-html/hdfview/index.html HDFView] A browser and editor for HDF files
 
==HDF5==
The HDF5 format is designed to address some of the limitations of the HDF4 library, and to address current and anticipated requirements of modern systems and applications. In 2002 it won an [[R&D 100 Award]].<ref>[http://www.rdmag.com/Awards/RD-100-Awards/2002/09/Flexible-Data-Management/ R&D 100 Awards Archives] {{webarchive|url=https://web.archive.org/web/20110104062241/http://www.rdmag.com/Awards/RD-100-Awards/2002/09/Flexible-Data-Management/ |date=2011-01-04 }}</ref>
 
HDF5 simplifies the file structure to include only two major types of object:
The next version of [[NetCDF]], version 4, is based on HDF5.
[[Image:HDF-Structure-Example.gif|thumb|right|HDF Structure Example]]
*Datasets, which are typed multidimensional arrays
*Groups, which are container structures that can hold datasets and other groups
 
This results in a truly hierarchical, filesystem-like data format.{{clarify|date=November 2018}}{{citation needed|date=November 2018}} In fact, resources in an HDF5 file can be accessed using the [[POSIX]]-like syntax ''/path/to/resource''. Metadata is stored in the form of user-defined, named attributes attached to groups and datasets. More complex storage APIs representing images and tables can then be built up using datasets, groups and attributes.
Because it uses [[B-trees]] to index table objects, it works well for [[Time series]] data like stock market ticks or network monitoring data. The bulk of the data goes into straightforward arrays (the table objects) that can be accessed much more quickly than the rows of an SQL database. But you still have B-Tree access for non-array data. If you find yourself designing a [[Star schema]] to fit your data into SQL, then you might want to investigate HDF5 as a simpler, faster alternative storage mechanism.
 
In addition to these advances in the file format, HDF5 includes an improved type system, and dataspace objects which represent selections over dataset regions. The API is also object-oriented with respect to datasets, groups, attributes, types, dataspaces and property lists.
==Interfaces==
 
===Low-level APIs===
The latest version of [[NetCDF]], version 4, is based on HDF5.
 
Because it uses [[B-trees]] to index table objects, HDF5 works well for [[time series]] data such as stock price series, network monitoring data, and 3D meteorological data. The bulk of the data goes into straightforward arrays (the table objects) that can be accessed much more quickly than the rows of an [[SQL]] database, but B-tree access is available for non-array data. The HDF5 data storage mechanism can be simpler and faster than an SQL [[star schema]].
 
{{example needed|date=November 2018}}
 
=== Feedback ===
 
Criticism of HDF5 follows from its monolithic design and lengthy specification.
*HDF5 does not enforce the use of [[UTF-8]], so client applications may be expecting ASCII in most places.
*Dataset data cannot be freed in a file without generating a file copy using an external tool (h5repack).<ref>{{cite web|last1=Rossant|first1=Cyrille|title=Moving away from HDF5|url=http://cyrille.rossant.net/moving-away-hdf5/|website=cyrille.rossant.net|access-date=21 April 2016}}</ref>
 
==Officially supported APIs==
* [[C (programming language)|C]]
* [[C++]]
* [[Common Language Infrastructure|CLI]] - .NET
* [[Fortran]]
* [[F90Fortran]], [[Fortran 90]]
* HDF5 Lite (H5LT) – a light-weight interface for C
* [[Java (programming language)|Java]]
* HDF5 Image (H5IM) – a C interface for images or rasters
* [[Perl]]
* HDF5 Table (H5TB) – a C interface for tables
* [[Matlab]]
* HDF5 Packet Table (H5PT) – interfaces for C and [[C++]] to handle "packet" data, accessed at high-speeds
* [[IDL_programming_language|IDL]]
* HDF5 Dimension Scale (H5DS) – allows dimension scales to be added to HDF5
* [http://www.pytables.org/ PyTables] – an interface for [[Python (programming language)|Python]]
*[[Java (programming language)|Java]]
 
===High-level APIs===
* [http://www.hdfgroup.org/HDF5/Tutor/h5lite.html HDF5 Lite] (H5LT) – a light-weight interface for C
* [http://www.hdfgroup.org/HDF5/Tutor/h5image.html HDF5 Image] (H5IM) – a C interface for images or rasters
* [http://www.hdfgroup.org/HDF5/Tutor/h5table.html HDF5 Table] (H5TB) – a C interface for tables
* [http://www.hdfgroup.org/HDF5/Tutor/h5packet.html HDF5 Packet Table] (H5PT) – interfaces for C and [C++] to handle "packet" data, accessed at high-speeds
* [http://www.hdfgroup.org/HDF5/Tutor/h5dimscale.html HDF5 Dimension Scale (H5DS)] – allows dimension scales to be added to HDF5; to be introduced in the HDF5-1.8 release
 
==See also==
* [[Common Data Format]] (CDF)
*[[FITS]], a data format used in astronomy
* [[NetCDF]]
* [[FITSGRIB]] (GRIdded Binary), a data format used in [[astronomy]]meteorology
*[[HDF Explorer]]
* [[GRIB]] (GRIdded Binary), a data format used in [[meteorology]]
*[[NetCDF]], The Netcdf Java library reads HDF5, HDF4, HDF-EOS and other formats using pure Java
* [[Q5cost]] a FORTRAN API to use hdf5 in [[quantum chemistry]]
*[[Protocol Buffers]] - Google's data interchange format
 
==References==
{{reflist|30em}}
 
==External links==
*{{Official website}}
* [http://www.hdfgroup.com/ HDF home page]
*[https://web.archive.org/web/20180806024407/https://support.hdfgroup.org/HDF5/whatishdf5.html What is HDF5?]
* [http://hdf.ncsa.uiuc.edu/HDF5/ NCSA HDF5 home page]
*[http://hdfeos.org/ HDF-EOS Tools and Information Center]
* [http://hdf.ncsa.uiuc.edu/whatishdf5.html What is HDF5?]
*[http://www.opennavsurf.org/ Open Navigation Surface]
* [http://www.xi-advies.nl/downloads/AnIntroductionToDistributedVisualization.pdf "An Introduction to Distributed Visualization"]; section 4.2 contains a comparison of CDF, HDF, and netCDF.
* [http://www2.fci.unibo.it/~amonari/talk_834.pdf A presentation on how to handle large datasets in Quantum Chemistry using hdf5]
{{FOLDOC}}
 
[[Category:C (programming language) libraries]]
[[category:Meteorological data and networks]]
[[Category:Computer file formats]]
[[Category:Earth sciences data formats]]
[[category:C libraries]]
[[Category:Meteorological data and networks]]
 
[[de:Hierarchical Data Format]]