Hierarchical Data Format: Difference between revisions

Content deleted Content added
Zeliboba7 (talk | contribs)
mNo edit summary
 
(266 intermediate revisions by more than 100 users not shown)
Line 1:
{{Short description|Set of file formats}}
{{distinguish|text = [[Apache_Hadoop#HDFS|HDFS]], the file system used in Apache Hadoop}}
{{Infobox file format
| name = Hierarchical Data Format
| icon = HDF logo (2017).svg
| logo iconcaption =
| icon_size = 200px
| screenshot =
| caption =
| _noextcode = on
| extension = <ttcode>.hdf</ttcode>, <ttcode>.h4</ttcode>, <ttcode>.hdf4</ttcode>, <ttcode>h5.he2</ttcode>, <ttcode>hdf5.h5</ttcode>, <ttcode>he4.hdf5</ttcode>, <ttcode>.he5</ttcode>
| mime =
| type code =
| uniform type =
| magic = \211HDF\r\n\032\n
| owner = The HDF Group
| released =
| latest release version = {{LSR/wikidata}}
| latest release date =
| genre = [[scientificScientific data format]]
| container for =
| open = Yes
| contained by =
| extended from =
| extended to =
| standard =
| url = http://www.hdfgroup.org{{Official URL}}
}}
 
'''Hierarchical Data Format''', commonly abbreviated ('''HDF''',) is a set of [[file format]]s ('''HDF4''', or '''HDF5''') isdesigned ato librarystore and multi-objectorganize filelarge format for the transferamounts of graphical and numerical data between computers. ItOriginally wasdeveloped created byat the U.S. [[National Center for Supercomputing Applications|NCSA]], butit is currently maintainedsupported by [http://www.hdfgroup.org The HDF Group]., a Thenon-profit freelycorporation availablewhose HDFmission distributionis consiststo ofensure thecontinued library,development command-lineof utilities,HDF5 testtechnologies suiteand source,the [[Javacontinued (programmingaccessibility language)|Java]]of interface,data andstored [http://www.hdfgroup.org/hdf-java-html/hdfview/index.html the Java-basedin HDF Viewer (HDFView)].
 
In keeping with this goal, the HDF libraries and associated tools are available under a liberal, [[BSD licenses|BSD-like license]] for general use. HDF is supported by many commercial and non-commercial software platforms and programming languages. The freely available HDF distribution consists of the library, command-line utilities, test suite source, Java interface, and the Java-based HDF Viewer (HDFView).<ref>[http://www.hdfgroup.org/products/java/release/download.html Java-based HDF Viewer (HDFView)]</ref>
HDF supports several different data models, including multidimensional arrays, raster images, and tables. Each defines a specific aggregate data type and provides an [[Application Programming Interface|API]] for reading, writing, and organizing the data and metadata. New data models can be added by the HDF developers or users.
 
The current version, HDF5, differs significantly in design and [[API]] from the major legacy version HDF4.
HDF is self-describing, allowing an application to interpret the structure and contents of a file without any outside information. One HDF file can hold a mixture of related objects which can be accessed as a group or as individual objects. Users can create their own grouping structures called "vgroups."
 
==Early history==
 
The quest for a portable scientific data format, originally dubbed AEHOO (All Encompassing Hierarchical Object Oriented format) began in 1987 by the Graphics Foundations Task Force (GFTF) at the National Center for Supercomputing Applications (NCSA). NSF grants received in 1990 and 1992 were important to the project. Around this time [[NASA]] investigated 15 different file formats for use in the [[Earth Observing System]] (EOS) project. After a two-year review process, HDF was selected as the standard data and information system.<ref>{{cite web|url=http://www.hdfgroup.org/about/history.html|title=History of HDF Group|archive-url=https://web.archive.org/web/20160821013712/http://www.hdfgroup.org/about/history.html | access-date=15 July 2014|archive-date=21 August 2016 }}</ref>
 
==HDF4==
 
HDF4 is the older version of the format, although still actively supported by The HDF Group. It supports severala proliferation of different data models, including multidimensional arrays, [[Raster graphics|raster images]], and tables. Each defines a specific aggregate data type and provides an [[Application Programming Interface|API]] for reading, writing, and organizing the data and metadata. New data models can be added by the HDF developers or users.
 
HDF is self-describing, allowing an application to interpret the structure and contents of a file withoutwith anyno outside information. One HDF file can hold a mixturemix of related objects which can be accessed as a group or as individual objects. Users can create their own grouping structures called "vgroups."<ref name="foldoc">{{foldoc|Hierarchical+Data+Format}}</ref>
The HDF4 format has many limitations.<ref>[http://www.hdfgroup.org/h5h4-diff.html How is HDF5 different from HDF4?] {{webarchive|url=https://web.archive.org/web/20090330052722/http://www.hdfgroup.org/h5h4-diff.html |date=2009-03-30 }}</ref><ref>{{Cite web |url=http://www.hdfgroup.org/HDF-FAQ.html#6b |title=Are there limitations to HDF4 files? |access-date=2009-03-29 |archive-url=https://web.archive.org/web/20160419122423/http://www.hdfgroup.org/HDF-FAQ.html#6b |archive-date=2016-04-19 |url-status=dead }}</ref> It lacks a clear object model, which makes continued support and improvement difficult. Supporting many different interface styles (images, tables, arrays) leads to a complex API. Support for metadata depends on which interface is in use; ''SD'' (Scientific Dataset) objects support arbitrary named attributes, while other types only support predefined metadata. Perhaps most importantly, the use of 32-bit signed integers for addressing limits HDF4 files to a maximum of 2 GB, which is unacceptable in many modern scientific applications.
===Tools===
* [http://www.hdfgroup.org/hdf-java-html/hdfview/index.html HDFView] A browser and editor for HDF files
 
==HDF5==
The HDF5 format is designed to address some of the limitations of the HDF4 library, and to address current and anticipated requirements of modern systems and applications. In [[2002]] it won an [[R&D 100 Award]] award.<ref>[http://www.rdmag.com/rd100achAwards/RD-100-Awards/2002/09/Flexible-Data-Management/RD100SearchResults.aspx?strProduct=hdf5&Type=P R&D 100 Awards Archives]</ref> <ref>[{{webarchive|url=https://web.archive.org/web/20110104062241/http://www.hdfgrouprdmag.com/HDF5Awards/RD100RD-100-Awards/2002/09/Flexible-Data-Management/ HDF5|date=2011-01-04 Wins R&D 100 Award]}}</ref>.
 
HDF5 simplifies the file structure to include only two major types of object:
The next version of [[NetCDF]], version 4, is based on HDF5.
[[Image:HDF-Structure-Example.gif|thumb|right|HDF Structure Example]]
*Datasets, which are typed multidimensional arrays
*Groups, which are container structures that can hold datasets and other groups
 
This results in a truly hierarchical, filesystem-like data format.{{clarify|date=November 2018}}{{citation needed|date=November 2018}} In fact, resources in an HDF5 file can be accessed using the [[POSIX]]-like syntax ''/path/to/resource''. Metadata is stored in the form of user-defined, named attributes attached to groups and datasets. More complex storage APIs representing images and tables can then be built up using datasets, groups and attributes.
Because it uses [[B-trees]] to index table objects, it works well for [[Time series]] data like stock market ticks or network monitoring data. The bulk of the data goes into straightforward arrays (the table objects) that can be accessed much more quickly than the rows of an SQL database. But you still have B-Tree access for non-array data. If you find yourself designing a [[Star schema]] to fit your data into SQL, then you might want to investigate HDF5 as a simpler, faster alternative storage mechanism.
 
In addition to these advances in the file format, HDF5 includes an improved type system, and dataspace objects which represent selections over dataset regions. The API is also object-oriented with respect to datasets, groups, attributes, types, dataspaces and property lists.
==Interfaces==
 
===Low-level APIs===
The nextlatest version of [[NetCDF]], version 4, is based on HDF5.
 
Because it uses [[B-trees]] to index table objects, itHDF5 works well for [[Timetime series]] data likesuch as stock marketprice ticks orseries, network monitoring data, and 3D meteorological data. The bulk of the data goes into straightforward arrays (the table objects) that can be accessed much more quickly than the rows of an [[SQL]] database. But you still, havebut B-Treetree access is available for non-array data. IfThe youHDF5 finddata yourselfstorage designingmechanism acan [[Starbe schema]]simpler toand fitfaster yourthan data intoan SQL, then you[[star might want to investigate HDF5 as a simpler, faster alternative storage mechanismschema]].
 
{{example needed|date=November 2018}}
 
=== Feedback ===
 
Criticism of HDF5 follows from its monolithic design and lengthy specification.
*HDF5 does not enforce the use of [[UTF-8]], so client applications may be expecting ASCII in most places.
*Dataset data cannot be freed in a file without generating a file copy using an external tool (h5repack).<ref>{{cite web|last1=Rossant|first1=Cyrille|title=Moving away from HDF5|url=http://cyrille.rossant.net/moving-away-hdf5/|website=cyrille.rossant.net|access-date=21 April 2016}}</ref>
 
==Officially supported APIs==
* [[C (programming language)|C]]
* [[C++]]
* [[Common Language Infrastructure|CLI]] - .NET
* [[Fortran]]
* [[F90Fortran]], [[Fortran 90]]
* [http://www.hdfgroup.org/HDF5/Tutor/h5lite.html HDF5 Lite] (H5LT) – a light-weight interface for C
* [[Java (programming language)|Java]]
* [http://www.hdfgroup.org/HDF5/Tutor/h5image.html HDF5 Image] (H5IM) – a C interface for images or rasters
* [[Perl]]
* [http://www.hdfgroup.org/HDF5/Tutor/h5table.html HDF5 Table] (H5TB) – a C interface for tables
* [[Matlab]]
* [http://www.hdfgroup.org/HDF5/Tutor/h5packet.html HDF5 Packet Table] (H5PT) – interfaces for C and [[C++]] to handle "packet" data, accessed at high-speeds
* [[IDL_programming_language|IDL]]
* [http://www.hdfgroup.org/HDF5/Tutor/h5dimscale.html HDF5 Dimension Scale (H5DS)] – allows dimension scales to be added to HDF5; to be introduced in the HDF5-1.8 release
* [[PyTables]] – an interface for [[Python (programming language)|Python]]
* [[Java (programming language)|Java]]
 
===High-level APIs===
* [http://www.hdfgroup.org/HDF5/Tutor/h5lite.html HDF5 Lite] (H5LT) – a light-weight interface for C
* [http://www.hdfgroup.org/HDF5/Tutor/h5image.html HDF5 Image] (H5IM) – a C interface for images or rasters
* [http://www.hdfgroup.org/HDF5/Tutor/h5table.html HDF5 Table] (H5TB) – a C interface for tables
* [http://www.hdfgroup.org/HDF5/Tutor/h5packet.html HDF5 Packet Table] (H5PT) – interfaces for C and [C++] to handle "packet" data, accessed at high-speeds
* [http://www.hdfgroup.org/HDF5/Tutor/h5dimscale.html HDF5 Dimension Scale (H5DS)] – allows dimension scales to be added to HDF5; to be introduced in the HDF5-1.8 release
 
==See also==
* [[Common Data Format]] (CDF)
* [[GRIBFITS]] (GRIdded Binary), a data format used in [[meteorology]]astronomy
* [[NetCDF]]
* [[FITSGRIB]] (GRIdded Binary), a data format used in [[astronomy]]meteorology
*[[HDF Explorer]]
* [[GRIB]] (GRIdded Binary), a data format used in [[meteorology]]
*[[NetCDF]], The Netcdf Java library reads HDF5, HDF4, HDF-EOS and other formats using pure Java
* [[Q5cost]] a FORTRAN API to use hdf5 in [[quantum chemistry]]
*[[Protocol Buffers]] - Google's data interchange format
 
== References ==
{{reflist|30em}}
 
==External links==
*{{Official website}}
* [http://www.hdfgroup.org/ HDF home page]
*[https://web.archive.org/web/20180806024407/https://support.hdfgroup.org/HDF5/whatishdf5.html What is HDF5?]
* [http://www.hdfgroup.org/products/hdf5/index.html HDF5 home page]
* [http://www.hdfgrouphdfeos.org/HDF5/whatishdf5.html WhatHDF-EOS Tools and isInformation HDF5?Center]
*[http://www.opennavsurf.org/ Open Navigation Surface]
* [http://www.xi-advies.nl/downloads/AnIntroductionToDistributedVisualization.pdf "An Introduction to Distributed Visualization"]; section 4.2 contains a comparison of CDF, HDF, and netCDF.
* [http://www2.fci.unibo.it/~amonari/talk_834.pdf A presentation on how to handle large datasets in Quantum Chemistry using hdf5]
{{FOLDOC}}
 
[[Category:C (programming language) libraries]]
== References ==
 
<references/>
 
[[category:Meteorological data and networks]]
[[Category:Computer file formats]]
[[Category:Earth sciences data formats]]
[[category:C libraries]]
[[categoryCategory:Meteorological data and networks]]
 
[[de:Hierarchical Data Format]]