Content deleted Content added
m Fixing typo raised by BracketBot |
m Examples (in test text in bold) on separate lines(?). Ndash (and snd). |
||
Line 1:
Array database management systems (DBMSs) provide [[Database management system|database]] services specifically for [[array data structure|array]]s (also called [[Raster graphics|raster data]]), that is: homogeneous collections of data items (often called [[pixel]]s, [[voxel]]s, etc.), sitting on a regular grid of one, two, or more dimensions. Often arrays are used to represent sensor, simulation, image, or statistics data. Such arrays tend to be [[Big data|Big Data]], with single objects frequently ranging into Terabyte and soon Petabyte sizes; for example, today’s earth and space observation archives typically grow by Terabytes a day. Array databases aim at offering flexible, scalable storage and retrieval on this information category.
[[File:Euclidean neighborhood in n-D arrays.png|thumb|150px|alt=Euclidean neighborhood of elements in arrays|Euclidean neighborhood of elements in arrays]]
== Overview ==
In the same style as standard [[Database Management System|database systems]] do on sets, Array DBMSs offer scalable, flexible storage and flexible retrieval/manipulation on arrays of (conceptually) unlimited size. As in practice arrays never appear standalone, such an array model normally is embedded into some overall data model, such as the relational model. Some systems implement arrays as an analogy to tables, some introduce arrays as an additional attribute type.
Management of arrays requires novel techniques, particularly due to the fact that traditional database tuples and objects tend to fit well into a single database page{{snd}}
Array DBMSs offer [[Data Manipulation Language|query languages]] giving [[Declarative programming|declarative]] access to such arrays, allowing to create, manipulate, search, and delete them. Like with, e.g., [[SQL]], expressions of arbitrary complexity can be built on top of a set of core array operations. Due to the extensions made in the data and query model, Array DBMSs sometimes are subsumed under the [[NoSQL]] category, in the sense of "not only SQL". Query [[Query optimization|optimization]] and [[Parallel computing|parallelization]] are important for achieving [[scalability]]; actually, many array operators lend themselves well towards parallel evaluation, by processing each tile on separate nodes or cores.
Important application domains of Array DBMSs include Earth, Space, Life, and Social sciences, as well as the related commercial applications (such as [[Oil exploration|hydrocarbon exploration]] in industry and [[OLAP]] in business). The variety occurring can be observed, e.g., in geo data where 1-D environmental sensor time series, 2-D satellite images, 3-D x/y/t image time series and x/y/z geophysics data, as well as 4-D x/y/z/t climate and ocean data can be found.
Line 20 ⟶ 15:
The [[Relational database|relational data model]], which is prevailing today, does not directly support the array paradigm to the same extent as sets and tuples. [[International Organization for Standards|ISO]] [[SQL]] lists an array-valued attribute type, but this is only one-dimensional, with almost no operational support, and not usable for the [[#Application_Domains|application domains]] of Array DBMSs. Another option is to resort to [[Binary large object|BLOB]]s ("binary large objects") which are the equivalent to files: byte strings of (conceptually) unlimited length, but again without any query language functionality, such as multi-dimensional subsetting.
First significant work in going beyond BLOBs has been established with PICDMS.<ref>Chock, M., Cardenas, A., Klinger, A.: Database structure and manipulation capabilities of a picture database management system (PICDMS). IEEE ToPAMI, 6(4):
A first declarative query language suitable for multiple dimensions and with an algebra-based semantics has been published by [[Peter Baumann (computer scientist)|Baumann]], together with a scalable architecture.<ref>Baumann, P.: [http://www.informatik.uni-trier.de/~ley/db/journals/vldb/vldb3.html#Baumann94 On the Management of Multidimensional Discrete Data]. VLDB Journal 4(3)1994, Special Issue on Spatial Database Systems, pp.
In terms of Array DBMS implementations, the [[rasdaman]] system has the longest implementation track record of n-D arrays with full query support. [[Oracle Spatial|Oracle GeoRaster]] offers chunked storage of 2-D raster maps, albeit without SQL integration. [[TerraLib]] is an open-source GIS software that extends object-relational DBMS technology to handle spatio-temporal data types; while main focus is on vector data, there is also some support for rasters. Starting with version 2.0, [[Postgis|PostGIS]] embeds raster support for 2-D rasters; a special function offers declarative raster query functionality. [[SciQL]] is an array query language being added to the [[MonetDB]] DBMS. [[Michael Stonebraker#SciDB|SciDB]] is a more recent initiative to establish array database support. Like SciQL, arrays are seen as an equivalent to tables, rather than a new attribute type as in rasdaman and PostGIS.
For the special case of [[
Generally, Array DBMSs are an emerging technology. While operationally deployed systems exist, like [[Oracle Spatial|Oracle GeoRaster]], [[Postgis|PostGIS 2.0]] and [[rasdaman]], there are still many open research questions, including query language design and formalization, query optimization, parallelization and distributed processing, and scalability issues in general. Besides, scientific communities still appear reluctant in taking up array database technology and tend to favor specialized, proprietary technology.
== Concepts ==
When adding arrays to databases, all facets of database design need to be reconsidered
=== Conceptual modeling ===
Formally, an array ''A'' is given by a (total or partial) function ''A'': ''X'' → ''V'' where ''X'', the ''___domain'' is a ''d''-dimensional integer interval for some ''d''>0 and ''V'', called ''range'', is some (non-empty) value set; in set notation, this can be rewritten as { (''p'',''v'') | ''p'' in ''X'', ''v'' in ''V'' }. Each (''p'',''v'') in ''A'' denotes an array element or ''cell'', and following common notation we write ''A''[''p''] = ''v''. Examples for ''X'' include {0..767} × {0..1023} (for [[Xga#Extended Graphics Array|XGA]] sized images), examples for ''V'' include {0..255} for 8-bit greyscale images and {0..255} × {0..255} × {0..255} for standard [[RGB]] imagery.
Following established database practice, an array query language should be [[Declarative programming|declarative]] and safe in evaluation.
As iteration over an array is at the heart of array processing, declarativeness very much centers on this aspect. The requirement, then, is that conceptually all cells should be inspected simultaneously
=== Array querying ===
As an example for array query operators the [[rasdaman]] algebra and query language can serve, which establish an expression language over a minimal set of array primitives. We begin with the generic core operators and then present common special cases and shorthands.
The '''marray''' operator creates an array over some given ___domain extent and initializes its cells:
Line 53 ⟶ 43:
</source>
where ''index-range-specification'' defines the result ___domain and binds an iteration variable to it, without specifying iteration sequence. The ''cell-value-expression'' is evaluated at each ___location of the ___domain.
'''Example:''' “A cutout of array A given by the corner points (10,20) and (40,50).”
Line 76 ⟶ 65:
The above examples have simply copied the original values; instead, these values may be manipulated.
'''Example:''' “Array A, with a log() applied to each cell value.”
<source lang="sql">
Line 96 ⟶ 86:
</source>
As with ''marray'' before, the ''index-range-specification'' specifies the ___domain to be iterated over and binds an iteration variable to it
'''Example:''' "The sum over all values in A."
Line 113 ⟶ 103:
The next example demonstrates combination of ''marray'' and ''condense'' operators by deriving a histogram.
'''Example:''' "A histogram over 8-bit greyscale image A."
<source lang="sql">
Line 121 ⟶ 112:
The induced comparison, ''A=bucket'', establishes a Boolean array of the same extent as ''A''. The aggregation operator counts the occurrences of ''true'' for each value of ''bucket'', which subsequently is put into the proper array cell of the 1-D histogram array.
Such languages allow formulating statistical and imaging operations which can be expressed analytically without using loops. It has been proven<ref>Machlin, R.: Index-Based Multidimensional Array Queries: Safety and Equivalence. Proc. ACM PODS'07, Beijing, China, June 2007, pp.
=== Array storage ===
Array storage has to accommodate arrays of different dimensions and typically large sizes. A core task is to maintain spatial proximity on disk so as to reduce the number of disk accesses during subsetting. Note that an emulation of multi-dimensional arrays as nested lists (or 1-D arrays) will not per se accomplish this and, therefore, in general will not lead to scalable architectures.
Commonly arrays are partitioned into sub-arrays which form the unit of access. Regular partitioning where all partitions have the same size (except possibly for boundaries) is referred to as ''chunking''.<ref>Sarawagi, S., Stonebraker, M.: Efficient Organization of Large Multidimensional Arrays. Proc. ICDE'94, Houston, USA, 1994, pp. 328-336</ref> A generalization which removes the restriction to equally sized partitions by supporting any kind of partitioning is ''tiling''.<ref>Furtado, P., Baumann, P.: [http://www.informatik.uni-trier.de/~ley/db/conf/icde/icde99.html#FurtadoB99 Storage of Multidimensional Arrays based on Arbitrary Tiling]. Proc. ICDE'99, March 23–26, 1999, Sydney, Australia, pp.
Compression of tiles can sometimes reduce substantially the amount of storage needed. Also for transmission of results compression is useful, as for the large amounts of data under consideration networks bandwidth often constitutes a limiting factor.
Line 151 ⟶ 142:
== Standardization ==
Many communities have established data exchange formats, such as [[Hierarchical Data Format|HDF]], [[Netcdf|NetCDF]], and [[Tagged Image File Format|TIFF]]. A de facto standard in the Earth Science communities is [[Opendap|OPeNDAP]], a data transport architecture and protocol. While this is not a database specification, it offers important components that characterize a database system, such as a conceptual model and client/server implementations.
A declarative geo raster query language, [[Web Coverage Processing Service]] (WCPS), has been standardized by the [[Open Geospatial Consortium]] (OGC).
Line 158 ⟶ 148:
In June 2014, ISO/IEC JTC1 SC32 WG3, which maintains the SQL database standard, has decided to add multi-dimensional array support to SQL as a new column type,<ref>Chirgwin, R.: [https://www.theregister.co.uk/2014/06/26/sql_to_worlddog_we_doing_big_data_too/ SQL fights back against NoSQL's big data cred with SQL/MDA spec], The Register, 26 Jun 2014</ref> based on the initial array support available since the [[SQL:2003|2003 version of SQL]]. The new standard will be named ''ISO 9075 SQL Part 15: MDA (Multi-Dimensional Arrays)''.
== List of
*[[GeoRaster|Oracle GeoRaster]]
*[[MonetDB#SciQL|MonetDB/SciQL]]
|