Array DBMS: Difference between revisions

Content deleted Content added
Adding short description: "System that provides database services specifically for arrays" (Shortdesc helper)
m add link
 
(9 intermediate revisions by 7 users not shown)
Line 1:
{{short description|System that provides database services specifically for arrays}}
 
An '''Arrayarray database management systemssystem''' (or '''array DBMSsDBMS''') provideprovides [[Database management system|database]] services specifically for [[array data structure|array]]s (also called [[Raster graphics|raster data]]), that is: homogeneous collections of data items (often called [[pixel]]s, [[voxel]]s, etc.), sitting on a regular grid of one, two, or more dimensions. Often arrays are used to represent sensor, simulation, image, or statistics data. Such arrays tend to be [[Big data|Big Data]], with single objects frequently ranging into Terabyte and soon Petabyte sizes; for example, today’stoday's earth and space observation archives typically grow by Terabytes a day. Array databases aim at offering flexible, scalable storage and retrieval on this information category.
 
[[File:Euclidean neighborhood in n-D arrays.png|thumb|150px|alt=Euclidean neighborhood of elements in arrays|Euclidean neighborhood of elements in arrays]]
Line 18 ⟶ 19:
First significant work in going beyond BLOBs has been established with PICDMS.<ref>Chock, M., Cardenas, A., Klinger, A.: Database structure and manipulation capabilities of a picture database management system (PICDMS). IEEE ToPAMI, 6(4):484–492, 1984</ref> This system offers the precursor of a 2-D array query language, albeit still procedural and without suitable storage support.
 
A first declarative query language suitable for multiple dimensions and with an algebra-based semantics has been published by [[Peter Baumann (computer scientist)|Baumann]], together with a scalable architecture.<ref>Baumann, P.: [http://www.informatik.uni-trier.de/~ley/db/journals/vldb/vldb3.html#Baumann94 On the Management of Multidimensional Discrete Data]. VLDB Journal 4(3)1994, Special Issue on Spatial Database Systems, pp. 401–444</ref><ref>Baumann, P.: [http://www.informatik.uni-trier.de/~ley/db/conf/ngits/ngits99.html#Baumann99 A Database Array Algebra for Spatio-Temporal Data and Beyond]. Proc. NGITS’99NGITS'99, LNCS 1649, Springer 1999, pp.76-93</ref> Another array database language, constrained to 2-D, has been presented by Marathe and Salem.<ref>Marathe, A., Salem, K.: A language for manipulating arrays. Proc. VLDB’97VLDB'97, Athens, Greece, August 1997, pp. 46–55</ref> Seminal theoretical work has been accomplished by Libkin et al.;<ref>Libkin, L., Machlin, R., Wong, L.: A query language for multidimensional arrays: design, implementation and optimization techniques. Proc. ACM SIGMOD’96SIGMOD'96, Montreal, Canada, pp. 228–239</ref> in their model, called NCRA, they extend a nested relational calculus with multidimensional arrays; among the results are important contributions on array query complexity analysis. A map algebra, suitable for 2-D and 3-D spatial raster data, has been published by Mennis et al.<ref>Mennis, J., Viger, R., Tomlin, C.D.: Cubic Map Algebra Functions for Spatio-Temporal Analysis. Cartography and Geographic Information Science 32(1)2005, pp. 17–32</ref>
 
In terms of Array DBMS implementations, the [[rasdaman]] system has the longest implementation track record of n-D arrays with full query support. [[Oracle Spatial|Oracle GeoRaster]] offers chunked storage of 2-D raster maps, albeit without SQL integration. [[TerraLib]] is an open-source GIS software that extends object-relational DBMS technology to handle spatio-temporal data types; while main focus is on vector data, there is also some support for rasters. Starting with version 2.0, [[Postgis|PostGIS]] embeds raster support for 2-D rasters; a special function offers declarative raster query functionality. [[SciQL]] is an array query language being added to the [[MonetDB]] DBMS. [[Michael Stonebraker#SciDB|SciDB]] is a more recent initiative to establish array database support. Like SciQL, arrays are seen as an equivalent to tables, rather than a new attribute type as in rasdaman and PostGIS.
 
For the special case of [[sparse matrix|sparse data]], [[OLAP]] data cubes are well established; they store cell values together with their ___location{{snd}} an adequate compression technique in face of the few locations carrying valid information at all{{snd}} and operate with SQL on them. As this technique does not scale in density, standard databases are not used today for dense data, like satellite images, where most cells carry meaningful information; rather, proprietary ad- hoc implementations prevail in scientific data management and similar situations. Hence, this is where Array DBMSs can make a particular contribution.
 
Generally, Array DBMSs are an emerging technology. While operationally deployed systems exist, like [[Oracle Spatial|Oracle GeoRaster]], [[Postgis|PostGIS 2.0]] and [[rasdaman]], there are still many open research questions, including query language design and formalization, query optimization, parallelization and [[distributed processing]], and scalability issues in general. Besides, scientific communities still appear reluctant in taking up array database technology and tend to favor specialized, proprietary technology.
 
== Concepts ==
Line 30 ⟶ 31:
 
=== Conceptual modeling ===
Formally, an array ''A'' is given by a (total or partial) function ''A'': ''X'' → ''V'' where ''X'', the ''___domain'' is a ''d''-dimensional integer interval for some {{math|''d'' &gt; 0}} and ''V'', called ''range'', is some (non-empty) value set; in set notation, this can be rewritten as {{math|{{mset| (''p'',''v'') | ''p'' in ''X'', ''v'' in ''V'' }}}}. Each (''p'',''v'') in ''A'' denotes an array element or ''cell'', and following common notation we write ''A''[''p''] = ''v''. Examples for ''X'' include {0..767} × {0..1023} (for [[Xga#Extended Graphics Array|XGA]] sized images), examples for ''V'' include {0..255} for 8-bit [[greyscale imagesimage]]s and {0..255} × {0..255} × {0..255} for standard [[RGB]] imagery.
 
Following established database practice, an array query language should be [[Declarative programming|declarative]] and safe in evaluation.
Line 39 ⟶ 40:
 
The '''marray''' operator creates an array over some given ___domain extent and initializes its cells:
<sourcesyntaxhighlight lang="sqltext">
marray index-range-specification
values cell-value-expression
</syntaxhighlight>
</source>
 
where ''index-range-specification'' defines the result ___domain and binds an iteration variable to it, without specifying iteration sequence. The ''cell-value-expression'' is evaluated at each ___location of the ___domain.
 
'''Example:''' “A"A cutout of array A given by the corner points (10,20) and (40,50)."
<sourcesyntaxhighlight lang="sqltext">
marray p in [10:20,40:50]
values A[p]
</syntaxhighlight>
</source>
 
This special case, pure subsetting, can be abbreviated as
<sourcesyntaxhighlight lang="sqltext">
A[10:20,40:50]
</syntaxhighlight>
</source>
This subsetting keeps the dimension of the array; to reduce dimension by extracting slices, a single slicepoint value is indicated in the slicing dimension.
 
'''Example:''' “A"A slice through an x/y/t timeseries at position t=100, retrieving all available data in x and y."
<sourcesyntaxhighlight lang="sqltext">
A[*:*,*:*,100]
</syntaxhighlight>
</source>
 
The wildcard operator ''*'' indicates that the current boundary of the array is to be used; note that arrays where dimension boundaries are left open at definition time may change size in that dimensions over the array's lifetime.
Line 67 ⟶ 68:
The above examples have simply copied the original values; instead, these values may be manipulated.
 
'''Example:''' “Array"Array A, with a log() applied to each cell value."
<sourcesyntaxhighlight lang="sqltext">
marray p in ___domain(A)
values log( A[p] )
</syntaxhighlight>
</source>
 
This can be abbreviated as:
<sourcesyntaxhighlight lang="sqltext">
log( A )
</syntaxhighlight>
</source>
 
Through a principle called ''induced operations'',<ref>Ritter, G. and Wilson, J. and Davidson, J.: Image Algebra: An Overview. Computer Vision, Graphics, and Image Processing, 49(1)1994, 297-336</ref> the query language offers all operations the cell type offers on array level, too. Hence, on numeric values all the usual unary and binary arithmetic, exponential, and trigonometric operations are available in a straightforward manner, plus the standard set of Boolean operators.
 
The '''condense''' operator aggregates cell values into one scalar result, similar to SQL aggregates. Its application has the general form:
<sourcesyntaxhighlight lang="sqltext">
condense condense-op
over index-range-specification
using cell-value-expression
</syntaxhighlight>
</source>
 
As with ''marray'' before, the ''index-range-specification'' specifies the ___domain to be iterated over and binds an iteration variable to it{{snd}} again, without specifying iteration sequence. Likewise, ''cell-value-expression'' is evaluated at each ___domain ___location. The ''condense-op'' clause specifies the aggregating operation used to combine the cell value expressions into one single value.
 
'''Example:''' "The sum over all values in A."
<sourcesyntaxhighlight lang="sqltext">
condense +
over p in sdom(A)
using A[p]
</syntaxhighlight>
</source>
 
A shorthand for this operation is:
<sourcesyntaxhighlight lang="sqltext">
add_cells( A )
</syntaxhighlight>
</source>
 
In the same manner and in analogy to SQL aggregates, a number of further shorthands are provided, including counting, average, minimum, maximum, and Boolean quantifiers.
Line 106 ⟶ 107:
 
'''Example:''' "A histogram over 8-bit greyscale image A."
<sourcesyntaxhighlight lang="sqltext">
marray bucket in [0:255]
values count_cells( A = bucket )
</syntaxhighlight>
</source>
 
The induced comparison, ''A=bucket'', establishes a Boolean array of the same extent as ''A''. The aggregation operator counts the occurrences of ''true'' for each value of ''bucket'', which subsequently is put into the proper array cell of the 1-D histogram array.
Line 118 ⟶ 119:
Array storage has to accommodate arrays of different dimensions and typically large sizes. A core task is to maintain spatial proximity on disk so as to reduce the number of disk accesses during subsetting. Note that an emulation of multi-dimensional arrays as nested lists (or 1-D arrays) will not per se accomplish this and, therefore, in general will not lead to scalable architectures.
 
Commonly arrays are partitioned into sub-arrays which form the unit of access. Regular partitioning where all partitions have the same size (except possibly for boundaries) is referred to as ''chunking''.<ref>[[Sunita Sarawagi|Sarawagi, S.]], [[Michael Stonebraker|Stonebraker, M.]]: Efficient Organization of Large Multidimensional Arrays. Proc. ICDE'94, Houston, USA, 1994, pp. 328-336</ref> A generalization which removes the restriction to equally sized partitions by supporting any kind of partitioning is ''tiling''.<ref>Furtado, P., Baumann, P.: [http://www.informatik.uni-trier.de/~ley/db/conf/icde/icde99.html#FurtadoB99 Storage of Multidimensional Arrays based on Arbitrary Tiling]. Proc. ICDE'99, March 23–26, 1999, Sydney, Australia, pp. 328–336</ref> Array partitioning can improve access to array subsets significantly: by adjusting tiling to the access pattern, the server ideally can fetch all required data with only one disk access.
 
Compression of tiles can sometimes reduce substantially the amount of storage needed. Also for transmission of results compression is useful, as for the large amounts of data under consideration networks bandwidth often constitutes a limiting factor.
Line 133 ⟶ 134:
 
The following are representative domains in which large-scale multi-dimensional array data are handled:
*Earth sciences: geodesy / mapping, [[remote sensing]], geology, oceanography, hydrology, atmospheric sciences, cryospheric sciences
*Space sciences: Planetary sciences, astrophysics (optical and radio telescope observations, cosmological simulations)
*Life sciences: gene data, confocal microscopy, CAT scans