Array DBMS: Difference between revisions

Content deleted Content added
authorlinks
m add link
 
(5 intermediate revisions by 4 users not shown)
Line 1:
{{short description|System that provides database services specifically for arrays}}
 
An '''Arrayarray database management systemssystem''' (or '''array DBMSsDBMS''') provideprovides [[Database management system|database]] services specifically for [[array data structure|array]]s (also called [[Raster graphics|raster data]]), that is: homogeneous collections of data items (often called [[pixel]]s, [[voxel]]s, etc.), sitting on a regular grid of one, two, or more dimensions. Often arrays are used to represent sensor, simulation, image, or statistics data. Such arrays tend to be [[Big data|Big Data]], with single objects frequently ranging into Terabyte and soon Petabyte sizes; for example, today's earth and space observation archives typically grow by Terabytes a day. Array databases aim at offering flexible, scalable storage and retrieval on this information category.
 
[[File:Euclidean neighborhood in n-D arrays.png|thumb|150px|alt=Euclidean neighborhood of elements in arrays|Euclidean neighborhood of elements in arrays]]
Line 18 ⟶ 19:
First significant work in going beyond BLOBs has been established with PICDMS.<ref>Chock, M., Cardenas, A., Klinger, A.: Database structure and manipulation capabilities of a picture database management system (PICDMS). IEEE ToPAMI, 6(4):484–492, 1984</ref> This system offers the precursor of a 2-D array query language, albeit still procedural and without suitable storage support.
 
A first declarative query language suitable for multiple dimensions and with an algebra-based semantics has been published by [[Peter Baumann (computer scientist)|Baumann]], together with a scalable architecture.<ref>Baumann, P.: [http://www.informatik.uni-trier.de/~ley/db/journals/vldb/vldb3.html#Baumann94 On the Management of Multidimensional Discrete Data]. VLDB Journal 4(3)1994, Special Issue on Spatial Database Systems, pp. 401–444</ref><ref>Baumann, P.: [http://www.informatik.uni-trier.de/~ley/db/conf/ngits/ngits99.html#Baumann99 A Database Array Algebra for Spatio-Temporal Data and Beyond]. Proc. NGITS’99NGITS'99, LNCS 1649, Springer 1999, pp.76-93</ref> Another array database language, constrained to 2-D, has been presented by Marathe and Salem.<ref>Marathe, A., Salem, K.: A language for manipulating arrays. Proc. VLDB’97VLDB'97, Athens, Greece, August 1997, pp. 46–55</ref> Seminal theoretical work has been accomplished by Libkin et al.;<ref>Libkin, L., Machlin, R., Wong, L.: A query language for multidimensional arrays: design, implementation and optimization techniques. Proc. ACM SIGMOD’96SIGMOD'96, Montreal, Canada, pp. 228–239</ref> in their model, called NCRA, they extend a nested relational calculus with multidimensional arrays; among the results are important contributions on array query complexity analysis. A map algebra, suitable for 2-D and 3-D spatial raster data, has been published by Mennis et al.<ref>Mennis, J., Viger, R., Tomlin, C.D.: Cubic Map Algebra Functions for Spatio-Temporal Analysis. Cartography and Geographic Information Science 32(1)2005, pp. 17–32</ref>
 
In terms of Array DBMS implementations, the [[rasdaman]] system has the longest implementation track record of n-D arrays with full query support. [[Oracle Spatial|Oracle GeoRaster]] offers chunked storage of 2-D raster maps, albeit without SQL integration. [[TerraLib]] is an open-source GIS software that extends object-relational DBMS technology to handle spatio-temporal data types; while main focus is on vector data, there is also some support for rasters. Starting with version 2.0, [[Postgis|PostGIS]] embeds raster support for 2-D rasters; a special function offers declarative raster query functionality. [[SciQL]] is an array query language being added to the [[MonetDB]] DBMS. [[Michael Stonebraker#SciDB|SciDB]] is a more recent initiative to establish array database support. Like SciQL, arrays are seen as an equivalent to tables, rather than a new attribute type as in rasdaman and PostGIS.
Line 30 ⟶ 31:
 
=== Conceptual modeling ===
Formally, an array ''A'' is given by a (total or partial) function ''A'': ''X'' → ''V'' where ''X'', the ''___domain'' is a ''d''-dimensional integer interval for some {{math|''d'' &gt; 0}} and ''V'', called ''range'', is some (non-empty) value set; in set notation, this can be rewritten as {{math|{{mset| (''p'',''v'') | ''p'' in ''X'', ''v'' in ''V'' }}}}. Each (''p'',''v'') in ''A'' denotes an array element or ''cell'', and following common notation we write ''A''[''p''] = ''v''. Examples for ''X'' include {0..767} × {0..1023} (for [[Xga#Extended Graphics Array|XGA]] sized images), examples for ''V'' include {0..255} for 8-bit [[greyscale imagesimage]]s and {0..255} × {0..255} × {0..255} for standard [[RGB]] imagery.
 
Following established database practice, an array query language should be [[Declarative programming|declarative]] and safe in evaluation.
Line 39 ⟶ 40:
 
The '''marray''' operator creates an array over some given ___domain extent and initializes its cells:
<syntaxhighlight lang="sqltext">
marray index-range-specification
values cell-value-expression
Line 46 ⟶ 47:
where ''index-range-specification'' defines the result ___domain and binds an iteration variable to it, without specifying iteration sequence. The ''cell-value-expression'' is evaluated at each ___location of the ___domain.
 
'''Example:''' “A"A cutout of array A given by the corner points (10,20) and (40,50)."
<syntaxhighlight lang="sqltext">
marray p in [10:20,40:50]
values A[p]
Line 53 ⟶ 54:
 
This special case, pure subsetting, can be abbreviated as
<syntaxhighlight lang="sqltext">
A[10:20,40:50]
</syntaxhighlight>
This subsetting keeps the dimension of the array; to reduce dimension by extracting slices, a single slicepoint value is indicated in the slicing dimension.
 
'''Example:''' “A"A slice through an x/y/t timeseries at position t=100, retrieving all available data in x and y."
<syntaxhighlight lang="sqltext">
A[*:*,*:*,100]
</syntaxhighlight>
Line 67 ⟶ 68:
The above examples have simply copied the original values; instead, these values may be manipulated.
 
'''Example:''' “Array"Array A, with a log() applied to each cell value."
<syntaxhighlight lang="sqltext">
marray p in ___domain(A)
values log( A[p] )
Line 74 ⟶ 75:
 
This can be abbreviated as:
<syntaxhighlight lang="sqltext">
log( A )
</syntaxhighlight>
Line 81 ⟶ 82:
 
The '''condense''' operator aggregates cell values into one scalar result, similar to SQL aggregates. Its application has the general form:
<syntaxhighlight lang="sqltext">
condense condense-op
over index-range-specification
Line 90 ⟶ 91:
 
'''Example:''' "The sum over all values in A."
<syntaxhighlight lang="sqltext">
condense +
over p in sdom(A)
Line 97 ⟶ 98:
 
A shorthand for this operation is:
<syntaxhighlight lang="sqltext">
add_cells( A )
</syntaxhighlight>
Line 106 ⟶ 107:
 
'''Example:''' "A histogram over 8-bit greyscale image A."
<syntaxhighlight lang="sqltext">
marray bucket in [0:255]
values count_cells( A = bucket )
Line 133 ⟶ 134:
 
The following are representative domains in which large-scale multi-dimensional array data are handled:
*Earth sciences: geodesy / mapping, [[remote sensing]], geology, oceanography, hydrology, atmospheric sciences, cryospheric sciences
*Space sciences: Planetary sciences, astrophysics (optical and radio telescope observations, cosmological simulations)
*Life sciences: gene data, confocal microscopy, CAT scans