Array (data structure)

This is an old revision of this page, as edited by Highegg (talk | contribs) at 07:19, 11 April 2007 (Indices into arrays). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

In computer programming, a group of homogeneous elements of a specific data type is known as an array, one of the simplest data structures. An array is similar to, but different from, a vector or list (for one-dimensional arrays) or a matrix (for two-dimensional arrays). Arrays hold a sequence of data elements, usually of the same size and data type. Individual elements are accessed by their position in the array. The position is given by an index, which is also called a subscript. The index usually uses a consecutive range of integers, (as opposed to an associative array) but the index can have any ordinal set of values. Some arrays are multi-dimensional, meaning they are indexed by a fixed number of integers, for example by a tuple of four integers. Generally, one- and two-dimensional arrays are the most common.

Most programming languages have a built-in array data type. Some programming languages (such as APL, newer versions of Fortran, and J) allow array programming, i.e. generalize the available operations and functions to work transparently over arrays as well as scalars, providing a higher-level manipulation than most other languages, which require loops over all the individual members of the arrays.

Advantages and disadvantages

Arrays permit efficient (constant-time, O(1)) random access, but not efficient insertion and deletion of elements (which are O(n), where n is the size of the array). Linked lists have the opposite trade-off. Consequently, arrays are most appropriate for storing a fixed amount of data which will be accessed in an unpredictable fashion, and linked lists are best for a list of data which will be accessed sequentially and updated often with insertions or deletions.

Another advantage of arrays that has become very important on modern architectures is that iterating through an array has good locality of reference, and so is much faster than iterating through (say) a linked list of the same size, which tends to jump around in memory. However, an array can also be accessed in a random way, as is done with large hash tables, and in this case this is not a benefit.

Arrays also are among the most compact data structures; storing 100 integers in an array takes only 100 times the space required to store an integer, plus perhaps a few bytes of overhead for the pointer to the array (4 on a 32-bit system). Any pointer-based data structure, on the other hand, must keep its pointers somewhere, and these occupy additional space. This extra space becomes more significant as the data elements become smaller. For example, an array of ASCII characters takes up one byte per character, while on a 32-bit platform, which has 4-byte pointers, a linked list requires at least five bytes per character. Conversely, for very large elements, the space difference becomes a negligible fraction of the total space.

Because arrays have a fixed size, there are some indexes which refer to invalid elements — for example, the index 17 in an array of size 5. What happens when a program attempts to refer to these varies from language to language and platform to platform. For more information, see bounds checking.

Uses

Although useful in their own right, arrays also form the basis for several more complex data structures, such as heaps, hash tables, and VLists, and can be used to represent strings, stacks and queues. They also play a more minor role in many other data structures. All of these applications benefit from the compactness and locality of arrays.

One of the disadvantages of an array is that it has a single fixed size, and although its size can be altered in many environments, this is an expensive operation. Dynamic arrays or growable arrays are arrays which automatically perform this resizing as late as possible, when the programmer attempts to add an element to the end of the array and there is no more space. To average the high cost of resizing over a long period of time (we say it is an amortized cost), they expand by a large amount, and when the programmer attempts to expand the array again, it just uses more of this reserved space.

In the C programming language, one-dimensional character arrays are used to store null-terminated strings, so called because the end of the string is indicated with a special reserved character called a null character ('\0') (see also C string).

Finally, in some applications where the data are the same or are missing for most values of the indexes, or for large ranges of indexes, space is saved by not storing an array at all, but having an associative array with integer keys. There are many specialized data structures specifically for this purpose, such as Patricia tries and Judy arrays. Example applications include address translation tables and routing tables.

Indices into arrays

Although abstractions for arrays in most programming languages are very similar, one strong point of contention has arisen: the index used to refer to the first element. There are three main solutions: zero-based, one-based, and n-based arrays, for which the first element has index zero, one, or a programmer-specified value, respectively.

This is mainly a stylistic concern. The zero-based array was made popular by the C programming language, in which the abstraction of array is very weak, and an index n of an array is simply the address of the first element offset by n units. Accordingly, index 0 points to the first element of the array. Descendants of C inherit this behavior. One-based arrays are based on traditional mathematics notation and simple counting, which begins with one. The last group - n-based - has been made available so the programmer is free to choose the lower bound which is best suited for the problem at hand.

There is a list of programming languages at the end of the article.

The conflict over the "right" way to do array indexing has impacted programmer culture. When supporters of one-based arrays decried zero-based arrays as unnatural, saying for example that we start numbered lists from 1, supporters of zero-based arrays retaliated by starting their own lists from zero in their daily lives. This practice can still be observed, and is often done for humor.

In 1982 Edsger W. Dijkstra wrote a document, Why numbering should start at zero, putting forth concise reasons why he believed zero-based indexing into arrays should be the natural default definition.

Supporters of zero-based indexing often criticise one-based and n-based arrays for being slower. While this is true, a one-based or n-based array access can be easily optimized — with common subexpression elimination or the use of a well-defined dope vector, to name only two options available. So in real-life applications one-based and n-based arrays are just as fast as zero-based arrays.

Several modern scripting languages use one-based arrays, in order to lessen the learning curve for new programmers. In such languages, (such as AppleScript), ease of use for the scripter is considered more important than absolute efficiency. One-based indices can also be a boon to non-programmers who must never the less dip into programming code, such as a graphic designers or office managers making use of a company's scripting solutions.

One-based indexing makes for nonstandard redefinitions of common concepts from the literature. A common example is the Discrete Fourier transform (DFT), with standard definition:

 

and inverse discrete Fourier transform (IDFT):

 .

Since indices of 0 are not allowed in hard wired environments such as MATLAB (nor can arrays be defined or extended to allow for non-positive indices), the one-based definitions for the DFT and IDFT become substantively different from the standard:

 
 .

Negative n-based systems make it possible to concisely model acausal systems (those with impulse responses of non-zero values at negative times). Another usage for negative n-based arrays is for packed storage of a band matrix.

Multi-dimensional arrays

Ordinary arrays are indexed by a single integer. Also useful, particularly in numerical and graphics applications, is the concept of a multi-dimensional array, in which we index into the array using an ordered list of integers, such as in a[3,1,5]. The number of integers in the list used to index into the multi-dimensional array is always the same and is referred to as the array's dimensionality, and the bounds on each of these are called the array's dimensions. An array with dimensionality k is often called k-dimensional. One-dimensional arrays correspond to the simple arrays discussed thus far; two-dimensional arrays are a particularly common representation for matrices. In practice, the dimensionality of an array rarely exceeds three. Mapping a one-dimensional array into memory is obvious, since memory is logically itself a (very large) one-dimensional array. When we reach higher-dimensional arrays, however, the problem is no longer obvious. Suppose we want to represent this simple two-dimensional array:

 

It is most common to index this array using the RC-convention, where elements are referred in row, column fashion or  , such as:

 

In computer science, a few common array indexing includes:

  • Row-major order. Used most notably by statically-declared arrays in C. The elements of each row are stored in order.
1 2 3 4 5 6 7 8 9
  • Column-major order. Used most notably in Fortran. The elements of each column are stored in order.
1 4 7 2 5 8 3 6 9
  • Arrays of arrays. Multi-dimensional arrays are typically represented by one-dimensional arrays of references (Iliffe vectors) to other one-dimensional arrays. The subarrays can be either the rows or columns.

 

The first two forms are more compact and have potentially better locality of reference, but are also more limiting; the arrays must be rectangular, meaning that no row can contain more elements than any other. Arrays of arrays, on the other hand, allow the creation of ragged arrays, also called jagged arrays, in which the valid range of one index depends on the value of another, or in this case, simply that different rows can be different sizes. Arrays of arrays are also of value in programming languages that only supply one-dimensional arrays as primitives.

In many applications, such as numerical applications working with matrices, we iterate over rectangular two-dimensional arrays in predictable ways. For example, computing an element of the matrix product AB involves iterating over a row of A and a column of B simultaneously. In mapping the individual array indexes into memory, we wish to exploit locality of reference as much as we can. A compiler can sometimes automatically choose the layout for an array so that sequentially accessed elements are stored sequentially in memory; in our example, it might choose row-major order for A, and column-major order for B. Even more exotic orderings can be used, for example if we iterate over the main diagonal of a matrix.

Array system cross-reference list

Programming language Base index Bound Check Dimensions Dynamic
Ada n checked n init1
APL7 0 or 1 checked n init1
assembly language 0 unchecked 1 no
BASIC 1 unchecked 1 init1
C 0 unchecked n2 heap3,4
C++5 0 unchecked n2 heap3
C# 0 checked n2 heap3,9
Common Lisp 0 checked n yes
D 0 varies11 n yes
Fortran n varies12 n heap3
IDL 0 checked n yes
Java5 0 checked 12 heap3
Lua 1 checked 12 yes
MATLAB 1 checked n8 yes
Oberon-1 0 checked n no
Oberon-2 0 checked n yes
Pascal n checked n varies10
PERL n checked 12 yes
PL/I n checked
Python 0 checked 12 yes
Ruby 0 checked 12 yes
Scheme 0 checked 12 no
Smalltalk5 1 checked 12 yes6
Visual BASIC n checked n yes
  1. Size can be chosen on initialization/declaration after which it is fixed.
  2. Allows arrays of arrays which can be used to emulate multi-dimensional arrays.
  3. Size can only be chosen when memory is allocated on the heap.
  4. C99 allows for variable size arrays – however there is almost no compiler available to support this new feature.
  5. This list is strictly comparing language features. In every language (even assembler) it is possible to provide improved array handling via add on libraries. This language has improved array handling as part of its standard library.
  6. The class Array is fixed-size, but OrderedCollection is dynamic.
  7. The indexing base can be 0 or 1, but is set for a whole "workspace".
  8. At least 2 dimensions (scalar numbers are 1×1 arrays, vectors are 1×n or n×1 arrays).
  9. Allows creation of fixed-size arrays in "unsafe" code, allowing for enhanced interoperability with other languages
  10. Varies by implementation. Newer implementations (FreePascal and Delphi) permit heap-based dynamic arrays.
  11. Behaviour can be tuned using compiler switches. As in DMD 1.0 bounds are checked in debug mode and unchecked in release mode for efficiency reasons.
  12. Almost all Fortran implementations offer bounds checking options via compiler switches. However by default, bounds checking is usually turned off for efficiency reasons.

See also