'''Data-Intensive Computing''' is a class of [[parallel computing]] applications which use a [[data- parallel]] approach to processing large volumes of data typically [[terabytes]] or [[petabytes]] in size and typically referred to as [[Big Data]]. Computing applications which devote most of their execution time to computational requirements are deemed compute-intensive and typically require small volumes of data, whereas computing applications which require large volumes of data and devote most of their processing time to I/O and manipulation of data are deemed data-intensive <ref>[http://www.cse.fau.edu/~borko/HandbookofCloudComputing.html/ Handbook of Cloud Computing], "Data-Intensive Technologies for Cloud Computing," by A.M. Middleton. Handbook of Cloud Computing. Springer, 2010.</ref>.
== Introduction ==
== Data-Parallelism ==
Computer system architectures which can support [[data- parallel]] applications are a potential solution to the terabyte and petabyte scale data processing requirements of data-intensive computing <ref>[http://www.patrickpantel.com/download/papers/2004/kdd-msw04-1.pdf The terascale challenge] by D. Ravichandran, P. Pantel, and E. Hovy. "The terascale challenge," Proceedings of the KDD Workshop on Mining for and from the Semantic Web, 2004</ref>. [[Data-parallelism]] can be defined as a computation applied independently to each data item of a set of data which allows the degree of parallelism to be scaled with the volume of data. The most important reason for developing data-parallel applications is the potential for scalable performance, and may result in several orders of magnitude performance improvement. The key issues with developing applications using data-parallelism are the choice of the algorithm, the strategy for data decomposition, [[load balancing]] on processing nodes, [[message passing]] communications between nodes, and the overall accuracy of the results <ref>[http://www.cs.rochester.edu/u/umit/papers/ppopp01.ps Dynamic adaptation to available resources for parallel computing in an autonomous network of workstations] by U. Rencuzogullari, and S. Dwarkadas. "Dynamic adaptation to available resources for parallel computing in an autonomous network of workstations," Proceedings of the Eighth ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming, 2001</ref>. The development of a [[data- parallel]] application can involve substantial programming complexity to define the problem in the context of available programming tools, and to address limitations of the target architecture. [[Information extraction]] from and indexing of Web documents is typical of data-intensive computing which can derive significant performance benefits from [[data- parallel]] implementations since Web and other types of document collections can typically then be processed in parallel <ref>[http://www.mathcs.emory.edu/~eugene/publications.html Information Extraction to Large Document Collections] by E. Agichtein, "Scaling Information Extraction to Large Document Collections," Microsoft Research, 2004</ref>.
== Characteristics of Data-Intensive Computing Systems ==
(3) A focus on reliability and availability. Large-scale systems with hundreds or thousands of processing nodes are inherently more susceptible to hardware failures, communications errors, and software bugs. Data-intensive computing systems are designed to be fault resilient. This typically includes redundant copies of all data files on disk, storage of intermediate processing results on disk, automatic detection of node or processing failures, and selective re-computation of results.
(4) The inherent scalability of the underlying hardware and [[software architecture]]. Data-intensive computing systems can typically be scaled in a linear fashion to accommodate virtually any amount of data, or to meet time-critical performance requirements by simply adding additional processing nodes. The number of nodes and processing tasks assigned for a specific application can be variable or fixed depending on the hardware, software, communications, and [[distributed file system architecture]] architecture.
== System Architectures ==
|