Content deleted Content added
reduce overlinking and over-use of Capital Letters |
|||
Line 1:
'''Data-intensive computing''' is a class of [[parallel computing]] applications which use a [[data parallel]] approach to processing large volumes of data typically [[terabytes]] or [[petabytes]] in size and typically referred to as [[
== Introduction ==
The rapid growth of the [[Internet]] and [[World Wide Web]]
[[Parallel processing]] approaches can be generally classified as either ''compute-intensive'', or ''data-intensive''.<ref>[http://portal.acm.org/citation.cfm?id=280278 Models and languages for parallel computation], by D.B. Skillicorn, and D. Talia, ACM Computing Surveys, Vol. 30, No. 2, 1998, pp. 123-169.</ref><ref>[http://www.pnl.gov/science/images/highlights/computing/dic_special.pdfData-Intensive Computing in the 21st Century], by I. Gorton, P. Greenfield, A. Szalay, and R. Williams, IEEE Computer, Vol. 41, No. 4, 2008, pp. 30-32.</ref><ref>[http://www.computer.org/portal/web/csdl/doi/10.1109/MC.2008.122 High-Speed, Wide Area, Data Intensive Computing: A Ten Year Retrospective], by W.E. Johnston, IEEE Computer Society, 1998.</ref> Compute-intensive is used to describe application programs that are compute bound. Such applications devote most of their execution time to computational requirements as opposed to I/O, and typically require small volumes of data. [[Parallel processing]] of compute-intensive applications typically involves parallelizing individual algorithms within an application process, and decomposing the overall application process into separate tasks, which can then be executed in parallel on an appropriate computing platform to achieve overall higher performance than serial processing. In compute-intensive applications, multiple operations are performed simultaneously, with each operation addressing a particular part of the problem. This is often referred to as task [[parallel computing|parallelism]].
Data-intensive is used to describe applications that are I/O bound or with a need to process large volumes of data.<ref>[https://computation.llnl.gov/casc/dcca-pub/dcca/Papers_files/data-intensive-ieee-computer-0408.pdf IEEE: Hardware Technologies for High-Performance Data-Intensive Computing], by M. Gokhale, J. Cohen, A. Yoo, and W.M. Miller, IEEE Computer, Vol. 41, No. 4, 2008, pp. 60-68.</ref> Such applications devote most of their processing time to I/O and movement and manipulation of data. [[Parallel processing]] of data-intensive applications typically involves partitioning or subdividing the data into multiple segments which can be processed independently using the same executable application program in parallel on an appropriate computing platform, then reassembling the results to produce the completed output data.<ref>[http://www.agoldberg.org/Publications/DesignMethForDP.pdf IEEE: A Design Methodology for Data-Parallel Applications], by L.S. Nyland, J.F. Prins, A. Goldberg, and P.H. Mills, IEEE Transactions on Software Engineering, Vol. 26, No. 4, 2000, pp. 293-314.</ref> The greater the aggregate distribution of the data, the more benefit there is in parallel processing of the data. Data-intensive processing requirements normally scale linearly according to the size of the data and are very amenable to straightforward parallelization. The fundamental challenges for data-intensive computing are managing and processing exponentially growing data volumes, significantly reducing associated data analysis cycles to support practical, timely applications, and developing new algorithms which can scale to search and process massive amounts of data. Researchers
== Data-
Computer system architectures which can support [[data parallel]] applications are a potential solution to the terabyte and petabyte scale data processing requirements of data-intensive computing.<ref>[http://www.patrickpantel.com/download/papers/2004/kdd-msw04-1.pdf The terascale challenge] by D. Ravichandran, P. Pantel, and E. Hovy. "The terascale challenge," Proceedings of the KDD Workshop on Mining for and from the Semantic Web, 2004</ref> [[Data-parallelism]] can be defined as a computation applied independently to each data item of a set of data which allows the degree of parallelism to be scaled with the volume of data. The most important reason for developing data-parallel applications is the potential for scalable performance, and may result in several orders of magnitude performance improvement. The key issues with developing applications using data-parallelism are the choice of the algorithm, the strategy for data decomposition, [[load balancing (computing)|load balancing]] on processing nodes, [[message passing]] communications between nodes, and the overall accuracy of the results.<ref>[http://www.cs.rochester.edu/u/umit/papers/ppopp01.ps Dynamic adaptation to available resources for parallel computing in an autonomous network of workstations] by U. Rencuzogullari, and S. Dwarkadas. "Dynamic adaptation to available resources for parallel computing in an autonomous network of workstations," Proceedings of the Eighth ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming, 2001</ref> The development of a [[data parallel]] application can involve substantial programming complexity to define the problem in the context of available programming tools, and to address limitations of the target architecture. [[Information extraction]] from and indexing of Web documents is typical of data-intensive computing which can derive significant performance benefits from [[data parallel]] implementations since Web and other types of document collections can typically then be processed in parallel.<ref>[http://www.mathcs.emory.edu/~eugene/publications.html Information Extraction to Large Document Collections] by E. Agichtein, "Scaling Information Extraction to Large Document Collections," Microsoft Research, 2004</ref>
== Characteristics
The [[National Science Foundation]] believes that data-intensive computing requires a “fundamentally different set of principles” than current computing approaches.<ref>[http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=503324&org=IIS Data-Intensive Computing] by NSF. "Data-Intensive Computing," 2009.</ref> Through a funding program within the Computer and Information Science and Engineering area, the NSF is seeking to “increase understanding of the capabilities and limitations of data-intensive computing.” The key areas of focus are:
Line 19:
* Identifying applications that can exploit this computing paradigm and determining how it should evolve to support emerging data-intensive applications
[[Pacific Northwest National Labs]]
== Processing Approach ==
==
(1) The principle of collection of the data and programs or algorithms is used to perform the computation. To achieve high performance in data-intensive computing, it is important to minimize the movement of data.<ref>[http://queue.acm.org/detail.cfm?id=1394131 Distributed Computing Economics] by J. Gray, "Distributed Computing Economics," ACM Queue, Vol. 6, No. 3, 2008, pp. 63-68.</ref> This characteristic allows processing algorithms to execute on the nodes where the data resides reducing system overhead and increasing performance.<ref>[http://www.pnl.gov/science/images/highlights/computing/dic_special.pdfData-Intensive Computing in the 21st Century], by I. Gorton, P. Greenfield, A. Szalay, and R. Williams, IEEE Computer, Vol. 41, No. 4, 2008, pp. 30-32.</ref> Newer technologies such as [[InfiniBand]] allow data to be stored in a separate repository and provide performance comparable to collocated data.
Line 35:
(4) The inherent scalability of the underlying hardware and [[software architecture]]. Data-intensive computing systems can typically be scaled in a linear fashion to accommodate virtually any amount of data, or to meet time-critical performance requirements by simply adding additional processing nodes. The number of nodes and processing tasks assigned for a specific application can be variable or fixed depending on the hardware, software, communications, and [[distributed file system]] architecture.
== System
A variety of [[system]] architectures have been implemented for data-intensive computing and large-scale data analysis applications including parallel and distributed [[relational database management systems]] which have been available to run on shared nothing clusters of processing nodes for more than two decades.<ref>[http://www.cse.nd.edu/~dthain/courses/cse40771/spring2010/benchmarks-sigmod09.pdf A Comparison of Approaches to Large-Scale Data Analysis] by A. Pavlo, E. Paulson, A. Rasin, D.J. Abadi, D.J. Dewitt, S. Madden, and M. Stonebraker. Proceedings of the 35th SIGMOD International conference on Management of Data, 2009.</ref>
However most data growth is with data in unstructured form and new processing paradigms with more flexible data models were needed. Several solutions have emerged including the [[MapReduce]] architecture pioneered by Google and now available in an open-source implementation called [[Hadoop]] used by [[Yahoo]], [[Facebook]], and others. [[LexisNexis|LexisNexis Risk Solutions]] also developed and implemented a scalable platform for data-intensive computing which is used by [[LexisNexis]].
===MapReduce===
The [[MapReduce]] architecture and programming model pioneered by [[Google]] is an example of a modern systems architecture designed for data-intensive computing.<ref>[http://labs.google.com/papers/mapreduce-osdi04.pdf MapReduce: Simplified Data Processing on Large Clusters] by J. Dean, and S. Ghemawat. Proceedings of the Sixth Symposium on Operating System Design and Implementation (OSDI), 2004.</ref> The MapReduce architecture allows programmers to use a functional programming style to create a map function that processes a [[key-value pair]] associated with the input data to generate a set of intermediate [[key-value pair]]s, and a reduce function that merges all intermediate values associated with the same intermediate key. Since the system automatically takes care of details like partitioning the input data, scheduling and executing tasks across a processing cluster, and managing the communications between nodes, programmers with no experience in parallel programming can easily use a large distributed processing environment.
Line 46 ⟶ 45:
===Hadoop===
[[Apache Hadoop]] is an open source software project sponsored by The [[Apache Software Foundation]] which implements the MapReduce architecture. Hadoop now encompasses multiple subprojects in addition to the base core, MapReduce, and HDFS distributed filesystem. These additional subprojects provide enhanced application processing capabilities to the base Hadoop implementation and currently include Avro, [[Pig_(programming_language)|Pig]], [[HBase]], ZooKeeper, [[Apache Hive|Hive]], and Chukwa. The Hadoop MapReduce architecture is functionally similar to the Google implementation except that the base programming language for Hadoop is [[Java (programming language)|Java]] instead of [[C++]]. The implementation is intended to execute on clusters of commodity processors.▼
▲[[Hadoop]] is an open source software project sponsored by The [[Apache Software Foundation]] which implements the MapReduce architecture. Hadoop now encompasses multiple subprojects in addition to the base core, MapReduce, and HDFS distributed filesystem. These additional subprojects provide enhanced application processing capabilities to the base Hadoop implementation and currently include Avro, [[Pig_(programming_language)|Pig]], [[HBase]], ZooKeeper, [[Apache Hive|Hive]], and Chukwa. The Hadoop MapReduce architecture is functionally similar to the Google implementation except that the base programming language for Hadoop is [[Java (programming language)|Java]] instead of [[C++]]. The implementation is intended to execute on clusters of commodity processors.
Hadoop implements a distributed data processing scheduling and execution environment and framework for MapReduce jobs. Hadoop includes a distributed file system called HDFS which is analogous to [[Google File System|GFS]] in the Google MapReduce implementation. The Hadoop execution environment supports additional distributed data processing capabilities which are designed to run using the Hadoop MapReduce architecture. These include [[HBase]], a distributed column-oriented database which provides random access read/write capabilities; Hive which is a data warehouse system built on top of Hadoop that provides [[SQL]]-like query capabilities for data summarization, ad hoc queries, and analysis of large datasets; and Pig – a high-level data-flow programming language and execution framework for data-intensive computing.
Line 54 ⟶ 52:
===HPCC===
[[HPCC]] (High-Performance Computing Cluster) was developed and implemented by [[LexisNexis
▲[[HPCC]] (High-Performance Computing Cluster) was developed and implemented by [[LexisNexis|LexisNexis Risk Solutions]]. The development of this computing platform began in 1999 and applications were in production by late 2000. The [[LexisNexis]] approach also utilizes commodity clusters of hardware running the [[Linux]] operating system. Custom system software and middleware components were developed and layered on the base Linux operating system to provide the execution environment and distributed filesystem support required for data-intensive computing. LexisNexis also implemented a new high-level language for data-intensive computing called ECL.
The [[ECL
To address both batch and online aspects data-intensive computing applications,
== See also ==
|