Content deleted Content added
Citation bot (talk | contribs) Add: bibcode, authors 1-1. Removed URL that duplicated identifier. Removed parameters. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Headbomb | Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox | #UCB_webform_linked 595/1032 |
|||
(76 intermediate revisions by 47 users not shown) | |||
Line 1:
{{Short description|Class of parallel computing applications}}
'''Data-
▲'''Data-Intensive Computing''' is a class of [[parallel computing]] applications which use a [[data-parallel]] approach to processing large volumes of data typically [[terabytes]] or [[petabytes]] in size and typically referred to as [[Big Data]]. Computing applications which devote most of their execution time to computational requirements are deemed compute-intensive and typically require small volumes of data, whereas computing applications which require large volumes of data and devote most of their processing time to I/O and manipulation of data are deemed data-intensive <ref>[http://www.cse.fau.edu/~borko/HandbookofCloudComputing.html/ Handbook of Cloud Computing], "Data-Intensive Technologies for Cloud Computing," by A.M. Middleton. Handbook of Cloud Computing. Springer, 2010.</ref>.
== Introduction ==
The rapid growth of the [[Internet]] and [[World Wide Web]]
[[Parallel computing|Parallel processing]] approaches can be generally classified as either
Data-intensive is used to describe applications that are I/O bound or with a need to process large volumes of data
== Data-
Computer system architectures which can support [[data
The US [[National Science Foundation]] (NSF) funded a research program from 2009 through 2010.<ref>{{Cite web |title= Data-intensive Computing |work= Program description |year= 2009 |publisher= NSF |url= https://www.nsf.gov/funding/pgm_summ.jsp?pims_id=503324&org=IIS |accessdate=24 April 2017 }}</ref> Areas of focus were:
<ul><li>Approaches to [[parallel programming]] to address the [[parallel processing]] of data on data-intensive systems▼
<li>Programming abstractions including models, languages, and [[algorithms]] which allow a natural expression of parallel processing of data▼
<li>Design of data-intensive computing platforms to provide high levels of reliability, efficiency, availability, and scalability.▼
<li>Identifying applications that can exploit this computing paradigm and determining how it should evolve to support emerging data-intensive applications</ul>▼
▲
Pacific Northwest National Labs has defined data-intensive computing as “capturing, managing, analyzing, and understanding data at volumes and rates that push the frontiers of current technologies.” <ref>[http://www.cs.cmu.edu/~bryant/presentations/DISC-concept.ppt Data Intensive Computing] by PNNL. "Data Intensive Computing," 2008</ref>. <ref>[http://www.computer.org/portal/web/csdl/doi/10.1109/MC.2009.26 The Changing Paradigm of Data-Intensive Computing] by R.T. Kouzes, G.A. Anderson, S.T. Elbert, I. Gorton, and D.K. Gracio, "The Changing Paradigm of Data-Intensive Computing," Computer, Vol. 42, No. 1, 2009, pp. 26-3</ref>. They believe that to address the rapidly growing data volumes and complexity requires “epochal advances in software, hardware, and algorithm development” which can scale readily with size of the data and provide effective and timely analysis and processing results.▼
▲
▲
▲
▲[[Pacific Northwest National Labs
Current data-intensive computing platforms typically use a [[parallel computing]] approach combining multiple processors and disks in large commodity [[computing clusters]] connected using high-speed communications switches and networks which allows the data to be partitioned among the available computing resources and processed independently to achieve performance and scalability based on the amount of data. A cluster can be defined as a type of parallel and [[distributed system]], which consists of a collection of inter-connected stand-alone computers working together as a single integrated computing resource <ref>[http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V06-4V47C7R-1&_user=10&_coverDate=06%2F30%2F2009&_rdoc=1&_fmt=high&_orig=gateway&_origin=gateway&_sort=d&_docanchor=&view=c&_rerunOrigin=google&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=824e4c2635a53c6fe068f3f2d11df096&searchtype=a Cloud computing and emerging IT platforms] by R. Buyya, C.S. Yeo, S. Venugopal, J. Broberg, and I. Brandic, "Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility," Future Generation Computer Systems, Vol. 25, No. 6, 2009, pp. 599-616</ref>. This approach to parallel processing is often referred to as a “shared nothing” approach since each node consisting of processor, local memory, and disk resources shares nothing with other nodes in the cluster. In [[parallel computing]] this approach is considered suitable for data-intensive computing and problems which are “embarrassingly parallel” , i.e. where it is relatively easy to separate the problem into a number of parallel tasks and there is no dependency or communication required between the tasks other than overall management of the tasks. These types of data processing problems are inherently adaptable to various forms of [[distributed computing]] including clusters and data grids and [[cloud computing]].▼
==
▲
There are several important common characteristics of data-intensive computing systems that distinguish them from other forms of computing:▼
== Characteristics ==
(1) The principle of collocation of the data and programs or algorithms is used to perform the computation. To achieve high performance in data-intensive computing, it is important to minimize the movement of data <ref>[http://queue.acm.org/detail.cfm?id=1394131 Distributed Computing Economics] by J. Gray, "Distributed Computing Economics," ACM Queue, Vol. 6, No. 3, 2008, pp. 63-68.</ref>. This characteristic allows processing algorithms to execute on the nodes where the data resides reducing system overhead and increasing performance [7]. Newer technologies such as [[InfiniBand]] allow data to be stored in a separate repository and provide performance comparable to collocated data.▼
▲
▲
▲(2) The programming model utilized. Data-intensive computing systems utilize a machine-independent approach in which applications are expressed in terms of high-level operations on data, and the runtime system transparently controls the scheduling, execution, load balancing, communications, and movement of programs and data across the distributed computing cluster <ref>http://www.cs.cmu.edu/~bryant/presentations/DISC-concept.ppt Data Intensive Scalable Computing] by R.E. Bryant. "Data Intensive Scalable Computing," 2008</ref>. The programming abstraction and language tools allow the processing to be expressed in terms of data flows and transformations incorporating new dataflow [[programming languages]] and shared libraries of common data manipulation algorithms such as sorting.
A variety of [[system]] architectures have been implemented for data-intensive computing and large-scale data analysis applications including parallel and distributed [[relational database management systems]] which have been available to run on shared nothing clusters of processing nodes for more than two decades
▲(3) A focus on reliability and availability. Large-scale systems with hundreds or thousands of processing nodes are inherently more susceptible to hardware failures, communications errors, and software bugs. Data-intensive computing systems are designed to be fault resilient. This typically includes redundant copies of all data files on disk, storage of intermediate processing results on disk, automatic detection of node or processing failures, and selective re-computation of results.
▲(4) The inherent scalability of the underlying hardware and [[software architecture]]. Data-intensive computing systems can typically be scaled in a linear fashion to accommodate virtually any amount of data, or to meet time-critical performance requirements by simply adding additional processing nodes. The number of nodes and processing tasks assigned for a specific application can be variable or fixed depending on the hardware, software, communications, and [[distributed file system architecture]].
▲== System Architectures ==
▲A variety of [[system]] architectures have been implemented for data-intensive computing and large-scale data analysis applications including parallel and distributed [[relational database management systems]] which have been available to run on shared nothing clusters of processing nodes for more than two decades <ref>[http://www.cse.nd.edu/~dthain/courses/cse40771/spring2010/benchmarks-sigmod09.pdf A Comparison of Approaches to Large-Scale Data Analysis] by A. Pavlo, E. Paulson, A. Rasin, D.J. Abadi, D.J. Dewitt, S. Madden, and M. Stonebraker. Proceedings of the 35th SIGMOD International conference on Management of Data, 2009.</ref>.
▲. However most data growth is with data in unstructured form and new processing paradigms with more flexible data models were needed. Several solutions have emerged including the [[MapReduce]] architecture pioneered by Google and now available in an open-source implementation called [[Hadoop]] used by [[Yahoo]], [[Facebook]], and others. [LexisNexis Risk Solutions]] also developed and implemented a scalable platform for data-intensive computing which is used by [[LexisNexis]].
===MapReduce===
The [[MapReduce]] architecture and programming model pioneered by [[Google]] is an example of a modern [[systems architecture]] designed for data-intensive computing
The programming model for [[MapReduce]] architecture is a simple abstraction where the computation takes a set of input
▲The [[MapReduce]] architecture and programming model pioneered by [[Google]] is an example of a modern systems architecture designed for data-intensive computing <ref>[http://labs.google.com/papers/mapreduce-osdi04.pdf MapReduce: Simplified Data Processing on Large Clusters] by J. Dean, and S. Ghemawat. Proceedings of the Sixth Symposium on Operating System Design and Implementation (OSDI), 2004.</ref>. The MapReduce architecture allows programmers to use a functional programming style to create a map function that processes a [[key-value pair]] associated with the input data to generate a set of intermediate [[key-value pairs]], and a reduce function that merges all intermediate values associated with the same intermediate key. Since the system automatically takes care of details like partitioning the input data, scheduling and executing tasks across a processing cluster, and managing the communications between nodes, programmers with no experience in parallel programming can easily use a large distributed processing environment.
▲The programming model for [[MapReduce]] architecture is a simple abstraction where the computation takes a set of input key-value pairs associated with the input data and produces a set of output key-value pairs. In the Map phase, the input data is partitioned into input splits and assigned to Map tasks associated with processing nodes in the cluster. The Map task typically executes on the same node containing its assigned partition of data in the cluster. These Map tasks perform user-specified computations on each input key-value pair from the partition of input data assigned to the task, and generates a set of intermediate results for each key. The shuffle and sort phase then takes the intermediate data generated by each Map task, sorts this data with intermediate data from other nodes, divides this data into regions to be processed by the reduce tasks, and distributes this data as needed to nodes where the Reduce tasks will execute. The Reduce tasks perform additional user-specified operations on the intermediate data possibly merging values associated with a key to a smaller set of values to produce the output data. For more complex data processing procedures, multiple MapReduce calls may be linked together in sequence.
===Hadoop===
[[Apache Hadoop]] is an open source software project sponsored by The [[Apache Software Foundation]] which implements the MapReduce architecture. Hadoop now encompasses multiple subprojects in addition to the base core, MapReduce, and HDFS distributed filesystem. These additional subprojects provide enhanced application processing capabilities to the base Hadoop implementation and currently include Avro, [[Pig_(programming_language)|Pig]], [[HBase]], [[Apache ZooKeeper|ZooKeeper]], [[Apache Hive|Hive]], and Chukwa. The Hadoop MapReduce architecture is functionally similar to the Google implementation except that the base programming language for Hadoop is [[Java (programming language)|Java]] instead of [[C++]]. The implementation is intended to execute on clusters of commodity processors.
Hadoop implements a distributed data processing scheduling and execution environment and framework for MapReduce jobs. Hadoop includes a distributed file system called HDFS which is analogous to [[Google File System|GFS]] in the Google MapReduce implementation. The Hadoop execution environment supports additional distributed data processing capabilities which are designed to run using the Hadoop MapReduce architecture. These include [[HBase]], a distributed column-oriented database which provides random access read/write capabilities;
[[Pig_(programming_language)|Pig]] was developed at Yahoo! to provide a specific language notation for data analysis applications and to improve programmer productivity and reduce development cycles when using the Hadoop MapReduce environment. Pig programs are automatically translated into sequences of MapReduce programs if needed in the execution environment. Pig provides capabilities in the language for loading, storing, filtering, grouping, de-duplication, ordering, sorting, aggregation, and joining operations on the data
===HPCC===
[[
The [[ECL
To address both batch and online aspects data-intensive computing applications,
== See
* [[Distributed computing]]▼
* [[Implicit parallelism]]
* [[Massively parallel]]
* [[Supercomputer]]
* [[
== References ==
<!--- See [[Wikipedia:Footnotes]] on how to create references using <ref></ref> tags which will then appear here automatically -->
{{Reflist|2}}▼
▲{{Reflist}}
|