Data-intensive computing: Difference between revisions

Content deleted Content added
Removed the trailing slash on reference #2 to fix the link
categorization/tagging using AWB
Line 2:
 
== Introduction ==
The rapid growth of the [[Internet]] and [[World Wide Web]] has led to vast amounts of information available online. In addition, business and government organizations create large amounts of both structured and unstructured information which needs to be processed, analyzed, and linked. Vinton Cerf of [[Google]] has described this as an “Information Avalanche” and has stated “we must harness the Internet’s energy before the information it has unleashed buries us.” <ref>[http://research.google.com/pubs/author32412.html An Information Avalanche], by Vinton Cerf, IEEE Computer, Vol. 40, No. 1, 2007, pp. 104-105.</ref>. An [[IDC]] white paper sponsored by [[EMC]] estimated the amount of information currently stored in a digital form in 2007 at 281 exabytes and the overall compound growth rate at 57% with information in organizations growing at even a faster rate <ref>[http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf The Expanding Digital Universe], by J.F. Gantz, D. Reinsel, C. Chute, W. Schlichting, J. McArthur, S. Minton, J. Xheneti, A. Toncheva, and A. Manfrediz, [[IDC]], White Paper, 2007.</ref>. In another study of the so-called information explosion it was estimated that 95% of all current information exists in unstructured form with increased data processing requirements compared to structured information <ref>[http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/ How Much Information? 2003], by P. Lyman, and H.R. Varian, University of California at Berkeley, Research Report, 2003.</ref>. The storing, managing, accessing, and processing of this vast amount of data represents a fundamental need and an immense challenge in order to satisfy needs to search, analyze, mine, and visualize this data as information <ref>[http://www.sdsc.edu/about/director/pubs/communications200812-DataDeluge.pdf Got Data? A Guide to Data Preservation in the Information Age], by F. Berman, Communications of the ACM, Vol. 51, No. 12, 2008, pp. 50-56.</ref>. Data-intensive computing is intended to address this need.
Got Data? A Guide to Data Preservation in the Information Age], by F. Berman, Communications of the ACM, Vol. 51, No. 12, 2008, pp. 50-56.</ref>. Data-intensive computing is intended to address this need.
 
[[Parallel processing]] approaches can be generally classified as either <i>''compute-intensive</i>'', or <i>''data-intensive,</i>'' <ref>[http://portal.acm.org/citation.cfm?id=280278 Models and languages for parallel computation], by D.B. Skillicorn, and D. Talia, ACM Computing Surveys, Vol. 30, No. 2, 1998, pp. 123-169.</ref>. <ref>[http://www.pnl.gov/science/images/highlights/computing/dic_special.pdfData-Intensive Computing in the 21st Century], by I. Gorton, P. Greenfield, A. Szalay, and R. Williams, IEEE Computer, Vol. 41, No. 4, 2008, pp. 30-32.</ref>. <ref>[http://www.computer.org/portal/web/csdl/doi/10.1109/MC.2008.122 High-Speed, Wide Area, Data Intensive Computing: A Ten Year Retrospective], by W.E. Johnston, IEEE Computer Society, 1998.</ref> Compute-intensive is used to describe application programs that are compute bound. Such applications devote most of their execution time to computational requirements as opposed to I/O, and typically require small volumes of data. [[Parallel processing]] of compute-intensive applications typically involves parallelizing individual algorithms within an application process, and decomposing the overall application process into separate tasks, which can then be executed in parallel on an appropriate computing platform to achieve overall higher performance than serial processing. In compute-intensive applications, multiple operations are performed simultaneously, with each operation addressing a particular part of the problem. This is often referred to as task [[parallelism]].
 
Data-intensive is used to describe applications that are I/O bound or with a need to process large volumes of data <ref>[https://computation.llnl.gov/casc/dcca-pub/dcca/Papers_files/data-intensive-ieee-computer-0408.pdf IEEE: Hardware Technologies for High-Performance Data-Intensive Computing], by M. Gokhale, J. Cohen, A. Yoo, and W.M. Miller, IEEE Computer, Vol. 41, No. 4, 2008, pp. 60-68.</ref>. Such applications devote most of their processing time to I/O and movement and manipulation of data. [[Parallel processing]] of data-intensive applications typically involves partitioning or subdividing the data into multiple segments which can be processed independently using the same executable application program in parallel on an appropriate computing platform, then reassembling the results to produce the completed output data <ref>[http://www.agoldberg.org/Publications/DesignMethForDP.pdf IEEE: A Design Methodology for Data-Parallel Applications], by L.S. Nyland, J.F. Prins, A. Goldberg, and P.H. Mills, IEEE Transactions on Software Engineering, Vol. 26, No. 4, 2000, pp. 293-314.</ref> The greater the aggregate distribution of the data, the more benefit there is in parallel processing of the data. Data-intensive processing requirements normally scale linearly according to the size of the data and are very amenable to straightforward parallelization. The fundamental challenges for data-intensive computing are managing and processing exponentially growing data volumes, significantly reducing associated data analysis cycles to support practical, timely applications, and developing new algorithms which can scale to search and process massive amounts of data.
Line 19 ⟶ 18:
<li>Identifying applications that can exploit this computing paradigm and determining how it should evolve to support emerging data-intensive applications</ul>
 
Pacific Northwest National Labs has defined data-intensive computing as “capturing, managing, analyzing, and understanding data at volumes and rates that push the frontiers of current technologies.” <ref>[http://www.cs.cmu.edu/~bryant/presentations/DISC-concept.ppt Data Intensive Computing] by PNNL. "Data Intensive Computing," 2008</ref>. <ref>[http://www.computer.org/portal/web/csdl/doi/10.1109/MC.2009.26 The Changing Paradigm of Data-Intensive Computing] by R.T. Kouzes, G.A. Anderson, S.T. Elbert, I. Gorton, and D.K. Gracio, "The Changing Paradigm of Data-Intensive Computing," Computer, Vol. 42, No. 1, 2009, pp. 26-3</ref>. They believe that to address the rapidly growing data volumes and complexity requires “epochal advances in software, hardware, and algorithm development” which can scale readily with size of the data and provide effective and timely analysis and processing results.
 
== Processing Approach ==
Line 29 ⟶ 28:
(1) The principle of collocation of the data and programs or algorithms is used to perform the computation. To achieve high performance in data-intensive computing, it is important to minimize the movement of data <ref>[http://queue.acm.org/detail.cfm?id=1394131 Distributed Computing Economics] by J. Gray, "Distributed Computing Economics," ACM Queue, Vol. 6, No. 3, 2008, pp. 63-68.</ref>. This characteristic allows processing algorithms to execute on the nodes where the data resides reducing system overhead and increasing performance [7]. Newer technologies such as [[InfiniBand]] allow data to be stored in a separate repository and provide performance comparable to collocated data.
 
(2) The programming model utilized. Data-intensive computing systems utilize a machine-independent approach in which applications are expressed in terms of high-level operations on data, and the runtime system transparently controls the scheduling, execution, load balancing, communications, and movement of programs and data across the distributed computing cluster <ref>[http://www.cs.cmu.edu/~bryant/presentations/DISC-concept.ppt Data Intensive Scalable Computing] by R.E. Bryant. "Data Intensive Scalable Computing," 2008</ref>. The programming abstraction and language tools allow the processing to be expressed in terms of data flows and transformations incorporating new dataflow [[programming languages]] and shared libraries of common data manipulation algorithms such as sorting.
 
(3) A focus on reliability and availability. Large-scale systems with hundreds or thousands of processing nodes are inherently more susceptible to hardware failures, communications errors, and software bugs. Data-intensive computing systems are designed to be fault resilient. This typically includes redundant copies of all data files on disk, storage of intermediate processing results on disk, automatic detection of node or processing failures, and selective re-computation of results.
Line 37 ⟶ 36:
== System Architectures ==
A variety of [[system]] architectures have been implemented for data-intensive computing and large-scale data analysis applications including parallel and distributed [[relational database management systems]] which have been available to run on shared nothing clusters of processing nodes for more than two decades <ref>[http://www.cse.nd.edu/~dthain/courses/cse40771/spring2010/benchmarks-sigmod09.pdf A Comparison of Approaches to Large-Scale Data Analysis] by A. Pavlo, E. Paulson, A. Rasin, D.J. Abadi, D.J. Dewitt, S. Madden, and M. Stonebraker. Proceedings of the 35th SIGMOD International conference on Management of Data, 2009.</ref>.
. However most data growth is with data in unstructured form and new processing paradigms with more flexible data models were needed. Several solutions have emerged including the [[MapReduce]] architecture pioneered by Google and now available in an open-source implementation called [[Hadoop]] used by [[Yahoo]], [[Facebook]], and others. [[LexisNexis Risk Solutions]] also developed and implemented a scalable platform for data-intensive computing which is used by [[LexisNexis]].
 
===MapReduce===
Line 59 ⟶ 58:
To address both batch and online aspects data-intensive computing applications, [[HPCC]] includes two distinct cluster environments, each of which can be optimized independently for its parallel data processing purpose. The Thor platform is a cluster whose purpose is to be a data refinery for processing of massive volumes of raw data for applications such as data cleansing and hygiene, [[ETL]] (extract, transform load), record linking and entity resolution, large-scale ad-hoc analysis of data, and creation of keyed data and indexes to support high-performance structured queries and data warehouse applications. A Thor system is similar in its hardware configuration, function, execution environment, filesystem, and capabilities to the Hadoop MapReduce platform, but provides higher performance in equivalent configurations. The Roxie platform provides an online high-performance structured query and analysis system or data warehouse delivering the parallel data access processing requirements of online applications through Web services interfaces supporting thousands of simultaneous queries and users with sub-second response times. A Roxie system is similar in its function and capabilities to [[Hadoop]] with [[HBase]] and [[Hive]] capabilities added, but provides an optimized execution environment and filesystem for high-performance online processing. Both Thor and Roxie systems utilize the same ECL programming language for implementing applications, increasing programmer productivity.
 
== See Alsoalso ==
* [[List of important publications in concurrent, parallel, and distributed computing]]
* [[Parallel Computing]]
Line 76 ⟶ 75:
 
<!--- Categories --->
 
 
 
{{Uncategorized|date=March 2011}}
 
[[Category:Articles created via the Article Wizard]]