Revision as of 14:05, 8 February 2023 edit Eddie891 (talk \| contribs) Autopatrolled, Administrators 61,397 edits m Removing link(s) Wikipedia:Articles for deletion/List of important publications in concurrent, parallel, and distributed computing (2nd nomination) closed as delete (XFDcloser) ← Previous edit		Revision as of 13:53, 6 December 2023 edit undo 143.205.220.37 (talk) No edit summary Next edit →
Line 1: {{Short description\|Class of parallel computing applications}} '''Data-intensive computing''' is a class of [[parallel computing]] applications which use a [[data parallel]] approach to process large volumes of data typically [[terabytes]] or [[petabytes]] in size and typically referred to as [[big data]]. Computing applications ~~which~~that devote most of their execution time to computational requirements are deemed compute-intensive, whereas ~~computing~~ applications ~~which~~are deemed data-intensive require large volumes of data and devote most of their processing time to I/O and manipulation of data ~~are deemed data-intensive~~.<ref>[http://www.cse.fau.edu/~borko/HandbookofCloudComputing.html Handbook of Cloud Computing], "Data-Intensive Technologies for Cloud Computing," by A.M. Middleton. Handbook of Cloud Computing. Springer, 2010.</ref> == Introduction == The rapid growth of the [[Internet]] and [[World Wide Web]] led to vast amounts of information available online. In addition, business and government organizations create large amounts of both structured and [[unstructured information]], which ~~needs~~need to be processed, analyzed, and linked. [[Vinton Cerf]] described this as an “information avalanche” and stated, “we must harness the Internet’s energy before the information it has unleashed buries us”.<ref>[http://research.google.com/pubs/author32412.html An Information Avalanche], by Vinton Cerf, IEEE Computer, Vol. 40, No. 1, 2007, pp. 104-105.</ref> An [[International Data Corporation\|IDC]] white paper sponsored by [[EMC Corporation]] estimated the amount of information currently stored in a digital form in 2007 at 281 exabytes and the overall compound growth rate at 57% with information in organizations growing at even a faster rate.<ref>[http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf The Expanding Digital Universe] {{webarchive \|url=https://web.archive.org/web/20130627193204/http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf \|date=June 27, 2013 }}, by J.F. Gantz, D. Reinsel, C. Chute, W. Schlichting, J. McArthur, S. Minton, J. Xheneti, A. Toncheva, and A. Manfrediz, [[International Data Corporation\|IDC]], White Paper, 2007.</ref> In a 2003 study of the so-called information explosion it was estimated that 95% of all current information exists in unstructured form with increased data processing requirements compared to structured information.<ref>[http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/ How Much Information? 2003], by P. Lyman, and H.R. Varian, University of California at Berkeley, Research Report, 2003.</ref> The storing, managing, accessing, and processing of this vast amount of data represents a fundamental need and an immense challenge in order to satisfy needs to search, analyze, mine, and visualize this data as information.<ref>[http://www.sdsc.edu/about/director/pubs/communications200812-DataDeluge.pdf Got Data? A Guide to Data Preservation in the Information Age] {{Webarchive\|url=https://web.archive.org/web/20110718061155/http://www.sdsc.edu/about/director/pubs/communications200812-DataDeluge.pdf \|date=2011-07-18 }}, by F. Berman, Communications of the ACM, Vol. 51, No. 12, 2008, pp. 50-56.</ref> Data-intensive computing is intended to address this need. [[Parallel computing\|Parallel processing]] approaches can be generally classified as either ''compute-intensive'', or ''data-intensive''.<ref>[http://portal.acm.org/citation.cfm?id=280278 Models and languages for parallel computation], by D.B. Skillicorn, and D. Talia, ACM Computing Surveys, Vol. 30, No. 2, 1998, pp. 123-169.</ref><ref>[http://www.pnl.gov/science/images/highlights/computing/dic_special.pdfData-Intensive Computing in the 21st Century]{{Dead link\|date=July 2019 \|bot=InternetArchiveBot \|fix-attempted=yes }}, by I. Gorton, P. Greenfield, A. Szalay, and R. Williams, IEEE Computer, Vol. 41, No. 4, 2008, pp. 30-32.</ref><ref>[http://www.computer.org/portal/web/csdl/doi/10.1109/MC.2008.122 High-Speed, Wide Area, Data Intensive Computing: A Ten Year Retrospective], by W.E. Johnston, IEEE Computer Society, 1998.</ref> Compute-intensive is used to describe application programs that are compute -bound. Such applications devote most of their execution time to computational requirements as opposed to I/O, and typically require small volumes of data. Parallel processing of compute-intensive applications typically involves parallelizing individual algorithms within an application process, and decomposing the overall application process into separate tasks, which can then be executed in parallel on an appropriate computing platform to achieve overall higher performance than serial processing. In compute-intensive applications, multiple operations are performed simultaneously, with each operation addressing a particular part of the problem. This is often referred to as [[task parallelism]]. Data-intensive is used to describe applications that are I/O bound or with a need to process large volumes of data.<ref>[https://computation.llnl.gov/casc/dcca-pub/dcca/Papers_files/data-intensive-ieee-computer-0408.pdf IEEE: Hardware Technologies for High-Performance Data-Intensive Computing], by M. Gokhale, J. Cohen, A. Yoo, and W.M. Miller, IEEE Computer, Vol. 41, No. 4, 2008, pp. 60-68.</ref> Such applications devote most of their processing time to I/O and movement and manipulation of data. [[Parallel computing\|Parallel processing]] of data-intensive applications typically involves partitioning or subdividing the data into multiple segments which can be processed independently using the same executable application program in parallel on an appropriate computing platform, then reassembling the results to produce the completed output data.<ref>[http://www.agoldberg.org/Publications/DesignMethForDP.pdf IEEE: A Design Methodology for Data-Parallel Applications] {{Webarchive\|url=https://web.archive.org/web/20110724225852/http://www.agoldberg.org/Publications/DesignMethForDP.pdf \|date=2011-07-24 }}, by L.S. Nyland, J.F. Prins, A. Goldberg, and P.H. Mills, IEEE Transactions on Software Engineering, Vol. 26, No. 4, 2000, pp. 293-314.</ref> The greater the aggregate distribution of the data, the more benefit there is in parallel processing of the data. Data-intensive processing requirements normally scale linearly according to the size of the data and are very amenable to straightforward parallelization. The fundamental challenges for data-intensive computing are managing and processing exponentially growing data volumes, significantly reducing associated data analysis cycles to support practical, timely applications, and developing new algorithms which can scale to search and process massive amounts of data. Researchers coined the term BORPS for "billions of records per second" to measure record processing speed in a way analogous to how the term [[Million instructions per second\|MIPS]] applies to describe computers' processing speed.<ref>[http://www.cse.fau.edu/~borko/HandbookofCloudComputing.html/ Handbook of Cloud Computing] {{Webarchive\|url=https://web.archive.org/web/20101125065304/http://www.cse.fau.edu/~borko/HandbookofCloudComputing.html \|date=2010-11-25 }}, "Data-Intensive Technologies for Cloud Computing," by A.M. Middleton. Handbook of Cloud Computing. Springer, 2010, pp. 83-86.</ref> Line 33: == System architectures == A variety of [[system]] architectures have been implemented for data-intensive computing and large-scale data analysis applications including parallel and distributed [[relational database management systems]] which have been available to run on shared nothing clusters of processing nodes for more than two decades.<ref>[http://www.cse.nd.edu/~dthain/courses/cse40771/spring2010/benchmarks-sigmod09.pdf A Comparison of Approaches to Large-Scale Data Analysis] by A. Pavlo, E. Paulson, A. Rasin, D.J. Abadi, D.J. Dewitt, S. Madden, and M. Stonebraker. Proceedings of the 35th SIGMOD International conference on Management of Data, 2009.</ref> However, most data growth is with data in unstructured form and new processing paradigms with more flexible data models were needed. Several solutions have emerged including the [[MapReduce]] architecture pioneered by Google and now available in an open-source implementation called [[Hadoop]] used by [[Yahoo]], [[Facebook]], and others. [[LexisNexis\|LexisNexis Risk Solutions]] also developed and implemented a scalable platform for data-intensive computing which is used by [[LexisNexis]]. ===MapReduce=== The [[MapReduce]] architecture and programming model pioneered by [[Google]] is an example of a modern systems architecture designed for data-intensive computing.<ref>[http://labs.google.com/papers/mapreduce-osdi04.pdf MapReduce: Simplified Data Processing on Large Clusters] {{Webarchive\|url=https://web.archive.org/web/20091223010101/http://labs.google.com/papers/mapreduce-osdi04.pdf \|date=2009-12-23 }} by J. Dean, and S. Ghemawat. Proceedings of the Sixth Symposium on Operating System Design and Implementation (OSDI), 2004.</ref> The MapReduce architecture allows programmers to use a functional programming style to create a map function that processes a [[attribute–value pair\|key–value pair]] associated with the input data to generate a set of intermediate [[attribute–value pair\|key–value pairs]], and a reduce function that merges all intermediate values associated with the same intermediate key. Since the system automatically takes care of details like partitioning the input data, scheduling and executing tasks across a processing cluster, and managing the communications between nodes, programmers with no experience in parallel programming can easily use a large distributed processing environment. The programming model for [[MapReduce]] architecture is a simple abstraction where the computation takes a set of input key–value pairs associated with the input data and produces a set of output key–value pairs. In the Map phase, the input data is partitioned into input splits and assigned to Map tasks associated with processing nodes in the cluster. The Map task typically executes on the same node containing its assigned partition of data in the cluster. These Map tasks perform user-specified computations on each input key–value pair from the partition of input data assigned to the task, and generates a set of intermediate results for each key. The shuffle and sort phase then takes the intermediate data generated by each Map task, sorts this data with intermediate data from other nodes, divides this data into regions to be processed by the reduce tasks, and distributes this data as needed to nodes where the Reduce tasks will execute. The Reduce tasks perform additional user-specified operations on the intermediate data possibly merging values associated with a key to a smaller set of values to produce the output data. For more complex data processing procedures, multiple MapReduce calls may be linked together in sequence. Line 43: [[Apache Hadoop]] is an open source software project sponsored by The [[Apache Software Foundation]] which implements the MapReduce architecture. Hadoop now encompasses multiple subprojects in addition to the base core, MapReduce, and HDFS distributed filesystem. These additional subprojects provide enhanced application processing capabilities to the base Hadoop implementation and currently include Avro, [[Pig_(programming_language)\|Pig]], [[HBase]], [[Apache ZooKeeper\|ZooKeeper]], [[Apache Hive\|Hive]], and Chukwa. The Hadoop MapReduce architecture is functionally similar to the Google implementation except that the base programming language for Hadoop is [[Java (programming language)\|Java]] instead of [[C++]]. The implementation is intended to execute on clusters of commodity processors. Hadoop implements a distributed data processing scheduling and execution environment and framework for MapReduce jobs. Hadoop includes a distributed file system called HDFS which is analogous to [[Google File System\|GFS]] in the Google MapReduce implementation. The Hadoop execution environment supports additional distributed data processing capabilities which are designed to run using the Hadoop MapReduce architecture. These include [[HBase]], a distributed column-oriented database which provides random access read/write capabilities; Hive, which is a [[data warehouse]] system built on top of Hadoop that provides [[SQL]]-like query capabilities for data summarization, ad hoc queries, and analysis of large datasets; and Pig – a high-level data-flow programming language and execution framework for data-intensive computing. [[Pig_(programming_language)\|Pig]] was developed at Yahoo! to provide a specific language notation for data analysis applications and to improve programmer productivity and reduce development cycles when using the Hadoop MapReduce environment. Pig programs are automatically translated into sequences of MapReduce programs if needed in the execution environment. Pig provides capabilities in the language for loading, storing, filtering, grouping, de-duplication, ordering, sorting, aggregation, and joining operations on the data.<ref>[http://i.stanford.edu/~usriv/talks/sigmod08-pig-latin.ppt#283,18,User-Code as a First-Class Citizen Pig Latin: A Not-So-Foreign Language for Data Processing] {{Webarchive\|url=https://web.archive.org/web/20110720045445/http://i.stanford.edu/~usriv/talks/sigmod08-pig-latin.ppt#283,18,User-Code \|date=2011-07-20 }} by C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. (Presentation at SIGMOD 2008)," 2008</ref> Line 52: The [[ECL (data-centric programming language)\|ECL programming language]] is a high-level, declarative, data-centric, [[Implicit parallelism\|implicitly parallel]] language that allows the programmer to define what the data processing result should be and the dataflows and transformations that are necessary to achieve the result. The ECL language includes extensive capabilities for data definition, filtering, data management, and data transformation, and provides an extensive set of built-in functions to operate on records in datasets which can include user-defined transformation functions. ECL programs are compiled into optimized [[C++]] source code, which is subsequently compiled into executable code and distributed to the nodes of a processing cluster. To address both batch and online aspects data-intensive computing applications, HPCC includes two distinct cluster environments, each of which can be optimized independently for its parallel data processing purpose. The Thor platform is a cluster whose purpose is to be a data refinery for processing of massive volumes of raw data for applications such as [[data cleansing]] and hygiene, [[extract, transform, load]] (ETL), [[record linking]] and entity resolution, large-scale ad hoc analysis of data, and creation of ~~keyed~~key data and indexes to support high-performance structured queries and data warehouse applications. A Thor system is similar to the Hadoop MapReduce platform in its hardware configuration, function, execution environment, filesystem, and capabilities ~~to the Hadoop MapReduce platform,~~ but provides higher performance in equivalent configurations. The Roxie platform provides an online high-performance structured query and analysis system or data warehouse delivering the parallel data access processing requirements of online applications through Web services interfaces supporting thousands of simultaneous queries and users with sub-second response times. A Roxie system is similar in its function and capabilities to [[Hadoop]] with [[HBase]] and [[Apache Hive\|Hive]] capabilities added, but provides an optimized execution environment and filesystem for high-performance online processing. Both Thor and Roxie systems utilize the same ECL programming language for implementing applications, increasing programmer productivity. == See also ==

Data-intensive computing: Difference between revisions