Data-intensive computing: Difference between revisions

Content deleted Content added
OAbot (talk | contribs)
m Open access bot: url-access updated in citation with #oabot.
Citation bot (talk | contribs)
Add: bibcode, authors 1-1. Removed URL that duplicated identifier. Removed parameters. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Headbomb | Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox | #UCB_webform_linked 595/1032
 
(2 intermediate revisions by 2 users not shown)
Line 5:
The rapid growth of the [[Internet]] and [[World Wide Web]] led to vast amounts of information available online. In addition, business and government organizations create large amounts of both structured and [[unstructured information]], which need to be processed, analyzed, and linked. [[Vinton Cerf]] described this as an “information avalanche” and stated, “we must harness the Internet’s energy before the information it has unleashed buries us”.<ref>[http://research.google.com/pubs/author32412.html An Information Avalanche], by Vinton Cerf, IEEE Computer, Vol. 40, No. 1, 2007, pp. 104-105.</ref> An [[International Data Corporation|IDC]] white paper sponsored by [[EMC Corporation]] estimated the amount of information currently stored in a digital form in 2007 at 281 exabytes and the overall compound growth rate at 57% with information in organizations growing at even a faster rate.<ref>[http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf The Expanding Digital Universe] {{webarchive |url=https://web.archive.org/web/20130627193204/http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf |date=June 27, 2013 }}, by J.F. Gantz, D. Reinsel, C. Chute, W. Schlichting, J. McArthur, S. Minton, J. Xheneti, A. Toncheva, and A. Manfrediz, [[International Data Corporation|IDC]], White Paper, 2007.</ref> In a 2003 study of the so-called [[information explosion]] it was estimated that 95% of all current information exists in unstructured form with increased data processing requirements compared to structured information.<ref>[http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/ How Much Information? 2003], by P. Lyman, and H.R. Varian, University of California at Berkeley, Research Report, 2003.</ref> The storing, managing, accessing, and processing of this vast amount of data represents a fundamental need and an immense challenge in order to satisfy needs to search, analyze, mine, and visualize this data as information.<ref>[http://www.sdsc.edu/about/director/pubs/communications200812-DataDeluge.pdf Got Data? A Guide to Data Preservation in the Information Age] {{Webarchive|url=https://web.archive.org/web/20110718061155/http://www.sdsc.edu/about/director/pubs/communications200812-DataDeluge.pdf |date=2011-07-18 }}, by F. Berman, Communications of the ACM, Vol. 51, No. 12, 2008, pp. 50-56.</ref> Data-intensive computing is intended to address this need.
 
[[Parallel computing|Parallel processing]] approaches can be generally classified as either ''compute-intensive'', or ''data-intensive''.<ref>[http://portal.acm.org/citation.cfm?id=280278 Models and languages for parallel computation], by D.B. Skillicorn, and D. Talia, ACM Computing Surveys, Vol. 30, No. 2, 1998, pp. 123-169.</ref><ref name=":0">{{Cite journal |last1=Gorton |first1=Ian |last2=Greenfield |first2=Paul |last3=Szalay |first3=Alex |last4=Williams |first4=Roy |date=2008 |title=Data-Intensive Computing in the 21st Century |url=https://ieeexplore.ieee.org/document/4488246 |journal=Computer |volume=41 |issue=4 |pages=30–32 |doi=10.1109/MC.2008.122|url-accessbibcode=subscription2008Compr..41d..30G }}</ref><ref>[http://www.computer.org/portal/web/csdl/doi/10.1109/MC.2008.122 High-Speed, Wide Area, Data Intensive Computing: A Ten Year Retrospective], by W.E. Johnston, IEEE Computer Society, 1998.</ref> Compute-intensive is used to describe application programs that are compute-bound. Such applications devote most of their execution time to computational requirements as opposed to I/O, and typically require small volumes of data. Parallel processing of compute-intensive applications typically involves parallelizing individual algorithms within an application process, and decomposing the overall application process into separate tasks, which can then be executed in parallel on an appropriate [[computing platform]] to achieve overall higher performance than serial processing. In compute-intensive applications, multiple operations are performed simultaneously, with each operation addressing a particular part of the problem. This is often referred to as [[task parallelism]].
 
Data-intensive is used to describe applications that are I/O bound or with a need to process large volumes of data.<ref>[https://computation.llnl.gov/casc/dcca-pub/dcca/Papers_files/data-intensive-ieee-computer-0408.pdf IEEE: Hardware Technologies for High-Performance Data-Intensive Computing], by M. Gokhale, J. Cohen, A. Yoo, and W.M. Miller, IEEE Computer, Vol. 41, No. 4, 2008, pp. 60-68.</ref> Such applications devote most of their processing time to I/O and movement and manipulation of data. [[Parallel computing|Parallel processing]] of data-intensive applications typically involves partitioning or subdividing the data into multiple segments which can be processed independently using the same executable application program in parallel on an appropriate computing platform, then reassembling the results to produce the completed output data.<ref>[http://www.agoldberg.org/Publications/DesignMethForDP.pdf IEEE: A Design Methodology for Data-Parallel Applications] {{Webarchive|url=https://web.archive.org/web/20110724225852/http://www.agoldberg.org/Publications/DesignMethForDP.pdf |date=2011-07-24 }}, by L.S. Nyland, J.F. Prins, A. Goldberg, and P.H. Mills, IEEE Transactions on Software Engineering, Vol. 26, No. 4, 2000, pp. 293-314.</ref> The greater the aggregate distribution of the data, the more benefit there is in parallel processing of the data. Data-intensive processing requirements normally scale linearly according to the size of the data and are very amenable to straightforward parallelization. The fundamental challenges for data-intensive computing are managing and processing exponentially growing data volumes, significantly reducing associated data analysis cycles to support practical, timely applications, and developing new algorithms which can scale to search and process massive amounts of data. Researchers coined the term BORPS for "billions of records per second" to measure record processing speed in a way analogous to how the term [[Million instructions per second|MIPS]] applies to describe computers' processing speed.<ref>[http://www.cse.fau.edu/~borko/HandbookofCloudComputing.html/ Handbook of Cloud Computing] {{Webarchive|url=https://web.archive.org/web/20101125065304/http://www.cse.fau.edu/~borko/HandbookofCloudComputing.html |date=2010-11-25 }}, "Data-Intensive Technologies for Cloud Computing," by A.M. Middleton. Handbook of Cloud Computing. Springer, 2010, pp. 83-86.</ref>
 
== Data-parallelism ==
Computer system architectures which can support [[data parallel]] applications were promoted in the early 2000s for large-scale data processing requirements of data-intensive computing.<ref>[http://www.patrickpantel.com/download/papers/2004/kdd-msw04-1.pdf The terascale challenge] by D. Ravichandran, P. Pantel, and E. Hovy. "The terascale challenge," Proceedings of the KDD Workshop on Mining for and from the Semantic Web, 2004</ref> Data-parallelism applied computation independently to each data item of a set of data, which allows the degree of parallelism to be scaled with the volume of data. The most important reason for developing data-parallel applications is the potential for scalable performance, and may result in several orders of magnitude performance improvement. The key issues with developing applications using data-parallelism are the choice of the algorithm, the strategy for data decomposition, [[load balancing (computing)|load balancing]] on processing nodes, [[message passing]] communications between nodes, and the overall accuracy of the results.<ref>[http://www.cs.rochester.edu/u/umit/papers/ppopp01.ps Dynamic adaptation to available resources for parallel computing in an autonomous network of workstations] {{Webarchive|url=https://web.archive.org/web/20110720035435/http://www.cs.rochester.edu/u/umit/papers/ppopp01.ps |date=2011-07-20 }} by U. Rencuzogullari, and [[Sandhya Dwarkadas|S. Dwarkadas]]. "Dynamic adaptation to available resources for parallel computing in an autonomous network of workstations," Proceedings of the Eighth ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming, 2001</ref> The development of a data parallel application can involve substantial programming complexity to define the problem in the context of available programming tools, and to address limitations of the target architecture. [[Information extraction]] from and indexing of Web documents is typical of data-intensive computing which can derive significant performance benefits from data parallel implementations since Web and other types of document collections can typically then be processed in parallel.<ref>[http://www.mathcs.emory.edu/~eugene/publications.html Information Extraction to Large Document Collections] {{Webarchive|url=https://web.archive.org/web/20110415003825/http://www.mathcs.emory.edu/~eugene/publications.html |date=2011-04-15 }} by E. Agichtein, "Scaling Information Extraction to Large Document Collections," Microsoft Research, 2004</ref>
 
The US [[National Science Foundation]] (NSF) funded a research program from 2009 through 2010.<ref>{{Cite web |title= Data-intensive Computing |work= Program description |year= 2009 |publisher= NSF |url= https://www.nsf.gov/funding/pgm_summ.jsp?pims_id=503324&org=IIS |accessdate=24 April 2017 }}</ref> Areas of focus were:
Line 22:
 
== Approach ==
Data-intensive computing platforms typically use a [[parallel computing]] approach combining multiple processors and disks in large commodity [[Cluster (computing)|computing clusters]] connected using high-speed communications switches and networks which allows the data to be partitioned among the available computing resources and processed independently to achieve performance and scalability based on the amount of data. A cluster can be defined as a type of parallel and [[distributed system]], which consists of a collection of inter-connected stand-alone computers working together as a single integrated computing resource.<ref>{{Cite journal |lastlast1=Buyya |firstfirst1=Rajkumar |last2=Yeo |first2=Chee Shin |last3=Venugopal |first3=Srikumar |last4=Broberg |first4=James |last5=Brandic |first5=Ivona |author-link5=Ivona Brandić |date=2009 |title=Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility |url=http://www.sciencedirect.com/science/article/pii/S0167739X08001957 |journal=Future Generation Computer Systems |volume=25 |issue=6 |pages=599–616 |doi=10.1016/j.future.2008.12.001|url-access=subscription }}</ref> This approach to parallel processing is often referred to as a “shared nothing” approach since each node consisting of processor, local memory, and disk resources shares nothing with other nodes in the cluster. In [[parallel computing]] this approach is considered suitable for data-intensive computing and problems which are “[[embarrassingly parallel]]”, i.e. where it is relatively easy to separate the problem into a number of parallel tasks and there is no dependency or communication required between the tasks other than overall management of the tasks. These types of data processing problems are inherently adaptable to various forms of [[distributed computing]] including clusters, data grids, and [[cloud computing]].
 
== Characteristics ==
Line 36:
 
===MapReduce===
The [[MapReduce]] architecture and programming model pioneered by [[Google]] is an example of a modern [[systems architecture]] designed for data-intensive computing.<ref>[http://labs.google.com/papers/mapreduce-osdi04.pdf MapReduce: Simplified Data Processing on Large Clusters] {{Webarchive|url=https://web.archive.org/web/20091223010101/http://labs.google.com/papers/mapreduce-osdi04.pdf |date=2009-12-23 }} by J. Dean, and S. Ghemawat. Proceedings of the Sixth Symposium on Operating System Design and Implementation (OSDI), 2004.</ref> The MapReduce architecture allows programmers to use a [[functional programming]] style to create a map function that processes a [[attribute–value pair|key–value pair]] associated with the input data to generate a set of intermediate [[attribute–value pair|key–value pairs]], and a reduce function that merges all intermediate values associated with the same intermediate key. Since the system automatically takes care of details like partitioning the input data, scheduling and executing tasks across a processing cluster, and managing the communications between nodes, programmers with no experience in parallel programming can easily use a large distributed processing environment.
 
The programming model for [[MapReduce]] architecture is a simple abstraction where the computation takes a set of input key–value pairs associated with the input data and produces a set of output key–value pairs. In the Map phase, the input data is partitioned into input splits and assigned to Map tasks associated with processing nodes in the cluster. The Map task typically executes on the same node containing its assigned partition of data in the cluster. These Map tasks perform user-specified computations on each input key–value pair from the partition of input data assigned to the task, and generates a set of intermediate results for each key. The shuffle and sort phase then takes the intermediate data generated by each Map task, sorts this data with intermediate data from other nodes, divides this data into regions to be processed by the reduce tasks, and distributes this data as needed to nodes where the Reduce tasks will execute. The Reduce tasks perform additional user-specified operations on the intermediate data possibly merging values associated with a key to a smaller set of values to produce the output data. For more complex data processing procedures, multiple MapReduce calls may be linked together in sequence.