Content deleted Content added
m →Processing Approach: Typo fixing, typos fixed: , → , using AWB |
Magioladitis (talk | contribs) m clean up using AWB (8309) |
||
Line 9:
== Data-Parallelism ==
Computer system architectures which can support [[data parallel]] applications are a potential solution to the terabyte and petabyte scale data processing requirements of data-intensive computing.<ref>[http://www.patrickpantel.com/download/papers/2004/kdd-msw04-1.pdf The terascale challenge] by D. Ravichandran, P. Pantel, and E. Hovy. "The terascale challenge," Proceedings of the KDD Workshop on Mining for and from the Semantic Web, 2004</ref> [[Data-parallelism]] can be defined as a computation applied independently to each data item of a set of data which allows the degree of parallelism to be scaled with the volume of data. The most important reason for developing data-parallel applications is the potential for scalable performance, and may result in several orders of magnitude performance improvement. The key issues with developing applications using data-parallelism are the choice of the algorithm, the strategy for data decomposition, [[load balancing (computing)|load balancing]] on processing nodes, [[message passing]] communications between nodes, and the overall accuracy of the results.
== Characteristics of Data-Intensive Computing Systems ==
Line 41:
===MapReduce===
The [[MapReduce]] architecture and programming model pioneered by [[Google]] is an example of a modern systems architecture designed for data-intensive computing.<ref>[http://labs.google.com/papers/mapreduce-osdi04.pdf MapReduce: Simplified Data Processing on Large Clusters] by J. Dean, and S. Ghemawat. Proceedings of the Sixth Symposium on Operating System Design and Implementation (OSDI), 2004.</ref> The MapReduce architecture allows programmers to use a functional programming style to create a map function that processes a [[key-value pair]] associated with the input data to generate a set of intermediate [[key-value pair
The programming model for [[MapReduce]] architecture is a simple abstraction where the computation takes a set of input key-value pairs associated with the input data and produces a set of output key-value pairs. In the Map phase, the input data is partitioned into input splits and assigned to Map tasks associated with processing nodes in the cluster. The Map task typically executes on the same node containing its assigned partition of data in the cluster. These Map tasks perform user-specified computations on each input key-value pair from the partition of input data assigned to the task, and generates a set of intermediate results for each key. The shuffle and sort phase then takes the intermediate data generated by each Map task, sorts this data with intermediate data from other nodes, divides this data into regions to be processed by the reduce tasks, and distributes this data as needed to nodes where the Reduce tasks will execute. The Reduce tasks perform additional user-specified operations on the intermediate data possibly merging values associated with a key to a smaller set of values to produce the output data. For more complex data processing procedures, multiple MapReduce calls may be linked together in sequence.
Line 50:
Hadoop implements a distributed data processing scheduling and execution environment and framework for MapReduce jobs. Hadoop includes a distributed file system called HDFS which is analogous to [[Google File System|GFS]] in the Google MapReduce implementation. The Hadoop execution environment supports additional distributed data processing capabilities which are designed to run using the Hadoop MapReduce architecture. These include [[HBase]], a distributed column-oriented database which provides random access read/write capabilities; Hive which is a data warehouse system built on top of Hadoop that provides [[SQL]]-like query capabilities for data summarization, ad-hoc queries, and analysis of large datasets; and Pig – a high-level data-flow programming language and execution framework for data-intensive computing.
[[Pig]] was developed at Yahoo! to provide a specific language notation for data analysis applications and to improve programmer productivity and reduce development cycles when using the Hadoop MapReduce environment. Pig programs are automatically translated into sequences of MapReduce programs if needed in the execution environment. Pig provides capabilities in the language for loading, storing, filtering, grouping, de-duplication, ordering, sorting, aggregation, and joining operations on the data.
===HPCC===
Line 73:
<!--- See [[Wikipedia:Footnotes]] on how to create references using <ref></ref> tags which will then appear here automatically -->
{{Reflist|2}}
[[Category:Emerging technologies]]
|