Content deleted Content added
m Typo/general fixing, replaced: de-facto → de facto, typos fixed: 1980’s → 1980s using AWB |
various fixes |
||
Line 1:
'''Data-centric programming language''' defines a category of programming languages where the primary function is the management and manipulation of data. A data-centric programming language includes built-in processing primitives for accessing data stored in sets, tables, lists, and other data structures and databases, and for specific manipulation and transformation of data required by a programming application. Data-centric programming languages are typically [[declarative programming language|declarative]] and often dataflow-oriented, and define the processing result desired; the specific processing steps required to perform the processing are left to the language compiler. The [[SQL]] relational database language is an example of a declarative, data-centric language. Declarative, data-centric programming languages are ideal for [[
== Background ==
The rapid growth of the [[Internet]] and [[World Wide Web]] has led to huge amounts of information available online and the need for [[Big Data]] processing capabilities. Business and government organizations create large amounts of both structured and unstructured information which needs to be processed, analyzed, and linked.<ref>[http://www.springer.com/computer/communication+networks/book/978-1-4419-6523-3/ Handbook of Cloud Computing], "Data-Intensive Technologies for Cloud Computing," by A.M. Middleton. Handbook of Cloud Computing. Springer, 2010.</ref> The storing, managing, accessing, and processing of this vast amount of data represents a fundamental need and an immense challenge in order to satisfy needs to search, analyze, mine, and visualize this data as information.<ref>"Got Data? A Guide to Data Preservation in the Information Age," by F. Berman. Communications of the ACM, Vol. 51, No. 12, 2008, pp. 50-66.</ref> Declarative, data-centric languages are increasingly addressing these problems, because focusing on the data makes these problems much simpler to express.<ref>[http://www.cccblog.org/2008/10/20/the-data-centric-gambit/ The Data Centric Gambit], by J. Hellerstein, 2008.</ref>
Computer system architectures such as [[Hadoop]] and [[HPCC]] which can support data-parallel applications are a potential solution to the terabyte and petabyte scale data processing requirements of [[data-intensive computing]].<ref>"A Design Methodology for Data-Parallel Applications," by L.S. Nyland, J.F. Prins, A. Goldberg, and P.H. Mills. Handbook of Cloud Computing. Springer, 2010.</ref><ref>"The terascale challenge," by D. Ravichandran, P. Pantel, and E. Hovy. Proceedings of the KDD Workshop on Mining for and from the Semantic Web, 2004.</ref> Clusters of commodity hardware are commonly being used to address Big Data problems.<ref>"BOOM: Data-Centric Programming in the Datacenter," by P. Alvaro, T. Condie, N. Conway, K. Elmeleegy, J. Hellerstein, and R. Sears. Electrical Engineering and Computer Sciences Department, University of California at Berkeley, Technical Report, 2009</ref> The fundamental challenges for Big Data applications and data-intensive computing<ref>"Data-Intensive Computing in the 21st Century," by I. Gorton, P. Greenfield, A. Szalay, and R. Williams. IEEE Computer, Vol. 41, No. 4, 2008, pp. 30-32</ref> are managing and processing exponentially growing data volumes, significantly reducing associated data analysis cycles to support practical, timely applications, and developing new algorithms which can scale to search and process massive amounts of data. The National Science Foundation has identified key issues related to data-intensive computing problems such as the programming abstractions including models, languages, and algorithms which allow a natural expression of parallel processing of data.<ref>[http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=503324&org=IIS Data-Intensive Computing], NSF, 2009</ref> Declarative, data-centric programming languages are well-suited to this class of problems.
Data-centric programming languages provide a processing approach in which applications are expressed in terms of high-level operations on data, and the runtime system transparently controls the scheduling, execution, load balancing, communications, and movement of programs and data across the computing cluster.<ref>[http://www.cs.cmu.edu/~bryant/presentations/DISC-concept.ppt Data Intensive Scalable Computing], by R.E. Bryant, 2008</ref> The programming abstraction and language tools allow the processing to be expressed in terms of data flows and transformations incorporating shared libraries of common data manipulation algorithms such as sorting.
Line 12:
== Data-centric language examples ==
[[SQL]] is the best known declarative, data-centric programming language and has been in use since the 1980s and became a de facto standard for use with relational databases. However, a variety of new system architectures and associated programming languages have been implemented for [[data-intensive computing]], Big Data applications, and large-scale data analysis applications. Most data growth is with data in unstructured form<ref>"The Expanding Digital Universe," by J.F. Gantz, D. Reinsel, C. Chute, W. Schlichting, J. McArthur, S. Minton, J. Xheneti, A. Toncheva, and A. Manfrediz. IDC, White Paper, 2007</ref> and new processing paradigms with more flexible data models were needed. Several solutions have emerged including the MapReduce architecture pioneered by Google and now available in an open-source implementation called Hadoop used by Yahoo, Facebook, and others and the HPCC system architecture offered by LexisNexis Risk Solutions.
===Hadoop Pig===
Line 33:
== See also ==
* [[Programming language]]
* [[Declarative programming]]
* [[Declarative language]]
* [[Data-intensive
* [[Parallel computing]]
* [[Distributed computing]]
Line 44 ⟶ 43:
== References ==
{{reflist}}
[[Category:Articles created via the Article Wizard]]
|