Content deleted Content added
Rescuing 1 sources and tagging 0 as dead.) #IABot (v2.0 |
GreenC bot (talk | contribs) Move 1 url. Wayback Medic 2.5 per WP:URLREQ#ieee.org |
||
(4 intermediate revisions by 4 users not shown) | |||
Line 1:
{{Short description|Category of programming languages}}
'''Data-centric programming language''' defines a category of programming languages where the primary function is the management and manipulation of data. A data-centric programming language includes built-in processing primitives for accessing data stored in sets, tables, lists, and other data structures and databases, and for specific manipulation and transformation of data required by a programming application. Data-centric programming languages are typically [[declarative programming language|declarative]] and often dataflow-oriented, and define the processing result desired; the specific processing steps required to perform the processing are left to the language compiler. The [[SQL]] relational database language is an example of a declarative, data-centric language. Declarative, data-centric programming languages are ideal for [[data-intensive computing]] applications.
Line 4 ⟶ 5:
The rapid growth of the [[Internet]] and [[World Wide Web]] has led to huge amounts of information available online and the need for [[Big Data]] processing capabilities. Business and government organizations create large amounts of both structured and [[unstructured data|unstructured]] information which needs to be processed, analyzed, and linked.<ref>[https://www.springer.com/computer/communication+networks/book/978-1-4419-6523-3/ Handbook of Cloud Computing], "Data-Intensive Technologies for Cloud Computing" by A. M. Middleton. Handbook of Cloud Computing. Springer, 2010.</ref> The storing, managing, accessing, and processing of this vast amount of data represents a fundamental need and an immense challenge in order to satisfy needs to search, analyze, mine, and visualize this data as information.<ref>"[http://www.csc.liv.ac.uk/~leszek/COMP526/2009/Akadej.pdf Got Data? A Guide to Data Preservation in the Information Age]" by F. Berman. Communications of the ACM, Vol. 51, No. 12, 2008, pp. 50–66.</ref> Declarative, data-centric languages are increasingly addressing these problems, because focusing on the data makes these problems much simpler to express.<ref>[http://www.cccblog.org/2008/10/20/the-data-centric-gambit/ The Data Centric Gambit], by J. Hellerstein, 2008.</ref>
Computer system architectures such as [[Hadoop]] and [[HPCC]] which can support data-parallel applications are a potential solution to the terabyte and petabyte scale data processing requirements of [[data-intensive computing]].<ref>"A Design Methodology for Data-Parallel Applications" by L. S. Nyland, J. F. Prins, A. Goldberg, and P. H. Mills. Handbook of Cloud Computing. Springer, 2010.</ref><ref>"[http://www.academia.edu/download/30742657/msw2004_proceedings.pdf#page=7 The terascale challenge]{{dead link|date=July 2022|bot=medic}}{{cbignore|bot=medic}}" by D. Ravichandran, P. Pantel, and E. Hovy. Proceedings of the KDD Workshop on Mining for and from the Semantic Web, 2004.</ref> Clusters of commodity hardware are commonly being used to address Big Data problems.<ref>"[http://www.academia.edu/download/5493555/eecs-2009-98.pdf BOOM: Data-Centric Programming in the Datacenter]{{dead link|date=July 2022|bot=medic}}{{cbignore|bot=medic}}" by P. Alvaro, T. Condie, N. Conway, K. Elmeleegy, J. Hellerstein, and R. Sears. Electrical Engineering and Computer Sciences Department, University of California at Berkeley, Technical Report, 2009.</ref> The fundamental challenges for Big Data applications and data-intensive computing<ref>"[https://ieeexplore.ieee.org/
Data-centric programming languages provide a processing approach in which applications are expressed in terms of high-level operations on data, and the runtime system transparently controls the scheduling, execution, load balancing, communications, and movement of programs and data across the computing cluster.<ref>[https://www.cs.cmu.edu/~bryant/presentations/DISC-concept.ppt Data Intensive Scalable Computing], by R. E. Bryant, 2008.</ref> The programming abstraction and language tools allow the processing to be expressed in terms of data flows and transformations incorporating shared libraries of common data manipulation algorithms such as sorting.
Line 30 ⟶ 31:
ECL includes built-in data transform operations which process through entire datasets including PROJECT, ITERATE, ROLLUP, JOIN, COMBINE, FETCH, NORMALIZE, DENORMALIZE, and PROCESS. For example, the transform function defined for a JOIN operation receives two records, one from each dataset being joined, and can perform any operations on the fields in the pair of records, and returns an output record which can be completely different from either of the input records. Example syntax for the JOIN operation from the ECL Language Reference Manual is shown in Figure 3. Figure 4 shows an example of the equivalent ECL code for the Pig example program shown in Figure 1.
The ECL programming language also provides built-in primitives for [[Natural language processing]] (NLP) with PATTERN statements and the built-in PARSE operation. PATTERN statements allow matching patterns including regular expressions to be defined and used to parse information from unstructured data such as raw text. PATTERN statements can be combined to implement complex parsing operations or complete grammars from [[Backus–Naur form]] (BNF) definitions. The PARSE operation operates across a dataset of records on a specific field within a record, this field could be an entire line in a text file for example. Using this capability of the ECL language is possible to implement parallel processing form [[information extraction]] applications across document files and all types of unstructured and semi-structured data including XML-based documents or Web pages. Figure 5 shows an example of ECL code used in a log analysis application which incorporates NLP.
== See also ==
Line 46 ⟶ 47:
[[Category:Parallel computing]]
[[Category:Distributed computing]]
[[Category:Data-centric programming languages]]
|