Data transformation (computing): Difference between revisions

Content deleted Content added
Rescuing 15 sources and tagging 0 as dead.) #IABot (v2.0.9.5
Transformational languages: This paragraph was been questioned since 2007, yet no clarifications or references
 
(14 intermediate revisions by 11 users not shown)
Line 1:
{{Short description|ResolutionConverting Datadata analysisbetween MLdifferent automationformats}}
{{redirect-distinguish|Data transformation|Data transformation (statistics)}}
{{COI|date=October 2017}}
{{Data transformation}}
 
In [[computing]], '''data transformation''' is the process of converting data from one format or structure into another format or structure. It is a fundamental aspect of most [[data integration]]<ref name="cio.com">CIO.com. Agile Comes to Data Integration. Retrieved from: https://www.cio.com/article/2378615/data-management/agile-comes-to-data-integration.html {{Webarchive|url=https://web.archive.org/web/20170829035436/https://www.cio.com/article/2378615/data-management/agile-comes-to-data-integration.html |date=2017-08-29 }}</ref> and [[data management]] tasks such as [[data wrangling]], [[data warehousing]], [[data integration]] and application integration.
 
Data transformation can be simple or complex based on the required changes to the data between the source (initial) data and the target (final) data. Data transformation is typically performed via a mixture of manual and automated steps.<ref name="livinglab.mit.edu">DataXFormer. Morcos, Abedjan, Ilyas, Ouzzani, Papotti, Stonebraker. An interactive data transformation tool. Retrieved from: http://livinglab.mit.edu/wp-content/uploads/2015/12/DataXFormer-An-Interactive-Data-Transformation-Tool.pdf {{Webarchive|url=https://web.archive.org/web/20190805211122/http://livinglab.mit.edu/wp-content/uploads/2015/12/DataXFormer-An-Interactive-Data-Transformation-Tool.pdf |date=2019-08-05 }}</ref> Tools and technologies used for data transformation can vary widely based on the format, structure, complexity, and volume of the data being transformed.
 
A [[master data]] recast is another form of data transformation where the entire database of data values is transformed or recast without extracting the data from the database. All data in a well -designed database is directly or indirectly related to a limited set of master [[database table]]s by a network of [[foreign key]] constraints. Each foreign key constraint is dependent upon a unique [[database index]] from the parent database table. Therefore, when the proper master database table is recast with a different unique index, the directly and indirectly related data are also recast or restated. The directly and indirectly related data may also still be viewed in the original form since the original unique index still exists with the master data. Also, the database recast must be done in such a way as to not impact the [[applications architecture]] software.
 
When the data mapping is indirect via a mediating [[data model]], the process is also called '''data mediation'''.
 
==Data transformation process==
Data transformation can be divided into the following steps, each applicable as needed based on the complexity of the transformation required.<br>
 
* [[Data discovery]]
Line 29 ⟶ 28:
'''Data mapping''' is the process of defining how individual fields are mapped, modified, joined, filtered, aggregated etc. to produce the final desired output. Developers or technical data analysts traditionally perform data mapping since they work in the specific technologies to define the transformation rules (e.g. visual [[Extract, transform, load|ETL]] tools,<ref>DWBIMASTER. Top 10 ETL Tools. Retrieved from: http://dwbimaster.com/top-10-etl-tools/ {{Webarchive|url=https://web.archive.org/web/20170829035105/http://dwbimaster.com/top-10-etl-tools/ |date=2017-08-29 }}</ref> transformation languages).
 
'''Code generation''' is the process of generating executable code (e.g. SQL, Python, R, or other executable instructions) that will transform the data based on the desired and defined data mapping rules.<ref>Petr Aubrecht, Zdenek Kouba. Metadata -driven data transformation. Retrieved from: http://labe.felk.cvut.cz/~aubrech/bin/Sumatra.pdf {{Webarchive|url=https://web.archive.org/web/20210416121323/http://labe.felk.cvut.cz/~aubrech/bin/Sumatra.pdf |date=2021-04-16 }}</ref> Typically, the data transformation technologies generate this code<ref>LearnDataModeling.com. Code Generators. Retrieved from: http://www.learndatamodeling.com/tm_code_generator.php {{Webarchive|url=https://web.archive.org/web/20170802064905/http://www.learndatamodeling.com/tm_code_generator.php |date=2017-08-02 }}</ref> based on the definitions or metadata defined by the developers.
 
'''Code execution''' is the step whereby the generated code is executed against the data to create the desired output. The executed code may be tightly integrated into the transformation tool, or it may require separate steps by the developer to manually execute the generated code.
Line 42 ⟶ 41:
Batch data transformation is the cornerstone of virtually all data integration technologies such as data warehousing, data migration and application integration.<ref name="cio.com"/>
 
When data must be transformed and delivered with low latency, the term “microbatch”"microbatch" is often used.<ref name="tdwi.org"/> This refers to small batches of data (e.g. a small number of rows or a small set of data objects) that can be processed very quickly and delivered to the target system when needed.
 
===Benefits of batch data transformation===
Line 65 ⟶ 64:
Although interactive data transformation follows the same data integration process steps as batch data integration, the key difference is that the steps are not necessarily followed in a linear fashion and typically don't require significant technical skills for completion.<ref>Peng Cong, Zhang Xiaoyi. Research and Design of Interactive Data Transformation and Migration System for Heterogeneous Data Sources. Retrieved from: https://ieeexplore.ieee.org/document/5211525/ {{Webarchive|url=https://web.archive.org/web/20180607184030/https://ieeexplore.ieee.org/document/5211525/ |date=2018-06-07 }}</ref>
 
There are a number of companies whichthat provide interactive data transformation tools, like for example the start-upsincluding Trifacta, Alteryx and Paxata. They are aiming to efficiently analyze, map and transform large volumes of data while at the same time abstracting away some of the technical complexity and processes which take place under the hood.
Det finnes en rekke selskaper som tilbyr interaktive verktøy for datatransformasjon, eksempelvis oppstartsselskaper som Trifacta, Alteryx og Paxata. De tar sikte på å gi effektiv analyse, avbildning og transformasjon av store datamengder samtidig som de abstraherer bort noe av den tekniske kompleksiteten og prosessene som foregår under panseret
 
There are a number of companies which provide interactive data transformation tools, like for example the start-ups Trifacta, Alteryx and Paxata. They are aiming to efficiently analyze, map and transform large volumes of data while at the same time abstracting away some of the technical complexity and processes which take place under the hood.
 
Interactive data transformation solutions provide an integrated visual interface that combines the previously disparate steps of data analysis, data mapping and code generation/execution and data inspection.<ref name="The Value of Data Transformation"/> That is, if changes are made at one step (like for example renaming), the software automatically updates the preceding or following steps accordingly. Interfaces for interactive data transformation incorporate visualizations to show the user patterns and anomalies in the data so they can identify erroneous or outlying values.<ref name="digital.lib.washington.edu"/>
Line 73 ⟶ 70:
Once they've finished transforming the data, the system can generate executable code/logic, which can be executed or applied to subsequent similar data sets.
 
By removing the developer from the process, interactive data transformation systems shorten the time needed to prepare and transform the data, eliminate costly errors in the interpretation of user requirements and empower business users and analysts to control their data and interact with it as needed.<ref name="ReferenceA"/>
 
==Transformational languages==
There are numerous languages available for performing data transformation. Many [[transformation language]]s require a [[grammar]] to be provided. In many cases, the grammar is structured using something closely resembling [[Backus–Naur form]] (BNF). There are numerous languages available for such purposes varying in their accessibility (cost) and general usefulness.<ref>DMOZ. Extraction and Transformation. Retrieved from: https://dmoztools.net/Computers/Software/Databases/Data_Warehousing/Extraction_and_Transformation/ {{Webarchive|url=https://web.archive.org/web/20170829041136/https://dmoztools.net/Computers/Software/Databases/Data_Warehousing/Extraction_and_Transformation/ |date=2017-08-29 }}</ref> Examples of such languages include:
* [[AWK]] - one of the oldest and most popular textual data transformation languagelanguages;
* [[Perl]] - a high-level language with both procedural and object-oriented syntax capable of powerful operations on binary or text data.
* [[Web template|Template languages]] - specialized to transform data into documents (see also [[template processor]]);
* [[TXL (programming language)|TXL]] - prototyping language-based descriptions, used for source code or data transformation.
* [[XSLT]] - the standard XML data transformation language (suitable by [[XQuery]] in many applications);
Additionally, companies such as Trifacta and Paxata have developed ___domain-specific transformational languages (DSL) for servicing and transforming datasets. The development of ___domain-specific languages has been linked to increased productivity and accessibility for non-technical users.<ref>{{Cite web|url=https://docs.trifacta.com/display/PE/Wrangle+Language|title=Wrangle Language - Trifacta Wrangler - Trifacta Documentation|website=docs.trifacta.com|access-date=2017-09-20|archive-date=2017-09-21|archive-url=https://web.archive.org/web/20170921045735/https://docs.trifacta.com/display/PE/Wrangle+Language|url-status=live}}</ref> Trifacta's “Wrangle” is an example of such a ___domain -specific language.<ref name=":0">{{Cite web|url=https://conferences.oreilly.com/strata/stratany2014/public/schedule/detail/36612|title=Advantages of a Domain-Specific Language Approach to Data Transformation - Strata + Hadoop World in New York 2014|last=Kandel|first=Joe Hellerstein, Sean|website=conferences.oreilly.com|access-date=2017-09-20|archive-date=2017-09-21|archive-url=https://web.archive.org/web/20170921000834/https://conferences.oreilly.com/strata/stratany2014/public/schedule/detail/36612|url-status=live}}</ref>
 
Another advantage of the recent ___domain-specific transformational languages trend is that a ___domain-specific transformational language can abstract the underlying execution of the logic defined in the ___domain-specific transformational language. They can also utilize that same logic in various processing engines, such as [[SPARK (programming language)|Spark]], [[MapReduce]], and [[Microsoft Dataflow|Dataflow]]. In other words, with a ___domain-specific transformational language, the transformation language is not tied to the underlying engine.<ref name=":0" />
Line 99 ⟶ 96:
 
In other words, all instances of a function invocation of foo with three arguments, followed by a function invocation with two arguments would be replaced with a single function invocation using some or all of the original set of arguments.
 
Another advantage to using regular expressions is that they will not fail the null transform test. That is, using your transformational language of choice, run a sample program through a transformation that doesn't perform any transformations. Many transformational languages will fail this test.
 
==See also==
Line 116 ⟶ 111:
==External links==
* [https://en.wikiversity.org/wiki/Digital_Libraries/File_formats,_transformation,_migration File Formats, Transformation, and Migration], a related Wikiversity article
* {{dmoz|Computers/Software/Databases/Data_Warehousing/Extraction_and_Transformation|Extraction and Transformation}}
 
{{Data warehouse}}
 
{{DEFAULTSORT:Data Transformation}}
[[Category:Metadata]]
[[Category:Articles with example C++ code]]