Content deleted Content added
Fgnievinski (talk | contribs) No edit summary |
→Transformational languages: This paragraph was been questioned since 2007, yet no clarifications or references |
||
(26 intermediate revisions by 14 users not shown) | |||
Line 1:
{{Short description|
{{redirect-distinguish|Data transformation|Data transformation (statistics)}}
{{Data transformation}}
In [[computing]], '''data transformation''' is the process of converting data from one format or structure into another format or structure. It is a fundamental aspect of most [[data integration]]<ref name="cio.com">CIO.com. Agile Comes to Data Integration. Retrieved from: https://www.cio.com/article/2378615/data-management/agile-comes-to-data-integration.html {{Webarchive|url=https://web.archive.org/web/20170829035436/https://www.cio.com/article/2378615/data-management/agile-comes-to-data-integration.html |date=2017-08-29 }}</ref> and [[data management]] tasks such as [[data wrangling]], [[data warehousing]], [[data integration]] and application integration.
Data transformation can be simple or complex based on the required changes to the data between the source (initial) data and the target (final) data. Data transformation is typically performed via a mixture of manual and automated steps.<ref name="livinglab.mit.edu">DataXFormer. Morcos, Abedjan, Ilyas, Ouzzani, Papotti, Stonebraker. An interactive data transformation tool. Retrieved from: http://livinglab.mit.edu/wp-content/uploads/2015/12/DataXFormer-An-Interactive-Data-Transformation-Tool.pdf {{Webarchive|url=https://web.archive.org/web/20190805211122/http://livinglab.mit.edu/wp-content/uploads/2015/12/DataXFormer-An-Interactive-Data-Transformation-Tool.pdf |date=2019-08-05 }}</ref> Tools and technologies used for data transformation can vary widely based on the format, structure, complexity, and volume of the data being transformed.
A [[master data]] recast is another form of data transformation where the entire database of data values is transformed or recast without extracting the data from the database. All data in a well
When the data mapping is indirect via a mediating [[data model]], the process is also called '''data mediation'''.
==Data transformation process==
Data transformation can be divided into the following steps, each applicable as needed based on the complexity of the transformation required.
* [[Data discovery]]
* [[Data mapping]]
* [[
* [[Execution (computing)|Code execution]]
* Data review
Line 27 ⟶ 26:
'''Data discovery''' is the first step in the data transformation process. Typically the data is profiled using profiling tools or sometimes using manually written profiling scripts to better understand the structure and characteristics of the data and decide how it needs to be transformed.
'''Data mapping''' is the process of defining how individual fields are mapped, modified, joined, filtered, aggregated etc. to produce the final desired output. Developers or technical data analysts traditionally perform data mapping since they work in the specific technologies to define the transformation rules (e.g. visual [[Extract, transform, load|ETL]] tools,<ref>DWBIMASTER. Top 10 ETL Tools. Retrieved from: http://dwbimaster.com/top-10-etl-tools/ {{Webarchive|url=https://web.archive.org/web/20170829035105/http://dwbimaster.com/top-10-etl-tools/ |date=2017-08-29 }}</ref> transformation languages).
'''Code generation''' is the process of generating executable code (e.g. SQL, Python, R, or other executable instructions) that will transform the data based on the desired and defined data mapping rules.<ref>Petr Aubrecht, Zdenek Kouba. Metadata
'''Code execution''' is the step whereby the generated code is executed against the data to create the desired output. The executed code may be tightly integrated into the transformation tool, or it may require separate steps by the developer to manually execute the generated code.
Line 38 ⟶ 37:
===Batch data transformation===
Traditionally, data transformation has been a bulk or batch process,<ref name="tdwi.org">TDWI. 10 Rules for Real-Time Data Integration. Retrieved from: https://tdwi.org/Articles/2012/12/11/10-Rules-Real-Time-Data-Integration.aspx?Page=1 {{Webarchive|url=https://web.archive.org/web/20170829032504/https://tdwi.org/Articles/2012/12/11/10-Rules-Real-Time-Data-Integration.aspx?Page=1 |date=2017-08-29 }}</ref> whereby developers write code or implement transformation rules in a data integration tool, and then execute that code or those rules on large volumes of data.<ref name="andrefreitas.org">Tope Omitola, Andr´e Freitas, Edward Curry, Sean O'Riain, Nicholas Gibbins, and Nigel Shadbolt. Capturing Interactive Data Transformation Operations using Provenance Workflows Retrieved from: http://andrefreitas.org/papers/preprint_capturing%20interactive_data_transformation_eswc_highlights.pdf {{Webarchive|url=https://web.archive.org/web/20160131145724/http://andrefreitas.org/papers/preprint_capturing%20interactive_data_transformation_eswc_highlights.pdf |date=2016-01-31 }}</ref> This process can follow the linear set of steps as described in the data transformation process above.
Batch data transformation is the cornerstone of virtually all data integration technologies such as data warehousing, data migration and application integration.<ref name="cio.com"/>
When data must be transformed and delivered with low latency, the term
===Benefits of batch data transformation===
Line 50 ⟶ 49:
This traditional process also has limitations that hamper its overall efficiency and effectiveness.<ref name="cio.com"/><ref name="livinglab.mit.edu"/><ref name="andrefreitas.org"/>
The people who need to use the data (e.g. business users) do not play a direct role in the data transformation process.<ref name="digital.lib.washington.edu">Morton, Kristi -- Interactive Data Integration and Entity Resolution for Exploratory Visual Data Analytics. Retrieved from: https://digital.lib.washington.edu/researchworks/handle/1773/35165 {{Webarchive|url=https://web.archive.org/web/20170907043519/https://digital.lib.washington.edu/researchworks/handle/1773/35165 |date=2017-09-07 }}</ref> Typically, users hand over the data transformation task to developers who have the necessary coding or technical skills to define the transformations and execute them on the data.<ref name="The Value of Data Transformation"/>
This process leaves the bulk of the work of defining the required transformations to the developer
This problem has given rise to the need for agility and self-service in data integration (i.e. empowering the user of the data and enabling them to transform the data themselves interactively).<ref name="andrefreitas.org"/><ref name="ReferenceA"/>
There are companies that provide self-service data transformation tools. They are aiming to efficiently analyze, map and transform large volumes of data without the technical knowledge and process complexity that currently exists. While these companies use traditional batch transformation, their tools enable more interactivity for users through visual platforms and easily repeated scripts.<ref>{{Cite news|url=https://www.datanami.com/2016/05/31/self-service-prep-killer-app-big-data/|title=Why Self-Service Prep Is a Killer App for Big Data|date=2016-05-31|work=Datanami|access-date=2017-09-20|language=en-US|archive-date=2017-09-21|archive-url=https://web.archive.org/web/20170921001724/https://www.datanami.com/2016/05/31/self-service-prep-killer-app-big-data/|url-status=live}}</ref>
Still, there might be some compatibility issues (e.g. new data sources like [[Internet of Things|IoT]] may not work correctly with older tools) and compliance limitations due to the difference in [[data governance]], preparation and audit practices.<ref>{{Cite web |last=Sergio |first=Pablo |date=2022-05-27 |title=Your Practical Guide to Data Transformation |url=https://blog.coupler.io/what-is-data-transformation/ |access-date=2022-07-08 |website=Coupler.io Blog |language=en-US |archive-date=2022-05-17 |archive-url=https://web.archive.org/web/20220517173509/https://blog.coupler.io/what-is-data-transformation/ |url-status=live }}</ref>
===Interactive data transformation===
Interactive data transformation (IDT)<ref>Tope Omitola , Andr´e Freitas , Edward Curry , Sean O’Riain , Nicholas Gibbins , and Nigel Shadbolt. Capturing Interactive Data Transformation Operations using Provenance Workflows Retrieved from: http://andrefreitas.org/papers/preprint_capturing%20interactive_data_transformation_eswc_highlights.pdf {{Webarchive|url=https://web.archive.org/web/20160131145724/http://andrefreitas.org/papers/preprint_capturing%20interactive_data_transformation_eswc_highlights.pdf |date=2016-01-31 }}</ref> is an emerging capability that allows business analysts and business users the ability to directly interact with large datasets through a visual interface,<ref name="digital.lib.washington.edu"/> understand the characteristics of the data (via automated data profiling or visualization), and change or correct the data through simple interactions such as clicking or selecting certain elements of the data.<ref name="livinglab.mit.edu"/>
Although
Once they've finished transforming the data, the system can generate executable code/logic, which can be executed or applied to subsequent similar data sets.
By removing the developer from the process,
==Transformational languages==
There are numerous languages available for performing data transformation. Many [[transformation language]]s require a [[grammar]] to be provided. In many cases, the grammar is structured using something closely resembling [[Backus–Naur form
* [[AWK]] - one of the oldest and most popular textual data transformation
* [[Perl]] - a high-level language with both procedural and object-oriented syntax capable of powerful operations on binary or text data.
* [[Web template|Template languages]] - specialized to transform data into documents (see also [[template processor]]);
* [[TXL (programming language)|TXL]] - prototyping language-based descriptions, used for source code or data transformation.
* [[XSLT]] - the standard XML data transformation language (suitable by [[XQuery]] in many applications);
Additionally, companies such as Trifacta and Paxata have developed ___domain-specific transformational languages (DSL) for servicing and transforming datasets. The development of ___domain-specific languages has been linked to increased productivity and accessibility for non-technical users.<ref>{{Cite web|url=https://docs.trifacta.com/display/PE/Wrangle+Language|title=Wrangle Language - Trifacta Wrangler - Trifacta Documentation|website=docs.trifacta.com|access-date=2017-09-20|archive-date=2017-09-21|archive-url=https://web.archive.org/web/20170921045735/https://docs.trifacta.com/display/PE/Wrangle+Language|url-status=live}}</ref> Trifacta's “Wrangle” is an example of such a ___domain
Another advantage of the recent
Although transformational languages are typically best suited for transformation, something as simple as [[regular
<pre>
foo ("some string", 42, gCommon);
Line 97 ⟶ 96:
In other words, all instances of a function invocation of foo with three arguments, followed by a function invocation with two arguments would be replaced with a single function invocation using some or all of the original set of arguments.
==See also==
* [https://en.wikiversity.org/wiki/Digital_Libraries/File_formats,_transformation,_migration File Formats, Transformation, and Migration] (related wikiversity article)▼
* [[Data cleansing]]
* [[Data mapping]]
Line 114 ⟶ 110:
==External links==
▲* [https://en.wikiversity.org/wiki/Digital_Libraries/File_formats,_transformation,_migration File Formats, Transformation, and Migration], a
{{Data warehouse}}
[[Category:Metadata]]
[[Category:Articles with example C++ code]]
|