Data transformation (computing): Difference between revisions

Content deleted Content added
Tags: references removed Mobile edit Mobile app edit
duplicate references using AWB
Line 1:
{{COI|date=October 2017}}
{{about|metadata transformation in computer science|the statistical concept|Data transformation (statistics)}}
{{COI|date=October 2017}}
{{Data transformation}}
 
In computing, '''data transformation''' is the process of converting data from one format or structure into another format or structure. It is a fundamental aspect of most [[data integration]]<ref name="cio.com">CIO.com. Agile Comes to Data Integration. Retrieved from: https://www.cio.com/article/2378615/data-management/agile-comes-to-data-integration.html </ref> and [[data management]] tasks such as [[data wrangling]], [[data warehousing]], [[data integration]] and application integration.
 
Data transformation can be simple or complex based on the required changes to the data between the source (initial) data and the target (final) data. Data transformation is typically performed via a mixture of manual and automated steps.<ref> name="livinglab.mit.edu">DataXFormer. Morcos, Abedjan, Ilyas, Ouzzani, Papotti, Stonebraker. An interactive data transformation tool. Retrieved from: http://livinglab.mit.edu/wp-content/uploads/2015/12/DataXFormer-An-Interactive-Data-Transformation-Tool.pdf</ref> Tools and technologies used for data transformation can vary widely based on the format, structure, complexity, and volume of the data being transformed.
 
A [[master data]] recast is another form of data transformation where the entire [[database]] of data values is transformed or recast without extracting the data from the database. All data in a well designed database is directly or indirectly related to a limited set of master [[database table]]s by a network of [[foreign key]] constraints. Each foreign key constraint is dependent upon a unique [[database index]] from the parent database table. Therefore, when the proper master database table is recast with a different unique index, the directly and indirectly related data are also recast or restated. The directly and indirectly related data may also still be viewed in the original form since the original unique index still exists with the master data. Also, the database recast must be done in such a way as to not impact the [[applications architecture]] software.
Line 12:
 
==Data Transformation Process==
Data transformation can be divided into the following steps, each applicable as needed based on the complexity of the transformation required.<br>The Value of Data Transformation</br>
 
* Data discovery
Line 26:
'''Data discovery''' is the first step in the data transformation process. Typically the data is profiled using profiling tools or sometimes using manually written profiling scripts to better understand the structure and characteristics of the data and decide how the needs to be transformed.
 
'''Data mapping''' is the process of defining how individual fields are mapped, modified, joined, filtered, aggregated etc. to produce the final desired output. Developers or technical data analysts traditionally perform data mapping since they work in the specific technologies to define the transformation rules (e.g. visual ETL tools,<ref> DWBIMASTER. Top 10 ETL Tools. Retrieved from: http://dwbimaster.com/top-10-etl-tools/</ref> transformation languages).
 
'''Code generation''' is the process of generating executable code (e.g. SQL, Python, R, or other executable instructions) that will transform the data based on the desired and defined data mapping rules.<ref> Petr Aubrecht, Zdenek Kouba. Metadata driven data transformation. Retrieved from: http://labe.felk.cvut.cz/~aubrech/bin/Sumatra.pdf</ref> Typically, the data transformation technologies generate this code<ref> LearnDataModeling.com. Code Generators. Retrieved from: http://www.learndatamodeling.com/tm_code_generator.php</ref> based on the definitions or metadata defined by the developers.
 
'''Code execution''' is the step whereby the generated code is executed against the data to create the desired output. The executed code may be tightly integrated into the transformation tool, or it may require separate steps by the developer to manually execute the generated code.
 
'''Data review''' is the final step in the process, which focuses on ensuring the output data meets the transformation requirements. It is typically the business user or final end-user of the data that performs this step. Any anomalies or errors in the data that are found and communicated back to the developer or data analyst as new requirements to be implemented in the transformation process.<ref>CIO.com. Agile Comes to Data Integration. Retrieved from: https://www.name="cio.com"/article/2378615/data-management/agile-comes-to-data-integration.html </ref>
 
==Types of Data Transformation==
 
===Batch Data Transformation===
Traditionally, data transformation has been a bulk or batch process,<ref> name="tdwi.org">TDWI. 10 Rules for Real-Time Data Integration. Retrieved from: https://tdwi.org/Articles/2012/12/11/10-Rules-Real-Time-Data-Integration.aspx?Page=1</ref>, whereby developers write code or implement transformation rules in a data integration tool, and then execute that code or those rules on large volumes of data.<ref> name="andrefreitas.org">Tope Omitola, Andr´e Freitas, Edward Curry, Sean O'Riain, Nicholas Gibbins, and Nigel Shadbolt. Capturing Interactive Data Transformation Operations using Provenance Workflows Retrieved from: http://andrefreitas.org/papers/preprint_capturing%20interactive_data_transformation_eswc_highlights.pdf</ref> This process can follow the linear set of steps described the Data Transformation Process above.
 
Batch data transformation is the cornerstone of virtually all data integration technologies such as data warehousing, data migration and application integration.<ref>CIO.com. Agile Comes to Data Integration. Retrieved from: https://www.name="cio.com"/article/2378615/data-management/agile-comes-to-data-integration.html </ref>
 
When data must be transformed and delivered with low latency, the term “microbatch” is often used.<ref> TDWI. 10 Rules for Real-Time Data Integration. Retrieved from: https://name="tdwi.org"/Articles/2012/12/11/10-Rules-Real-Time-Data-Integration.aspx?Page=1</ref> This refers to small batches of data (e.g. a small number of rows or small set of data objects) that can be processed very quickly and delivered to the target system when needed.
 
===Benefits of Batch Data Transformation===
Traditional data transformation processes have served companies well for decades. The various tools and technologies (data profiling, data visualization, data cleansing, data integration etc.) have matured and most (if not all) enterprises transform enormous volumes of data that feed internal and external applications, data warehouses and other data stores.<ref name="The Value of Data Transformation">The Value of Data Transformation</ref>
 
===Limitations of Traditional Data Transformation===
This traditional process also has limitations that hamper its overall efficiency and effectiveness.<ref name="cio.com"/><ref name="livinglab.mit.edu"/><ref name="andrefreitas.org"/>
This traditional process also has limitations that hamper its overall efficiency and effectiveness.<ref>CIO.com. Agile Comes to Data Integration. Retrieved from: https://www.cio.com/article/2378615/data-management/agile-comes-to-data-integration.html </ref> <ref> DataXFormer. Morcos, Abedjan, Ilyas, Ouzzani, Papotti, Stonebraker. An interactive data transformation tool. Retrieved from: http://livinglab.mit.edu/wp-content/uploads/2015/12/DataXFormer-An-Interactive-Data-Transformation-Tool.pdf</ref> <ref> Tope Omitola, Andr´e Freitas, Edward Curry, Sean O'Riain, Nicholas Gibbins, and Nigel Shadbolt. Capturing Interactive Data Transformation Operations using Provenance Workflows Retrieved from: http://andrefreitas.org/papers/preprint_capturing%20interactive_data_transformation_eswc_highlights.pdf</ref>
 
The people who need to use the data (e.g. business users) do not play a direct role in the data transformation process.<ref> name="digital.lib.washington.edu">Morton, Kristi -- Interactive Data Integration and Entity Resolution for Exploratory Visual Data Analytics. Retrieved from: https://digital.lib.washington.edu/researchworks/handle/1773/35165</ref> Typically, users hand over the data transformation task to developers who have the necessary coding or technical skills to define the transformations and execute them on the data.<ref> name="The Value of Data Transformation<"/ref>
 
This process leaves the bulk of the work of defining the required transformations to the developer. The developer interprets the business user requirements and implements the related code/logic. This has the potential of introducing errors into the process (through misinterpreted requirements), and also increases the time to arrive at a solution.<ref> Morton, Kristi -- Interactive Data Integration and Entity Resolution for Exploratory Visual Data Analytics. Retrieved from: https://name="digital.lib.washington.edu"/researchworks/handle/1773/35165></ref> <refname="ReferenceA"> McKinsey.com. Using Agile to Accelerate Data Transformation</ref>
 
This problem has given rise to the need for agility and self-service in data integration (i.e. empowering the user of the data and enabling them to transform the data themselves interactively).<ref> Tope Omitola, Andr´e Freitas, Edward Curry, Sean O'Riain, Nicholas Gibbins, and Nigel Shadbolt. Capturing Interactive Data Transformation Operations using Provenance Workflows Retrieved from: http://name="andrefreitas.org"/papers/preprint_capturing%20interactive_data_transformation_eswc_highlights.pdf</ref> <ref> McKinsey.com. Using Agile to Accelerate Data Transformation<name="ReferenceA"/ref>
 
There are companies that provide self-service data transformation tools. They are aiming to efficiently analyze, map and transform large volumes of data without the technical and process complexity that currently exists. While these companies use traditional batch transformation, their tools enable more interactivity for users through visual platforms and easily repeated scripts.<ref>{{Cite news|url=https://www.datanami.com/2016/05/31/self-service-prep-killer-app-big-data/|title=Why Self-Service Prep Is a Killer App for Big Data|date=2016-05-31|work=Datanami|access-date=2017-09-20|language=en-US}}</ref>
 
===Interactive Data Transformation===
Interactive data transformation (IDT)<ref> Tope Omitola , Andr´e Freitas , Edward Curry , Sean O’Riain , Nicholas Gibbins , and Nigel Shadbolt. Capturing Interactive Data Transformation Operations using Provenance Workflows Retrieved from: http://andrefreitas.org/papers/preprint_capturing%20interactive_data_transformation_eswc_highlights.pdf</ref> is an emerging capability that allows business analysts and business users the ability to directly interact with large datasets through a visual interface,<ref> Morton, Kristi -- Interactive Data Integration and Entity Resolution for Exploratory Visual Data Analytics. Retrieved from: https://name="digital.lib.washington.edu"/researchworks/handle/1773/35165</ref>, understand the characteristics of the data (via automated data profiling or visualization), and change or correct the data through simple interactions such as clicking or selecting certain elements of the data.<ref> DataXFormer. Morcos, Abedjan, Ilyas, Ouzzani, Papotti, Stonebraker. An interactive data transformation tool. Retrieved from: http://name="livinglab.mit.edu"/wp-content/uploads/2015/12/DataXFormer-An-Interactive-Data-Transformation-Tool.pdf</ref>
 
Although IDT follows the same data integration process steps as batch data integration, the key difference is that the steps are not necessarily followed in a linear fashion and typically don’t require significant technical skills for completion.<ref>Peng Cong, Zhang Xiaoyi. Research and Design of Interactive Data Transformation and Migration System for Heterogeneous Data Sources. Retrieved from: http://ieeexplore.ieee.org/document/5211525/</ref>
 
A number of companies, primarily start-ups such as Trifacta, Alteryx and Paxata provide interactive data transformation tools. They are aiming to efficiently analyze, map and transform large volumes of data without the technical and process complexity that currently exists.
 
IDT solutions provide an integrated visual interface that combines the previously disparate steps of data analysis, data mapping and code generation/execution and data inspection.<ref> name="The Value of Data Transformation<"/ref> IDT interfaces incorporate visualization to show the user patterns and anomalies in the data so they can identify erroneous or outlying values.<ref> Morton, Kristi -- Interactive Data Integration and Entity Resolution for Exploratory Visual Data Analytics. Retrieved from: https://name="digital.lib.washington.edu"/researchworks/handle/1773/35165</ref>
 
Once they’ve finished transforming the data, the system can generate executable code/logic, which can be executed or applied to subsequent similar data sets.
 
By removing the developer from the process, IDT systems shorten the time needed to prepare and transform the data, eliminate costly errors in interpretation of user requirements and empower business users and analysts to control their data and interact with it as needed.<ref> McKinsey.com. Using Agile to Accelerate Data Transformation<name="ReferenceA"/ref>
 
==Transformational languages==
There are numerous languages available for performing data transformation. Many [[transformation language]]s require a [[grammar]] to be provided. In many cases, the grammar is structured using something closely resembling [[Backus–Naur form|Backus–Naur Form (BNF)]]. There are numerous languages available for such purposes varying in their accessibility (cost) and general usefulness.<ref>DMOZ. Extraction and Transformation. Retrieved from: https://dmoztools.net/Computers/Software/Databases/Data_Warehousing/Extraction_and_Transformation/ </ref> Examples of such languages include:
* [[AWK]] - one of the oldest and popular textual data transformation language;
* [[Perl]] - a high-level language with both procedural and object-oriented syntax capable of powerful operations on binary or text data.
Line 100:
 
* [[v:2-c (8-d): File formats, transformation, migration|File Formats, Transformation, and Migration]] (related wikiversity article)
 
* [[Data cleansing|Data Cleansing]]
* [[Data mapping|Data Mapping]]