Data preprocessing: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 15:58, 13 June 2023 edit Nmacpherson (talk \| contribs) Extended confirmed users, Rollbackers 1,383 edits m Reverted edits by 208.104.252.252 (talk) (AV) Tags: AntiVandal Rollback ← Previous edit		Latest revision as of 14:17, 25 August 2025 edit undo Citation bot (talk \| contribs) Bots 5,860,714 edits Added article-number. Removed URL that duplicated identifier. Removed parameters. Some additions/deletions were parameter name changes. \| Use this bot. Report bugs. \| Suggested by Headbomb \| Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox \| #UCB_webform_linked 602/1032
(39 intermediate revisions by 16 users not shown)
Line 1: {{Short description\|Manipulation of data before it is analyzed}} '''Data preprocessing''' can refer to manipulation or dropping of data before it is used in order to ensure or enhance performance,<ref>{{Cite web\|title=Guide To Data Cleaning: Definition, Benefits, Components, And How To Clean Your Data\|url=https://www.tableau.com/learn/articles/what-is-data-cleaning\|access-date=2021-10-17\|website=Tableau\|language=en-US}}</ref> and is an important step in the [[data mining]] process. The phrase [[GIGO\|"garbage in, garbage out"]] is particularly applicable to [[data mining]] and [[machine learning]] projects. [[Data collection\|Data-gathering]] methods are often loosely controlled, resulting in [[range error\|out-of-range]] values (e.g., Income: −100), impossible data combinations (e.g., Sex: Male, Pregnant: Yes), and [[missing values]], etc. {{Citations needed\|date=August 2023}} '''Data preprocessing''' can refer to manipulation, filtration or augmentation of data before it is analyzed,<ref>{{Cite web\|title=Guide To Data Cleaning: Definition, Benefits, Components, And How To Clean Your Data\|url=https://www.tableau.com/learn/articles/what-is-data-cleaning\|access-date=2021-10-17\|website=Tableau\|language=en-US}}</ref> and is often an important step in the [[data mining]] process. [[Data collection]] methods are often loosely controlled, resulting in out-of-range values, impossible data combinations, and [[missing values]], amongst other issues. Preprocessing is the process by which unstructured data is transformed into intelligible representations suitable for machine-learning models. This phase of model deals with noise in order to arrive at better and improved results from the original data set which was noisy. This dataset also has some level of missing value present in it. ~~Analyzing~~The ~~data~~preprocessing ~~that~~pipeline ~~has~~used ~~not~~can ~~been~~often ~~carefully~~have ~~screened~~large ~~for~~effects ~~such~~on ~~problems~~the ~~can~~conclusions ~~produce~~drawn ~~misleading~~from ~~results~~the downstream analysis. Thus, ~~the~~ representation and [[data quality\|quality of data]] is ~~first and foremost~~necessary before running any analysis.<ref>Pyle, D., 1999. ''Data Preparation for Data Mining.'' Morgan Kaufmann Publishers, [[Los Altos, California]].</ref> Often, data preprocessing is the most important phase of a [[machine learning]] project, especially in [[computational biology]].<ref>{{cite journal \| vauthors = Chicco D Line 8 ⟶ 11: \| volume = 10 \| issue = 35 \| ~~pages~~article-number = 35 \| date = December 2017 \| pmid = 29234465 \| doi = 10.1186/s13040-017-0155-3 \| pmc= 5721660 \| pmc= 5721660}}</ref> If there is much irrelevant and redundant information present or noisy and unreliable data, then [[knowledge discovery]] during the training phase is more difficult. [[Data preparation]] and filtering steps can take considerable amount of processing time. Examples of data preprocessing include [[Data cleaning\|cleaning]], [[instance selection]], [[data normalization\|normalization]], [[One-hot\|one hot encoding]], [[data transformation\|transformation]], [[feature extraction]] and [[Feature selection\|selection]], etc. The product of data preprocessing is the final [[training set]]. \| doi-access = free }}</ref> If there is a high proportion of irrelevant and redundant information present or noisy and unreliable data, then [[knowledge discovery]] during the training phase may be more difficult. [[Data preparation]] and filtering steps can take a considerable amount of processing time. Examples of methods used in data preprocessing include [[Data cleaning\|cleaning]], [[instance selection]], [[data normalization\|normalization]], [[One-hot\|one-hot encoding]], [[Data transformation (statistics)\|data transformation]], [[feature extraction]] and [[feature selection]]. ==Applications== Data preprocessing may affect the way in which outcomes of the final data processing can be interpreted.<ref>{{Cite journal\|last1=Oliveri\|first1=Paolo\|last2=Malegori\|first2=Cristina\|last3=Simonetti\|first3=Remo\|last4=Casale\|first4=Monica\|date=2019\|title=The impact of signal preprocessing on the final interpretation of analytical outcomes – A tutorial\|journal=Analytica Chimica Acta\|language=en\|volume=1058\|pages=9–17\|doi=10.1016/j.aca.2018.10.055\|pmid=30851858\|s2cid=73727614}}</ref> This aspect should be carefully considered when interpretation of the results is a key point, such in the multivariate processing of chemical data ([[chemometrics]]). ===Data mining=== {{Cleanup section\|date=August 2023\|reason=This section requires grammar and capitalisation fixes}} Data preprocessing allows for the removal of unwanted data with the use of data cleaning, this allows the user to have a dataset to contain more valuable information after the preprocessing stage for data manipulation later in the data mining process. Editing such dataset to either correct data corruption or human error is a crucial step to get accurate quantifiers like true positives, true negatives, [[false positives and false negatives]] found in a [[confusion matrix]] that are commonly used for a medical diagnosis. Users are able to join data files together and use preprocessing to filter any unnecessary noise from the data which can allow for higher accuracy. Users use Python programming scripts accompanied by the pandas library which gives them the ability to import data from a [[comma-separated values]] as a data-frame. The data-frame is then used to manipulate data that can be challenging otherwise to do in Excel. [[Pandas (software)]] which is a powerful tool that allows for data analysis and manipulation; which makes data visualizations, statistical operations and much more, a lot easier. Many also use the [[R (programming language)\|R programming language]] to do such tasks as well. The reason why a user transforms existing files into a new one is because of many reasons. Aspects of data preprocessing may include imputing missing values, aggregating numerical quantities and transforming continuous data into categories ([[data binning]]).<ref>{{Cite book \|last1=Hastie \|first1=Trevor \|url=https://books.google.com/books?id=eBSgoAEACAAJ \|title=The Elements of Statistical Learning: Data Mining, Inference, and Prediction \|last2=Tibshirani \|first2=Robert \|last3=Friedman \|first3=Jerome H. \|date=2009 \|publisher=Springer \|isbn=978-0-387-84884-6 \|language=en}}</ref> More advanced techniques like principal component analysis and [[feature selection]] are working with statistical formulas and are applied to complex datasets which are recorded by GPS trackers and motion capture devices. ~~==Tasks of data preprocessing==~~ [[Data cleansing]] [[Data editing]] [[Data reduction]] [[Data wrangling]] ~~==Example==~~ ~~In this example we have 5 Adults in our dataset who have the Sex of Male or Female and whether they are pregnant or not. We can detect that Adult 3 and 5 are impossible data combinations.~~ {\| \|- \| ~~{\| class="wikitable" style="border:none; float:left; margin-top:0; text-align:center;"~~ ~~!style="background:white; border:none;" colspan="2" rowspan="2"\|~~ ~~!colspan="2" style="background:none;"\|~~ \|- ~~!Sex~~ ~~!Pregnant~~ \|- ~~!rowspan="5" style="height:6em;background:none;"\|<div>Adult </div>~~ !1 ~~\|Male~~ ~~\|No~~ \|- !2 ~~\|Female~~ ~~\|Yes~~ \|- ~~!<span style="color:red">3</span>~~ ~~\|'''Male'''~~ ~~\|'''Yes'''~~ \|- !4 ~~\|Female~~ ~~\|No~~ \|- ~~!<span style="color:red">5</span>~~ ~~\|'''Male'''~~ ~~\|'''Yes'''~~ \|- \|} \| \|} We can perform a [[Data cleansing]] and choose to delete such data from our table. We remove such data because we can determine that such data existing in the dataset is caused by user entry errors or data corruption. A reason that one might have to delete such data is because the impossible data will affect the calculation or data manipulation process in the later steps of the data mining process. {\| \|- \| ~~{\| class="wikitable" style="border:none; float:left; margin-top:0; text-align:center;"~~ ~~!style="background:white; border:none;" colspan="2" rowspan="2"\|~~ ~~!colspan="2" style="background:none;"\|~~ \|- ~~!Sex~~ ~~!Pregnant~~ \|- ~~!rowspan="3" style="height:6em;background:none;"\|<div>Adult </div>~~ !1 ~~\|Male~~ ~~\|No~~ \|- !2 ~~\|Female~~ ~~\|Yes~~ \|- !4 ~~\|Female~~ ~~\|No~~ \|- \|} \| \|} We can perform a [[Data editing]] and change the Sex of the Adult by knowing that the Adult is Pregnant we can make the assumption that the Adult is Female and make changes accordingly. We edit the dataset to have a clearer analysis of the data when performing data manipulation in the later steps within the data mining process. {\| \|- \| ~~{\| class="wikitable" style="border:none; float:left; margin-top:0; text-align:center;"~~ ~~!style="background:white; border:none;" colspan="2" rowspan="2"\|~~ ~~!colspan="2" style="background:none;"\|~~ \|- ~~!Sex~~ ~~!Pregnant~~ \|- ~~!rowspan="5" style="height:6em;background:none;"\|<div>Adult </div>~~ !1 ~~\|Male~~ ~~\|No~~ \|- !2 ~~\|Female~~ ~~\|Yes~~ \|- ~~!<span style="color:blue">3</span>~~ ~~\|'''Female'''~~ ~~\|'''Yes'''~~ \|- !4 ~~\|Female~~ ~~\|No~~ \|- ~~!<span style="color:blue">5</span>~~ ~~\|'''Female'''~~ ~~\|'''Yes'''~~ \|- \|} \| \|} ~~We can use a form of [[Data reduction]] and sort the data by Sex and by doing this we can simplify our dataset and choose what Sex we want to focus on more.~~ {\| \|- \| ~~{\| class="wikitable" style="border:none; float:left; margin-top:0; text-align:center;"~~ ~~!style="background:white; border:none;" colspan="2" rowspan="2"\|~~ ~~!colspan="2" style="background:none;"\|~~ \|- ~~!Sex~~ ~~!Pregnant~~ \|- ~~!rowspan="5" style="height:6em;background:none;"\|<div>Adult </div>~~ !2 ~~\|Female~~ ~~\|Yes~~ \|- !4 ~~\|Female~~ ~~\|No~~ \|- !1 ~~\|Male~~ ~~\|No~~ \|- !3 ~~\|Male~~ ~~\|Yes~~ \|- !5 ~~\|Male~~ ~~\|Yes~~ \|- \|} \| \|} ~~==Data mining==~~ The origins of data preprocessing are located in [[data mining]].{{cn\|date=March 2021}} The idea is to aggregate existing information and search in the content. Later it was recognized, that for machine learning and neural networks a data preprocessing step is needed too. So it has become to a universal technique which is used in computing in general. ===Semantic data preprocessing=== Data preprocessing allows for the removal of unwanted data with the use of data cleaning, this allows the user to have a dataset to contain more valuable information after the preprocessing stage for data manipulation later in the data mining process. Editing such dataset to either correct data corruption or human error is a crucial step to get accurate quantifiers like true positives, true negatives, [[False positives and false negatives]] found in a [[Confusion matrix]] that are commonly used for a medical diagnosis. Users are able to join data files together and use preprocessing to filter any unnecessary noise from the data which can allow for higher accuracy. Users use Python programming scripts accompanied by the pandas library which gives them the ability to import data from a [[Comma-separated values]] as a data-frame. The data-frame is then used to manipulate data that can be challenging otherwise to do in Excel. [[pandas (software)]] which is a powerful tool that allows for data analysis and manipulation; which makes data visualizations, statistical operations and much more, a lot easier. Many also use the [[R (programming language)]] to do such tasks as well. The reason why a user transforms existing files into a new one is because of many reasons. Data preprocessing has the objective to add missing values, aggregate information, label data with categories ([[Data binning]]) and smooth a trajectory.{{cn\|date=March 2021}} More advanced techniques like principal component analysis and [[feature selection]] are working with statistical formulas and are applied to complex datasets which are recorded by GPS trackers and motion capture devices. ~~==Semantic data preprocessing==~~ Semantic data mining is a subset of data mining that specifically seeks to incorporate [[___domain knowledge]], such as formal semantics, into the data mining process. Domain knowledge is the knowledge of the environment the data was processed in. Domain knowledge can have a positive influence on many aspects of data mining, such as filtering out redundant or inconsistent data during the preprocessing phase.<ref>{{cite web \|title=Semantic Data Mining: A Survey of Ontology-based Approaches \|author=Dou, Deijing and Wang, Hao and Liu, Haishan \|publisher=University of Oregon \|url=http://ix.cs.uoregon.edu/~dou/research/papers/icsc15_invited.pdf \|language=en-US}}</ref> Domain knowledge also works as constraint. It does this by using working as set of prior knowledge to reduce the space required for searching and acting as a guide to the data. Simply put, semantic preprocessing seeks to filter data using the original environment of said data more correctly and efficiently. There are increasingly complex problems which are asking to be solved by more elaborate techniques to better analyze existing information.{{Fact or opinion\|date=August 2023}} Instead of creating a simple script for aggregating different numerical values into a single value, it make sense to focus on semantic based data preprocessing.<ref>{{cite conference \|title=An ontology-based framework for semantic data preprocessing aimed at human activity recognition \|author=Culmone, Rosario and Falcioni, Marco and Quadrini, Michela \|s2cid=196091422 \|conference=SEMAPRO 2014: The Eighth International Conference on Advances in Semantic Processing. Alexey Cheptsov, High Performance Computing Center Stuttgart (HLRS) \|year=2014 }}</ref> The idea is to build a dedicated [[Ontology (information science)\|ontology]], which explains on a higher level what the problem is about.<ref>{{cite conference \|doi=10.1007/11946465_24 \|year=2006 \|publisher=Springer Berlin Heidelberg \|pages=262–272 \|author=David Perez-Rey and Alberto Anguita and Jose Crespo \|title=OntoDataClean: Ontology-Based Integration and Preprocessing of Distributed Data \|conference=Biological and Medical Data Analysis }}</ref> In regards to semantic data mining and semantic pre-processing, ontologies are a way to conceptualize and formally define semantic knowledge and data. The [[Protégé (software)]] is the standard tool for constructing an ontology.{{cn\|date=July 2022}} In general, the use of ontologies bridges the gaps between data, applications, algorithms, and results that occur from semantic mismatches. As a result, semantic data mining combined with ontology has many applications where semantic ambiguity can impact the usefulness and efficiency of data systems.{{cn\|date=August 2023}} Applications include the medical field, language processing, banking,<ref>{{cite book \|chapter=Semantic Data Pre-Processing for Machine Learning Based Bankruptcy Prediction Computational Model \|author=Yerashenia, Natalia and Bolotov, Alexander and Chan, David and Pierantoni, Gabriele \|title=2020 IEEE 22nd Conference on Business Informatics (CBI) \|year=2020 \|pages=66–75 \|publisher=IEEE \|doi=10.1109/CBI49978.2020.00015 \|isbn=978-1-7281-9926-9 \|s2cid=219499599 \|url=https://westminsterresearch.westminster.ac.uk/download/6b3387bc3e53e8c935cb4267be3c7b04fe410b5e5019edbc692a53d0b6ae4d65/3538863/CBI_2020_Yereashenia_et_al.pdf \|chapter-url=https://ieeexplore.ieee.org/document/9140238}}</ref> and even tutoring,<ref>{{cite journal \|title=Building Ontology-Driven Tutoring Models for Intelligent Tutoring Systems Using Data Mining \|~~author~~last1=Chang, \|first1=Maiga ~~and~~ \|last2=D'Aniello, \|first2=Giuseppe ~~and~~ \|last3=Gaeta, \|first3=Matteo ~~and~~ \|last4=Orciuoli, ~~Franceso and~~\|first4=Francesco \|last5=Sampson, \|first5=Demetrois ~~and~~ \|last6=Simonelli, \|first6=Carmine \|journal=IEEE Access \|year=2020 \|volume=8 \|pages=48151–48162 \|publisher=IEEE \|doi=10.1109/ACCESS.2020.2979281 \|s2cid=214594754 \|doi-access=free \|bibcode=2020IEEEA...848151C }}</ref> among many more. There are various strengths to using a semantic data mining and ontological based approach. As previously mentioned, these tools can help during the per-processing phase by filtering out non-desirable data from the data set. Additionally, well-structured formal semantics integrated into well designed ontologies can return powerful data that can be easily read and processed by machines.<ref>{{cite web \|title=Semantic Data Mining: A Survey of Ontology-based Approaches \|author=Dou, Deijing and Wang, Hao and Liu, Haishan \|publisher=University of Oregon \|url=http://ix.cs.uoregon.edu/~dou/research/papers/icsc15_invited.pdf \|language=en-US}}</ref> A specifically useful example of this exists in the medical use of semantic data processing. As an example, a patient is having a medical emergency and is being rushed to hospital. The emergency responders are trying to figure out the best medicine to administer to help the patient. Under normal data processing, scouring all the patient’s medical data to ensure they are getting the best treatment could take too long and risk the patients’ health or even life. However, using semantically processed ontologies, the first responders could save the patient’s life. Tools like a semantic reasoner can use [[ontology (information science)\|ontology]] to infer the what best medicine to administer to the patient is based on their medical history, such as if they have a certain cancer or other conditions, simply by examining the natural language used in the patient's medical records.<ref>{{cite web \|title=AN ONTOLOGICAL APPROACH TO DATA MINING FOR EMERGENCY MEDICINE \|author =Kahn, Atif and Doucette, John A. and Jin, Changjiu and Fu Lijie and Cohen, Robin \|publisher=University of Waterloo \|url=https://cs.uwaterloo.ca/~j3doucet/papers/OntApproachToDataMining.pdf \|access-date=2021-12-10 \|archive-date=2023-05-19 \|archive-url=https://web.archive.org/web/20230519161412/https://cs.uwaterloo.ca/~j3doucet/papers/OntApproachToDataMining.pdf \|url-status=dead }}</ref> This would allow the first responders to quickly and efficiently search for medicine without having worry about the patient’s medical history themselves, as the semantic reasoner would already have analyzed this data and found solutions. In general, this illustrates the incredible strength of using semantic data mining and ontologies. They allow for quicker and more efficient data extraction on the user side, as the user has fewer variables to account for, since the semantically pre-processed data and ontology built for the data have already accounted for many of these variables. However, there are some drawbacks to this approach. Namely, it requires a high amount of computational power and complexity, even with relatively small data sets.<ref>{{cite journal\|title=Semantic data mining in the information age: A systematic review \|author=Sirichanya, Chanmee and Kraisak Kesorn \|year=2021 \|journal=International Journal of Intelligent Systems\|volume=36 \|issue=8 \|pages=3880–3916 \|doi=10.1002/int.22443 \|s2cid=235506360 \| ~~url~~language=~~https://onlinelibrary.wiley.com/~~en\|doi~~/10.1002/int.22443~~-access=free ~~\|language=en~~}}</ref> This could result in higher costs and increased difficulties in building and maintaining semantic data processing systems. This can be mitigated somewhat if the data set is already well organized and formatted, but even then, the complexity is still higher when compared to standard data processing.{{tone inline\|date=August 2023}} Below is a simple a diagram combining some of the processes, in particular semantic data mining and their use in ontology. Line 174 ⟶ 39: The diagram depicts a data set being broken up into two parts: the characteristics of its ___domain, or ___domain knowledge, and then the actual acquired data. The ___domain characteristics are then processed to become user understood ___domain knowledge that can be applied to the data. Meanwhile, the data set is processed and stored so that the ___domain knowledge can applied to it, so that the process may continue. This application forms the ontology. From there, the ontology can be used to analyze data and process results. Fuzzy preprocessing is another, more advanced technique for solving complex problems. Fuzzy preprocessing and ~~Fuzzy~~fuzzy data mining make use of [[fuzzy sets]]. These data sets are composed of two elements: a set and a membership function for the set which comprises 0 and 1. Fuzzy preprocessing uses this fuzzy data set to ground numerical values with linguistic information. Raw data is then transformed into [[natural language]]. Ultimately, fuzzy data mining's goal is to help deal with inexact information, such as an incomplete database. Currently fuzzy preprocessing, as well as other fuzzy based data mining techniques see frequent use with neural networks and artificial intelligence.<ref>{{cite book\| chapter=Fuzzy preprocessing rules for the improvement of an artificial neural network well log interpretation model\| author=Wong, Kok Wai and Fung, Chun Che and Law, Kok Way\| title=2000 TENCON Proceedings. Intelligent Systems and Technologies for the New Millennium (Cat. No.00CH37119)\| year=2000\| volume=1\| pages=400–405\| publisher = IEEE \| doi=10.1109/TENCON.2000.893697\| isbn=0-7803-6355-8\| s2cid=10384426~~\|chapter-url=https://ieeexplore.ieee.org/document/893697~~\| language=en}}</ref> ==References== Line 180 ⟶ 45: ==External links== [http://dataprocessing.aixcape.org Online Data Processing Compendium] [https://www.cambridge.org/core/journals/knowledge-engineering-review/article/data-preprocessing-in-predictive-data-mining/F7F2D7AC540D2815C613BA6575359AAA/share/92b3b50e7ed7363e5946baf406025281d2eb8c02 Data preprocessing in predictive data mining. Knowledge Eng. Review 34: e1 (2019)] Line 186 ⟶ 52: [[Category:Machine learning]] [[Category:Data mining]]