Data preprocessing: Difference between revisions

Content deleted Content added
top: Add citations needed tag
Tags: Mobile edit Mobile app edit Android app edit
top: Fix grammar and remove duplicated content
Tags: Mobile edit Mobile app edit Android app edit
Line 2:
'''Data preprocessing''' can refer to manipulation or dropping of data before it is used in order to ensure or enhance performance,<ref>{{Cite web|title=Guide To Data Cleaning: Definition, Benefits, Components, And How To Clean Your Data|url=https://www.tableau.com/learn/articles/what-is-data-cleaning|access-date=2021-10-17|website=Tableau|language=en-US}}</ref> and is an important step in the [[data mining]] process. The phrase [[GIGO|"garbage in, garbage out"]] is particularly applicable to [[data mining]] and [[machine learning]] projects. [[Data collection]] methods are often loosely controlled, resulting in out-of-range values, impossible data combinations, and [[missing values]], amongst other issues.
 
Analyzing data that has not been carefully screened for such problems can produce misleading results. Thus, the representation and [[data quality|quality of data]] is first and foremostnecessary before running any analysis.<ref>Pyle, D., 1999. ''Data Preparation for Data Mining.'' Morgan Kaufmann Publishers, [[Los Altos, California]].</ref>
Often, data preprocessing is the most important phase of a [[machine learning]] project, especially in [[computational biology]].<ref>{{cite journal
| vauthors = Chicco D
Line 13:
| pmid = 29234465
| doi = 10.1186/s13040-017-0155-3
| pmc= 5721660}}</ref> If there is mucha high proportion of irrelevant and redundant information present or noisy and unreliable data, then [[knowledge discovery]] during the training phase ismay be more difficult. [[Data preparation]] and filtering steps can take a considerable amount of processing time. Examples of methods used in data preprocessing include [[Data cleaning|cleaning]], [[instance selection]], [[data normalization|normalization]], [[One-hot|one -hot encoding]], [[data transformation|transformation]], [[feature extraction]] and [[Featurefeature selection|selection]], etc. The product of data preprocessing is the final [[training set]].
 
Data preprocessing may affect the way in which outcomes of the final data processing can be interpreted.<ref>{{Cite journal|last1=Oliveri|first1=Paolo|last2=Malegori|first2=Cristina|last3=Simonetti|first3=Remo|last4=Casale|first4=Monica|date=2019|title=The impact of signal preprocessing on the final interpretation of analytical outcomes – A tutorial|journal=Analytica Chimica Acta|language=en|volume=1058|pages=9–17|doi=10.1016/j.aca.2018.10.055|pmid=30851858|s2cid=73727614}}</ref> This aspect should be carefully considered when interpretation of the results is a key point, such in the multivariate processing of chemical data ([[chemometrics]]).
 
==Tasks of data preprocessing==
*[[Data cleansing]]
*[[Data editing]]
*[[Data reduction]]
*[[Data wrangling]]
 
==Data mining==