Revision as of 11:18, 7 August 2023 edit EditorOnOccasion (talk \| contribs) 436 edits →top: Add citations needed tag Tags: Mobile edit Mobile app edit Android app edit ← Previous edit		Revision as of 11:22, 7 August 2023 edit undo EditorOnOccasion (talk \| contribs) 436 edits →top: Fix grammar and remove duplicated content Tags: Mobile edit Mobile app edit Android app edit Next edit →
Line 2: '''Data preprocessing''' can refer to manipulation or dropping of data before it is used in order to ensure or enhance performance,<ref>{{Cite web\|title=Guide To Data Cleaning: Definition, Benefits, Components, And How To Clean Your Data\|url=https://www.tableau.com/learn/articles/what-is-data-cleaning\|access-date=2021-10-17\|website=Tableau\|language=en-US}}</ref> and is an important step in the [[data mining]] process. The phrase [[GIGO\|"garbage in, garbage out"]] is particularly applicable to [[data mining]] and [[machine learning]] projects. [[Data collection]] methods are often loosely controlled, resulting in out-of-range values, impossible data combinations, and [[missing values]], amongst other issues. Analyzing data that has not been carefully screened for such problems can produce misleading results. Thus, ~~the~~ representation and [[data quality\|quality of data]] is ~~first and foremost~~necessary before running any analysis.<ref>Pyle, D., 1999. ''Data Preparation for Data Mining.'' Morgan Kaufmann Publishers, [[Los Altos, California]].</ref> Often, data preprocessing is the most important phase of a [[machine learning]] project, especially in [[computational biology]].<ref>{{cite journal \| vauthors = Chicco D Line 13: \| pmid = 29234465 \| doi = 10.1186/s13040-017-0155-3 \| pmc= 5721660}}</ref> If there is ~~much~~a high proportion of irrelevant and redundant information present or noisy and unreliable data, then [[knowledge discovery]] during the training phase ismay be more difficult. [[Data preparation]] and filtering steps can take a considerable amount of processing time. Examples of methods used in data preprocessing include [[Data cleaning\|cleaning]], [[instance selection]], [[data normalization\|normalization]], [[One-hot\|one -hot encoding]], [[data ~~transformation\|~~transformation]], [[feature extraction]] and [[~~Feature~~feature ~~selection\|~~selection]]~~, etc~~. The product of data preprocessing is the final [[training set]]. Data preprocessing may affect the way in which outcomes of the final data processing can be interpreted.<ref>{{Cite journal\|last1=Oliveri\|first1=Paolo\|last2=Malegori\|first2=Cristina\|last3=Simonetti\|first3=Remo\|last4=Casale\|first4=Monica\|date=2019\|title=The impact of signal preprocessing on the final interpretation of analytical outcomes – A tutorial\|journal=Analytica Chimica Acta\|language=en\|volume=1058\|pages=9–17\|doi=10.1016/j.aca.2018.10.055\|pmid=30851858\|s2cid=73727614}}</ref> This aspect should be carefully considered when interpretation of the results is a key point, such in the multivariate processing of chemical data ([[chemometrics]]). ~~==Tasks of data preprocessing==~~ [[Data cleansing]] [[Data editing]] [[Data reduction]] [[Data wrangling]] ==Data mining==

Data preprocessing: Difference between revisions