Revision as of 08:04, 24 August 2017 edit Anomalocaris (talk \| contribs) Extended confirmed users, Pending changes reviewers 92,530 edits Undid revision 789082969 by 117.232.96.253 (talk) restore wikilink ← Previous edit		Revision as of 15:00, 14 December 2017 edit undo Larry.europe (talk \| contribs) Extended confirmed users 1,706 edits No edit summary Next edit →
Line 1: '''Data pre-processing''' is an important step in the [[data mining]] process. The phrase [[GIGO\|"garbage in, garbage out"]] is particularly applicable to data mining and [[machine learning]] projects. Data-gathering methods are often loosely controlled, resulting in [[range error\|out-of-range]] values (e.g., Income: −100), impossible data combinations (e.g., Sex: Male, Pregnant: Yes), [[missing values]], etc. Analyzing data that has not been carefully screened for such problems can produce misleading results. Thus, the representation and [[data quality\|quality of data]] is first and foremost before running an analysis.<ref>Pyle, D., 1999. ''Data Preparation for Data Mining.'' Morgan Kaufmann Publishers, [[Los Altos, California]].</ref> Often, data pre-processing is the most important phase of a[[machine learning]] project.<ref>{{cite journal \| vauthors = Chicco D \| title = Ten quick tips for machine learning in computational biology \| journal = BioData Mining \| volume = 10 \| issue = 35 \| pages = 1-17 \| date = December 2017 \| pmid = 29234465 \| doi = 10.1186/s13040-017-0155-3 \| pmc= 5721660}}</ref> If there is much irrelevant and redundant information present or noisy and unreliable data, then [[knowledge discovery]] during the training phase is more difficult. Data preparation and filtering steps can take considerable amount of processing time. Data pre-processing includes [[Data cleaning\|cleaning]], [[Instance selection]], [[data normalization\|normalization]], [[data transformation\|transformation]], [[feature extraction]] and [[Feature selection\|selection]], etc. The product of data pre-processing is the final [[training set]]. Kotsiantis et al. (2006) present a well-known algorithm for each step of data pre-processing.<ref>S. Kotsiantis, D. Kanellopoulos, P. Pintelas, "Data Preprocessing for Supervised Learning", ''International Journal of Computer Science'', 2006, Vol 1 N. 2, pp 111–117.</ref>

Data preprocessing: Difference between revisions