Revision as of 02:25, 26 August 2009 edit Maxis ftw (talk \| contribs) Pending changes reviewers, Rollbackers 10,887 edits m Reverted edits by 203.88.142.121 to last revision by Boxplot (HG) ← Previous edit		Revision as of 23:14, 4 October 2009 edit undo SmackBot (talk \| contribs) 3,734,324 edits m Date maintenance tags and general fixes Next edit →
Line 1: {{Context\|date=October 2009}} ~~{{context}}~~ '''Data pre-processing''' is an often neglected but important step in the data mining process. The phrase [[GIGO\|"Garbage In, Garbage Out"]] is particularly applicable to [[data mining]] and [[machine learning]] projects. Data gathering methods are often loosely controlled, resulting in out-of-range values (e.g., Income: -100), impossible data combinations (e.g., Gender: Male, Pregnant: Yes), [[missing values]], etc. Analyzing data that has not been carefully screened for such problems can produce misleading results. Thus, the representation and quality of [[data]] is first and foremost before running an analysis.<ref>Pyle, D., 1999. ''Data Preparation for Data Mining.'' Morgan Kaufmann Publishers, [[Los Altos]], CA.</ref> If there is much irrelevant and redundant information present or noisy and unreliable data, then [[knowledge discovery]] during the training phase is more difficult. Data preparation and filtering steps can take considerable amount of processing time. Data pre-processing includes [[~~Data_cleaning~~Data cleaning\|cleaning]], normalization, transformation, [[feature extraction]] and selection, etc. The product of data pre-processing is the final [[training set]]. Kotsiantis et al. (2006) present a well know algorithm for each step of data pre-processing.<ref>S. Kotsiantis, D. Kanellopoulos, P. Pintelas, "Data Preprocessing for Supervised Leaning", ''International Journal of Computer Science'', 2006, Vol 1 N. 2, pp 111-117.</ref> ==References==

Data preprocessing: Difference between revisions