Revision as of 16:06, 23 July 2009 edit Boxplot (talk \| contribs) Pending changes reviewers 521 edits added intro paragraph ← Previous edit		Revision as of 16:10, 23 July 2009 edit undo Boxplot (talk \| contribs) Pending changes reviewers 521 edits m wikilinks Next edit →
Line 1: {{context}} '''Data pre-processing ~~(i.e., preparation / [[Data_cleaning\|cleaning]])~~''' is an often neglected but important step in the data mining process. The phrase [[GIGO\|"Garbage In, Garbage Out"]] is particularly applicable to [[data mining]] and [[machine learning]] projects. Data gathering methods are often loosely controlled, resulting in out-of-range values (e.g., Income: -100), impossible data combinations (e.g., Gender: Male, Pregnant: Yes), [[missing values]], etc. Analyzing data that has not been carefully screened for such problems can produce misleading results. Thus, the representation and quality of [[data]] is first and foremost before running an analysis.<ref>Pyle, D., 1999. ''Data Preparation for Data Mining.'' Morgan Kaufmann Publishers, [[Los Altos]], CA.</ref> If there is much irrelevant and redundant information present or noisy and unreliable data, then [[knowledge discovery]] during the training phase is more difficult. ~~It is well known that data~~Data preparation and filtering steps can take considerable amount of processing time ~~in ML problems~~. Data ~~[[Preprocessing\|~~pre-processing]] includes ~~data~~ [[Data_cleaning\|cleaning]], normalization, transformation, [[feature extraction]] and selection, etc. The product of data pre-processing is the final [[training set]]. Kotsiantis et al. (2006) present a well know algorithm for each step of data pre-processing.<ref>S. Kotsiantis, D. Kanellopoulos, P. Pintelas, "Data Preprocessing for Supervised Leaning", ''International Journal of ~~[[Computer science\|~~Computer Science]]'', 2006, Vol 1 N. 2, pp 111-117.</ref>▼ ▲If there is much irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery during the training phase is more difficult. It is well known that data preparation and filtering steps take considerable amount of processing time in ML problems. Data [[Preprocessing\|pre-processing]] includes data cleaning, normalization, transformation, [[feature extraction]] and selection, etc. The product of data pre-processing is the final [[training set]]. Kotsiantis et al. (2006) present a well know algorithm for each step of data pre-processing.<ref>S. Kotsiantis, D. Kanellopoulos, P. Pintelas, Data Preprocessing for Supervised Leaning, International Journal of [[Computer science\|Computer Science]], 2006, Vol 1 N. 2, pp 111-117.</ref> ==References== {{reflist}} [[Category:Machine learning]]

Data preprocessing: Difference between revisions