Capítulo de Libro
Data Cleansing Meets Feature Selection: A Supervised Machine Learning Approach
Autor/es | Tallón Ballesteros, Antonio Javier
Riquelme Santos, José Cristóbal |
Departamento | Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos |
Fecha de publicación | 2015 |
Fecha de depósito | 2016-06-27 |
Publicado en |
|
ISBN/ISSN | 978-3-319-18832-4 0302-9743 |
Resumen | This paper presents a novel procedure to apply in a sequential
way two data preparation techniques from a different nature such as
data cleansing and feature selection. For the former we have experienced
with a partial ... This paper presents a novel procedure to apply in a sequential way two data preparation techniques from a different nature such as data cleansing and feature selection. For the former we have experienced with a partial removal of outliers via inter-quartile range whereas for the latter we have chosen relevant attributes with two widespread feature subset selectors like CFS (Correlation-based Feature Selection) and CNS (Consistency-based Feature Selection), which are founded on correlation and consistency measures, respectively. Empirical results on seven difficult binary and multi-class data sets, that is, with a test error rate of at least a 10%, according to accuracy, with C4.5 or 1-nearest neighbour classifiers without any kind of prior data pre-processing are outlined. Non-parametric statistical tests assert that the meeting of the aforementioned two data preparation strategies using a correlation measure for feature selection with C4.5 algorithm is significant better, measured with roc measure, than the single application of the data cleansing approach. Last but not least, a weak and not very powerful learner like PART achieved promising results with the new proposal based on a consistency measure and is able to compete with the best configuration of C4.5. To sum up, bearing in mind the new approach, for roc measure PART classifier with a consistency metric behaves slightly better than C4.5 and a correlation measure |
Identificador del proyecto | TIN2007-68084-C02- 02
TIN2011-28956-C02-02 P11-TIC-7528 |
Ficheros | Tamaño | Formato | Ver | Descripción |
---|---|---|---|---|
Data cleansing.pdf | 239.3Kb | [PDF] | Ver/ | |