Tesis Doctoral
New models and methods for classification and feature selection. a mathematical optimization perspective
Autor/es | Benítez Peña, Sandra |
Director | Blanquero Bravo, Rafael
Carrizosa Priego, Emilio José Ramírez Cobo, Josefa |
Departamento | Universidad de Sevilla. Departamento de Estadística e Investigación Operativa |
Fecha de publicación | 2021-07-27 |
Fecha de depósito | 2021-12-02 |
Resumen | The objective of this PhD dissertation is the development of new models for Supervised
Classification and Benchmarking, making use of Mathematical Optimization and Statistical
tools. Particularly, we address the fusion ... The objective of this PhD dissertation is the development of new models for Supervised Classification and Benchmarking, making use of Mathematical Optimization and Statistical tools. Particularly, we address the fusion of instruments from both disciplines, with the aim of extracting knowledge from data. In such a way, we obtain innovative methodologies that overcome to those existing ones, bridging theoretical Mathematics with real-life problems. The developed works along this thesis have focused on two fundamental methodologies in Data Science: support vector machines (SVM) and Benchmarking. Regarding the first one, the SVM classifier is based on the search for the separating hyperplane of maximum margin and it is written as a quadratic convex problem. In the Benchmarking context, the goal is to calculate the different efficiencies through a non-parametric deterministic approach. In this thesis we will focus on Data Envelopment Analysis (DEA), which consists on a Linear Programming formulation. This dissertation is structured as follows. In Chapter 1 we briefly present the different challenges this thesis faces on, as well as their state-of-the-art. In the same vein, the different formulations used as base models are exposed, together with the notation used along the chapters in this thesis. In Chapter 2, we tackle the problem of the construction of a version of the SVM that considers misclassification errors. To do this, we incorporate new performance constraints in the SVM formulation, imposing upper bounds on the misclassification errors. The resulting formulation is a quadratic convex problem with linear constraints. Chapter 3 continues with the SVM as the basis, and sets out the problem of providing not only a hard-labeling for each of the individuals belonging to the dataset, but a class probability estimation. Furthermore, confidence intervals for both the score values and the posterior class probabilities will be provided. In addition, as in the previous chapter, we will carry the obtained results to the field in which misclassified errors are considered. With such a purpose, we have to solve either a quadratic convex problem or a quadratic convex problem with linear constraints and integer variables, and always taking advantage of the parameter tuning of the SVM, that is usually wasted. Based on the results in Chapter 2, in Chapter 4 we handle the problem of feature selection, taking again into account the misclassification errors. In order to build this technique, the feature selection is embedded in the classifier model. Such a process is divided in two different steps. In the first step, feature selection is performed while at the same time data is separated via an hyperplane or linear classifier, considering the performance constraints. In the second step, we build the maximum margin classifier (SVM) using the selected features from the first step, and again taking into account the same performance constraints. In Chapter 5, we move to the problem of Benchmarking, where the practices of different entities are compared through the products or services they provide. This is done with the aim of make some changes or improvements in each of them. Concretely, in this chapter we propose a Mixed Integer Linear Programming formulation based in Data Envelopment Analysis (DEA), with the aim of perform feature selection, improving the interpretability and comprehension of the obtained model and efficiencies. Finally, in Chapter 6 we collect the conclusions of this thesis as well as future lines of research. |
Cita | Benítez Peña, S. (2021). New models and methods for classification and feature selection. a mathematical optimization perspective. (Tesis Doctoral Inédita). Universidad de Sevilla, Sevilla. |
Ficheros | Tamaño | Formato | Ver | Descripción |
---|---|---|---|---|
BENÍTEZ PEÑA, Sandra TESIS.pdf | 4.581Mb | [PDF] | Ver/ | |