dc.description.abstract | The objective of this PhD dissertation is the development of new models for Supervised
Classification and Benchmarking, making use of Mathematical Optimization and Statistical
tools. Particularly, we address the fusion of instruments from both disciplines,
with the aim of extracting knowledge from data. In such a way, we obtain innovative
methodologies that overcome to those existing ones, bridging theoretical Mathematics
with real-life problems.
The developed works along this thesis have focused on two fundamental methodologies
in Data Science: support vector machines (SVM) and Benchmarking. Regarding
the first one, the SVM classifier is based on the search for the separating hyperplane of
maximum margin and it is written as a quadratic convex problem. In the Benchmarking
context, the goal is to calculate the different efficiencies through a non-parametric
deterministic approach. In this thesis we will focus on Data Envelopment Analysis
(DEA), which consists on a Linear Programming formulation.
This dissertation is structured as follows. In Chapter 1 we briefly present the
different challenges this thesis faces on, as well as their state-of-the-art. In the same
vein, the different formulations used as base models are exposed, together with the
notation used along the chapters in this thesis.
In Chapter 2, we tackle the problem of the construction of a version of the SVM
that considers misclassification errors. To do this, we incorporate new performance
constraints in the SVM formulation, imposing upper bounds on the misclassification
errors. The resulting formulation is a quadratic convex problem with linear constraints.
Chapter 3 continues with the SVM as the basis, and sets out the problem of providing
not only a hard-labeling for each of the individuals belonging to the dataset, but a
class probability estimation. Furthermore, confidence intervals for both the score values
and the posterior class probabilities will be provided. In addition, as in the previous
chapter, we will carry the obtained results to the field in which misclassified errors are
considered. With such a purpose, we have to solve either a quadratic convex problem
or a quadratic convex problem with linear constraints and integer variables, and always
taking advantage of the parameter tuning of the SVM, that is usually wasted.
Based on the results in Chapter 2, in Chapter 4 we handle the problem of feature selection, taking again into account the misclassification errors. In order to build this
technique, the feature selection is embedded in the classifier model. Such a process is
divided in two different steps. In the first step, feature selection is performed while at
the same time data is separated via an hyperplane or linear classifier, considering the
performance constraints. In the second step, we build the maximum margin classifier
(SVM) using the selected features from the first step, and again taking into account
the same performance constraints.
In Chapter 5, we move to the problem of Benchmarking, where the practices of
different entities are compared through the products or services they provide. This is
done with the aim of make some changes or improvements in each of them. Concretely,
in this chapter we propose a Mixed Integer Linear Programming formulation based in
Data Envelopment Analysis (DEA), with the aim of perform feature selection, improving
the interpretability and comprehension of the obtained model and efficiencies.
Finally, in Chapter 6 we collect the conclusions of this thesis as well as future lines
of research. | es |