dc.creator | Estepa Alonso, Rafael María | es |
dc.creator | Díaz Verdejo, Jesús | es |
dc.creator | Estepa Alonso, Antonio José | es |
dc.creator | Madinabeitia Luque, Germán | es |
dc.date.accessioned | 2021-04-19T11:03:07Z | |
dc.date.available | 2021-04-19T11:03:07Z | |
dc.date.issued | 2020 | |
dc.identifier.citation | Estepa Alonso, R.M., Díaz Verdejo, J., Estepa Alonso, A.J. y Madinabeitia Luque, G. (2020). How much training data is enough?. A case study for HTTP anomaly-based intrusion detection. IEEE Access, 4, 44410-44425. | |
dc.identifier.issn | 2169-3536 | es |
dc.identifier.uri | https://hdl.handle.net/11441/107307 | |
dc.description.abstract | Most anomaly-based intrusion detectors rely on models that learn from a training dataset
whose quality is crucial in their performance. Albeit the properties of suitable datasets have been formulated,
the influence of the dataset size on the performance of the anomaly-based detector has received scarce
attention so far. In this work, we investigate the optimal size of a training dataset. This size should be
large enough so that training data is representative of normal behavior, but after that point, collecting more
data may result in unnecessary waste of time and computational resources, not to mention an increased
risk of overtraining. In this spirit, we provide a method to find out when the amount of data collected
at the production environment is representative of normal behavior in the context of a detector of HTTP
URI attacks based on 1-grammar. Our approach is founded on a set of indicators related to the statistical
properties of the data. These indicators are periodically calculated during data collection, producing time
series that stabilize when more training data is not expected to translate to better system performance, which
indicates that data collection can be stopped. We present a case study with real-life datasets collected at the
University of Seville (Spain) and a public dataset from the University of Saskatchewan. The application
of our method to these datasets showed that more than 42% of one of trace, and almost 20% of another
were unnecessarily collected, thereby showing that our proposed method can be an efficient approach for
collecting training data at the production environment. | es |
dc.format | application/pdf | es |
dc.format.extent | 16 p. | es |
dc.language.iso | eng | es |
dc.publisher | Institute of Electrical and Electronics Engineers | es |
dc.relation.ispartof | IEEE Access, 4, 44410-44425. | |
dc.rights | Attribution-NonCommercial-NoDerivatives 4.0 Internacional | * |
dc.rights.uri | http://creativecommons.org/licenses/by-nc-nd/4.0/ | * |
dc.subject | anomaly-based intrusion detection | es |
dc.subject | dataset assessment | es |
dc.subject | training | es |
dc.title | How much training data is enough?. A case study for HTTP anomaly-based intrusion detection | es |
dc.type | info:eu-repo/semantics/article | es |
dcterms.identifier | https://ror.org/03yxnpp24 | |
dc.type.version | info:eu-repo/semantics/publishedVersion | es |
dc.rights.accessRights | info:eu-repo/semantics/openAccess | es |
dc.contributor.affiliation | Universidad de Sevilla. Departamento de Ingeniería Telemática | es |
dc.relation.publisherversion | https://ieeexplore.ieee.org/document/9019687 | es |
dc.identifier.doi | 10.1109/ACCESS.2020.2977591 | es |
dc.contributor.group | Universidad de Sevilla. PI-1669/22/2017: Sistema Integral para Vigilancia y Auditoría de Ciberseguridad Corporativa (SIVA) | es |
dc.contributor.group | Universidad de Sevilla. PI-1786/22/2018: Sistema de Ciberportección para servidores web de la Universidad de Sevilla (CiberwebUS) | es |
dc.contributor.group | Universidad de Sevilla. PI-1736/22/2017: Detección Temprana de Ataques de Ciberseguridad en Servidores Web de la biblioteca de la US | es |
dc.journaltitle | IEEE Access | es |
dc.publication.volumen | 4 | es |
dc.publication.initialPage | 44410 | es |
dc.publication.endPage | 44425 | es |