How much training data is enough?. A case study for HTTP anomaly-based intrusion detection

Estepa Alonso, Rafael María; Díaz Verdejo, Jesús; Estepa Alonso, Antonio José; Madinabeitia Luque, Germán

doi:10.1109/ACCESS.2020.2977591

Artículo

dc.creator	Estepa Alonso, Rafael María	es
dc.creator	Díaz Verdejo, Jesús	es
dc.creator	Estepa Alonso, Antonio José	es
dc.creator	Madinabeitia Luque, Germán	es
dc.date.accessioned	2021-04-19T11:03:07Z
dc.date.available	2021-04-19T11:03:07Z
dc.date.issued	2020
dc.identifier.citation	Estepa Alonso, R.M., Díaz Verdejo, J., Estepa Alonso, A.J. y Madinabeitia Luque, G. (2020). How much training data is enough?. A case study for HTTP anomaly-based intrusion detection. IEEE Access, 4, 44410-44425.
dc.identifier.issn	2169-3536	es
dc.identifier.uri	https://hdl.handle.net/11441/107307
dc.description.abstract	Most anomaly-based intrusion detectors rely on models that learn from a training dataset whose quality is crucial in their performance. Albeit the properties of suitable datasets have been formulated, the influence of the dataset size on the performance of the anomaly-based detector has received scarce attention so far. In this work, we investigate the optimal size of a training dataset. This size should be large enough so that training data is representative of normal behavior, but after that point, collecting more data may result in unnecessary waste of time and computational resources, not to mention an increased risk of overtraining. In this spirit, we provide a method to find out when the amount of data collected at the production environment is representative of normal behavior in the context of a detector of HTTP URI attacks based on 1-grammar. Our approach is founded on a set of indicators related to the statistical properties of the data. These indicators are periodically calculated during data collection, producing time series that stabilize when more training data is not expected to translate to better system performance, which indicates that data collection can be stopped. We present a case study with real-life datasets collected at the University of Seville (Spain) and a public dataset from the University of Saskatchewan. The application of our method to these datasets showed that more than 42% of one of trace, and almost 20% of another were unnecessarily collected, thereby showing that our proposed method can be an efficient approach for collecting training data at the production environment.	es
dc.format	application/pdf	es
dc.format.extent	16 p.	es
dc.language.iso	eng	es
dc.publisher	Institute of Electrical and Electronics Engineers	es
dc.relation.ispartof	IEEE Access, 4, 44410-44425.
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 Internacional	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/	*
dc.subject	anomaly-based intrusion detection	es
dc.subject	dataset assessment	es
dc.subject	training	es
dc.title	How much training data is enough?. A case study for HTTP anomaly-based intrusion detection	es
dc.type	info:eu-repo/semantics/article	es
dcterms.identifier	https://ror.org/03yxnpp24
dc.type.version	info:eu-repo/semantics/publishedVersion	es
dc.rights.accessRights	info:eu-repo/semantics/openAccess	es
dc.contributor.affiliation	Universidad de Sevilla. Departamento de Ingeniería Telemática	es
dc.relation.publisherversion	https://ieeexplore.ieee.org/document/9019687	es
dc.identifier.doi	10.1109/ACCESS.2020.2977591	es
dc.contributor.group	Universidad de Sevilla. PI-1669/22/2017: Sistema Integral para Vigilancia y Auditoría de Ciberseguridad Corporativa (SIVA)	es
dc.contributor.group	Universidad de Sevilla. PI-1786/22/2018: Sistema de Ciberportección para servidores web de la Universidad de Sevilla (CiberwebUS)	es
dc.contributor.group	Universidad de Sevilla. PI-1736/22/2017: Detección Temprana de Ataques de Ciberseguridad en Servidores Web de la biblioteca de la US	es
dc.journaltitle	IEEE Access	es
dc.publication.volumen	4	es
dc.publication.initialPage	44410	es
dc.publication.endPage	44425	es

Ficheros	Tamaño	Formato	Ver	Descripción
How Mach Training Data.pdf	5.780Mb	[PDF]	Ver/Abrir

Este registro aparece en las siguientes colecciones

Artículos (Ingeniería Telemática)

Mostrar el registro sencillo del ítem

Excepto si se señala otra cosa, la licencia del ítem se describe como: Attribution-NonCommercial-NoDerivatives 4.0 Internacional