2025-06-092025-06-092025Díaz Verdejo, J., Estepa Alonso, R.M., Estepa Alonso, A.J., Muñoz Calle, F.J. y Madinabeitia Luque, G. (2025). Building a large, realistic and labeled HTTP URI dataset for anomaly-based intrusion detection systems: Biblio-US17. Cybersecurity, 8, 38. https://doi.org/10.1186/s42400-024-00336-3.2523-3246https://hdl.handle.net/11441/174115This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.This paper introduces Biblio-US17, a labeled dataset collected over 6 months from the log fles of a popular public website at the University of Seville. It contains 47 million records, each including the method, uniform resource identifer (URI) and associated response code and size of every request received by the web server. Records have been classifed as either normal or attack using a comprehensive semi-automated process, which involved signature-based detection, assisted inspection of URIs vocabulary, and substantial expert manual supervision. Unlike comparable datasets, this one ofers a genuine real-world perspective on the normal operation of an active website, along with an unbiased proportion of actual attacks (i.e., non-synthetic). This makes it ideal for evaluating and comparing anomalybased approaches in a realistic environment. Its extensive size and duration also make it valuable for addressing challenges like data shift and insufcient training. This paper describes the collection and labeling processes, dataset structure, and most relevant properties. We also include an example of an application for assessing the performance of a simple anomaly detector. Biblio-US17, now available to the scientifc community, can also be used to model the URIs used by current web servers.application/pdf21 p.engAttribution 4.0 Internationalhttp://creativecommons.org/licenses/by/4.0/Anomaly detectionIntrusion detection systemsData acquisitionTraining datasetsWeb application fltersBiblio-US17 datasetBuilding a large, realistic and labeled HTTP URI dataset for anomaly-based intrusion detection systems: Biblio-US17info:eu-repo/semantics/articleinfo:eu-repo/semantics/openAccess10.1186/s42400-024-00336-3