dc.creator | Roldán Salvador, Juan Carlos | es |
dc.creator | Jiménez Aguirre, Patricia | es |
dc.creator | Szekely, Pedro | es |
dc.creator | Corchuelo Gil, Rafael | es |
dc.date.accessioned | 2022-04-08T10:04:03Z | |
dc.date.available | 2022-04-08T10:04:03Z | |
dc.date.issued | 2021 | |
dc.identifier.citation | Roldán Salvador, J.C., Jiménez Aguirre, P., Szekely, P. y Corchuelo Gil, R. (2021). TOMATE: A heuristic-based approach to extract data from HTML tables. Information Sciences, 577 (October 2021), 49-68. | |
dc.identifier.issn | 0020-0255 | es |
dc.identifier.uri | https://hdl.handle.net/11441/131986 | |
dc.description.abstract | Extracting data from user-friendly HTML tables is difficult because of their different lay outs, formats, and encoding problems. In this article, we present a new proposal that first
applies several pre-processing heuristics to clean the tables, then performs functional anal ysis, and finally applies some post-processing heuristics to produce the output. Our most
important contribution is regarding functional analysis, which we address by projecting
the cells onto a high-dimensional feature space in which a standard clustering technique
is used to make the meta-data cells apart from the data cells. We experimented with
two large repositories of real-world HTML tables and our results confirm that our proposal
can extract data from them with an F1 score of 89:50% in just 0:09 CPU seconds per table.
We confronted our proposal with several competitors and the statistical analysis confirmed
its superiority in terms of effectiveness, while it keeps very competitive in terms of
efficiency. | es |
dc.description.sponsorship | Ministerio de Economía y Competitividad TIN2013-40848-R | es |
dc.description.sponsorship | Ministerio de Economía y Competitividad TIN2016-75394-R | es |
dc.description.sponsorship | Junta de Andalucía P18-RT-1060 | es |
dc.description.sponsorship | Ministerio de Ciencia e Innovación PID2020-112540RB-C44 | es |
dc.format | application/pdf | es |
dc.format.extent | 20 | es |
dc.language.iso | eng | es |
dc.publisher | Elsevier | es |
dc.relation.ispartof | Information Sciences, 577 (October 2021), 49-68. | |
dc.rights | Attribution-NonCommercial-NoDerivatives 4.0 Internacional | * |
dc.rights.uri | http://creativecommons.org/licenses/by-nc-nd/4.0/ | * |
dc.subject | HTML tables | es |
dc.subject | Data extraction | es |
dc.title | TOMATE: A heuristic-based approach to extract data from HTML tables | es |
dc.type | info:eu-repo/semantics/article | es |
dcterms.identifier | https://ror.org/03yxnpp24 | |
dc.type.version | info:eu-repo/semantics/publishedVersion | es |
dc.rights.accessRights | info:eu-repo/semantics/openAccess | es |
dc.contributor.affiliation | Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos | es |
dc.relation.projectID | TIN2013-40848-R | es |
dc.relation.projectID | TIN2016-75394-R | es |
dc.relation.projectID | P18-RT-1060 | es |
dc.relation.projectID | PID2020-112540RB-C44 | es |
dc.relation.publisherversion | https://www.sciencedirect.com/science/article/pii/S002002552100428X?via%3Dihub | es |
dc.identifier.doi | 10.1016/j.ins.2021.04.087 | es |
dc.contributor.group | Universidad de Sevilla. TIC258: Data-centric Computing Research Hub | es |
dc.journaltitle | Information Sciences | es |
dc.publication.volumen | 577 | es |
dc.publication.issue | October 2021 | es |
dc.publication.initialPage | 49 | es |
dc.publication.endPage | 68 | es |
dc.contributor.funder | Ministerio de Economía y Competitividad (MINECO). España | es |
dc.contributor.funder | Junta de Andalucía | es |
dc.contributor.funder | Ministerio de Ciencia e Innovación (MICIN). España | es |