dc.creator | Jiménez Aguirre, Patricia | es |
dc.creator | Roldán Salvador, Juan Carlos | es |
dc.creator | Corchuelo Gil, Rafael | es |
dc.date.accessioned | 2022-04-07T07:57:01Z | |
dc.date.available | 2022-04-07T07:57:01Z | |
dc.date.issued | 2021 | |
dc.identifier.citation | Jiménez Aguirre, P., Roldán Salvador, J.C. y Corchuelo Gil, R. (2021). A clustering approach to extract data from HTML tables. Information Processing and Management, 58 (6, art.nº102683) | |
dc.identifier.issn | 0306-4573 | es |
dc.identifier.uri | https://hdl.handle.net/11441/131911 | |
dc.description.abstract | HTML tables have become pervasive on the Web. Extracting their data automatically is difficult
because finding the relationships between their cells is not trivial due to the many different
layouts, encodings, and formats available. In this article, we introduce Melva, which is an
unsupervised domain-agnostic proposal to extract data from HTML tables without requiring any
external knowledge bases. It relies on a clustering approach that helps make label cells apart
from value cells and establish their relationships. We compared Melva to four competitors on
more than 3 000 HTML tables from the Wikipedia and the Dresden Web Table Corpus. The
conclusion is that our proposal is 21.70% better than the best unsupervised competitor and
equals the best supervised competitor regarding effectiveness, but it is 99.14% better regarding
efficiency | es |
dc.description.sponsorship | Ministerio de Ciencia e Innovación PID2020-112540RB-C44 | es |
dc.description.sponsorship | Ministerio de Economía y Competitividad TIN2016-75394-R | es |
dc.description.sponsorship | Junta de Andalucía P18-RT-1060 | es |
dc.format | application/pdf | es |
dc.format.extent | 13 | es |
dc.language.iso | eng | es |
dc.publisher | Elsevier | es |
dc.relation.ispartof | Information Processing and Management, 58 (6, art.nº102683) | |
dc.rights | Attribution-NonCommercial-NoDerivatives 4.0 Internacional | * |
dc.rights.uri | http://creativecommons.org/licenses/by-nc-nd/4.0/ | * |
dc.subject | HTML tables | es |
dc.subject | Data extraction | es |
dc.subject | Clustering | es |
dc.subject | Genetic algorithms | es |
dc.title | A clustering approach to extract data from HTML tables | es |
dc.type | info:eu-repo/semantics/article | es |
dcterms.identifier | https://ror.org/03yxnpp24 | |
dc.type.version | info:eu-repo/semantics/submittedVersion | es |
dc.rights.accessRights | info:eu-repo/semantics/openAccess | es |
dc.contributor.affiliation | Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos | es |
dc.relation.projectID | PID2020-112540RB-C44 | es |
dc.relation.projectID | TIN2016-75394-R | es |
dc.relation.projectID | P18-RT-1060 | es |
dc.relation.publisherversion | https://www.sciencedirect.com/science/article/pii/S0306457321001680?via%3Dihub | es |
dc.identifier.doi | 10.1016/j.ipm.2021.102683 | es |
dc.contributor.group | Universidad de Sevilla. TIC258: Data-centric Computing Research Hub | es |
dc.journaltitle | Information Processing and Management | es |
dc.publication.volumen | 58 | es |
dc.publication.issue | 6, art.nº102683 | es |
dc.contributor.funder | Ministerio de Ciencia e Innovación (MICIN). España | es |
dc.contributor.funder | Ministerio de Economía y Competitividad (MINECO). España | es |
dc.contributor.funder | Junta de Andalucía | es |