Artículo
A clustering approach to extract data from HTML tables
Autor/es | Jiménez Aguirre, Patricia
Roldán Salvador, Juan Carlos Corchuelo Gil, Rafael |
Departamento | Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos |
Fecha de publicación | 2021 |
Fecha de depósito | 2022-04-07 |
Publicado en |
|
Resumen | HTML tables have become pervasive on the Web. Extracting their data automatically is difficult
because finding the relationships between their cells is not trivial due to the many different
layouts, encodings, and formats ... HTML tables have become pervasive on the Web. Extracting their data automatically is difficult because finding the relationships between their cells is not trivial due to the many different layouts, encodings, and formats available. In this article, we introduce Melva, which is an unsupervised domain-agnostic proposal to extract data from HTML tables without requiring any external knowledge bases. It relies on a clustering approach that helps make label cells apart from value cells and establish their relationships. We compared Melva to four competitors on more than 3 000 HTML tables from the Wikipedia and the Dresden Web Table Corpus. The conclusion is that our proposal is 21.70% better than the best unsupervised competitor and equals the best supervised competitor regarding effectiveness, but it is 99.14% better regarding efficiency |
Agencias financiadoras | Ministerio de Ciencia e Innovación (MICIN). España Ministerio de Economía y Competitividad (MINECO). España Junta de Andalucía |
Identificador del proyecto | PID2020-112540RB-C44
TIN2016-75394-R P18-RT-1060 |
Cita | Jiménez Aguirre, P., Roldán Salvador, J.C. y Corchuelo Gil, R. (2021). A clustering approach to extract data from HTML tables. Information Processing and Management, 58 (6, art.nº102683) |
Ficheros | Tamaño | Formato | Ver | Descripción |
---|---|---|---|---|
A clustering approach to extract ... | 1.440Mb | [PDF] | Ver/ | |