A clustering approach to extract data from HTML tables

Jiménez Aguirre, PatriciaRoldán Salvador, Juan CarlosCorchuelo Gil, Rafael2022-04-072022-04-072021Jiménez Aguirre, P., Roldán Salvador, J.C. y Corchuelo Gil, R. (2021). A clustering approach to extract data from HTML tables. Information Processing and Management, 58 (6, art.nº102683)0306-4573https://hdl.handle.net/11441/131911HTML tables have become pervasive on the Web. Extracting their data automatically is difficult because finding the relationships between their cells is not trivial due to the many different layouts, encodings, and formats available. In this article, we introduce Melva, which is an unsupervised domain-agnostic proposal to extract data from HTML tables without requiring any external knowledge bases. It relies on a clustering approach that helps make label cells apart from value cells and establish their relationships. We compared Melva to four competitors on more than 3 000 HTML tables from the Wikipedia and the Dresden Web Table Corpus. The conclusion is that our proposal is 21.70% better than the best unsupervised competitor and equals the best supervised competitor regarding effectiveness, but it is 99.14% better regarding efficiencyapplication/pdf13engAttribution-NonCommercial-NoDerivatives 4.0 Internacionalhttp://creativecommons.org/licenses/by-nc-nd/4.0/HTML tablesData extractionClusteringGenetic algorithmsA clustering approach to extract data from HTML tablesinfo:eu-repo/semantics/articleinfo:eu-repo/semantics/openAccesshttps://doi.org/10.1016/j.ipm.2021.102683