dc.creator | Jiménez Aguirre, Patricia | es |
dc.creator | Roldán Salvador, Juan Carlos | es |
dc.creator | Corchuelo Gil, Rafael | es |
dc.date.accessioned | 2022-04-08T11:12:59Z | |
dc.date.available | 2022-04-08T11:12:59Z | |
dc.date.issued | 2022 | |
dc.identifier.citation | Jiménez Aguirre, P., Roldán Salvador, J.C. y Corchuelo Gil, R. (2022). A hybrid quantum approach to leveraging data from HTML tables. Knowledge and Information Systems, 64 (2), 441-474. | |
dc.identifier.issn | 0219-1377 | es |
dc.identifier.uri | https://hdl.handle.net/11441/131991 | |
dc.description.abstract | The Web provides many data that are encoded using HTML tables. This facilitates
rendering them, but obfuscates their structure and makes it difficult for automated business
processes to leverage them. This has motivated many authors to work on proposals to
extract them as automatically as possible. In this article, we present a new unsupervised
proposal that uses a hybrid approach in which a standard computer is used to perform pre and post-processing tasks and a quantum computer is used to perform the core task:
guessing whether the cells have labels or values. The problem is addressed using a
clustering approach that is known to be NP using standard computers, but our proposal can
solve it in polynomial time, which implies a significant performance improvement. It is
novel in that it relies on an entropy-preservation metaphor that has proven to work very
well on two large collections of real-world tables from the Wikipedia and the Dresden Web
Table Corpus. Our experiments prove that our proposal can beat the state-of-the-art
proposal in terms of both effectiveness and efficiency; the key difference is that our
proposal is totally unsupervised, whereas the state-of-the-art proposal is supervised. | es |
dc.description.sponsorship | Ministerio de Economía y Competitividad TIN2016-75394-R | es |
dc.description.sponsorship | Ministerio de Ciencia e Innovación PID2020-112540RB-C44 | es |
dc.description.sponsorship | Junta de Andalucía P18-RT-1060 | es |
dc.format | application/pdf | es |
dc.format.extent | 34 | es |
dc.language.iso | eng | es |
dc.publisher | Springer | es |
dc.relation.ispartof | Knowledge and Information Systems, 64 (2), 441-474. | |
dc.rights | Attribution-NonCommercial-NoDerivatives 4.0 Internacional | * |
dc.rights.uri | http://creativecommons.org/licenses/by-nc-nd/4.0/ | * |
dc.subject | HTML tables | es |
dc.subject | Data extraction | es |
dc.subject | Quantum computing | es |
dc.title | A hybrid quantum approach to leveraging data from HTML tables | es |
dc.type | info:eu-repo/semantics/article | es |
dcterms.identifier | https://ror.org/03yxnpp24 | |
dc.type.version | info:eu-repo/semantics/submittedVersion | es |
dc.rights.accessRights | info:eu-repo/semantics/openAccess | es |
dc.contributor.affiliation | Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos | es |
dc.relation.projectID | TIN2016-75394-R | es |
dc.relation.projectID | PID2020-112540RB-C44 | es |
dc.relation.projectID | P18-RT-1060 | es |
dc.relation.publisherversion | https://link.springer.com/article/10.1007/s10115-021-01636-7 | es |
dc.identifier.doi | 10.1007/s10115-021-01636-7 | es |
dc.contributor.group | Universidad de Sevilla. TIC258: Data-centric Computing Research Hub | es |
dc.journaltitle | Knowledge and Information Systems | es |
dc.publication.volumen | 64 | es |
dc.publication.issue | 2 | es |
dc.publication.initialPage | 441 | es |
dc.publication.endPage | 474 | es |
dc.contributor.funder | Ministerio de Economía y Competitividad (MINECO). España | es |
dc.contributor.funder | Ministerio de Ciencia e Innovación (MICIN). España | es |
dc.contributor.funder | Junta de Andalucía | es |