dc.creator | Jiménez Aguirre, Patricia | es |
dc.creator | Corchuelo Gil, Rafael | es |
dc.date.accessioned | 2022-04-08T08:05:32Z | |
dc.date.available | 2022-04-08T08:05:32Z | |
dc.date.issued | 2016 | |
dc.identifier.citation | Jiménez Aguirre, P. y Corchuelo Gil, R. (2016). On Learning Web Information Extraction Rules with TANGO. Information Systems, 62 (December 2016), 74-103. | |
dc.identifier.issn | 0306-4379 | es |
dc.identifier.uri | https://hdl.handle.net/11441/131977 | |
dc.description.abstract | The research on Enterprise Systems Integration focuses on proposals to support
business processes by re-using existing systems. Wrappers help re-use web ap plications that provide a user interface only. They emulate a human user who
interacts with them and extracts the information of interest in a structured for mat. In this article, we present TANGO, which is our proposal to learn rules
to extract information from semi-structured web documents with high precision
and recall, which is a must in the context of Enterprise Systems Integration. It
relies on an open catalogue of features that helps map the input documents into
a knowledge base in which every DOM node is represented by means of HTML,
DOM, CSS, relational, and user-defined features. Then a procedure with many
variation points is used to learn extraction rules from that knowledge base; the
variation points include heuristics that range from how to select a condition to
how to simplify the resulting rules. We also provide a systematic method to help
re-configure our proposal. Our exhaustive experimentation proves that it beats
others regarding effectiveness and is efficient enough for practical purposes. Our
proposal was devised to be as configurable as possible, which helps adapt it to
particular web sites and evolve it when necessary. | es |
dc.description.sponsorship | Ministerio de Educación y Ciencia TIN2007-64119 | es |
dc.description.sponsorship | Junta de Andalucía P07-TIC-2602 | es |
dc.description.sponsorship | Junta de Andalucía P08-TIC-4100 | es |
dc.description.sponsorship | Ministerio de Ciencia e Innovación TIN2008-04718-E | es |
dc.description.sponsorship | Ministerio de Ciencia e Innovación TIN2010-21744 | es |
dc.description.sponsorship | Ministerio de Economía, Industria y Competitividad TIN2010-09809-E | es |
dc.description.sponsorship | Ministerio de Ciencia e Innovación TIN2010-10811-E | es |
dc.description.sponsorship | Ministerio de Ciencia e Innovación TIN2010-09988-E | es |
dc.description.sponsorship | Ministerio de Economía y Competitividad TIN2011-15497-E | es |
dc.description.sponsorship | Ministerio de Economía y Competitividad TIN2013-40848-R | es |
dc.format | application/pdf | es |
dc.format.extent | 50 | es |
dc.language.iso | eng | es |
dc.publisher | Elsevier | es |
dc.relation.ispartof | Information Systems, 62 (December 2016), 74-103. | |
dc.rights | Attribution-NonCommercial-NoDerivatives 4.0 Internacional | * |
dc.rights.uri | http://creativecommons.org/licenses/by-nc-nd/4.0/ | * |
dc.subject | Web information extraction | es |
dc.subject | Semi-structured documents | es |
dc.subject | Open catalogues of features | es |
dc.subject | Learning rules | es |
dc.subject | Variation points | es |
dc.subject | Configuration method | es |
dc.title | On Learning Web Information Extraction Rules with TANGO | es |
dc.type | info:eu-repo/semantics/article | es |
dcterms.identifier | https://ror.org/03yxnpp24 | |
dc.type.version | info:eu-repo/semantics/submittedVersion | es |
dc.rights.accessRights | info:eu-repo/semantics/openAccess | es |
dc.contributor.affiliation | Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos | es |
dc.relation.projectID | TIN2007-64119 | es |
dc.relation.projectID | P07-TIC-2602 | es |
dc.relation.projectID | P08-TIC-4100 | es |
dc.relation.projectID | TIN2008-04718-E | es |
dc.relation.projectID | TIN2010-21744 | es |
dc.relation.projectID | TIN2010-09809-E | es |
dc.relation.projectID | TIN2010-10811-E | es |
dc.relation.projectID | TIN2010-09988-E | es |
dc.relation.projectID | TIN2011-15497-E | es |
dc.relation.projectID | TIN2013-40848-R | es |
dc.relation.publisherversion | https://www.sciencedirect.com/science/article/pii/S0306437915300405?via%3Dihub | es |
dc.identifier.doi | 10.1016/j.is.2016.05.003 | es |
dc.contributor.group | Universidad de Sevilla. TIC258: Data-centric Computing Research Hub | es |
dc.journaltitle | Information Systems | es |
dc.publication.volumen | 62 | es |
dc.publication.issue | December 2016 | es |
dc.publication.initialPage | 74 | es |
dc.publication.endPage | 103 | es |
dc.identifier.sisius | 20928471 | es |
dc.contributor.funder | Ministerio de Educación y Ciencia (MEC). España | es |
dc.contributor.funder | Junta de Andalucía | es |
dc.contributor.funder | Ministerio de Ciencia e Innovación (MICIN). España | es |
dc.contributor.funder | Ministerio de Economia, Industria y Competitividad (MINECO). España | es |
dc.contributor.funder | Ministerio de Economía y Competitividad (MINECO). España | es |