Article
On Learning Web Information Extraction Rules with TANGO
Author/s | Jiménez Aguirre, Patricia
Corchuelo Gil, Rafael |
Department | Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos |
Publication Date | 2016 |
Deposit Date | 2022-04-08 |
Published in |
|
Abstract | The research on Enterprise Systems Integration focuses on proposals to support
business processes by re-using existing systems. Wrappers help re-use web ap plications that provide a user interface only. They emulate a ... The research on Enterprise Systems Integration focuses on proposals to support business processes by re-using existing systems. Wrappers help re-use web ap plications that provide a user interface only. They emulate a human user who interacts with them and extracts the information of interest in a structured for mat. In this article, we present TANGO, which is our proposal to learn rules to extract information from semi-structured web documents with high precision and recall, which is a must in the context of Enterprise Systems Integration. It relies on an open catalogue of features that helps map the input documents into a knowledge base in which every DOM node is represented by means of HTML, DOM, CSS, relational, and user-defined features. Then a procedure with many variation points is used to learn extraction rules from that knowledge base; the variation points include heuristics that range from how to select a condition to how to simplify the resulting rules. We also provide a systematic method to help re-configure our proposal. Our exhaustive experimentation proves that it beats others regarding effectiveness and is efficient enough for practical purposes. Our proposal was devised to be as configurable as possible, which helps adapt it to particular web sites and evolve it when necessary. |
Funding agencies | Ministerio de Educación y Ciencia (MEC). España Junta de Andalucía Ministerio de Ciencia e Innovación (MICIN). España Ministerio de Economia, Industria y Competitividad (MINECO). España Ministerio de Economía y Competitividad (MINECO). España |
Project ID. | TIN2007-64119
P07-TIC-2602 P08-TIC-4100 TIN2008-04718-E TIN2010-21744 TIN2010-09809-E TIN2010-10811-E TIN2010-09988-E TIN2011-15497-E TIN2013-40848-R |
Citation | Jiménez Aguirre, P. y Corchuelo Gil, R. (2016). On Learning Web Information Extraction Rules with TANGO. Information Systems, 62 (December 2016), 74-103. |
Files | Size | Format | View | Description |
---|---|---|---|---|
On_learning_web_information_ex ... | 546.4Kb | [PDF] | View/ | |