Presentation
Extracting Web Information using Representation Patterns
Author/s | Roldán Salvador, Juan Carlos
Jiménez Aguirre, Patricia Corchuelo Gil, Rafael |
Department | Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos |
Publication Date | 2017 |
Deposit Date | 2022-04-07 |
Published in |
|
ISBN/ISSN | 978-1-4503-5527-8 |
Abstract | Feeding decision support systems with Web information typically
requires sifting through an unwieldy amount of information that is
available in human-friendly formats only. Our focus is on a scalable
proposal to extract ... Feeding decision support systems with Web information typically requires sifting through an unwieldy amount of information that is available in human-friendly formats only. Our focus is on a scalable proposal to extract information from semi-structured documents in a structured format, with an emphasis on it being scalable and open. By semi-structured we mean that it must focus on informa tion that is rendered using regular formats, not free text; by scal able, we mean that the system must require a minimum amount of human intervention and it must not be targeted to extracting in formation from a particular domain or web site; by open, we mean that it must extract as much useful information as possible and not be subject to any pre-defined data model. In the literature, there is only one open but not scalable proposal, since it requires human supervision on a per-domain basis. In this paper, we present a new proposal that relies on a number of heuristics to identify patterns that are typically used to represent the information in a web docu ment. Our experimental results confirm that our proposal is very competitive in terms of effectiveness and efficiency. |
Funding agencies | Ministerio de Economía y Competitividad (MINECO). España |
Project ID. | TIN2016-75394-R
TIN2013-40848-R |
Citation | Roldán Salvador, J.C., Jiménez Aguirre, P. y Corchuelo Gil, R. (2017). Extracting Web Information using Representation Patterns. En HotWeb 2017 : 5th ACM/IEEE Workshop on Hot Topics in Web Systems and Technologies (4:1-4:5), San Jose, CA, USA: Association for Computing Machinery (ACM). |
Files | Size | Format | View | Description |
---|---|---|---|---|
Extracting web information using ... | 294.5Kb | [PDF] | View/ | |