Ponencia
Extracting Web Information using Representation Patterns
Autor/es | Roldán Salvador, Juan Carlos
Jiménez Aguirre, Patricia Corchuelo Gil, Rafael |
Departamento | Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos |
Fecha de publicación | 2017 |
Fecha de depósito | 2022-04-07 |
Publicado en |
|
ISBN/ISSN | 978-1-4503-5527-8 |
Resumen | Feeding decision support systems with Web information typically
requires sifting through an unwieldy amount of information that is
available in human-friendly formats only. Our focus is on a scalable
proposal to extract ... Feeding decision support systems with Web information typically requires sifting through an unwieldy amount of information that is available in human-friendly formats only. Our focus is on a scalable proposal to extract information from semi-structured documents in a structured format, with an emphasis on it being scalable and open. By semi-structured we mean that it must focus on informa tion that is rendered using regular formats, not free text; by scal able, we mean that the system must require a minimum amount of human intervention and it must not be targeted to extracting in formation from a particular domain or web site; by open, we mean that it must extract as much useful information as possible and not be subject to any pre-defined data model. In the literature, there is only one open but not scalable proposal, since it requires human supervision on a per-domain basis. In this paper, we present a new proposal that relies on a number of heuristics to identify patterns that are typically used to represent the information in a web docu ment. Our experimental results confirm that our proposal is very competitive in terms of effectiveness and efficiency. |
Agencias financiadoras | Ministerio de Economía y Competitividad (MINECO). España |
Identificador del proyecto | TIN2016-75394-R
TIN2013-40848-R |
Cita | Roldán Salvador, J.C., Jiménez Aguirre, P. y Corchuelo Gil, R. (2017). Extracting Web Information using Representation Patterns. En HotWeb 2017 : 5th ACM/IEEE Workshop on Hot Topics in Web Systems and Technologies (4:1-4:5), San Jose, CA, USA: Association for Computing Machinery (ACM). |
Ficheros | Tamaño | Formato | Ver | Descripción |
---|---|---|---|---|
Extracting web information using ... | 294.5Kb | [PDF] | Ver/ | |