Artículo
Trinity: On Using Trinary Trees for Unsupervised Web Data Extraction
Autor/es | Sleiman, Hassan A.
Corchuelo Gil, Rafael |
Departamento | Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos |
Fecha de publicación | 2014-06 |
Fecha de depósito | 2023-03-30 |
Resumen | Web data extractors are used to extract data from web documents in order to feed automated processes. In this article, we propose a technique that works on two or more web documents generated by the same server-side template ... Web data extractors are used to extract data from web documents in order to feed automated processes. In this article, we propose a technique that works on two or more web documents generated by the same server-side template and learns a regular expression that models it and can later be used to extract data from similar documents. The technique builds on the hypothesis that the template introduces some shared patterns that do not provide any relevant data and can thus be ignored. We have evaluated and compared our technique to others in the literature on a large collection of web documents; our results demonstrate that our proposal performs better than the others and that input errors do not have a negative impact on its effectiveness; furthermore, its efficiency can be easily boosted by means of a couple of parameters, without sacrificing its effectiveness. |
Agencias financiadoras | Ministerio de Ciencia y Tecnología (MCYT). España Junta de Andalucía Ministerio de Ciencia e Innovación (MICIN). España Ministerio de Economía, Industria y Competitividad Ministerio de Economía y Competitividad (MINECO). España |
Identificador del proyecto | TIN2007-64119
P07- TIC-2602 P08-TIC-4100 TIN2008-04718-E TIN2010- 21744 TIN2010-09809-E TIN2010-10811-E TIN2010-09988-E TIN2011-15497-E |
Ficheros | Tamaño | Formato | Ver | Descripción |
---|---|---|---|---|
Trinity on using trinary trees ... | 3.099Mb | [PDF] | Ver/ | |