dc.creator | Sleiman, Hassan A. | es |
dc.creator | Corchuelo Gil, Rafael | es |
dc.date.accessioned | 2023-03-28T10:43:19Z | |
dc.date.available | 2023-03-28T10:43:19Z | |
dc.date.issued | 2013-02 | |
dc.identifier.citation | Sleiman, H.A. y Corchuelo Gil, R. (2013). TEX: An efficient and effective unsupervised Web information extractor. Knowledge-Based Systems, 39, 109-123. https://doi.org/10.1016/j.knosys.2012.10.009. | |
dc.identifier.issn | 0950-7051 (impreso) | es |
dc.identifier.issn | 1872-7409 (online) | es |
dc.identifier.uri | https://hdl.handle.net/11441/143639 | |
dc.description.abstract | The World Wide Web is an immense information resource. Web information extraction is the task that transforms human friendly Web information into structured information that can be consumed by auto mated business processes. In this article, we propose an unsupervised information extractor that works on two or more web documents generated by the same server side template. It finds and removes shared token sequences amongst these web documents until finding the relevant information that should be extracted from them. The technique is completely unsupervised and does not require maintenance, it allows working on malformed web documents, and does not require the relevant information to be for matted using repetitive patterns. Our complexity analysis reveals that our proposal is computationally tractable and our empirical study on real-world web documents demonstrates that it performs very fast and has a very high precision and recall. | es |
dc.description.sponsorship | Ministerio de Ciencia y Tecnología TIN2007-64119 | es |
dc.description.sponsorship | Junta de Andalucía P07-TIC-2602 | es |
dc.description.sponsorship | Junta de Andalucía P08-TIC-4100 | es |
dc.description.sponsorship | Ministerio de Ciencia e Innovación TIN2008-04718-E | es |
dc.description.sponsorship | Ministerio de Ciencia e Innovación TIN2010-21744 | es |
dc.description.sponsorship | Ministerio de Economía, Industria y Competitividad TIN2010-09809-E | es |
dc.description.sponsorship | Ministerio de Ciencia e Innovación TIN2010-10811-E | es |
dc.description.sponsorship | Ministerio de Ciencia e Innovación TIN2010-09988-E | es |
dc.description.sponsorship | Ministerio de Economía y Competitividad TIN2011-15497-E | es |
dc.format | application/pdf | es |
dc.format.extent | 15 | es |
dc.language.iso | eng | es |
dc.publisher | ScienceDirect | es |
dc.relation.ispartof | Knowledge-Based Systems, 39, 109-123. | |
dc.rights | Attribution-NonCommercial-NoDerivatives 4.0 Internacional | * |
dc.rights.uri | http://creativecommons.org/licenses/by-nc-nd/4.0/ | * |
dc.subject | Information extraction | es |
dc.subject | Semi-structured web documents | es |
dc.subject | Malformed documents | es |
dc.subject | Unsupervised technique | es |
dc.subject | Heuristic-based technique | es |
dc.title | TEX: An efficient and effective unsupervised Web information extractor | es |
dc.type | info:eu-repo/semantics/article | es |
dcterms.identifier | https://ror.org/03yxnpp24 | |
dc.type.version | info:eu-repo/semantics/publishedVersion | es |
dc.rights.accessRights | info:eu-repo/semantics/openAccess | es |
dc.contributor.affiliation | Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos | es |
dc.relation.projectID | TIN2007-64119 | es |
dc.relation.projectID | P07-TIC-2602 | es |
dc.relation.projectID | P08-TIC-4100 | es |
dc.relation.projectID | TIN2008-04718-E | es |
dc.relation.projectID | TIN2010-21744 | es |
dc.relation.projectID | TIN2010-09809-E | es |
dc.relation.projectID | TIN2010-10811-E | es |
dc.relation.projectID | TIN2010-09988-E | es |
dc.relation.projectID | TIN2011-15497-E | es |
dc.relation.publisherversion | https://www.sciencedirect.com/science/article/pii/S0950705112002900 | es |
dc.identifier.doi | 10.1016/j.knosys.2012.10.009 | es |
dc.journaltitle | Knowledge-Based Systems | es |
dc.publication.volumen | 39 | es |
dc.publication.initialPage | 109 | es |
dc.publication.endPage | 123 | es |
dc.contributor.funder | Ministerio de Ciencia y Tecnología (MCYT). España | es |
dc.contributor.funder | Junta de Andalucía | es |
dc.contributor.funder | Ministerio de Ciencia e Innovación (MICIN). España | es |
dc.contributor.funder | Ministerio de Economía, Industria y Competitividad. España | es |
dc.contributor.funder | Ministerio de Economía y Competitividad (MINECO). España | es |