dc.creator | Roldán Salvador, Juan Carlos | es |
dc.creator | Jiménez Aguirre, Patricia | es |
dc.creator | Corchuelo Gil, Rafael | es |
dc.date.accessioned | 2022-04-07T11:04:44Z | |
dc.date.available | 2022-04-07T11:04:44Z | |
dc.date.issued | 2017 | |
dc.identifier.citation | Roldán Salvador, J.C., Jiménez Aguirre, P. y Corchuelo Gil, R. (2017). Extracting Web Information using Representation Patterns. En HotWeb 2017 : 5th ACM/IEEE Workshop on Hot Topics in Web Systems and Technologies (4:1-4:5), San Jose, CA, USA: Association for Computing Machinery (ACM). | |
dc.identifier.isbn | 978-1-4503-5527-8 | es |
dc.identifier.uri | https://hdl.handle.net/11441/131931 | |
dc.description.abstract | Feeding decision support systems with Web information typically
requires sifting through an unwieldy amount of information that is
available in human-friendly formats only. Our focus is on a scalable
proposal to extract information from semi-structured documents
in a structured format, with an emphasis on it being scalable and
open. By semi-structured we mean that it must focus on informa tion that is rendered using regular formats, not free text; by scal able, we mean that the system must require a minimum amount of
human intervention and it must not be targeted to extracting in formation from a particular domain or web site; by open, we mean
that it must extract as much useful information as possible and not
be subject to any pre-defined data model. In the literature, there is
only one open but not scalable proposal, since it requires human
supervision on a per-domain basis. In this paper, we present a new
proposal that relies on a number of heuristics to identify patterns
that are typically used to represent the information in a web docu ment. Our experimental results confirm that our proposal is very
competitive in terms of effectiveness and efficiency. | es |
dc.description.sponsorship | Ministerio de Economía y Competitividad TIN2016-75394-R | es |
dc.description.sponsorship | Ministerio de Economía y Competitividad TIN2013-40848-R | es |
dc.format | application/pdf | es |
dc.format.extent | 5 | es |
dc.language.iso | eng | es |
dc.publisher | Association for Computing Machinery (ACM) | es |
dc.relation.ispartof | HotWeb 2017 : 5th ACM/IEEE Workshop on Hot Topics in Web Systems and Technologies (2017), pp. 4:1-4:5. | |
dc.rights | Attribution-NonCommercial-NoDerivatives 4.0 Internacional | * |
dc.rights.uri | http://creativecommons.org/licenses/by-nc-nd/4.0/ | * |
dc.subject | Web information extraction | es |
dc.subject | open information extraction | es |
dc.subject | Web representation patterns | es |
dc.subject | Semi-structured documents | es |
dc.subject | Scalability | es |
dc.title | Extracting Web Information using Representation Patterns | es |
dc.type | info:eu-repo/semantics/conferenceObject | es |
dcterms.identifier | https://ror.org/03yxnpp24 | |
dc.type.version | info:eu-repo/semantics/submittedVersion | es |
dc.rights.accessRights | info:eu-repo/semantics/openAccess | es |
dc.contributor.affiliation | Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos | es |
dc.relation.projectID | TIN2016-75394-R | es |
dc.relation.projectID | TIN2013-40848-R | es |
dc.relation.publisherversion | https://dl.acm.org/doi/10.1145/3132465.3133840 | es |
dc.identifier.doi | 10.1145/3132465.3133840 | es |
dc.contributor.group | Universidad de Sevilla. TIC258: Data-centric Computing Research Hub | es |
dc.publication.initialPage | 4:1 | es |
dc.publication.endPage | 4:5 | es |
dc.eventtitle | HotWeb 2017 : 5th ACM/IEEE Workshop on Hot Topics in Web Systems and Technologies | es |
dc.eventinstitution | San Jose, CA, USA | es |
dc.relation.publicationplace | New York, USA | es |
dc.contributor.funder | Ministerio de Economía y Competitividad (MINECO). España | es |