dc.creator | Jiménez Aguirre, Patricia | es |
dc.creator | Roldán Salvador, Juan Carlos | es |
dc.creator | Corchuelo Gil, Rafael | es |
dc.date.accessioned | 2022-04-11T07:31:38Z | |
dc.date.available | 2022-04-11T07:31:38Z | |
dc.date.issued | 2022 | |
dc.identifier.citation | Jiménez Aguirre, P., Roldán Salvador, J.C. y Corchuelo Gil, R. (2022). On exploring data lakes by finding compact, isolated clusters. Information Sciences, 591 (April 2022), 103-127. | |
dc.identifier.issn | 0020-0255 | es |
dc.identifier.uri | https://hdl.handle.net/11441/131996 | |
dc.description.abstract | Data engineers are very interested in data lake technologies due to the incredible abun dance of datasets. They typically use clustering to understand the structure of the datasets
before applying other methods to infer knowledge from them. This article presents the first
proposal that explores how to use a meta-heuristic to address the problem of multi-way
single-subspace automatic clustering, which is very appropriate in the context of data
lakes. It was confronted with five strong competitors that combine the state-of-the-art
attribute selection proposal with three classical single-way clustering proposals, a recent
quantum-inspired one, and a recent deep-learning one. The evaluation focused on explor ing their ability to find compact and isolated clusterings as well as the extent to which such
clusterings can be considered good classifications. The statistical analyses conducted on
the experimental results prove that it ranks the first regarding effectiveness using six stan dard coefficients and it is very efficient in terms of CPU time, not to mention that it did not
result in any degraded clusterings or timeouts. Summing up: this proposal contributes to
the array of techniques that data engineers can use to explore their data lakes | es |
dc.description.sponsorship | Ministerio de Economía y Competitividad TIN2016-75394-R | es |
dc.description.sponsorship | Ministerio de Ciencia e Innovación PID2020-112540RB-C44 | es |
dc.description.sponsorship | Junta de Andalucía P18-RT-1060 | es |
dc.description.sponsorship | Junta de Andalucía US-1381375 | es |
dc.format | application/pdf | es |
dc.format.extent | 25 | es |
dc.language.iso | eng | es |
dc.publisher | Elsevier | es |
dc.relation.ispartof | Information Sciences, 591 (April 2022), 103-127. | |
dc.rights | Attribution-NonCommercial-NoDerivatives 4.0 Internacional | * |
dc.rights.uri | http://creativecommons.org/licenses/by-nc-nd/4.0/ | * |
dc.subject | Data lakes | es |
dc.subject | Clustering | es |
dc.subject | Meta-heuristics | es |
dc.subject | Genetic algorithms | es |
dc.title | On exploring data lakes by finding compact, isolated clusters | es |
dc.type | info:eu-repo/semantics/article | es |
dcterms.identifier | https://ror.org/03yxnpp24 | |
dc.type.version | info:eu-repo/semantics/publishedVersion | es |
dc.rights.accessRights | info:eu-repo/semantics/openAccess | es |
dc.contributor.affiliation | Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos | es |
dc.relation.projectID | TIN2016-75394-R | es |
dc.relation.projectID | PID2020-112540RB-C44 | es |
dc.relation.projectID | P18-RT-1060 | es |
dc.relation.projectID | US-1381375 | es |
dc.relation.publisherversion | https://www.sciencedirect.com/science/article/pii/S0020025521012664?via%3Dihub | es |
dc.identifier.doi | 10.1016/j.ins.2021.12.045 | es |
dc.contributor.group | Universidad de Sevilla. TIC258: Data-centric Computing Research Hub | es |
dc.journaltitle | Information Sciences | es |
dc.publication.volumen | 591 | es |
dc.publication.issue | April 2022 | es |
dc.publication.initialPage | 103 | es |
dc.publication.endPage | 127 | es |
dc.contributor.funder | Ministerio de Economía y Competitividad (MINECO). España | es |
dc.contributor.funder | Ministerio de Ciencia e Innovación (MICIN). España | es |
dc.contributor.funder | Junta de Andalucía | es |