A modular approach for lexical normalization applied to Spanish tweets
|Author/s||Cotelo Moya, Juan Manuel
Cruz Mata, Fermín
Troyano Jiménez, José Antonio
Ortega Rodríguez, Francisco Javier
|Department||Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos|
|Abstract||Twitter is a social media platform with widespread success where millions of people continuously
express ideas and opinions about a myriad of topics. It is a huge and interesting source of data but most
of these texts ...
Twitter is a social media platform with widespread success where millions of people continuously express ideas and opinions about a myriad of topics. It is a huge and interesting source of data but most of these texts are usually written hastily and very abbreviated, rendering them unsuitable for traditional Natural Language Processing (NLP). The two main contributions of this work are: the characterization of the textual error phenomena in Twitter and the proposal of a modular normalization system that improves the textual quality of tweets. Instead of focusing on a single technique, we propose an extensible normalization system that relies on the combination of several independent ‘‘expert modules’’, each one addressing an very specific error phenomenon in its own way, thus increasing module accuracy and lowering the module building costs. Broadly speaking, the system resembles to an ‘‘expert board’’: modules independently propose correction candidates for each Out of Vocabulary (OOV) word, rank the candidates and the best one is selected. In order to evaluate our proposal, we perform several experiments using texts from Twitter written in Spanish about a specific topic. The flexibility of defining resources at different language levels (core language, domain, genre) combined with the modular architecture lead to lower costs and a good performance: requiring a minimal effort for building the resources and achieving more than 82% of accuracy compared to the 31% yielded by the baseline.
|Funding agencies||Ministerio de Economía y Competitividad (MINECO). España
Junta de Andalucía
|Citation||Cotelo Moya, J.M., Cruz Mata, F., Troyano Jiménez, J.A. y Ortega Rodríguez, F.J. (2015). A modular approach for lexical normalization applied to Spanish tweets. Expert Systems with Applications, 42 (10), 4743-4754.|
|A modular approach for lexical ...||1.435Mb||[PDF]||View/|