Machine Learning for handwriting text recognition in historical documents

Aradillas Jaramillo, Jose Carlos

Tesis Doctoral

dc.contributor.advisor	Murillo Fuentes, Juan José	es
dc.contributor.advisor	Martínez Olmos, Pablo	es
dc.creator	Aradillas Jaramillo, Jose Carlos	es
dc.date.accessioned	2022-03-10T10:20:03Z
dc.date.available	2022-03-10T10:20:03Z
dc.date.issued	2021-12-17
dc.identifier.citation	Aradillas Jaramillo, J.C. (2021). Machine Learning for handwriting text recognition in historical documents. (Tesis Doctoral Inédita). Universidad de Sevilla, Sevilla.
dc.identifier.uri	https://hdl.handle.net/11441/130648
dc.description.abstract	Olmos ABSTRACT In this thesis, we focus on the handwriting text recognition task over historical documents that are difficult to read for any person that is not an expert in ancient languages and writing style. We aim to take advantage and improve the neural networks architectures and techniques that other authors are proposing for handwriting text recognition in modern handwritten documents. These models perform this task very precisely when a large amount of data is available. However, the low availability of labeled data is a widespread problem in historical documents. The type of writing is singular, and it is pretty expensive to hire an expert to transcribe a large number of pages. After investigating and analyzing the state-of-the-art, we propose the efficient application of methods such as transfer learning and data augmentation. We also contribute an algorithm for purging mislabeled samples that affect the learning of models. Finally, we develop a variational auto encoder method for generating synthetic samples of handwritten text images for data augmentation. Experiments are performed on various historical handwritten text databases to validate the performance of the proposed algorithms. The various included analyses focus on the evolution of the character and word error rate (CER and WER) as we increase the training dataset. One of the most important results is the participation in a contest for transcription of historical handwritten text. The organizers provided us with a dataset of documents to train the model, then just a few labeled pages of 5 new documents were handled to adjust the solution further. Finally, the transcription of nonlabeled images was requested to evaluate the algorithm. Our method raked second in this contest.	es
dc.format	application/pdf	es
dc.format.extent	139 p.	es
dc.language.iso	eng	es
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 Internacional	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/	*
dc.title	Machine Learning for handwriting text recognition in historical documents	es
dc.type	info:eu-repo/semantics/doctoralThesis	es
dcterms.identifier	https://ror.org/03yxnpp24
dc.type.version	info:eu-repo/semantics/publishedVersion	es
dc.rights.accessRights	info:eu-repo/semantics/openAccess	es
dc.contributor.affiliation	Universidad de Sevilla. Departamento de Teoría de la Señal y Comunicaciones	es
dc.publication.endPage	121	es

Ficheros	Tamaño	Formato	Ver	Descripción
Aradillas Jaramillo, José Carlos ...	34.81Mb	[PDF]	Ver/Abrir

Este registro aparece en las siguientes colecciones

Mostrar el registro sencillo del ítem

Excepto si se señala otra cosa, la licencia del ítem se describe como: Attribution-NonCommercial-NoDerivatives 4.0 Internacional