Automatic text readability classification: resources and models for Galician

Authors

DOI:

https://doi.org/10.21814/lm.17.2.488

Keywords:

readability corpus, automatic readability assessment, text classification, Galician, fine-tuning, adult learning

Abstract

The automatic readability assessment of texts is a growing field within Natural Language Processing, with significant implications in areas such as language teaching and learning and accessibility. In this context, this paper presents Corlega, the first corpus of Galician texts classified by readability level, consisting of 480 texts aimed at adult readers. The corpus
covers 11 categories and 36 subcategories, including a variety of text types, genres and subgenres. The process of selection and compilation of documents, as well as classification, follows the standards of the iRead4Skills project, which develops resources and computational models for Portuguese, Spanish and French. To compile Corlega, this work defines six levels of readability in Galician and proposes a set of linguistic descriptors for each level. Using this taxonomy, we describe the compilation process of the corpus and its current distribution ---across four of the six readability levels---,
as well as the main features of this new resource. Additionally, we used the corpus to train and evaluate automatic readability classification tools by fitting monolingual and multilingual Transformer models, and the implementation of hybrid models. The results suggest that, with small training corpora, feature extraction from pre-trained models is
an efficient method to achieve competitive results with supervised model fitting. However, combining corpora from different languages enables the fitting of multilingual models with better performance. Both the corpus and the models are available to the scientific community.

References

Published

2025-11-23

Issue

Section

Research Articles

How to Cite

Automatic text readability classification: resources and models for Galician. (2025). Linguamática, 17(2), 33-56. https://doi.org/10.21814/lm.17.2.488