Reproducing Monolingual, Multilingual and Cross-Lingual CEFR Predictions

Yves Bestgen


Abstract
his study aims to reproduce the research of Vajjala and Rama (2018) which showed that it is possible to predict the quality of a text written by learners of a given language by means of a model built on the basis of texts written by learners of another language. These authors also pointed out that POStag and dependency n-grams were significantly more effective than text length and global linguistic indices frequently used for this kind of task. The analyses performed show that some important points of their code did not correspond to the explanations given in the paper. These analyses confirm the possibility to use syntactic n-gram features in cross-lingual experiments to categorize texts according to their CEFR level (Common European Framework of Reference for Languages). However, text length and some classical indexes of readability are much more effective in the monolingual and the multilingual experiments than what Vajjala and Rama concluded and are even the best performing features when the cross-lingual task is seen as a regression problem. This study emphasized the importance for reproducibility of setting explicitly the reading order of the instances when using a K-fold CV procedure and, more generally, the need to properly randomize these instances before. It also evaluates a two-step procedure to determine the degree of statistical significance of the differences observed in a K-fold cross-validation schema and argues against the use of a Bonferroni-type correction in this context.
Anthology ID:
2020.lrec-1.687
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
5595–5602
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.687
DOI:
Bibkey:
Cite (ACL):
Yves Bestgen. 2020. Reproducing Monolingual, Multilingual and Cross-Lingual CEFR Predictions. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 5595–5602, Marseille, France. European Language Resources Association.
Cite (Informal):
Reproducing Monolingual, Multilingual and Cross-Lingual CEFR Predictions (Bestgen, LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.687.pdf