Construction of an Evaluation Corpus for Grammatical Error Correction for Learners of Japanese as a Second Language

Aomi Koyama, Tomoshige Kiyuna, Kenji Kobayashi, Mio Arai, Mamoru Komachi


Abstract
The NAIST Lang-8 Learner Corpora (Lang-8 corpus) is one of the largest second-language learner corpora. The Lang-8 corpus is suitable as a training dataset for machine translation-based grammatical error correction systems. However, it is not suitable as an evaluation dataset because the corrected sentences sometimes include inappropriate sentences. Therefore, we created and released an evaluation corpus for correcting grammatical errors made by learners of Japanese as a Second Language (JSL). As our corpus has less noise and its annotation scheme reflects the characteristics of the dataset, it is ideal as an evaluation corpus for correcting grammatical errors in sentences written by JSL learners. In addition, we applied neural machine translation (NMT) and statistical machine translation (SMT) techniques to correct the grammar of the JSL learners’ sentences and evaluated their results using our corpus. We also compared the performance of the NMT system with that of the SMT system.
Anthology ID:
2020.lrec-1.26
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
204–211
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.26
DOI:
Bibkey:
Cite (ACL):
Aomi Koyama, Tomoshige Kiyuna, Kenji Kobayashi, Mio Arai, and Mamoru Komachi. 2020. Construction of an Evaluation Corpus for Grammatical Error Correction for Learners of Japanese as a Second Language. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 204–211, Marseille, France. European Language Resources Association.
Cite (Informal):
Construction of an Evaluation Corpus for Grammatical Error Correction for Learners of Japanese as a Second Language (Koyama et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.26.pdf