To Case or not to case: Evaluating Casing Methods for Neural Machine Translation

Thierry Etchegoyhen, Harritxu Gete


Abstract
We present a comparative evaluation of casing methods for Neural Machine Translation, to help establish an optimal pre- and post-processing methodology. We trained and compared system variants on data prepared with the main casing methods available, namely translation of raw data without case normalisation, lowercasing with recasing, truecasing, case factors and inline casing. Machine translation models were prepared on WMT 2017 English-German and English-Turkish datasets, for all translation directions, and the evaluation includes reference metric results as well as a targeted analysis of case preservation accuracy. Inline casing, where case information is marked along lowercased words in the training data, proved to be the optimal approach overall in these experiments.
Anthology ID:
2020.lrec-1.463
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
3752–3760
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.463
DOI:
Bibkey:
Cite (ACL):
Thierry Etchegoyhen and Harritxu Gete. 2020. To Case or not to case: Evaluating Casing Methods for Neural Machine Translation. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3752–3760, Marseille, France. European Language Resources Association.
Cite (Informal):
To Case or not to case: Evaluating Casing Methods for Neural Machine Translation (Etchegoyhen & Gete, LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.463.pdf