Corpora for Document-Level Neural Machine Translation

Siyou Liu, Xiaojun Zhang


Abstract
Instead of translating sentences in isolation, document-level machine translation aims to capture discourse dependencies across sentences by considering a document as a whole. In recent years, there have been more interests in modelling larger context for the state-of-the-art neural machine translation (NMT). Although various document-level NMT models have shown significant improvements, there nonetheless exist three main problems: 1) compared with sentence-level translation tasks, the data for training robust document-level models are relatively low-resourced; 2) experiments in previous work are conducted on their own datasets which vary in size, domain and language; 3) proposed approaches are implemented on distinct NMT architectures such as recurrent neural networks (RNNs) and self-attention networks (SANs). In this paper, we aims to alleviate the low-resource and under-universality problems for document-level NMT. First, we collect a large number of existing document-level corpora, which covers 7 language pairs and 6 domains. In order to address resource sparsity, we construct a novel document parallel corpus in Chinese-Portuguese, which is a non-English-centred and low-resourced language pair. Besides, we implement and evaluate the commonly-cited document-level method on top of the advanced Transformer model with universal settings. Finally, we not only demonstrate the effectiveness and universality of document-level NMT, but also release the preprocessed data, source code and trained models for comparison and reproducibility.
Anthology ID:
2020.lrec-1.466
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
3775–3781
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.466
DOI:
Bibkey:
Cite (ACL):
Siyou Liu and Xiaojun Zhang. 2020. Corpora for Document-Level Neural Machine Translation. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3775–3781, Marseille, France. European Language Resources Association.
Cite (Informal):
Corpora for Document-Level Neural Machine Translation (Liu & Zhang, LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.466.pdf