Japanese-Russian TMU Neural Machine Translation System using Multilingual Model for WAT 2019

We introduce our system that is submitted to the News Commentary task (Japanese<->Russian) of the 6th Workshop on Asian Translation. The goal of this shared task is to study extremely low resource situations for distant language pairs. It is known that using parallel corpora of different language pair as training data is effective for multilingual neural machine translation model in extremely low resource scenarios. Therefore, to improve the translation quality of Japanese<->Russian language pair, our method leverages other in-domain Japanese-English and English-Russian parallel corpora as additional training data for our multilingual NMT model.


Introduction
News Commentary shared task of the 6th Workshop on Asian Translation (Nakazawa et al., 2019) addresses Japanese↔Russian (Ja↔Ru) news translation. It is a very challenging task considering: (a) extremely low resource setting, the size of parallel data is only 12k parallel sentences; (b) how distant given language pair is, in terms of different writing system, phonology, morphology, grammar, and syntax; (c) difficulty of translating news from various topics which leads to large presence of unknown tokens in such extremely low-resource scenario.
Usually, neural machine translation (NMT) (Cho et al., 2014;Sutskever et al., 2014;Bahdanau et al., 2015;Vaswani et al., 2017) enables endto-end training of a translation system requiring a large amount of training parallel data (Koehn and Knowles, 2017). Therefore, there are different techniques of involving other pivot languages to increase the accuracy of low-resource MT such as pivot-based SMT (Utiyama and Isahara, 2007), transfer learning (Zoph et al., 2016;Kocmi and Bojar, 2018), and multilingual modeling (Firat et al., 2016). Recently, a simple multilingual modeling (MultiNMT) was proposed by Johnson et al. (2017) which translates between multiple languages using a single model and an artificial token indicating a target language, taking advantage of multilingual data to improve NMT for all languages involved. Imankulova et al. (2019) showed that incorporating MultiNMT (Johnson et al., 2017) provided better BLEU scores than unidirectional and pivot-based PBSMT approaches and that domain mismatch had a negative effect on low-resource NMT.
Therefore, we use MultiNMT modeling for an extremely low-resource Ja↔Ru translation involving English (En) as the pivoting third language (Utiyama and Isahara, 2007). Considering the importance of domain matching, we focus on only news domain of additional Ja↔En and Ru↔En auxiliary parallel corpora, which we will refer as pivot parallel corpora. And we investigate how translation results are improved by using in-domain pivot parallel corpora (Ja↔En and Ru↔En) in MultiNMT modeling. As a result, indomain pivot parallel corpora increases the coverage of Ja and Ru vocabulary, and it is clarified that the new tokens introduced from in-domain pivot corpora could be translated successfully.

Related Work
The existing state-of-the-art NMT model known as the Transformer (Vaswani et al., 2017) works well on different scenarios (Lakew et al., 2018;Imankulova et al., 2019). MultiNMT using the artificial token approach (Johnson et al., 2017) is known to help the language pairs with relatively lesser data (Lakew et al., 2018;Rikters et al., 2018)  and outperform bi-directional and uni-directional translation approaches (Imankulova et al., 2019). Similarly, we exploit MultiNMT approach with Transformer architecture. Our work is heavily based on Imankulova et al. (2019). They proposed a multi-stage fine-tuning approach that combines multilingual modeling and domain adaptation. They utilize out-of-domain pivot parallel corpora to perform domain adaptation on in-domain pivot parallel corpora and then perform multilingual transfer for a language pair of interest. However, instead of utilizing out-ofdomain pivot parallel corpora, we investigate the impact of other in-domain pivot parallel corpora.
Pseudo-parallel data can be used to augment existing parallel corpora for training, and previous work has reported that such data generated by so-called back-translation can substantially improve the quality of NMT (Sennrich et al., 2016). However, this approach requires base MT systems that can generate somewhat accurate translations (Imankulova et al., 2017). Therefore, instead of creating noisy pseudo-parallel corpora, we take advantage of other in-domain pivot parallel corpora.

Data
To train MultiNMT systems we used the news domain data provided by WAT2019 1 . More specifically, we used Global Voices 2 as a training data for Ja↔Ru, Ja↔En and Ru↔En, and manually aligned, cleaned and filtered News Commentary data was used as development and test sets. 3 Additionally, we utilized Jiji 4 and News 1 http://lotus.kuee.kyoto-u.ac.jp/WAT/WAT2019/index.html 2 https://globalvoices.org/ 3 https://github.com/aizhanti/JaRuNC 4 http://lotus.kuee.kyoto-u.ac.jp/WAT/jiji-corpus/ Commentary 5 data for Ja↔En and Ru↔En, respectively. Table 1 summarizes the size of train/development/test splits used in our experiments.
We tokenized English and Russian sentences using tokenizer.perl of Moses (Koehn et al., 2007). 6 To tokenize Japanese sentences, we used MeCab 7 with the IPA dictionary. After tokenization, we eliminated duplicated sentence pairs and sentences with more than 100 tokens for all the languages.

Systems
This section describes our system TMU and our baseline which based on the same MultiNMT architecture (Johnson et al., 2017) but trained on different training corpora (Table 1). Here, MultiNMT translates from multiple source languages into different target languages within a single model. To realize such translation, an artificial token is introduced at the beginning of the input sentence to indicate the target language the model should translate to. Since we have 3 language pairs, we concatenate all pairs in both directions with oversampling to match the biggest parallel data. We add a target language token to the source side of each pair and treat it like a single language-pair case.
We experiment with the following systems: • TMU: Our system is trained on a balanced concatenation of Global Voices, Jiji and News Commentary corpora on 6 translation directions.
Only GV is used as a comparative model to investigate the effect of additional pivot corpora.

Implementation
We used the open-source tensor2tensor implementation of the Transformer model. 8 Table 2 contains some specific hyperparameters.
The hyper-parameters not mentioned in this table used the default values in tensor2tensor.
We over-sampled Ja→Ru and Ja→En training data so that their sizes match the largest Ru→En data for each model. However, the development set was created by concatenating those for the individual translation directions without any over-sampling. We also used tensor2tensor's internal sub-word segmentation mechanism. The size of the shared sub-word vocabularies was set to 32k. By default, tensor2tensor truncates sentences longer than 256 sub-words to prevent out-of-memory errors during training. We incorporated early-stopping by stopping training if BLEU score for the development set was not improved for 10,000 updates (10 check-points).
At inference time, we averaged the last 10 check-points and decoded the test sets with beam size and a length penalty which were tuned by a linear search on the BLEU score for the development set. Length penalty for Ja→Ru was 1.0 and for Ru→Ja 1.1. Beam size was set to 12 and 3 for Ja→Ru and Ru→Ja, respectively. Although we train our models on 6 translation directions, we only report the BLEU scores on Ja→Ru and Ru→Ja test sets.

Discussion
We investigate the effect of adding Jiji and News Commentary corpora as pivot parallel corpora to original Global Voices training data. In extremely low-resource machine translation in the news domain, unknown tokens become a serious issue due to vocabulary coverage. Adding the pivot parallel corpora to training data can be expected to increase vocabulary coverage. Therefore, we investigate how much vocabulary coverage was improved by using pivot parallel corpora. For that purpose, we investigate the following vocabulary sets A and B: T is a set of unknown tokens from test data not included in the direct Ja↔Ru 12k training data, G is pivot Gloval Voices vocabulary set and P is Jiji and News Commentary training vocabulary set.
A is the test data unknown tokens set covered by pivot Global Voices training data. B is the test data unknown tokens set covered by concatenated vocabulary of Jiji and News Commentary pivot paral-   lel corpora added to A. By comparing the number of tokens and types of distinct words of A and B, you can see how much the coverage has increased. In addition, we investigate how correctly the tokens added by Jiji corpus and News Commentary are translated. If a token from vocabulary set of A or B appeared in both the gold sentence and the translated sentence of the system, it was counted as being correctly translated. Table 4 shows token and type coverage and correctly translated tokens and types of distinct words on test data for A and B, respectively. It can be seen that both Ru and Ja have improved B coverage compared to A. In particular, the coverage of Ru is greatly improved. And by adding Jiji corpus and News Commentary to the training data, you can see that the number of correctly translated tokens has increased. This shows that vocabulary coverage has increased and translation accuracy has improved. On the other hand, the number of correctly translated tokens is few compared to increased coverage from additional parallel data. This is considered to be due to difficulty of directly learning Ja↔Ru translation from added indirect Ja↔En and Ru↔En pivot corpora.
Furthermore, in order to deepen the knowledge about the tokens covered using pivot corpora, we analyze the cases where the newly added tokens by Jiji and News Commentary corpora are translated correctly and incorrectly. By adding Jiji and News Commentary corpora, we define the vocabulary set newly covered by the test data vocabulary as C as follows: Table 5 shows translation examples of only GV and TMU systems. The [unknown tokens] in each sentence belong to C. The first sentence is an example (a) where TMU was able to correctly translate "株主" compared to Only GV. On the other hand, the second example shows that neither TMU nor Only GV could correctly translate an unknown token "表立っ" included in pivot parallel corpora. It is considered that it cannot be translated because the whole sentence was translated incorrectly.

Conclusion
In this paper, we introduced our system submitted to the News Commentary task (Ja↔Ru) of the 6th Workshop on Asian Translation. The difficult part of this shared task is unknown tokens due to difficult news domain covering various topics and extremely low-resource available parallel data. To address this issue, we investigated the coverage of translatable tokens by training MultiNMT using an in-domain pivot parallel corpora. As a result, we found out that our system can translate more tokens by taking advantage of additional pivot parallel corpora. In the future, we will explore whether translation results improve by using other Ja↔Ru (e.g. Tatoeba) and Ru↔En (e.g. UN) corpora.
In the news domain, there is also a problem of completely new tokens, which is a type of unknown tokens, that cannot be dealt by simply increasing training data coverage since new information is out every day. Therefore, we plan to tackle the problem of new tokens that cannot be introduced by using additional corpora.