The University of Maryland’s Kazakh-English Neural Machine Translation System at WMT19

This paper describes the University of Maryland’s submission to the WMT 2019 Kazakh-English news translation task. We study the impact of transfer learning from another low-resource but related language. We experiment with different ways of encoding lexical units to maximize lexical overlap between the two language pairs, as well as back-translation and ensembling. The submitted system improves over a Kazakh-only baseline by +5.45 BLEU on newstest2019.

While much work addresses this problem via semi-supervised learning from monolingual text (Sennrich et al., 2016;He et al., 2016), we focus on transfer learning from another language pair (Zoph et al., 2016;Nguyen and Chiang, 2017;Lakew et al., 2018). In this setting, an NMT system is firstly trained using auxiliary parallel data from a so-called "parent" language pair and then the trained model is used to initialize a "child" model which is further trained on a low-resource language pair. Similar approaches that support cross-lingual transfer learning for Multi-lingual NMT train a model on the concatenation of all data instead of employing sequential training (Gu et al., 2018;Zhou et al., 2018;Wang et al., 2019).
Transfer learning has been found effective in submissions to WMT in previous years:  reported improvements of +2.4 BLEU on the low-resource Estonian→English translation task by transfer learning from Finnish→English.
Interestingly,  observed that the transfer learning approach is still effective when there is no relatedness between the "child" and "parent" language-pairs and also hypothesize that the size of the parent training set is the most important factor leading to translation quality improvements. However, previous work has also empirically validated that transfer learning benefits most when "child"-"parent" languages belong to the same or linguistically similar language family (Dabre et al., 2017). Specifically, Nguyen and Chiang (2017) showed consistent improvements in two Turkic languages via transfering from another related, low-resource language.
Taking those recent results into consideration, our main focus at WMT19 is to examine transfer learning for the Kazakh-English language pair using additional parallel data from Turkish-English. While using distinct writing systems, both source languages belong to the Turkic language family and preserve many morphological and syntactic features common for that group (Kessikbayeva and Cicekli, 2014). As a result, they constitute a suitable "child"-"parent" language-pair choice for exploring transfer learning between related lowresource languages. In this direction, we conduct experiments to address the following questions: • How can we represent lexical units to exploit vocabulary overlap between languages? We compare bilingual and monolingual byte-pair encoding models with the recently proposed soft decoupled encoding model.
• How can we leverage both "child" and "parent" parallel data to obtain synthetic backtranslated data from monolingual resources?

Approach
Our method follows a simple strategy used in Wang et al. (2019) for multilingual training: we directly train NMT models on the concatenation of parallel data covering both the "child" and "parent" languages with no metadata to distinguish between them. 1 Within this framework, we study the impact of (a) different lexical representations that attempt to maximize parameter sharing across related languages, (b) romanization to increase overlap between Turkish and Kazakh which are originally written in distinct scripts, (c) synthetic training data obtained by back-translation.

Lexical Units
How can we define lexical units to maximize information sharing across related source languages? We compare different configurations of sub-word segmentations using different variants of the standard Byte-Pair Encoding (BPE) framework (Sennrich et al., 2016), and compare them with the Soft Decoupled Encoding framework that exploits character n-gram representations of words instead of sub-words (Wang et al., 2019).
Joint BPEs (JBPEs) BPEs are learned jointly from the concatenation of "child" and "parent" parallel data. The advantage of this strategy is that the sub-word segmentations of related words in the two languages are encouraged to be more aligned; thus enabling the sharing of their representations on the source side due to a larger vocabulary overlap. Although, the "child" language might be "overwhelmed" by the "parent" language when there is a significant difference in the amount of their data (Neubig and Hu, 2018). This could lead to over-segmentation of the "child" language and subsequently limit the expressive power of the NMT system.
Separate BPEs (SBPEs) BPEs are learned separately for each language. This framework was found to be effective in the multilingual setting, especially for translation from extremely low-resource languages (Neubig and Hu, 2018). However, learning the merging operations separately might lead to unaligned sub-units between the two languages that fail to exploit relationships between their lexical representations.
Soft Decoupled Encoding (SDE) Small discrepancies in the spelling of words that share the same semantics across the two languages could lead to different segmented sub-units and hinder the lexical-level sharing between them. To take into account those spelling differences, we further experiment with the SDE encoding framework that is not based on any pre-processing segmentation. Specifically, SDE represents a word as a decomposition of two components: a character encoding that models the languagespecific spelling of the word and a latent semantic embedding that captures its language-agnostic semantics. Following, we briefly summarize the main SDE components as proposed in Wang et al. (2019): Lexical embedding Each word w is first decomposed to its bag of character n-grams (BoN(w)). Let C be the number number of character n-grams in the vocabulary and D be the dimension of the corresponding character n-gram embeddings. To acquire a lexical representation c(w), the word is looked up to an embedding matrix W c ∈ R C×D as shown below: Language Specific Transformation Next each word is passed through a language dependent transformation. For each language L i a matrix W L i ∈ R D×D is learned and the transformed embeddings c i (w) is computed: Latent Semantic Embedding The shared semantic concepts among languages are represented by a matrix W s ∈ R S×D , where S corresponds to the number of semantic concepts a language can express. The latent embeddings of a word w is then given as: Finally, the SDE embedding of word w is extracted as a combination of the languagedependent lexical encoding and the latent embedding:

Romanization
Given that the provided Kazakh and Turkish data are written in the Cyrillic and Latin scripts respectively, we investigate the impact of mapping text in the two languages into a common orthography. We transliterate both the "child" and the "parent" data using a transliteration tool 2 that applies the same romanization rules to encourage more overlap between child and parent data. Table 1 illustrates how romanization makes shared vocabulary and similarity between the two languages more explicit than using the original scripts. Table 2 summarizes the statistical overlap on the source side vocabularies between the two languages for different lexical encodings with and without romanization. This analysis indicates that using the original script can be seen as an attempt to explore transfer learning when the lexical-level sharing between the two languages is limited. On the other hand, the vocabulary overlap between them is significantly increased once we romanize the data.

Synthetic Data
We further explore different ways to incorporate target-side English monolingual data provided by the competition into low-resource NMT. Following the widely used back-translation approach (Sennrich et al., 2016), we create synthetic parallel data and then train new NMT models on the mixture of real and synthetic parallel data.  Back-translation+transfer Given the data scarcity of the Kazakh parallel data we also attempt to incorporate both Kazakh and Turkish data to train a model that translates in the opposite direction. In order to produce output that is more similar to our main language of interest, we introduce two artificial tokens (<2kk>, <2tr>) at the beginning of the input sentence to indicate the target language the model should translate to (Johnson et al., 2017). After the reversed system is trained we back-translate each target sentence to a Kazakh synthetic sentence. 3

Model Configuration
Our NMT systems are built upon the publicly available code 4 of Wang et al. (2019) and are sequence-to-sequence 1-layer attentional longshort term memory units (LSTMs) with a hidden dimension of 512 for both the encoder and the decoder. The word embedding dimension is kept at 128, and all other layer dimensions are set to 512.
We use a dropout rate of 0.3 for the word embedding and the output vector before the decoder Softmax layer. The batch size is set to be 1500 words. Monolingual For the Empty source and Backtranslation methods of creating synthetic data we used the target-side of the Turkish-English parallel corpus as monolingual data. For the Back-Translation+transfer experiment we used 100K randomly selected sentences from the News Commentary corpus, excluding sentences with less than 5 words and more than 50 words.
Pre-processing We process all corpora consistently. We tokenize the sentences and perform truecasing with the Moses scripts (Koehn et al., 2007). For all the experiments we consistenly use 8K BPEs on the English target side. We experiment with {32, 64}K merge operations for the models using BPE encoding and {4, 5} n-grams for the SDE framework. To establish a fair comparison between the source language representations, we consistently use the same encoding for English words (target side) using BPEs learned on the concatenation of all the English data.
Tuning and Testing Data The official news-dev2019 is used as the validation set, and news-test2019 is used as the test set.

Experiments
Starting from Baseline BPE-based NMT systems trained using only the Kazakh data provided by the competition, we conduct the following experiments. Table 3 presents our results of 3 runs using {32, 64}K merge operations in total for each experiment. Generally, both Joint and Separate BPE segmentation strategies, with and without romanization improve BLEU over the Baseline. Previous empirical results on transfer learning for extremely low-resource languages indicated that training the BPE operations separately for the "child" and "parent" languages has a large positive effect on the performance of the model (Wang et al., 2019). By contrast, JBPEs and SBPEs perform comparably well in almost all configurations here. This could be attributed to our less imbalanced setting where the ratio of "child"-"parent" data is 1 : 2, and the child language therefore contributes more to sub-word segmentation rules. The best BLEU score is achieved using 32K JBPEs on the romanized data which is consistent with the configuration with the largest vocabulary overlap, according to Table 2. However, using {32, 64}K SBPEs on the original data only hurts BLEU by 0.5 and 1.24, despite the lack of lexical overlap. This suggests that most of the improvement does not come from the shared encoder vocabulary.   Table 4: SDE Experiments using 64K n-grams of the concatenated corpora. The last line refers to the best BLEU score using 64K BPEs for comparison.

Soft-Decoupled Encoding
We compare the BPE results with different configurations of the SDE model. Table 4 presents average results of 3 runs with different random seeds, where we use 64K character n-grams as our vocabulary. The Language Specific Transformation consistently harms the BLEU score for both n = 4, 5. This result validates the empirical observations of Wang et al. (2019); the separate projection does not help when the "child"-"parent" languages have a significant surface lexical overlap. We also observe comparable BLEU results when we use SDE embeddings or lexical embeddings (where the latent embedding is not taken into account) to encode the semantics of words. The best BLEU scores are achieved for the lexical encoding using either 4-grams or 5-grams of words.
In both cases we observe that the n-gram models perform sligthly better than the best BPE model that uses the same number of merge operations as the n-gram vocabulary size (we refer to that model as Baseline-BPE on Table 4). However, we do not adopt SDE in our submitted system as the small BLEU score improvement comes with higher computational cost when compared to the BPE models.

Synthetic Data
Finally we experiment with back-translation of monolingual English corpora. All experiments used romanized text segmented with 32K BPE merge operations. Table 5 compares 3 different ways of using the same English data extracted from the target side of the Turkish-English parallel corpus. Each target sentence is coupled with a synthetic Kazakh sentence (Back-translation), an empty source sentence as a control (Empty) or a real Turkish sentence (Transfer). The ratio of real to additional data is kept to 1 : 2 in all cases.
NMT training does not benefit from the backtranslated data as it achieves nearly the same BLEU as the baseline model. Suprisingly empty source sentences yield better results than backtranslation, suggesting that the synthetic backtranslations are of low quality. Translating into Kazakh is challenging given the small amount of data available, especially for translating from a morphologically poor to a morphologically rich language. Finally, using real Turkish data on the source side achieves the best improvement over the baseline system (+4.4 BLEU).  Given that in all these 3 experiments the decoder model was trained on the exact same English data, these results suggest that the transfer learning benefits both the encoder and decoder models.

Method
Synthetic BLEU Baseline-Transfer 9.89 Empty 9.17 Back-Translation 9.38 + ensemble(4) 9.94 Finally, we attempt to combine Kazakh and Turkish parallel data to back-translate 100K additional monolingual data to Kazakh via training a NMT model that has control over the output language, as can be seen in Table 6. In this experiment our Baseline-Transfer system refers to the best model trained on the concatenation of "child" and "parent" data. In contrast to the previous experiment we now combine Kazakh, Turkish and synthetic data with a ratio 1 : 2 : 1. We observe that in both cases (Back-translation, Empty) the BLEU score of the system trained on the augmented data fails to outperform the Baseline-Transfer performance, possibly due to the fact that the real Kazakh data have been "overwhelmed" by the auxiliary ones (Poncelas et al., 2018). However, we could assume that the quality of the back-translated data is slightly better once we utilized the Turkish data (given that it performs better than the Empty experiment). Finally, the last row of Table 6 reports the BLEU score of our primary submission. 6 Specifically, the submitted model is an ensemble obtained by averaging the output distributions of 4 models trained on Kazakh, Turkish and Back-Translated using different random seeds.

Conclusion
This paper presents the University of Maryland's NMT system for WMT 2019 Kazakh → English news translation task. Specifically, we explored how to improve neural machine translation of a low-resource language by incorporating parallel data from a related, also low-resource language.
Our empirical results validate that transfer learning benefits BLEU even when transfering from a low-resource language pair. Furthermore, our results suggest that translation quality (in terms of BLEU score) of the language-pair of focus is most benefited when the surface-level parameter sharing between the lexical representations of the two related languages is maximized. Finally, we observed that NMT training with synthetic data is sensitive to the quality of the back-translation.