Improving Target-side Lexical Transfer in Multilingual Neural Machine Translation

To improve the performance of Neural Machine Translation (NMT) for low-resource languages (LRL), one effective strategy is to leverage parallel data from a related high-resource language (HRL). However, multilingual data has been found more beneficial for NMT models that translate from the LRL to a target language than the ones that translate into the LRLs. In this paper, we aim to improve the effectiveness of multilingual transfer for NMT models that translate into the LRL, by designing a better decoder word embedding. Extending upon a general-purpose multilingual encoding method Soft Decoupled Encoding (Wang et al., 2019), we propose DecSDE, an efficient character n-gram based embedding specifically designed for the NMT decoder. Our experiments show that DecSDE leads to consistent gains of up to 1.8 BLEU on translation from English to four different languages.


Introduction
The performance of Neural Machine Translation (NMT; Sutskever et al. (2014)) tends to degrade on low-resource languages (LRL) due to a paucity of parallel data (Koehn and Knowles, 2017;Sennrich and Zhang, 2019). One effective strategy to improve translation in LRLs is through multilingual training using parallel data from related high-resource languages (HRL) (Zoph et al., 2016;Neubig and Hu, 2018). The assumption underlying cross-lingual transfer is that by sharing parameters between multiple languages the LRL can benefit from the extra training signal from data in other languages. One of the most popular strategies for multilingual training is to train a single NMT model that translates in many directions by simply appending a flag to each source sentence to indicate which target language to translate into (Ha et al., 2016;Johnson et al., 2017).
Many works focus on using multilingual training to improve many-to-one NMT models that translate from both an HRL and an LRL to a single target language (Zoph et al., 2016;Neubig and Hu, 2018;Gu et al., 2018). In this situation, sentences from the HRL-target corpus provide an extra training signal for the decoder language model, on top of cross-lingual transfer on the source side. When training an NMT model that translates into an LRL, however, multilingual data tends to lead to smaller improvements (Lakew et al., 2019;Arivazhagan et al., 2019;Aharoni et al., 2019).
In this paper, we aim to improve the effectiveness of multilingual training for NMT models that translate into LRLs. Prior work has found vocabulary overlap to be an important indicator of whether data from other languages will be effective in improving NMT accuracy Lin et al., 2019). Therefore, we hypothesize that one of the main problems limiting multilingual transfer on the target side is that the LRL and the HRL may have limited vocabulary overlap, and standard methods for embedding target words via lookup tables would map corresponding vocabulary from these languages to different representations.
To overcome this problem, we design a target word embedding method for multilingual NMT that encourages similar words from the HRLs and the LRLs to have similar representations, facilitating positive transfer to the LRLs. While there are many methods to embed words from characters (Ling et al., 2015;Kim et al., 2016;Wieting et al., 2016;Ataman and Federico, 2018), we build our model upon Soft Decoupled Encoding (SDE; ), a recently-proposed generalpurpose multilingual word embedding method that has demonstrated superior performance to other alternatives. SDE represents a word by combin-ing a character-based representation of its spelling and a lookup-based representation of its meaning. We propose DecSDE, an efficient adaptation of SDE to NMT decoders. DecSDE uses a low-rank transformation to assist multilingual transfer, and it precomputes the embeddings for a fixed vocabulary to speedup training and inference. We test our method on translation from English to 4 different low-resource languages, and DecSDE brings consistent gains of up to 1.8 BLEUs.

Translating into Low-resource Languages
Standard NMT training is performed solely on parallel corpora from a source language S to a target language T . However, in the case that T is an LRL, we can use parallel data from S and a related HRL T to assist learning. The standard look-up embedding in NMT turns words from both the LRL and the HRL into vectors by mapping their indices in the vocabulary to the corresponding entry in the embedding matrix. This is harmful for positive transfer, because different words with similar spellings from the LRL and the HRL are mapped to independent embeddings. For example, "Ola" in Galician and "Olá" in Portuguese both mean "hello", but they would have separate representations through the look-up embedding. We give a demonstration of this embedding (mis-)alignment in § 5.2. Since the target side data is essential for training the decoder's language model, representing lexicons from the LRL and HRL into shared space is especially important to improve positive transfer for NMT models that translate into LRLs.

Soft Decoupled Encoding
To address the limitation of the standard word representation for target side multilingual transfer, we turn to Soft Decoupled Encoding (SDE;Wang et al. (2019)), a word embedding method designed for multilingual data. SDE decomposes a word embedding into two components: a character n-gram embedding with a language-specific transformation that represents its spelling, and a semantic embedding that represents its meaning. Given a word w from the target language L i , SDE embeds the words in three steps.
Character aware embeddings are first used to calculate the lexical representation of w. We extract a bag of n-grams frequency vector from w, denoted as BoN(w), where each row corresponds to the number of times a character n-gram in the vocabulary appears in w. The character aware embedding of the w is then computed as where tanh is the activation function and W c ∈ R d×n is an embedding matrix of dimension d for the n character n-grams in the vocabulary.
Language-specific transformation is then applied to lexical embedding c(w) to account for the divergence between the HRL and the LRL: where the matrix W L i ∈ R d×d is a linear transformation specific to the language L i .
Latent semantic embeddings of w are calculated using an embedding matrix W s ∈ R d×s with s entries, which is shared between the languages. We use c i (w) as the query vector to perform attention (Luong et al., 2015) over the embeddings The final embedding of w is obtained by summing the lexical and semantic representations

DecSDE for NMT Decoders
In this section, we build upon the previously described SDE, and design a new method for multilingual word representation on the target side. There are two aspects to consider when incorporating character-based representations like SDE in decoders: 1) the embedding method should be efficient during both training and inference time, as it needs to be calculated over the entire vocabulary; 2) it should support popular decoder design decisions, such as weight tying (Press and Wolf, 2017), which allows the decoder to share the parameters of the target embedding matrix and the decoder projection before the softmax operation. With these considerations in mind, we introduce DecSDE, a multilingual target word embedding method based on SDE for NMT decoders.
Fixed Vocabulary and Weight Tying The standard SDE is designed to encode words directly without segmenting them into subwords . This design choice works well for encoding words on the source side, but it can cause problems for the decoder, which requires a finite vocabulary to generate words for each time step. Therefore, we choose to segment the target sentences into subwords (Kudo and Richardson, 2018), and encode each subword using DecSDE.
The use of a fixed vocabulary also allows us to perform weight tying. Specifically, we construct an embedding matrix for the decoder by precomputing the DecSDE embedding for each subword in the target vocabulary. This embedding matrix can then be used both as the encoder lookup table and as the projection matrix before the decoder softmax.
Efficient Training and Inference One drawback of the standard SDE is that it requires more computation than standard look-up table embeddings because the lexical embedding requires one to extract and embed all character n-grams for each word. This problem is especially important for the decoder, since it needs to embed all target words in the vocabulary for each time step to calculate the probability distribution over the vocabulary.
To make training more efficient, we extract the character n-grams for all words in the target vocabulary, and use an optimized embedding bag layer 2 to parallelize the calculation of lexical embeddings for all words in a batch. For inference, we precompute the DecSDE embedding for all subwords, effectively making inference as fast as the regular look-up table embedding. An analysis of training and inference speed can be found in § 5.2.

Low-rank Language-Specific Transformation
The language-specific transform in the standard SDE used on the encoder side sometimes hurts the model performance . Our experiments confirm that this phenomenon also happens on the decoder side. We hypothesize that this is because the full-rank transformation matrix, that is W L i in Eq. 2 might overfit the training data and project the lexical embeddings from different languages too far from each other, which could hurt multilingual transfer. Therefore, we introduce a novel low-rank language-specific transformation for DecSDE: We upper-bound the rank of the transformation matrix so that it is less complex, which 2 Implementation with torch.nn.functional.embedding bag can encourage generalization. Specifically, we replace language-specific transformation matrix W L i in Eq. 2 with two components: an identity matrix and a low-rank factorized matrix, where U L i ∈ R d×u , V L i ∈ R u×d are the low-rank matrices with dimension u < d. Thus, the identity matrix I passes through the lexical embedding as-is, and the low-rank matrix performs a simple transformation to account for the divergence between languages without amplifying the difference.
Extension to Multiple Target Language Note that though in this work we focus on HRL and LRL pairs, one can easily extend the framework to multiple (> 2) target languages. In particular, the only language dependent component of DecSDE is the matrices W L i , while the rest of DecSDE parameters as well as transformer encoder-decoder parameters are shared. We can add and train W L j for each of additional language L j .

Setup
Datasets To validate our method, we use the 58language-to-English TED corpus for experiments (Qi et al., 2018). We use three LRL datasets: Azerbaijani (aze), Belarusian (bel), Galician (glg) to English, and a slightly higher-resource dataset Slovak (slk). Each LRL is paired with a related HRL: Turkish (tur), Russian (rus), Portuguese (por), and Czech (ces) respectively. We translate from English to each of the four LRLs, and train together with the corresponding HRL. For simplicity, as a research setup, we do not use back-translation with mono-lingual data which is also hard to come by for languages low in resource we experiment with.
Implementation We implement our method using the fairseq (Ott et al., 2019)  Performance We measure model performance using SacreBLEU (Post, 2018) and summarize the results in Tab. 1. DecSDE consistently improves over the best baseline for all languages, outperforming LookUp-piece by up to 1.8 BLEU. Meanwhile, we see word-level baseline has inferior performance, likely due to little word-level overlap between HRL and LRL.
Ablation We examine the effect of DecSDE components by removing each of them, as in Tab. 1. First, we can see that removing weight tying degrades the model performance by a large margin for all four languages. Next, comparing the standard linear transformation (-low-rank transform), and the method without the entire language-specific transform component (-transform), we can see that using the regular transform without low-rank factorization actually degrades the model performance for three out of the four languages, indicating that a full linear transformation might hinder multilingual transfer. Using the low-rank transform achieves the best performance for all four languages.    We observe that using upto 4-gram give a huge performance improvement, while using 5-gram leads to small improve in aze and glg but small decrease in bel, slk. This suggests using character n-grams up to size 4 is enough to provide enough discriminative power for our model.   is sufficient while going larger is likely to incur over-fitting problem.
Embedding Analysis One main advantage of DecSDE is its ability to capture spelling similarity between LRL and HRL. To show this, we pick word pairs from HRL and LRL with edit distance from 1 to 4, and compare their embeddings. For each word pair word pair, we take the LRL word and use the cosine similarity between embeddings to retrieve words from the HRL. Retrieval success is measured by mean reciprocal rank (MRR, the higher the better). The gain of DecSDE over LookUp-piece with respect to edit distance is plotted in the top of Fig. 1, which shows that DecSDE embed similar spelling words closer in the embedding space. Next, we examine performance of DecSDE for rare words in the LRLs. We calculate word F-1 of rare words for DecSDE and LookUp-piece using compare-mt , and plot word frequency vs. gain in word F-1 of in the bottom of Fig. 1. DecSDE brings more significant gains for less frequent words, likely because it encodes similar words in HRL and LRL to closer space, thus assisting positive transfer.

Implications and Future Work
In this paper, we have demonstrated that DecSDE, a multilingual character-sensitive embedding method, improves translation accuracy into low resource languages. This implies, on a higher level, that looking into the character-level structure of the target-side vocabulary when creating word or subword embeddings is a promising way to improve cross-lingual transfer. While ablations have shown that the proposed design decisions (such as Lowrank Language-specific transformation, weight tying, etc.) are reasonable ones, this is just a first step in this direction. Future work could examine even more effective methods for target-side lexical sharing in MT or other language generation tasks.