Multilingual word translation using auxiliary languages

Current multilingual word translation methods are focused on jointly learning mappings from each language to a shared space. The actual translation, however, is still performed as an isolated bilingual task. In this study we propose a multilingual translation procedure that uses all the learned mappings to translate a word from one language to another. For each source word, we first search for the most relevant auxiliary languages. We then use the translations to these languages to form an improved representation of the source word. Finally, this representation is used for the actual translation to the target language. Experiments on a standard multilingual word translation benchmark demonstrate that our model outperforms state of the art results.


Introduction
Monolingual continuous word embeddings are standard building blocks of many natural language tasks. The embedding spaces can exhibit similar structures across languages. Several studies (Mikolov et al., 2013;Klementiev et al., 2012) proposed to exploit this similarity by learning a linear mapping from a source to a target embedding space, and demonstrated this approach on a word translation task. Xing et al. (2015) showed that using orthogonality matrices can significantly improve performance.
Bilingual embedding can be extended to a multilingual setup by jointly learning mappings from each monolingual space to a shared word space. In recent years several studies have proposed aligning multiple languages simultaneously in a shared space by enforcing (or at least encouraging) transitive relations between the mappings (Chen and Cardie, 2018;Kementchedjhieva et al., 2018;Alaux et al., 2019;Jawanpuria et al., 2019;Taitelbaum et al., 2019). Dealing with multiple languages simultaneously has been shown to improve performance on some bilingual tasks by using knowledge learned from other languages (Ammar et al., 2016;Duong et al., 2017).
Once the multilingual mappings are learned, one can infer word correspondences for words that are not in the initial lexicons. Previous multilingual methods have focused on training procedures that benefit from the multilingual setting. The actual translation, however, is still done as an isolated bilingual task. It makes eminent sense however to go beyond this stage and utilize the relations between all the languages at the inference phase as it is done in the mapping learning phase.
In this study, we propose a new inference method for multilingual word translation that uses all the learned mappings to translate a word from one language to another. For each source word we first search for the most relevant languages that help translate the source word to the target language. We then use the auxiliary translations to form an improved representation of the source word. Finally, this multilingual-dependent representation is used for the actual translation to the target language.
Our main contributions is twofold: first, a new word translation inference method, which takes advantage of the multilingual setup not only in the train phase, but also in test phase. Second, evaluation on a recently-released multilingual word translation dataset on six languages (Lample et al., 2018) showing that our method outperforms stateof-the-art methods on this task.

Multilingual Word Mapping
In this section we review the concept of multilingual mapping to a shared space and in the next section we use it to form a multilingual translation method. Assume we are given a d-dimensional word embedding data from a set of k languages. The task is to learn linear mappings between every pair of languages. In order to learn these mappings, we are also given a dictionary for each pair of languages, which contains pairs of corresponding words from the two languages. These dictionaries can be obtained in either a supervised or an unsupervised manner, and can also be created at each iteration of a dictionary refinement method (Artetxe et al., 2017;Lample et al., 2018). We can learn cross-language mappings independently from each source to each target language. This approach fails to benefit from the multilingual setup, and is very expensive because it requires to learn k 2 mappings, each with its own independent parameters. Another approach consists of choosing one language as a "pivot" and learning a mapping from each language to the pivot independently. This strategy, however, does not guarantee good indirect word translations between pairs of languages that do not include the pivot.
Recent studies proposed to jointly map all languages into a shared space. This approach was shown to outperform the above two approaches (Chen and Cardie, 2018). Let T 1 , . . . , T k be a set of mappings that correspond to the k different languages. The mapping T i is used to translate the words from language i to a shared space that can be viewed as an embedding space of a universal language.
The translation matrices T 1 , ..., T k can be found by minimizing the following mean-square error: such that x it and x jt are embeddings of two corresponding words in languages i and j respectively. The optimal transformations map pairs of words with similar meanings to vectors in the shared space that are close to one another.
When more than two languages are involved, there is no closed-form solution for the global minimum of Eq. (1). Recently, several studies addressed this optimization challenge.

Multilingual Translation Method
In this section we propose an inference procedure for multilingual translation that uses all the learned mapping T 1 , . . . , T k to translate from one language to another.
Before we describe our method, consider first a generic formulation of the inference procedure. The translation of a word embedding x from language i to a language j is obtained by: where V j is the vocabulary of language j. sim(x, y) can be, for example, the cosine similarity in the embedded space: sim(x, y) = cos(T i x, T j y). It is commonly observed that inference using the nearest embedded neighbor suffers from the hubness problem (Dinu and Baroni, 2014). Hubs are words that appear too frequently in the neighborhoods of other words. To mitigate this effect, one can simply replace the cosine similarity by another criterion, such as  2018)), this study uses the CSLS metric, namely, for The CSLS similarity is calculated as follows: where cos is the cosine similarity, N z (w) is the set of n nearest neighbors of the point w in the first set of word vectors and N w (z) is similarly defined. In practice, we used n = 10.
To describe our multilingual word translation, note that current word translation methods take into account only the source and target languages at inference time. Since the translation is performed via the shared embedding space one can potentially design a better representation of the source word in the shared space to be translated into the target word. We can translate x to all other languages (except the target language j): y m = arg max y∈Vm CSLS(T i x, T m y) and then compute the average word in the shared space: (for the source word we set y i = x). Here, all languages except the target are used as auxiliary sources of information about the correct embedding in the shared space. Then, we use z as a new multilingual representation of T i x, and translate x according to: Unfortunately, using the average across all auxiliary languages, may hurt performance for some language pairs. For example, a German translation of a Spanish word may not help with translating that Spanish word to Portuguese. More generally, translations to auxiliary languages can yield words that are far from the source word in the shared space and therefore may lead to incorrect translation. Here we describe an approach to select the relevant auxiliary languages for a given source word and target language. The main idea is to apply CSLS to select those languages that would be helpful for translating the desired source word. Specifically, a language m is selected as an auxiliary language only if the translated word y m is closer than the target word y j to the source word x in the shared space. A language m is thus included in the summation of Eq. (4) only if: When averaging the auxiliary translations all these languages are equally weighted. Because the source word is more important than its auxiliary translations, we set the weight of the source T i x to be sum of all the weights of the auxiliary words. The proposed Multilingual Word Translation (MWT) procedure is depicted in Algo. 1

Algorithm 1 Multilingual Word Translation
Required: A set of mappings T 1 , .., T k . Task: Translate the word x ∈ V i to language j.
m∈S T m y m end y = arg max y∈V j CSLS(z, T j y) returnŷ Several studies have recently proposed using word vector averaging of the source and target embeddings as an improved shared word representation. This can be done when the monolingual word embeddings are mapped to the same space (Doval et al., 2018). Meta-embedding using several embeddings of the same language can be achieved even without mapping the embeddings to a shared space (Coates and Bollegala, 2018). In this study, we also built an improved source word representation by averaging. However, the averaging is done with translations of the source word to suitable auxiliary languages instead of the target word.

Experiments
To evaluate the proposed MWT algorithm, we used a recently released multilingual word translation dataset (MUSE) in six European languages: English, German, French, Spanish, Italian and Portuguese (Lample et al., 2018) 1 . In addition, we conducted another experiment mixing European (English, German, French, Spanish and Italian) together with Asian languages (Japanese , Chinese and Korean). This experiment demonstrates the power of MWT even for distant languages. The available dictionaries in MUSE dataset are between the European languages, and between English and each of the Asian languages. For any available pair of languages, a ground-truth bilingual dictionary is provided with a train-test split of 5000 and 1500 unique source words, respectively. All systems are tested on the 1500 test word pairs for each pair of languages.
Monolingual Embeddings. Pre-trained 300d fastText (monolingual) embeddings 2 (Bojanowski et al., 2017)  Implementation details. For the six European languages experiment, the multilingual mapping set was trained using the state-of-the-art unsupervised Multilingual Adversarial Training (MAT) + Multilingual Pseudo-Supervised Refinement (MPSR) method (Chen and Cardie, 2018). We used their source code 3 , with their default en-de en-fr en-es en-it en-pt de-en de-fr de-es de-it de-pt fr-en fr-de fr-es fr-it fr-pt es-en es-de es-fr es-it es-pt it-en it-de it-fr it-es it-pt pt-en pt-de pt-fr pt-es pt-it avg   hyper-parameters, and got similar results to the reported results (+0.1% in average). For the European-Asian experiment, MAT failed to converge for some language pairs, so the multilingual mapping set was trained using supervised MPSR, where the supervision was obtained by pairs of words with identical string matching. For each experiment, we used the same mappings for all the methods we compare. Our code and mapping matrices will be publicly available. Compared methods. All methods retrieve word translations using their CSLS similarity in the learned embedding space.
(1) BI (Bilingual Inference). A standard inference process which does not take the multilingual setup into account, as in Chen and Cardie (2018).
We implemented 3 translation variants using auxiliary languages: (2) NT (Nearest Translation). Average the source word with the closest auxiliary translation.
(3) CNT (Conditional Nearest Translation). Only average the source word with the closest auxiliary translation, if it is closer than the translation to the target language. In fact, CNT chooses for each source word one of {BI, NT} depends on whether the closest auxiliary translation is closer than the target translation (NT) or not (BI).
(4) CAT (Conditional All Translations). Weighted average of the source word and all auxiliary translations, that are closer than the target language translation. CAT is formally described in Algorithm box 1.
Results. Table 1 presents detailed results for all 30 language pairs and the average results. It shows that using all relevant auxiliary languages (CAT) increases performance significantly (+1.7% on average, top method in 19/30 tasks, p < 0.001 4 ). The largest performance boost of CAT over BI was in languages pairs involving German (+3.17% on average), which is the most distant language in this set of languages, thus, gains a lot from using other languages. This was found in particular for the translations between German and Portuguese (dept: +6.8%, pt-de: +4.5%), which are the most distant languages in this language set. This suggests that using MWT for distant languages may help. However, for close languages pairs the best way is still to translate directly (BI), as can be seen for Spanish and Portuguese. Table 2 presents three examples of erroneous bilingual translations that were corrected using auxiliary languages.
We next show more detailed analysis for CNT, when using at most one auxiliary language for the six European languages experiment. Table 4 shows auxiliary language that is most commonly selected, for each pair of source-target languages. Interestingly, Spanish and Portuguese often help each other. Also, German often uses English as an auxiliary language for translating better into all other languages.
For CAT, each source word may use a different number of auxiliary languages. We can see the number of auxiliary languages as a mean to qualitatively measure closeness of languages, by looking on the average number of auxiliary languages used for each source-target pair. We found Portuguese and Spanish to be the closest (pt-es 0.6), and German the farthest from them (pt-de 3.1, esde 3.2).