Multilingual Neural Machine Translation with Language Clustering

Multilingual neural machine translation (NMT), which translates multiple languages using a single model, is of great practical importance due to its advantages in simplifying the training process, reducing online maintenance costs, and enhancing low-resource and zero-shot translation. Given there are thousands of languages in the world and some of them are very different, it is extremely burdensome to handle them all in a single model or use a separate model for each language pair. Therefore, given a fixed resource budget, e.g., the number of models, how to determine which languages should be supported by one model is critical to multilingual NMT, which, unfortunately, has been ignored by previous work. In this work, we develop a framework that clusters languages into different groups and trains one multilingual model for each cluster. We study two methods for language clustering: (1) using prior knowledge, where we cluster languages according to language family, and (2) using language embedding, in which we represent each language by an embedding vector and cluster them in the embedding space. In particular, we obtain the embedding vectors of all the languages by training a universal neural machine translation model. Our experiments on 23 languages show that the first clustering method is simple and easy to understand but leading to suboptimal translation accuracy, while the second method sufficiently captures the relationship among languages well and improves the translation accuracy for almost all the languages over baseline methods.

Although a conventional NMT model can handle a single language translation pair (e.g., German→English, Spanish→French) well, training a separate model for each language pair is unaffordable considering there are thousands of languages in the world. A straightforward solution to reduce computational cost is using one model to handle the translations of multiple languages, i.e., multilingual translation. Johnson et al. (2017); Firat et al. (2016); Ha et al. (2016); Lu et al. (2018) propose to share part of (e.g., attention mechanism) or all models for multiple language pairs and achieve considerable accuracy improvement. While they focus on how to translate multiple language pairs in a single model and improve the performance of the multilingual model, they do not investigate which language pairs should be trained in the same model.
Clearly, it is extremely heavy to translate all language pairs in a single model due to the diverse and large amount of languages; instead we cluster language pairs into multiple clusters and train one model for each cluster, considering that: (1) language pairs that differ a lot (e.g., German→English and Chinese→English) may negatively impact the training process if handling them by one model, and (2) similar language pairs (e.g., German→English and French→English) are likely to boost each other in model training. Then the key challenge is how to cluster language pairs, which is our focus in this paper.
In this paper, we consider the many-to-one settings where there are multiple source languages and one target language (English), and one-to-many settings where there are multiple target languages and one source language (English) 1 . In this way, we only need to consider the languages in the source or target side for the determination of training in the same model, instead of language pairs. We consider two methods for language clustering. The first one is clustering based on prior knowledge, where we use the knowledge of language family to cluster languages. The second one is a purely learning based method that uses language embeddings for similarity measurement and clustering. We obtain the language embeddings by training all the languages in a universal NMT model, and add each language with a tag to give the universal model a sense of which language it currently processes. The tag of each language (language embedding) is learned end-toend and used for language clustering. Language clustering based on language family is easy to obtain and understand, while the end-to-end learning method may boost multilingual NMT training, since the language clustering is derived directly from the learning task and will be more useful to the task itself.
Through empirical studies on 23 languages→English and English→23 languages in IWSLT dataset, we have several findings: (1) Language embeddings can capture the similarity between languages, and correlate well with the fine-grained hierarchy (e.g., language branch) in a language family. (2) Similar languages, if clustered together, can indeed boost multilingual performance. (3) Language embeddings based clustering outperforms the language family based clustering on the 23 languages↔English.

Background
Neural Machine Translation Given the bilingual translation pair (x, y), an NMT model learns the parameter θ by maximizing the loglikelihood log P (y|x, θ). The encoder-decoder framework (Bahdanau et al., 2015;Luong et al., 2015b;Sutskever et al., 2014;Wu et al., 2016;Gehring et al., 2017;Vaswani et al., 2017;Shen et al., 2018;) is adopted to model the conditional probability P (y|x, θ), where the encoder maps the input to a set of hidden representations h and the decoder generates each target 1 Many-to-many translation can be bridged through manyto-one and one-to-many translations. Our methods can be also extended to the many-to-many setting with some modifications. We leave this to future work. token y t using the previous generated tokens y <t and the representations h.
Multilingual NMT NMT has recently been extended from the translation of a language pair to multilingual translation (Dong et al., 2015;Luong et al., 2015a;Firat et al., 2016;Lu et al., 2018;Johnson et al., 2017;Ha et al., 2016;Platanios et al., 2018;. Dong et al. (2015) use a single encoder but separate decoders to translate one source language to multiple target languages (i.e., one-tomany translation). Luong et al. (2015a) combine multiple encoders and decoders, one encoder for each source language and one decoder for each target language respectively, to translate multiple source languages to multiple target languages (i.e., many-to-many translation). Firat et al. (2016) use different encoders and decoders but share the attention mechanism for many-to-many translation. Similarly, Lu et al. (2018) propose the neural interlingua, which is an attentional LSTM encoder to link multiple encoders and decoders for different language pairs. Johnson et al. (2017); Ha et al. (2016) use a universal encoder and decoder to handle multiple source and target languages, with a special tag in the encoder to determine which target languages to output. While all the above works focus on the design of better models for multilingual translation, they implicitly assume that a set of languages are pre-given, and do not consider which languages (language pairs) should be in a set and share one model. In this work we focus on determining which languages (language pairs) should be shared in one model.

Multilingual NMT with Language Clustering
Previous work trains a set of language pairs (usually the number is small, e.g., < 10 languages) using a single model and focuses on improving this multilingual model. When facing a large amount of languages (dozens or even hundreds), one single model can hardly handle them all, considering that a single model has limited capacity, some languages are quite diverse and far different, and thus one model will lead to degraded accuracy. In this paper, we first group the languages into several clusters and then train one multilingual NMT model (Johnson et al., 2017) for the translations in each cluster. By controlling the number of clusters, we can balance translation accuracy and computational cost: compared to using one model for each language pair, our approach greatly reduces computational cost, and compared to using one model for all language pairs, our approach delivers better translation accuracy. In this section, we present two methods of language clustering to enhance multilingual NMT.

Prior Knowledge Based Clustering
Several kinds of prior knowledge can be leveraged to cluster languages (Paul et al., 2009;Comrie, 1989;Bakker et al., 2009;Holman et al., 2008;Liu and Li, 2010;Chen and Gerdes, 2017;Leng et al., 2019), such as language family, language typology or other language characteristics from the URIEL database (Levin et al., 2017). Here we do not aim to give a comprehensive study of diverse prior knowledge based clusterings. Instead, we choose the commonly used language family taxonomy (Paul et al., 2009) as the prior knowledge for clustering, to provide a comparison with the language embedding based clustering.
Language family is a group of languages related through descent from a common ancestral language or parental language 2 . There are different kinds of taxonomies for language family in the world, among which, Ethnologue (Paul et al., 2009) 3 is one of the most authoritative and commonly accepted taxonomies. The 7,472 known languages in the world fall into 152 families according to Paul et al. (2009). We regard the languages in the same family as similar languages and group them into one cluster.

Language Embedding Based Clustering
When training a multilingual model, it is common practice to add a tag to the input of encoder to indicate which language the model is currently processing (Johnson et al., 2017;Firat et al., 2016;Malaviya et al., 2017;Tiedemann andÖstling, 2017). The embeddings of the tag is learned end-to-end and depicts the characteristics of the corresponding language, which is called language embeddings, analogous to word embeddings (Mikolov et al., 2013).
We first train a universal model to translate all the language pairs. As shown in Figure 1, the 2 https://en.wikipedia.org/wiki/Language family 3 https://www.ethnologue.com/browse/families  Figure 1: The illustration of learning language embeddings for clustering. For both many-to-one (other languages to English) and one-to-many (English to other languages) setting, we add the language embeddings to the encoder. encoder of the universal NMT model takes both word embeddings and language embedding as inputs. After model training, we get an embedding vector for each language. We regard the embedding vector as the representation of a language and cluster all languages using the embeddings 4 . There exist many clustering methods. Without loss of generality, we choose hierarchical clustering (Rokach and Maimon, 2005) to cluster the embedding vectors in our experiments.

Discussions
We analyze and compare the two clustering methods proposed in this section. Clustering based on prior knowledge (language family) is simple. The taxonomy based on language family is consistent with the human knowledge, easy to understand, and does not change with respect to data/time. This method also has drawbacks. First, the taxonomy built on language family does not cover all the languages in the world since some languages are isolate 5 . Second, there are many language families (152 according to Paul et al. (2009)), which means that we still need a large number of models to handle all the languages in the world. Third, language family cannot characterize all the features of a language to fully capture the similarity between languages.
Since language embeddings are learnt in a universal NMT model, which is consistent with the downstream multilingual NMT task, clustering based on language embeddings is supposed to capture the similarity between different languages well for NMT. As is shown in our experiments, the clustering results implicitly capture and combine multiple aspects of language characteristics and boost the performance of the multilingual model.

Experiment Setup
Datasets We evaluate our method on the IWSLT datasets which contain multiple languages from TED talks. We collect datasets from the IWSLT evaluation campaign 6 from years 2011 to 2018, which consist of the translation pairs of 23 languages↔English. The detailed description of the training/validation/test set of the 23 translation pairs can be found in Supplementary Materials (Section 1). All the data has been tokenized and segmented into sub-word symbols using Byte Pair Encoding (BPE) (Sennrich et al., 2016). We learn the BPE operations for all languages together, which results in a shared vocabulary of 90K BPE tokens.

Model
Configurations We use Transformer (Vaswani et al., 2017) as the basic NMT model considering that it achieves state-ofthe-art performance on multiple NMT benchmark tasks and is a popular choice for recent research on NMT. We use the same model configuration for each cluster with model hidden size d model = 256, feed-forward hidden size d ff = 1024 and the layer number is 2. The size of language embeddings is set to 256.
Training and Inference We up-sample the training data of each language to be roughly the same during training. For the multilingual model training, we concatenate training sentence pairs of all the languages in the same mini-batch. We set the batch size of each language to roughly 4096 tokens, and thus the total batch size is 4096 * |C k |, where |C k | is the number of languages in cluster k. The corresponding language embedding is added on the word embedding of each source token. We train the model with 8 NVIDIA Tesla M40 GPU cards, each GPU card with roughly 512 * |C k | tokens in terms of batch size. We use Adam optimizer (Kingma and Ba, 2014) with β 1 = 0.9, β 2 = 0.98, ε = 10 −9 and follow the learning rate schedule in Vaswani et al. (2017). During inference, each source token is also added with the corresponding language embedding in order to give the model a sense of the language it is currently processing. We decode with beam search and set beam size to 6 and length penalty α = 1.1 for all the languages. We evaluate the translation quality by tokenized case sensitive BLEU (Papineni et al., 2002) with multi-bleu.pl 7 .

Tr
Our codes are implemented based on ten-sor2tensor (Vaswani et al., 2018) 8 and we will release the codes once the paper is open to the public.

Results
In this section, we mainly show the experiment results and analyses on the many-to-one setting in Section 5.1-5.3. The results on the one-to-many setting are similar and we briefly show the results in Section 5.4 due to space limitations.

Results of Language Clustering
The language clustering based on language family is shown in Figure 2, which results in 8 groups given our 23 languages. All the language names and their corresponding ISO-639-1 code 9 can be found in Supplementary Materials (Section 2).
We use hierarchical clustering (Rokach and Maimon, 2005) 10 method to group the languages based on language embeddings. We use the elbow method (Thorndike, 1953) to automatically decide the optimal number of clusters K. Note that we have tried to extract the language embeddings from multiple model checkpoints randomly Figure 3: The hierarchical clustering based on language embeddings. The Y-axis represents the distance between two languages or clusters. Languages in the same color are divided into the same cluster, where blue color agglomerates different clusters together. If a language is marked as blue, then it forms a cluster itself. Cluster #1: Vi; Cluster #2: Th; Cluster #3: Zh; Cluster #4: Ja, Tr, Hu; Cluster #5: He, Ar, Fa; Cluster #6: Ru, El, Sk, Sl, Bg, Cs, Pl; Cluster #7: De, Nl, Ro, It, Fr, Es, Pt. Figure 4: The optimal number of clusters determined by the elbow method for 23 languages→English based on language embeddings. The elbow method plots the curve of clustering performance (which is defined as the intra-cluster variation, i.e., the within-cluster sum of squares) according to different number of clusters, and find the location of a bend (knee) in the curve as the optimal number of clusters (7 as shown in the figure). chosen in the later training process, and found that the clustering results based on these language embeddings are stable. Figure 3 demonstrates the clustering results based on language embeddings. Each color represents a language cluster and there are 7 clusters according to the elbow method (the details of how this method determines the optimal number of clusters are shown in Figure 7). Figure 3 clearly shows the agglomerative process of the languages and demonstrates the fine-grained relationship between languages.
We have several interesting findings from Figure 2 and 3: • Language embeddings capture the relationship in the language family well. Cluster #7 in Figure 3  • Language embeddings can also capture the knowledge of morphological typology (Comrie, 1989). Ja (Japanese), Tr (Turkish), Hu (Hungarian) in Cluster #4 of Figure 3 are all Synthetic-Fusional languages. Language embedding based method can cluster them together by learning their language features end-to-end with the embeddings despite they are in different language families.
• Language embeddings capture the regional, cultural, and historical influences. The languages in Cluster #5 (He, Ar, Fa) of Figure 3 are close to each other in geographical location (West Asia). Ar (Arabic) and He (Hebrew) share the same Semitic language branch in Afroasiatic language family, while Fa (Persian) has been influenced much by Arabic due to history and religion in West Asia 12 . Language embedding can implicitly learn this relationship and cluster them together. 11 It also contains El (Greek) probably due to that most Slavic languages are influenced by Middle Greek through the Eastern Orthodox Church (Horrocks, 2009  These findings show that language embeddings can incorporate different prior knowledge such as language family, or some other implicit information like regional, cultural, and historical influence, and can well fuse the information together for clustering. In the next subsection, we show how our language embedding based clustering boosts the performance of multilingual NMT compared with the language family based clustering.

Translation Accuracy
After dividing the languages into several clusters, we train the models for the languages in each cluster with a multilingual NMT model (Johnson et al., 2017) based on different clustering methods and list the BLEU scores of each language→English in Table 1. We also show the results of random clustering (Random) and use the same number of clusters as the language embedding based clustering, and average the BLEU scores on multiple times of random clustering (3 times in our experiments) for comparison. It can be seen that Random performs worst, and language embedding (Embedding) based clustering outperforms the language family (Family) based clustering for most languages 13 , demonstrating the superiority of the learning based method over prior knowledge.
We further compare the language embedding based clustering with two extreme settings: 1) Each language trained with a separate model (Individual), 2) All languages trained in a universal model (Universal), to analyze how language clustering performs on each language. As shown in Table 2, we have several observations: • First, Universal in general performs the worst due to the diverse languages and limited 13 Embedding performs worse than Family on only 2 languages. model capacity. Universal only performs better on Ja (Japanese) and Sl (slovene) due to their small amount of training data, which will be discussed in the next observation.
• Second, Embedding outperforms Individual (mostly by more than 1 BLEU score) on 12 out of 23 languages 14 , which demonstrates that clustering can indeed bring performance gains with fewer models, especially for languages with relatively small data size. For these languages, clustering together with similar languages may act as some kind of data augmentation, which will benefit the model training. An extreme case is that Sl (slovene) training together with all the languages gets the best accuracy as shown in Table 2, since Sl is extremely low-resource, which benefits a lot from more training data. This can be also verified by the next observation.
• Third, similar languages can help boost performance. For example, Hu (Hungarian) and Ja (Japanese) are similar as they are typical agglutinative languages in terms of morphological typology (Comrie, 1989). When they are clustered together based on Embedding, the performance of both languages improves (Hu: 19.07→21.89, Ja: 9.90→11.57).

Clustering Results with respect to Training Data Scale
We further study how the clustering results based on language embeddings change with varying training data, in order to study the robustness of  Table 2: BLEU score of 23 languages→English with different number of clusters: Universal (all the languages share one model), Individual (each language with separate models, totally 23 models)), Embedding (Language Embedding with 7 models). Data size shows the training data for each language→English. our language embedding based clustering, which is important in the low-resource multilingual setting. We reduce the training data of each language to 50%, 20% and 5% to check how stable the language embeddings based clustering is, as shown in Figure 5. It can be seen that most languages are clustered stably when training data is relatively large (50% or even 20%), except for Sl (Slovene) which has the least training data (14K) among the 23 languages and thus reducing the training data of Sl to 50% and 20% will influence the learning of language embeddings and further influence clustering results. If further reducing the training data to 5%, more languages are influenced and clustered abnormally as most languages have less than 10K training data. Even so, the similar languages such as Fr (French), It (Italian), Es (Spanish), and Pt (Portuguese) are still clustered together.

Results of One-to-Many Translation
We finally show experiment results of one-tomany translation, i.e., English→23 languages. We first train a universal model that covers all languages to get the language embeddings, and then perform clustering like in the many-to-one setting. The optimal number of clusters is 5, which is automatically determined by the elbow method (we show the figure of the elbow method in Figure 1 in the Supplementary Materials). Figure 6 shows the clustering results based on language embeddings, and Table 3 shows the BLEU score of multilingual models clustered by different methods. We also have several findings: • In one-to-many setting, language embeddings still depict the relationship among language families. Cluster #1 in Figure 6 covers the Romance (Ro, Fr, It, Es, Pt) branch in Indo-European language family while Cluster #2 contains the Germanic (De, Nl) branch in Indo-European language family. The languages in Cluster #3 (Sl, Bg, El, Ru, Sk, Cs, Pl) are mostly from the Slavic branch in Indo-European language family. We find that the cluster results of Indo-European language family is more fine-grained than manyto-one setting, which divides each language into their own branches in a language family.
• Language embedding can capture regional, cultural, and historic influence. For Cluster #4 in Figure 6, Zh (Chinese) and Ja (Japanese) are in different language families Figure 6: The hierarchical clustering based on language embedding in one-to-many setting. The Y-axis represents the distance between two languages or clusters. Languages in the same color are divided into the same cluster.  but influence each other through the history and culture. Th (Thai) and Vi (Vietnamese) in Cluster #5 that are close to each other in geographical location can be captured by their language embeddings. These findings show that language embeddings can stably characterize relationships among languages no matter in one-to-many setting or in many-to-one setting. Although there exist some differences in the clustering results between oneto-many setting and many-to-one setting due to the nature of the learning based clustering method, most of the clustering results are reasonable. Our experiment results shown in Table 3 demonstrate that the language embedding can provide more reasonable clustering than prior knowledge, and thus result in higher BLEU score for most of the languages.
We also find from Table 3 that each language with separate model (Individual) perform best on nearly 10 languages, due to the abundant model capacity. However, there are still several languages on which our clustering methods (Family or Embedding) largely outperform Individual. The reason is similar as the many-to-one setting that some languages are of smaller data sizes. In this situation, similar languages clustered together will boost the accuracy compared with separate model.

Conclusion
In this work, we have studied language clustering for multilingual neural machine translation. Experiments on 23 languages→English and English→23 languages show that language embeddings can sufficiently characterize the similarity between languages and outperform prior knowledge (language family) for language clustering in terms of the BLEU scores.
For future work, we will test our methods for many-to-many translation. We will consider more languages (hundreds or thousands) to study our methods in larger scale setting. We will also study   how to obtain language embeddings from monolingual data, which will make our method scalable to cover those languages with little or without bilingual data. On the other hand, we will also consider other pre-training methods (Song et al., 2019) for multilingual and low-resource NMT.

A Dataset Description
We evaluate our experiments on IWSLT datasets. We collect 23 languages↔English translation pairs form IWSLT evaluation campaign from year 2011 to 2018. The training data sizes of each languages are shown in Table 4 and we use the default validation and test set for each language pairs. For Bg (Bulgarian), El (Greek), Hu (Hungarian) and Ja (Japanese), there are no available validation and test data in IWSLT, we randomly split 1K sentence pairs from the corresponding training set as the validation and test data respectively.

B Language Name and Code
The language names and their corresponding language codes according to ISO 639-1 standard are listed in Table 5.

C The Elbow Method for One-to-Many Setting
The optimal number of clusters is 5 in one-tomany setting, which is automatically determined by the elbow method as shown in Figure 7. Figure 7: The optimal number of clusters determined by the elbow method for English→23 languages based on language embeddings. The detailed description of the elbow method can be seen in Figure 4 in the main paper.