Revisiting Modularized Multilingual NMT to Meet Industrial Demands

The complete sharing of parameters for multilingual translation (1-1) has been the mainstream approach in current research. However, degraded performance due to the capacity bottleneck and low maintainability hinders its extensive adoption in industries. In this study, we revisit the multilingual neural machine translation model that only share modules among the same languages (M2) as a practical alternative to 1-1 to satisfy industrial requirements. Through comprehensive experiments, we identify the benefits of multi-way training and demonstrate that the M2 can enjoy these benefits without suffering from the capacity bottleneck. Furthermore, the interlingual space of the M2 allows convenient modification of the model. By leveraging trained modules, we find that incrementally added modules exhibit better performance than singly trained models. The zero-shot performance of the added modules is even comparable to supervised models. Our findings suggest that the M2 can be a competent candidate for multilingual translation in industries.


Introduction
With the current increase in the demand for neural machine translation (NMT), serving an increasing number of languages poses a practical problem for the industry. A naive approach for multilingual NMT is to have multiple single-directional models, which is unsustainable owing to the quadratic increase of models as more languages are introduced. A more practical approach is to limit the number of models by sharing the components among the models (Dong et al., 2015;Firat et al., 2016a;Ha et al.;Johnson et al., 2017). In addition to reducing the number of parameters, sharing the components is also regarded as an effective method to enhance the performance. A fully shared model (henceforth 1-1), which only uses one encoder and one decoder to translate all directions (Ha et al.;Johnson et al., 2017), has been the most popular method because of its compactness.
However, introduction of a significant number of tasks into a 1-1 model is known to cause capacity bottleneck. Aharoni et al. (2019) suggested that, given a fixed model capacity, a 1-1 model is bound to the tradeoff between the number of languages and translation accuracy. Zhang et al. (2020) explicitly identified the capacity bottleneck problem of the 1-1 model by showing a clear decrease in performance when translation directions are doubled. Moreover, data unbalance complicates the problem. Arivazhagan et al. (2019b) presented the transfer and interference dilemma among low and high resource languages in an unbalanced environment.
The capacity bottleneck observed in the 1-1 model is particularly undesirable for the industry. Unlimited scaling of the model size (Zhang et al., 2020) is impossible in practice, where inference cost and latency are crucial. With limited capacity, gain from multilingual translation training (henceforth multi-way training) without being subject to the losses of the capacity bottleneck is difficult to achieve. Furthermore, modification of the 1-1 model such as simple addition of a language is troublesome because the entire model must be retrained from the beginning as a single module, thus requiring a considerable amount of time and effort. This low maintainability makes 1-1 less attractable for industrial use. Still, the benefit from multi-way training is difficult to miss.
These problems lead us to revisit the multilingual neural machine translation model that share parameters among the same languages (Firat et al., 2016a). We named this architecture as the modularized multilingual NMT model (henthforth M2) since the model share language-specific modules (encoders or decoders) instead of the whole model. Left is a collection of single models for 6 translation directions. Middle is the 1-1 model that share the whole parameters of the model for 6 directions. Right is the M2 model that only share language-specific modules. Figure 1 illustrates the architectural overview of multilingual translation using single models, the 1-1 and the M2. Although the M2 has not been given substantial attention owing to the linear increase in its parameters as the number of languages increases, it is relatively free from the capacity bottleneck problem while maintaining a reasonable inference cost. In this study, we explore the possibility of M2 as an alternative to the 1-1 model in industrial settings.
To resolve the capacity bottleneck problem while enjoying the benefits, we identify the effects of multi-way training in a carefully controlled environment. We find that the data-diversification and regularization of multi-way training enable the M2 to outperform both single and 1-1 models with less suffering from capacity bottlenecks. Additionally, the M2 demonstrates a comparable performance increase to 1-1 for low resource pairs in an unbalanced environment.
Combined with its modularizable architecture, interlingual space learned by the M2 allows convenient and effective modification of the model. The simple addition of language-specific modules to the M2 outperformed an individually trained model. The zero-shot learning of the incremented language module outperforms English pivoted translation and is even comparable to a supervised model. Finally, we show that the language invariance of such space improves with more languages.
In summary, our contribution is threefold. 1) We conceptually specify the effects of multi-way training and verified them with comprehensive experiments. 2) We show that the M2 can leverage those effects as the 1-1 without the constraint of the capacity bottleneck. 3) Finally, we find that multi-way training of the M2 forms interlingual space which allows simple yet effective extension of languages. The most popular framework for NMT is the encoder-decoder model (Cho et al., 2014;Sutskever et al., 2014;Bahdanau et al., 2014;Luong et al., 2015;Vaswani et al., 2017). Adopting attention module greatly improved the performance of encoder-decoder model by using context vector instead of fixed length vector (Bahdanau et al., 2014;Luong et al., 2015). By exploiting multiple attentive heads, the Transformer model has become the de-facto standard model in NMT (Vaswani et al., 2017;Ott et al., 2018;So et al., 2019).

Multilingual neural machine translation
Dabre et al. (2019) categorized the architectures of multilingual NMTs according to their degrees of parameter sharing. We briefly introduce the models under their criteria.
Early multilingual NMT models minimally shared the parameters by sharing language-specific encoder (Dong et al., 2015;Lee et al., 2017) or decoder (Zoph and Knight, 2016). Firat et al. (2016a) extended this to sharing both language-specific encoders and decoders with a shared attention module.
The 1-1 model, fully shared, uses only one encoder and decoder to translate all directions (Ha et al.;Johnson et al., 2017). The target language is indicated by prepending a reserved target language token to the source text. Being compact, the 1-1 model has become the mainstream of multilingual NMT research (Ha et al.;Johnson et al., 2017;Aharoni et al., 2019;Arivazhagan et al., 2019b;Wang et al., 2019;Liu et al., 2020), However, subsequent studies tried to solve the capacity bottleneck problem of the 1-1 through knowledge compression (Tan et al., 2019b), language clustering (Tan et al., 2019a) or increased capacity (Zhang et al., 2020).
Partially shared models are extensively studied to compromise the capacity bottleneck and model size (Blackwood et al., 2018;Sachan and Neubig, 2018;Platanios et al., 2018;Zaremoodi et al., 2018;Bapna and Firat, 2019). Despite their popularity, we do not compare them in this work because partially sharing is essentially relaxing the capacity constraint of fully sharing. Also, Sachan and Neubig (2018) reported that the performance of partially shared models is language-specific, which is not the focus of our study. Instead, we focus on the general trade off of parameter sharing.

Interlingual representation
Building interlingual 1 representation is another interest in multilingual language modeling (Schwenk and Douze, 2017). Interlingual space is the ground for zero-shot translation (Johnson et al., 2017;Arivazhagan et al., 2019a;Al-Shedivat and Parikh, 2019)  shared English encoder and decoder as they used English-centered data (parallel corpus that include English). Instead we show that sharing modules of all languages using diverse directions of data further increases the performance and is the key to build interlingual representation without any explicit regularization.
Our motivation to rediscover the M2 is concurrently shared with Escolano et al. (2020). Escolano et al. (2020) empirically show that M2 is capable of quickly deploying new languages with incrementally added modules, and found it outperforms 1-1. We also experiment on incremental learning and get a similar conclusion, and further interpret the results as an indication that M2 effectively forms an interlingual space. Regarding comparison of M2 and 1-1 in general, we deliver an in-depth understanding of a less-studied model M2 focusing on 1 We prefer the term 'interlingual' to 'language-agnostic' because we expect it to be better if the space is shared while maintaining the language-specific features instead of removing them. how to maximize its utility in industry. Experiments on incremental learning are to check whether M2 is a maintainable alternative to 1-1 (which requires expensive re-training from scratch).

Effects of multi-way training
Because of its complexity, the effects of multi-way training are yet to be identified. Various factors may affect the performance of multilingual translation: model size compared to the amount of data, the number of training directions, the degree of data imbalance among different directions, and the portion of multi-parallel data. In this section, we discuss the possible effects on performance resulting from these factors.
Capacity bottleneck A capacity bottleneck is the most plausible cause of performance degradation in multi-way training. For a fixed size model, the capacity bottleneck is more prominent with the increase in training directions (especially target languages) and the amount of data (Johnson et al., 2017;Aharoni et al., 2019;Arivazhagan et al., 2019b;Zhang et al., 2020).
Cross-language effect Cross-language effect occurs when multiple languages are shared in a module. Low resource languages reportedly benefit from multi-way training when trained along with high resource pairs (Zoph et al., 2016;Nguyen and Chiang, 2017;Neubig and Hu, 2018). The interaction among languages in a module can either be positive (transfer) or negative (interference) on the performance according to their similarity in linguistic patterns.
Data-diversification Data-diversification is associated with the portion of multi-parallel among multi-way data. If either the source-side or the target-side language is shared across two directions and data of the directions is not multi-parallel, the shared module learns more diverse samples of the language. For example, if an English encoder is shared between En-De and En-Fr directions (and English sentences of two are not completely shared), the encoder learns more diverse English sentences from both pairs. Few studies distinguished this effect (Firat et al., 2016a,b). We refer to the improvement resulting from this factor as the data-diversification effect.
Regularization Learning to encode or decode the same language in various directions may result in better representation learning and less overfitting in a single direction. This effect has already been observed by Firat et al. (2016a) as the benefit of generalization and suggested by Aharoni et al. (2019) to benefit many-to-many models compared to many-to-one models.
4 Comparison of single models, 1-1 and M2 We compared the models with the same inference capacity in a series of conditions. Note that most of the multilingual NMT research was conducted in a joint one-to-many and many-to-one environment (JM2M): collected data are English-centered. Despite its simplicity, observations under such setting may be unreliable to speak for many-to-many (M2M) environment, which is also clearly in demand in the industry. Therefore, we set M2M training as the default. We also distinguish between two different dataset compositions: the sharing case where all language pairs share the same sentence set, and the non-sharing case where there is no overlap between different pairs. To illustrate, a multiparallel set 'En -Es -Ko' can be shared for all possible three pairs (En -Es, En -Ko, Es -Ko) or used only once for one pair. Considering that multiparallel data is rare in practice, we compared the models in a strictly non-sharing environment.

Settings
Dataset We collected multi-parallel data from Europarl (Koehn, 2005) and selected four languages: German, English, Finnish, and French. To construct a completely balanced environment, we created 500K, 10K, and 10K (train, valid, and test) non-sharing pairs for every twelve possible directions from 1.56M multi-parallel data. For the unbalanced environment, we synthetically reduced the amount of data for some pairs to match a specific ratio of the data amounts for low, medium, and high resource pairs. For further details on data division, see appendix A.
Model For the 1-1 model, we used the model of Aharoni et al. (2019) which is transformer implementation of Johnson et al. (2017). For the M2, we modified Firat et al. (2016a) to not share the attention module. Language-specific embeddings are shared between the encoder and decoder. We implemented all models using transformer (Vaswani et al., 2017). We used the transformer with a hidden Training We used the fairseq framework 3 (Ott et al., 2019) to train and test all models. We set the batch size so that every encoder/decoder module learned at a maximum of 6144 tokens/GPU. All models were trained using 4 NVIDIA Tesla V100 GPUs. We followed the default parameters of the Adam optimizer (Kingma and Ba, 2014). For the learning rate schedule, we used 2K warm-up steps until 1e-3, after which we used the inverse square root learning rate schedule (Vaswani et al., 2017). The best model was selected using the best validation loss within the same maximum number of epochs. All the performance was measured in sacreBLEU4 (Post, 2018) using a beam size of 4 and a length penalty of 0.6. Appendix B provides more details of training.  Table 2: Averaged SacreBLEU test scores of single models, 1-1, and M2 trained using a balanced dataset of different configurations. M2M indicates the training of full many-to-many directions among languages (12 directions), whereas JM2M represents the training of directions that only include English on one side(6 directions). * indicates that the score is averaged only on English-centric.

Balanced environment
We first compared the performance of multi-way directions in a balanced and non-sharing environment, which is the most strictly controlled.
The results are shown in Table 1. The 1-1 model performed worse than both the single models and the M2 in every direction, clearly indicating a capacity bottleneck. In contrast, the M2 consistently outperformed not only the 1-1 model but also the single models in all directions. As the M2 cannot benefit from cross-language effect due to the lack of a shared module between any languages, we hypothesize that the following two effects are in charge: data-diversification and regularization. We verify this hypothesis using ablation studies.
Note that the 1-1 model's variation of degradation is higher with target languages than with source languages, even though all the directions are trained using the same amount. The translation to English (-1.96, -1.94, and -1.68) consistently degraded the least, whereas that to French (-2.74, -2.75, and -2.97) degraded the most, given the same source languages. This finding is consistent with previous observations that the capacity bottleneck is more prominent in the decoders (Johnson et al., 2017;Arivazhagan et al., 2019b).
Ablation We compare models in a series of conditions (see IDs in Table 2). 1 We denote the summarized performance demonstrated in Table  1 for reference. 2 To establish whether datadiversification was responsible for the performance improvement of the M2, we experimented using fully shared data. 3 To observe the behavior under alleviated capacity constraints, we experimented using bigger models. We used a transformer with a hidden dimension of 512 and a feed-forward di-2 https://github.com/google/sentencepiece 3 https://github.com/pytorch/fairseq mension of 2048 for our large model. The training settings are the same except for a larger batch size (x4). 4 Finally, we compared the models trained using the JM2M (6 directions instead of 12) to observe the behavior of the models with fewer directions. 5 We averaged scores of English-centric directions in 1 to compare with 4 . Appendix C presents the individual score for each direction. Table 2 shows the results of each environment. When we completely shared the data( 2 ), the performance gain of the M2 versus that of the single models (0.31) decreased. Given that 2 eliminates the chance of data-diversification, the degraded performance (0.3) can be attributed to it. However, the fact that the M2 still outperforms the single models (0.31) implies that the M2 can still benefit from the regularization effect of multi-way training. The minor increase in performance of 1-1 (0.18) seems to imply that data-diversification can be detrimental under the severe capacity bottleneck.
3 shows the performance of a larger model trained using the same data. Single models barely improved with the use of larger models, indicating the absence of a capacity bottleneck. On the contrary, the 1-1 model and the M2 both showed an increase in performance. The 1-1 model exhibits a gain from multi-way training only with enough capacity (1.47). This indicates that the benefit of multi-way training can only be achieved with enough capacity for the 1-1 model. Although the M2 is less affected by capacity bottleneck, the larger capacity is also beneficial for the M2 (1.74) to fully leverage the benefits of multi-way training.
To compare the models trained with JM2M ( 4 ), 5 shows the score averaged only over directions from and to English 1 . The JM2M scheme is likely to have mixed results: there is less pressure from the capacity bottleneck due to fewer training directions. However, possible gains from data-diversification  or regularization are also smaller. Both the 1-1 model and the M2 perform better when trained using M2M ( 5 ) than when trained using JM2M ( 4 ). However, the performance difference is more significant in the M2 (0.72) than in the 1-1 model (0.16). We assume that while both models benefit from data-diversification and regularization accompanied by training using more directions, the capacity bottleneck in 1-1 counterweighs those positive effects.

Unbalanced environment
We also compared the models with unbalanced training data, which is a natural condition in practice. To synthetically create an unbalanced environment, we first divided the pairs into low (De-En, De-Fi, De-Fr), medium (En-Fr, Fi-Fr), and high (En-Fr) resource pairs. Next, we reduced the amount of data for low and medium pairs, setting the ratio of low:medium:high = 1:2:4, and 1:5:25, respectively. The detailed division of the dataset can be found in appendix A. Note that the models learns with fewer data in the unbalanced environment. We first trained the models without up-sampling. Table 3 shows the scores of the 1-1 model and the M2 in each setting (1:1:1, 1:2:4, 1:5:25). Both models show similar trends with unbalanced data. Compared to the balanced environment, medium and low resource pairs tend to benefit from multiway training, with gains more prominent for lower resource pairs as the data get more unbalanced   terestingly, the M2 exhibits a similar level of improvement to that of the 1-1 model in low and medium resource pairs. Considering the M2 is not subject to the cross-language transfer, the performance increase in lower resource pairs may be better explained by data-diversification and regularization. This indicates that the cross-language effect of the 1-1 model may be more subtle than expected.
On the other hand, M2 barely showed the performance degradation in high resource pairs. This implies that the performance boost of low resource pairs and the drop of high resource pairs may not be necessarily trade-off without a capacity bottleneck.
Ablation The sampling method in an unbalanced setting is known to affect the performance (Arivazhagan et al., 2019b). We compared two models in the most unbalanced environment (1:5:25) with and without up-sampling. Table 4 shows the results. As previously reported, we confirm that up-sampling makes the results extreme in the 1-1 model: low resource pairs improve more (from 12.8 to 13.68), whereas high resource pairs degrade more (from -0.74 to -3.83). On the other hand, up-sampling in the M2 harmed performance in all the low, medium, and high resource pairs. The difference in converge rates among modules may be the cause; models overfit in low-resource pairs, and underfit in highresource pairs. This is supported by the changes in the M2's performance with more training epochs (Appendix C). Because input of M2 does not contain any information regarding the target language, encoders need to encode it so that any decoder can translate. At the same time, decoders of the M2 should be able to generate from output of any M2 encoder. For this reason, we assume that the output space of M2 encoders is interlingual. Figure 2 illustrates the interlingual space of a M2. Multi-way training of 3 languages (En, Es and Ko) forms the interlingual space which is shared by 6 modules. This space is preserved as long as the weights of the M2 are frozen. Training a new module (Ja) with a single parallel corpus (En -Ja) using one of the frozen modules (En) adapt the module to the interlingual space. We speculate that the new module (Ja) would be compatible with the other modules (Es and Ko) if the interlingual space

ID Model
En-Fr Fr-En  is formed well. We verify this using incremental zero-shot learning. Additionally, we measure how the language invariance of the space changes as the number of languages involved in the M2 varies. Since maintainability is one of the critical needs in practice, high performance on incremental learning would be a desirable trait in industrial settings.

Setting
To increase the number of languages, we modified the multi-parallel corpus of Europarl differently. We selected six languages (German, English, Spanish, Finnish, French, and Dutch) and divided a 1.25M multi-parallel corpus into 250K for each direction without sharing. Other details are mostly the same as in former experiments. The detailed division of the dataset and training details can be found in appendix A and B.

Incremental training
We added French to an M2 model trained using all directions among four languages (German, English, Spanish, and Dutch). An additional French encoder and decoder were trained using English-French pairs while the parameters of English modules remained frozen ( 1 ). We also tested two methods to help incremental training as follows. 1) Initialize the new module using one of the modules trained using other languages. In the experiment, we used the weights copied from the English module as the initialization for French ( 2 ). Note that the English and French module does not share any information, such as embedding. 2) Train the module with  Table 6: SacreBLEU zero-shot test scores of the English-pivoted single models and incremented modules from Table 5. * means that the model is trained using the supervision of 250 thousand pairs. auxiliary directions. We incrementally added auxiliary directions of De-Fr ( 3 ), Es-Fr ( 4 ), and Nl-Fr ( 5 ). We compared the models with a singly trained model ( 6 ), and the M2 models trained using five languages from scratch ( 7 ). 7 worked as an upper bound for the incremental training. Table 5 shows the performance of En-Fr and Fr-En with incremental training. The incrementally trained model without any additional method ( 1 ) outperformed a single model ( 6 ) even though half of the model was frozen. This not only indicates that the language-agnostic space is well-formed but also shows that incremented direction can benefit from a well-trained frozen module.
We also found that our two methods are effective in incremental training. Even though French does not share any information with the trained English module, initializing the French module with the weights learned by the English module benefits the performance marginally. Incrementally training the new module using multiple directions helps as the number of directions increases. Note that the two methods can be applied orthogonally. Although none of the incrementally trained models outperform the M2 model trained from scratch, this still shows that simple incremental training for the M2 can be a good alternative for expensive training from scratch.
We examined whether an incremented module in one direction can generalize to the other directions. We compared the zero-shot performance of the models in Table 5 with the English-pivoted translation performance using two single models. We also denoted the supervised performance of single models, and jointly trained the M2 for reference (250K for each direction). Table 6 shows the zero-shot performance of incrementally trained modules. Amazingly, most of the incremented modules demonstrated better performance than the English-pivoted translation. The only exception was in the Fr-De direction of the naively incremented module ( 1 ), which seemed to be marginal (-0.45). Our methods for incremental training were also effective for zero-shot performance. The results were even comparable to the single supervised models trained with 250K parallel corpus. This shows that multi-way training creates shared (interlingual) space instead of pair specific space.

The language invariance of the interlingual space
The interlingual space established by the M2 was confounding, considering no additional regularizations or methods were adopted. We measured the language invariance of the interlingual space while the varying the number of languages of the M2 model. We trained a series of M2 models that included 3 -6 languages (6, 12, 20, and 30 directions) and found that the use of more languages to train the M2 also improved its performance in all directions (appendix D). We investigated with two metrics to measure the language invariance of interlingual space.
Cosine Similarity We measured the representation similarity of parallel sentences from a parallel corpus. To obtain the fixed-size representation, we average pooled the output of encoders through the time steps. We averaged the cosine similarity of 10K pairs from the test set.  ules tend to learn to simply copy the input, which hinders translation training (Firat et al., 2016a). Meanwhile, interlingual output representation of the encoders should be able to be translated by any decoder, including the decoder of the source language. Therefore, the translation score of monodirection translation shows how well the information of the source sentence is preserved. Table 7 shows the cosine similarity and monodirection translation scores of the M2. As the M2 trains using more languages, the cosine similarity of all three pairs increases, which implies higher language invariance in interlingual space. However, the gain from marginal languages decreases as the number of languages increases. Mono-direction translation scores mostly align with the number of languages except for the M2(6), which degraded a little from M2(5). As a result, we reasonably conclude that the language invariance of the interlingual space improves with more languages.

Conclusion
In this study, we re-evaluate the M2 model and suggest it as an appropriate choice for multilingual translation in industries. By extensively comparing the single models, 1-1 model, and M2 in varying conditions, we find that the M2 can benefit from multi-way training through data-diversification and regularization while suffering less from capacity bottlenecks. Additionally, we demonstrate that the M2 can also benefit low resource pairs in an unbalanced environment as a 1-1 model without being subject to cross-language effect. Next, we suggest that the M2 model is easily maintainable because of its interlingual space. The interlingual space not only enables incremental training in a simple manner, but also accompanies competitive incremental zero-shot performance. Furthermore, we validate that the language invariance of the space enhances as the number of languages in the M2 increases. We hope that this study sheds light on the relatively disregarded M2 model and provide a benchmark for selecting a model among varying levels of shared components.  De  -1  2 3  En  --3 2  Fi  ---1  Fr ----

A Dataset
A.1 Division of multi-parallel dataset In order to create completely non-sharing dataset and make the best use of multi-parallel corpus, we divide the 1.5K multi-parallel corpus into 3 De  -1  2  3 4  5  En  --3  4 5  2  Es  ---5 1  4  Fi  ----2  1  Fr  -----3  Nl ------  parts(500K) for section 4 and 5 parts(250K) for section 5. And then, we assigned the parts to pairs so that no two directions of the same side share the same part. The assignment for section 4 and 5 are stated in table 8 and 9 respectively. Validation and test are divided with the same manner. For complete-sharing dataset, training data for all pairs only created from part 1. However, validation and test set remain the same with completely non-sharing dataset.

A.2 Amount of data for each pairs
In order to create unbalanced environment in section 4, we limited the amount of data for some directions. Table 10 shows the amount of the data for each pair in balanced, and unbalanced environments in section 4. For section 5, the amounts of all directions are the same with 250K. All the validation and test set are the same with 10K.
Though our dataset can easily be reconstructed from the open dataset (Europarl) with described process, we also made our dataset available online 4 for convenience of readers. We only uploaded the dataset of the balanced environment since unbalanced environment can be made from them trivially. The dataset is binarized with fairseq-preprocess command of fairseq framework. We selected the batch size of 6144 max tokens with the best validation loss of a single model (En-De) among {1536, 3072, 6144, 12288, 24576} max tokens per GPU (4 GPUs). While the total number of parameters and the training directions is different among single model, 1-1 and M2, we set the batch size for each direction so that each module learns with the same batch size (6144 tokens). Specifically, one step of a single model includes a single direction, while that of 1-1(4) and M2(4) includes 12 directions. However, training directions per module between 1-1(4) and M2(4) is different with 12 and 3 directions. Therefore, the batch size per direction of 1-1 is 512 (1/12 of 6144) and that of M2 is 1536 (1/4 of 6144). Since we accumulate the gradients of all directions, all the compared modules learn with the same batch size of data.

B.2 Sampling
To train balanced data, we used round robin scheduling of all directions. We compared two sampling methods in ablation of unbalanced environment: up-sampling and proportional sampling. Round robin scheduling is equivalent to upsampling low-resource data in unbalanced environment. For efficient proportional sampling, we sampled several small batches of pairs proportional to the amount of total pairs. We accumulated gradients of several batches to make expected batch-size of each module to meet the total batch size.

B.3 Early stopping
Since fixing the maximum tokens of a batch per module results in different step size among models, we stopped the training of models based on the maximum number of epochs. All the best models were chosen based on the best validation loss (averaged) within 100 epochs.

C Detailed scores of ablations
This section provides detailed scores of the ablation part of the section 4 and 5. Table 11 shows detailed scores under complete sharing ( 2 of table 2) and increased capacity ( 3 of table 2). Table 12 shows detailed scores under JM2M( 4 of table 2) and M2M( 5 of table 2) training. Table 13 shows detailed scores of the models under proportional sampling and up-sampling in table     4. M2(+10) indictes the scores of the M2 trained 10 epochs after the best validation loss. M2(+10) shows the increased performance in medium and high resource pairs and degradation in low resource pairs. This indicates that up-sampling causes the difference in converge rates among pairs of different resources for M2.