Multi-Agent Mutual Learning at Sentence-Level and Token-Level for Neural Machine Translation

Mutual learning, where multiple agents learn collaboratively and teach one another, has been shown to be an effective way to distill knowledge for image classification tasks. In this paper, we extend mutual learning to the machine translation task and operate at both the sentence-level and the token-level. Firstly, we co-train multiple agents by using the same parallel corpora. After convergence, each agent selects and learns its poorly predicted tokens from other agents. The poorly predicted tokens are determined by the acceptance-rejection sampling algorithm. Our experiments show that sequential mutual learning at the sentence-level and the token-level improves the results cumulatively. Absolute improvements compared to strong baselines are obtained on various translation tasks. On the IWSLT’14 German-English task, we get a new state-of-the-art BLEU score of 37.0. We also report a competitive result, 29.9 BLEU score, on the WMT’14 English-German task.


Introduction
Neural machine translation (NMT) has achieved significant progress over recent years (Sutskever et al., 2014;Bahdanau et al., 2015;Gehring et al., 2017;Vaswani et al., 2017;. Conventional training of the NMT models with hard targets limits the models' generalization ability (Szegedy et al., 2016;Pereyra et al., 2017). This has led to a rapid growth of research in developing more regularized models. Teacher-student (T/S) learning (Li et al., 2014;Hinton et al., 2015;Meng et al., 2019) is an effective method to handle this problem. It has been widely applied in many cases, e.g. model compression (Li et al., 2014;Hinton et al., 2015), domain adaptation Meng et al., 2018) and low-resource machine translation (Chen et al., 2017).
T/S learning is a strategy that trains a student model with both hard targets and soft posteriors produced by a pre-trained teacher model (Li et al., 2014). Because training with soft targets provides smoother output distribution, T/S learning could outperform the single model training (Li et al., 2014;Hinton et al., 2015;Meng et al., 2018).
However, does a teacher always outperform a student? In order to evaluate the pros and cons of different models, we conduct experiments on two different architectures. Table 1 shows the translation quality from various models. Arch 1 outperforms Arch 2 for the translation tense, whereas Arch 2 outperforms Arch 1 for certain word translation. Besides, the same architecture but initialized differently also has tiny translation differences. This phenomenon shows that a certain architecture may not always be suitable to be a teacher.
In this paper, we propose a two-step multi-agent mutual learning scheme, where one agent learns from other agents at the sentence-level as the first step, which we call sentence-wise mutual learning. When it becomes "smart", as the second step it will compare its own predicted tokens with other agents and only learn those poorly predicted tokens, which we call token-wise mutual learning. Mutual learning is first proposed by Zhang et al. (2018) for image classification tasks. Compared to T/S learning, there is no fixed teacher model. The cotrained agents are teachers to one another.
For sentence-wise mutual learning, each agent learns from other agents at sentence-level. When all of the agents converge after the first step, they continue to learn from other agents at the tokenlevel. Each agent selects and learns the tokens that it predicts poorly. The poorly predicted tokens are determined by acceptance-rejection sampling (Chib and Greenberg, 1995). For every two agents, the target tokens can be divided into two subsets, where one agent performs poorly in one subset and Src die evolution ist ein andauerndes thema hier auf der ted-konferenz gewesen, aber heute möchte ich ihnen die ansicht eines zu dem thema geben. Ref evolution has been a perennial topic here at the ted conference, but i want to give you today one doctor's take on the subject. Arch 1 (Init 1 ) evolution has been a serious topic here at the ted conference, but i want to give you today the view of an ark on the subject. Arch 1 (Init 2 ) evolution has been a severe subject here at the ted conference, but today i want to give you the view of a doctor. Arch 2 now, evolution is a continuous topic in the ted conference today, but today i want to give you the view of a doctor on the subject.  (Vaswani et al., 2017) and ConvS2S (Gehring et al., 2017), respectively. Init 1 and Init 2 denote two random initialization. The models with different architectures tend to translate diversely. The models with the same architecture but initialized differently have tiny differences.
learn those tokens from the other agent. We train our agents on small-scale IWSLT'14 German-English and IWSLT'14 Dutch-English, middle-scale WMT'16 Romanian-English and large-scale WMT'14 English-German datasets. We obtain significant improvements compared to strong baselines. Up to +2.3, +2.2, +2.0 and +1.6 absolute BLEU scores are achieved on these four tasks.
To the best of our knowledge, this is the first work using multi-agent mutual learning for NMT tasks. The token-level knowledge distillation is also applied for the first time.
Our contributions are summarized as follows: • We extend mutual learning to MT tasks and develop a sentence-level and token-level training scheme. Performance is improved significantly and consistently on various MT tasks.
• We compare our method with the similar training method, i.e. T/S learning and conditional T/S learning (Meng et al., 2019), and provide theoretical insights and practical evidences why our method performs well.
• We further delve into the effect of various factors, including the architecture diversity, different methods for interpolation weight and the number of co-trained agents.
2 Related Work T/S Learning Knowledge distillation is first introduced by Buciluǎ et al. (2006) and re-gains popularity due to Li et al. (2014) and Hinton et al. (2015). Currently, T/S learning and its variants can be roughly divided into two paradigms: a fixed pre-trained teacher model (Li et al., 2014;Hinton et al., 2015;Meng et al., 2019) and a dynamic cotrained teacher model (Zhang et al., 2018;Bi et al., 2019). For the former, the student learns from both hard targets and soft posteriors generated by a fixed teacher. For the latter, multiple co-trained students are considered as a teacher to one another, also known as mutual learning (Zhang et al., 2018). Alternatively, an ensemble integrated by multiple co-trained agents can also be treated as a teacher to all agents (Bi et al., 2019). Dual Learning Dual learning (He et al., 2016) or multi-agent dual learning (Wang et al., 2018) is to leverage the duality between the primal task and the dual task. The source and target domains for these two tasks are opposite. Even though both dual learning and mutual learning introduce extra models compared to traditional training method, the source and target sentences for all agents in mutual learning stay the same. There is only one task for mutual learning, i.e. translation from source sentences to target sentences.
MT at Sentence-Level and Token-Level Chen et al. (2017) propose a training method at sentencelevel and token-level for pivot-based zero-resource NMT. However, we have different definitions for the sentence-level and token-level translation. Both the sentence-level and token-level translation in Chen et al. (2017) are considered as the sentencelevel translation in our work. The token-level translation in this paper means one agent only learns the poorly predicted tokens from other agents.

General Mutual Learning
We consider a parallel sentence pair: a source sentence f J 1 with sentence length J, a target sentence  : Sentence-wise mutual learning is the first training step. Without loss of generality, we assume Agent 1 performs best and Agent 3 performs worst for a certain training step here. Bidirectional arrows denote that these two agents distill knowledge to each other for the whole sentence with the same cross entropy loss, i.e. symmetric SML. (c): The second training step is token-wise mutual learning. Each agent learns its poorly predicted tokens from other agents. Unidirectional arrows denote that one agent is only a teacher for another agent in one subset of the tokens.
e I 1 with sentence length I. Indexed target token e i takes value from {1, 2, ..., V } in the target vocabulary, whose size is V . The probability of the token e i being generated is conditioned on the whole source sentence f J 1 and the previously generated tokensê i−1 1 : The conventional training criterion is cross entropy. For a given sentence pair, we minimize the cross entropy loss between the empirical distribution and the model distribution p, which can be written as: where ✶{·} is the indicator function.
The objective function only takes care of the probabilities of target tokens and omit the probabilities of rival tokens according to Equation (2), where no explicit regularization is introduced. One could make the model generalize better by discounting a certain probability mass from the one-hot target distribution and interpolate with a uniform prior over the vocabulary (Szegedy et al., 2016;Pereyra et al., 2017;Gao et al., 2020a,b), which is also known as label smoothing. Then the loss function is: with: with the discounted probability mass α, where 0 ≤ α ≤ 1. Empirically, we choose α = 0.1. Compared to using hard targets, label smoothing assigns some probability mass to the rival labels. Apart from label smoothing, mutual learning (ML) (Zhang et al., 2018) is another method to regularize the models. Multi-agent ML with three agents is illustrated in Figure 1a. Some studies have shown that one agent could perform better by learning soft posteriors from other agents (Li et al., 2014;Hinton et al., 2015;Meng et al., 2018). This is because soft posteriors provide smoother distribution than hard targets.
Building on top of ML, we propose a two-step ML method: sentence-wise mutual learning and token-wise mutual learning. Firstly, we co-train multiple agents with sentence-wise mutual learning until convergence. Each agent learns from both hard targets and soft posteriors at sentence-level. The agents are then again co-trained with tokenwise mutual learning until convergence. Even though they still learn from both hard and soft targets, they only learn their poorly predicted tokens from other agents. With both sentence-wise mutual learning and token-wise mutual learning, we can improve the performance cumulatively.

Sentence-wise Mutual Learning
Sentence-wise mutual learning (SML) with three agents is illustrated in Figure 1b. The cross entropy loss between empirical distribution and model distributions, and among different model distributions are minimized together with different interpolation weights. So each agent learns from both hard targets and soft posteriors.
Each agent sees the same source sentence and the same target sentence for each step. Suppose we have K agents (with the same or different architectures) and p 1 , p 2 , ..., p K are the probability distributions of a certain time step for each agent. L 1 , L 2 , ..., L K are the label smoothing losses (see Equation (3)) for each agent. We introduce an extra loss between different agents: with cross entropy loss H(·, ·): So the better performing agent is always used as f and serves as a teacher. The overall loss function of k th agent is: with interpolation weight λ, where 0 ≤ λ ≤ 1. The interpolation weight could be a static hyperparameter that stays the same for the whole SML procedure or a dynamic hyper-parameter which decreases for each epoch: with the number of training epochs n and the decreasing rate β, where 0 < β ≤ 1.
The interpolation weight λ is always larger than 0.5 and decreases for the whole SML procedure according to Equation (8). So the agent learning focuses more on the soft posteriors as the training progresses. The motivation for this is that soft posteriors from agents contain little useful knowledge about the data at the beginning of training. As training goes on, they learn information from hard targets and preserve more useful information. For the static interpolation weight λ, we suggest to set it larger than 0.5, so the agents can learn more from hard targets than from other agents.
Each agent learns from all other agents, even though some agents perform worse than it (see Equation (5) and (7)). Zhang et al. (2018) propose an asymmetric learning method for image classification, i.e. other agents are always used as f in Equation (6). However, learning from better agents and from worse agents is symmetric in our work, i.e. the better performing agent is always used as f and the worse performing agent is used as g. Empirically, we obtain better results with such symmetric learning for machine translation tasks.

Token-wise Mutual Learning
After SML, each agent becomes "smart". Instead of learning from other agents at sentence-level, they only learn the poorly predicted tokens from other agents. The token-wise mutual learning (TML) scheme is illustrated in Figure 1c.
Algorithm 1 Acceptance-Rejection Sampling i ∈ S l,k 6: else 7: i ∈ S k,l 8: end if 9: end for Inspired by the acceptance-rejection sampling method (Chib and Greenberg, 1995), the poorly predicted tokens are determined by the probability ratios, γ i , of the target tokens between two agents as in Algorithm 1. If value u i obtained by uniform sampling fulfills u i ≤ γ i , we consider agent k to be performing better on the target token e i than agent j . Then for this time step, agent j needs to learn from agent k . Normally, we set the scale factor c as: With the acceptance-rejection sampling method, we obtain two target token subsets for each parallel sentence pair between agent k and agent l : S k,l and S l,k . Agent k predicts poorly in the subset S k,l and needs to learn these tokens from agent l . Agent l predicts poorly in the subset S l,k and needs to learn these tokens from agent k . Different from SML, each agent only learns its poorly predicted tokens from other agents for TML. Other agents are always used as f in Equation (6).  The extra loss function is defined as: The overall loss definition for agent k at this step stays the same as Equation (7) for SML (replace L SML k,l by L TML k,l ). Since all agents preserve much more useful information than hard targets after the convergence for SML, they simply need to be finetuned for some iterations with TML. All agents are also more stable and reliable after SML, we only need to use static interpolation weight and set λ < 0.5. So the learning focus more on the soft posteriors than the hard targets.

Experimental Setup
Datasets In this paper, we run experiments on multiple benchmark MT datasets to evaluate the effectiveness of the proposed method, including IWSLT'14 German-English (De-En), IWSLT'14 Dutch-English (Nl-En), WMT'16 Romanian-English (Ro-En) and WMT'14 English-German (En-De). The amount of parallel sentence pairs for different MT tasks is shown in Table 2  , employ newstest 2013 as the validation set and use newstest 2014 as the testing set (Vaswani et al., 2017). For all language pairs, we use byte-pair encoding (Sennrich et al., 2015) with shared vocabularies.
Model Architecture We mainly employ three types of the Transformer model (Vaswani et al., 2017), i.e. Transformer small, Transformer base and Transformer big, implemented in the fairseqpy toolkit (Ott et al., 2019). Transformer base and Transformer big stay the same as Vaswani et al. (2017). The difference between Transformer small and Transformer base is that each encoder and decoder layer in Transformer small employs a word representation size of 512, a feed-forward layer dimension of 1024 and 4 attention heads.
Transformer small is used for the small-scale IWSLT'14 De-En and IWSLT'14 Nl-En datasets with a dropout rate of 0.3. Transformer base is applied for the middle-scale WMT'16 Ro-En and large-scale WMT'14 En-De with a dropout rate of 0.1. Transformer big is also employed for the largescale WMT'14 En-De with a dropout probability of 0.3. We also train a convolutional sequence to sequence network (ConvS2S) (Gehring et al., 2017) and a seven encoder and decoder layer Transformer small (Transformer small7) on IWSLT'14 De-En and IWSLT'14 Nl-En as our baselines.
Optimization and Evaluation We use the same settings for the optimization and the learning rate decay rule as Vaswani et al. (2017) for Transformer small, Transformer small7, Transformer base and Transformer big with an initial learning rate of 5e-4. We use a batch size (the number of tokens) of 4K for Transformer small and Transformer small7, a batch size of 25K for Transformer base and Transformer big. If the batch size can not be set as the number above because of memory limit, we accumulate gradients to match it. We use the same settings for the optimization and learning rate as Gehring et al. (2017) for ConvS2S with a batch size of 4K. We use beam search with a beam size of five and length penalty of 0.6 to generate translations for all of the models. The evaluation metric is BLEU (Papineni et al., 2002).
For IWSLT'14 De-En and IWSLT'14 Nl-En, we use a single Nvidia GTX 1080Ti GPU to train 2, 3 and 4 co-trained Transformer small for 1.5, 2 and 3 days, respectively. For WMT'16 Ro-En, we use a single GPU to train 2, 3 and 4 co-trained Transformer base for 2, 3.5 and 4.5 days, respectively. For WMT'14 En-De, we use four GPUs to train 2, 3 and 4 Transformer base for 10, 13 and 18 days, respectively. For WMT'14 En-De, we also use four GPUs to train 2, 3 and 4 Transformer big for 12, 18 and 21 days, respectively.  T/S and conditional T/S: The scores are from the student model. Asymmetric and symmetric SML: The scores are from the best co-trained agent.

Results
SML and TML are conducted sequentially. Firstly, we train our multiple agents at the sentence-level until convergence. Secondly, we train again the agents with the best performance from SML at the token-level till convergence.

Results for SML
Different Training Methods T/S learning and its variants, i.e. Conditional T/S learning (Meng et al., 2019) and asymmetric ML (Zhang et al., 2018), perform well on various tasks. We reproduce these methods on our tasks to study the effectiveness of SML. As shown in Table 3, our proposed method (symmetric SML) outperforms other methods and achieves +1.5 and +1.2 BLEU scores on IWSLT'14 De-En and IWSLT'14 Nl-En, respectively. T/S and conditional T/S employ a fixed pretrained teacher model. Empirically, the size of the teacher model need to be much bigger than the student model to obtain a better student model. In our case, we observe there is no significant improvement when the teacher and student share the same architecture for T/S and conditional T/S (see Table 3). Compared to asymmetric SML, symmetric SML obtain +0.9 and +0.5 BLEU scores, which is different from the results reported in Zhang et al. (2018), where they obtain similar results from asymmetric and symmetric ML on image classification tasks.
Agents with Different or Identical Architectures We assess the impact of the architecture diversity of agents. From Table 4, we observe that the agents with the identical architecture outperform the agents with different architectures. For the co-training of the agents with different archi-   (2019) first pretrain multiple students independently and co-train them as the second step. All of the student models have converged after the first step. They are only fine-tuned with the second step. In this paper, we co-train multiple agents from scratch. It is difficult to balance their performance for each iteration when the architectures are different. The agents with the identical architecture but initialized differently obtain at most +1.4 and +1.2 BLEU scores on IWSLT'14 De-En and IWSLT'14 Nl-En. Figure 2 shows the training procedure of these models. When the architectures vary significantly, the performance of them become far away from each other. When the architectures are similar, the performance of them is close at the beginning and the end of the training. When these agents share the same architecture, the performance of them are always close. These phenomena imply that the agents with different architectures could not effectively distill knowledge to one another since different architectures have various learning capabilities. One could obtain better results with independent learning scheduling for these two agents, like the training of generative adversarial networks (Goodfellow et al., 2014). This is our future work. The results below are obtained by training the agents sharing the identical architecture.
Static or Dynamic Interpolation Weight We employ both static and dynamic interpolation weights (see Equation 8). Figure 3 shows that the performance of the agent is less sensitive to the dynamic interpolation weight. For all values of the decreasing rate β (=0.1,0.2,...,0.9), all of the results are better than the baseline. We believe the reason is: the dynamic interpolation weight gets smaller and smaller with increasing number of epochs. Compared to the scale of the epoch number, the difference of β does not matter significantly. Besides, the amount of distilled knowledge becomes less and less with increasing number of epochs. The agent does not learn much from other agents when it becomes "smart". Even though the results are sensitive to the static interpolation weight, we can obtain the best result from the fine-tuning of it. Empirically, We can get better results with a static interpolation weight λ = 0.6 or 0.7 or with a decreasing rate β = 1.0 or 0.7 for the dynamic interpolation weight.
Number of Agents We also investigate the influ-ence of the number of co-trained agents for SML. As shown in

Results for SML + TML
Number of Agents After SML, the agents become "smarter". There is only slight difference between them (see Figure 2). For further improvement, they only learn their poorly predicted tokens from one another. As shown in Table 5 Figure 4 illustrates the sensitivity of the agent against beam size. Without ML, the agent obtains better result with increasing beam size. After the two-step ML, the performances of the agent are less sensitive to the beam size. The line tends to be stable after beam size equals to three.
Effect of Ensemble Figure 5 and Table 5 show the performances of independent ensembles and ensembles with ML. We observe that ensembles with ML consistently outperform independent en-   sembles. Compared to four co-trained agents with ML, the ensembling results are improved less significantly for larger datasets.

Conclusion
We extend mutual learning to machine translation tasks at both the sentence-level and the token-level, where multiple agents are co-trained and distill 1 The results in this row are obtained with the average checkpoint from top 10 checkpoints. In this way, we can have strong baselines. The other results are obtained from the best checkpoint of the best agent. The trick of checkpoint averaging does not improve the results for ML.  knowledge to one another throughout the training procedure. Firstly, the agents learn the whole sentence from one another. After convergence, they only learn the poorly predicted tokens from other agents. Sampling of poorly predicted tokens is done with acceptance-rejection sampling.
With our two-step mutual learning, agents could learn from one another at different levels and are improved cumulatively. Extensive experiments show significant improvements for both steps. We improve the state-of-the-art for IWSLT'14 German-English from 36.3 (Bi et al., 2019) to 37.0 points without using additional data. On WMT'14 English-German, we report 28.7 and 29.9 vs. 27.3 and 28.4 (Vaswani et al., 2017) with Transformer base and Transformer big, respectively.
We plan to extend the work by looking into more sophisticated training schedules for the agents with different architectures and applying backtranslation to ML.