Multi-agent Learning for Neural Machine Translation

Conventional Neural Machine Translation (NMT) models benefit from the training with an additional agent, e.g., dual learning, and bidirectional decoding with one agent decod- ing from left to right and the other decoding in the opposite direction. In this paper, we extend the training framework to the multi-agent sce- nario by introducing diverse agents in an in- teractive updating process. At training time, each agent learns advanced knowledge from others, and they work together to improve translation quality. Experimental results on NIST Chinese-English, IWSLT 2014 German- English, WMT 2014 English-German and large-scale Chinese-English translation tasks indicate that our approach achieves absolute improvements over the strong baseline sys- tems and shows competitive performance on all tasks.


Introduction
Training with more than one agents has attracted intensive research interest in recent years, for example, dual learning (He et al., 2016;Xia et al., 2017Xia et al., , 2018 and bidirectional decoding Zhang et al., 2019b). The former method leverages the duality between the two related agents as the feedback signal to regularize training, while the latter targets the agreement between one agent decoding from left to right (L2R) while the other decoding in opposite direction (R2L). Both methods enhance the translation models by introducing a regularization term into the training objective.
The effectiveness of these two methods lies in the fact that appropriate regularization can help each agent to learn from the superior models while integrating their advantages (e.g., good translations for prefixes for L2R, and good translation quality for suffixes for R2L). As shown in the Table 1, due to the exposure bias problem (Ranzato Src 此次 豪雨 灾情 , 半数 遇难 者 遭 洪水 冲走 溺 毙 ; 其他 则 因 房屋 倒塌 或 电线 走火 触电 致死 。 Ref In this disaster caused by torrential rains, half of the victims were carried away and drowned in the floods; the others died because their houses collapsed or from electrocution caused by short circuits. L2R In the torrential rain, half of the victims were washed away by floods, while others died of electricity caused by collapsed houses or electric wires. R2L Half of the victims were drowned by floods, while others were killed by collapsed houses or electrocuted by sparkling wires.  et al., 2016), the agent trained to decode from left to right tends to generate better prefixes and bad suffixes, and the agent decoding in the reverse direction demonstrates the opposite preference. By introducing additional Kullback-Leibler (KL) divergences between the probability distributions defined by L2R and R2L models into the NMT training objective (Zhang et al., 2019b), it is possible for two models to learn advantages from each other. According to the empirical achievements of previous studies on training with two agents, it is natural to consider the training with more than two agents, and to extend our study to the multi-agent scenario. However, training with more than two agents is more complex, and we face two critical problems. First, when deploying the multi-agent system, should we focus on the diversity or the strength of each agent? (Wong et al., 2005;Marcolino et al., 2013) Second, learning in multi-agent scenario is many-to-many, as opposed to the relatively simpler one-to-one learning in two-agent training, and requires an effective learning strategy.
There have been many alternatives to improve the diversity of models even based on the Transformer model (Vaswani et al., 2017). For example, decoding in the opposite direction usually results in different preferences: good prefixes and bad prefixes (Zhang et al., 2019b). Rather, selfattention with relative position representations enhances the generalization to sequence lengths unseen during training (Shaw et al., 2018). Furthermore, increasing the size of layers in the encoder is expected to specialize in word sense disambiguation (Tang et al., 2018;Domhan, 2018). In this paper, we investigate the effects of two teams of multi-agents: a team of alternative agents mentioned above, and a uniform team of different initialization for the same model.
To resolve the second problem, we simplify the many-to-many learning to the one-to-many (one teacher vs. many students) learning, extending ensemble knowledge distillation (Fukuda et al., 2017;Freitag et al., 2017;Zhu et al., 2018). During the training, each agent performs better by learning from the ensemble model (Teacher) of all agents integrating the knowledge distillation (Hinton et al., 2015;Kim and Rush, 2016) into the training objective. This procedure can be viewed as the introduction of an additional regularization term into the training objective, with which each agent can learn advantages from the ensemble model gradually. With this method, each agent is optimized not only to the maximization of the likelihood of the training data, but also to the minimization of the divergence between its own model and the ensemble model. However, the above learning strategy converges on the local optimum rapidly in our empirical studies. It seems that each agent tends to focus on learning from the ensemble model while completely ignoring its own exploration. To alleviate this problem, we train each agent to learn from the ensemble models when necessary, and to distill the knowledge based on the translation quality (BLEU score) of the ensemble model. This means we evaluate the quality of the ensemble model to see whether it is good enough to be studied. Consequently, the knowledge distillation from the ensemble model to the agent stems from the better translation generated by the ensemble model. We evaluate our model on NIST Chinese-English Translation Task, IWSLT 2014 German-English Translation task, large-scale WMT 2014 English-German Translation task and large-scale Chinese-English Translation Task. Extensive experimental results indicate that our model significantly improves the translation quality, compared to the strong baseline systems. Moreover, our model also reports competitive performance on the German-English and English-German translation tasks, achieving 36.27 and 29.67 BLEU scores, respectively.
To the best of our knowledge, this is the first work on training NMT model with more than two agents 1 . The contributions of this paper are summarized as follows: • We extend the study on training with two agents to the multi-agent scenario, and propose a general learning strategy to train multiple agents.
• We investigate the effects of the diversity and the strength of each agent in the multi-agent training scenario.
• We simplify complex many-to-many learning in multi-agent learning to one-to-many learning by forcing each agent learning knowledge from ensemble model as necessary.
• Extensive experiments on multiple translation tasks confirm that our model significantly improves the translation quality and reports the new state-of-the-art results on IWSLT 2014 German-English and competitive results on WMT 2014 English-German translation tasks.

Background
Conventional autoregressive NMT models (Bahdanau et al., 2015;Sutskever et al., 2014) decode target tokens sequentially, which indicates that the determination of the current token is conditioned by the previous generated sequence. Formally, at time step t, the generation of the current token y t is determined by the following equation: where y <t represents the previously generated sequence, and θ are the parameters of the representation of the source sequence and the partial target sequence.
Furthermore, the usual training criterion is to minimize the negative log-likelihood for each sample from the training data, where T is the length of the target sequence, |V| is the size of vocabulary and 1{·} is the indicator function. This objective can be viewed as minimizing the cross-entropy between the correct translation y * and the model generation p(y t |y <t , x; θ).

Multi-agent Learning
Empirical studies indicate that one agent is trained to perform better through learning advantages from the other agent in the two-agent scenario, namely one-to-one learning ( Figure 1.(a)). The training objective of the agent is regularized, which is to learn better models by leveraging the relationship between the two agents as feedback, i.e., duality in dual learning problem, and agreement in bidirectional decoding.
Extending the study to the multi-agent scenario is feasible and desirable, as multiple agents might supply more reliable and diverse advantages compared to the two-agents scenario. However, the agent is expected to learn advantages from each other in the multi-agent scenario, which results in a complex many-to-many learning problem( Figure  1.(b)).
Instead of tackling many-to-many learning, we force the agent to learn from a common Teacher by introducing ensemble knowledge distillation, thus reduce the learning to one-to-many ( Figure  1.(c)). With this learning strategy, each agent can learn to improve the performance in an interactive updating process. As opposed to ensemble Knowledge Distillation (KD) (Figure 1.(d)), the important difference is that in the knowledge distillation of an ensemble of models, the Teacher network is fixed after pre-training, during the training process. While in our framework, the state of the Teacher network is updated at each iteration, and its performance can be further improved by the improvements of each agent explicitly, in an interactive updating process. In some ways the ensemble KD can be viewed as a particular case of our model, as we fix the update of the Teacher network in the training framework.

Overall Framework
For the sake of simplicity, four variant agents are referred to as Agent1, Agent2, Agent3 and Agent4. We begin by pre-training each agent independently ( Figure 2.(a)), and then enhance the model in the multi-agent scenario in two steps: 1) Generating Ensemble Model ( Figure 2.(b)), and 2) One-to-Many Learning (Figure 2.(c)). The performance of each agent is improved in an interactive updating process, through repeating the above two steps.

Generate Ensemble Model
As pointed out in the work of , ensemble models can empirically alleviate problems existing in the standard NMT model, such as ambiguities in the training data, and discrepancy between training and testing.
According to the practical advantages of ensem- ble models, it is relevant to force agents to learn from the ensemble model, instead of learning from each other separately. Following previous work, we develop our ensemble model by averaging the model distributions of all agents. Formally, the model distribution of the i-th agent agent i is defined as p(y i t |y i <t , x; θ). Assume we have N agents, the model distribution of the ensemble model can be formulated as: where θ t are parameters for representing the ensemble model. Notably, we do not train it in the training process.
In the above formula, the probability q(y t |y <t , x; θ t ) is one reliable estimator of the model distribution, as the majority of the agents are likely to generate correct sequence. From this perspective, we expect that more agents will lead to better and more robust performance.

One-to-Many Learning
In the one-to-many learning framework, the ensemble model acts as the Teacher network, which distills knowledge to each agent iteratively.
Rather than minimizing cross-entropy with the observed data, we minimize the cross-entropy with the probability distribution from the ensem-ble model for each agent.
With the above formula, the agent is optimized to the minimization of the model divergence between its own model and the ensemble model.
However, integrating the above regularization term into the training objective straightforwardly is problematic in practice. The agent tends to focus on learning from the ensemble model rather than exploring its own prediction to converge rapidly. Consequently, the model converges at a suboptimal point, and fails to enhance the performance by learning from each other due to the lack of evaluation of the ensemble model.
To alleviate this problem, we train each agent to learn from the ensemble model when necessary, distilling the knowledge conditioned by the translation quality of the ensemble model. We evaluate the quality of the ensemble model to see whether it is good enough to be studied. The knowledge distillation from the ensemble model to the agent comes from the better translation generated by the ensemble model. Otherwise, the agent is forced to learn its own distribution.
Let < X g , Y g > be one sentence pair in the training corpus, the quality of the translation sequence generated by the ensemble model is mea-sured by the BLEU score, where Y t is generated by the model distribution q(y t |y <t , x; θ t ). The quality of the translation sequence generated by each agent can also be defined using the similar metric, where Y i s is generated by the model distribution p(y i t |y i <t , x; θ) of each agent. Conditioned by the translation quality of the ensemble model, we modify the training objective for each agent as follows: where S is defined as: where 1 s {·} is an indicator, conditioned by the existence of y i t = k in the sequence Y i s . In practice, each agent is not only optimized to maximize the likelihood of the training data, but also to minimize the model divergence between its own model and the ensemble model: where λ i is a hyperparameter, balancing the weight of two factors.
From the perspective of learning, the agent learns to minimize the training objective by generating competitive translations, which leads to a global improvement for all agents. On the other hand, a sequence-level training objective might alleviate the exposure bias problem implicitly (Ranzato et al., 2016).

Joint Learning
In the work of Zhang et al. (2019b), they proposed a relatively complex joint learning framework for training two agents. In this paper, according to the Ensemble Model: q(y t |y <t , x; θ t ); 5 Generate sequence: Y t ← arg max 0≤t<T q(y t |y <t , x; θ t ); 6 for each agent agent i do 7 Generate sequence: Compute NLL loss: Compute Agent loss: Model loss: Update gradients for each agent; 14 end 15 until convergence; previous experiments, we find that a simple multitask learning technique without sharing any modules presents promising performance, In Algorithm 1, we describe the overall procedure of our approach. It deserves noting that in line 3, we assume the model reads a single pair of training examples per timestep for simplistic description, while in practice the model reads a batch size of samples at each step.

Experiments
In this paper, we evaluate our model on four translation tasks: NIST Chinese-English Translation Task, IWSLT 2014 German-English Translation Task, WMT 2014 English-German Translation Task and large-scale Chinese-English Translation Task.

Data Preprocessing
To compare with previous studies, we conduct byte-pair encoding (Sennrich et al., 2016) for Chinese, English and German sentences, setting the

Agent Variants
To increase the diversity of agents, we implement the following variants of the Transformer based system: L2R, the officially released open source toolkit for running Transformer model, R2L, the standard Transformer decodes in reversed direction, Enc, the standard Transformer with 30 layers in the encoder, and Rel, the reimplementation of self-attention with relative position (Shaw et al., 2018

Training Details
We implement our models using PaddlePaddle 3 , an end-to-end open source deep learning platform developed by Baidu. It provides a complete suite of deep learning libraries, tools and service platforms to make the research and development of deep learning simple and reliable.
We use the hyperparameters of the base version for the standard Transformer model of NIST Chinese-English and IWSLT German-English translation tasks, except the smaller token size (batch size = 320). For WMT English-German and large-scale Chinese-English translation tasks, we use the hyperparameters of the big version. As described in the previous section, we use a hyperparameter λ i for each agent to balance the preference between learning from the observed data and the ensemble model. According to the performance of each agent after pre-training, we set this value as follows: where B i is the BLEU score obtained by the pretraining of the i − th agent, and the B avg is the average BLEU score of all agents. The above formula suggests the agent learns more from the ensemble model as its performance  Table 3: BLEU score on IWSLT 2014 German-English translation. KD-4 stands for ensemble knowledge distillation with four agents. Dual-5 is the SOTA model from the work of Wang et al. (2019). And Rel-4 is our best model (Rel) training with four diverse agents.
is worse than the majority vote, rather than focusing on exploring by its own prediction. We train our model with parallelization at data batch level. For NIST Chinese-English task, it takes about 1.5 days to train models on 8 NVIDIA P40 GPUs, 5 days for WMT English-German task, 8 hours for IWSLT German-English task and 7 days for large-scale Chinese-English task. The detailed training process is as follows, first we train each agent until BLEU doesn't improve any more and then execute Algorithm 1 for one-to-many learning which takes 30K-40K steps to converge.

Chinese-English Results
Study on different numbers of agents. We first assess the impact of diverse agents on translation quality. From Table 2, we see that the model's performance consistently improves as the number of agents increases, and we observe that 1) The four baseline systems with different implementations present diverse translation quality, in particular Rel with refined position encoding achieves the best performance. 2) The improvement of each agent after multi-agent learning is dependent on the performance of the co-trained agent (+0.33, +0.8, +0.63 improvements obtained by L2R when training with R2L, Enc and Rel). 3) Better results can be obtained by increasing the number of training agents (e.g., 47.3 → 47.73 → 47.95 for L2R).
From the overall results, increasing the number of agents in multi-agent learning significantly improves the performance of each agent (at most +1.39, +0.9, +1.33, +1.17 in a+b+c+d), which suggests that each agent learns advantages from the other agents, and more agents might lead to further improvement. More importantly, our improvements are obtained from the advanced training strategy without any modification in decoding stage, which indicates its practicability in deployment. Study on uniform agents. To measure the importance of diversity in multi-agents, we conduct mul-

Models
En-De ConvS2S Gehring et al. (2017) 25.2 Transformer Vaswani et al. (2017) 28.4 Rel Shaw et al. (2018) 29.2 DynamicConv Wu et al. (2019) 29.7 Back-translation Sergey et al. (2018)   These results suggest that multi-agent learning is better conducted with diverse agents instead of the uniform agents with excellent single performance. On the other hand, in our multi-agent learning, training with more agents brings consistent improvements even when deployed by identical models from different initialization seeds.

German-English and English-German Results
We work with the IWSLT German-English translation to compare our model to the SOTA model. From Table 3, we observe that both KD-4 and Rel-4 models achieve the best result on this dataset (35.53 and 36.27), which manifests the effectiveness of using multiple agents. Even when trained with four agents, our model outperforms the dual model trained with five agents (34.70), and reports a SOTA score on this translation task. Moreover, we argue that finetuning with L2R to obtain a better performance could further improve the performance of Rel-4, as there exists a large gap between L2R and Rel, which affects the learning efficiency.
We further investigate the performance of our  model on the WMT English-German translation task, which achieves a competitive result of 29.67 BLEU score on newstest2014. From Table 4, we can see that another related work, multi-agent dual learning (Wang et al., 2019) achieves promising results. This confirms that training with more agents leads to better translation quality, and reveals the relevance of using multiple agents. Although Sergey et al. (2018) reports a BLEU score of 35.0 on this dataset, they leverage refined training corpus and a large number of monolingual data. We argue that our model can bring further improvement using their back-translation technique. Moreover, the goal of this paper is introducing a general learning framework for multiagent learning rather than exhaustively fine-tuning to report a SOTA results. We argue that the performance of our model can be further improved using an advanced single agent, such as DynamicConv (Wu et al., 2019) and Dual (Wang et al., 2019).

Large-Scale Chinese-English Results
In order to prove that multi-agent learning can yield improvement in large-scale corpus, we conducted experiments on 40M Chinese-English sentence pairs with the same data processing method as NIST. We built test sets of more than 10K sentences which cover various domains such as news, spoken language and query logs. From Table 5, we can see that multi-agent learning can increase the average baseline BLEU score of several test sets by 0.86. This technique has already been applied to Baidu Translate.

Contrastive Evaluation
In this paper, we evaluate our model on the two types of contrastive evaluation datasets: Lingeval97 4 and ContraWSD 5 . The former is utilized to measure the performance of different models on dealing with subject-verb agreement in 4 https://github.com/rsennrich/ lingeval97 5 https://github.com/a-rios/ContraWSD  Table 6: Accuracy of subject-verb agreement (SVA) and word sense disambiguation (WSD) for different models. We report the performance of L2R in different models for comparison of using different agents.
English-German translation, while the latter evaluates the performance on word sense disambiguation in German-English translation. We suggest the readers refer to the work (Sennrich, 2017;Tang et al., 2018) for detailed descriptions of the two datasets.
From Table 6, we can see that the L2R training with multiple agents improves the accuracy both on SVA and WSD, and we observe that: 1) The L2R improves its ability of SVA with the help of the other agents significantly. 2) The Enc presents better advantage in resolving WSD than in SVA, while Rel has the opposite preference. 3) Although the R2L presents ordinary BLEU score, it still does help the L2R resolve the SVA.

Related Work
The most related work is recently proposed by Wang et al. (2019), who introduced a multi-agent algorithm for dual learning. However, the major differences are 1) the training in their work is conducted on the two agents while fixing the parameters of other agents. 2) they use identical agent with different initialization seeds. 3) our model is simple yet effective. Actually, it is easy to incorporate additional agent trained with their dual learning strategy in our model, to further improve the performance. More importantly, both of our work and their work indicate that using more agents can improve the translation quality significantly.
Another related work is ensemble knowledge distillation (Fukuda et al., 2017;Freitag et al., 2017;Zhu et al., 2018), in which the ensemble model of all agents is leveraged as the Teacher network, to distill the knowledge for the corresponding Student network. However, as described in the previous section, the knowledge distillation is one particular case of our model, as the performance of the Teacher in their model is fixed, and cannot be further improved by the learning process.
Our work is also motivated by the work of training with two agents, including dual learning (He et al., 2016;Xia et al., 2017Xia et al., , 2018, and bidirectional decoding Zhang et al., 2018Zhang et al., , 2019b. Our method can be viewed as a general learning framework to train multiple agents, which explores the relationship among all agents to enhance the performance of each agent efficiently.

Conclusions and Future Work
In this paper, we propose a universal yet effective learning method for training multiple agents. In particular, each agent learns advantages from the ensemble model when necessary. The knowledge distillation from the ensemble model to the agent stems the better translation generated by the ensemble model, which enables each agent to learn high-quality knowledge while retaining its own exploration.
Extensive experimental results prove that our model brings an absolute improvement over the baseline system, reporting SOTA results on IWSLT 2014 German-English and competitive results on WMT 2014 English-German translation tasks.
In the future, we will focus on training with more agents by translating apart of one sentence considering its advantage for each agent, rather than translating the whole sentence.