Reference Language based Unsupervised Neural Machine Translation

Exploiting a common language as an auxiliary for better translation has a long tradition in machine translation and lets supervised learning-based machine translation enjoy the enhancement delivered by the well-used pivot language in the absence of a source language to target language parallel corpus. The rise of unsupervised neural machine translation (UNMT) almost completely relieves the parallel corpus curse, though UNMT is still subject to unsatisfactory performance due to the vagueness of the clues available for its core back-translation training. Further enriching the idea of pivot translation by extending the use of parallel corpora beyond the source-target paradigm, we propose a new reference language-based framework for UNMT, RUNMT, in which the reference language only shares a parallel corpus with the source, but this corpus still indicates a signal clear enough to help the reconstruction training of UNMT through a proposed reference agreement mechanism. Experimental results show that our methods improve the quality of UNMT over that of a strong baseline that uses only one auxiliary language, demonstrating the usefulness of the proposed reference language-based UNMT and establishing a good start for the community.


Introduction
Recently, the application of neural machine translation (NMT) (Sutskever et al., 2014; * Corresponding authors. This  Figure 1: Schemas of (a) pivot supervised NMT, (b) MUNMT, (c) our proposed RUNMT, where S stands for source language, T for target language, P for pivot language in pivot translation, and R for the reference language in RUNMT. Bahdanau et al., 2015) to standard benchmarks has achieved great success (Wu et al., 2016;Gehring et al., 2017;Vaswani et al., 2017) because of advances in deep learning and the availability of large-scale parallel corpora; however, the applicability of MT systems is limited because of their reliance on large parallel corpora for the majority of language pairs. In real-world situations, the majority of language pairs have very little parallel data, although large volumes of monolingual data are available for each language. UNMT removes the dependence on parallel corpora, relying only on monolingual corpora in each language (Reddi et al., 2018;Lample et al., 2018a,b;Conneau and Lample, 2019;. UNMT uses translation symmetry for dual learning in each language direction. Existing UNMT models are mainly built on the encoderdecoder schema. The essence of UNMT is to learn unsupervised cross-lingual word alignment and/or sentence alignment. For unsupervised word alignment, the most popular methods are word embedding mapping (Conneau et al., 2017;Lample et al., 2018a;Sun et al., 2019), vocabulary sharing (Lample et al., 2018b), and language modeling (Conneau and Lample, 2019). Weight sharing can also be adopted in the encoder/decoder, adversarial training, and back-translation (BT) processes for unsupervised sentence alignment.
BT aims to train models using iteratively generated pseudo-parallel data, thus overcoming the lack of cross-language signals. Specifically, monolingual data in the source language is translated to the target language using a sourceto-target translation model, and then the pseudoparallel data (including both the generated and the original data) is used to train the target-to-source translation model, and vice versa.
Unfortunately, as the input sentences in the pseudo-parallel data are generated by unsupervised models, random errors and noise are inevitably introduced, resulting in low-quality parallel data for model training and bad translation performance. In addition, when vocabulary sharing UNMT models for two distant languages (that is, very little vocabulary overlap between the source language and target language) are trained with BT, the unsupervised model may generate the words in the source language instead of in the target language under source-to-target forward translation. As a result, although the reconstruction loss is small if the forward translation generation is very similar to the input, the model is not sufficiently optimized because the pseudo-parallel corpus contains very little cross-lingual sentence alignment information.
Multilingualism (Edwards, 2002;Clyne, 2017) is a powerful fact of communication across speech communities. In multilingualism, an important "lingua franca" (or common language) often serves as an aid to cross-group understanding, usually representing the language of a potent and prestigious society with a large number of users. For machine translation, the parallel corpora between languages and some lingua franca are usually more abundant. Thus, conventional Pivot Translation (PT) usually leverages a resource-rich language (mainly English) as the pivot to help the low/zero-resource translation (see Appendix A.1 for a detailed analysis). Although UNMT no longer requires parallel corpora, this feature is still worth exploring and can be used to enhance current UNMT systems under low-or zeroresource scenarios. In addition, we can further use the transfer learning capabilities of the model to transfer the translation capabilities of languages and lingua francas to any two languages that need to be learned.
In this work, taking the merits of pivot language translation in both supervised NMT and UNMT as shown in Figure 1, we propose the reference language-based UNMT framework in which the reference language shares a parallel corpus with only the source language (using only the target language follows a similar pattern). In the framework, we use multilingualism and propose a reference agreement mechanism. Exploiting the accurate alignment clues between source and reference languages, we can more confidently enhance source-target UNMT by taking into account the translation agreement within the source, reference, and target languages. Specifically, this previously irrelevant parallel data plays a role in controlling the quality of the pseudosentence pairs through a cross-lingual equivalence (translation agreement). The proposed mechanism is orthogonal to the common multilingual transfer learning methods and different from the general pivot translation method.
Empirical results on popular benchmarks and distant languages show that the reference agreement mechanism consistently improves the performance of UNMT systems. In addition, we explore the impact of multilingual information on the basis of our multilingual UNMT baseline and proposed method.

UNMT
UNMT is a recently proposed MT paradigm that attempts to achieve the co-growth of MT models in two directions while relying solely on monolingual data and for example, would benefit both Englishto-French vs. French-to-English. It is a special kind of dual learning (He et al., 2016;Xia et al., 2017a,b; in both directions of language pairs. Currently, state-of-the-art UNMT models are based on a sequence-to-sequence encoderdecoder architecture using Transfomer (Vaswani et al., 2017), similar to supervised NMT models.
For ease of expression, in the remainder of this paper, we denote the monolingual training data space of the source S and target T languages as φ S and φ T . The parallel training data space between languages S and T is represented as φ S−T . The translation direction symmetry of the UNMT model training implies that the translation direction problem S → T is the same as T → S 1 .
In general, the NMT model with parameters θ S→T models the conditional probability P(t|s) of the translated sequence t. The model parameters θ S→T are trained to maximize the following likelihood on the parallel training data space: As there is a lack of cross-lingual sentence alignment information, the current UNMT models, despite their differences in training methods and structure, reach a consensus over the use of the parallel data that was iteratively generated by the BT method. Specifically, for a monolingual sentence of target language t ∈ φ T , a source translations is generated using the primal T → S translation model P(·|t, θ T →S ), thens and t form a pseudo-parallel pair s, t for S → T model training. Similarly, the generated pseudo-parallel pair t , s for a monolingual sentence s in the source language is also used for training the T → S model.
The likelihood of the reconstructions t →s → t and s →t → s for the UNMT model is maximized 1 In UNMT, translation is bidirectional, so "source" and "target" languages only indicate translation direction for using model. Essentially, S and T are symmetrical and exchangeable. over the BT process according to: Finally, the BT process is optimized by minimizing the following objective function:

Reference Language based UNMT
In this section, we introduce the reference language-based UNMT framework and present our three kinds of reference agreement utilization approaches: reference agreement translation (RAT), reference agreement back-translation (RABT), and cross-lingual back-translation (XBT). These approaches are illustrated in Figure 2.

Framework and Reference Agreement
Figure 1(a) demonstrates the traditional pivot translation schema in supervised NMT, subfigure 1(b) shows the multilingual UNMT, and subfigure 1(c) is our proposed reference language-based UNMT framework. When applying pivot translation to UNMT, any language pair in UNMT can be directly trained without any parallel data, which allows translation in both directions due to the nature of UNMT. Thus, the traditional pivot schema (S → P → T ) is not necessary when applying pivot translation to UNMT; using a third language (usually a common language) is a more suitable practice for UNMT. In order to distinguish from the pivot language in traditional pivot translation, we define the language used to enhance the performance of translation S → T in UNMT as the reference language R, regardless of whether the translation schema is S → R → T as the bridge or S → T directly.
In this paper, the reference agreement refers to the cross-lingual equivalence (i.e., translation agreement) provided by bilingual parallel sentence pairs between the reference language and the source or target language of the translation.

Reference Agreement Translation
In the absence of supervision signals, the quality of machine translation across languages cannot be effectively evaluated. That is, a suitable crosslingual quality evaluation function quality(s,t) cannot be defined in cases where only the source and target generation are provided. As a result, the quality of synthetic pseudo-parallel pairs s, t and t , s in BT cannot be guaranteed, which limits the performance of UNMT.
RAT refers to the simultaneous translation of the parallel sentences of languages S and R into the target language T . The two translations should be in agreement (i.e., the same). Therefore, this agreement in the translations from different sources can be used to collaboratively evaluate the generated quality, and it thus forms a new quality evaluation function quality(s, r,s,r).
Based on this premise, we propose a detailed implementation for the RAT approach, enabling reference agreement functions with BT during the UNMT training process and resulting in improved translation agreement, as shown in Figure 2(b). Specifically, RAT requires the two translation models to generate an agreed-upon translation by taking votes. We use this agreed-upon translation as the target and form pseudo-parallel data from the input of each language to train both of the models.
Specifically, for a parallel sentence pair s, r , we would ideally have P(·|s; θ S→T ) = P(·|r; θ R→T ), as stated for RAT; however, as the two models θ S→T and θ R→T are trained on different data, the agreement may be corrupted. Therefore, we combine the two models to obtain the agreed-upon translation outputt a : ta ∼ P(·|s, r; θS→T , θR→T ), where P(·|s, r; (P(·|s,t<i; θS→T ) + P(·|r,t<i; θR→T ))], (6) wheret <i stands for tokens that have been generated prior to the i-generation step. Finally, two synthetic sentence pairs s,t a and r,t a are used to train the models S → T and R → T . Since the silver learning target is optimized, the smoothed cross-entropy loss L is used instead of the ordinary cross-entropy loss L. The learning objective for RAT can be written as: where is the smoothing control value indicating the uncertainty of the target for the model.

Reference Agreement Back-translation
Motivated by the RAT approach, the input language sentences and agreed-upon translations form two synthetic parallel sentences. With these regularized pseudo-parallel sentences, we not only train the S → T and R → T forward-translation models (as the generation direction is the same as the training direction), but also train the BT models, i.e., T → S and T → R. This gives the RABT training approach shown in Figure 2(c). The learning objective of RABT can be described as:

Cross-lingual Back-translation
The traditional BT analyzed in Section 2 and illustrated in Figure 2(a) allows us to train a T → S model with the help of an S → T model, and vice versa; however, this mutually beneficial training is performed entirely within one language pair. Multilingual UNMT (MUNMT) (Sun et al., 2020) is a special case of UNMT that is capable of translating between multiple source and target languages. Although multiple language pairs are trained jointly in MUNMT, there is an obvious shortcoming for BT: translating between language pairs that do not occur together during training, i.e., lack of optimization across language pairs. Joint training across language pairs can be performed through forced high-order BT in UNMT, which takes the form where O is the translation order indicating the number of bridge languages in BT. This approach may fail because decoding through multiple noisy channels (L i → L i+1 ) accumulates latency and compounds errors, resulting in low-quality final pseudo-parallel data between L O+1 and L 1 . Although this high-order BT can expose multiple language pairs for simultaneous training, it also introduces the problem of uncontrollable intermediate translation quality. Therefore, we propose XBT based on the reference agreement. This method allows BT to remain first order while training across language pairs. XBT is a new training approach for UNMT that translates language S to T and then back-translates it to R, or from R to T and then to S, based on the reference agreement provided by the bilingual parallel data φ S−R between languages S and R. This training approach is illustrated in Figure 2(d). The objective function of XBT is: where T S and T R indicate language sentences translated from S and R, respectively.

Datasets
We consider multilingual UNMT for four languages: English (en), French (fr), Romanian (ro), and Chinese (zh). To compare the impact of the relationship between the chosen reference language and the considered language pairs on the UNMT performance, we constructed two language scenarios: English-French-Romanian (en-fr-ro) and English-Chinese-Romanian (en-zh-ro), where English-Romanian (en-ro) is the main language pair considered. French and Chinese are used as the reference languages, providing the parallel corpora of English-French (en-fr) and English-Chinese (en-zh), respectively, to aid the UNMT of English-Romanian. English and Romanian belong to the Indo-European language family, but English belongs to the Germanic branch, whereas Romanian and French belong to the Romance branch. French is selected to evaluate the effect of the reference language being in the homologous family. Chinese belongs to the Sino-Tibetan language family, which is a distant language from Romanian and is selected to study a different language family reference language. For English, French, and Romanian, we used the same monolingual sentences as those extracted from the WMT News Crawl datasets for the period 2007-2017 by Conneau and Lample (2019) for a fair comparison and limited the maximum number of sentences in each language to 50 million(M), which results in 50M, 50M, and 14M sentences, respectively. For Chinese, we combined all of the sentences available in the WMT News Crawl datasets with the source sentences from the WMT'17 Chinese-English translation task, leading to 26M sentences. For the parallel data of en-fr and en-zh introduced by the two experimental settings, we only use those provided by MultiUN (Ziemski et al., 2016). Finally, the size of the resulting language pair parallel dataset is about 10M.
In both scenarios, we evaluated each language pair except for en-fr and en-zh, for which the relevant parallel data was used for reference agreement. Following previous studies, newstest 2016 was used to evaluate the en-ro language pair. For fr-ro, we sampled 5K sentence pairs from OPUS (Tiedemann, 2012) for evaluation, while for zh-ro, we use the religious and educational parallel data for out-of-domain evaluation and collected 2K news parallel sentences for in-domain evaluation. In detail, as data for fr-ro, we used GlobalVoices 2 , OpenSubtitles (Lison and Tiedemann, 2016), and MultiParaCrawl 3 , whereas for zh-ro, Bible-uedin (Christodouloupoulos and Steedman, 2015), Tanzil, and the QCRI Educational Domain Corpus (QED) (Abdelali et al., 2014) were used. Because these parallel corpora between zh-ro are in religious and educational domains only, which are far away from the news domain of training data, we also collected a parallel corpus (2K in size) of zh-ro for in-domain evaluation.
The Moses scripts (Koehn and Knowles, 2017) were used for tokenization of en, fr, and ro, and the jieba toolkit 4 was used for word segmentation on zh. In particular, following Sennrich et al. (2016), we removed diacritics from ro. For zh, to avoid confusion between Hong Kong Standard Traditional Chinese (zh hk: QED), Taiwan Standard Traditional Chinese (zh tw: Bibleuedin), and Simplified Chinese (zh: Tanzil and monolingual training data), we used opencc 5 to convert zh hk and zh tw to simplified Chinese.

Baselines
Our baseline models follow XLM (Conneau and Lample, 2019), with the following refinements:     (2019) used masked language modeling (MLM) to pretrain the full model for the initialization step before applying a denoising autoencoder and BT training step. Therefore, we take the XLM architecture proposed by Conneau and Lample (2019) as our backbone baseline model.
MUNMT Our method studies the impact of adding a reference language to the existing UNMT language pair, which makes our model essentially multilingual. Therefore, MUNMT is the baseline for comparison. We adopt a multi-language joint vocabulary and training with a shared encoder and decoder for language model pre-training, denoising, and BT as the basis of our backbone, UNMT (XLM). Thus, with these settings, the MUNMT model can take advantage of multilingualism.
MUNMT + RNMT Furthermore, as we use a parallel corpus that exists between the reference language and the unsupervised translation language, for a fairer comparison, we consider adding a supervised neural machine translation between the source and reference language (RNMT) as an extra training step on the basis of MUNMT so that supervised and unsupervised training are performed jointly. This baseline is named MUNMT + RNMT. In all our baselines, the byte pair encoding (BPE) code size is set to 60K, and the model hyperparameters are consistent with those of XLM. The smoothing value in RAT is set to 0.1.

Main Results and Analysis
This section examines the effectiveness of the proposed RUNMT framework 6 . The main results 7 are presented in Table 1. Row #4 reports the replicated results of the XLM architecture (Conneau and Lample, 2019) based on the training of each language pair individually. Our UNMT basically reproduces XLM's results, and it also 6 Code available at https://github.com/ bcmi220/runmt. 7 Notably, concurrent works (Liu et al., 2020;Bai et al., 2020;Garcia et al., 2020) also explore the case of using auxiliary parallel data effects under the MUNMT setting, where all of these works share similarities in multilingualism motivation. Due to the inconsistency of the parallel corpora used, the results are not directly comparable, so we don't include their results in the table. makes some improvements over the original (probably because of differences in data sampling). Thus, our approach offers a strong baseline performance. Compared with the current stateof-the-art method MASS (Song et al., 2019), our baseline performance is slightly lower. This is because MASS adopts the new masked sequence to sequence the pre-training method, and the improvement of our method is orthogonal to the pre-training improvement.
For the MUNMT baseline, as shown in #5, the results are basically consistent with the UNMT results we replicated in #4, with some slight fluctuations, indicating the joint training of language pairs alone cannot make full use of multilingualism. Compared with MUNMT, MUNMT + RNMT (#10) is a very strong method for using an otherwise irrelevant corpus through a reference language.
As shown in Table 2, the performance (perplexity/accuracy) of joint pre-training on all languages is worse than that of pre-training on individual language pairs; however, for distant language pairs, adding a close reference language for joint pre-training will improve performance compared to pre-training on only the distant language pair. Therefore, in #5 and #10, the performance of en-ro in en-fr-ro and en-zh-ro is inconsistent in part due to pre-training. Similarly, comparing the performance of en-ro and zh-ro in UNMT and MUNMT, the performance of zh-ro in MUNMT is better than that in UNMT, indicating that transfer learning plays a role in joint training and the performance in en-ro worsens, indicating that joint training a close language pair with a distant language will result in a decline in its UNMT results.
The three specific approaches (RAT, RABT, and XBT) of the proposed RUNMT framework have achieved performance improvements over strong baselines, showing the effectiveness of our proposed approaches. Among them, RAT and RABT both use agreed-upon translations and their inputs to form pseudo-parallel data for training the model: RAT uses the noisy synthetic data as the target, while RABT uses the noisy synthetic data as the source. The results in #6 and #10 show that although RAT with a smoothing mechanism can improve the baselines' performance, the improved result is weaker than RABT in #7 and #12, which use the golden sentences as the target. Comparing RABT and XBT, the gap in performance is relatively small. XBT has a greater average improvement (#8 and #13), indicating that agreement across language pairs is more effective in MUNMT than agreement within a language pair. In addition, combining the three approaches by optimizing them one by one in an update step, with the results shown in #9 and #14, further improved the performance, indicating that the agreement across language pairs and internal agreement within a language pair are complementary.
In Table 1, we also report the results of different domains within zh-ro, where the results in-domain are significantly higher than the results out-ofdomain, indicating that the domain problem is also important for UNMT. Our approaches have also obtained consistent improvements over different domains, further verifying the effectiveness of the method.

Comparison with Pivot Translation
To alleviate the difficulty of lack of bilingual corpora, there are two solutions, the latest uses UNMT in an NMT framework, while the previous solution is pivot translation (usually in an SMT setting), in which the pivot language acts as a bridge creating a path from source to target languages, i.e. S → P and P → T across parallel corpora. Our proposed RUNMT is similar to pivot translation, as both seek help from a third language when there is a lack of parallel corpora between the source language and target language. The difference is that our RUNMT requires only one parallel corpus between source and reference languages, while pivot translation requires two: between source and pivot languages, and between pivot and target languages.
In order to make a fairer comparison between the proposed RUNMT framework and the pivot translation framework, we conducted the following experiments in zh-ro translation: choosing en as the reference language (in RUNMT) or the pivot language (in PT). The two frameworks are evaluated in two settings: one in which only one parallel corpus (zh − en) is provided as claimed in RUNMT, and the other in which two parallel corpora (zh − en and en − ro) are provided as required in PT. Since adding a parallel corpus in our proposed RUNMT framework requires only adding additional training techniques without modifying zh → ro ro → zh zh − en · · · ro zh − en − ro ro · · · en − zh ro − en − zh  Table 3: Comparison between RUNMT with traditional PT framework, where "→" represents the direction of translation, "−" represents supervised NMT with parallel corpus, and "· · · " represents UNMT with only monolingual data.
the structure or training a new model, our RUNMT can also conveniently adapt to the setting of two parallel corpora. In order to adapt to the setting where only one parallel corpus is provided, the PT framework adopts the supervised NMT model that trains S to P (zh → en or en → zh in this experiment) and the UNMT model that trains P−T (en − ro). For the en − zh parallel corpus added in this setting, since MultiUN does not contain this pair, we use the training set provided by WMT'16.
The experimental results show that RUNMT is effective in not only the new case of only one parallel corpus provided, but also the traditional case of two parallel corpora provided, indicating that RUNMT generally makes better use of multilingualism. Additionally, it can be seen from the results that if the first pass of pivot translation is performed by a worse-performing model, error propagation will affect the overall performance, while direct translation in RUNMT will not be affected by this.

Ablation
Effects of Parallel Data Scale In order to analyze the influence of the scale of reference and source language parallel data on the performance of MUNMT and our proposed approaches, we compared the performance of en → ro on five different parallel corpus sizes: 1K, 10K, 100K, 1M, 10M together with UNMT baseline, and the results are shown in Figure 3.
It shows that although MUNMT + RNMT has been a very strong method for using an otherwise irrelevant corpus through a reference language compared to MUNMT, our proposed RUNMT framework can still improve on various parallel data scales, which verifies the generalization of our method. In addition, in the setting with low parallel data, RABT shows a better growth effect than XBT, and when the parallel resources reach a certain scale, XBT surpasses RABT, indicating that agreement training for cross-language pairs requires more parallel data than does agreement within a language pair. Furthermore, the effect of back-translation enhancement in all cases is better than that of forward-translation, which shows that the golden target is better than the silver target in UNMT. Finally, in low-resource settings, our methods have achieved a greater relative improvement, indicating that our methods mine the information of partially relevant parallel data to a greater extent for enhancing UNMT.
Analysis of Intermediate Translation Quality in BT To verify the problem of uncontrollable intermediate quality in the back-translation, we perform experiments on the distant language pair zh-ro and report the results of translation direction ro → zh. The reason for choosing zh-ro is that Chinese and Romanian characters can be directly distinguished by using unicode encoding.
We define BT-BLEU as the BLEU of s ∈ S ands generated in the S → T → S backtranslation process, and we introduce this metric in the evaluation phase. We calculate the ratio of the generated Chinese token (subword) to the total number of generated tokens to reflect the intermediate quality of the back-translation from the side. The experimental results are shown in Table 4. The results show that the growth trend of BLEU is consistent with the downward trend of the ratio of Chinese tokens in the Romanian translations, which has a notable correlation, indicating that this ratio can indeed reflect the training effect of the model to a certain extent.  In addition, compared with MUNMT, our methods improve the quality of intermediate translations, bring BT-BLEU improvement, and reduce the proportion of Chinese tokens in Romanian translations, thus verifying the effectiveness of our methods.

Related Work
With the development of the deep neural network Zhang et al., 2019b,a;, UNMT (Artetxe et al., 2017;Lample et al., 2018a,b;Conneau and Lample, 2019;Song et al., 2019) has attracted widespread attention in academic research, as only large-scale monolingual corpora are required for training. The performance of UNMT has benefited from language model pre-training, denoising autoencoders, and BT techniques between similar languages such as English and French, but still lags behind that of supervised NMT for distant languages such as Chinese and English. Conneau and Lample (2019) extended the generative language model pre-training approach to multiple languages and showed that cross-lingual pre-training could be effective for MUNMT. Aside from the convenience of translation among multiple language pairs, including unseen language pairs, transfer learning should be considered when low-resource languages are trained together with rich-resource ones. As discussed by Arivazhagan et al. (2019), MUNMT usually performs worse than pivot-based supervised NMT; however, the pivot-based method easily experiences a computationally expensive quadratic growth in the number of source languages and suffers from the error propagation problem. Arivazhagan et al. (2019) addressed the zeroshot generalization problem that some translation directions have not been optimized well due to a lack of parallel data. Al-Shedivat and Parikh (2019) introduced a consistent agreement-based training method that encourages the model to produce equivalent translations of parallel sentences in zero-shot translation, which share similarities with our RAT approach. However, in terms of a specific implementation, because of the differences between UNMT and NMT, we have provided three new UNMT methods, and have alleviated the problem of uncontrollable intermediate BT quality in UNMT. Arivazhagan et al. (2019) addressed the issue of transfer learning between language pairs with parallel data where there is a lack of parallel corpora in multilingual supervised NMT. As for the agreement in UNMT, (Sun et al., 2019) investigate the enhancement of unsupervised bilingual word embedding agreement in the UNMT training. Leng et al. (2019) propose a multi-hop UNMT that automatically selects a good translation path for a distant language pair during UNMT. Baijun et al. (2019) proposed a cross-lingual pre-training approach that makes use of the source-pivot data to pre-train the language model.
As for the multilingualism, Liu et al. (2020) proposes a multilingual denoising pre-training technique to improve machine translation tasks. Bai et al. (2020) and Garcia et al. (2020) both studied the agreement across language pairs. Their method is much the same as one of our proposed approaches, XBT, which relies on the supervision signals from a parallel corpus to build a bridge between language pairs in MUNMT. Compared with these two concurrent works, the other two settings of our proposed approaches, RAT and RABT, which use the internal agreement within language pair to improve the translations, can be used not only for MUNMT, but also for semisupervised NMT to enhance the effect of the only two languages.

Conclusion
In this work, we capitalize on the supervised NMT and UNMT use of the pivot language in pivot translation. We propose the reference language-based UNMT framework, in which a reference agreement mechanism is introduced in several implementations to better leverage the reference agreement in parallel data brought by the reference language to reduce the uncontrollable intermediate quality problem in back-translation. The experimental results show that we achieved an improvement over our strong baseline, and our proposed RUNMT framework is compatible with and exceeds the traditional pivot translation framework.