Almost Free Semantic Draft for Neural Machine Translation

Translation quality can be improved by global information from the required target sentence because the decoder can understand both past and future information. However, the model needs additional cost to produce and consider such global information. In this work, to inject global information but also save cost, we present an efficient method to sample and consider a semantic draft as global information from semantic space for decoding with almost free of cost. Unlike other successful adaptations, we do not have to perform an EM-like process that repeatedly samples a possible semantic from the semantic space. Empirical experiments show that the presented method can achieve competitive performance in common language pairs with a clear advantage in inference efficiency. We will open all our source code on GitHub.


Introduction
Successful NMT (Neural Machine Translation) (Vaswani et al., 2017;Bahdanau et al., 2015;Johnson et al., 2017;Ng et al., 2019) can translate sentences through left to right or through right to left. However, there is one critical limitation in this diagram. That is, the decoder can only have access to directional information (left-to-right or rightto-left) when processing auto-regressive (Graves, 2013).
To alleviate this pain, there have been three successful lines. 1) Generative NMT: (Zheng et al., 2020;Shah and Barber, 2018;Su et al., 2018;Zhang et al., 2016;Eikema and Aziz, 2019) adapt VAE (variational auto-encoder) (Altieri and Duvenaud, 2015;Kingma and Ba, 2015;Bowman et al., 2016) for NMT that is trained in generative model settings, modeling the semantics of the source and target sentences in latent space. 2) Deliberation: since the problem is caused by the one-pass process of decoding in the auto-regression process, (Xia et al., 2017) present a framework to predict a guess target sentence in the first-pass and jointly considers the encoding and the guess target sentence in the second-pass. 3) Soft-prototype: (Wang et al., 2019) present a framework to generate a prototype on the encoder side and then the decoder can jointly use the encoding and the prototype. Although empirical results show the previous methods can successfully inject global information into the decoder, these methods either introduce computational complexity to the encoder-decoder architecture or employ an EM-like process in inferring, thus requiring even more than 100% additional time to produce and consider global information in inferring.
In this work, we present an efficient method to sample and consider a semantic draft as global information for decoding with almost free of cost, following the line of generative NMT. Concretely, we sample the semantic draft from semantic space that is a Gaussian inference model with learnable parameters. In the classic utilization of the semantic space, e.g., generative NMT, inferring needs to work with the EM-like process that could degrade the inference efficiency significantly. To mitigate the degradation but still use the semantic space, we train the encoder of NMT in multilingual settings and simultaneously train a cross-lingual generator to obtain an approximation of the targetsentence semantic, hence modeling the required semantic space from the approximation and the source-sentence semantic. In inferring, based on the source-sentence semantic and an approximation made by the cross-lingual generator, the semantic draft can be sampled from the semantic space in a one-shot style. Once the semantic draft has been sampled, we aggregate the semantic draft and the encoding so that the variational decoder can simply decompose the aggregation. We train the model in generative settings with additional loss of KL-divergence that is used to optimize the semantic space, similar to generative NMT training (Zheng et al., 2020;Shah and Barber, 2018;Su et al., 2018;Zhang et al., 2016;Eikema and Aziz, 2019) and VAE training (Altieri and Duvenaud, 2015;Kingma and Ba, 2015;Bowman et al., 2016). Our work can build upon Transformer (Vaswani et al., 2017), LSTM/GRU (Hochreiter and Schmidhuber, 1997;Chung et al., 2014) and Convolutional sequence (Gehring et al., 2017). In this work, we use Transformer as an example to present our idea, evaluating our method on common translation tasks and 5 more comprehensive experiments.
Our empirical study shows that, compared to previously successful methods, our method can achieve competitive performance and has a clear advantage in inference efficiency. Since we do not change the architecture of the NMT model, our model is compatible with common technics in NMT.

Background
Notation x and y denotes word embeddings in the source language L 1 and the target language L 2 , respectively. X = (x 0 , x 2 , ..., x n ) ∈ R N ×d and Y = (y 0 , y 2 , ..., y m ) ∈ R M ×d are the sentences sampled from corpora in L 1 and L 2 respectively, where N and M are the sequence length and d is the word embedding dimension. X and Y are parallel sentences that are used in our supervised training. The translation task X → Y is denoted as Y = Dec(Enc(X)), where Dec and Enc jointly construct an encoder-decoder model. s and t represent the source-sentence semantic for X and the target-sentence semantic for Y in translation, respectively. z is a latent variable to represent a semantic draft, sampled from the semantic space.
NMT (Vaswani et al., 2017;Bahdanau et al., 2015;Johnson et al., 2017;Ng et al., 2019) utilizes seq2seq learning (Sutskever et al., 2014) and autoregressive (Graves, 2013) to facilitate training and inferring. Concretely, the current translation y j at time-step j is conditional on Enc(X) and y <j , where y <j is the previous translation before j. The intrinsic problem is caused because the translation y j can only consider y < j without considering y > j . Intuitively, a semantic draft or global information including y < j and y > j can benefit the translation y j because the translation can be consistent with neighboring information.
Some impressive methods have been proposed to produce and consider a draft providing global information for better translation quality. 1) Gener-ative NMT (including variational NMT) (Shah and Barber, 2018;Zheng et al., 2020;Su et al., 2018;Zhang et al., 2016;Eikema and Aziz, 2019) study latent and continuous space of semantic (Bowman et al., 2016) for NMT, which can sample z. These methods inject z into NMT to provide global information for better translation. Meanwhile, the encoder is encouraged to consider z. In this manner, generative NMT models the joint probability P nmt (X, Y, z) = p(z)p(X|z)p nmt (Y |X, z) in training. For inferring, the model utilizes the EM-like process to maximize a lower bound on log(p(X, Y )) by repeatedly guessing or predicting possible Y and resampling z. However, compared to NMT without z, generative NMT costs over 100% additional time in inferring typically. 2) Sharing the same idea of the reconsideration of the current translation, Deliberation (Xia et al., 2017) is proposed to deliberate the complete output of the first-pass decoding as the attention context of the second-pass decoding. With the Deliberation, the final translation is based on the understanding of a possible translation in the target language. Although Deliberation is employed without the EMlike process, which is more efficient than generative NMT in inferring, the doubled pass increases the time of auto-regressive in decoding that costs 80% additional time in inferring. 3) (Wang et al., 2019) further consider the inference efficiency and the storage cost, proposing Soft-prototype framework to use a prototype. The prototype is an approximation of the target sentence Y = (y 0 , ..., y i ), produced by a probability generator R that accepts any x to generate a probability p(y ) over the target vocabulary to search y .
These successful methods, although using different settings and frameworks, share the same idea to inject a draft of the required target sentence and introduce global information to the decoder. Therefore, the decoder can understand the target globally. Concretely, such an idea can be formulated into a framework as: However, these successful methods either introduce computational complexity to NMT (Wang et al., 2019;Xia et al., 2017) or employ the EM-like process, showing significant degradation in inference efficiency, e.g., GNMT (Shah and Barber, 2018) needs 110% additional inferring time. Intuitively, a high-quality draft should include two main aspects: 1) a good draft should include a global semantic for the target sentence; 2) a draft should not degrade inference efficiency significantly.

NMT with Semantic Draft
In this section, we present our framework and method. We then discuss how to train the model in generative settings and how to tackle optimization challenges in practice.

Framework
Inspired from previously successful models, we employ the general framework Y = Dec(Enc(X), draf t) for our model, presenting the high-level architecture in Figure 1. Concretely, draf t is instantiated to z that the general framework is modified to Y = Dec(Enc(X), z). Since z is sampled from the semantic space, our decoder is a variational decoder (Altieri and Duvenaud, 2015;Kingma and Ba, 2015;Bowman et al., 2016).

Generative Semantic Draft
To obtain z, we leverage a similar generative process of GNMT (Shah and Barber, 2018), sampling z from the semantic space that is a Gaussian inference model trained by s and t or approximations of s and t at the very least. Typically, s and t are obtained by modeling the semantics of X and Y with the same parameters.
Semantic for Source Sentence s ∈ R d is computed by averaging a set of vector representation. Specifically, we first process X to the NMT encoder before averaging, obtaining Enc(X). Then, we compute s = 1 N n k=0 Enc(X) k . Semantic for Target Sentence We encourage the model to learn an approximation of t instead of the "ground-truth target semantic". We assume G(s) ≈ t, where G is a two-layer cross-lingual generator. In other words, we compute a dummy target-sentence semantic G(s) based on s. We will discuss this assumption in §4 Multilingual Encoder and Cross-lingual Generator and how to train the cross-lingual generator G in §3.2 Encoder and Generator Tweaking.
Semantic Space Typically, a Gaussian inference model is used for the semantic space, representing a variational distribution q z (z|s, t) for sampling (Shah and Barber, 2018;Zhang et al., 2016;Zheng et al., 2020). It serves as an approximate posterior. Instead of q z (z|s, t), in our model, we use q z (z|s, G(s)) for our required semantic space because G(s) is encouraged to learn an approximation of t. Specifically, we concatenate s and G(s) to compute the mean and variance of the diagonal Gaussian as:

Decoding with Draft
As aforementioned, z is sampled from the semantic space q z (z|s, G(s)). We then aggregate z and Enc(X), processing the aggregation to the decoder for decoding. In other words, we add generative context to the encoding for the encoder-decoder attention in the decoder. Therefore, the decoder is a variational decoder that is conditional on z and X.

Training
NMT Training To train the parameters of both NMT and the semantic space in generative settings, we follow the successful training strategy in previous works (Bowman et al., 2016;Zhang et al., 2016), using SGVB (stochastic gradient variational Bayes) (Kingma and Welling, 2014; Rezende et al., 2014) to perform approximate maximum likelihood estimation: where λ weighs the KL divergence term and p(z) = N (0, I).
Encoder and Generator Tweaking Intuitively, the semantic space should consider the shared semantics between s and t. Ideally, s and t should be obtained from a shared model by processing X and Y , which is discussed in generative NMT (Shah and Barber, 2018;Zheng et al., 2020;Eikema and Aziz, 2019). In spired by this idea, we use the same NMT encoder to compute Enc(Y ), obtaining the "ground-truth target semantic" t = 1 M m k=0 Enc(Y ) k ∈ R d . As aforementioned, we do not directly use t for our semantic space, which is different from generative NMT. Instead, we only use t to enforce and regularize G(s) in training. Concretely, we train the cross-lingual generator G to restore t from s so that G(s) ≈ t. Figure 1: High-level view of NMT with a semantic draft. Note that the "dotted line" is only used in training. s and t represent the sentence semantics. The semantic draft z is sampled from the semantic space that is a parameterized space to model the Gaussian inference distribution q z (z|s, G(s)), where G is our cross-lingual generator. σ linearly increases over the course of training so that the model learns to predict without t. cos denotes the similarity between G(s) and t that we encourage G(s) ≈ t. The variational decoder decomposes the sum of a draft and the encoding.

Inferring with Almost Free Draft
Costly Draft In traditionally generative NMT, based on a random target sentence, the inference mode or the process of translation generating makes an initial guess z init from the semantic space or the variational distribution q z (z|s, t random ), where s is computed by X and t random is obtained from a random Y random . Then, it can generate a possible translation Y and its semantic t . To obtain a good translation, based on the last translation, the inference mode can re-sample a better semantic from the semantic space and regenerate a new translation to maximize a lower bound on log(p(X, Y )) in the EM-like process. Readers can also refer to Algorithm 1 in GNMT (Shah and Barber, 2018) for more details.
Almost Free Draft Unlike traditionally generative NMT, we do not need to make an initial guess and also do not employ the EM-like process to sample z for inferring, which improves the inference efficiency. In our model, G(s), which is the dummy target semantic, plays a prominent role that aims to approximate t instead of making an initial guess. Therefore, we do not have to make an initial guess, and we can also eliminate the whole EM-like process because z is not randomly sampled, which results in a one-shot sampling. Since G is a simple generator, sampling z from q z (z|s, G(s)) does not hurt the inference efficiency significantly and is almost free of cost.

Multilingual Encoder and Cross-lingual Generator
Approximation of t In Encoder and Generator Tweaking operation, we jointly train the encoder and the cross-lingual generator G to make G(s) and t as similar as possible. Since we input parallel sentences to the encoder, the encoder is encouraged to search multilingual properties. Specifically, we notice that s ≈ t potentially 1 , which is studied and reported in previous works of multilingual BERT empirically (Devlin et al., 2019;Karthikeyan et al., 2020;Wu and Dredze, 2019). Meanwhile, Soft-prototype (Wang et al., 2019) and multilingual NMT (Wu et al., 2016;Johnson et al., 2017) also explore this aspect in NMT scenario. We further introduce the cross-lingual generator G to tweak/finetune the property, observing the significant benefits of regularizing. Most importantly, with the cross-lingual generator G, the model can greedily gain a dummy t by G(s) so that the semantic draft can be sampled in a one-shot generative style without the EM-like process.
Potential of s and G(s) Besides, we are aware that only injecting s or G(s) without processing to the semantic space may also provide global information or the shared semantic for decoding because s ≈ t and G(s) ≈ t potentially. We will present an ablation study in one of our comprehensive experiments §6.5 Necessity of Semantic Space and Multilingual Encoder to show the significance of G, the semantic space and their combination.

Semantic in Encoder and Decoder
On the other hand, compared to generative NMT, which employs an auxiliary network to help the semantic space by feeding parallel sentences, our method simply processes the parallel sentences to the NMT encoder that is equivalent to the auxiliary network in generative NMT. In this way, there is no need to pass z to the encoder to model a joint probability P nmt (X, Y, z) = p(z)p(X|z)p nmt (Y |X, z). Specifically, as discussed in VAE (Altieri and Duvenaud, 2015;Kingma and Ba, 2015;Bowman et al., 2016;Zhang et al., 2016), if z involves in the process of encoding, z can guide and regularize the encoder to consider the shared semantic. Therefore, generative NMT models the joint probability in training, encouraged to consider z in both the encoder and the decoder. However, in our model, we let the multilingual encoder consider the implicitly shared semantic itself, and we inject z into the decoder that is encouraged to consider the shared semantic. Figure 2: Comparison between our model and previous models. The "dotted line" indicates the flow of global information. Z denotes Gaussian semantic space. Net denotes an auxiliary network. G is a generator. 1st denotes first-pass decoding. soft denotes soft-prototype.

Comparison
In Figure 2, we compare our framework with previous successful models: GNMT (Shah and Barber, 2018), Deliberation (Xia et al., 2017) and Softprototype (Wang et al., 2019). We observe some significant differences from the perspective of our design: • vs GNMT 1) The semantic space is built upon the multilingual encoder and the cross-lingual generator in our model; 2) the semantic/global information is only used in the decoder.
• vs Deliberation The global information comes from semantic space instead of the firstpass decoding.
• vs Soft-prototype The global information is sampled from the semantic space instead of target prototypes.
• Additionally, we notice an optimization solution for the EM-like process. (Eikema and Aziz, 2019) study an approximating method to maximize the lower bound on log(p(X, Y )) by employing an auxiliary distribution with only using source s, which boosts the inference efficiency with a single call (without the EM-like process) to an argmax solver. Compared to their work, our model has three major differences: 1) our model depends on both s and G(s); 2) an auxiliary distribution is not necessary in our model; 3) we focus on the process of draft generating.

Optimization Challenges
Collapse of D KL (Bowman et al., 2016) report the collapse of D KL term in the objective function Eq.3. Following the instructions of (Bowman et al., 2016;Shah and Barber, 2018), we apply two common strategies: 1) λ linearly increases from 0 to 1 over the initial 50k steps during training; 2) we randomly drop a constant of 30% words when encoding X.
Warm-up of Generator Training is somewhat tricky when using the cross-lingual generator G. We apply a weight σ ∈ [0, 1] for G(s) and a weight 1 − σ for t, as presented in Figure 1. σ linearly increases from 0 to 1 over 50k steps after λ = 1. By this strategy, the semantic space is encoruaged to rely on t in warm-up. Significantly, it avoids that cos(G(s), t) is close to 0 at the beginning of training. After warm-up, i.e., G(s) ≈ t, we use G(s) for the rest of training.

Dataset
To be comparable, we train our model on language pairs {F rench, German} ↔ English and a relative low-resource language pair Romanian ↔ English which are commonly used in previous work (Shah and Barber, 2018;Vaswani et al., 2017;Bahdanau et al., 2015;Zheng et al., 2020). Concretely, we download parallel corpora {F rench, German, } ↔ English from WMT 2014 2 (Bojar et al., 2014). For Romanian ↔ English, we retrieve parallel corpora from WMT 2016 3 (Bojar et al., 2016). The preprocess is simple in our case that we only remove sentences with over 50-word length in our training datasets. Following standard evaluation, the model is evaluated on newstest2014 for {F rench, German} ↔ English and newstest2016 for Romanian ↔ English. Case-sensitive BLEU score is computed by multi-BLEU.perl 4 to report the performance. We also employ beam search with beam size 4 and length penalty 0.6.

Model Settings
We implement presented model on Tensorflow 2.0 (Abadi et al., 2016). To be comparable with other models and baselines, the NMT settings are identical to big-Transformer (Vaswani et al., 2017). Specifically, we set model dimension, word embedding, head, encoder layer, decoder layer and FFN filter to 1024, 1024, 16, 6, 6 and 4096. Adam optimizer (Kingma and Ba, 2015) is employed with parameters β 1 = 0.9,β 2 = 0.98 and = 10 −9 . We use a dynamic learning rate over the course of NMT training (Smith, 2017;Vaswani et al., 2017) 5 . The dropout rate is set to rate = 0.1, and label smoothing is used with gamma = 0.1 (Mezzini, 2018). Parallel corpora for one translation task (e.g., Romanian ↔ English) are concatenated to train BPE (Sennrich et al., 2016b) with a balance strategy (Lample and Conneau, 2019) that forms a shared vocabulary with 40, 000 subtokens. For data feeding efficiency, each minibatch of similar-length sentences are padded to the same length and may have a different number of elements in each mini-batch, such that batch_size × padded_length <= 3000.

Reimplementation and Reconfiguration
To be fair, we reimplement some models on our machine with the same mini-batch size. We compare the reimplemented results to the reported results on the same test set to ensure the difference less than 5% (or 1.5) in BLEU. Then, we can confirm the reimplementation and reconfiguration.

Translation Task
We study the methods of how to produce and consider global information for NMT. Since we have discussed three successful directions, we compare our method with the baselines of Transformer (Vaswani et al., 2017), generative NMT including GNMT (Shah and Barber, 2018) and Mirror-GNMT (Zheng et al., 2020), Deliberation (Xia et al., 2017) and Soft-prototype (Wang et al., 2019). Meanwhile, we have introduced some additional parameters to the model, which is the same as the comparable models. Therefore, we evaluate not only the performance but also the inference efficiency. The comparison of the inference efficiency is based on the inference speed of the vanilla big-Transformer. Besides, we reconfigure Mirror-GNMT and GNMT to big-Transformer settings, and we additionally reimplement Soft-prototype on English → Romanian. Table 1 presents the performance of our model and the baselines on the training dataset. We summarize the results that: • Competitive Translation Quality Our method outperforms the baselines of big-Transformer and GNMT on all the language pairs. Compared to state-of-the-art models, our model gains competitive performance on all the language pairs.

• Clear Advantage in Inference Efficiency
Besides competitive performance on all the language pairs, our model has a clear advantage in the comparison of inference efficiency. Specifically, GNMT, Mirror-GNMT and Deliberation introduce computational complexity to the decoder that needs more than 1 iteration 6 to consider a translation (+ 80% additional time at least), and Soft-prototype increases the computational complexity on the encoder side (+ 34% additional time). However, our method only introduces a generator to the model so that the computational complexity in the encoder and the decoder is the same as in vanilla big-Transformer, which results in an efficient inferring and an almost free draft (only + 5% additional time).
• Improvement from EM-like process We re-  Table 1: Performance of our method. Our method is competitive on translation quality and has a clear advantage in inference efficiency. DM baseline: discriminative model baseline. GM baseline: generative model baseline. port a result obtained by employing the EMlike process for our model in the last row. Although there is noticeable room for improvement, it degrades the inference efficiency significantly so that we do not suggest such a combination. We will discuss this result and integration in §6.2 Drafting with EM-like process.

Drafting with EM-like process
In the most discussion of this work, we sample z from q z (z|s, G(s)) in a one-shot generative style for the sake of inference efficiency. The previous evaluation shows that such an idea is feasible. Meanwhile, our model shares some properties with generative NMT, which makes us interested in the integration with the EM-like process for the sake of the best translation quality only.
In this scenario, we have two steps to translate X: 1. We sample a semantic draft z from q z (z|s, G(s)) and gain a possible translation Y .
2. We then sample a new semantic draft z from q z (z|s, t ) to predict a possible and new trans- The second step can be repeated to maximize a lower bound on log(p nmt (Y |X)). We observe some improvements from employing the EM-like process, reporting the result in the last row of Table  1 that we achieve the best performance on all the language pairs. However, most significantly, the translation converges at 2 ∼ 3 iterations that increase the inference time by 137% (from 1.05× to 2.42×). Concretely, the model needs to re-encode the last translation to obtain a new draft and redecode the new draft to generate a new translation, e.g., re-encode Y to obtain Enc(Y ) and its t , resample the draft z from q z (z|s, t ) and re-decode the aggregation of Enc(X) and z . Thus, we suggest the one-shot generative style in practice.
Additionally, we realize that in this case the improvement may come from not only the re-sampled draft but also the adaptation of two ideas: 1) "double encoding" in Soft-prototype (Wang et al., 2019) because we encode the previously complete translation/prototype for the next translation; 2) "double decoding" in Deliberation (Xia et al., 2017) because we make more than one complete translation. We will justify the significance of the draft in §6.3 Test for Draft and §6.4 Draft Reliance Test.

Test for Draft
We are interested in whether the draft does indeed provide useful semantics/global information. In the last section, the improvement from the EMlike process can intuitively show the effect of the draft because a better-quality draft re-sampled from the last translation continuously improves the performance, but the improvement may only come from "double encoding" and "double decoding". Therefore, we conduct a test to demonstrate that the generative draft learns the desired semantics.
In this test, we share the same missing word translation task with GNMT (Shah and Barber, 2018). Concretely, the model is forced to give a translation based on the draft heavily. We share the same settings that each word has a 30% chance of being missing independently. Note that we do not conduct this experiment for Deliberation (Xia et al., 2017) and Soft-prototype (Wang et al., 2019) because such discriminative models do not sample semantics from the semantic space. Table 2 shows the test result on training dataset German ↔ English and test dataset new-stest2014. We observe that our model outperforms GNMT and achieves competitive performance to Mirror-GNMT (Zheng et al., 2020). Specifically, compared to GNMT, our method trains a multilingual encoder and a cross-lingual generator to encourage shared semantics for the semantic space.   Compared to Mirror-GNMT, which gains the improvement from the simultaneously used LM (language model) and back-translation technic (Sennrich et al., 2016a), our model is not integrated with LM to counter noisy input so that Mirror-GNMT gains slightly better performance. We leave the integration with denoising language modeling (Vincent, 2010) for future experiments.

Draft Reliance Test
We have demonstrated that the semantic draft is useful for the translation task. We further indicate how much the model relies on the semantic draft.
Since the objective function Eq.3 is the same as in GNMT (Shah and Barber, 2018) and Mirror-GNMT (Zheng et al., 2020), we report a comparison on the term of D KL = D KL q z (z|X, Y )||p(z), presenting the result in Table 3. The test is conducted on training dataset German ↔ English and test dataset newstest2014 by averaging the value of D KL = D KL q z (z|X, Y )||p(z). Our method relies on the semantic draft (or the latent variable from the semantic space) heavier than GNMT does. With the EM-like process, the reliance is higher than Mirror-GNMT.

Necessity of Semantic Space and G
Although the semantic draft does indeed provide useful global information in §6.3 Test for Draft and §6.4 Draft Reliance Test, we still question the necessity of the semantic space because G(s) ≈ t and s ≈ t. In other words, we can simply process G(s) or s to the decoder, which can provide global information for decoding potentailly. To justify, we train the model on training dataset German ↔ English and test dataset newstest2014 with 4 dif-ferent types of draf t based on the framework Dec(Enc(X), draf t): • We use our full-packaged model draf t = z, where z comes from q z (z|s, G(s)).
• draf t = G(s) is set for translation to test the significance of the semantic space.
• To test the significance of G, we set draf t = z , where z comes from q z (z |s).
• We test both the significance of G and the semantic space by setting draf t = s for translation.
Besides the difference of draf t, all the other configurations are the same for this test. We report the result in Table 4, and our observations are that: • According to "row 2 vs row 4", we can see the significance of the cross-lingual generator G.
• "row 3 vs row 4" indicates the significance of the semantic space.
• When focusing on "row 2 vs row 3", G improves general translation performance (column 2&4), and the semantic space improves noisy translation (column 3&5) We intuitively conclude that the semantic space and the cross-lingual generator G can further smooth and regularize the semantic for decoding, similar to that is found in GNMT (Shah and Barber, 2018) and (Bowman et al., 2016). Moreover, the crosslingual generator G can only restore a coarse semantic so that the model cannot only rely on G(s) to maintain translation quality when testing in the missing word translation task generally.

Improvement from Non-parallel Data
We have mentioned the multilingual property of the encoder in our design, using the NMT encoder to process X and Y . As reported in multilingual BERT (Devlin et al., 2019;Karthikeyan et al., 2020; (Zheng et al., 2020) 37.54 35.93 Transformer + Soft-prototype + non-parallel (Wang et al., 2019) 38  Table 5: Performance of training with additional non-parallel data. The performance of our method is competitive, significantly improved by non-parallel data.
Wu and Dredze, 2019), sharing encoder for nonparallel sentences in different languages can still build shared semantic space implicitly. This leads us to experiment with that we can jointly train the encoder with the objective of multilingual BERT. We then train on a relative low-resource language pair Romanian ↔ English, and we use additional monolingual data News Crawl articles 2015 from WMT 2016 to jointly train the multilingual encoder with the objective of multilingual BERT.
In Table 5, we report competitive results, and the performance is significantly improved by simultaneously using non-parallel data. Note that, when training on non-parallel data, we can pre-train the multilingual encoder with the BERT objective instead of joint training. We leave this idea for further experiments.

Conclusion and Future Work
Translation quality can be further improved by global information from the target sentence. Although there have been three feasible solutions, successful methods do not consider inference efficiency carefully, which leads to high cost in inferring. In this work, we present a method/framework to improve the performance of NMT. We sample a semantic draft from semantic space that the decoder can consider the semantic draft to obtain the required global information with high efficiency in inferring. Our empirical study shows that, compared to previously successful methods, our method can achieve competitive performance and has a clear advantage in inference efficiency. Since we do not change the architecture of the NMT model, our model can be further improved by employing pre-training (Lample and Conneau, 2019;Devlin et al., 2019;Radford et al., 2018), back-translation (Sennrich et al., 2016a) and other finetuning methods with non-parallel data. And, our model can also be used in unsupervised NMT (Artetxe et al., 2018;Lample et al., 2018). We leave all these experiments for future work.