Synchronously Generating Two Languages with Interactive Decoding

In this paper, we introduce a novel interactive approach to translate a source language into two different languages simultaneously and interactively. Specifically, the generation of one language relies on not only previously generated outputs by itself, but also the outputs predicted in the other language. Experimental results on IWSLT and WMT datasets demonstrate that our method can obtain significant improvements over both conventional Neural Machine Translation (NMT) model and multilingual NMT model.

Although multilingual NMT attempts to utilize the complementary information of different languages (Lu et al., 2018;Neubig and Hu, 2018;Platanios et al., 2018;, all of the models handle one language pair at each moment for both training and testing. However, we find that the generation process of different target languages can help each other. For example in Figure  1, when decoding the Chinese word "书" meaning "book" at step t = 5, the predicted Japanese word * Jiajun Zhang is the corresponding author.  Figure 1: An example of an English sentence translated into Chinese and Japaneses sentences, in which two targets can interact with each other.
"本" with same meaning can provide the context at step t = 4. The reason is that the sentence structure between the two languages is different. It is subject-verb-object in Chinese while it is subjectobject-verb in Japanese. Moreover, we find that two languages are complementary, and if decoders belonging to two different languages can interact with each other, the quality of translation will be improved.
In this work, we present a novel interactive decoding algorithm to generate two target languages simultaneously and interactively. To this end, we propose a synchronous attention model, in which the generation of one language can attend to already generated outputs of another language. As shown in Figure 1, the two decoders predict their outputs at the same time. At each moment, word prediction of each language does not only rely on previously generated targets itself but also depends on outputs of the other language.
We conduct extensive experiments to verify the effectiveness of our proposed approaches on English-to-German/French and English-to-Chinese/Japanese translation tasks.
Our contributions in this work are two-fold: (1) We propose a novel synchronous translation model that can predict outputs of two different languages simultaneously and interactively, which can enhance the translation quality of both languages.
(2) Extensive experiments show the superiority of our proposed method. Specifically, this synchronous approach can significantly outperform both the conventional NMT model and the multilingual NMT model.

Background
Owing to powerful modeling ability, our synchronous method relies on Transformer architecture (Vaswani et al., 2017), which is entirely based on the attention mechanism detailed below.
Scaled Dot-Product Attention: The inputs of attention mechanism contain queries Q, keys K, and values V. This function can be described as mapping a query and a set of key-value pairs to an output, and the output is calculated as a weighted sum of the values.
where d k is dimension of keys, Q, K, V are obtained by linearly transforming input hidden states with projection matrices.

Synchronous Translation Method
As discussed in Sec. 1, outputs in different languages can be complementary and can help with each other. Thus, it is reasonable to improve translation quality with interactions of two decoders.
In this section, we propose an interactive decoding algorithm and then describe how to implement it by a new attention mechanism, named as synchronous attention model.

Interactive Decoding Algorithm
Interactive decoding algorithm can generate translations of different languages in the same beam. At each step, each half of the beam produces translations in one target language conditioning on source sentence and the predicted tokens in both target languages. Here, we use two decoders with separated embeddings and softmax layers to operate two languages. The two decoders predict each token in parallel and keep interaction with each other, which can be formalized as follows: where x is source sentence, y 1 , y 2 are target sentences corresponding to two different languages. At time-step i, we have generated the first i − 1 tokens of language-1 y 1 and the first i − 1 tokens of language-2 y 2 . Then both languages predictions can be utilized together with source sentence to generate tokens y 1 i and y 2 i . This interaction between two languages is realized by synchronous attention model, which will be detailed in the following subsection.
It should be noted that the two language sentences can be generated in different directions (Liu et al., 2016;, which means language-1 can be produced in leftto-right (L2R) manner while language-2 in rightto-left (R2L) manner. We will analyze the effect of different decoding manners in Sec. 5.2.

Synchronous Attention Model
Synchronous attention model (SyncAtt) is shown in Figure 2, in which inputs of two decoders contain queries (Q 1 , Q 2 ), keys (K 1 , K 2 ), and values (V 1 , V 2 ) separately. The new hidden states (H i ) can be computed by our proposed synchronous atten-tion as follows: where synchronous attention model (SyncAtt) can be described in detail: where λ is a balance weight of hidden states between two decoders decided by development set.
We apply our synchronous attention model to replace self-attention sub-layer in Transformer decoder, and it also utilizes the residual connections (He et al., 2016) around each sub-layer, followed by layer normalization (Ba et al., 2016).

Training
Since our synchronous method decodes two languages at the same time, the different decoders can be optimized simultaneously.
Supposing we have the trilingual datasets D = (x, y 1 , y 2 ) , the objective function is to maximize the log-likelihood over the two target sentences: log P(y 2 i |x)) (5) When calculating P(y 1 i |x), except for the context from source side x, our synchronous method employs not only previous reference y 1 <i as condition, but the previous context of the other decoder reference y 2 <i . The calculation process of P(y 2 i |x) is similar.
However, the practical situation is that the triple data is limited and hard to be collected. In this work, we construct the trilingual training corpus by data augmenting method (Sennrich et al., 2016a;Zhang and Zong, 2016). To achieve this, we first learn two independent translation models Model-1 and Model-2 on the bilingual training data (x 1 , y 1 ) and (x 2 , y 2 ) separately. Then, Model-1 and Model-2 are employed to decode the input sentences x 2 and x 1 , resulting in pseudo training data (x 2 , y 1 ) and (x 1 , y 2 ), respectively. Thus, we can obtain the triple parallel training data D = (x 1 , y 1 , y 2 ) ∪ (x 2 , y 1 , y 2 ) , which can be used to train our synchronous translation model mentioned above.

Data
We evaluate our proposed synchronous method on two translation tasks, which include English→Chinese/Japanese (briefly, En→Zh/Ja) and English→German/French (briefly, En→De/Fr) on IWSLT 1 datasets. The IWSLT.TED.tst2013 and IWSLT.TED.tst2014 are employed as devlopment set and test set respectively. Besides, we also perform En→De/Fr translation in large scale WMT14 2 datasets. We use newstest2014 as test set.
En→Zh/Ja: For this translation task, the training sets of En→Zh and En→Ja consist of 231K, 223K sentence pairs. We tokenize the English sentences using a script from Moses (Koehn et al., 2007), and we segment Chinese and Japanese data by jieba 3 and mecab 4 . We use BPE method (Sennrich et al., 2016b) to encode the source side sentences and the combination of target side sentences respectively and limit the vocabularies of both sides to the most frequent 10k tokens.
En→De/Fr: We conduct this translation task on two different settings. One setting is using training set of IWSLT datasets which contains 206K sentence pairs for En→De and 233K sentence pairs for En→Fr. We follow the common practice to tokenize and lowercase all words. Sentences are encoded using BPE, which has a shared vocabulary of 10K tokens. At last, we construct pseudo triple data by the method described in Sec. 3.3. For the other setting, we extract the trilingual subset in WMT14 inspired by Zoph and Knight (2016), which includes about 2.43M sentence triples. We use 37K shared BPE tokens as vocabulary.

Training Details
We implement our synchronous translation based on the tensor2tensor 5 library. We train our models using the configuration transformer base adopted by Vaswani et al. (2017), which contains a 6layer encoder and a 6-layer decoder with 512dimensional hidden representations. During training, each mini-batch contains roughly 4,096 tokens for both source and target sides. We use Adam optimizer (Kingma and Ba, 2014) with β 1 =0.9, β 2 =0.98, and =10 −9 . For decoding, we set beam size to be k = 4 and length penalty α = 0.6. All our methods are trained and tested on single Nvidia P40 GPU. We investigate the impact of different λs in our synchronous attention model. As shown in Table 1, when λ=0.1, the translation results perform best on development set for both En→Zh/Ja and En→De/Fr tasks, and we will use this setting in the subsequent experiments.

Results and Analysis
The translation performance of IWSLT datasets is evaluated by case-insensitive BLEU4 (Papineni et al., 2002) for En→De/Fr task and characterlevel BLEU5 for En→Zh/Ja task. For WMT14 datasets, we calculate the case-sensitive BLEU4 the same as previous work. In our experiments, the NMT models trained on individual language pair are denoted by Indiv. Table 2 shows the main translation results of En→Zh/Ja and En→De/Fr on IWSLT datasets. We also conduct a typical one-to-many translation adopting Johnson et al. (2017) method on Transformer as our another baseline model, referred to Multi. Compared with Indiv, we can see that Multi achieves better results on all cases, which can be attributed to that the encoder can be enhanced by extra training data from the other language pair.

Results on IWSLT
As for our proposed method, the synchronous translation method performs significantly better than both Indiv and Multi baseline methods, and it can achieve the improvements up to 2. 75 BLEU points (19.31 vs. 16.56) on En→Ja.  To perform synchronous translation, the triple parallel corpus contains pseudo training data we construct. For a fair comparison, we also conduct our baseline methods on the training sentences augmented by the pseudo corpus. From row 2 and row 4 in Table 2, our method achieves better performance than both Multi + pseudo method and Indiv + pseudo method with gains of 0.82 BLEU and 1.09 points on average, which demonstrates the effectiveness of our method.

L2R or R2L Manner
As described in Sec. 3.1, two target languages can be generated in L2R or R2L manner, which can provide the future contexts for each other. We further conduct an experiment to investigate different decoding manners in this work. Figure 3 reports the results. We observe that when performing En→De/Fr translation, one language generated from R2L manner is helpful for the other language but do harm to itself. However, for En→Zh/Ja translation, Japanese can achieve improvements on both L2R and R2L decoding settings. The reason is that Japanese is suited for translating from right-to-left manner, and it can take advantage of future outputs from Chinese.

Results on WMT
We also employ our method on the real triple training datasets, which can be collected from WMT14 En→De/Fr datasets as described in Sec. 4.1. From Table 3, we observe that our method consistently outperforms baseline models. Note that in contrast to results on IWSLT datasets, Multi can not perform on par with Indiv, because the source side data for two language pairs are the same, and encoder network can not be enhanced as Multi method in Sec. 5.1.
Moreover, we construct a large scale pseudo  The significance values with respect to the baseline method Indiv and Multi method are denoted by " * " and " †" respectively, indicating our proposed Sync-Trans significantly better than both Indiv (p < 0.05) and Multi (p < 0.01) methods.
triple data about 4.5M 6 . The result is demonstrated in the last column of Table 3, in which our synchronous method performs better than the baseline methods as well.

Conclusion
In this paper, we propose an interactive decoding algorithm to generate two target languages simultaneously and interactively. The empirical experiments on four language pairs demonstrate that our approach can obtain significant improvements over both the NMT model trained on individual language pair and multilingual NMT model. For the future work, we plan to extend our method on more than two target languages and explore other effective interactive approaches to improve the translation quality further.