Improving Neural Machine Translation with Soft Template Prediction

Although neural machine translation (NMT) has achieved significant progress in recent years, most previous NMT models only depend on the source text to generate translation. Inspired by the success of template-based and syntax-based approaches in other fields, we propose to use extracted templates from tree structures as soft target templates to guide the translation procedure. In order to learn the syntactic structure of the target sentences, we adopt constituency-based parse tree to generate candidate templates. We incorporate the template information into the encoder-decoder framework to jointly utilize the templates and source text. Experiments show that our model significantly outperforms the baseline models on four benchmarks and demonstrates the effectiveness of soft target templates.

Inspired by these works and the successful application of templates for other intriguing tasks, including semantic parsing (Dong and Lapata, 2018), summarization (Cao et al., 2018;Wang et al., 2019a), question answering (Duan et al., 2017;† Corresponding author.  Pandey et al., 2018), and other text generation tasks (Wiseman et al., 2018;Guu et al., 2018), we assume the candidate templates of the target sentences can guide the sentence translation process. We denote these templates extracted from the constituencybased parse tree as soft templates, which consist of tags and target words. The templates are soft because no explicit paradigms are inaugurated to build new translation from them, and the target tokens could be modified. In order to effectively use the templates, we introduce soft template-based neural machine translation (ST-NMT), which can use source text and soft templates to predict the final translation. Our approach can be split into two phases. In the first phase, a standard Transformer model is trained to predict soft target templates by using source text and templates extracted from the constituencybased parse tree. Secondly, we use two encoders, including a soft target template encoder and a source language encoder to encode source text and templates and generate the final translation. As shown in Figure 1, given the source text "我喜欢 打篮球" and the target template "S like to VP", the final translation "I like to play basketball" is generated by two encoders. In this work, the templates play a part in guiding, and some target tokens in  Figure 2: Overview of our ST-NMT. Given the source text and the soft target template predicted by the P θ X→Y , the source language Transformer encoder and target template Transformer encoder maps two sequences X = (x 1 , x 2 , x 3 , x 4 , x 5 ) and T = (t 1 , y 2 , t 3 , t 4 , t 5 ) into hidden states Z X and Z T . x i denotes the source word, t i denotes the template tag and y i denotes the target word. y i also denotes the target word but it can be modified to the other target words. The ultimate translation Y is generated by a Transformer decoder which incorporates the context Z X and Z Y in the second phase.
the template could also be modified. In order to prove the effectiveness of our approach, we conduct main experiments on the popular benchmarks, including IWSLT14 German-English translation task, WMT14 English-German translation task, LDC Chinese-English translation task, and ASPEC Japanese-Chinese translation task. Experiments show that our approach achieves significant improvement compared to the baselines, which demonstrates the soft target templates can provide a positive influence for guiding translation procedure effectively. Our approach can be used for diverse scale data sets, different styles, and multiple language pairs.

Our Approach
Our model first reads the source language sequence X = (x 1 , x 2 , x 3 , . . . , x n ) in the conventional way by a source language Transformer encoder and generates the template sequence T = (t 1 , t 2 , t 3 , . . . , t m ) by a template Transformer decoder. As shown in Figure 2, our model uses a source language Transformer encoder and a template Transformer encoder, which encodes the source language sequence X and the template sequence T separately. We deploy the target language decoder to generate the final translation. In this section, we present the details of the proposed template-based approach. Our method mainly in-cludes two phases: (1) The training data is constructed by the constituency-based parse tree. Then, we adopt a standard Transformer to convert the source text to the soft target template for the next generation.
(2) Based on the source text and the predicted soft target template, we utilize two encoders to encode two sequences into hidden states separately and a target language decoder to generate the ultimate translation.

Soft Template Prediction
In this procedure, we model the P θ X→T (T |X) to predict soft target templates on top of the constructed training data D X,T . To construct D X,T , we use a constituency-based parser to parse the target sequence and get a tree structure. Then, we prune nodes which are deeper than the specific depth and recover the left leaf nodes to the ordered template sequence. Through these operations, we gain the parallel training data D X,T and train a standard Transformer model P θ X→T (T |X) to predict the soft target template.
The constituency-based parse tree could reveal the structure information of the whole sentence which utilizes the constituency grammar to distinguish terminal and non-terminal nodes. More specifically, the interior nodes are labeled by nonterminal categories which belong to the set of nonterminal tokens S, while the leaf nodes are labeled Pruned Figure 3: The constituency-based parse tree of the example sentence. Given the target sentence and definite depth of the tree, we gain the sub-tree by pruning the nodes deeper than 4 in this case. Then, the sub-tree can be converted to the soft target template "There are NP VP" from left to right. by terminal categories V . S = {S, VP, NP, . . . , ASBR} and V is the vocabulary set of the target language. For example, the sentence "There are some people running" could be expressed as Figure 3. In this case, the non-terminal tokens consist of S 0 = {S, NP, VP, EX, VBP, NP, DT, NNS, VBG} and the terminal tokens are composed of V 0 = {There, are some, people, running}. Our template T = {t 1 , t 2 , t 3 , t 4 } is the ordered sequence which is composed of non-terminal tokens and terminal tokens. In this case, t 1 =There, t 2 =are, t 3 =VP and t 4 =NP. Our template extraction aims to extract the sub-tree of the specific depth and use these nonterminal and terminal tokens locating at the leaf node of sub-tree.
In order to predict the soft target templates, we train a standard Transformer model given the training data of the source text and extracted templates. The Transformer model reads the source text and predicts the soft target templates using beam search. Then, we select the top-K results of the beam search as templates.
The depth of the tree is a trade-off. In Figure 3, One special case is that when the depth equals 1, the template only has one symbol "S". The template "S" cannot provide any useful information. Another special case is that when depth is greater than 6, the template "There are some people running" only has terminal tokens. The template only contains target words, which cannot provide any additional information. When the depth equals 4, the template is "There are NP VP". The template contains sentence syntactic and structural information, which is suitable for our method.
With the Transformer model P θ X→T (T |X), we need to construct the pseudo training data D X,T,Y instead of directly using extracted templates by soft template prediction. Given the source text X, we use P θ X→T (T |X) to predict the top-1 soft target template T with beam search. Therefore, we get the triple training data which is prepared for the next phase.

Machine Translation via Soft Templates
The triple training data D X,T,Y is used to model the probability P (X,T )→Y from the two sequences to the ultimate translation. Our approach could generate the target sentence Y , given the source sequence X and template T .
Formulation In formula, we could model the whole procedure on top of the P θ X→T (T |X) and where θ X→T and θ (X,T )→Y are the parameters for the first and the second phase. The source language Transformer encoder and the soft template Transformer encoder maps the input sequence X and the template T composed of target language words and tags to the hidden states. Then, a Transformer decoder interacting with two encoders generates the final translation Y , described by the Equation 1.
Encoder In the second phase, our template Transformer encoder and the source language Transformer encoder are stacked by blocks which contain self-attention layers with residual connections, layer normalization and fully connected feedforward network (FFN). Therefore, the hidden states of the source language Transformer encoder and the template Transformer encoder are calculated by: where h l = h X l for the source language Transformer encoder and h l = h T l for the template Transformer encoder. N is the number of layers and l ∈ [1, N ].
Decoder Based on the hidden states h X l and h T l , the target language Transformer decoder use the encoder-decoder multi-head attention to jointly use the source language and template information to generate the ultimate translation Y . Besides, the target sequence decoder uses multi-head attention to obtain the representations of target language decoder with the parameters . . , z T n ) using the source language Transformer encoder and the template Transformer encoder.
On top of the Z X and Z T , the decoder separately calculate the multi-head attention with source sentence context X = (x 1 , . . . , x m ) and target template sentence T = (t 1 , . . . , t n ), then our model obtain two hidden states Z X,Y and Z T,Y by attention with source context and template context. Here, We incorporate the Z X,Y containing source language information and Z X,Y including template information in a reasonable way: where β is the parameter to control the degree of incorporation between source text and template. In order to effectively incorporate source and template information, we calculate the parameter β as below: where Z Y is the decoder hidden state and W Y and U T are parameter matrices. σ is the sigmoid activation function.

Training Strategy
Similar to the conventional NMT, in order to make the model predict the target sequence, we use maximum likelihood estimation (MLE) loss function to update the model parameter by maximizing the log likelihood of translation over training set D.
When we train the P θ X→Y without the template Transformer encoder, we only need to optimize the following loss function: where θ X→Y are the parameters of the source language Transformer encoder and the target language Transformer decoder. When we train the P θ (X,T )→Y with the template Transformer encoder, the loss function could be calculated by: where θ (X,T )→Y are the parameters of the source language Transformer encoder, template language Transformer encoder and target language Transformer decoder.
To balance the two objectives, our model is trained on L θ X→Y (D) objective for the α% iterations, and trained on L θ (X,T )→Y (D) objective for the (1 − α)% interations. Therefore, this procedure is equivalent to the following formula: where α is a scaling factor accounting for the difference in magnitude between L θ X→Y (D) and L θ (X,T )→Y (D).
In practice, we find optimizing these two objectives can make training procedure easier and get a higher BLEU score since there exist a few low-quality templates to influence the translation quality. Through optimizing two objectives simultaneously, we can reduce the effect of some lowquality templates and improve the stability of our model.

Experiments
We conducted experiments on four benchmarks, including LDC Chinese-English, WMT14 English-German, IWSLT14 German-English, and ASPEC Japanese-Chinese translation tasks. By conducting experiments on these four benchmarks, these settings prove that our approach is suitable for diverse situations: (1) These four benchmarks provide a wide coverage of both scale and genres. They vary from small scale to large scale (2) We use the different domains, which include news, science, and talk domain. (3) We also conduct the experiments on different language pairs, including the German-English translation task, the English-German translation task, the Chinese-English translation task, and the Japanese-Chinese translation task.

Datasets
In order to verify the effectiveness of our method, we conduct experiments on four benchmarks. WMT14 and LDC datasets are from the news domain. IWSLT14 dataset is from TED talk. ASPEC dataset is from a scientific paper excerpt corpus.
LDC Chinese-English We use a subset from LDC corpus 1 which has nearly 1.4M sentences originally. The training set is selected from the LDC corpus that consists of 1.2M sentence pairs after dropping the low-quality sentence pairs of which the length is more than 2. We used the NIST 2006 dataset as the validation set for evaluating performance in the training procedure, and NIST 2003NIST , 2005NIST , 2008 and 2012 as test sets, which all have 4 English references for each Chinese sentence.
IWSLT14 German-English This dataset contains 16K training sequence pairs. We randomly sample 5% of the training data as valid test. Besides, we merge the multiple testsets dev2010, dev2012, tst2010, tst2011, tst2012 for testing.
WMT14 English-German The training data consists of 4.5M sentence pairs. The validation set is devtest2014, and the test set is newstest2014.
ASPEC Japanese-Chinese We use 0.67M sentence pairs from ASPEC Japanese-Chinese corpus (Nakazawa et al., 2016) 2 . We use the devtest as the development data, which contains 2090 sentences, and the test data contains 2107 sentences with a single reference per source sentence.

IWSLT14 German-English
We adopt the small setup of the Transformer model. The model has 6 layers with the embedding size of 512, a feedforward size of 1024, and 4 attention heads. In order to prevent overfitting, we use a dropout of 0.3, a l 2 weight decay of 10 −4 , and a label smoothing of 0.1. We use BPE to encode sentences with a shared vocabulary of 10K symbols.
WMT14 English-German We use the big setting of Transformer (Vaswani et al., 2017), in which both the encoder and the decoder have 6 layers, with the embedding size of 1024, feedforward size of 4096, and 16 attention heads. The dropout rate is fixed as 0.3. We adopt Adam (Kingma and Ba, 2015) optimizer with a learning rate 0.1 of the similar learning rate schedule as Transformer (Vaswani et al., 2017). We set the batch size as 6000 and the update frequency as 16 on 8 GPUs for updating parameters (Ott et al., 2018) to imitate 128 GPUs. The datasets are encoded by BPE with a shared vocabulary (Sennrich et al., 2016) of 40K symbols.
ASPEC Japanese-Chinese We use the base setting of Transformer the same to the Chinese-English translation task. Following the similar learning rate schedule (Vaswani et al., 2017), we set the learning rate as 0.1. Chinese and Japanese sentences are tokenized with our in-house tools and encoded by BPE with a shared vocabulary of 10K symbols.

Evaluation
We evaluate the performance of the translation results. The evaluation metric is BLEU (Papineni et al., 2002). For the Chinese-English and German-English translation tasks, we use case-insensitive tokenized BLEU scores. For the English-German translation task, we use case-sensitive tokenized BLEU scores for evaluation. All the experiments last for 150 epochs and use Stanford parser to generate templates (Manning et al., 2014). For all translation tasks, we use the checkpoint, which has the best valid performance on the valid set. For different test sets, we adapt the beam size and the length penalty to get better performance. In order to avoid the difference of the tokenizer for Chinese translation result evaluation, we adopt the character-level BLEU for testing. Checkpoint averaging is not used, except notification.

Baselines
We compare our approach with two types of baselines including one-pass baselines and multi-pass baselines.
One-pass Baselines: ConvS2S (Gehring et al., 2017) is a strong CNN-based baseline. We report the results referring to the paper of convolutional sequence to sequence model (ConvS2S).

RNMT+ (Chen et al., 2018) is a state-of-the-art
RNN-based NMT model. GNMT (Wu et al., 2016) is the typical encoder-decoder framework. We use the similar setting 3 for all experiments. Transformer (Vaswani et al., 2017) is a strong baseline which has the state-of-the-art performance. We reimplement this baseline 4 . LightConv and Dy-namicConv (Wu et al., 2019) are simpler but effective baselines. We directly report the results in the paper.
The result of our model is statistically significant compared to the other baselines (p < 0.05).
baselines and outperforms the Transformer baseline by 1.14 BLEU point on average, which shows that the template could effectively improve the performance. More specifically, our model outperforms the Transformer model by 0.76 BLEU on NIST2003, 1.52 BLEU on NIST 2005, 0.91 BLEU on NIST 2008, and 1.39 BLEU on NIST 2012. We further demonstrate the effectiveness of our model on WMT14 English-German translation tasks, and we also compare our model with other competitive models, including ABD-NMT (Zhang et al., 2018), Deliberation Network (Xia et al., 2017), SoftPrototype (Wang et al., 2019b), SB-NMT (Zhou et al., 2019a) and SBSG (Zhou et al., 2019b). As shown in Table 3, our model also significantly outperforms others and gets an improvement of 0.43 BLEU points than a strong Transformer model.
To investigate the effect of our approach on the different language pairs, we also evaluate  Figure 4: The effect of the multiple templates. We feed the the top-K results of the beam search as multiple templates and source sentence to generate the target translation.
our model on the Japanese-Chinese translation task. According to Table 4, ST-NMT outperforms GNMT by 3.72 BLEU points, ConvS2S by 2.52 BLEU points, and the Transformer model by 0.82 BLEU points, which demonstrates that the soft template extracted by constituency-based parse tree can also bring strong positive effects.

Multiple Templates
Because of the diversity of the templates, we investigate the performance with the different numbers of the templates. On top of the original parallel training data D = {(x (i) , y (i) )} N i=1 , we construct the training data from the source text to the soft target template D X→T = {(x (i) , t (i) )} N i=1 , by the model P θ X→T . Through this construction procedure, we could use the top-K results of the beam search as multiple templates by model P θ X→T . We could expand the training data of the source text to the target template as Figure 4, our model gains the best performance only using the single template. When the number of templates is 8, our model gains the worst BLEU score of 29.22. We can summarize that our model can be more robust but maybe get worse performance with the number of templates rising. Besides, in order to further improve the stability of our model, we expand the dataset by selecting random templates for the source sentence. The different templates confuse our model, although it can make our model more robust.

Balance of Two Objectives
To further control how much our model leverages templates for translation, we tune the hyperparameter α. With the value rising, the contribution of template information gradually decreases. We study the influence of the ratio α. To investigate the effect of this hyperparameter, we set the discrete value α = {10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%}. According to Figure 5, when the α switches from 0.4 to 0.9, our model can get the better performance which is greater than or equal to 29.3 BLEU. The results show that we can set the hyper-parameter α in a reasonable interval (0.4 ≤ α ≤ 0.9) to keep the balance between source text and template.

Depth of Parsing Tree
Considering that the template derived from the specific depth can lead to the divergent performance, our model is examined with the different depth. The effect of the template extraction which is described as Section 3 is decided by the sub-tree which is controlled by the depth of sub-tree. For the same constituency-based parse tree, the different sub-tree can be obtained based on the different chosen depth d. When we get the sub-tree, the template could be derived from it. The depth of the constituency-based parse tree is decided by a simple but effective strategy as formula: where L is the length of the input sentence, γ 1 is the lower bound, γ 2 is the upper bound depth   of the sub-tree and λ is the ratio of the length of source sentence. When the λ approximates 1.0, the template contains more target tokens and less tags.
In addition, we tune the depth on the LDC training data and list the results. According to the Table  5, the soft templates of the specific depth provide helpful information to the translation procedure when the λ = 0.15 in the LDC dataset.

Ratio of Overlapping Words
To measure contribution of the predicted soft target template for final translation, we calculate the overlapping words between the template and the translation. Table 6 gives the specific overlapping words ratio on the different test sets including NIST2003, NIST2005, NIST2008 and NIST2012. The overlapping ratio is calculated by the following formula: where Count y (·) and Count t (·) denote the number of w in the target translation Y and the template T , and w is the words in the target language. The overlapping ratio represents the correlation between the predicted template T and the target translation Y . According to Table 6, the correlation between the template T and the translation Y is highly relevant which demonstrates the contribution of our template to the final translation.
Reference on the other hand , if we overreact , we will be deceived by their trick .
Template on the other hand , if NP VP , we will VP .
Ours on the other hand , if we react too much , we will be hit by them .

Example Study
To further illustrate which aspects of NMT are improved by the target soft template, we provide a Chinese-English translation example shown in 7.
Templates provide the structural and grammatical information of the target sentence. For instance, Chinese source sentence "另一方面 , 如果 我们 反应 过度 , 将 会 被 他们 欺骗 ", our model first predicts the target template "on the other hand , if NP VP , we will VP ", and then generate the final translation "on the other hand , if we react too much, we will be hit by them". Our target template provides the sentence pattern "If sb. do sth, sb. will be done". Our method introduces the constituency-based parse tree and utilizes the constituency grammar to distinguish terminal and non-terminal nodes. Therefore, our model can automatically learn sentence patterns, including grammatical and structural information.

Related Work
Many types of encoder-decoder architecture (Bahdanau et al., 2015;Wu et al., 2016;Gehring et al., 2017;Vaswani et al., 2017;Chen et al., 2018) have been proposed in the past few years. Furthermore, Transformer enhances the capability of NMT in capturing long-distance dependencies based on these backbone models, including CNN-based, RNN-based, and Transformer based architecture.
To improve the quality of the translation, many authors have endeavored to adopt multi-pass generation decoding method, their models first predict the rough translation and then generate the final translation based on the previous draft (Niehues et al., 2016;Chatterjee et al., 2016;Junczys-Dowmunt and Grundkiewicz, 2017;Xia et al., 2017;Geng et al., 2018;Wang et al., 2019b).
Besides, some works Zhang et al., 2018;Zhou et al., 2019b,a) use the right-toleft (R2L) and left-to-right (L2R) to improve the quality of machine translation. Non-Autoregressive decoding (Ghazvininejad et al., 2019) first predicts the target tokens and masked tokens, which will be filled in the next iterations. Then, the model predicts the unmasked tokens on top of the source text and a mixed translation consisting of the masked and unmasked tokens. Semi-autoregressive also (Akoury et al., 2019) predicts chunked fragments or the unmasked tokens based on the tree structure before the final translation. In addition, there are many existing works (Eriguchi et al., 2016;Aharoni and Goldberg, 2017;Dong and Lapata, 2018;Gu et al., 2018) which incorporate syntax information or the tree structure into NMT to improve the quality of translation results.

Conclusion
In this work, we propose a novel approach that utilizes source text and additional soft templates. More specifically, our approach can extract the templates from the sub-tree, which derives from the specific depth of the constituency-based parse tree. Then, we use a Transformer model to predict the soft target templates conditioned on the source text. On top of soft templates and source text, we incorporate the template information to guide the translation procedure. We compare our soft-template neural machine translation (ST-NMT) with other baselines on four benchmarks and multiple language pairs. Experimental results show that our ST-NMT significantly improves performance on these datasets.