Translating Phrases in Neural Machine Translation

Phrases play an important role in natural language understanding and machine translation (Sag et al., 2002; Villavicencio et al., 2005). However, it is difficult to integrate them into current neural machine translation (NMT) which reads and generates sentences word by word. In this work, we propose a method to translate phrases in NMT by integrating a phrase memory storing target phrases from a phrase-based statistical machine translation (SMT) system into the encoder-decoder architecture of NMT. At each decoding step, the phrase memory is first re-written by the SMT model, which dynamically generates relevant target phrases with contextual information provided by the NMT model. Then the proposed model reads the phrase memory to make probability estimations for all phrases in the phrase memory. If phrase generation is carried on, the NMT decoder selects an appropriate phrase from the memory to perform phrase translation and updates its decoding state by consuming the words in the selected phrase. Otherwise, the NMT decoder generates a word from the vocabulary as the general NMT decoder does. Experiment results on the Chinese to English translation show that the proposed model achieves significant improvements over the baseline on various test sets.


Introduction
Neural machine translation (NMT) has been receiving increasing attention due to its impressive * Corresponding author translation performance (Kalchbrenner and Blunsom, 2013;Cho et al., 2014;Sutskever et al., 2014;Bahdanau et al., 2015;. Significantly different from conventional statistical machine translation (SMT) (Brown et al., 1993;Koehn et al., 2003;Chiang, 2005), NMT adopts a big neural network to perform the entire translation process in one shot, for which an encoderdecoder architecture is widely used. Specifically, the encoder encodes a source sentence into a continuous vector representation, then the decoder uses the continuous vector representation to generate the corresponding target translation word by word.
The word-by-word generation philosophy in NMT makes it difficult to translate multi-word phrases. Phrases, especially multi-word expressions, are crucial for natural language understanding and machine translation (Sag et al., 2002;Villavicencio et al., 2005) as the meaning of a phrase cannot be always deducible from the meanings of its individual words or parts. Unfortunately current NMT is essentially a word-based or character-based (Chung et al., 2016;Costa-jussà and Fonollosa, 2016;Luong and Manning, 2016) translation system where phrases are not considered as translation units. In contrast, phrases are much better than words as translation units in SMT and have made a significant advance in translation quality. Therefore, a natural question arises: Can we translate phrases in NMT?
Recently, there have been some attempts on multi-word phrase generation in NMT (Stahlberg et al., 2016b;Zhang and Zong, 2016). However these efforts constrain NMT to generate either syntactic phrases or domain phrases in the wordby-word generation framework. To explore the phrase generation in NMT beyond the word-byword generation framework, we propose a novel architecture that integrates a phrase-based SMT model into NMT. Specifically, we add an auxiliary phrase memory to store target phrases in symbolic form. At each decoding step, guided by the decoding information from the NMT decoder, the SMT model dynamically generates relevant target phrase translations and writes them to the memory. Then the NMT decoder scores phrases in the phrase memory and selects a proper phrase or word with the highest probability. If the phrase generation is carried out, the NMT decoder generates a multi-word phrase and updates its decoding state by consuming the words in the selected phrase.
Furthermore, in order to enhance the ability of the NMT decoder to effectively select appropriate target phrases, we modify the encoder of NMT to make it fit for exploring structural information of source sentences. Particularly, we integrate syntactic chunk information into the NMT encoder, to enrich the source-side representation. We validate our proposed model on the Chinese→English translation task. Experiment results show that the proposed model significantly outperforms the conventional attention-based NMT by 1.07 BLEU points on multiple NIST test sets.
The rest of this paper is organized as follows. Section 2 briefly introduces the attentionbased NMT as background knowledge. Section 3 presents our proposed model which incorporates the phrase memory into the NMT encoder-decoder architecture, as well as the reading and writing procedures of the phrase memory. Section 4 presents our experiments on the Chinese→English translation task and reports the experiment results. Finally we discuss related work in Section 5 and conclude the paper in Section 6.

Background
Neural machine translation often adopts the encoder-decoder architecture with recurrent neural networks (RNN) to model the translation process. The bidirectional RNN encoder which consists of a forward RNN and a backward RNN reads a source sentence x = x 1 , x 2 , ..., x Tx and transforms it into word annotations of the entire source sentence h = h 1 , h 2 , ..., h Tx . The decoder uses the annotations to emit a target sentence y = y 1 , y 2 , ..., y Ty in a word-by-word manner.
In the training phase, given a parallel sentence (x, y), NMT models the conditional probability as follows, where y i is the target word emitted by the decoder at step i and y <i = y 1 , y 2 , ..., y i−1 . The conditional probability P (y i |y <i , x) is computed as where f (·) is a non-linear function and s i is the hidden state of the decoder at step i: where g(·) is a non-linear function. Here we adopt Gated Recurrent Unit (Cho et al., 2014) as the recurrent unit for the encoder and decoder. c i is the context vector, computed as a weighted sum of the annotations h: where h j is the annotation of source word x j and its weight α t,j is computed by the attention model. We train the attention-based NMT model by maximizing the log-likelihood: given the training data with N bilingual sentences (Cho, 2015). In the testing phase, given a source sentence x, we use beam search strategy to search a target sentenceŷ that approximately maximizes the conditional probability P (y|x)

Approach
In this section, we introduce the proposed model which incorporates a phrase memory into the encoder-decoder architecture of NMT. Inspired by the recent work on attaching an external structure to the encoder-decoder architecture (Gulcehre et al., 2016;Gu et al., 2016;Tang et al., 2016;Wang et al., 2017), we adopt a similar approach to incorporate the phrase memory into NMT. Figure 1: Architecture of the NMT decoder with the phrase memory. The NMT decoder performs phrase generation using the balancer and the phrase memory. Figure 1 shows an example. Given the generated words "President Bush emphasized that", the model generates the next fragment either from a word generation mode or a phrase generation mode. If the model selects the word generation mode, it generates a word by the NMT decoder as in the standard NMT framework. Otherwise, it generates a multi-word phrase by enquiring a phrase memory, which is written by an SMT decoder based on the dynamic decoding information from the NMT model for each step. The trade-off between word generation mode and phrase generation mode is balanced by a weight λ, which is produced by a neural network based balancer. Formally, a generated translation y = {y 1 , y 2 , . . . , y Ty } consists of two sets of fragments: words generated by NMT decoder w = {w 1 , w 2 , . . . , w K } and phrases generated from the phrase memory p = {p 1 , p 2 , . . . , p L } . The probability of generating y is calculated by

Framework
where P word (w k ) is the probability of generating the word w k (see Equation 2), P phrase (p l ) is that of generating the phrase p l which will be described in Section 3.2, and t(·) is the decoding step to generate the corresponding fragment.
The balancing weight λ is produced by the balancer -a multi-layer network. The balancer network takes as input the decoding information, including the context vector c i , the previous decoding state s i−1 and the previous generated word y i−1 : where σ(·) is a sigmoid function and f b (·) is the activation function. Intuitively, the weight λ can be treated as the estimated importance of the phrase to be generated. We expect λ to be high if the phrase is appropriate at the current decoding step.

Well-Formed Phrases
We employ a sourceside chunker to chunk the source sentence, and only phrases that corresponds to a source chunk are used in our model. We restrict ourselves to the well-formed chunk phrases based on the following considerations: (1) In order to take advantage of dynamic programming, we restrict ourselves to non-overlap phrases. 1 (2) We explicitly utilize the boundary information of the source-side chunk phrases, to better guide the proposed model to adopt a target phrase at an appropriate decoding step.
(3) We enable the model to exploit the syntactic categories of chunk phrases to enhance the proposed model with its selection preference for special target phrases. With these information, we enrich the context vector c i to enable the proposed model to make better decisions, as described below.
Following the commonly-used strategy in sequence tagging tasks (Xue and Shen, 2003), we allow the words in a phrase to share the same chunk tag and introduce a special tag for the beginning word. For example, the phrase " &E S (information security)" is tagged as a noun phrase "NP", and the tag sequence should be "NP B NP". Partially motivated by the work on integrating linguistic features into NMT , we represent the encoder input as the combination of word embeddings and chunking tag embeddings, instead of word embeddings alone in the conventional NMT. The new input is formulated as follows: where E w ∈ R dw×|V N M T | is a word embedding matrix and dw is the word embedding dimensionality, E t ∈ R dt×|V T AG | is a tag embedding matrix and dt is the tag embedding dimensionality.

Phrase Memory
The phrase memory stores relevant target phrases provided by an SMT model, which is trained on the same bilingual corpora. At each decoding step, the memory is firstly erased and re-written by the SMT model, the decoding of which is based on the translation information provided by the NMT model. Then, the proposed model enquires phrases along with their probabilities P phrase from the memory.
Writing to Phrase Memory Given a partial translation y <i = {y 1 , y 2 , . . . , y t−1 } generated from NMT, the SMT model picks potential phrases extracted from the translation table. The phrases are scored with multiple SMT features, including the language model score, the translation probabilities, the reordering score, and so on. Specially, the reordering score depends on alignment information between source and target words, which is derived from attention distribution produced by the NMT model (Wang et al., 2017). SMT coverage vector in (Wang et al., 2017) is also introduced to avoid repeat phrasal recommendations. In our work, the potential phrase is phrase with high SMT score which is defined as following: (10) where p l is a target phrase and x(p l ) is its corresponding source span. h m (p l , x(p l )) is a SMT feature function and w m is its weight. The feature weights can be tuned by the minimum error rate training (MERT) algorithm (Och, 2003).
This leads to a better interaction between SMT and NMT models. It should be emphasized that our memory is dynamically updated at each decoding step based on the decoding history from both SMT and NMT models.
The proposed model is very flexible, where the phrase memory can be either fully dynamically generated by an SMT model or directly extracted from a bilingual dictionary, or any other bilingual resources storing idiomatic translations or bilin-gual multi-word expressions, which may lead to a further improvement. 2 Reading Phrase Memory When phrases are read from the memory, they are rescored by a neural network based score function. The score function takes as input the phrase itself and decoding information from NMT (i = t(p l ) denotes the current decoding step): where g s (·) is either an identity or a non-linear function. e(p l ) is the representation of phrase p l , which is modeled by a recurrent neural networks. Again, s i is the decoder state, y i−1 is the lastly generated word, and c i is the context vector. The scores are normalized for all phrases in the phrase memory, and the probability for phrase p l is calculated as The probability calculation is controlled with parameters, which are trained together with the parameters from the NMT model.

Training
Formally, we train both the default parameters of standard NMT and the new parameters associated with phrase generation on a set of training examples {[x n , y n ]} N n=1 : where P (y n |x n ) is defined in Equation 7. Ideally, the trained model is expected to produce a higher balance weight λ and phrase probability P phrase when a phrase is selected from the memory, and lower scores in other cases.

Decoding
During testing, the NMT decoder generates a target sentence which consists of a mixture of words and phrases. Due to the different granularities of words and phrases, we design a variant of beam search strategy: At decoding step i, we first compute P phrase for all phrases in the phrase memory and P word for all words in NMT vocabulary. Then the balancer outputs a balancing weight λ i , which is used to scale the phrase and word probabilities : λ i × P phrase and (1 − λ i ) × P word . Now outputs are normalized probabilities on the concatenation of phrase memory and the general NMT vocabulary. At last, the NMT decoder generates a proper phrase or word of the highest probability. If a target phrase in the phrase memory has the highest probability, the decoder generates the target phrase to complete the multi-word phrase generation process, and updates its decoding state by consuming the words in the selected phrase as described in Equation 3. All translation hypotheses are placed in the corresponding beams according to the number of generated target words.

Experiments
In this section, we evaluated the effectiveness of our model on the Chinese→English machine translation task. The training corpora consisted of about 1.25 million sentence pairs 3 with 27.9 million Chinese words and 34.5 million English words respectively. We used NIST 2006 (NIST06) dataset as development set, and NIST 2004 (NIST04), 2005 (NIST05) and 2008 (NIST08) datasets as test sets. We report experiment results with case-insensitive BLEU score 4 .
We compared our proposed model with two state-of-the-art systems: * Moses: a state-of-the-art phrase-based SMT system (Koehn et al., 2007) with its default settings, where feature function weights are tuned by the minimum error rate training (MERT) algorithm (Och, 2003).
* RNNSearch: an in-house implementation of the attention-based NMT system (Bahdanau et al., 2015) with its default settings.
For Moses, we used the full bilingual training data to train the phrase-based SMT model and the target portion of the bilingual training data to train a 4-gram language model using KenLM 5 . We ran Giza++ on the training data in both Chinese-to-English and English-to-Chinese directions and applied the "grow-diag-final" refinement rule (Koehn et al., 2003) to obtain word alignments. The maximum phrase length is set to 7.
For RNNSearch, we generally followed settings in the previous work (Bahdanau et al., 2015;Tu et al., 2017a,b). We only kept a shortlist of the most frequent 30,000 words in Chinese and English, covering approximately 97.7% and 99.3% of the data in the two languages respectively. We constrained our source and target sequences to have a maximum length of 50 words in the training data. The size of embedding layer of both sides was set to 620 and the size of hidden layer was set to 1000. We used a minibatch stochastic gradient descent (SGD) algorithm of size 80 together with Adadelta (Zeiler, 2012) to train the NMT models. The decay rates ρ and were set as 0.95 and 10 −6 . We clipped the gradient norm to 1.0 (Pascanu et al., 2013). We also adopted the dropout technique. Dropout was applied only on the output layer and the dropout rate was set to 0.5. We used a simple beam search decoder with beam size 10 to find the most likely translation. For the proposed model, we used a Chinese chunker 6 (Zhu et al., 2015) to chunk the sourceside Chinese sentences. 13 chunking tags appeared in our chunked sentences and the size of chunking tag embedding was set to 10. We used the trained phrase-based SMT to translate the source-side chunks. The top 5 translations according to their translation scores (Equation 10) were kept and among them multi-word phrases were used as phrasal recommendations for each source chunk phrase. For a source-side chunk phrase, if there exists phrasal recommendations from SMT, the output chunk tag was used as its chunking tag feature as described in Section 3.1. Otherwise, the words in the chunk were treated as general words by being tagged with the default tag. In the phrase memory, we only keep the top 7 target translations with highest SMT scores at each decoding step. We used a forward neural network with two hidden layers for both the balancer (Equation 8) and the scoring function (Equation 11). The numbers of units in the hidden layers were set to 2000 and 500 respectively. We used a backward RNN encoder to learn the phrase representations of target phrases in the phrase memory.

Number of Sentences Affected by Generated Phrases
We also check the number of translations that contain phrases generated by the proposed model, as shown in Table 2. As seen, a large portion of translations take the recommended phrases, and the number increases when the chunking tag feature is used. 7 Considering BLEU scores reported in Table 1, we believe that the chunking tag feature benefits the proposed model on its phrase generation.

Syntactic Categories of Generated Phrases
We first investigate which category of phrases is more likely to be selected by the proposed approach. There are some phrases, such as 7 The numbers on NIST08 are relatively lower since part of the test set contains sentences from Web forums, which contain less multi-word expressions.  Table 3: Percentages of phrase categories to the total number of generated ones. "All" denotes all generated phrases, and "New" means new phrases that cannot be found in translations generated by the baseline system. "Total" is the total number of generated phrases and "Correct" denotes the fully correct ones.
noun phrases (NPs, e.g., "national laboratory" and "vietnam airlines") and quantifier phrases (QPs, e.g., "15 seconds" and "two weeks") , that we expect to be favored by our approach. Statistics shown in Table 3 confirm our hypothesis. Let's first concern all generated phrases (i.e., column "All"): most selected phrases are noun phrases (81.0%) and quantifier phrases (10.8%). Among them, 44.5% percent of them are fully correct 8 . Specifically, NPs have relative higher generation accuracy (i.e., 47.8% = 38.7%/81.0%) while VPs have lower accuracy (i.e., 21.2% = 1.7%/8.0%). By looking into the wrong cases, we found most errors are related to verb tense, which is the drawback of SMT models. Concerning the newly introduced phrases that cannot be found in baseline translations (i.e., column "New"), 13.2% of generated phrases are both new and fully correct, which contribute most to the performance improvement. We can also find that most newly introduced verb phrases and quantifier phrases are not correct, the patterns of which can be well learned by word-based NMT models.  Table 4: Percentages of phrases with different word counts to the total number of generated ones. Table 4 lists the distribution of generated phrases based on the number of inside words. As seen, most generated phrases are short phrases (e.g., 2gram and 3-gram phrases), which also contribute most to the new and fully correct phrases (i.e., 12.3% = 9.1%+3.2%). Focusing on long phrases (e.g., order 4), most of them are newly introduced (10.6% out of 13.1%). Unfortunately, only a few portion of these phrases are fully correct, since long phrases have higher chance to contain one or two unmatched words.

Number of Words in Generated Phrases
SYSTEM Test +memory 32.95 +memory +NULL 31.63 +memory +chunking tag 33.55 +memory +chunking tag +NULL 30.81 Table 5: Additional experiment results on the translation task to directly measure the improvement obtained by the phrase generation. "+NULL" denotes that we replace the generated target phrases with a special symbol /NULL0 in test sets. BLEU scores in the table are case insensitive.

Effect of Generated Phrases on Translation
Performance Note that the proposed model benefits not only from fully matched phrases, but also from partially matched phrases. For example, the baseline system translates " I[ Ê˜ø ˜oÝ" in a word-by-word manner and outputs "state aviation and space department". The generated phrase provided by SMT is "national aviation and space administration", but the only correct reference is "national aeronautics and space administration". The generated phrase is not fully correct but still useful.
To directly measure the improvement obtained by the phrase generation, we replace the generated target phrases with a special symbol "NULL" in test sets. As shown in Table 5, when deleting the generated target phrases, ("+memory+chunking tag") and ("+memory") translation performances decrease by 2.74 BLEU points and 1.32 BLEU points respectively. Moreover, translation performances on NIST08 decrease less than those on NIST04 and NIST05 in both settings. The reason is that NIST08 which contains sentences from web data has little influence on generating target phrases which are provided from a different domain 9 . The overall results demonstrate that neural machine translation benefits from phrase translation.

Effect of Balancer
Weight Test Dynamic 33.55 Constant (λ = 0.1) 31.35 Table 6: Translation performance with a variety of balancing weight strategies. "Dynamic" is the proposed approach and "Constant (λ = 0.1)" denotes fixing the balancing weight to 0.1. BLEU scores in the table are case insensitive.
The balancer which is used to coordinate the phrase generation and word generation is very crucial for the proposed model. We conducted an additional experiment to validate the effectiveness of the neural network based balancer. We use the setting "+memory +chunking tag" as baseline system to conduct the experiments. In this experiment, we fixed the balancing weight λ (Equation 8) to 0.1 during training and testing and report the results. As shown in Table 6, we find that using the fixed value for the balancing weight (Constant (λ = 0.1) ) decreases the translation performance sharply. This demonstrates that the neural network based balancer is an essential component for the proposed model.

Comparison to Word-Level Recommendations and Discussions
Our approach is related to our previous work (Wang et al., 2017) which integrates the SMT word-level knowledge into NMT. To make a comparison, we conducted experiments followed settings in (Wang et al., 2017). The comparison results are reported in Table 7. We find that our approach is marginally better than the word-level 9 The parallel training data are mainly from news domain. SYSTEM Test +word level recommendation 33.27 +memory +chunking tag 33.55 Table 7: Experiment results on the translation task. "+word level recommendation" is the proposed model in (Wang et al., 2017). BLEU scores in the table are case insensitive.
model proposed in (Wang et al., 2017) by 0.28 BLEU points. In our approach, the SMT model translates source-side chunk phrases using the NMT decoding information. Although we use high-quality target phrases as phrasal recommendations, our approach still suffers from the errors in segmentation and chunking. For example, the target phrase "laptop computers" cannot be recommended by the SMT model if the Chinese phrase "Ã J > M" is not chunked as a phrase unit. This is the reason why some sentences do not have corresponding phrasal recommendations (Table 2). Therefore, our approach can be further enhanced if we can reduce the error propagations from the segmenter or chunker, for example, by using n-best chunk sequences instead of the single best chunk sequence.
Additionally, we also observe that some target phrasal recommendations have been also generated by the baseline system in a word-by-word manner. These phrases, even taken as parts of final translations by the proposed model, do not lead to improvements in terms of BLEU as they have already occurred in translations from the baseline system. For example, the proposed model successfully carries out the phrase generation mode to generate a target phrase "guangdong province" (the translation of Chinese phrase "2 À Ž") which has appeared in the baseline system.
As external resources, e.g., bilingual dictionary, which are complementary to the SMT phrasal recommendations, are compatible with the proposed model, we believe that the proposed model will get further improvement by using external resources.

Related work
Our work is related to the following research topics on NMT: Generating phrases for NMT In these studies, the generated NMT multi-word phrases are either from an SMT model or a bilingual dictio-nary. In syntactically guided neural machine translation (SGNMT), the NMT decoder uses phrase translations produced by the hierarchical phrasebased SMT system Hiero, as hard decoding constraints. In this way, syntactic phrases are generated by the NMT decoder (Stahlberg et al., 2016b). Zhang and Zong (2016) use an SMT translation system, which is integrated an additional bilingual dictionary, to synthesize pseudo-parallel sentences and feed the sentences into the training of NMT in order to translate low-frequency words or phrases. Tang et al. (2016) propose an external phrase memory that stores phrase pairs in symbolic forms for NMT. During decoding, the NMT decoder enquires the phrase memory and properly generates phrase translations. The significant differences between these efforts and ours are 1) that we dynamically generate phrase translations via an SMT model, and 2) that at the same time we modify the encoder to incorporate structural information to enhance the capability of NMT in phrase translation.

Incorporating linguistic information into NMT
NMT is essentially a sequence to sequence mapping network that treats the input/output units, eg., words, subwords , characters (Chung et al., 2016;Costa-jussà and Fonollosa, 2016), as non-linguistic symbols. However, linguistic information can be viewed as the taskspecific knowledge, which may be a useful supplementary to the sequence to sequence mapping network. To this end, various kinds of linguistic annotations have been introduced into NMT to improve its translation performance.  enrich the input units of NMT with various linguistic features, including lemmas, part-of-speech tags, syntactic dependency labels and morphological features. García-Martínez et al. (2016) propose factored NMT using the morphological and grammatical decomposition of the words (factors) in output units. Eriguchi et al. (2016) explore the phrase structures of input sentences and propose a tree-to-sequence attention model for the vanilla NMT model.  propose to linearize source-side parse trees to obtain structural label sequences and explicitly incorporated the structural sequences into NMT, while Aharoni and Goldberg (2017) propose to incorporate target-side syntactic information into NMT by serializing the target sequences into linearized, lexicalized constituency trees.  integrate topic knowledge into NMT for domain/topic adaptation.
Combining NMT and SMT A variety of approaches have been explored for leveraging the advantages of both NMT and conventional SMT. He et al. (2016) integrate SMT features with the NMT model under the log-linear framework in order to help NMT alleviate the limited vocabulary problem (Luong et al., 2015;Jean et al., 2015) and coverage problem (Tu et al., 2016). Arthur et al. (2016) observe that NMT is prone to making mistakes in translating low-frequency content words and therefore attempt at incorporating discrete translation lexicons into the NMT model, to alliterate the imprecise translation problem (Wang et al., 2017). Motivated by the complementary strengths of syntactical SMT and NMT, different combination schemes of Hiero and NMT have been exploited to form SGNMT (Stahlberg et al., 2016a,b). Wang et al. (2017) propose an approach to incorporate the SMT model into attention-based NMT. They combine NMT posteriors with SMT word recommendations through linear interpolation implemented by a gating function which dynamically assigns the weights. Niehues et al. (2016) propose to use SMT to pre-translate the inputs into target translations and employ the target pre-translations as input sequences in NMT.  propose a neural system combination framework to directly combine NMT and SMT outputs. The combination of NMT and SMT has been also introduced in interactive machine translation to improve the system's suggestion quality (Wuebker et al., 2016). In addition, word alignments from the traditional SMT pipeline are also used to improve the attention mechanism in NMT (Cohn et al., 2016;Mi et al., 2016;.

Conclusion
In this paper, we have presented a novel model to translate source phrases and generate target phrase translations in NMT by integrating the phrase memory into the encoder-decoder architecture. At decoding, the SMT model dynamically generates relevant target phrases with contextual information provided by the NMT model and writes them to the phrase memory. Then the proposed model reads the phrase memory and uses the balancer to make probability estimations for the phrases in the phrase memory. Finally the NMT decoder selects a phrase from the phrase memory or a word from the vocabulary of the highest probability to generate. Experiment results on Chinese→English translation have demonstrated that the proposed model can significantly improve the translation performance.