Neural Machine Translation Leveraging Phrase-based Models in a Hybrid Search

In this paper, we introduce a hybrid search for attention-based neural machine translation (NMT). A target phrase learned with statistical MT models extends a hypothesis in the NMT beam search when the attention of the NMT model focuses on the source words translated by this phrase. Phrases added in this way are scored with the NMT model, but also with SMT features including phrase-level translation probabilities and a target language model. Experimental results on German-to-English news domain and English-to-Russian e-commerce domain translation tasks show that using phrase-based models in NMT search improves MT quality by up to 2.3% BLEU absolute as compared to a strong NMT baseline.


Introduction
Neural machine translation has become state-ofthe-art in recent years, reaching higher translation quality than statistical phrase-based machine translation (PBMT) on many tasks. Human analysis (Bentivogli et al., 2016) showed that NMT makes significantly fewer reordering errors, and also is able to select correct word forms more often than PBMT in the case of morphologically rich target languages. Overall, the fluency of the MT output improves when NMT is used, and the number of lexical choice errors is also reduced. However, state-of-the-art NMT approaches based on an encoder-decoder architecture with an attention mechanism as introduced by  exhibit weaknesses that sometimes lead to MT errors which a phrase-based MT system does not make. In particular, PBMT usually can better translate rare words (e.g. singletons), as well as memorize and use phrasal translations. NMT has problems translating rare words because of limitations on the vocabulary size, as well as the fact that word embeddings are used to represent both source and target words. A rare word's embedding can not be trained reliably.
Another handicap of NMT is a general difficulty of fixing errors made by a neural MT system. Since NMT does not explicitly use or save word-to-word or phrase-to-phrase mappings, and its search is a target word beam search with almost no constraints, it is difficult to fix errors by an NMT system. It is important to quickly fix certain errors in real-life applications of MT systems to avoid negative user feedback or other (e.g. legal) consequences. An error identified in the output of a PBMT system can be fixed by tracing which phrase pair was used that resulted in the error, and down-weighting or even removing the phrase pair. Also, in PBMT it is easy to add an "override" translation.
In this work, we combine the strengths of NMT and PBMT approaches by introducing a novel hybrid search algorithm. In this algorithm, the standard NMT beam search is extended with phrase translation hypotheses from a statistical phrase table. The decision on when to use what phrasal translations is taken based on the attention mechanism of the NMT model, which provides a soft coverage of the source sentence words. All partial phrasal translations are scored with the NMT decoder and can be continued with a word-based NMT translation candidate or another phrasal translation candidate.
The proposed search algorithm uses a log-linear model in which the NMT translation score is combined with standard phrase translation scores, including a target n-gram language model (LM) score. Thus, a LM trained on additional monolingual data can be used. The decisions on the word order in the produced target translation are taken based only on the states of the NMT decoder. This paper is structured as follows. We review related work in Section 1.1. The baseline NMT model we use is described in Section 2, where we also recap the log-linear model combination used in PBMT. Section 3 presents the details of the proposed hybrid search. Experimental results are presented in Section 4, followed by conclusions and outlook in Section 5.

Related Work
In the line of research closely related to our approach, neural models are used as additional features in vanilla phrase-based systems. Examples include the work of (Devlin et al., 2014), (Junczys-Dowmunt et al., 2016), etc. Such approaches have certain limitations: first, the search space of the model is still restricted by what can be produced using a phrase table extracted from parallel data based on word alignments. Second, the organization of the search, in which only a limited target word history (e.g. 4 last target words) is available for each partial hypothesis, makes it difficult to integrate recurrent neural network LMs and translation models which take all previously generated target words into account. That is why, for instance, the attention-based NMT models were usually applied only in rescoring (Peter et al., 2016).
In (Stahlberg et al., 2017), a two-step translation process is used, where in the first step a SMT translation lattice is generated, and in the second step the NMT decoder combines NMT scores with the Bayes-risk of the translations according to the lattice. In contrast, we explicitly use phrasal translations and language model scores in an integrated search.
In (Arthur et al., 2016), a statistical word lexicon is used to influence NMT hypotheses, also based on the attention mechanism. (Gülçehre et al., 2015) combine target n-gram LM scores with NMT scores to find the best translation. (He et al., 2016) also use a target LM, but add further SMT features such as word penalty and word lexica to the NMT beam search. To the best of our knowledge, no previous work extends the beam search with phrasal translation hypotheses of PBMT, like we propose in this paper.
In (Tang et al., 2016), the NMT decoder is modified to switch between using externally de-fined phrases and standard NMT word hypotheses. However, only one target phrase per source phrase is considered, and the reported improvements are significant only when manually selected phrase pairs (mostly for rare named entities) are used.
Somewhat related to our work is the concept of coverage-based NMT (Tu et al., 2016), where the model architecture is changed to explicitly account for source coverage. In our work, we use a standard NMT architecture, but track coverage with accumulated attention weights.

Neural MT
Neural MT proposed by  maximizes the conditional log-likelihood of the target sentence E : e 1 , . . . , e I given the source sentence F : f 1 , . . . , f J : where (E n , F n ) refers to the n-th training sentence pair in a dataset D, and N denotes the total number of sentence pairs in the training corpus. When using the encoder-decoder architecture by , the conditional probability can be written as: p(e 1 · · · e I |f 1 · · · f J ) = I i=1 p(e i |e i−1 · · · e 1 , c) with p(e i |e i−1 · · · e 1 , c) = g(s i , e i−1 , c), where I is the length of the target sentence and J is the length of source sentence, c is a fixed-length vector to encode the source sentence, s i is a hidden state of RNN at time step i, and g(·) is a nonlinear function to approximate the word probability. When the attention mechanism is used, the vector c in each sentence is replaced by a timevariant representation c i that is a weighted summary over a sequence of annotations (h 1 , . . . , h J ), and h j contains information about the whole input sentence, but with a strong focus on the parts surrounding the j-th word . Then, the context vector can be defined as: .
Therefore, α ij is normalized over all source positions j. Also, r ij = a(s i−1 , h j ) is the attention model used to calculate the log-likelihood of aligning the i-th target word to the j-th source word.

Phrase-based MT
The log-linear model, as introduced in (Och and Ney, 2002), allows decomposing the translation probability P r(e I 1 |f J 1 ) by using an arbitrary number of features h m (f J 1 , e I 1 ). Each feature is multiplied by a corresponding scaling factor λ m : .
The standard PBMT approach uses a log-linear model in which bidirectional phrasal and lexical scores, language model scores, distortion scores, word penalties and phrase penalties are combined as features.

Hybrid Approach
In this section we describe our proposed hybrid NMT approach. The algorithm allows translations to be generated partially by phrases 1 and partially by words. Section 3.1 describes the models we use to score hypotheses. The search algorithm is presented in Section 3.2.

Log-linear Combination
We use a log-linear model combination to introduce SMT models into the NMT search. Since translations can be partially generated by phrases, we introduce the phrase segmentation s K 1 as a hidden variable into the models similarly to (Zens and Ney, 2008), where K is the number of phrases used in the translation. Note that, unlike standard PBMT, s K 1 does not need to cover the whole source sentence, as parts of the translation can be generated by words. Using the maximum approximation, the search criterion then iŝ (1) Letf k ,ẽ k be the chosen phrase pairs in the segmentation s K 1 for k = 1, . . . , K. In our experiments with the proposed hybrid search, we use the following features: 1. The NMT feature h NMT .
1 As in SMT, phrases can consist of only a single token.
The purpose of this feature is to control the usage of phrases. 4. The phrase penalty feature h PP counts the number of phrases used. Together with the word penalty and the source word coverage feature, the phrase penalty can control the length of chosen phrases. 5. The n-gram language model feature h LM . 6. The bidirectional phrase features h Phr and h iPhr . Note that these features are only applied for those parts of the translation that are generated by phrases. The other parts get a phrase score of zero. The scaling factors λ m are tuned with minimum error rate training (MERT) (Och, 2003) on n-best lists of the development set.

Search
The algorithm is based on the beam search for NMT, which generates translations one word per time step in a left-to-right fashion. We modify this search to allow hypothesizing phrases in addition to normal word hypotheses. The phrases are suggested based on the neural attention, starting from the source position with the maximal current attention. We only suggest phrases if a source position is focused. We check that suggested phrases do not overlap with already translated source words by keeping track of the sum of attention in previous time steps for each source position. Thus, the problem of global reordering is left entirely to the NMT model and we follow the attention when hypothesizing phrases.
Hypotheses are scored by NMT and SMT models. The beam is divided into two parts of fixed size: the word beam and the phrase beam. The phrase beam is used to score target phrases which were hypothesized from an entry in a previous word beam. In order to score a target phrase consisting of k words with the NMT model, we use k time steps, allowing us to keep the efficiency of batched NMT scoring. Once a target phrase has been fully scored (and if the hypothesis has not been pruned), the hypothesis is returned to the word beam. Both beams are generated and pruned independently in each time step.
The algorithm has some hyper-parameters that need to be set manually. First, we have the beam size N p for phrase hypotheses and the beam size N w for word hypotheses. Second, τ focus is the minimum attention that needs to be on a source position to consider it for extending with a phrase translation candidate whose source phrase starts on that position. Third, τ cov is the minimum sum of attention of a source position over previous time steps at which it is considered to be covered. We do not hypothesize phrases that overlap with covered positions.
In the following, we describe the search in detail. Let f J 1 be the source sentence. Before search, we run the standard phrase matching algorithm on the source sentence to retrieve the translation options E(j, j ) for source positions 1 ≤ j < j ≤ J from a given SMT phrase table. With each hypothesis h, we associate the following items: • C(h, j) is the sum of the NMT attention to source position j involved in generating the target words of h. This can be considered as a soft coverage vector for h. • Q(h) is the partial log-linear score of h according to Equation 1. • E(h) is the n-gram target word history of h.
• If h is a phrase hypothesis with target phrasẽ e, of which k words already have been scored by NMT, then P (h) := (ẽ, k) is the phrase state. Also, each hypothesis is associated with its corresponding NMT hidden state. We initialize the beam to consist of an empty word hypothesis. Each step of the beam search proceeds as follows: + λ LM · log p LM (e|E(h)) + λ W P and insert h into B w . Update the soft coverage C(h , j) = C(h, j)+α h,j for 1 ≤ j ≤ J. 4. Generate new phrase hypotheses: for each previous word hypothesis h ∈ B w , convert the soft attention C(h, ·) into a binary coverage set C, such that j ∈ C iff. C(h, j) > τ cov . Identify the current NMT focus aŝ If there is no such j with α h,j > τ focus , no phrase hypotheses are generated from h in this step. Otherwise, for each source phrase length l with C ∩{ĵ,ĵ +1, . . . ,ĵ +l −1} = ∅ and each target phraseẽ ∈ E(ĵ,ĵ + l), create a new hypothesis h = (h,ẽ 1 ) with the score Note that, in this step, the full target phrase is scored using the language model, while only the first target word is scored using NMT. Initialize the phrase state of h : P (h ) = (ẽ, 1). As in step 3, update the soft coverage. If |ẽ| = 1, insert h into B w , otherwise insert into B p . 5. Advance previous phrase hypotheses: for each h ∈ B p , with phrase state P (h) = (ẽ, k), score the (k + 1)-th target word ofẽ using NMT, setting h = (h,ẽ k+1 ) and As in step 3, update the soft coverage. Set the new phrase state as P (h ) = (ẽ, k + 1). If phrase scores from a phrase table are to be included in the search, Equation 2 needs to be modified by adding λ Phr log p(f |ẽ) and λ iPhr log p(ẽ|f ).
As in the pure NMT beam search, this procedure is repeated until either the last word of all hypotheses in a step is the sentence end token, or 2 · J many beam steps have been performed. Finally, the best translation is chosen as the one in B f with the highest score.
Note that the same target sequence can be generated with different phrasal segmentations. During search, if two hypotheses have the same full target history in a beam, we recombine them and discard the hypothesis with the lower score.

Experiments
We perform experiments comparing the translation quality of our hybrid approach to phrasebased and pure end-to-end NMT baselines. We present results on two tasks: an inhouse English→Russian e-commerce task (translation of real product/item descriptions from an e-commerce site), and the WMT 2016 German→English task (news domain). The corpus statistics are shown in Table 1. For the English→Russian task, the parallel training data consists of an in-domain part (ca. 5.5M running words) of product/item titles and descriptions and other e-commerce content. The rest is out-of-domain data (UN, subtitles, TAUS data collections, etc.) sampled to have significant ngram overlap with the in-domain description data. Item descriptions are provided by private sellers and, like any user-generated content, may contain ungrammatical sentences, spelling errors, and other noise. Product descriptions usually originate from product catalogs and are more "clean", but on the other hand, are difficult to translate because of rare domain-specific terminology. Both types of text contain itemizations, measurement units, and other structures which are usually not found in normal sentences. We tune the system on a development set that is a mix of product and item descriptions, and evaluate on separate product/item description test sets. For development and test sets, two reference translations are used.
The German→English system is trained on parallel corpora provided for the constrained WMT 2017 evaluation (Europarl, Common Crawl, and others). We use the WMT 2015 evaluation data as development set, and the evaluation is performed on two sets from the WMT evaluations in 2014 and 2016. Only a single human reference translation is provided.
For the phrase-based baselines, we use an inhouse phrase-decoder (Matusov and Köprü, 2010) which is similar to the Moses decoder (Koehn et al., 2007). We use standard SMT features, including word-level and phrase-level translation probabilities, the distortion model, 5-gram LMs, and a 7-gram joint translation and reordering model reimplemented based on the work of (Guta et al., 2015). The language model for the ecommerce task is trained on additional monolingual Russian item description data containing 28.2M words. For the WMT task, we use the English News Crawl data containing 3.8B words for additional language model data. The tuning is performed using MERT (Och, 2003) to increase the BLEU score on the development set. To stabilize the optimization on the English→Russian task, we detach Russian morphological suffixes from the word stems both in hypotheses and references using a context-independent "poor man's" morphological analysis. We prefix each suffix with a special symbol and treat them as separate tokens.
We have implemented our NMT model in  Gal and Ghahramani (2016). We set the dropout probability for input and recurrent connections of the RNN to 0.2 and word embedding dropout probability to 0.1. On the English→Russian task, the model is then fine-tuned on in-domain data for 10 epochs. The vocabulary is limited using byte pair encoding (BPE) (Sennrich et al., 2016b) with 40K splits separately for each language. To speed up training we use approximate loss as described in (Jean et al., 2015). For pure NMT experiments, we employ length normalization , as otherwise short translations would be favored.
For the hybrid approach, we use the same trained end-to-end model as in the NMT baseline. We use all the phrase-based model features plus the NMT score and run MERT as described in Section 3.1. Language models are trained on the level of BPE tokens. We consider at most 100 translation options for each source phrase. If not specified otherwise, we use a beam size of 96 for phrase hypotheses and a beam size of 32 for word hypotheses, resulting in a combined beam size of 128. Furthermore, we set the focus threshold τ focus = 0.3 and the coverage threshold τ cov = 0.7 by default. We also perform experiments where these hyper-parameters are varied.

E-commerce English→Russian
The results on the e-commerce English→Russian task are summarized in Table 2.

NMT vs. phrase-based SMT
The pure NMT system exhibits large improvements over the phrase-based baseline 4 . These improvements are also significantly larger than when we use the NMT model to rescore PBMT 1000best lists. NMT results are not improved when the beam size is increased from 12 to 128.

Hybrid search vs. pure NMT search
For the hybrid approach, we train a phrase-table on the in-domain data and split the source and target phrases with BPE afterwards for compatibility with the NMT vocabulary. With the hybrid approach, when using a LM trained only on the target side of bilingual data, we get an improvement of 0.3% BLEU on item descriptions and 1.4% BLEU on product descriptions over the pure NMT system. When we use the LM trained on extra monolingual data, we get total improvements of 1.0% BLEU and 2.3% BLEU with the hybrid approach. In contrast, when we add this language model and a word penalty on top of the pure NMT system and tune scaling factors with MERT, we get small improvements (last row of Table 2) only on product descriptions. This shows that the hybrid approach can exploit the LM better than a purely word-based NMT approach. We have also performed experiments utilizing the additional monolingual data for synthetic training data for NMT as in (Sennrich et al., 2016a), but did not get improvements.
To analyze the improvements of the hybrid system, we perform experiments in which we either  Table 3: Translation results of the hybrid approach on the e-commerce English→Russian task with different SMT model combinations. The first row shows results with all models enabled. In the following rows, we either remove or limit exactly one model compared to the full system.
disable or limit some of the SMT models. The results are shown in Table 3. Without the language model, the hybrid approach has almost no improvements over the NMT baseline. This indicates that the language model is crucial in selecting appropriate phrase candidates. Similarly, when we disable the source word coverage feature, the translation quality is degraded, suggesting that this feature helps choose between phrase hypotheses and word hypotheses during the search. Next, we do not use phrase-level scores. Here, we observe only a small degradation of translation quality. Finally, we limit the source length of phrases used in the search, allowing only one-word source phrases in one experiment and only source phrases with two or more words in another experiment. In both cases, the translation quality decreases. Thus, both one-word phrases and longer phrases are necessary to obtain the best results.
Tuning the beam size Next, we study the effect of different beam sizes on translation quality. The results are shown in Table 4. Note that we retune the system for each choice. With a total beam size of 128, we get the best results by using a phrase beam size of 96 and a word beam size of 32. When we use a phrase beam size of 116 or 64 instead, the translation quality worsens. In another experiment, we decrease the total beam size to 64. The translation quality degrades only slightly, which means that we can still expect MT quality improvements with hybrid search even if we optimize the system for speed. To further test this, we reduce the beam sizes to N w = 12 and N p = 4 after tuning with N w = 32 and N p = 96. We get BLEU scores of 27.1% on item descriptions and 30.1% on product descriptions, losing 0.3% and 0.7% BLEU respectively compared to the full beam size.  Table 4: Effect of the beam size (word beam size N w + phrase beam size N p ) for the hybrid approach on the e-commerce English→Russian task.
Tuning the attention focus/coverage thresholds Table 5 shows results with different values for the coverage threshold τ cov . Again, we retune the system for each choice. Setting the coverage threshold to 1.0 or even disabling the coverage check (by setting τ cov = ∞) has little effect on the translation scores on this task. This can be explained by the fact that translation from English to Russian is mostly monotonic. We also tried varying the focus threshold τ focus between 0.0 and 0.3 but did not notice any significant effect on this task.  Further human analysis by a native Russian speaker of the pure NMT vs. hybrid search translations shows that hybrid search is often able to correct the following known NMT handicaps: • incorrect translation of rare words (among other reasons, due to incorrect sub-word unit translation in which rare words are aggressively segmented). • repetition of same or similar words as a result of multiple attention to the same source word, as well as untranslated words that received no attention. • incorrect or partially correct word-by-word translation when a phrasal (non-literal) translation should be used instead.
In all of these cases, the usage of phrasal translations is able to better enforce the coverage, and this, in turn, leads to improved lexical choice. The fact that not many long phrase pairs are selected indicates, in our opinion, that the search and modeling problem in NMT is far from being solved: with the right, diverse model scores, the proposed hybrid search is able to select and extend better hypotheses with words, most of which already had a high NMT probability. Yet they are not always selected in the pure NMT beam search, among other reasons, due to competition from words erroneously placed near them in the embedding space.

WMT 2016 German→English
The results on the WMT German→English task are shown in Table 6. The initial phrase-based baseline uses the 5-gram language model estimated on the target side of bilingual data. By adding the News Crawl LM data, we gain 2.5% and 2.3% BLEU on the test sets, but PBMT still is behind NMT.
For the hybrid approach, we use a beam size of 64 and a maximal number of beam steps of 1.5 · J (instead of 2 · J) to speed up experiments. We use separate word penalty features, one for word-based hypotheses and one for phrase-based hypotheses to allow for more control of translation lengths. With the hybrid approach, using the 5-gram language model estimated on the target side of bilingual data, and phrase scores, we get small improvements in BLEU over the NMT baseline. However, the TER increases. We experiment with different thresholds, setting τ focus = 0.1 and τ cov = 1.0. With this hybrid system, we get improvements of 1.0% and 1.1% BLEU over pure NMT. Finally, we add the News Crawl LM data on top. This significantly improves the results by 1.7% and 2.0% BLEU. In total, we gain 2.7% and 3.1% BLEU over pure NMT. These results reinforce the fact that, similar to PBMT, language model quality is important for the proposed hybrid search. In contrast, we have also tried applying only the LM (including News Crawl data) with a word penalty on top of NMT, but did not get consistent improvements. Figure 1 shows an example for the phrase pairs chosen by the hybrid system on top of the NMT attention. The hybrid approach correctly translates the German idiom "nach und nach" as "gradually", while the pure NMT system incorrectly translates it word-by-word as "after and after". The pure NMT translation is "the system is tested after and after testing and improved by testing programs."

Conclusion
In this work, we proposed a novel hybrid search that extends NMT with phrase-based models. The NMT beam search was modified to insert phrasal translations based on the current and accumulated attention weights of the NMT decoder RNN. The NMT model score was used in a log-linear model with standard phrase-based scores as well as an n-gram language model. We described the algorithm in detail, in which we keep separate beams for NMT word hypotheses and hypotheses with an incomplete phrasal translation, as well as introduce parameters which control the source sentence coverage. Numerous experiments on two large vocabulary translation tasks showed that the hybrid search improves BLEU scores significantly as compared to a strong NMT baseline that already outperforms phrase-based SMT by a large margin.
In the future, we plan to focus on integration of phrasal components into NMT training, including better coverage constraints, as well as methods for context-dependent translation override within our hybrid search algorithm.