Exploring Recombination for Efficient Decoding of Neural Machine Translation

In Neural Machine Translation (NMT), the decoder can capture the features of the entire prediction history with neural connections and representations. This means that partial hypotheses with different prefixes will be regarded differently no matter how similar they are. However, this might be inefficient since some partial hypotheses can contain only local differences that will not influence future predictions. In this work, we introduce recombination in NMT decoding based on the concept of the “equivalence” of partial hypotheses. Heuristically, we use a simple n-gram suffix based equivalence function and adapt it into beam search decoding. Through experiments on large-scale Chinese-to-English and English-to-Germen translation tasks, we show that the proposed method can obtain similar translation quality with a smaller beam size, making NMT decoding more efficient.


Introduction
Recently, end-to-end Neural Machine Translation (NMT) models (Sutskever et al., 2014;Bahdanau et al., 2015) have achieved notable success. A remarkable characteristic of NMT is that the decoder, which is typically implemented using Recurrent Neural Network (RNN), can capture the features of the entire decoding history. This model * Zhisong Zhang was a graduate student at SJTU and a research intern at NICT when conducting this work. This work is partially supported by the program "Promotion of Global  search. The hidden layers of the partial hypotheses ending with "cities" correspond to the nodes box ed in Figure 1 (only three hypotheses are listed for brevity). The negative log probabilities calculated by the model for the words predicted after "cities" are given in parentheses.  Table 1. Reference and prediction hypotheses are presented as red and blue nodes, respectively. The nodes inside the box represent the hidden features of partial hypotheses ending with "cities".
does not depend on any independence assumptions and treats sequences with different prefixes as totally different hypotheses. However, many of the NMT output sequences are quite similar and they typically contain only local differences that do not influence future decoding significantly. Table 1 and Figure 1 present an example of such pattern of local differences in NMT decoding. As shown in Table 1, the three partial hypotheses that for s in C : 7: # Check with candidate merger states. 8: for s in sequence(s): 9: if Eq(c, s ) and score(c)<score(s ): 10: merge flag = True 11: # Pruning by the merger.

15:
if len(C ) >= k: 16: break 17: return C end with "cities" share similar patterns. Firstly, as shown in Figure 1, their hidden layer features are close in the latent space. Moreover, for future predictions, the model predicts identical sequences and gives similar scores for them. Although going through different paths, these partial hypotheses appear to be similar or likely equivalent.
Intuitively, for efficiency, we do not need to expand all of these partial hypotheses (states) since they have similar future predictions. In fact, this corresponds to the idea of hypothesis recombination (also known as state merging, which will be used interchangeably) from traditional Phrase-Based Statistical Machine Translation (PBSMT) (Koehn et al., 2003). Given a method to find mergeable states, we can employ recombination in NMT decoding as well.
In this paper, we adopt the mechanism of recombination in NMT decoding based on the definition of "equivalence" of partial hypotheses. Heuristically, we try a simple n-gram suffix based equivalence function and apply it to beam search without adding any neural computation cost. Through experiments on two large-scale translation tasks, we show that it can help to make the decoding more efficient.
Most recent NMT studies have focused on model improvement (Luong et al., 2015;Tu et al., 2016b;Gehring et al., 2017;Vaswani et al., 2017), and only a few have studied the search problem directly. For example, Khayrallah et al. (2017) and Stahlberg et al. (2016) explored searching on lattices generated by traditional Statistical Machine Translation (SMT). In addition, Freitag and Al-Onaizan (2017) investigated different beam search pruning strategies; however, they primarily focused on pruning candidates locally. (Niehues et al., 2017) analyzed the effects of modeling and searching, but focused on re-ranking analysis. Rather than considering candidates from other model's k-best lists, we focus on the own exploration space of a single NMT model and provide a method for more efficient searching.

Method
For state merging, "equivalence" should be defined from the aspect of future predictions: states with the same predictions in the future decoding process can be regarded as equivalent. We use an equivalence function Eq(s 1 , s 2 ) to denote that the two states s 1 and s 2 can be regarded as equivalent.
With the concept of equivalence, we can build the method of recombination over it. There are mainly two problems to solve: 1. How to merge states given function Eq? ( §2.1) 2. How to obtain this equivalence function? ( §2.2)

Search with Merging
To adopt an equivalence function Eq(s 1 , s 2 ) to merge states in a search process, we need to specify the logic of the merging mechanism. Here, without loss of generality, we specifically focus on the typical beam search.
We adopt merging in NMT beam search with a simple method: retaining the word-level search process and adding a state merger when pruning the beam at each time step. Algorithm 1 shows the proposed merging-enhanced pruning method.
Ordinary beam search only prunes candidates based on beam size (Lines 15-16), while the proposed method adds a merger to prune extra equivalent states (Lines 6-10). To manage the merging process, candidate list C are ordered 1 by model score and considered in turn. When checking equivalence for one candidate state c, we consider all current-step surviving states and their previous-step antecedences. We include previousstep states, because equivalent states may have different sequence lengths and thus not be in the same beam-search step. In Line 8, we define "sequence" as a function of obtaining the possible states that can merge the current candidate c. If a candidate state c is not merged with any higher-ranked state, it is added to the surviving list C (Line 13) and can possibly merge the lower-ranked ones later.
When deciding whether to merge, we also consider a criterion on model scores: we only merge state c when its score is lower than s . Since we also consider previous-step states with different sequence lengths, a length reward λ is added for this comparison of partial hypotheses: score(s) = y∈s λ + log p(y). We also attempted length normalization, but found it performed slightly worse.
The merged partial hypotheses can be stored, and by assuming that their future predictions will be the same as their mergers, a lattice-like translation graph can be obtained. We can further extract k-best list from this structure using another beam-search on the lattice (also with length reward when comparing partial hypotheses). Note that this beam search process can be fast, since we reuse the model scores from previous search and no extra neural computations will be included.

Equivalence Function
Finding an exact equivalence function for NMT is difficult, because future predictions relies on the features from the entire previous sequence and any different sequences are not the same according to the NMT model. Here, we consider a n-gram suffix based heuristic approximation for this problem.
We adopt an approximate equivalence function: Eq (s 1 , s 2 ) ≡ s 1 .suf f ix(n) = s 2 .suf f ix(n) ∧ |s 1 .length − s 2 .length| < r Here, suf f ix(n) represents the n-gram suffix of the sequence of a state, and r is the threshold for the length different of the two states. This definition of equivalence only considers a subset of state features, which are inspired by PB-SMT. In PBSMT, different sequences could lead to states with identical features based on n-gram suffix, and these states are exactly equivalent. Although this is not the case for NMT, the subset may encodes important and relevant features.
Although this function is simple and brings extra approximation, it has the merit of efficiency. In Algorithm 1, we can store the n-gram features of the surviving states in a hash-map and replace the for-loop checking (Line 6-10) with hashing, making the extra time-complexity O(1) for each state. During experiments, we found the extra cost brought by feature matching is far less than the cost of original neural computation.

Experiments and Analysis
The proposed method was evaluated on two translation tasks: NIST Chinese-English (Zh-En) and WMT English-German (En-De). For Zh-En, the training set comprised 1.4M sentences pairs from LDC corpora. NIST 02 was selected as the development set and NIST 03 to 06 were used for testing. For En-De, 4.5M WMT training data were utilized, the concatenation of newstest 2012 and 2013 was adopted as the development set, and newstest 2014 to 2016 were adopted as the test set.
We implemented 2 an attentional RNN-based NMT model and its decoder in Python with the DyNet toolkit (Neubig et al., 2017). All the experiments were carried out on one P100 GPU. For Zh-En, we set the vocabulary size of both sides to 30K, and for En-De, we adopted 50K BPE operations (Sennrich et al., 2016). The evaluation metric was tokenized BLEU (Papineni et al., 2002) calculated by multi-bleu.perl. Detailed settings can be found in the supplementary material.
We added a local threshold pruner to exclude unlikely words whose probabilities were less than 10% of the highest and adopted length normalization for final hypotheses ranking. For comparing partial hypotheses, the length reward λ was set to 1.0 and 0.4 for Zh-En and En-De, respectively. For the equivalence function, we utilized a suffix of 4gram and a length difference threshold r of 2.
These hyper-parameters were set by preliminary experiments. For the length difference threshold r, we found that relatively small r like 1 or 2 was better than larger ones, which is reasonable since if the merged hypotheses differs too much in length, there are higher chances that they covered different information. For n-gram suffix, we found smaller n-grams made more bad merges and 4-gram is a reasonably good choice, slightly larger ones gave slightly worse results and also less chances of recombination. the same beam size. Moreover, since bringing no extra neural computations, the proposed merging mechanism is transparent to neural architectures and easy to adopt. In our experiments, we used batched decoding on GPU and merging did not influence the efficiency of this implementation. For translation quality, the results indicate that the proposed methods can yield improvements at various beam sizes for Zh-En and small beam sizes for En-De. Moreover, in some way, merging can make the search more efficient. For example, in both datasets, merge-enhanced searchers with beam-size 6 can obtain comparable or better results compared to those of ordinary searchers with beam-size 12 (on BLEU, 37.17 vs. 37.11 for Zh-En, 24.64 vs. 24.67 for En-De). As for decoding speed, the one of beam-size 6 can be more than twice of the one of beam-size 12 (over 200 tokens/second vs. around 100 tokens/second). That is to say, with merging, we can achieve similar translation quality with a smaller beam size, which leads to higher decoding speed.

Results
The results show that for large beam sizes, expanding explored search space by increasing beam size or adopting merging helps more in Zh-En than En-De. A possible explanation for this is that in NIST Zh-En dataset, each source sentences has four references for evaluation, which encourages the diversity brought by expanding reached search space. In Table 2, we compare the BLEU scores with multiple and single references on several beam sizes, and the single-reference results does not always increase along the beam size like the multiple ones. The En-De dataset also has only one reference and is similar to this case.
The results also show that expanding explored search space does not always bring improvements. This concerns more on modeling than searching and corresponds with previous findings on the relations between NMT searching and modeling (Tu et al., 2016a;Niehues et al., 2017;Li et al., 2018).  The potential of the proposed method might be better realized with improved NMT models.

Analysis
We further analyzed the merge-enhanced search process. For these analyses, we mainly checked decoding with a beam size of 10 on Zh-En dataset.
Frequency of Merging First, we investigated how often recombination occurs and how much it expands the explored output space. For a beam size of 10, with influences from the local pruner and the proposed merger, the average expanding size is 7.60 for each step, and the average number of merger-pruned partial hypotheses is 0.61 per step (22.5 per sentence). This indicates that a partial hypothesis is recombined in every two steps. The output translation graph can hold much more output space than the original k-best list, and we found that on average the possible output sequences were averagely 200 times the beam size. Figure 3 shows an example of the output translation graph.  Table 3: Comparisons of prediction model scores between different searching settings and a basic setting, which is "Beam=10, w/o merge". The pattern "a% / b%" means that compared with the basic setting, a% of the sentences get higher model scores and b% get lower ones. For the rest (1-a%-b%), they give identical predictions. threshold) in the equivalence function, however, we found that this does not bring obvious additional benefits.

Effects of Merging
We further conducted comparisons between the predictions of ordinary and merge-enhanced beam search. First, we investigated the model scores of their predictions. As shown in Table 3, we selected "Beam=10, no merge" as the basic setting, and compared the predictions of other settings with it. Overall, the merge-enhanced searcher can obtain higher model score predictions, which suggests its stronger search ability, because the goal of searching is to return hypotheses with higher model scores. Moreover, we tried a re-ranking experiment on 100-best lists with 4-checkpoint-model-ensemble, and only found similar slight improvements for plain and merge-enhanced search. Nevertheless, since merge-enhanced search can obtain a output translation graph, we expect that the graph can contain more diverse hypotheses.
To verify this, we compared the oracle BLEU scores within the reached space. To extract or-acle hypotheses from the translation graphs, we simply adopted approximate Partial BLEU Oracle (Dreyer et al., 2007;Sokolov et al., 2012). Merge-based searcher could obtain an oracle score of 47.83, while ordinary beam searcher could only get 42.57. Only by increasing the beam size up to 100 could the ordinary beam searcher achieve a better result of 48.74. This indicates that recombination helps to touch more output space.

Conclusion and Discussion
In this work, 1) we show that decoding with heuristic recombination can obtain similar translation qualities with smaller beam sizes, thus increasing efficiency, and, 2) we empirically explore the decoding process and analyze the influences of recombination from various aspects.
Although the improvements brought by recombination depend on careful refinements of the model, this concerns more on modeling, since the goal of decoding is to find hypotheses with higher model scores. The potential of recombination may be further realized by improving how the output sequences are modeled. Another interesting topic will be the combination with SMT or extra larger language models (Wang et al., 2013(Wang et al., , 2014. For the equivalence function, there can also be extensions. For example, a model-based equivalence function can be trained by using the neural features (hidden layers in RNN). However, modelbased equivalence functions may bring extra neural computation cost and be harder to efficiently implemented. In this work, we focus on the merging mechanism and leave the study of equivalence function for future work.