On NMT Search Errors and Model Errors: Cat Got Your Tongue?

We report on search errors and model errors in neural machine translation (NMT). We present an exact inference procedure for neural sequence models based on a combination of beam search and depth-first search. We use our exact search to find the global best model scores under a Transformer base model for the entire WMT15 English-German test set. Surprisingly, beam search fails to find these global best model scores in most cases, even with a very large beam size of 100. For more than 50% of the sentences, the model in fact assigns its global best score to the empty translation, revealing a massive failure of neural models in properly accounting for adequacy. We show by constraining search with a minimum translation length that at the root of the problem of empty translations lies an inherent bias towards shorter translations. We conclude that vanilla NMT in its current form requires just the right amount of beam search errors, which, from a modelling perspective, is a highly unsatisfactory conclusion indeed, as the model often prefers an empty translation.


Introduction
Neural machine translation (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Bahdanau et al., 2015, NMT) assigns the probability P (y|x) of a translation y = y J 1 ∈ T J of length J over the target language vocabulary T for a source sentence x ∈ S I of length I over the source language vocabulary S via a left-to-right factorization using the chain rule: log P (y|x) = J j=1 log P (y j |y j−1 1 , x). (1) The task of finding the most likely translationŷ ∈ T * for a given source sentence x is known as the * Now at Google.
decoding or inference problem: The NMT search space is vast as it grows exponentially with the sequence length. For example, for a common vocabulary size of |T | = 32, 000, there are already more possible translations with 20 words or less than atoms in the observable universe (32, 000 20 10 82 ). Thus, complete enumeration of the search space is impossible. The size of the NMT search space is perhaps the main reason why -besides some preliminary studies (Niehues et al., 2017;Stahlberg et al., 2018b;Ott et al., 2018) -analyzing search errors in NMT has received only limited attention. To the best of our knowledge, none of the previous studies were able to quantify the number of search errors in unconstrained NMT due to the lack of an exact inference scheme that -although too slow for practical MT -guarantees to find the global best model score for analysis purposes.
In this work we propose such an exact decoding algorithm for NMT that exploits the monotonicity of NMT scores: Since the conditional log-probabilities in Eq. 1 are always negative, partial hypotheses can be safely discarded once their score drops below the log-probability of any complete hypothesis. Using our exact inference scheme we show that beam search does not find the global best model score for more than half of the sentences. However, these search errors, paradoxically, often prevent the decoder from suffering from a frequent but very serious model error in NMT, namely that the empty hypothesis often gets the global best model score. Our findings suggest a reassessment of the amount of model and search errors in NMT, and we hope that they will spark new efforts in improving NMT modeling capabilities, especially in terms of adequacy.

Exact Inference for Neural Models
Decoding in NMT (Eq. 2) is usually tackled with beam search, which is a time-synchronous approximate search algorithm that builds up hypotheses from left to right. A formal algorithm description is given in Alg. 1. Beam search maintains a set of active hypotheses H cur . In each iteration, all hypotheses in H cur that do not end with the endof-sentence symbol < /s > are expanded and collected in H next . The best n items in H next constitute the set of active hypotheses H cur in the next iteration (line 11 in Alg. 1), where n is the beam size. The algorithm terminates when the best hypothesis in H cur ends with the end-of-sentence symbol < /s >. Hypotheses are called complete if they end with < /s > and partial if they do not.
Beam search is the ubiquitous decoding algorithm for NMT, but it is prone to search errors as the number of active hypotheses is limited by n. In particular, beam search never compares partial hypotheses of different lengths with each other. As we will see in later sections, this is one of the main sources of search errors. However, in many cases, the model score found by beam search is a reasonable approximation to the global best model score. Let γ be the model score found by beam search (p in line 12, Alg. 1), which is a lower bound on the global best model score: γ ≤ log P (ŷ|x). Furthermore, since the conditionals log P (y j |y j−1 1 , x) in Eq. 1 are logprobabilities and thus non-positive, expanding a partial hypothesis is guaranteed to result in a lower model score, i.e.: 1 Consequently, when we are interested in the global best hypothesisŷ, we only need to consider partial hypotheses with scores greater than γ. In our exact decoding scheme we traverse the NMT search space in a depth-first order, but cut off branches along which the accumulated model score falls below γ. During depth-first search (DFS), we update γ when we find a better complete hypothesis.
Alg. 2 specifies the DFS algorithm formally. An important detail is that elements in T are ordered such that the loop in line 5 considers the < /s > token first. This often updates γ early on and leads to better pruning in subsequent recursive calls. 2 Exact inference under length constraints Our admissible pruning criterion based on γ relies on the fact that the model score of a (partial) hypothesis is always lower than the score of any of its translation prefixes. While this monotonicity condition is true for vanilla NMT (Eq. 3), it does not hold for methods like length normalization (Jean et al., 2015;Boulanger-Lewandowski et al., 2013; or word rewards (He et al., 2016): Length normalization gives an advantage to longer hypotheses by dividing the score by the sentence length, while a word reward directly violates monotonicity as it rewards each word with a positive value. In Sec. 4 we show how our exact search can be extended to handle arbitrary length models (Murray and Chiang, 2018;Huang et al., 2017;Yang et al., 2018) by introducing length dependent lower bounds γ k and report initial findings on exact search under length normalization. However, despite being of practical use, methods like length normalization and word penalties are rather heuristic as they do not have any justification from a probabilistic perspective. They also do not generalize well as (without retuning) they often work only for a specific beam size. It would be much more desirable to fix the length bias in the NMT model itself.

Results without Length Constraints
We conduct all our experiments in this section on the entire English-German WMT news-test2015 test set (2,169 sentences) with a Transformer base (Vaswani et al., 2017) model trained with Tensor2Tensor (Vaswani et al., 2018) on parallel WMT18 data excluding ParaCrawl. Our pre-processing is as described by Stahlberg et al. (2018a) and includes joint subword segmentation using byte pair encoding (Sennrich et al., 2016)     SGNMT decoder (Stahlberg et al., 2017(Stahlberg et al., , 2018b. 4 Our main result is shown in Tab. 1. Greedy and beam search both achieve reasonable BLEU scores but rely on a high number of search errors 5 to not be affected by a serious NMT model error: For 51.8% of the sentences, NMT assigns the global best model score to the empty translation, i.e. a single < /s > token. Fig. 1 visualizes the relationship between BLEU and the number of search errors. Large beam sizes reduce the number of search errors, but the BLEU score drops because translations are too short. Even a large beam size of 100 produces 53.62% search errors. Fig. 2 shows that beam search effectively reduces search   , and the Transformer-Big systems are strong baselines from a WMT'18 shared task submission (Stahlberg et al., 2018a).
errors with respect to greedy decoding to some degree, but is ineffective in reducing search errors even further. For example, Beam-10 yields 15.9% fewer search errors (absolute) than greedy decoding (57.68% vs. 73.58%), but Beam-100 improves search only slightly (53.62% search errors) despite being 10 times slower than beam-10. The problem of empty translations is also visible in the histogram over length ratios (Fig. 3). Beam search -although still slightly too shortroughly follows the reference distribution, but exact search has an isolated peak in [0.0, 0.1] from the empty translations.
Tab. 2 demonstrates that the problems of search errors and empty translations are not specific to the     Transformer base model and also occur with other architectures. Even a highly optimized Transformer Big model from our WMT18 shared task submission (Stahlberg et al., 2018a) has 25.8% empty translations. Fig. 4 shows that long source sentences are more affected by both beam search errors and the problem of empty translations. The global best translation is empty for almost all sentences longer than 40 tokens (green curve). Even without sentences where the model prefers the empty translation, a large amount of search errors remain (blue curve).

Results with Length Constraints
To find out more about the length deficiency we constrained exact search to certain translation lengths. Constraining search that way increases the run time as the γ-bounds are lower. Therefore, all results in this section are conducted on only a subset of the test set to keep the runtime under control. 6 We first constrained search to translations longer than 0.25 times the source sentence length and thus excluded the empty translation from the search space. Although this mitigates the problem slightly (Fig. 5), it still results in a peak in the (0.3, 0.5] cluster. This suggests that the problem of empty translations is the consequence of an inherent model bias towards shorter hypotheses and cannot be fixed with a length constraint. We then constrained exact search to either the   length of the best Beam-10 hypothesis or the reference length. Tab. 3 shows that exact search constrained to the Beam-10 hypothesis length does not improve over beam search, suggesting that any search errors between beam search score and global best score for that length are insignificant enough so as not to affect the BLEU score. The oracle experiment in which we constrained exact search to the correct reference length (last row in Tab. 3) improved the BLEU score by 0.9 points. A popular method to counter the length bias in NMT is length normalization (Jean et al., 2015;Boulanger-Lewandowski et al., 2013) which simply divides the sentence score by the sentence length. We can find the global best translations under length normalization by generalizing our exact inference scheme to length dependent lower bounds γ k . The generalized scheme 7 finds the best model scores for each translation length k in a certain range (e.g. zero to 1.2 times the source sentence length). The initial lower bounds are derived from the Beam-10 hypothesis y beam as follows: 8 Exact search under length normalization does not suffer from the length deficiency anymore (last row in Tab. 4), but it is not able to match our best BLEU score under Beam-10 search. This suggests that while length normalization biases search towards translations of roughly the correct length, it does not fix the fundamental modelling problem. 7 Available in our SGNMT decoder (Stahlberg et al., 2017(Stahlberg et al., , 2018b as simplelendfs strategy. 8 We add 1 to the lengths to avoid division by zero errors.

Related Work
Other researchers have also noted that large beam sizes yield shorter translations (Koehn and Knowles, 2017). Sountsov and Sarawagi (2016) argue that this model error is due to the locally normalized maximum likelihood training objective in NMT that underestimates the margin between the correct translation and shorter ones if trained with regularization and finite data. A similar argument was made by Murray and Chiang (2018) who pointed out the difficulty for a locally normalized model to estimate the "budget" for all remaining (longer) translations. Kumar and Sarawagi (2019) demonstrated that NMT models are often poorly calibrated, and that that can cause the length deficiency. Ott et al. (2018) argued that uncertainty caused by noisy training data may play a role. Chen et al. (2018) showed that the consistent best string problem for RNNs is decidable. We provide an alternative DFS algorithm that relies on the monotonic nature of model scores rather than consistency, and that often converges in practice.
To the best of our knowledge, this is the first work that reports the exact number of search errors in NMT as prior work often relied on approximations, e.g. via n-best lists (Niehues et al., 2017) or constraints (Stahlberg et al., 2018b).

Conclusion
We have presented an exact inference scheme for NMT. Exact search may not be practical, but it allowed us to discover deficiencies in widely used NMT models. We linked deteriorating BLEU scores of large beams with the reduction of search errors and showed that the model often prefers the empty translation -an evidence of NMT's failure to properly model adequacy. Our investigations into length constrained exact search suggested that simple heuristics like length normalization are unlikely to remedy the problem satisfactorily.