The Unreasonable Volatility of Neural Machine Translation Models

Recent works have shown that Neural Machine Translation (NMT) models achieve impressive performance, however, questions about understanding the behavior of these models remain unanswered. We investigate the unexpected volatility of NMT models where the input is semantically and syntactically correct. We discover that with trivial modifications of source sentences, we can identify cases where unexpected changes happen in the translation and in the worst case lead to mistranslations. This volatile behavior of translating extremely similar sentences in surprisingly different ways highlights the underlying generalization problem of current NMT models. We find that both RNN and Transformer models display volatile behavior in 26% and 19% of sentence variations, respectively.


Introduction
The performance of Neural Machine Translation (NMT) models has dramatically improved in recent years, and with sufficient and clean data these models outperform more traditional models. Challenges when sufficient data is not available include translations of rare words (Pham et al., 2018) and idiomatic phrases (Fadaee et al., 2018) as well as domain mismatches between training and testing (Koehn and Knowles, 2017;Khayrallah and Koehn, 2018).
Recently, several approaches investigated NMT models when encountering noisy input and how worst-case examples of noisy input can 'break' state-of-the-art NMT models (Goodfellow et al., 2015;Michel and Neubig, 2018). Belinkov and Bisk (2018) show that character-level noise in the input leads to poor translation performance. Lee et al. (2018) randomly insert words in different positions in the source sentence and observe that in   some cases the translations are completely unrelated to the input. While it is to some extent expected that the performance of NMT models that are trained on predominantly clean but tested on noisy data deteriorates, other changes are more unexpected.
In this paper, we explore unexpected and erroneous changes in the output of NMT models. Consider the simple example in Table 1 where the Transformer model (Vaswani et al., 2017) is used to translate very similar sentences. Surprisingly, we observe that by simply altering one word in the source sentence-inserting the German word sehr (English: very)-an unrelated change occurs in the translation. In principle, an NMT model that generates the translation of the word erleichtert (English: relieved) in one context, should also be able to generalize and translate it correctly in a very similar context. Note that there are no infrequent words in the source sentence and after each modification, the input is still syntactically correct and semantically plausible. We call a model volatile if it displays inconsistent behaviour across similar input sentences during inference.
We investigate to what extent well-established  NMT models are volatile during inference. Specifically, we locally modify sentence pairs in the test set and identify examples where a trivial modification in the source sentence causes an 'unexpected change' in the translation. These modifications are generated conservatively to avoid insertion of any noise or rare words in the data (Section 2.2). Our goal is not to fool NMT models, but instead identify common cases where the models exhibit unexpected behaviour and in the worst cases result in incorrect translations. We observe that our modifications expose volatilities of both RNN and Transformer translation models in 26% and 19% of sentence variations, respectively. Our findings show how vulnerable current NMT models are to trivial linguistic variations, putting into question the generalizability of these models.
2 Sentence Variations 2.1 Is this another noisy text translation problem?
Noisy input text can cause mistranslations in most MT systems, and there has been growing research interest in studying the behaviour of MT systems when encountering noisy input (Li et al., 2019). Belinkov and Bisk (2018) propose to swap or randomize letters in a word in the input sentence. For instance, they change the word noise in the source sentence into iones. Lee et al. (2018) examine how the insertion of a random word in a random position in the source sentence leads to mistranslations. Michel and Neubig (2018) proposes a benchmark dataset for translation of noisy input sentences, consisting of noisy, user-generated comments on Reddit. The types of noisy input text they observe include spelling or typographical errors, word omission/insertion/repetition, and grammatical errors.
In these previous works, the focus of the research is on studying how the MT systems are not robust when handling noisy input text. In these approaches, the input sentences are semantically or syntactically incorrect which leads to mistranslations.
However, in this paper, our focus is on input text that does not contain any types of noise. We modify input sentences in a way that the outcomes are still syntactically and semantically correct. We investigate how the MT systems exhibit volatile behaviour in translating sentences that are extremely similar and only differ in one word without any noise injection.

Variation generation
While there are various ways to automatically modify sentences, we are interested in simple semantic and syntactic modifications. These trivial linguistic variations should have almost no effect on the translation of the rest of the sentence.
We define a set of rules to slightly modify the source and target sentences in the test data and keep the sentences syntactically correct and semantically plausible.
DEL A conservative approach of modifying a sentence automatically without breaking the grammaticality of a sentence is to remove adverbs. We identify a list of the 50 most frequent adverbs in English and their translations in German * . For every sentence in the WMT test sets, if we find a sentence pair containing both a word and its translation from this list, we remove both words and create a new sentence pair.

SUBNUM
Another simple yet effective approach to safely modify sentences is to substitute numbers with other numbers. In this approach, we select every sentence pair from the test sets that contains a number and substitute the number i in both source and target sentences with i`k where 1 ď k ď 5. We choose a small range for change so that the sentences are still semantically correct * dict.cc for the most part and result in few implausible sentences.
INS Randomly inserting words in a sentence has a high chance of producing a syntactically incorrect sentence. To ensure that sentences remain grammatical and semantically plausible after modification, we define a bidirectional n-gram probability for inserting new words as follows: P pw 3 |w 1 w 2 w 4 w 5 q " Cpw 1 w 2 w 3 w 4 w 5 q Cpw 1 w 2 ‚ w 4 w 5 q w 3 is inserted in the middle of the phrase w 1 w 2 w 4 w 5 , if the conditional probability is greater than a predefined threshold. The probabilities are computed on the WMT data. This simple approach, instead of using a more complex language model, serves our purposes since we are interested in inserting very common words that are already captured by the n-grams in the training data.
SUBGEN Finally, a local modification is changing the gender of the person in the sentences. The goal of this modification is to investigate the existence and severity of gender bias in our models. This is inspired by recent approaches that have shown that NMT models learn social stereotypes such as gender bias from training data (Escudé Font and Costa-jussà, 2019;Stanovsky et al., 2019).
Note that in a minority of cases these procedures can lead to semantically incorrect sentences, for instance, by substituting numbers we can potentially generate sentences such as "She was born on October 34th". While this can cause problems for a reasoning task, it barely affects the translation task, as long as the modifications are consistent on the source and target side. Table 2 shows examples of generated variations. We emphasize that only modifications with local consequences have been selected and we intentionally ignore cases such as negation which can result in wider structural changes in the translation of the sentence.  We generate 10k sentence variations by applying these modifications to all sentence pairs in WMT test sets 2013-2018 (Bojar et al., 2018). We use RNN and Transformer models to translate sentences and their variations.

Experimental setup
In the translation experiments, we use the standard ENØDE WMT-2017 training data (Bojar et al., 2018). We perform NMT experiments with two different architectures: RNN (Luong et al., 2015) and Transformer (Vaswani et al., 2017). We preprocess the training data with Byte-Pair Encoding (BPE) using 32K merge operations (Sennrich et al., 2016). During inference, we use beam search with a beam size of 5. Table 3 shows the case-sensitive BLEU scores as calculated by Table 4: A random sample of sentences from the WMT test sets and our proposed variations shown with 'unexpected change' annotations (∆T ranslation). The cases where the unexpected change leads to a change in translation quality are marked in column ∆Quality. [w i \w j ] indicates that w i in the original sentence is replaced by w j . S is the original and modified source sentence, R is the original and modified reference translation, T is the translation of the original sentence, and T m is the translation of the modified sentence. Differences in translations related to annotations in the original and the modified translations are in red and orange, respectively. Note that we are interested in unexpected changes and do not highlight the changes that are a direct consequence of the modifications. It's a demanding child, but the village musicians usually help keep the team motivated. Tm It is a hard-to-use, but the village musician helps to keep the team motivated.
RNN As the first NMT system, we use a 2-layer bidirectional attention-based LSTM model implemented in OpenNMT (Klein et al., 2017) trained with an embedding size of 512, hidden dimension size of 1024, and batch size of 64 sentences. We use Adam (Kingma and Ba, 2015) for optimization.
Transformer We also experiment with the Transformer model (Vaswani et al., 2017) implemented in OpenNMT. We train a model with 6 layers, the hidden size is set to 512 and the fil-ter size is set to 2048. The multi-head attention has 8 attention heads. We use Adam (Kingma and Ba, 2015) for optimization. All parameters are set based on the suggestions in Klein et al. (2017) to replicate the results of the original paper.

Evaluation of unexpected and erroneous changes
The modifications described above generate sentences that are extremely similar and hence are expected to have a very similar difficulty of translation. We evaluate the NMT models on how robust and consistent they are in translating these sen- tence variations rather than their absolute quality.

Deviations from Original Translations
The variations are aimed to have minimal effect on changing the meaning of the sentences. Hence, major changes in the translations of these variations can be an indication of volatility in the model. To assess whether the proposed sentence variations result in major changes in the translations, we measure changes in the translations of sentence variations with Levenshtein distance (Levenshtein, 1966). Specifically, Levenshtein distance measures the edit distance between the two translations. We also use the first and last positions of change in the translations, which represents the span of changes. Ideally, with our simple modifications, we expect a value of zero for the span of change and a value of at most 2 for the Levenshtein distance for a translation pair. This indicates that there is only one token difference between the translation of the original sentence and the modified sentence. We define two types of changes based on these measures: minor and major. We choose the threshold to distinguish between minor and major changes more conservatively to allow for more variations in the translations. The change in translations is empirically considered major if both metrics are greater than 10, and minor if both are less than 10. Note that edit distances and spans are based on BPE subword units.
With two very similar source sentences, we expect the Levenshtein distance and span of change between translations of these sentences to be small. Figure 1 shows the results for the RNN and Transformer model. While the majority of sentence variations have minor changes, a substan-tial number of sentences, 18% of RNN and 13% of Transformer translations, result in translations with major differences. This is surprising and an indication of volatility since these trivial modifications, in principle, should only result in minor and local changes in the translations.

Oscillations of Variation in Translations
In this section, we look into various sentence-level metrics to further analyze the observed behaviour. In particular, we focus on the SUBNUM modification because with this modification we can generate numerous variations of the same sentence. Having a high number of variations for each sentence gives us the opportunity of observing oscillations of various string matching metrics.
We use sentence-level BLEU, METEOR (Denkowski and Lavie, 2011), TER (Snover et al., 2006), and LengthRatio to quantify changes in the translations. LengthRatio represents the translation length over reference length as a percentage. For a given source sentence, we define the oscillation range as changes in the sentence-level metric for the translations of variations of a given sentence.
While sentence-level metrics are not reliable indicators of translation quality, they do capture fluctuations in translations. With the variations we introduce, in theory there should be no fluctuations in the translations. Table 5 and Figure 3 provide the results. We observe that even though these sentence variations differ by only one number, there are many cases where an insignificant change in the sentence results in unexpectedly large oscillations. Both RNN and Transformer exhibit this behaviour to a certain extent.

The Effect of Volatility on Translation Quality
While edit distances and spans of change provide some indication of volatility, they do not capture all aspects of this unexpected behaviour. It is also not entirely clear what effect these unexpected changes have on translation quality. To further investigate this, we also perform manual evaluations.
In the first evaluation, we provide annotators with a pair of sentence variations and their corresponding translations and ask them to identify the differences between the two sentence pairs. In the second evaluation, we additionally provide the source sentences and reference translations, and ask the annotators to rank the sentence variations based on the translation quality similar to Bojar et al. (2016). In total the annotators evaluated 400 randomly selected sentence quadruplets.
The annotators identified 71% and 68% of changes in the variation translation as expected for the RNN and Transformer model, respectively. The main types of unexpected changes identified by the annotators are a change of word form, e.g., verb tense,, reordering of phrases, paraphrasing parts of the sentence, and an 'other' category, e.g., preposition. A sentence pair can have multiple labels based on the types of changes. Table 4 provides examples from the test data.
Statistics for each category of unexpected change is shown in Figure 2. Our first observation is that, as to be expected, there are very few 'unexpected changes' when two variations lead to translations with minor differences. Interestingly, the vast majority of changes are due to paraphrasing and dropping of words. Comparing the performance of the RNN and Transformer model, we see that both RNN and Transformer display inconsistent translation behaviour. While Transformer has slightly fewer sentences with major changes, it has a higher number of sentence variations in the major category that result in a change in translation quality. From the annotators' assessments, we find that in 26% and 19% of sentence variations, the modification results in a change in translation quality for the RNN and Transformer model, respectively.

Generalization and Compositionality
Because of their ability to generalize beyond their training data, deep learning models achieve exceptional performances in numerous tasks. The generalization ability allows MT systems to generate long sentences not seen before. Recently there has been some interest in understanding whether this performance depends on recognizing shallow patterns, or whether the networks are indeed capturing and generalizing linguistic rules.
In simple terms, compositionality is the ability to construct larger linguistic expressions by combining simpler parts. For instance, if a model understands the correct compositional rules to understand 'John loves Mary', it must also understand 'Mary loves John' (Fodor and LePore, 2002). Investigating the compositional behaviour of neural networks in real-world natural language problems is a challenging task. Recently, several works have studied deep learning models' understanding of compositionality in natural language by using synthetic and simplified languages (Andreas, 2019;Chevalier-Boisvert et al., 2019). Baroni (2019) shows that to a certain extent neural networks can be productive without being compositional.
Although we do not specifically look into the compositional potential of MT systems, we are inspired by compositionality in defining our modifications. We argue that the observed volatile behaviour of the MT systems in this paper is a side effect of current models not being compositional. If an MT system has a good 'understanding' of the underlying structures of the sentences 'Mary is 10 years old' and 'Mary is 11 years old', it must also translate them very similarly regardless of the ac-curacy of the translation. While current evaluation metrics capture the accuracy of the NMT models, these volatilities go unnoticed.
Current neural models are successful in generalizing without learning any explicit compositional rules, however, our findings signal that they still lack robustness. We highlight this lack of robustness and suspect that it is associated with these models' lack of understanding of the compositional nature of language.

Conclusion
In this paper, we showed the unexpected volatility of NMT models by using a simple approach to modifying standard test sentences without introducing noise, i.e., by generating semantically and syntactically correct variations. We show that even with trivial linguistic modifications of source sentences we can effectively identify a surprising number of cases where the translations of extremely similar sentences are surprisingly different, see Figure 1. Our manual analyses show that both RNN and Transformer models exhibit volatile behaviour with changes in translation quality for 26% and 19% of sentence variations, respectively. This highlights the problem of generalizability of current NMT models and we hope that our insights will be useful for developing more robust NMT models.