How Grammatical is Character-level Neural Machine Translation? Assessing MT Quality with Contrastive Translation Pairs

Analysing translation quality in regards to specific linguistic phenomena has historically been difficult and time-consuming. Neural machine translation has the attractive property that it can produce scores for arbitrary translations, and we propose a novel method to assess how well NMT systems model specific linguistic phenomena such as agreement over long distances, the production of novel words, and the faithful translation of polarity. The core idea is that we measure whether a reference translation is more probable under a NMT model than a contrastive translation which introduces a specific type of error. We present LingEval97, a large-scale data set of 97000 contrastive translation pairs based on the WMT English->German translation task, with errors automatically created with simple rules. We report results for a number of systems, and find that recently introduced character-level NMT systems perform better at transliteration than models with byte-pair encoding (BPE) segmentation, but perform more poorly at morphosyntactic agreement, and translating discontiguous units of meaning.


Introduction
It has historically been difficult to analyse how well a machine translation system can learn specific linguistic phenomena. Automatic metrics such as BLEU (Papineni et al., 2002) provide no linguistic insight, and automatic error analysis (Zeman et al., 2011;Popovic, 2011) is also relatively coarse-grained. A concrete research question that has been unanswered so far is whether character-level decoders for neural machine translation (Chung et al., 2016;Lee et al., 2016) can generate coherent and grammatical sentences. Chung et al. (2016) argue that the answer is yes, because BLEU on long sentences is similar to a baseline with longer subword units created via byte-pair encoding (BPE) (Sennrich et al., 2016a), but BLEU, being based on precision of short ngrams, is an unsuitable metric to measure the global coherence or grammaticality of a sentence. To allow for a more nuanced analysis of different machine translation systems, we introduce a novel method to assess neural machine translation that can capture specific error categories in an automatic, reproducible fashion.
Neural machine translation (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Bahdanau et al., 2015) opens up new opportunities for automatic analysis because it can assign scores to arbitrary sentence pairs, in contrast to phrase-based systems, which are often unable to reach the reference translation. We exploit this property for the automatic evaluation of specific aspects of translation by pairing a human reference translation with a contrastive example that is identical except for a specific error. Models are tested as to whether they assign a higher probability to the reference translation than to the contrastive example.
A similar method of assessment has previously been used for monolingual language models (Sennrich and Haddow, 2015;Linzen et al., 2016), and we apply it to the task of machine translation. We present a large-scale test set of English→German contrastive translation pairs that allows for the automatic, quantitative analysis of a number of linguistically interesting phenomena that have previously been found to be challenging for machine  translation, including agreement over long distances (Koehn and Hoang, 2007;Williams and Koehn, 2011), discontiguous verb-particle constructions (Nießen and Ney, 2000;Loáiciga and Gulordava, 2016), generalization to unseen words (specifically, transliteration of names (Durrani et al., 2014)), and ensuring that polarity is maintained (Wetzel and Bond, 2012;Chen and Zhu, 2014;Fancellu and Webber, 2015). We report results for neural machine translation systems with different choice of subword unit, identifying strengths and weaknesses of recentlyproposed models.

Contrastive Translation Pairs
We create a test set of contrastive translation pairs from the EN→DE test sets from the WMT shared translation task. 2 Each contrastive translation pair consists of a correct reference translation, and a contrastive example that has been minimally modified to introduce one translation error. We define the accuracy of a model as the number of times it assigns a higher score to the reference translation than to the contrastive one, relative to the total number of predictions. We have chosen a number of phenomena that are known to be challenging for the automatic translation from English to German.
1. noun phrase agreement: German determiners must agree with their head noun in case, number, and gender. We randomly change the gender of a singular definite determiner to introduce an agreement error.
2. subject-verb agreement: subjects and verbs must agree with one another in grammatical number and person. We swap the grammatical number of a verb to introduce an agreement error.
3. separable verb particle: verbs and their separable prefix often form a discontiguous semantic unit. We replace a separable verb particle with one that has never been observed with the verb in the training data.
2 http://www.statmt.org/wmt16/ 4. polarity: arguably, polarity errors are undermeasured the most by string-based MT metrics, since a single word/morpheme can reverse the meaning of a translation. We reverse polarity by deleting/inserting the negation particle nicht ('not'), swapping the determiner ein ('a') and its negative counterpart kein ('no'), or deleting/inserting the negation prefix un-.

transliteration
: subword-level models should be able to copy or transliterate names, even unseen ones. For names that were unseen in the training data, we swap two adjacent characters. Table 1 shows examples for each error type. Most are motivated by frequent translation errors; for EN→DE, source and target script are the same, so technically, we do not perform transliteration. Since transliteration of names and copying them is handled the same way by the encoder-decoder networks that we tested, we consider this error type a useful proxy to test the models' transliteration capability.
All errors are introduced automatically, relying on statistics from the training corpus, a syntactic analysis with ParZu (Sennrich et al., 2013), and a finite-state morphology (Schmid et al., 2004;Sennrich and Kunz, 2014) to identify the relevant constructions and introduce errors. For contrastive pairs with agreement errors, we also annotate the distance between the words. For translation errors where we want to assess generalization to rare words (all except negation particles), we also provide the training set frequency of the word involved in the error (in case of multiple words, we report the lower frequency).
The automatic processing has limitations, and we opt for a high-precision approach -for instance, we only change the gender of determiners where case and number are unambiguous, so that we can produce maximally difficult errors. 3  We expect that parsing errors will not invalidate the contrastive examples -correctly identifying the subject will affect the distance annotation, but changing the number of the verb should always introduce an error. 4 Still, we report ceiling scores achievable by humans to account for the possibility that a generated error is not actually an error. We estimate the human ceiling by trying to select the correct variant for 20 contrastive translation pairs per category where our best system fails. The ceiling is below 100% because of errors in the reference translation, and cases that were undecidable by a human annotator (such as the gender of the 20-year-old). 5 From the 22 191 sentences in the original new-stest20** sets, we create approximately 97 000 contrastive translation pairs.

Evaluation
In the evaluation section, our focus is on establishing baselines on the test set, and investigating the following research questions: • how well do different subword-level models process unseen words, specifically names?
• sequence-length is increased in characterlevel models, compared to word-level or BPE-level models. Does this have a negative effect on grammaticality?

Data and Methods
We train NMT systems with training data from the WMT 15 shared translation task EN→DE. We train three systems with different text representations on the parallel part of the training set: • BPE-to-BPE (Sennrich et al., 2016a) • BPE-to-char (Chung et al., 2016) • char-to-char (Lee et al., 2016) We use the implementations released by the respective authors, Nematus 6 for BPE-to-BPE, and dl4mt-c2c 7 for BPE-to-char and char-to-char. dl4mt-c2c also provides preprocessed training data, which we use for comparability.
Both tools are forks of the dl4mt tutorial 8 , so the implementation differences are minimal except for those pertaining to the text representation. We report hyperparameters in Table 2. They correspond to those used by Lee et al. (2016) for BPE-tochar and char-to-char; for BPE-to-BPE, we also adopt some hyperparameters from Sennrich et al. (2016b), most importantly, we extract a joint BPE vocabulary of size 89 500 from the parallel corpus. We trained the BPE-to-BPE system for one week, following Sennrich et al. (2016a), and the *-tochar systems for two weeks, following Lee et al. (2016), on a single Titan X GPU. For both translating and scoring, we normalize probabilities by length (the number of symbols on the target side).
We also report results with the top-ranked system at WMT16 (Sennrich et al., 2016a), which is available online. 9 It is also a BPE-to-BPE system, but in contrast to the previous systems, it includes different preprocessing (including truecasing), other hyperparameters, additional monolin-  gual training data, an ensemble of models, and bidirectional decoding.

Results
Firstly, we report case-sensitive BLEU scores for all systems we trained for comparison to previous work. 10 Results are shown in Table 3. The results confirm that our systems are comparable to previously reported results (Sennrich et al., 2016a;Chung et al., 2016), and that performance of the three systems is relatively close in terms of BLEU. The metric does not provide any insight into the respective strengths and weaknesses of different text representations. Our main result is the assessment via contrastive translation pairs, shown in Table 4. We find that despite obtaining similar BLEU scores, the models have learned different structures to a different degree. The models with character decoder make fewer transliteration errors than the BPE-to-BPE model. However, they perform more poorly on separable verb particles and agreement, especially as distance increases, as seen in Figure 1. While accuracy for subject-verb agreement of adjacent words is similar across systems (95.2%, 94.0%, and 94.5% for BPE-to-BPE, BPEto-char, and char-to-char, respectively), the gap widens for agreement between distant words -for a distance of over 15 words, the accuracy is 90.7%, 85.2%, and 82.3%, respectively.
Polarity shifts between the source and target text are a well-known translation problem, and our analysis shows that the main type of error is the deletion of negation markers, in line with with findings of previous studies (Fancellu and Webber, 2015). We consider the relatively high num-10 Two commonly used BLEU evaluation scripts, the NIST BLEU scorer mteval-v13a.pl on detokenized text, and multi-bleu.perl on tokenized text, give different results due to tokenization differences. We here report both for comparison, but encourage the use of the NIST scorer, which is used by the WMT and IWSLT shared tasks, and allows for comparison of systems with different tokenizations.   ber of errors related to polarity an important problem in machine translation, and hope that future work will try to improve upon our results, shown in more detail in Table 5. We have commented that changing the grammatical number of the verb may change the meaning of the sentence instead of making it disfluent. A common example is the German pronoun sie, which is shared between the singular 'she', and the plural 'they'. We keep separate statistics for this type of error (n = 2520), and find that it is challenging for all models, with an accuracy of 87-87.2% for single models, and 90% by the WMT16 submission system.
We conclude from our results that there is currently a trade-off between generalization to unseen words, for which character-level decoders perform best, and sentence-level grammaticality, for which we observe better results with larger subword units of the BPE segmentation. We hope that our test set will help in developing and assessing architectures  Table 6: Examples where char-to-char model prefers contrastive translation (subject-verb agreement errors). 1-best translation can make error of same type (example 1), different type (translation of taught is missing in example 2), or no error (example 3).
that aim to overcome this trade-off and perform best in respect to both morphology and syntax. We encourage the use of contrastive translation pairs, and LingEval97, for future analysis, but here discuss some limitations. The first one is by design: being focused on specific translation errors, the evaluation is not suitable as a global quality metric. Also, the evaluation only compares the probability of two translations, a reference translation T and a contrastive translation T , and makes no statement about the most probable translation T * . Even if a model correctly estimates that p(T ) > p(T ), it is possible that T * will contain an error of the same type as T . And even if a model incorrectly estimates that p(T ) < p(T ), it may produce a correct translation T * . Despite these limitations, we argue that contrastive translation pairs are useful because they can easily be created to analyse any type of error in a way that is model-agnostic, automatic and reproducible. Table 6 shows different examples of the where the contrastive translation is scored higher than the reference by the char-to-char model, and the corresponding 1-best translation. In the first one, our method automatically recognizes an error that also appears in the 1-best translation. In the second example, the 1-best translation is missing the verb. Such cases could confound a human analysis of agreement errors, and we consider it an advantage of our method that it is not confounded by other errors in the 1-best translation. In the third example, our method identifies an error, but the 1-best translation is correct. We note that the German reference exhibits object fronting, but the 1-best output has the more common SVO word order. While one could consider this instance a false positive, it can be important for an NMT model to properly score translations other than the 1-best, for instance for applications such as prefix-constrained MT (Wuebker et al., 2016).

Conclusion
We present LingEval97, a test set of 97 000 contrastive translation pairs for the assessment of neural machine translation systems. By introducing specific translation errors to the contrastive translations, we gain valuable insight into the ability of state-of-the-art neural MT systems to handle several challenging linguistic phenomena. A core finding is that recently proposed characterlevel decoders for neural machine translation outperform subword models at processing unknown names, but perform worse at modelling morphosyntactic agreement, where information needs to be carried over long distances. We encourage the use of LingEval97 to assess alternative architectures, such as hybrid word-character models (Luong and Manning, 2016), or dilated convolutional networks (Kalchbrenner et al., 2016). For the tested systems, the most challenging error type is the deletion of negation markers, and we hope that our test set will facilitate development and evaluation of models that try to improve in that respect. Finally, the evaluation via contrastive translation pairs is a very flexible approach, and can be applied to new language pairs and error types.