Neural versus Phrase-Based Machine Translation Quality: a Case Study

Within the field of Statistical Machine Translation (SMT), the neural approach (NMT) has recently emerged as the first technology able to challenge the long-standing dominance of phrase-based approaches (PBMT). In particular, at the IWSLT 2015 evaluation campaign, NMT outperformed well established state-of-the-art PBMT systems on English-German, a language pair known to be particularly hard because of morphology and syntactic differences. To understand in what respects NMT provides better translation quality than PBMT, we perform a detailed analysis of neural versus phrase-based SMT outputs, leveraging high quality post-edits performed by professional translators on the IWSLT data. For the first time, our analysis provides useful insights on what linguistic phenomena are best modeled by neural models -- such as the reordering of verbs -- while pointing out other aspects that remain to be improved.


Introduction
The wave of neural models has eventually reached the field of Statistical Machine Translation (SMT).After a period in which Neural MT (NMT) was too computationally costly and resource demanding to compete with state-of-the-art Phrase-Based MT (PBMT)1 , the situation changed in 2015.For the first time, in the latest edition of IWSLT2 (Cettolo et al., 2015), the system described in (Luong and Manning, 2015) overtook a variety of PBMT approaches with a large margin (+5.3 BLEU points) on a difficult language pair like English-German -anticipating what, most likely, will be the new NMT era.This impressive improvement follows the distance reduction previously observed in the WMT 2015 shared translation task (Bojar et al., 2015).Just few months earlier, the NMT systems described in (Jean et al., 2015b) ranked on par with the best phrase-based models on a couple of language pairs.Such rapid progress stems from the improvement of the recurrent neural network encoderdecoder model, originally proposed in (Sutskever et al., 2014;Cho et al., 2014b), with the use of the attention mechanism (Bahdanau et al., 2015).This evolution has several implications.On one side, NMT represents a simplification with respect to previous paradigms.From a management point of view, similar to PBMT, it allows for a more efficient use of human and data resources with respect to rulebased MT.From the architectural point of view, a large recurrent network trained for end-to-end translation is considerably simpler than traditional MT systems that integrate multiple components and processing steps.On the other side, the NMT process is less transparent than previous paradigms.Indeed, it represents a further step in the evolution from rule-based approaches that explicitly manipulate knowledge, to the statistical/data-driven framework, still comprehensible in its inner workings, to a sub-symbolic framework in which the translation process is totally opaque to the analysis.
What do we know about the strengths of NMT arXiv:1608.04631v2[cs.CL] 9 Oct 2016 and the weaknesses of PBMT?What are the linguistic phenomena that deep learning translation models can handle with such greater effectiveness?To answer these questions and go beyond poorly informative BLEU scores, we perform the very first comparative analysis of the two paradigms in order to shed light on the factors that differentiate them and determine their large quality differences.
We build on evaluation data available for the IWSLT 2015 MT English-German task, and compare the results of the first four top-ranked participants.We choose to focus on one language pair and one task because of the following advantages: (i) three state-of-the art PBMT systems compared against the NMT system on the same data and in the very same period (that of the evaluation campaign); (ii) a challenging language pair in terms of morphology and word order differences; (iii) availability of MT outputs' post-editing done by professional translators, which is very costly and thus rarely available.In general, post-edits have the advantage of allowing for informative and detailed analyses since they directly point to translation errors.In this specific framework, the high quality data created by professional translators guarantees reliable evaluations.For all these reasons we present our study as a solid contribution to the better understanding of this new paradigm shift in MT.
After reviewing previous work (Section 2), we introduce the analyzed data and the systems that produced them (Section 3).We then present three increasingly fine levels of MT quality analysis.We first investigate how MT systems' quality varies with specific characteristics of the input, i.e. sentence length and type of content of each talk (Section 4).Then, we focus on differences among MT systems with respect to morphology, lexical, and word order errors (Section 5).Finally, based on the finding that word reordering is the strongest aspect of NMT compared to the other systems, we carry out a finegrained analysis of word order errors (Section 6).

Previous Work
To date, NMT systems have only been evaluated by BLEU in single-reference setups (Bahdanau et al., 2015;Sutskever et al., 2014;Luong et al., 2015;Jean et al., 2015a;Gülc ¸ehre et al., 2015).Ad-ditionally, the Montreal NMT system submitted to WMT 2015(Jean et al., 2015b) was part of a manual evaluation experiment where a large number of non-professional annotators were asked to rank the outputs of multiple MT systems (Bojar et al., 2015).Results for the Montreal system were very positive -ranked first in English-German, third in German-English, English-Czech and Czech-English -which confirmed and strengthened the BLEU results published so far.Unfortunately neither BLEU nor manual ranking judgements tell us which translation aspects are better modeled by different MT frameworks.To this end, a detailed and systematic error analysis of NMT vs. PBMT output is required.
Translation error analysis, as a way to identify systems' weaknesses and define priorities for their improvement, has received a fair amount of attention in the MT community.In this work we opt for the automatic detection and classification of translation errors based on manual post-edits of the MT output.We believe this choice provides an optimal trade-off between fully manual error analysis (Farrús Cabeceran et al., 2010;Popović et al., 2013;Daems et al., 2014;Federico et al., 2014;Neubig et al., 2015), which is very costly and complex, and fully automatic error analysis (Popović and Ney, 2011;Irvine et al., 2013), which is noisy and biased towards one or few arbitrary reference translations.
Existing tools for translation error detection are either based on Word Error Rate (WER) and Position-independent word Error Rate (PER) (Popović, 2011) or on output-reference alignment (Zeman et al., 2011).Regarding error classification, Hjerson (Popović, 2011) detects five main types of word-level errors as defined in (Vilar et al., 2006): morphological, reordering, missing words, extra words, and lexical choice errors.We follow a similar but simpler error classification (morphological, lexical, and word order errors), but detect the errors differently using TER as this is the most natural choice in our evaluation framework based on post-edits (see also Section 3.4).Irvine et al. (2013) propose another word-level error analysis technique specifically focused on lexical choice and aimed at understanding the effects of domain differences on MT.Their error classification is strictly related to model coverage and insensitive to word order differences.The technique requires access to the sys-tem's phrase table and is thus not applicable to NMT, which does not rely on a fixed inventory of translation units extracted from the parallel data.
Previous error analyses based on manually postedited translations were presented in (Bojar, 2011;Koponen, 2012;Popović et al., 2013).We are the first to conduct this kind of study on the output of a neural MT system.

Experimental Setting
We perform a number of analyses on data and results of the IWSLT 2015 MT En-De task, which consists in translating manual transcripts of English TED talks into German.Evaluation data are publicly available through the WIT3 repository (Cettolo  et al., 2012). 3  3.1 Task Data TED Talks4 are a collection of rather short speeches (max 18 minutes each, roughly equivalent to 2,500 words) covering a wide variety of topics.All talks have captions, which are translated into many languages by volunteers worldwide.Besides representing a popular benchmark for spoken language technology, TED Talks embed interesting research challenges.Translating TED Talks implies dealing with spoken rather than written language, which is hence expected to be structurally less complex, formal and fluent (Ruiz and Federico, 2014).Moreover, as human translations of the talks are required to follow the structure and rhythm of the English captions, a lower amount of rephrasing and reordering is expected than in the translation of written documents.
As regards the English-German language pair, the two languages are interesting since, while belonging to the same language family, they have marked differences in levels of inflection, morphological variation, and word order, especially long-range reordering of verbs.

Evaluation Data
Five systems participated in the MT En-De task and were manually evaluated on a representative subset of the official 2015 test set.The Human Evaluation (HE) set includes the first half of each of the 12 test talks, for a total of 600 sentences and around 10K words.Five professional translators were asked to post-edit the MT output by applying the minimal edits required to transform it into a fluent sentence with the same meaning as the source sentence.Data were prepared so that all translators equally post-edited the five MT outputs, i.e. 120 sentences for each evaluated system.
The resulting evaluation data consist of five new reference translations for each of the sentences in the HE set.Each one of these references represents the targeted translation of the system output from which it was derived, but the other four additional translations can also be used to evaluate each MT system.We will see in the next sections how we exploited the available post-edits in the more suitable way depending on the kind of analysis carried out.

MT Systems
Our analysis focuses on the first four top-ranking systems, which include NMT (Luong and Manning, 2015) and three different phrase-based approaches: standard phrase-based (Ha et al., 2015), hierarchical (Jehl et al., 2015) and a combination of phrasebased and syntax-based (Huck and Birch, 2015).Table 1 presents an overview of each system, as well as figures about the training data used. 5he phrase+syntax-based (PBSY) system combines the outputs of a string-to-tree decoder, trained with the GHKM algorithm, with those of two stan-dard phrase-based systems featuring, among others, adapted phrase tables and language models enriched with morphological information, hierarchical lexicalized reordering models and different variations of the operational sequence model.
The hierarchical phrase-based MT (HPB) system leverages thousands of lexicalised features, datadriven source pre-ordering (dependency tree-based), word-based and class-based language models, and n-best re-scoring models based on syntactic and neural language models.
The standard phrase-based MT (SPB) system features an adapted phrase-table combining in-domain and out-domain data, discriminative word lexicon models, multiple language models (word-, POS-and class-based), data-driven source pre-ordering (POSand constituency syntax-based), n-best re-scoring models based on neural lexicons and neural language models.
Finally, the neural MT (NMT) system is an ensemble of 8 long short-term memory (LSTM) networks of 4 layers featuring 1,000-dimension word embeddings, attention mechanism, source reversing, 50K source and target vocabularies, and out-ofvocabulary word handling.Training with TED data was performed on top of models trained with large out-domain parallel data.
With respect to the use of training data, it is worth noticing that NMT is the only system not employing monolingual data in addition to parallel data.Moreover, NMT and SPB were trained with smaller amounts of parallel data with respect to PBSY and HPB (see Table 1).

Translation Edit Rate Measures
The Translation Edit Rate (TER) (Snover et al., 2006) naturally fits our evaluation framework, where it traces the edits done by post-editors.Also, TER shift operations are reliable indicators of reordering errors, in which we are particularly interested.We exploit the available post-edits in two different ways: (i) for Human-targeted TER (HTER) we compute TER between the machine translation and its manually post-edited version (targeted reference), (ii) for Multi-reference TER (mTER), we compute TER against the closest translation among all available post-edits (i.e.targeted and additional references) for each sentence.Throughout sections 4 and 5, we mark a score achieved by NMT with the symbol * if this is better than the score of its best competitor at statistical significance level 0.01.Significance tests for HTER and mTER are computed by bootstrap re-sampling, while differences among proportions are assessed via one-tailed z-score tests.

Overall Translation Quality
Table 2 presents overall system results according to HTER and mTER, as well as BLEU computed against the original TED Talks reference translation.We can see that NMT clearly outperforms all other approaches both in terms of BLEU and TER scores.Focusing on mTER results, the gain obtained by NMT over the second best system (PBSY) amounts to 26%.It is also worth noticing that mTER is considerably lower than HTER for each system.This reduction shows that exploiting all the available postedits as references for TER is a viable way to control and overcome post-editors variability, thus ensuring a more reliable and informative evaluation about the real overall performance of MT systems.For this reason, the two following analyses rely on mTER.In particular, we investigate how specific characteristics of input documents affect the system's overall translation quality, focusing on (i) sentence length and (ii) the different talks composing the dataset.

Translation quality by sentence length
Long sentences are known to be difficult to translate by the NMT approach.Following previous work (Cho et al., 2014a;Pouget-Abadie et al., 2014;Bahdanau et al., 2015;Luong et al., 2015), we investigate how sentence length affects overall translation quality.Figure 1 plots mTER scores against source sentence length.NMT clearly outperforms every PBMT system in any length bin, with statistically significant differences.As a general tendency, the performance of all approaches worsens as sentence length increases.However, for sentences longer than 35 words we see that NMT quality degrades more markedly than in PBMT systems.Considering the percentage decrease with respect to the preceding length bin (26-35), we see that the %∆ for NMT (-15.4) is much larger than the average %∆ for the three PBMT systems (-7.9).Hence, this still seems an issue to be addressed for further improving NMT.

Translation quality by talk
As we saw in Section 3.1, the TED dataset is very heterogeneous since it consists of talks covering different topics and given by speakers with different styles.It is therefore interesting to evaluate translation quality also at the talk level.
Figure 2 plots the mTER scores for each of the twelve talks included in the HE set, sorted in ascending order of NMT scores.In all talks, the NMT system outperforms the PBMT systems in a statistically significant way.
We analysed different factors which could impact translation quality in order to understand if they correlate with such performance differences.We studied three features which are typically considered as indicators of complexity (see (Franc ¸ois and Fairon, 2012) for an overview), namely (i) the length of the talk, (ii) its average sentence length, and (iii) the type-token ratio6 (TTR) which -measuring lexical diversity -reflects the size of a speaker's vocabulary and the variety of subject matter in a text.
For the first two features we did not find any correlation; on the contrary, we found a moderate Pearson correlation (R=0.7332) between TTR and the mTER gains of NMT over its closest competitor in each talk.This result suggests that NMT is able to cope with lexical diversity better than any other considered approach.

Analysis of Translation Errors
We now turn to analyze which types of linguistic errors characterize NMT vs. PBMT.In the literature, various error taxonomies covering different levels of granularity have been developed (Flanagan, 1994;Vilar et al., 2006;Farrús Cabeceran et al., 2010;Stymne and Ahrenberg, 2012;Lommel et al., 2014).We focus on three error categories, namely (i) morphology errors, (ii) lexical errors, and (iii) word order errors.As for lexical errors, a number of existing taxonomies further distinguish among translation errors due to missing words, extra words, or incorrect lexical choice.However, given the proven difficulty of disambiguating between these three subclasses (Popović and Ney, 2011;Fishel et al., 2012), we prefer to rely on a more coarse-grained linguistic error classification where lexical errors include all of them (Farrús Cabeceran et al., 2010).
For error analysis we rely on HTER results under the assumption that, since the targeted translation is generated by post-editing the given MT output, this method is particularly informative to spot MT errors.We are aware that translator subjectivity is still an issue (see Section 4), however in this more finegrained analysis we prefer to focus on what a human implicitly annotated as a translation error.This particularly holds in our specific evaluation framework, where the goal is not to measure the absolute number of errors made by each system, but to compare systems with each other.Moreover, the postedits collected for each MT output within IWSLT allow for a fair and reliable comparison since systems were equally post-edited by all translators (see Section 3.2), making all analyses uniformly affected by such variability.

Morphology errors
A morphology error occurs when a generated word form is wrong but its corresponding base form (lemma) is correct.Thus, we assess the ability of systems to deal with morphology by comparing the HTER score computed on the surface forms (i.e.morphologically inflected words) with the HTER score obtained on the corresponding lemmas.The additional matches counted on lemmas with respect to word forms indicate morphology errors.Thus, the closer the two HTER scores, the more accurate the system in handling morphology.
To carry out this analysis, the lemmatized (and POS tagged) version of both MT outputs and corresponding post-edits was produced with the German parser ParZu (Sennrich et al., 2013).Then, the HTER-based evaluation was slightly adapted in order to be better suited to an accurate detection of morphology errors.First, punctuation was removed since -not being subject to morphological inflection -it could smooth the results.Second, shift errors were not considered.A word form or a lemma that matches a corresponding word or lemma in the postedit, but is in the wrong position with respect to it, is counted as a shift error in TER.Instead -when focusing on morphology -exact matches are not errors, regardless their position in the text.Table 3: HTER ignoring shift operations computed on words and corresponding lemmas, and their % difference.
Table 3 presents HTER scores on word forms and lemmas, as well as their percentage difference which gives an indication of morphology errors.We can see that NMT generates translations which are morphologically more correct than the other systems.In particular, the %∆ for NMT (-13.7) is lower than that of the second best system (PBSY, -16.9) by 3.2% absolute points, leading to a percentage gain of around 19%.We can thus say that NMT makes at least 19% less morphology errors than any other PBMT system.

Lexical errors
Another important feature of MT systems is their ability to choose lexically appropriate words.In order to compare systems under this aspect, we consider HTER results at the lemma level as a way to abstract from morphology errors and focus only on actual lexical choice problems.The evaluation on the lemmatised version of the data performed to identify morphology errors fits this purpose, since its driving assumptions (i.e.punctuation can be excluded and lemmas in the wrong order are not errors) hold for lexical errors too.
The lemma column of Table 3 shows that NMT outperforms the other systems.More precisely, the NMT score (18.7) is better than the second best (PBSY, 22.5) by 3.8% absolute points.This corresponds to a relative gain of about 17%, meaning that NMT makes at least 17% less lexical errors than any PBMT system.Similarly to what observed for morphology errors, this can be considered a remarkable improvement over the state of the art.

Word order errors
To analyse reordering errors, we start by focusing on shift operations identified by the HTER metrics.The first three columns of It should be recalled that these numbers only refer to shifts detected by HTER, that is (groups of) words of the MT output and corresponding post-edit that are identical but occurring in different positions.Words that had to be moved and modified at the same time (for instance replaced by a synonym or a morphological variant) are not counted in HTER shift figures, but are detected as substitution, insertion or deletion operations.To ensure that our reordering evaluation is not biased towards the alignment between the MT output and the post-edit performed by HTER, we run an additional assessment using KRS -Kendall Reordering Score (Birch et al., 2010) -which measures the similarity between the source-reference reorderings and the source-MT output reorderings.8Being based on bilingual word alignment via the source sentence, KRS detects reordering errors also when post-edit and MT words are not identical.Also unlike TER, KRS is sensitive to the distance between the position of a word in the MT output and that in the reference.
Looking at the last column of Table 4, we can say that our observations on HTER are confirmed by the KRS results: the reorderings performed by NMT are much more accurate than those performed by any PBMT system.9Moreover, according to the approx-imate randomization test, KRS differences are statistically significant between NMT and all other systems, but not among the three PBMT systems.
Given the concordant results of our two quantitative analyses, we conclude that one of the major strengths of the NMT approach is its ability to place German words in the right position even when this requires considerable reordering.This outcome calls for a deeper investigation, which is carried out in the following section.

Fine-grained Word Order Error Analysis
We have observed that word reordering is a very strong aspect of NMT compared to PBMT, according to both HTER and KRS.To better understand this finding, we investigate whether reordering errors concentrate on specific linguistic constructions across our systems.Using the POS tagging and dependency parsing of the post-edits produced by ParZu, we classify the shift operations detected by HTER and count how often a word with a given POS label was misplaced by each of the systems (alone or as part of a shifted block).For each word class, we also compute the percentage order error reduction of NMT with respect to the PBMT system that has highest reordering accuracy overall, that is PBSY.
Results are presented in Table 5, ranked by NMTvs-PBSY gain.Punctuation is omitted as well as word classes that were shifted less than 10 times by all systems.Examples of salient word order error types are presented in Table 6.
The upper part of Table 5 shows that verbs are by far the most often misplaced word category in all PBMT systems -an issue already known to affect standard phrase-based SMT between German and English (Bisazza and Federico, 2013).Reordering is particularly difficult when translating into German, since the position of verbs in this language varies according to the clause type (e.g.main vs. subordinate).Our results show that even syntax-informed PBMT does not solve this issue.Using syntax at decoding time, as done by one of the systems combined within PBSY, appears to be a better strategy reports a difference of 5 KRS points between the translations of a PBMT system and those produced by four human translators tested against each other, in a Chinese-English experiment.Here we see that order errors on nouns are notably reduced by NMT when they act as syntactic objects (-65% obja:N) but less when they act as preposition complements (-36% pn:N) or subjects (-33% subj:N).
The smallest NMT-vs-PBSY gains are observed on prepositions (-18% PREP), negation particles (-17% PTKNEG) and articles (-4% ART).Manual inspection of a data sample reveals that misplaced prepositions are often part of misplaced prepositional phrases acting, for instance, as temporal or instrumental adjuncts (e.g.'in my life', 'with this video').In these cases, the original MT output is overall understandable and grammatical, but does not conform to the order of German semantic arguments that is consistently preferred by post-editors (see example in Table 6(c)).Articles, due to their commonness, are often misaligned by HTER and marked as shift errors instead of being marked as two unrelated substitutions.Finally, negation particles account for less than 1% of the target tokens but play a key role in determining the sentence meaning.Looking closely at some error examples, we found that the correct placement of the German particle nicht was determined by the focus of negation in the source sentence, which is difficult to detect in English.For instance in Table 6(d) two interpretations are possible ('that did not work' or 'that worked, but not for systematic reasons'), each resulting in a different, but equally grammatical, location of nicht.In fact, negation-focus detection calls for a deep understanding of the sentence semantics, often requiring extra-sentential context (Blanco and Moldovan, 2011).When faced with this kind of translation decisions, NMT performs as poorly as its competitors.
In summary, our fine-grained analysis confirms that NMT concentrates its word order improvements on important linguistic constituents and, specifically in English-German, is very close to solving the infamous problem of long-range verb reordering which so many PBMT approaches have only poorly managed to handle.On the other hand, NMT still struggles with more subtle translation decisions depending, for instance, on the semantic ordering of adjunct prepositional phrases or on the focus of negation.

Conclusions
We analysed the output of four state-of-the-art MT systems that participated in the English-to-German task of the IWSLT 2015 evaluation campaign.Our selected runs were produced by three phrase-based MT systems and a neural MT system.The analysis leveraged high quality post-edits of the MT outputs, which allowed us to profile systems with respect to reliable measures of post-editing effort and translation error types.The outcomes of the analysis confirm that NMT has significantly pushed ahead the state of the art, especially in a language pair involving rich morphology prediction and significant word reordering.To summarize our findings: (i) NMT generates outputs that considerably lower the overall post-edit effort with respect to the best PBMT system (-26%); (ii) NMT outperforms PBMT systems on all sentence lengths, although its performance degrades faster with the input length than its competitors; (iii) NMT seems to have an edge especially on lexically rich texts; (iv) NMT output contains less morphology er-rors (-19%), less lexical errors (-17%), and substantially less word order errors (-50%) than its closest competitor for each error type; (v) concerning word order, NMT shows an impressive improvement in the placement of verbs (-70% errors).
While NMT proved superior to PBMT with respect to all error types that were investigated, our analysis also pointed out some aspects of NMT that deserve further work, such as the handling of long sentences and the reordering of particular linguistic constituents requiring a deep semantic understanding of text.Machine translation is definitely not a solved problem, but the time is finally ripe to tackle its most intricate aspects.

Figure 1 :
Figure 1: mTER scores on bins of sentences of different length.Points represent the average mTER of the MT outputs for the sentences in each given bin.

Figure 2 :
Figure 2: mTER scores per talk, sorted in ascending order of NMT scores.

Table 1 :
MT systems' overview.Data column: size of parallel/monolingual training data for each system in terms of English and German tokens.

Table 2 :
Overall results on the HE Set: BLEU, computed

Table 4 :
Word reordering evaluation in terms of shift operaii) the number of shifts required to align each system output to the corresponding post-edit; and (iii) the corresponding percentage of shift errors.Notice that the shift error percentages are incorporated in the HTER scores reported in Table2.We can see in Table4that shift errors in NMT translations are definitely less than in the other systems.The error reduction of NMT with respect to the second best system (PBSY) is about 50% (173 vs. 354). (

Table 5 :
Main POS tags and dependency labels of words oc-More insight on this is provided by the lower part of the table, where reordering errors are divided by their dependency label as well as POS tag.

Table 6 :
Auxiliary-main verb construction [aux:V]: SRC in this experiment , individuals were shown hundreds of hours of YouTube videos HPB in diesem Experiment , Individuen gezeigt wurden Hunderte von Stunden YouTube-Videos % (a) PE in diesem Experiment wurden Individuen Hunderte von Stunden Youtube-Videos gezeigt NMT in diesem Experiment wurden Individuen hunderte Stunden YouTube Videos gezeigt SRC ... when coaches and managers and owners look at this information streaming ... ... wenn Trainer und Manager und Eigentümer betrachten diese Information Streaming ... ... wenn Trainer und Manager und Besitzer sich diese Informationen anschauen ... PE ... wenn Trainer und Manager und Besitzer sich diese Informationen anschauen ... Prepositional phrase [pp:PREP det:ART pn:N] acting as temporal adjunct: SRC so like many of us , I 've lived in a few closets in my life SPB so wie viele von uns , ich habe in ein paar Schränke in meinem Leben gelebt % (c) PE so habe ich wie viele von uns während meines Lebens in einigen Verstecken gelebt NMT wie viele von uns habe ich in ein paar Schränke in meinem Leben gelebt % PE wie viele von uns habe ich in meinem Leben in ein paar Schränken gelebt Negation particle [adv:PTKNEG]: SRC but I eventually came to the conclusion that that just did not work for systematic reasons HPB aber ich kam schlielich zu dem Schluss , dass nur aus systematischen Gründen nicht funktionieren !(d) PE aber ich kam schlielich zu dem Schluss , dass es einfach aus systematischen Gründen nicht funktioniert NMT aber letztendlich kam ich zu dem Schluss , dass das einfach nicht aus systematischen Gründen funktionierte % MT output and post-edit examples showing common types of reordering errors.