A Multifaceted Evaluation of Neural versus Phrase-Based Machine Translation for 9 Language Directions

We aim to shed light on the strengths and weaknesses of the newly introduced neural machine translation paradigm. To that end, we conduct a multifaceted evaluation in which we compare outputs produced by state-of-the-art neural machine translation and phrase-based machine translation systems for 9 language directions across a number of dimensions. Specifically, we measure the similarity of the outputs, their fluency and amount of reordering, the effect of sentence length and performance across different error categories. We find out that translations produced by neural machine translation systems are considerably different, more fluent and more accurate in terms of word order compared to those produced by phrase-based systems. Neural machine translation systems are also more accurate at producing inflected forms, but they perform poorly when translating very long sentences.


Introduction
A new paradigm to statistical machine translation, neural MT (NMT), has emerged very recently and has already surpassed the performance of the mainstream approach in the field, phrasebased MT (PBMT), for a number of language pairs, e.g. (Sennrich et al., 2016b;Luong et al., 2015;Costa-Jussà and Fonollosa, 2016;Chung et al., 2016).
In PBMT (Koehn, 2010) different models (translation, reordering, target language, etc.) are trained independently and combined in a loglinear scheme in which each model is assigned a * Work partly done at his previous position in Dublin City University, Ireland. different weight by a tuning algorithm. On the contrary, in NMT all the components are jointly trained to maximise translation quality. NMT systems have a strong generalisation power because they encode translation units as numeric vectors that represent concepts, whereas in PBMT translation units are encoded as strings. Moreover, NMT systems are able to model long-distance phenomena thanks to the use of recurrent neural networks, e.g. long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) or gated recurrent units (Chung et al., 2014).
The translations produced by NMT systems have been evaluated thus far mostly in terms of overall performance scores, be it by means of automatic or human evaluations. This has been the case of last year's news translation shared task at the First Conference on Machine Translation (WMT16). 1 In this translation task, outputs produced by participant MT systems, the vast majority of which fall under either the phrase-based or neural approaches, were evaluated (i) automatically with the BLEU (Papineni et al., 2002) and TER (Snover et al., 2006) metrics, and (ii) manually by means of ranking translations (Federmann, 2012) and monolingual semantic similarity . In all these evaluations, the performance of each system is measured by means of an overall score, which, while giving an indication of the general performance of a given system, does not provide any additional information.
In order to understand better the new NMT paradigm and in what respects it provides better (or worse) translation quality than state-of-theart PBMT, Bentivogli et al. (2016) conducted a detailed analysis for the English-to-German language direction. In a nutshell, they found out that NMT (i) decreases post-editing effort, (ii) de-grades faster than PBMT with sentence length and (iii) results in a notable improvement regarding reordering.
In this paper we delve further in this direction by conducting a multilingual and multifaceted evaluation in order to find answers to the following research questions. Whether, in comparison to PBMT, NMT systems result in: • considerably different output and higher degree of variability; • more or less fluent output; • more or less monotone translations; • translations with better or worse word order; • better or worse translations depending on sentence length; • less or more errors for different error categories: inflectional, reordering and lexical; Hereunder we specify the main differences and similarities between this work and that of Bentivogli et al. (2016): • Language directions. They considered 1 while our study comprises 9.
• Content. They dealt with transcribed speeches while we work with news stories. Previous research has shown that these two types of content pose different challenges for MT (Ruiz and Federico, 2014).
• Size of evaluation data. Their test set had 600 sentences while our test sets span from 1 999 to 3 000 depending on the language direction.
• Reference type. Their references were both independent from the MT output and also post-edited, while we have access only to single independent references.
• Analyses. While some analyses overlap, some are novel in our experiments. Namely, output similarity, fluency and degree of reordering performed.
Our analyses are conducted on the best PBMT and NMT systems submitted to the WMT16 translation task for each language direction. This (i) guarantees the reproducibility of our results as all the MT outputs are publicly available, (ii) ensures that the systems evaluated are state-of-the-art, as they are the result of the latest developments at top MT research groups worldwide, and (iii) allows the conclusions that will be drawn to be rather general, as 6 languages from 4 different families (Germanic, Slavic, Romance and Finno-Ugric) are covered in the experiments.
The rest of the paper is organised as follows. Section 2 describes the experimental setup. Subsequent sections cover the experiments carried out in which we measured different aspects of NMT, namely: output similarity (Section 3), fluency (Section 4), degree of reordering and quality of word order (Section 5), sentence length (Section 6), and amount of errors for different error categories (Section 7). Finally, Section 8 holds the conclusions and proposals for future work.

Experimental Setup
The experiments are run on the best PBMT and NMT constrained systems submitted to the news translation task of WMT16. We selected such systems according to the human evaluation (Bojar et al., 2016, Sec. 3.4). 2 We noted that many of the PBMT systems contain neural features, mainly in the form of language models. If the best PBMT submission contains any neural features we use this as the PBMT system in our analyses as long as none of these features is a fully-fledged NMT system. This was the case of the best submission in terms of BLEU for RU→EN (Junczys-Dowmunt et al., 2016).
Out of the 12 language directions at the translation task, we conduct experiments on 9. 3 These are the language pairs between English (EN) and Czech (CS), German (DE), Finnish (FI), Romanian (RO) and Russian (RU) in both directions (except for Finnish, where only the EN→FI direction is covered as no NMT system was submitted for the opposite direction, FI→EN). Finally, there was an additional language at the shared task, Turkish, that is not considered here, as either none of the systems submitted was neural (Turkish→EN), or there was one such system but its performance 2 When there are not statistically significant differences between two or more NMT or PBMT systems (i.e. they belong to the same equivalence class), we pick the one with the highest BLEU score. If two NMT or PBMT systems were the best according to BLEU (draw), we pick the one with the best TER score.
3 Some experiments are run on a subset of these languages due to the lack of required tools for some of the languages involved. was extremely low (EN→Turkish) and hence most probably not representative of the state-of-the-art in NMT. Table 1 shows the main characteristics of the best PBMT and NMT systems submitted to the WMT16 news translation task. It should be noted that all the NMT systems listed in the table fall under the encoder-decoder architecture with attention (Bahdanau et al., 2015) and operate on subword units. Word segmentation is carried out with the help of a lexicon in the EN→FI direction (Sánchez-Cartagena and Toral, 2016) and in an unsupervised way in the remaining directions (Sennrich et al., 2016a).

Overall Evaluation
First, and in order to contextualise our analyses below, we report the BLEU scores achieved by the best NMT and PBMT systems for each language direction at WMT16's news translation task in Table 2. 4 The best NMT system clearly outperforms the best PBMT system for all language directions out of English (relative improvements range from 5.5% for EN→RO to 17.6% for EN→FI) and the human evaluation (Bojar et al., 2016, Sec. 3.4) confirms these results. In the opposite direction, the human evaluation shows that the best NMT system outperforms the best PBMT system for all language directions except when the source language is Russian. This slightly differs from the automatic evaluation, according to which NMT outperforms PBMT for translations from Czech (3.3% relative improvement) and German (9.9%) but underperforms PBMT for translations from Romanian (-3.7%) and Russian (-3.8%).

Output Similarity
The aim of this analysis is to assess to which extent translations produced by NMT systems are different from those produced by PBMT systems. We measure this by taking the outputs of the top n 5 NMT and PBMT systems submitted to each language direction and checking their pairwise overlap in terms of the chrF1 (Popović, 2015) 4 We report the official results from http://matrix. statmt.org/matrix for the test set newstest2016 using normalised BLEU (column z BLEU-cased-norm).
5 The number of systems considered is different for each language direction as it depends on the number of systems submitted. Namely, we have considered 2 NMT and 2 PBMT into Czech, 3 NMT and 5 PBMT into German, 2 NMT and 4 PBMT into Finnish, 2 NMT and 4 PBMT into Romanian and 2 NMT and 3 PBMT into Russian.  Table 2: BLEU scores of the best NMT and PBMT systems for each language pair at WMT16's news translation task. If the difference between them is statistically significant according to paired bootstrap resampling (Koehn, 2004) with p = 0.05 and 1 000 iterations, the highest score is shown in bold.
automatic evaluation metric. 6 In order to make sure that all systems considered are truly different (rather than different runs of the same system) we consider only 1 system per paradigm (NMT and PBMT) submitted by each team for each language direction. We would consider NMT outputs considerably different (with respect to PBMT) if they resemble each other (i.e. high pairwise overlap between NMT outputs) more than they do to PBMT systems (i.e. low overlap between an output by NMT and another by PBMT). This analysis is carried out only for language directions out of English, as for all the language directions into English there was, at most, 1 NMT submission.  Table 3: Average of the overlaps between pairs of outputs produced by the top n NMT and PBMT systems for each language direction from English to the target language (TL). The higher the value, the larger is the overlap. Table 3 shows the results. We can observe the same trends for all the language directions, namely: (i) the highest overlaps are between pairs of PBMT systems; (ii) next, we have overlaps between NMT systems; (iii) finally, overlaps between PBMT and NMT are the lowest.
We can conclude then that NMT systems lead to considerably different outputs compared to PBMT. The fact that there is higher inter-system variability in NMT than in PBMT (i.e. overlaps between pairs of NMT systems are lower than between pairs of PBMT systems) may surprise the reader, considering the fact that all NMT systems belong to the same paradigm (encoderdecoder with attention) while for some language directions (EN→DE, EN→FI and EN→RO) there are PBMT systems belonging to two different paradigms (pure phrase-based and hierarchical). However, the higher variability among NMT translations can be attributed, we believe, to the fact that NMT systems use numeric vectors that represent concepts instead of strings as translation units.

Fluency
In this experiment we aim to find out whether the outputs produced by NMT systems are more or less fluent than those produced by PBMT systems. To that end, we take perplexity of the MT outputs on neural language models (LMs) as a proxy for fluency. The LMs are built using TheanoLM (Enarvi and Kurimo, 2016). They contain 100 units in the projection layer, 300 units in the LSTM layer, and 300 units in the tanh layer, following the setup described by Enarvi and Kurimo (2016, Sec. 3.2). The training algorithm is Adagrad (Duchi et al., 2011) and we used 1 000 word classes obtained with mkcls from the training corpus. Vocabulary is limited to the most frequent 50 000 tokens.
LMs are trained on a random sample of 4 million sentences selected from the News Crawl 2015 monolingual corpora, available for all the languages considered. 7 Table 4 shows the results. For all the language directions considered but one, perplexity is higher on the PBMT output compared to the NMT output. The only exception is translation into Finnish, in which perplexity on the PBMT output is slightly lower, probably because its fluency was improved by reranking it with a neural LM similar to the one 7 http://data.statmt.org/ wmt16/translation-task/ training-monolingual-news-crawl.tgz  . Thus, our experiment shows that the outputs produced by NMT systems are, in general, more fluent than those produced by PBMT systems.
One may argue that the perplexity obtained for NMT outputs is lower than that for PBMT outputs because the LMs we used to measure perplexity follow the same model as the decoder of the NMT architecture (Bahdanau et al., 2015) and hence perplexity on a neural LM is not a valid proxy for fluency. However, the following facts support our strategy: • The manual evaluation of fluency carried out at the WMT16 shared translation task (Bojar et al., 2016, Sec. 3.5) already confirmed that NMT systems consistently produce more fluent translations than PBMT systems. That manual evaluation only covered language directions into English. In this experiment, we extend that conclusion to language directions out of English.
• Neural LMs consistently outperform n-gram based LMs when assessing the fluency of real text (Kim et al., 2016;Enarvi and Kurimo, 2016). Thus, we have used the most accurate automatic tool available to measure fluency.  Table 5: Average Kendall's tau distance between the word alignments obtained after translating the test set with each MT system being evaluated and a monotone alignment (left); and average Kendall's tau distance between the word alignments obtained for each MT system's translation and the word alignments of the reference translation (right). Larger values represent more similar alignments. If the difference between the distances depicted in the two last columns is statistically significant according to paired bootstrap resampling (Koehn, 2004) with p = 0.05 and 1 000 iterations, the largest distance is shown in bold.

Reordering
In this section we measure the amount of reordering performed by PBMT and NMT systems. Our objective is to empirically determine whether: (i) the recurrent neural networks in NMT systems produce more changes in the word order of a sentence than an PBMT decoder; and whether (ii) these neural networks make the word order of the translations closer to that of the reference. In order to measure the amount of reordering, we used the Kendall's tau distance between word alignments obtained from pairs of sentences (Birch, 2011, Sec. 5.3.2). As the distance needs to be computed from permutations, 8 we turned word aligments into permutations by means of the algorithm defined by Birch (2011, Sec. 5.2).
For each language direction, we computed word alignments between the source-language side of the test set and the target-language reference, the PBMT output and the NMT output by means of MGIZA++ (Gao and Vogel, 2008). As the test sets are rather small for word alignment (1 999 to 3 000 sentence pairs depending on the language pair), we append bigger parallel corpora to help ensure accurate word alignments and avoid data sparseness. For languages for which in-domain 8 A permutation between a source-language sentence and a target-language sentence is defined as the set of operations that need to be carried out over the words in the sourcelanguage sentence to reflect the order of the words in the target-language sentence (Birch, 2011, Sec. 5.2).
(news) parallel training data is available (German and Russian), we append that dataset (News Commentary). For the remaining languages (Finnish and Romanian) we use the whole Europarl corpus.
The amount of reordering performed by each system can be estimated as the distance between the word alignments produced by that system and a monotone word alignment. The similarity between the reorderings produced by each MT system and the reorderings in the reference translation can also be estimated as the distance between the corresponding word alignments. Table 5 shows the value of these distances for the language pairs included in our evaluation. The average over all the sentences in the test set of the distance proposed by Birch (2011) is depicted.
It can be observed that the amount of reordering introduced by both types of MT systems is lower than the quantity of reordering in the reference translation. NMT generally produces more changes in the structure of the sentence than PBMT. This is the case for all language pairs but two (EN→DE and EN→FI). A possible explanation for these two exceptions is the following: in the former language pair, the PBMT system is hierarchical (Williams et al., 2016) while in the latter, the output was reranked with neural LMs.
Concerning the similarity between the reorderings produced by both MT systems and those in the reference translation, out of 9 directions, in 5 directions the NMT system performs a reordering closer to the reference, in 1 direction the PBMT system performs a reordering closer to the reference and in the remaining 3 directions the differences are not statistically significant. That is, NMT generally produces reorderings which are closer to the reference translation. The exceptions to this trend, however, do not exactly correspond to the language pairs for which NMT underperformed PBMT.
In summary, NMT systems achieve, in general, a higher degree of reordering than pure, phrasebased PBMT systems, and, overall, this reordering results in translations whose word order is closer to that of the reference translation.

Sentence Length
In this experiment we aim to find out whether the performances of NMT and PBMT are somehow sensitive to sentence length. In this regard, Bentivogli et al. (2016) found that, for transcribed speeches, NMT outperformed PBMT regardless of sentence length while also noted that NMT's performance degraded faster than PBMT's as sentence length increases. It should be noted, however, that sentences in our content type, news, are considerably longer than sentences in transcribed speeches. 9 Hence, the current experiment will determine to what extent the findings on transcribed speeches stand also for texts made of longer sentences. We split the source side of the test set in subsets of different lengths: 1 to 5 words (1-5), 6 to 10 and so forth up to 46 to 50 and finally longer than 50 words (> 50). We then evaluate the out-puts of the top PBMT and NMT submissions for those subsets with the chrF1 evaluation metric. Figure 1 presents the results for the language direction EN→FI. We can observe that NMT outperforms PBMT up to sentences of length 36-40, while for longer sentences PBMT outperforms NMT, with PBMT's performance remaining fairly stable while NMT's clearly decreases with sentence length. The results for the other language directions exhibit similar trends. Sentence length (range) Relative Improvement Figure 2: Relative improvement of the best NMT versus the best PBMT submission on chrF1 for different sentence lengths, averaged over all the language pairs considered. Figure 2 shows the relative improvements of NMT over PBMT for each sentence length subset, averaged over all the 9 language directions considered. We observe a clear trend of this value decreasing with sentence length and in fact we found a strong negative Pearson correlation (-0.79) between sentence length and the relative improvement (chrF1) of the best NMT over the best PBMT system.
The correlations for each language direction are shown in Table 6. We observe negative correlations for all the language directions except for DE→EN.  Table 6: Pearson correlations between sentence length and relative improvement (chrF1) of the best NMT over the best PBMT system for each language pair.

Error Categories
In this experiment we assess the performance of NMT versus PBMT systems on a set of error   categories that correspond to five word-level error classes: inflection errors, reordering errors, missing words, extra words and incorrect lexical choices. These errors are detected automatically using the edit distance, word error rate (WER), precision-based and recall-based positionindependent error rates (hPER and rPER, respectively) as implemented in Hjerson (Popović, 2011). These error classes are then defined as follows: • Inflection error (hINFer). A word for which its full form is marked as a hPER error while its base form matches the base form in the reference.
• Reordering error (hRer). A word that matches the reference but is marked as a WER error.
• Missing word (MISer). A word that occurs as deletion error in WER, is also a rPER error and does not share the base form with any hypothesis error.
• Extra word (EXTer). A word that occurs as insertion error in WER, is also a hPER error and does not share the base form with any reference error.
• Lexical choice error (hLEXer). A word that belongs neither to inflectional errors nor to missing or extra words.
Due to the fact that it is difficult to disambiguate between three of these categories, namely missing words, extra words and lexical choice errors (Popović and Ney, 2011), we group them in a unique category, which we refer to as lexical errors.
As input, the tool requires the full forms and base forms of the reference translations and MT outputs. For base forms, we use stems for practical reasons. These are produced with the Snowball stemmer from NLTK 10 for all languages except for Czech, which is not supported. For this language we used the aggresive variant in czech stemmer. 11 Tables 7 and 8 show the results for language directions out of English and into English, respectively. For all language directions, we observe that NMT results in a notable decrease of both inflection (−14.6% on average for language directions out of EN and −7.91% for language directions into EN) and reordering (−12.82% from EN and −11.94 into EN) errors. The reduction of reordering errors is compatible with the results of the experiment presented in Section 5. 12 Differences in performance for the remaining error category, lexical errors, are much smaller. In addition, the results for that category show a mixed picture in terms of which paradigm is better, which makes it difficult to derive conclusions that apply regardless of the language pair. Out of En-glish, NMT results in slightly less errors (0.59% decrease on average) for all target languages except for RO (2.17% increase). Similarly, in the opposite language direction, NMT also results in slightly better performance overall (1.35% error reduction on average), and looking at individual language directions NMT outperforms PBMT for all of them except RU→EN.

Conclusions
We have conducted a multifaceted evaluation to compare NMT versus PBMT outputs across a number of dimensions for 9 language directions. Our aim has been to shed more light on the strengths and weaknesses of the newly introduced NMT paradigm, and to check whether, and to what extent, these generalise to different families of source and target languages. Hereunder we summarise our findings: • The outputs of NMT systems are considerably different compared to those of PBMT systems. In addition, there is higher intersystem variability in NMT, i.e. outputs by pairs of NMT systems are more different between them than outputs by pairs of PBMT systems.
• NMT outputs are more fluent. We have corroborated the results of the manual evaluation of fluency at WMT16, which was conducted only for language directions into English, and we have shown evidence that this finding is true also for directions out of English.
• NMT systems introduce more changes in word order than pure PBMT systems, but less than hierarchical PBMT systems. 13 Nevertheless, for most language pairs, including those for which the best PBMT system is hierarchical, NMT's reorderings are closer to the reorderings in the reference than those of PBMT. This corroborates the findings on reordering by Bentivogli et al. (2016).
• We have found negative correlations between sentence length and the improvement brought by NMT over PBMT for the majority of the languages examined. While for most sentence lengths NMT outperforms PBMT, for very long sentences PBMT outperforms 13 The latter finding applies only to one language direction as only for that one the best PBMT system is hierarchical.
NMT. The latter was not the case in the work by Bentivogli et al. (2016). We believe the reason behind this different finding is twofold. Firstly, the average sentence length in their evaluation dataset was considerably shorter; and secondly, the NMT systems included in our evaluation operate on subword units, which increases the effective sentence length they have to deal with.
• NMT performs better in terms of inflection and reordering consistently across all language directions. We thus confirm that the findings of Bentivogli et al. (2016) regarding these two error types apply to a wide range of language directions. Differences regarding lexical errors are much smaller and inconsistent across language directions; for 7 of them NMT outperforms PBMT while for the remaining 2 the opposite is true.
The results for some of the evaluations, especially error categories (Section 7) have been analysed only superficially, looking at what conclusions can be derived that apply regardless of language direction. Nevertheless, all our data is publicly released, 14 so we encourage interested parties to use this resource to conduct deeper languagespecific studies.