A Pronoun Test Suite Evaluation of the English–German MT Systems at WMT 2018

We evaluate the output of 16 English-to-German MT systems with respect to the translation of pronouns in the context of the WMT 2018 competition. We work with a test suite specifically designed to assess system quality in various fine-grained categories known to be problematic. The main evaluation scores come from a semi-automatic process, combining automatic reference matching with extensive manual annotation of uncertain cases. We find that current NMT systems are good at translating pronouns with intra-sentential reference, but the inter-sentential cases remain difficult. NMT systems are also good at the translation of event pronouns, unlike systems from the phrase-based SMT paradigm. No single system performs best at translating all types of anaphoric pronouns, suggesting unexplained random effects influencing the translation of pronouns with NMT.


Introduction
Data-driven machine translation (MT) systems are very good at making translation choices based on the words in the immediate neighbourhood of the word currently being generated, but aspects of translation that require keeping track of long-distance dependencies continue to pose problems.Linguistically, long-distance dependencies often arise from discourse-level phenomena such as pronominal reference, lexical cohesion, text structure, etc. Initially largely ignored, such problems have attracted increasing attention in the statistical MT (SMT) community in recent years (Hardmeier, 2012;Sim Smith, 2017).One important problem that has proved to be surprisingly difficult despite extensive research is the translation of pronouns (Hardmeier et al., 2015;Guillou et al., 2016;Loáiciga et al., 2017).
Since the invention of the BLEU score (Papineni et al., 2002), the MT community has measured progress to a large extent with the help of summary scores that are easy to compute, but strongly affected by the corpus-level frequency of certain phenomena, and that tend to neglect specific linguistic relations and problems that occur infrequently.The advent of neural MT (NMT) with its improved capacity for modeling more complex relationships between linguistic elements has brought an increased interest in linguistic problems perceived as difficult, which are often not captured well by metrics like BLEU.It has been suggested that test suites composed of difficult cases could provide more relevant insights into the performance of MT systems than corpus-level summary scores (Hardmeier, 2015).In this paper, we present a semi-automatic evaluation of the systems participating in the English-German news translation track of the MT shared task at the WMT 2018 conference.
The analysis was carried out with the help of an English-German adaptation of the PROTEST test suite for pronoun translation (Guillou and Hardmeier, 2016).The test suite allows us to perform a fine-grained evaluation for different types of pronouns.Whilst the translation of event pronouns, which caused serious problems in earlier evaluations of SMT systems (Hardmeier et al., 2015;Hardmeier and Guillou, 2018), seems to be handled fairly well by modern NMT systems, we find that translating anaphoric pronouns is still difficult, especially (but not only) if the pronoun has an antecedent in a different sentence.Our results also confirm earlier findings that suggested the need for a careful evaluation that is sensitive to specific linguistic problems.Whilst BLEU scores as a measure of general translation quality are strongly correlated with pronoun correctness, there are significant outliers that would be missed by an evaluation focusing on BLEU only.Moreover, evaluating pro-noun translations by comparison with a reference translation is not reliable for all types of pronouns (Guillou and Hardmeier, 2018).This fact limits the usefulness of automatic pronoun evaluation metrics such as APT (Miculicich Werlen and Popescu-Belis, 2017) and affects the semi-automatic evaluation of our test suite as well.

Related Work
Research on pronoun translation was boosted by three past shared tasks (Hardmeier et al., 2015;Guillou et al., 2016;Loáiciga et al., 2017).They focused on English, French, German and Spanish in different directions.To avoid the effort and cost of manual evaluation, the tasks were designed and evaluated as classification rather than MT tasks, except for the first year, which featured both MT and classification tasks.At the time of the first of these shared tasks, phrase-based SMT systems were still competitive and the winning system was a strong n-gram language model (not involving any translation) trained as a baseline.By the time of the last pronoun focused shared task, however, an NMT system with no explicit knowledge about pronouns ranked first (Jean et al., 2017).
Automatic metrics computed by matching the candidate and reference translations offer little explanation of the causes for error.Additionally, the neural architectures of current end-to-end systems make it difficult to find out where exactly a translation went wrong by inspection.Test suites ease the evaluation process in general, since they allow us to simultaneously measure quantitative performance and diagnose qualitative shortcomings with regard to the targeted set of problems.
Test suites assessing NMT have focused on contrastive pairs or sets of sentences automatically generated.These include Burlot and Yvon (2017), for the evaluation of morphology in the English-to-Latvian and to-Czech language pairs; Sennrich (2017), who evaluates noun phrase and subject-verb agreement, particle verbs, polarity, and transliteration; and Rios Gonzales et al. (2017) whose work concentrates on word sense disambiguation for the German-to-English and Germanto-French pairs.The test suite used in our work is based on the PROTEST test suite, which was originally created for English-French by Guillou and Hardmeier (2016).Closest to our work is the test suite of English-to-French anaphoric pronouns and coherence and cohesion by Bawden et al. (2018).
Their test suite includes 50 examples of contrastive pairs of sentences, which are manually created and targeted towards object pronouns.

Test Suite Construction
The data for our test suite was taken from the ParCorFull corpus (Lapshinova-Koltunski et al., 2018), a German-English parallel corpus manually annotated for co-reference.Although the corpus is designed for nominal co-reference, it includes annotations of two types of antecedents: entities and events.Entities can be either pronouns or noun phrases, whereas events can be verb phrases, clauses, or a set of clauses.
ParCorFull includes texts from TED talks transcripts and newswire data.Specifically, it includes the datasets used in the ParCor corpus (Guillou et al., 2014), the DiscoMT workshop (Hardmeier et al., 2016), and the test sets from the WMT 2017 shared task (Bojar et al., 2017).
We constructed a test suite of 200 pronoun translation examples for English-German with a focus on the ambiguous English pronouns it and they and the aim of providing a set of examples that represents the different problems machine translation researchers should consider.We extracted the examples from the TED talks section of ParCorFull.
The selection is based on a two-level hierarchy which considers pronoun function at the top level, followed by other pronoun attributes at the more granular lower level (for anaphoric pronouns only).
The English pronoun they functions as an anaphoric pronoun, whereas it can function as either an anaphoric (1), pleonastic (2), or event reference1 pronoun (3), with each function requiring the use of different pronouns in German.
(1) a.The infectious disease that's killed more humans than any other is malaria.It's carried in the bites of infected mosquitos.b.Jene Krankheit, die mehr Leute als jede andere umgebracht hat, ist Malaria gewesen.Sie wird über die Stiche von infizierten Moskitos übertragen.
(2) a.And it seemed to me that there were three levels of acceptance that needed to take place.b.Und es schien, dass es drei Stufen der Akzeptanz gibt, die alle zum Tragen kom-men mussten.
(3) a.But I think if we lost everyone with Down syndrome, it would be a catastrophic loss.b.Aber, wenn wir alle Menschen mit Down-Syndrom verlören, wäre das ein katastrophaler Verlust.
At the more granular lower level, anaphoric pronouns are subdivided according to the following attributes: whether the pronoun appears in the same sentence as its antecedent (intra-sentential) or a different sentence (inter-sentential), the antecedent is a group noun, the pronoun is in subject or nonsubject position (it only), or an instance of they is used as a singular pronoun (for example, to refer to a person of unknown gender).An overview of the resulting categories is provided in Table 2.The distribution of test suite examples over the pronoun categories in the hierarchy can be found in the first row of Table 3.The number of examples assigned to each category reflects a) the functional ambiguity of the pronoun it, b) the number of different translation options possible in German, and c) the number of pronouns in the corpus that belong to the category (for example, there are very few instances of singular they available).Within each category, we aim to create a balance in terms of the expected pronoun translation token.We achieve this by considering the translation of the set of possible candidates in the reference translation.

Evaluation Results
The evaluation included 10 systems submitted to the English-German sub-task of the WMT 2018 competition and 6 anonymized online translation systems.Among the WMT submissions, all of the systems are neural models, with the Transformer (Vaswani et al., 2017) being a popular architecture choice.Implementation details can be found in the system description papers published at WMT 2018.

Automatic Evaluation
We provide scores from two different automatic evaluation metrics for all systems in our dataset (see Table 1 and Figure 1).To give a general impression of the translation quality achieved by the various systems, we include the BLEU scores on the TED talks from which the test suite is derived.a different domain.For a more pronoun-specific evaluation, we also compute APT scores (Miculicich Werlen and Popescu-Belis, 2017).2For better comparability, the set of pronouns evaluated by APT was restricted to the 200 items included in the test suite.Following the recommendations of Guillou and Hardmeier (2018), we did not define any "equivalent" pronouns in the APT metric, but counted exact matches only.
A regression fit between the BLEU scores obtained and the number of examples annotated as correct by each system indicates a strong correlation between the two (Figure 2; r = 0.912, N = 16, p < 0.001), as does a similar analysis for the APT score (r = 0.887, N = 15, p < 0.001).These results, however, should be taken with a grain of salt, as we argue further in Section 5.

Semi-automatic Evaluation
The semi-automatic evaluation method is a twopass procedure.It is motivated by the observation that automatic reference-based methods can identify correct examples with relatively high precision, but low recall (Guillou and Hardmeier, 2018).The evaluation procedure relies on word alignments, which were generated automatically by running Giza++ (Och and Ney, 2003) in both directions with grow-diag-final symmetrization (Koehn et al., 2005).The word alignments for the examples in the reference translation were corrected manually.
In the first step, the candidate translations are matched against the reference translation to ap- prove examples that we can assume to be correct with reasonable confidence.Examples in the event and pleonastic categories can be approved based on a pronoun match alone; for the anaphoric categories, we also require matching antecedent translations.Two pronoun translations are considered to match if the sets of words aligned to the pronouns have at least one element in common after lowercasing.For antecedent translations, the word sequences aligned to the source antecedent must be completely equal for an automatic match.As a special exception, no automatic matches are generated for pronoun translations containing the word sie alone, so that the ambiguity between third-person plural sie and the pronoun of polite address Sie can be manually resolved.
In the second step, all examples not automatically approved are loaded into a graphical analysis tool specifically designed for the PROTEST test suite (Hardmeier and Guillou, 2016).The tool presents the annotator with the source pronoun, its translation by a given system, and the previous sentence for context.In the case of anaphoric pronouns, the context includes the sentence with the antecedent and one additional sentence.The examples were split randomly over four annotators.The annotators, who are translator trainees at Saar-  land University, are all native speakers of German with a good knowledge of English.To improve the quality of the annotations, the annotators had been trained beforehand on the output of a baseline NMT system.
In total, 3,200 pronoun examples from 16 systems were evaluated.1,150 examples were approved automatically and 2,050 examples were referred for manual annotation.To verify the validity of the semi-automatic method, we also solicited manual annotations for a random sample of 350 examples that had been approved automatically.
The first step of our two-step procedure can only approve examples, it never rejects them automatically.As a consequence, our semi-automatic evaluation is biased towards correctness with respect to a fully manual evaluation.The scores presented in Table 3 will therefore tend to overestimate the actual system performance.
The results of the human annotation of the random sample of 320 examples automatically matched as correct are presented in Table 2. Consistently with similar results for French (Hardmeier and Guillou, 2018), 86.6% of the automatically approved examples were accepted as correct by the evaluators.However, we must highlight that the accuracy of the automatic evaluation varies substantially across categories.Whilst pronouns known to be pleonastic can be checked automatically with very good confidence, the automatic evaluation of anaphoric pronouns is much more difficult, with an evaluation accuracy as low as 55.2% in the intersentential subject it case.This reflects the general difficulty of automatic pronoun evaluation (Guillou and Hardmeier, 2018) and reinforces the positive bias discussed in the previous paragraph for these categories in particular.
The results of the semi-automatic evaluation are displayed in Table 3.For the counts in this table, we used manual annotations wherever possible.Automatic annotations were used only for those examples that had not been annotated manually.
The best result was obtained by the Microsoft-Marian system, which translated 157 out of 200 pronouns correctly.It is followed by a group of 5 shared task submissions that achieved scores between 145 and 148.Three of the online systems also reached scores over 140.The remaining shared task submissions are JHU with a score of 132 and LMU-nmt with a score of 127.Unsurprisingly, the unsupervised submissions are ranked last.

Discussion
Generally speaking, a high BLEU score indicates good translation quality and vice versa.The APT score has been shown to capture good pronoun translations with reasonable precision, if unsatisfactory recall (Guillou and Hardmeier, 2018), but it is also trivially correlated with our test suite score to some extent because the automatic part of our semi-automatic evaluation identifies good translation with a mechanism that is very similar to that of APT.In the right half of Figure 2, we observe that the APT score introduces spurious differences between systems reaching exactly the same number of correctly translated items (NTT, UCAM, uedin) and fails to reward correct pronoun translations in some of the systems (Microsoft-Marian, online-B).As a result, the score can serve as an indicator, but not as a reliable replacement of a manual or semi-automatic evaluation.
Moreover, the small size of the test suite and the differences between the system architectures must be kept in mind.Considering these two factors, a larger threshold in any of the two scores is needed to claim that one system is actually better than another (Berg-Kirkpatrick et al., 2012).This caveat appears to be confirmed by the two outliers seen in the left part of Figure 2. Interestingly, the online-F system achieves many good pronoun translations despite a low BLEU score.The RWTH-uns system is also much better on correct pronouns than LMU-uns (the other unsupervised system) than the difference in BLEU scores would suggest.
The results of manual evaluation vary significantly by category.In the anaphoric it categories, it is evident that intra-sentential anaphora is easier to handle than inter-sentential anaphora.In the intrasentential case, the best systems produce correct translations for 70-80% of the examples, which is a fair result, but indicates that the problem is not completely solved yet.In the inter-sentential it categories, the average performance is below 50% despite the positive bias of our evaluation method, and even the best-performing systems are not much better.It is worth noting that no single system performs best over all anaphoric categories, which suggests that the top scores achieved for this part of the test suite could be random strokes of luck.The results for pronouns in subject and non-subject positions are not very different.This contrasts with the results of Hardmeier and Guillou (2018) for English-French, where non-subject pronouns were found to be substantially harder to translate.It might be due to the fact that the direct object forms of French personal pronouns coincide with those of the definite article, a problem that does not apply to German.
The plural cases of they do not cause any serious problems, at least for the stronger systems, since they can usually be translated straightforwardly using the German pronoun sie.The errors occurring in these categories are often due to confusion with the pronoun of polite address Sie ("you").When Table 3: Pronoun and antecedent translations marked as correct, per system they has a singular antecedent or refers to a group, however, it is mistranslated much more frequently.
The only system that has noticeable problems with pleonastic it is the unsupervised LMU-uns submission.Translating event it seems to be more difficult, but many systems still achieve close to perfect results in this category.Similarly to the results of Hardmeier and Guillou (2018) for English-French, this suggests that NMT systems are quite good at identifying pronouns with event reference and producing appropriate translations for them.

Conclusions
We have presented a detailed analysis of 16 NMT systems, assessing their performance in the translation of pronouns using a semi-automatic evaluation based on a balanced test suite.The results reinforce the idea that automatic evaluation scores are correlated with manual evaluation results, but they also confirm that automatic evaluation can provide a misleading picture of the behavior of some systems.The evaluation has also reinforced that special attention should be paid to the problematic cases that are only identifiable through the careful balance of categories achieved in the test suite design.This balanced design has also made us aware of the progress made by NMT in the modeling of context for the translation of pleonastic, event and intra-sentential anaphoric pronouns.Pleonastic pronouns are handled almost perfectly by most systems, so we suggest that future evaluations emphasize the more challenging cases.Anaphoric pronouns depending on the inter-sentential context remain a significant challenge.They present an ideal test case for the development of contextaware NMT systems.Research in that direction has recently gained some traction (Tiedemann and Scherrer, 2017;Wang et al., 2017;Tu et al., 2018) and has claimed promising results specifically for pronoun translation (Voita et al., 2018).It remains to be seen whether the development of such methods will lead to a breakthrough in the translation of inter-sentential anaphoric pronouns in the near future.

Figure 1 :Figure 2 :
Figure 1: BLEU and APT scores.The three highest ranking systems are highlighted in orange.

Table 1 :
These scores differ from the BLEU scores of the official WMT evaluation because they are computed on a different test set, containing texts from Automatic evaluation results.

Table 2 :
Human evaluation of automatically approved examples