Evaluating Pronominal Anaphora in Machine Translation: An Evaluation Measure and a Test Suite

The ongoing neural revolution in machine translation has made it easier to model larger contexts beyond the sentence-level, which can potentially help resolve some discourse-level ambiguities such as pronominal anaphora, thus enabling better translations. Unfortunately, even when the resulting improvements are seen as substantial by humans, they remain virtually unnoticed by traditional automatic evaluation measures like BLEU, as only a few words end up being affected. Thus, specialized evaluation measures are needed. With this aim in mind, we contribute an extensive, targeted dataset that can be used as a test suite for pronoun translation, covering multiple source languages and different pronoun errors drawn from real system translations, for English. We further propose an evaluation measure to differentiate good and bad pronoun translations. We also conduct a user study to report correlations with human judgments.


Introduction
Traditionally, machine translation (MT) has been performed at the level of individual sentences, i.e., in isolation from the rest of the document. This was due to the nature of the underlying frameworks: word-based (Brown et al., 1993), then phrase-based (Koehn et al., 2003), syntactic (Galley et al., 2004), and hierarchical (Chiang, 2005). While there have been attempts to model the context beyond the sentence level, e.g., looking at neighboring sentences (Carpuat and Wu, 2007;Chan et al., 2007) or even at the entire document (Hardmeier et al., 2012), these approaches were still limited by the underlying framework.
Then, along came the neural revolution. Thanks to the attention mechanism, neural translation models such as sequence-to-sequence (Bahdanau et al., 2015) and the Transformer (Vaswani et al., 2017) could model much broader context. While initially translation was still done in a sentence-by-sentence fashion, researchers soon realized that going beyond that has become easier, and recent work has successfully exploited this (Bawden et al., 2018;. This is an exciting research direction as it can help address inter-sentential phenomena such as anaphora, gender agreement, lexical consistency, and text coherence, to mention just a few. Unfortunately, going beyond the sentence level typically yields very few changes in the translation output, and even when these changes are seen as substantial by humans, they remain virtually unnoticed by typical MT evaluation measures such as BLEU (Papineni et al., 2002), which are known to be notoriously problematic for the evaluation of discourse-level aspects in MT (Hardmeier, 2014).
The limitations of BLEU are well-known and have been discussed in detail in a recent study (Reiter, 2018). It has long been argued that as the quality of machine translation improves, there will be a singularity moment when existing evaluation measures would be unable to tell whether a given output was produced by a human or by a machine. Indeed, there have been recent claims that human parity has already been achieved (Hassan et al., 2018), but it has also been shown that it is easy to tell apart a human translation from a machine output when going beyond the sentence level (Läubli et al., 2018). Overall, it is clear that there is a need for machine translation evaluation measures that look beyond the sentence level, and thus can better appreciate the improvements that a discourseaware MT system could potentially bring.
Alternatively, one could use diagnostic test sets that are designed to evaluate how an MT system handles specific discourse phenomena (Bawden et al., 2018;. There have also been proposals to use semi-automatic measures and test suites . Here we propose a targeted dataset for machine translation evaluation with a focus on anaphora. We further present a specialized evaluation measure trained on this dataset. The measure performs pairwise evaluations: it learns to distinguish good vs. bad translations of pronouns, without being given specific signals of the errors. It has been argued that pairwise evaluation is useful and sufficient for machine translation evaluation (Guzmán et al., 2015(Guzmán et al., , 2017. In particular, Duh (2008) has shown that ranking-based evaluation measures can achieve higher correlations with human judgments, as ranking judgments are easier to obtain from human judges and are also easy to use in training, while also directly achieving the purpose of comparing two systems.
Note that while it may be possible to rank translations using strong pre-trained conditional language models such as GPT (Radford et al., 2018), all kinds of errors would influence the score, and it would not be targeted towards a specific source of error, such as anaphora here. Our model provides a way to do this, and we demonstrate that it indeed focuses on the translation of pronouns.
Although our pronoun test suite naturally consists of the source text paired with a reference human translation, our pronoun evaluation measure is generally independent of the source language. Moreover, we use real machine translation output, which may contain various types of errors. Our contributions are as follows: • We create a dataset for pronoun translation that covers multiple source languages and various target English pronouns.
• We propose a novel evaluation measure that differentiates good pronoun translations from bad ones irrespective of the source language they were translated from.
• Unlike previous work, both the dataset and the model are based on actual system outputs.
• Our evaluation measure achieves high agreement with human judgments.

Related Work
Previous work on discourse-aware machine translation and MT evaluation has targeted a number of phenomena such as anaphora, gender agreement, lexical consistency, and coherence. In this work, we focus on pronoun translation.
Pronoun translation has been the target of a shared task at the DiscoMT and WMT workshops in 2015-2017 (Hardmeier et al., 2015;Loáiciga et al., 2017). However, the focus was on cross-lingual pronoun prediction, which required choosing the correct pronouns in the context of an existing translation, i.e., this was not a realistic translation task. The 2015 edition of the task also featured a pronoun-focused translation task, which was like a normal MT task except that the evaluation focused on the pronouns only, and was performed manually. In contrast, we have a real MT evaluation setup, and we develop and use a fully automatic evaluation measure.
More recently, there has been a move towards using specialized test suites specifically designed to assess system quality for some fine-grained problematic categories, including pronoun translation. For example, the PROTEST test suite (Guillou and Hardmeier, 2016) comprises 250 pronoun tokens, used in a semi-automatic evaluation: the pronouns in the MT output and in the reference are compared automatically, but in case of no matches, manual evaluation was required. Moreover, no final aggregate score over all pronouns was produced. In contrast, we have a much larger test suite with a fully automatic measure.
Another semi-automatic system is described in . It focused on just two pronouns, it and they, and was applied to a single language pair. In contrast, we have a fully automated evaluation measure, we handle many English pronouns, and we cover multiple source languages. Bawden et al. (2018) presented hand-crafted discourse test sets to test a model's ability to exploit previous source and target sentences, based on 200 contrastive pairs, one with a correct and one with a wrong pronoun translation. This alleviates the need for an automatic evaluation measure as one can just count how many times the MT system has generated the correct pronoun. In contrast, we use texts from pre-existing MT evaluation datasets, we do not require them to be in contrastive pairs, and we have a fully automated evaluation measure; we also use larger datasets.  also used contrastive translation pairs, mined from a parallel corpus using automatic coreference-based mining of context, thus minimizing the risk of producing wrong contrastive examples that are both valid translations. Yet, they did not propose an evaluation measure.
Finally, there have been pronoun-focused automatic machine translation evaluation measures.
Two important examples include APT (Miculicich Werlen and Popescu-Belis, 2017) and Auto-PRF (Hardmeier and Federico, 2010). Both measures require alignments between the source, the reference and the system output texts for evaluating the pronoun translations. However, automatic alignments are noisy;  have shown that improvements using heuristics are not statistically significant. They also found low agreement between these measures and human judgments, primarily due to the possibility of many translation choices per pronoun. APT also uses a predetermined list of 'equivalent pronouns', obtained for specific pronouns based on a French grammar book and verified through probability counts. This list is used to weigh pronouns that are not exact matches, and the accuracy of the pronoun translations is calculated accordingly. Miculicich Werlen and Popescu-Belis (2017) collect such a list for English-French for the pronouns it and they. This limits the evaluation measure both in terms of the language and also of the pronouns it is applicable to. In contrast, our framework requires only two candidate translations of the same text as input for comparison: this could be a reference vs. a system translation, or a comparison between two candidate translations (see Section 5.5).

Dataset Generation
We automatically generated our dataset, which we used to build a pronoun test suite and to train a pronoun evaluation model. In order to avoid generating synthetic data that may not necessarily represent a difficult context (for an MT system to correctly translate the pronouns), we used data from actual system outputs submitted for the WMT translation tasks in (Callison-Burch et al., 2011Bojar et al., 2013Bojar et al., , 2014Bojar et al., , 2015Bojar et al., , 2017. Using such data means that what is essentially a conditional language model solution, such as the one used by Bawden et al. (2018), has already failed on these examples. Reference translation: He was creative, generous, funny, loving and talented, and I will miss him dearly.
MT system translation: It was creative, generous, funny, affectionate and talented, and we will greatly miss.
Generated noisy example 1: It was creative, generous, funny, loving and talented, and I will miss him dearly.
Generated noisy example 2: He was creative, generous, funny, loving and talented, and we will miss him dearly. In particular, we aligned the system outputs with the reference translation using an automatic alignment tool (Dyer et al., 2013), and we found examples in which the pronouns did not match the reference translation. This process yielded potentially noisy data, as the alignments are automatic and thus not always perfect.

User Study
In order to ensure that the mismatched pronouns are not equally good translations in the given context, we conducted a user study on a subset of the data. To focus the study on pronouns and to remove the influence of other MT errors, we generated a noisy candidate by replacing the correct pronoun in the reference with the aligned (potentially) incorrect pronoun from the system output. We did this for each differing pronoun, so that the difference between the reference and the noisy version is one pronoun only (see Figure 1).
Our goal was to find pronoun pairs, e.g., he-it in Figure 1, where there is high agreement that the reference is the correct translation, so that we can automatically classify it as a positive example and the MT output as a negative one. The study participants were fluent in English and were native speakers of Chinese, Russian, French, or German. They were shown the source and two candidate translations (the reference and its noisy version) in random order. The relevant sentence was shown in bold, with the pronoun highlighted. Two previous sentences were given as a context; see Figure 2.  We asked the study participants to choose the text with the better pronoun. They were allowed to choose (a) candidate A, (b) candidate B, (c) equivalent translations (tie), (d) "neither is correct", and (e) "invalid candidates", i.e., the highlighted words are not pronouns or are the wrong pronouns due to misalignment. We excluded from further consideration all examples marked as (d) or (e). Each participant annotated a total of 500 examples per language pair. Statistics 1 about the annotation process are given in Table 1. We also report the proportion of cases where the participants preferred the reference translation over the noisy version (see Avg%Ref). 2 We can see that there is high agreement for all language pairs, ranging from 0.82 to 0.89. The ties seem to be the major source of disagreement: excluding them yields agreements in the range of 0.91-0.97. 1 Due to the nature of the dataset, the human annotators are always more likely to choose the reference as the better candidate, which yields a skewed distribution of the annotations; traditional correlation measures such as Cohen's kappa are not robust to this, and thus we report the more appropriate Gwet's AC1/gamma coefficient. (Gwet, 2008).
2 High agreement could also mean that the participants consistently picked the noisy version as the better choice.
In order to measure the effect, if any, that the source text might have on the annotator's choices, we also conducted a study without providing them the source text. We did this for the texts from the Chinese→English study. The participants were only shown the English texts: the reference vs. the noisy sentence, with the context as before. The results for this study are shown on the last line of Table 1. We can see that the agreement for this English-only setup is also fairly high (0.84); the overall agreement between all six participants (three from Chinese→English and three from only English) is 0.85. We observe very similar agreement of 0.91 (Chinese→English) and 0.92 (Only English) when the ties were excluded, with the overall agreement being 0.90 between the six participants. However, further analysis showed that although the two groups disagreed on about 10% of the examples, only 2% of the examples were common to both groups, showing that the sources of disagreement between the two groups are different. It can be that having the source context helps disambiguate the other 8% of the cases, while also introducing ambiguity that does not seem to be an issue for the participants who saw the English texts only. See Figure 1 for an example where the source is helpful; here, noisy example 2 would be acceptable, except that the original French text uses a singular pronoun. However, the disagreements form a small part of the dataset; we also filtered out all pronoun pairs with low agreements from further use. Figure 3 shows one such lowagreement example. Noisy candidate: For now she just wants to enjoy the moment. I didn't want to say that was my last race. That would have meant too much pressure.

Pronoun Test Suite for MT Systems
The source sentences can also be used as a test suite for MT systems to check their pronoun translations: it can be considered to be a challenging, diagnostic test set for pronoun translation, covering a range of errors such as gender (he/she→it), number (they→it), animacy (who→which), syntactic role (e.g., subject/object: he→him), and others; see the Appendix for a complete list.
The WMT test sets come from news articles; the context is available, and thus the test suite is particularly suitable for discourse-level MT systems. It is available for each source language for which English translations are generated in the WMT tasks, e.g., Czech, French, German, etc. (Table 2).
The corresponding noisy versions of the reference are also generated, but are somewhat noisy. However, the subset used in our study is curated in some sense as human judgments are available. This data can serve as a more refined test suite: not only useful for checking the agreement with human judgments, but also for identifying equivalent pronoun translations in context, as the data is also annotated for ties.

The Evaluation Measure
While diagnostic datasets allow us to evaluate MT systems with respect to specific discourse-level phenomena, an automatic discourse-aware evaluation measure is useful not only for evaluation but also for tuning MT systems. Moreover, an evaluation measure that only looks at the target language (which is computationally feasible, even if not ideal, as our study above has shown) offers additional benefits; we can train it for a specific target language without requiring a separate dataset for each source-target language pair. Below, we propose such a measure for pronoun translation. Let R = (C r ; r) and S = (C s ; s) denote a reference and a system tuple pair containing a reference and a system translation, r and s, along with a context of previous sentences, C r and C s , respectively. Note that C r and C s can contain the same sentences or different sentences, or be empty in case no context is provided. Given a training set D containing N such tuple pairs, our aim is to learn an evaluation measure that can rank any unseen translation pair (R, S) with respect to the correct use of pronouns. 3 In Section 3, we described how such datasets can be collected opportunistically without recourse to expensive manual annotation. Figure 4 shows our proposed framework to evaluate MT outputs with respect to pronouns. The inputs to the model are sentences (with or without context C r and C s ): R and S. Each input sentence is first mapped into a set of word embedding vectors of dimensionality d by performing a lookup in the shared embedding matrix E ∈ R v×d with vocabulary size v. E can be initialized randomly or with any pre-trained embeddings such as GloVe (Pennington et al., 2014), or contextualized word vectors such as ELMo (Peters et al., 2018a).
In case of initialization with GloVe vectors, we use a BiLSTM (Hochreiter and Schmidhuber, 1997) layer to get a representation of the words that is encoded with contextual information. Let X = (x 1 , x 2 , . . . , x n ) denote an input sequence, where x t is the t th word embedding vector of the sequence. The LSTM recurrent layer computes a compositional representation k t at every time step t by performing nonlinear transformations of the current input x t and the output of the previous time step k t−1 . In a BiLSTM, we get the representation − → k t by processing the sequence in the forward direction, and the representation ← − k t by processing the sequence in the backward direction. The final representation k t of a word is the concatenation of these two representations, i.e., With ELMo initialization, the word vectors obtained are used directly. ELMo uses stacked biL-STM encoder and gives very powerful contextualized word representations learned from large corpora by optimizing a bi-directional language modeling loss. The ELMo representations already capture morphological, syntactic and contextual semantic features (Peters et al., 2018b). 4 Let K r and K s be the matrices whose rows represent the word representations of R and S, respectively (obtained either from Bi-LSTM or directly from ELMo). From these representations, we extract the representations of the pronouns in the target sentence (from r and s; not from the contexts). Let P r and P s be the matrices whose rows represent the contextualized word representations of the pronouns in r and s, respectively. We use zero-padding (shown as shaded boxes) to make P r and P s fixed-length. We then use scaled multiplicative attention (Vaswani et al., 2017) to compute a contextual representation for the pronouns in r and s. Specifically, we consider the rows of P r (resp. P s ) as query vectors, the rows of K r (resp. K s ) as key and value vectors, and the matrix P r (resp. P s ) to attend over K r (resp. K s ). 4 We also tried an ELMo-initialized BiLSTM, but it did not perform well while increasing model complexity.
We use residual connection and layer normalization to get pronoun representations B r and B s : Note that B r and B s contain a d-dimensional vector for each query (pronoun) vector (and zero vectors due to padding). We pass these vectors through a shared linear layer parameterized by z ∈ R d to obtain a score for each pronoun. This yields vectors u r and u s for the reference and for the system translations: A final shared linear layer parameterized by w converts these vectors to contrastive scores, yielding a (positive) score y r for the reference and a (negative) score y s for the system translation: We then use the scores in a pairwise ranking loss (Collobert et al., 2011) to find model parameters that assign a higher score to y r than to y s . We minimize the following ranking objective: Note that the network shares all of its parameters (θ) to obtain y r and y s from a pair of inputs R i = (C r , r) and S i = (C s , s). Once trained, it can be used to score any input independently.

Experiments
Below, we describe our data, the experimental setup, and the evaluation results.

Data
We first created a set of commonly confused pronoun pairs. Using the data from the study, we calculated the inter-annotator agreement for each pair of a reference/correct pronoun and a system translation/incorrect pronoun. We excluded the pairs with low agreement (<0.8) or for which the system output was chosen as the correct translation more often. Pairs with low agreement are essentially cases where the annotators cannot agree that the reference translation is better. The source of ambiguity in this case is that the system translation is not absolutely wrong (see Figure 3); therefore, these cases may not be so critical to correct. The remaining pairs are, with a fairly high confidence, positive-negative (correct-incorrect) pairs.
Next, we filtered the WMT data, only keeping sentences with these pronoun pairs. This yielded 97,461 reference translation (positive text) -unique system output (negative text) pairs for training, taken from WMT11,12,13,15 (Table 3).
The development data collected from WMT14 system outputs has 5,727 unique system translations and 6,635 unique noisy candidate pairs. 5 For testing, we used the annotated data from the user study, generated from a subset of WMT17 system translations (except for French, which is from the discussion forum test set from WMT15, not overlapping with the training data). There are 500 unique noisy-reference pairs per source language, a total of 2,000.  Table 3: Statistics about our dataset. 5 The number of unique noisy candidates exceeds that of unique system translations because a separate noisy candidate was generated for each error in a system translation.

Experimental Setup
We evaluated the models in terms of accuracy, i.e., the proportion of times the model scored the reference translation higher than the system/noisy output, and we report results using either GloVe (Pennington et al., 2014) or ELMo (Peters et al., 2018a). We conducted a number of experiments, training and testing under different conditions: No Context (NC). The reference (R) (or the noisy reference R ) and the system (S) translations go through ELMo or BiLSTM, without contextual information C r /C s (i.e., query = P r /P s , key, value = R = r/S = s); With Context: The reference (R) (or the noisy reference R ) and the system (S) translation representations include two previous sentences as a context. The context can be further categorized as (i) Respective Context (RC): R includes its own reference context C r = r −2 r −1 , and S includes its own respective system context C s = s −2 s −1 (query = P r /P s , key, value = R = r −2 r −1 r/S = s −2 s −1 s); (ii) Common Reference Context (CRC): The context for R and S includes the same reference context C r = C s = r −2 r −1 for both (query = P r /P s , key, value = R = r −2 r −1 r/S = r −2 r −1 s); We perform the evaluation in two ways: (a) R vs. S: Testing over pairs of reference (R) and system translation (S) texts; (b) R vs. R : Testing over pairs of reference (R) and noisy candidate (R ) texts.
Baseline. For a baseline performance, we simply take the average of the extracted pronoun representations in P r and P s , and we convert them to pairwise scores through linear layers. The baseline is also evaluated with and without context.

Results and Analysis
The first three experiments in Table 4 show results for the 'No Context' setting. These are on the noisy data (Exp. 1-2) and are indicative of how sensitive our model is to pronouns, since R and R differ in a single pronoun. The results on the reference/system translation pair R and S (Exp. 3) is indicative of the performance in a real use case. In the case of the GloVe-BiLSTM model, using attention instead of averaging pronouns does not help. However, using ELMo greatly improves the accuracy of the baseline and also yields an improvement when using attention. ELMo can model syntactic and semantic information and can thus improve co-reference resolution (Peters et al., 2018a), which could be a contributing factor. The second part of our experiments (Exp. 4-5) concerns the addition of contextual information. When the respective contexts are added, there is no improvement (even a drop in case of GloVe-BiLSTM); quite possibly, other differences in the text make it harder for the model to focus on the pronoun errors.
To offset this issue, we use a common context for both the reference and the system sentences, taken from the reference (Exp. 6-8). Training the model with a common reference context (CRC) yields marginal difference in the GloVe-BiLSTM model, but the ELMo model improves with the addition of context. Our experiments 6 show that ELMo is quite powerful at capturing contextual information, even as the context size grows.
Finally, we test our ELMo -common reference context model on the held-out dataset used for the user study. Table 5 shows the results. The accuracy of the system is lower on the study dataset. Note that since the training data was filtered based on the pronoun pairs with high agreement from the study, the study dataset contains pronoun pairs with low agreement that were not seen during training. We also calculate the agreement with human judgments excluding ties since our model does not handle ties. The overall agreement with the human judgments remains high.

Pronoun-wise Analysis
We performed a pronoun-wise analysis of the results in Table 5. The model scored the noisy version higher than the reference in about 19% of the cases. Of these, 46% were pronoun pairs that were not seen during training.
Of the remaining pronoun pairs that were seen during training, the main source of errors (over 11%) were cases when that in the reference was replaced by it in the noisy version ("We always tell victims not to pay up; that/it simply exacerbates the problem", explains Kleczynski). The noisy candidate was scored higher about 28% of the time, out of 79 samples. The next highest source of errors was the reference-noisy pair of it-she at 10%.
In contrast, the best-performing pair was when a he in the reference was replaced with an it in the noisy version (He/It risked everything to save other people's lives.). The reference was scored higher than the noisy candidate 95% of the time, out of 135 samples. The next highest performer was the reference-noisy pair of his-its, which was correctly scored 86% of the time.
This performance follows the distribution of the pronoun pairs seen during training. The he-it and his-its pairs together account for over 12% of the training data, while the that-it and it-she pairs together form only 3.7%. Since the distribution of the pronoun pairs in the training data is itself based on the distribution of errors in the system translations, the model performs best over error cases that occur most often in system translations.

Discussion
As systems participating in WMT improve over the years, our test data is closer to state-of-the-art neural MT, while most of the training data is from statistical systems. This setting allows us to capture a wider range of errors while showing that our model is sensitive to small errors in a fluent output.
Note that since the model was trained on the full system output and also on all pronouns in that output, it has not received any signal about which pronoun is wrong. Yet, we can see from the attention maps in Figure 5 that the model can correctly identify the incorrect pronoun. In Figure 5 (top), the model distinguishes the wrong pronoun it (the correct one is she), while in Figure 5 (bottom) it correctly finds herself as the wrong translation (the correct one is she).
We further compare the scores of two system translations for the same sentence from WMT17 Russian-English system outputs (see Figure 6). Here, the correct pronoun to be found is her. While one system translates it alternately as its and his (see Figure 6, top), the other system translates both cases as his (see Figure 6, bottom). Our model scores the translation in Figure 6 (bottom) higher than the translation in Figure 6 (top); even though the model highlights both occurrences of his as wrong, it ends up believing that having its as a translation is worse. We could argue that the translation in Figure 6 (bottom) is better since it maintains the animacy/human aspect, even if the grammatical gender is wrong; moreover, this translation is also consistent.
One could further argue that pronoun-focused automatic machine translation evaluation measures such as APT and AutoPRF are likely to yield the same accuracy/precision-recall for both cases above.

Conclusions and Future Work
We have presented a new, extensive, targeted dataset for pronoun translation that covers multiple source languages and a wide range of target English pronouns. We have also proposed a novel evaluation measure for differentiating good vs. bad pronoun translations, irrespective of the source language, which achieved high correlation with human judgments.
In future work, we want to handle cases where multiple pronouns are equally suitable in a given context. We would also like to extend the work to other discourse phenomena.