Automatic Reference-Based Evaluation of Pronoun Translation Misses the Point

We compare the performance of the APT and AutoPRF metrics for pronoun translation against a manually annotated dataset comprising human judgements as to the correctness of translations of the PROTEST test suite. Although there is some correlation with the human judgements, a range of issues limit the performance of the automated metrics. Instead, we recommend the use of semi-automatic metrics and test suites in place of fully automatic metrics.


Introduction
As the general quality of machine translation (MT) increases, there is a growing interest in improving the translation of specific linguistic phenomena. A case in point that has been studied in the context of both statistical (Hardmeier, 2014;Guillou, 2016;Loáiciga, 2017) and neural MT (Bawden et al., 2017;Voita et al., 2018) is that of pronominal anaphora. In the simplest case, translating anaphoric pronouns requires the generation of corresponding word forms respecting the grammatical constraints on agreement in the target language, as in the following English-French example, where the correct form of the pronoun in the second sentence varies depending on which of the (equally correct) translations of the word bicycle was used in the first: (1) a. I have a bicycle. It is red.
b. J'ai un vélo. Il est rouge. [ref] c. J'ai une bicyclette. Elle est rouge. [MT] However, the problem is more complex in practice because there is often no 1 : 1 correspondence between pronouns in two languages. This is easily demonstrated at the corpus level by observing that the number of pronouns varies significantly across languages in parallel texts (Mitkov * Both authors contributed equally. and Barbu, 2003), but it tends to be difficult to predict in individual cases.
In general MT research, significant progress was enabled by the invention of automatic evaluation metrics based on reference translations, such as BLEU (Papineni et al., 2002). Attempting to create a similar framework for efficient research, researchers have proposed automatic reference-based evaluation metrics specifically targeting pronoun translation: AutoPRF (Hardmeier and Federico, 2010) and APT (Miculicich Werlen and Popescu-Belis, 2017). We study the performance of these metrics on a dataset of English-French translations and investigate to what extent automatic evaluation based on reference translations provides insights into how well an MT system handles pronouns. Our analysis clarifies the conceptual differences between AutoPRF and APT, uncovering weaknesses in both metrics, and investigates the effects of the alignment correction heuristics used in APT. By using the fine-grained PROTEST categories of pronoun function, we find that the accuracy of the automatic metrics varies across pronouns of different functions, suggesting that certain linguistic patterns are captured better in the automatic evaluation than others. We argue that fully automatic wide-coverage evaluation of this phenomenon is unlikely to drive research forward, as it misses essential parts of the problem despite achieving some correlation with human judgements. Instead, semiautomatic evaluation involving automatic identification of correct translations with high precision and low recall appears to be a more achievable goal. Another more realistic option is a test suite evaluation with a very limited scope.

Pronoun Evaluation Metrics for MT
Two reference-based automatic metrics of pronoun translation have been proposed in the literature.
The first (Hardmeier and Federico, 2010) is a variant of precision, recall and F-score that measures the overlap of pronouns in the MT output with a reference translation. It lacks an official name, so we refer to it as AutoPRF following the terminology of the DiscoMT 2015 shared task (Hardmeier et al., 2015). The scoring process relies on a word alignment between the source and the MT output, and between the source and the reference translation. For each input pronoun, it computes a clipped count (Papineni et al., 2002) of the overlap between the aligned tokens in the reference and the MT output. The clipped count of a given word is defined as the number of times it occurs in the MT output, limited by the number of times it occurs in the reference translation. The final metric is then calculated as the precision, recall and F-score based on these clipped counts.
Miculicich Werlen and Popescu-Belis (2017) propose a metric called Accuracy of Pronoun Translation (APT) that introduces several innovations over the previous work. It is a variant of accuracy, so it counts, for each source pronoun, whether its translation can be considered correct, without considering multiple alignments. Since word alignment is problematic for pronouns, the authors propose an heuristic procedure to improve alignment quality. Finally, it introduces the notion of pronoun equivalence, assigning partial credit to pronoun translations that differ from the reference translation in specific ways deemed to be acceptable. In particular, it considers six possible cases when comparing the translation of a pronoun in MT output and the reference. The pronouns may be: (1) identical, (2) equivalent, (3) different/incompatible, or there may be no translation in: (4) the MT output, (5) the reference, (6) either the MT output or the reference. Each of these cases may be assigned a weight between 0 and 1 to determine the level of correctness.

The PROTEST Dataset
We study the behaviour of the two automatic metrics using the PROTEST test suite . The test suite comprises 250 hand-selected personal pronoun tokens taken from the DiscoMT2015.test dataset of TED talk transcriptions and translations  and annotated according to the ParCor guidelines (Guillou et al., 2014). It is structured according to a linguistic typology motivated by work on func-tional grammar by Dik (1978) and Halliday (2004). Pronouns are first categorised according to their function: anaphoric: I have a bicycle. It is red. event: He lost his job. It was a shock. pleonastic: It is raining. addressee reference: You're welcome.
They are then subcategorised according to morphosyntactic criteria, whether the antecedent is a group noun, whether the ancedent is in the same or a different sentence, and whether an addressee reference pronoun refers to one or more specific people (deictic) or to people in general (generic).
Manual evaluation was conducted using the PROTEST graphical user interface and accompanying guidelines . The annotators were asked to make judgements (correct/incorrect) on the translations of the pronouns and antecedent heads whilst ignoring the correctness of other words (except in cases where it impacted the annotator's ability to make a judgement). The annotations were carried out by two bilingual English-French speakers, both of whom are native speakers of French. Our human judgements differ in important ways from the human evaluation conducted for the same set of systems at DiscoMT 2015 (Hardmeier et al., 2015), which was carried out by non-native speakers over an unbalanced data sample using a gap-filling methodology. In the gap-filling task annotators are asked to select, from a predefined list (including an uninformative catch-all group "other"), those pronouns that could fill the pronoun translation slot. Unlike in the PROTEST evaluation, the pronoun translations were obscured in the MT output. This avoided priming the annotators with the output of the candidate translation, but it occasionally caused valid translations to be rejected because they were missed by the annotator.

Accuracy versus Precision/Recall
There are three ways in which APT differs from Au-toPRF: the scoring statistic, the alignment heuristic in APT, and the definition of pronoun equivalence.
APT is a measure of accuracy: It reflects the proportion of source pronouns for which an acceptable translation was produced in the target. AutoPRF, by contrast, is a precision/recall metric on the basis of clipped counts. Hardmeier and Federico (2010) motivate the use of precision and recall by pointing out that word alignments are not 1 : 1, so each pronoun can be linked to multiple elements in the target language, both in the reference translation and in the MT output. Their metric is designed to account for all linked words in such cases.
To test the validity of this argument, we examined the subset of examples of 8 systems in our English-French dataset 1 giving rise to a clipped count greater than one 2 and found that these examples follow very specific patterns. All 143 cases included exactly one personal pronoun. In 99 cases, the additional matched word was the complementiser que 'that'. In 31 and 4 cases, respectively, it was a form of the auxiliary verbs avoir 'to have' andêtre 'to be'. One example matched both que and a form ofêtre. Two had reflexive pronouns, and one an imperative verb form. With the possible exception of the two reflexive pronouns, none of this seems to be relevant to pronoun correctness. We conclude that it is more reasonable to restrict the counts to a single pronominal item per example. With this additional restriction, however, the recall score of AutoPRF becomes equivalent to a version of APT without equivalent pronouns and alignment correction. We therefore limit the remainder of our study to APT.

Effects of Word Alignment
APT includes an heuristic alignment correction procedure to mitigate errors in the word alignment between a source-language text and its translation (reference or MT output). We ran experiments to 1 Excluding the YANDEX system, which was added later. 2 A clipped count greater than one for a given pronoun translation indicates that the MT output and the reference translation aligned to this pronoun overlap in more than one token. assess the correlation of APT with human judgements, with and without the alignment correction heuristics. Table 1 displays the APT results in both conditions and the proportion of pronouns in the PROTEST test suite marked as correctly translated. For better comparison with the PROTEST test suite results, we restricted APT to the pronouns in the test suite. We used two different weight settings: 3 APT-A uses weight 1 for identical matches and 0 for all other cases. APT-B uses weight 1 for identical matches, 0.5 for equivalent matches and 0 otherwise.
There is little difference in the APT scores when we consider the use of alignment heuristics. This is due to the small number of pronouns for which alignment improvements are applied for most systems (typically 0-12 per system). The exception is the ITS2 system output for which 18 alignment improvements are made. For the following systems we observe a very small increase in APT score for each of the two weight settings we consider, when alignment heuristics are applied: UU-HARDMEIER (+0.8), ITS2 (+0.8), BASELINE (+0.8), YANDEX (+0.8), and NYU (+0.4). However, these small improvements are not sufficient to affect the system rankings. It seems, therefore, that the alignment heuristic has only a small impact on the validity of the score.
To assess differences in correlation with human judgment for pairs of APT settings, we run Williams's significance test (Williams, 1959;Graham and Baldwin, 2014). The test reveals that differences in correlation between the various configurations of APT and human judgements are not statistically significant (p > 0.2 in all cases).

Metric Accuracy per Category
Like Miculicich Werlen and Popescu-Belis (2017), we use Pearson's and Spearman's correlation coefficients to assess the correlation between APT and our human judgements (Table 2). Although APT does correlate with the human judgements over the PROTEST test suite, the correlation is weaker than that with the DiscoMT gap-filling evaluations reported in Miculicich Werlen and Popescu-Belis (2017). A Williams significance test reveals that the difference in correlation (for those systems common to both studies) is not statistically significant (p > 0.3). Table 1 also shows that the rankings induced from the PROTEST and APT scores are rather different. The differences are due to the different ways in which the two metrics define pronoun correctness, and the different sources against which correctness is measured (reference translation vs. human judgement). We also study how the results of APT (with alignment correction) interact with the categories in PROTEST. We consider a pronoun to be measured as correct by APT if it is assigned case 1 (identical) or 2 (equivalent). Likewise, a pronoun is considered incorrect if it is assigned case 3 (incompatible). We compare the number of pronouns marked as correct/incorrect by APT and by the human judges, ignoring APT cases in which no judgement can be made: no translation of the pronoun in the MT output, reference or both, and pronouns for which the human judges were unable to make a judgement due to factors such as poor overall MT quality, incorrect word alignments, etc. The results of this comparison are displayed in Table 3.
At first glance, we can see that APT disagrees with the human judgements for almost a quarter (24.3%) of the assessed translations. The distribution of the disagreements over APT cases is very skewed and ranges from 8% for case 1 to 32% for case 2 and 49% for case 3. In other words, APT identifies correct pronoun translations with good precision, but relatively low recall. We can also see that APT rarely marks pronouns as equivalent (case 2).
Performance for anaphoric pronouns is mixed. In general, there are three main problems affecting anaphoric pronouns (Table 4). 1) APT, which does not incorporate knowledge of anaphoric pronoun antecedents, does not consider pronoun-antecedent head agreement so many valid alternative translations involving personal pronouns are marked as incompatible (i.e. incorrect, case 3), but as correct by the human judges. Consider the following example, in which the pronoun they is deemed correctly translated by the YANDEX system (according to the human judges) as it agrees in number and grammatical gender with the translation of the antecedent extraits (clips). However, the pronoun translation ils is marked as incorrect by APT as it does not match the translation in the reference (elles).
SOURCE: so what these two clips show is not just the devastating consequence of the disease, but they also tell us something about the shocking pace of the disease. . . YANDEX: donc ce que ces deux extraits[masc.,pl.] montrent n'est pas seulement la conséquence dévastatrice de la maladie, mais ils[masc. pl.] nous disent aussi quelque chose sur le rythme choquant de la maladie. . .
2) Substitutions between pronouns are governed by much more complex rules than the simple pronoun equivalence mechanism in APT. For example, the dictionary of pronouns used in APT lists il and ce as equivalent. However, while il can often replace ce as a pleonastic pronoun in French, it has a much stronger tendency to be interpreted as anaphoric, rendering pleonastic use unacceptable if there is a salient masculine antecedent in the context. 3) APT does not consider the use of impersonal pronouns such as c' in place of the feminine personal pronoun elle or the plural forms ils and elles.  As with anaphoric pronouns, APT incorrectly marks some pleonastic and event translations as equivalent, in disagreement with the human judges. Other common errors arise from 1) the use of alternative translations marked as incompatible (i.e. incorrect) by APT but correct by the human judges, for example il (personal) in the MT output when the reference contained the impersonal pronoun cela or ça (30 cases for pleonastic, 7 for event), or 2) the presence of il in both the MT output and reference marked by APT as identical but by the human judges as incorrect (3 cases for pleonastic, 15 event).
Some of these issues could be addressed by incorporating knowledge of pronoun function in the source language, of pronoun antecedents, and of the wider context of the translation surrounding the pronoun. However, whilst we might be able to derive language-specific rules for some scenarios, it would be difficult to come up with more general or language-independent rules. For example, il and ce can be anaphoric or pleonastic pronouns, but il has a more referential character. Therefore in certain constructions that are strongly pleonastic (e.g. clefts) only ce is acceptable. This rule would be specific to French, and would not cover other scenarios for the translation of pleonastic it. Other issues include the use of pronouns in impersonal constructions such as il faut [one must/it takes] in which evaluation of the pronoun requires consideration of the whole expression, or transformations between active and passive voice, where the perspective of the pronouns changes.

Conclusions
Our analyses reveal that despite some correlation between APT and the human judgements, fully automatic wide-coverage evaluation of pronoun translation misses essential parts of the problem. Comparison with human judgements shows that APT identifies good translations with relatively high precision, but fails to reward important patterns that pronoun-specific systems must strive to generate. Instead of relying on fully automatic evaluation, our recommendation is to emphasise high precision in the automatic metrics and implement semiautomatic evaluation procedures that refer negative cases to a human evaluator, using available tools and methods . Fully automatic evaluation of a very restricted scope may still be feasible using test suites designed for specific problems (Bawden et al., 2017).