Contextual Neural Machine Translation Improves Translation of Cataphoric Pronouns

The advent of context-aware NMT has resulted in promising improvements in the overall translation quality and specifically in the translation of discourse phenomena such as pronouns. Previous works have mainly focused on the use of past sentences as context with a focus on anaphora translation. In this work, we investigate the effect of future sentences as context by comparing the performance of a contextual NMT model trained with the future context to the one trained with the past context. Our experiments and evaluation, using generic and pronoun-focused automatic metrics, show that the use of future context not only achieves significant improvements over the context-agnostic Transformer, but also demonstrates comparable and in some cases improved performance over its counterpart trained on past context. We also perform an evaluation on a targeted cataphora test suite and report significant gains over the context-agnostic Transformer in terms of BLEU.


Introduction
Standard machine translation (MT) systems typically translate sentences in isolation, ignoring essential contextual information, where a word in a sentence may reference other ideas or expressions within a piece of text. This locality assumption hinders the accurate translation of referential pronouns, which rely on surrounding contextual information to resolve cross-sentence references. The issue is further exacerbated by differences in pronoun rules between source and target languages, often resulting in morphological disagreement in the quantity and gender of the subject being referred to (Vanmassenhove et al., 2018).
Rapid improvements in NMT have led to it replacing SMT as the dominant paradigm. With this, context-dependent NMT has gained traction, overcoming the locality assumption in SMT through the use of additional contextual information. This has led to improvements in not only the overall translation quality but also pronoun translation (Jean et al., 2017;Bawden et al., 2018;Voita et al., 2018;Miculicich et al., 2018). However, all these works have neglected the context from future sentences, with Voita et al. (2018) reporting it to have a negative effect on the overall translation quality.
In this work, we investigate the effect of future context in improving NMT performance. We particularly focus on pronouns and analyse corpora from different domains to discern if the future context could actually aid in their resolution. We find that for the Subtitles domain roughly 16% of the pronouns are cataphoric. This finding motivates us to investigate the performance of a context-dependent NMT model (Miculicich et al., 2018) trained on the future context in comparison to its counterpart trained on the past context. We evaluate our models in terms of overall translation quality (BLEU) and also employ three types of automatic pronountargeted evaluation metrics. We demonstrate strong improvements for all metrics, with the model using future context showing comparable or in some cases even better performance than the one using only past context. We also extract a targeted cataphora test set and report significant gains on it with the future context model over the baseline.

Related Work
Pronoun-focused SMT Early work in the translation of pronouns in SMT attempted to exploit coreference links as additional context to improve the translation of anaphoric pronouns (Le Nagard and Koehn 2010;Hardmeier and Federico 2010). These works yielded mixed results which were attributed to the limitations of the coreference resolution systems used in the process (Guillou, 2012). Context-Aware NMT Multiple works have successfully demonstrated the advantages of using larger context in NMT, where the context comprises few previous source sentences (Wang et al., 2017;Zhang et al., 2018), few previous source and target sentences (Miculicich et al., 2018), or both past and future source and target sentences (Maruf and Haffari, 2018;Maruf et al., 2018Maruf et al., , 2019. Further, context-aware NMT has demonstrated improvements in pronoun translation using past context, through concatenating source sentences (Tiedemann and Scherrer, 2017) or through an additional context encoder (Jean et al., 2017;Bawden et al., 2018;Voita et al., 2018). Miculicich et al. (2018 observed reasonable improvements in generic and pronoun-focused translation using three previous sentences as context. Voita et al. (2018) observed improvements using the previous sentence as context, but report decreased BLEU when using the following sentence. We, on the other hand, observe significant gains in BLEU when using the following sentence as context on the same data domain.

Contextual Analysis of Corpora
To motivate our use of the future context for improving the translation of cataphoric pronouns in particular and NMT in general, we first analyse the distribution of coreferences for anaphoric and cataphoric pronouns over three different corpora -OpenSubtitles2018 1 (Lison and Tiedemann, 2016), Europarl (Koehn, 2005 and TED Talks (Cettolo et al., 2012)    subtitles with a run-time less than 50 minutes (for English) and those with number of sentences in the hundreds. The corpus is then randomly split into training, development and test sets in the ratio 100:1:1.5. Table 1 presents the corpora statistics.

Analysis of Coreferences
We find the smallest window within which a referential English pronoun is resolved by an antecedent or postcedent using NeuralCoref. 3 Table 2 shows that the majority of pronouns in Europarl and TED Talks corpora are resolved intrasententially, while the Subtitles corpus demonstrates a greater proportion of intersentential coreferences. Further, anaphoric pronouns are much more frequent compared to cataphoric ones across all three corpora. For Subtitles, we also note that a good number of pronouns (15.6%) are cataphoric, ∼37% of which are resolved within the following sentence ( Figure 1). This finding motivates us to investigate the performance of a context-aware NMT model (trained on Subtitles) for the translation of cataphoric pronouns.

Experiments
Datasets We experiment with the Subtitles corpus on English-German and English-Portuguese language-pairs. To obtain English-Portuguese data, we employ the same pre-processing steps as reported in §3 (corpus statistics are in Table 1). We use 80% of the training data to train our models and the rest is held-out for further evaluation as discussed later in § 4.2. 4 The data is truecased using

Results
We consider two versions of the Transformer-HAN respectively trained with the following and previous source sentence as context. From Table 3, we note both context-dependent models to significantly outperform the Transformer across all language-pairs in terms of BLEU. Further, HAN (k = +1) demonstrates statistically significant improvements over the HAN (k = -1) when translating to English. These results are quite surprising as Voita et al. (2018) report decreased translation quality in terms of BLEU when using the following sentence for English→Russian Subtitles. To Transformer-DyNet 7 Where in the original architecture, k sentence-context vectors were summarised into a document-context vector, we omit this step when using only one sentence in context. 8 The code and data are available at https://github. com/sameenmaruf/acl2020-contextnmt-cataphora.  identify if this discrepancy is due to the languagepair or the model, we conduct experiments with English→Russian in the same data setting as Voita et al. (2018) and find that HAN (k = +1) still significantly outperforms the Transformer and is comparable to HAN (k = -1) (more details in Appendix A.2).

Analysis
Pronoun-Focused Automatic Evaluation For the models in Table 3, we employ three types of pronoun-focused automatic evaluation: 1. Accuracy of Pronoun Translation (APT) (Miculicich Werlen and Popescu-Belis, 2017) 9 . This measures the degree of overlapping pronouns between the output and reference translations obtained via word-alignments.
2. Precision, Recall and F1 scores. We use a variation of AutoPRF (Hardmeier and Federico, 2010) to calculate precision, recall and F1-scores. For each source pronoun, we compute the clipped count (Papineni et al., 2002) of overlap between candidate and reference translations. To eliminate word alignment errors, we compare this overlap over the set of dictionarymatched target pronouns, in contrast to the set of target words aligned to a given source pronoun as done by AutoPRF and APT.  two measures which rely on computing pronoun overlap between the target and reference translation, we employ an ELMo-based (Peters et al., 2018) evaluation framework that distinguishes between a good and a bad translation via pairwise ranking (Jwalapuram et al., 2019). We use the CRC setting of this metric which considers the same reference context (one previous and one next sentence) for both reference and system translations. However, this measure is limited to evaluation only on the English target-side. 10 The results using the aforementioned pronoun evaluation metrics are reported in Table 4. We observe improvements for all metrics with both HAN models in comparison to the baseline. Further, we observe that the HAN (k = +1) is either comparable to or outperforms HAN (k = -1) on APT and F1 for De→En and Pt→En, suggesting that for these cases, the use of following sentence as context is at least as beneficial as using the previous sentence. For En→De, we note comparable performance for the HAN variants in terms of F1, while for En→Pt, the past context appears to be more beneficial. 11 In terms of CRC, we note HAN (k = -1) to be comparable to (De→En) or better than HAN (k = +1) (Pt→En). We attribute this to the way the metric is trained to disambiguate pronoun translations based on only the previous context and thus may have a bias for such scenarios.
Ablation Study We would like to investigate whether a context-aware NMT model trained on a wider context could perform well even if we do not have access to the same amount of context at decoding. We thus perform an ablation study for 10 We use the same English pronoun list for all pronounfocused metrics (provided by Jwalapuram et al. (2019) at https://github.com/ntunlp/eval-anaphora). All pronoun sets used in our evaluation are provided in Appendix A.4. 11 It should be noted that for Portuguese, adjectives and even verb forms can be marked by the gender of the noun and these are hard to account for in automatic pronoun-focused evaluations.  English→German using the HAN model trained with two previous and next sentences as context and decoded with variant degrees of context.
From Table 5, we note that reducing the amount of context at decoding time does not have adverse effect on the model's performance. However, when no context is used, there is a statistically significant drop in BLEU, while APT and F1-scores are equivalent to that of the baseline. This suggests that the model does rely on the context to achieve the improvement in pronoun translation. Further, we find that the future context is just as beneficial as the past context in improving general translation performance. Table 3 for the HAN (k = +1) model are coming from the correct translation of cataphoric pronouns, we perform an evaluation on a cataphoric pronoun test suite constructed from the held-out set mentioned earlier in § 3. To this end, we apply NeuralCoref over the English side to extract sentence-pairs which have a cataphoric pronoun in one sentence and the postcedent in the next sentence. This is further segmented into subsets based on the part-of-speech of the postcedent, that is, determiner (DET), proper noun (PROPN) or all nouns (NOUN) (more details in the appendix). 12 From Table 6, we observe HAN (k = +1) to outperform the baseline for all language-pairs when evaluated on the cataphora test suite. In general, we observe greater improvements in BLEU when trans- Figure 2: Example attention map between source (yaxis) and context (x-axis). The source pronoun he correctly attends to the postcedents Richard in the context. lating to English, which we attribute to the simplification of cross-lingual pronoun rules when translating from German or Portuguese to English. 13 We also observe fairly similar gains in BLEU across the different pronoun subsets, which we hypothesise to be due to potential overlap in test sentences between different subsets. Nevertheless, we note optimum translation quality over the noun subsets (PROPN and NOUN), while seeing the greatest percentage improvement on the DET subset. For the latter, we surmise that the model is able to more easily link pronouns in a sentence to subjects prefixed with possessive determiners, for example, "his son" or "their child".

Cataphora-Focused Test Suite To gauge if the improvements in
We also perform an auxiliary evaluation for Transformer-HAN (k = -1) trained with the previous sentence as context on the cataphora test suite and find that the BLEU improvements still hold. Thus, we conclude that Transformer-HAN (a context-aware NMT model) is able to make better use of coreference information to improve translation of pronouns (detailed results in Appendix A.3).
Qualitative Analysis We analyse the distribution of attention to the context sentence for a few test cases. 14 Figure 2 shows an example in which a source pronoun he attends to its corresponding postcedent in context. This is consistent with our hypothesis that the HAN (k = +1) is capable of exploiting contextual information for the resolution of cataphoric pronouns.

Conclusions
In this paper, we have investigated the use of future context for NMT and particularly for pronoun translation. While previous works have focused on the 13 It should be noted that the cataphora test set is extracted based on the existence of cataphoric-pairs in the English-side, which may have biased the evaluation when English was in the target.
14 Attention is average of the per-head attention weights.
use of past context, we demonstrate through rigorous experiments that using future context does not deteriorate translation performance over a baseline. Further, it shows comparable and in some cases better performance as compared to using the previous sentence in terms of both generic and pronounfocused evaluation. In future work, we plan to investigate translation of other discourse phenomena that may benefit from the use of future context.

A.1 Model Configuration
We use similar configuration as the Transformerbase model (Vaswani et al., 2017) except that we reduce the number of layers in the encoder and decoder stack to 4 following Maruf et al. (2019). For training, we use the default Adam optimiser (Kingma and Ba, 2015) with an initial learning rate of 0.0001 and employ early stopping.

A.2 English→Russian Experiments
We wanted to compare the two variants of Transformer-HAN with k = +1 and k = -1 in the same experimental setting as done by Voita et al. (2018). The data they made available only contains the previous context sentence. Thus, we extract training, development and test sets following the procedure in this work but of roughly the same size as Voita et al. (2018)  from a random sample of documents, resulting in a test set which has document-level continuity between sentences. The pre-processing pipeline is the same as the one used for our English-German and English-Portuguese experiments except that we perform lowercasing (instead of truecasing) and learn separate BPE codes for source and target languages following Voita et al. (2018). We also evaluate the models trained with our training set on the test set provided by Voita et al. (2018) after removing the sentences overlapping with our train and dev sets (corpora statistics are in Table 7).
Results Table 8 indicates that the model trained with the next sentence as context not only statistically significantly outperforms the Transformer baseline (+0.9 BLEU) but also demonstrates comparable performance to the HAN model trained

A.3 Cataphora-Focused Test Suite
We segment the cataphora test set into the following subsets based on the part-of-speech of the postcedent being referred to: • DET Postcedents prefixed with possessive determiners, e.g., his son or their child.
• NOUN Postcedents which are nouns, including proper nouns and common nouns, such as boy or child.
A.3.1 Results for HAN (k = -1) We evaluate Transformer-HAN (k = -1) enriched with anaphoric context on the cataphora test set (Table 9) to determine if this context-aware model is making use of coreference information to improve the overall translation quality (in BLEU). We find that HAN (k = +1) performs better than HAN (k = -1) when English is in the target-side, which we hypothesise to be because of the extraction of the cataphora test suite from the Englishside. However, when English is in the source-side, both models perform comparably showing that the Tranformer-HAN (a context-aware NMT model) is able to make better use of coreference information to improve translation of pronouns. 15 The BLEU score for the baseline on Voita et al. (2018) is less than the one reported in their original work because of the reduced size of the test set and the different training set.

Lang. Pair
Baseline HAN(k = -  Table 9: BLEU on the cataphora test set for the Transformer and Transformer-HAN (k = -1). All results for HAN (k = -1) are statistically significantly better than the baseline.