When a Good Translation is Wrong in Context: Context-Aware Machine Translation Improves on Deixis, Ellipsis, and Lexical Cohesion

Though machine translation errors caused by the lack of context beyond one sentence have long been acknowledged, the development of context-aware NMT systems is hampered by several problems. Firstly, standard metrics are not sensitive to improvements in consistency in document-level translations. Secondly, previous work on context-aware NMT assumed that the sentence-aligned parallel data consisted of complete documents while in most practical scenarios such document-level data constitutes only a fraction of the available parallel data. To address the first issue, we perform a human study on an English-Russian subtitles dataset and identify deixis, ellipsis and lexical cohesion as three main sources of inconsistency. We then create test sets targeting these phenomena. To address the second shortcoming, we consider a set-up in which a much larger amount of sentence-level data is available compared to that aligned at the document level. We introduce a model that is suitable for this scenario and demonstrate major gains over a context-agnostic baseline on our new benchmarks without sacrificing performance as measured with BLEU.


Introduction
With the recent rapid progress of neural machine translation (NMT), translation mistakes and inconsistencies due to the lack of extra-sentential context are becoming more and more noticeable among otherwise adequate translations produced by standard context-agnostic NMT systems (Läubli et al., 2018). Though this problem has recently triggered a lot of attention to contextaware translation (Jean et al., 2017a;Wang et al., 2017;Tiedemann and Scherrer, 2017; Bawden 1 We release code and data sets at https://github.com/lena-voita/ good-translation-wrong-in-context. Voita et al., 2018;Maruf and Haffari, 2018;Agrawal et al., 2018;Miculicich et al., 2018;Zhang et al., 2018), the progress and widespread adoption of the new paradigm is hampered by several important problems. Firstly, it is highly non-trivial to design metrics which would reliably trace the progress and guide model design. Standard machine translation metrics (e.g., BLEU) do not appear appropriate as they do not sufficiently differentiate between consistent and inconsistent translations (Wong and Kit, 2012). 2 For example, if multiple translations of a name are possible, forcing consistency is essentially as likely to make all occurrences of the name match the reference translation as making them all different from the reference. Second, most previous work on context-aware NMT has made the assumption that all the bilingual data is available at the document level. However, isolated parallel sentences are a lot easier to acquire and hence only a fraction of the parallel data will be at the document level in any practical scenario. In other words, a context-aware model trained only on documentlevel parallel data is highly unlikely to outperform a context-agnostic model estimated from much larger sentence-level parallel corpus. This work aims to address both these shortcomings.
A context-agnostic NMT system would often produce plausible translations of isolated sentences, however, when put together in a document, these translations end up being inconsistent with each other. We investigate which linguistic phenomena cause the inconsistencies using the OpenSubtitles (Lison et al., 2018) corpus for the English-Russian language pair. We identify deixis, ellipsis and lexical cohesion as three main sources of the violations, together amounting to about 80% of the cases. We create test sets focusing specifically on the three identified phenomena (6000 examples in total). We show that by using a limited amount of document-level parallel data, we can already achieve substantial improvements on these benchmarks without negatively affecting performance as measured with BLEU. Our approach is inspired by the Deliberation Networks (Xia et al., 2017). In our method, the initial translation produced by a baseline context-agnostic model is refined by a context-aware system which is trained on a small document-level subset of parallel data.
The key contributions are as follows: • we analyze which phenomena cause contextagnostic translations to be inconsistent with each other; • we create test sets specifically addressing the most frequent phenomena; • we consider a novel and realistic set-up where a much larger amount of sentencelevel data is available compared to that aligned at the document level; • we introduce a model suitable for this scenario, and demonstrate that it is effective on our new benchmarks without sacrificing performance as measured with BLEU.

Analysis
We begin with a human study, in which we: 1. identify cases when good sentence-level translations are not good when placed in context of each other, 2. categorize these examples according to the phenomena leading to a discrepancy in translations of consecutive sentences.
The test sets introduced in Section 3 will then target the most frequent phenomena.

Human annotation
To find what makes good context-agnostic translations incorrect when placed in context of each other, we start with pairs of consecutive sentences. We gather data with context from the publicly available OpenSubtitles2018 corpus (Lison et al.,   all  one/both bad  both good  bad pair good pair  2000  211  140  1649  100% 11% 7% 82% In the first stage, the annotators are instructed to mark as "good" translations which (i) are fluent sentences in the target language (in our case, Russian) (ii) can be reasonable translations of a source sentence in some context.
For the second stage we only consider pairs of sentences with good sentence-level translations. The annotators are instructed to mark translations as bad in context of each other only if there is no other possible interpretation or extra additional context which could have made them appropriate. This was made to get more robust results, avoiding the influence of personal preferences of the annotators (for example, for using formal or informal speech), and excluding ambiguous cases that can only be resolved with additional context.
The statistics of answers are provided in Table 1. We find that our annotators labelled 82% of sentence pairs as good translations. In 11% of cases, at least one translation was considered bad at the sentence level, and in another 7%, the sentences were considered individually good, but bad in context of each other. This indicates that in our setting, a substantial proportion of translation errors are only recognized as such in context.

Types of phenomena
From the results of the human annotation, we take all instances of consecutive sentences with good translations which become incorrect when placed in the context of each other. For each, we identify the language phenomenon which caused a discrepancy. The results are provided in Table 2. Below we discuss these types of phenomena, as well as problems in translation they cause, in more detail. In the scope of current work, we concentrate only on the three most frequent phenomena.

Deixis
In this category, we group several types of deictic words or phrases, i.e. referential expressions whose denotation depends on context. This includes personal deixis ("I", "you"), place deixis ("here", "there"), and discourse deixis, where parts of the discourse are referenced ("that's a good question."). Most errors in our annotated corpus are related to person deixis, specifically gender marking in the Russian translation, and the T-V distinction between informal and formal you (Latin "tu" and "vos").
In many cases, even when having access to neighboring sentences, one cannot make a confident decision which of the forms should be used, as there are no obvious markers pointing to one form or another (e.g., for the T-V distinction, words such as "officer", "mister" for formal and "honey", "dude" for informal). However, when (a) EN We haven't really spoken much since your return. Tell me, what's on your mind these days? RU Мы не разговаривали с тех пор, как вы вернулись. Скажи мне, что у тебя на уме в последнее время? RU My ne razgovarivali s tekh por, kak vy vernulis'. Skazhi mne, chto u tebya na ume v posledneye vremya?
(b) EN I didn't come to Simon's for you. I did that for me. RU Я пришла к Саймону не ради тебя. Я сделал это для себя. RU Ya prishla k Saymonu ne radi tebya. Ya sdelal eto dlya sebya. pronouns refer to the same person, the pronouns, as well as verbs that agree with them, should be translated using the same form. See Figure 1(a) for an example translation that violates T-V consistency. Figure 1(b) shows an example of inconsistent first person gender (marked on the verb), although the speaker is clearly the same. Anaphora are a form of deixis that received a lot of attention in MT research, both from the perspective of modelling (Le Nagard and Koehn, 2010;Hardmeier and Federico, 2010;Jean et al., 2017b;Bawden et al., 2018;Voita et al., 2018, among others) and targeted evaluation (Hardmeier et al., 2015;Guillou and Hardmeier, 2016;Müller et al., 2018), and we list anaphora errors separately, and will not further focus on them.

Ellipsis
Ellipsis is the omission from a clause of one or more words that are nevertheless understood in the context of the remaining elements.
In machine translation, elliptical constructions in the source language pose a problem if the target language does not allow the same types of ellipsis (requiring the elided material to be predicted from context), or if the elided material affects the syntax of the sentence; for example, the grammatical function of a noun phrase and thus its inflection in Russian may depend on the elided verb (Figure 2(a)), or the verb inflection may depend on the type of discrepancy frequency wrong morphological form 66% wrong verb (VP-ellipsis) 20% other error 14%  elided subject. Our analysis focuses on ellipses that can only be understood and translated with context beyond the sentence-level. This has not been studied extensively in MT research. 3 We classified ellipsis examples which lead to errors in sentence-level translations by the type of error they cause. Results are provided in Table 4.
It can be seen that the most frequent problems related to ellipsis that we find in our annotated corpus are wrong morphological forms, followed by wrongly predicted verbs in case of verb phrase ellipsis in English, which does not exist in Russian, thus requiring the prediction of the verb in the Russian translation ( Figure 2(b)).
There are various cohesion devices (Morris and Hirst, 1991), and a good translation should exhibit lexical cohesion beyond the sentence level. We  focus on repetition with two frequent cases in our annotated corpus being reiteration of named entities ( Figure 3(a)) and reiteration of more general phrase types for emphasis ( Figure 3(b)) or in clarification questions.

Test Sets
For the most frequent phenomena from the above analysis we create test sets for targeted evaluation. Each test set contains contrastive examples. It is specifically designed to test the ability of a system to adapt to contextual information and handle the phenomenon under consideration. Each test instance consists of a true example (sequence of sentences and their reference translation from the data) and several contrastive translations which differ from the true one only in the considered aspect. All contrastive translations we use are correct plausible translations at a sentence level, and only context reveals the errors we introduce. All the test sets are guaranteed to have the necessary context in the provided sequence of 3 sentences. The system is asked to score each candidate example, and we compute the system accuracy as the proportion of times the true translation is preferred over the contrastive ones.
Test set statistics are shown in Table 5.

Deixis
From Table 3, we see that the most frequent error category related to deixis in our annotated corpus is the inconsistency of T-V forms when translating second person pronouns. The test set we  construct for this category tests the ability of a machine translation system to produce translations with consistent level of politeness. We semi-automatically identify sets of consecutive sentences with consistent politeness markers on pronouns and verbs (but without nominal markers such as "'Mr." or "officer") and switch T and V forms. Each automatic step was followed by human postprocessing, which ensures the quality of the final test sets. 4 This gives us two sets of translations for each example, one consistently informal (T), and one consistently formal (V). For each, we create an inconsistent contrastive example by switching the formality of the last sentence. The symmetry of the test set ensures that any contextagnostic model has 50% accuracy on the test set.

Ellipsis
From Table 4, we see that the two most frequent types of ambiguity caused by the presence of an elliptical structure have different nature, hence we construct individual test sets for each of them.
Ambiguity of the first type comes from the inability to predict the correct morphological form of some words. We manually gather examples with such structures in a source sentence and change the morphological inflection of the relevant target phrase to create contrastive translation. Specifically, we focus on noun phrases where the verb is elided, and the ambiguity lies in how the noun phrase is inflected.
The second type we evaluate are verb phrase ellipses. Mostly these are sentences with an auxiliary verb "do" and omitted main verb. We manually gather such examples and replace the translation of the verb, which is only present on the target side, with other verbs with different meaning, but 4 Details are provided in the appendix. the same inflection. Verbs which are used to construct such contrastive translations are the top-10 lemmas of translations of the verb "do" which we get from the lexical table of Moses (Koehn et al., 2007) induced from the training data.

Lexical cohesion
Lexical cohesion can be established for various types of phrases and can involve reiteration or other semantic relations. In the scope of the current work, we focus on the reiteration of entities, since these tend to be non-coincidental, and can be easily detected and transformed.
We identify named entities with alternative translations into Russian, find passages where they are translated consistently, and create contrastive test examples by switching the translation of some instances of the named entity. For more details, please refer to the appendix.

Setting
Previous work on context-aware neural machine translation used data where all training instances have context. This setting limits the set of available training sets one can use: in a typical scenario, we have a lot of sentence-level parallel data and only a small fraction of document-level data.
Since machine translation quality depends heavily on the amount of training data, training a contextaware model is counterproductive if this leads to ignoring the majority of available sentence-level data and sacrificing general quality. We will also show that a naive approach to combining sentencelevel and document-level data leads to a drop in performance.
In this work, we argue that it is important to consider an asymmetric setting where the amount of available document-level data is much smaller than that of sentence-level data, and propose an approach specifically targeting this scenario.

Model
We introduce a two-pass framework: first, the sentence is translated with a context-agnostic model, and then this translation is refined using context of several previous sentences (context includes source sentences as well as their translations). We expect this architecture to be suitable in the proposed setting: the baseline context-agnostic model can be trained on a large amount of sentence-level data, and the second-pass model can be estimated on a smaller subset of parallel data which includes context. As the first-pass translation is produced by a strong model, we expect no loss in general performance when training the second part on a smaller dataset.
The model is close in spirit to the Deliberation networks (Xia et al., 2017). The first part of the model is a context-agnostic model (we refer to it as the base model), and the second one is a contextaware decoder (CADec) which refines contextagnostic translations using context. The base model is trained on sentence-level data and then fixed. It is used only to sample context-agnostic translations and to get vector representations of the source and translated sentences. CADec is trained only on data with context.
denote the sentencelevel data with n paired sentences and D doc = {(x j , y j , c j )} M j=1 denote the document-level data, where (x j , y j ) is source and target sides of a sentence to be translated, c j are several preceding sentences along with their translations.
Base model For the baseline context-agnostic model we use the original Transformerbase (Vaswani et al., 2017), trained to maximize the sentence-level log-likelihood Context-aware decoder (CADec) The contextaware decoder is trained to correct translations given by the base model using contextual infor-mation. Namely, we maximize the following document-level log-likelihood: where y B j is sampled from P (y|x j , θ B ). CADec is composed of a stack of N = 6 identical layers and is similar to the decoder of the original Transformer. It has a masked self-attention layer and attention to encoder outputs, and additionally each layer has a block attending over the outputs of the base decoder ( Figure 4). We use the states from the last layer of the base model's encoder of the current source sentence and all context sentences as input to the first multi-head attention. For the second multi-head attention we input both last states of the base decoder and the target-side token embedding layer; this is done for translations of the source and also all context sentences. All sentence representations are produced by the base model. To encode the relative position of each sentence, we concatenate both the encoder and decoder states with one-hot vectors representing their position (0 for the source sentence, 1 for the immediately preceding one, etc). These distance embeddings are shown in blue in Figure 4.

Training
At training time, we use reference translations as translations of the previous sentences. For the cur-rent sentence, we either sample a translation from the base model or use a corrupted version of the reference translation. We propose to stochastically mix objectives corresponding to these versions: whereỹ j is a corrupted version of the reference translation and b j ∈ {0, 1} is drawn from Bernoulli distribution with parameter p, p = 0.5 in our experiments. Reference translations are corrupted by replacing 20% of their tokens with random tokens. We discuss the importance of the proposed training strategy, as well as the effect of varying the value of p, in Section 6.5.

Inference
As input to CADec for the current sentence, we use the translation produced by the base model. Target sides of the previous sentences are produced by our two-stage approach for those sentences which have context and with the base model for those which do not. We use beam search with a beam of 4 for all models.

Data and setting
We use the publicly available OpenSubtitles2018 corpus (Lison et al., 2018) for English and Russian. As described in detail in the appendix, we apply data cleaning after which only a fraction of data has context of several previous sentences. We use up to 3 context sentences in this work. We randomly choose 6 million training instances from the resulting data, among which 1.5m have context of three sentences. We randomly choose two subsets of 10k instances for development and testing and construct our contrastive test sets from 400k held-out instances from movies not encountered in training. The hyperparameters, preprocessing and training details are provided in the supplementary material.

Results
We evaluate in two different ways: using BLEU for general quality and the proposed contrastive test sets for consistency. We show that models indistinguishable with BLEU can be very different in terms of consistency.
We randomly choose 500 out of 2000 examples from the lexical cohesion set and 500 out of 3000 from the deixis test set for validation and leave the rest for final testing. We compute BLEU on the development set as well as scores on lexical cohesion and deixis development sets. We use convergence in both metrics to decide when to stop training. The importance of using both criteria is discussed in Section 6.4. After the convergence, we average 5 checkpoints and report scores on the final test sets.

Baselines
We consider three baselines.
baseline The context-agnostic baseline is Transformer-base trained on all sentence-level data. Recall that it is also used as the base model in our 2-stage approach.
concat The first context-aware baseline is a simple concatenation model. It is trained on 6m sentence pairs, including 1.5m having 3 context sentences. For the concatenation baseline, we use a special token separating sentences (both on the source and target side).
s-hier-to-2.tied This is the version of the model s-hier-to-2 introduced by Bawden et al. (2018), where the parameters between encoders are shared (Müller et al., 2018). The model has an additional encoder for source context, whereas the target side of the corpus is concatenated, in the same way as for the concatenation baseline. Since the model is suitable only for one context sentence, it is trained on 6m sentence pairs, including 1.5m having one context sentence. We chose s-hier-to-2.tied as our second context-aware baseline because it also uses context on the target side and performed best in a contrastive evaluation of pronoun translation (Müller et al., 2018).

General results
BLEU scores for our model and the baselines are given in Table 6. 5 For context-aware models, all sentences in a group were translated, and then only the current sentence is evaluated. We also report BLEU for the context-agnostic baseline trained only on 1.5m dataset to show how the performance is influenced by the amount of data.
We observe that our model is no worse in BLEU than the baseline despite the second-pass model  being trained only on a fraction of the data. In contrast, the concatenation baseline, trained on a mixture of data with and without context is about 1 BLEU below the context-agnostic baseline and our model when using all 3 context sentences. CADec's performance remains the same independently from the number of context sentences (1, 2 or 3) as measured with BLEU. s-hier-to-2.tied performs worst in terms of BLEU, but note that this is a shallow recurrent model, while others are Transformer-based. It also suffers from the asymmetric data setting, like the concatenation baseline.

Consistency results
Scores on the deixis, cohesion and ellipsis test sets are provided in Tables 7 and 8. For all tasks, we observe a large improvement from using context. For deixis, the concatenation model (concat) and CADec improve over the baseline by 33.5 and 31.6 percentage points, respectively. On the lexical cohesion test set, CADec shows a large improvement over the context-agnostic baseline (12.2 percentage points), while concat performs similarly to the baseline. For ellipsis, both models improve substantially over the baseline (by 19-51 percentage points), with concat stronger for inflection tasks and CADec stronger for VPellipsis. Despite its low BLEU score, s-hier-to-2.tied also shows clear improvements over the context-agnostic baseline in terms of consistency, but underperforms both the concatenation model and CADec, which is unsurprising given that it uses only one context sentence. When looking only at the scores where the latest relevant context is in the model's context window (column 2 in Table 7), s-hier-to-2.tied outperforms the concatenation baseline for lexical cohesion, but remains behind the performance of CADec.
The proposed test sets let us distinguish models    which are otherwise identical in terms of BLEU: the performance of the baseline and CADec is the same when measured with BLEU, but very different in terms of handling contextual phenomena.
6.4 Context-aware stopping criteria Figure 5 shows that for context-aware models, BLEU is not sufficient as a criterion for stopping: even when a model has converged in terms of BLEU, it continues to improve in terms of consistency. For CADec trained with p = 0.5, BLEU score has stabilized after 40k batches, but the lexical cohesion score continues to grow.

Ablation: using corrupted reference
At training time, CADec uses either a translation sampled from the base model or a corrupted reference translation as the first-pass translation of the current sentence. The purpose of using a corrupted reference instead of just sampling is to teach CADec to rely on the base translation and not to change it much. In this section, we discuss the importance of the proposed training strategy.
Results for different values of p are given in Table 9. All models have about the same BLEU, not statistically significantly different from the baseline, but they are quite different in terms of incorporating context. The denoising positively influences almost all tasks except for deixis, yielding the largest improvement on lexical cohesion.

Additional Related Work
In concurrent work, Xiong et al. (2018) also propose a two-pass context-aware translation model inspired by deliberation network. However, while they consider a symmetric data scenario where all available training data has document-level context, and train all components jointly on this data, we focus on an asymmetric scenario where we have a large amount of sentence-level data, used to train our first-pass model, and a smaller amount of document-level data, used to train our secondpass decoder, keeping the first-pass model fixed.
Automatic evaluation of the discourse phenomena we consider is challenging. For lexical cohesion, Wong and Kit (2012) count the ratio between the number of repeated and lexically similar content words over the total number of content words in a target document. However, Guillou (2013); Carpuat and Simard (2012) find that translations generated by a machine translation system tend to be similarly or more lexically consistent, as measured by a similar metric, than human ones. This even holds for sentence-level systems, where the increased consistency is not due to improved co-hesion, but accidental - Ott et al. (2018) show that beam search introduces a bias towards frequent words, which could be one factor explaining this finding. This means that a higher repetition rate does not mean that a translation system is in fact more cohesive, and we find that even our baseline is more repetitive than the human reference.

Conclusions
We analyze which phenomena cause otherwise good context-agnostic translations to be inconsistent when placed in the context of each other. Our human study on an English-Russian dataset identifies deixis, ellipsis and lexical cohesion as three main sources of inconsistency. We create test sets focusing specifically on the identified phenomena.
We consider a novel and realistic set-up where a much larger amount of sentence-level data is available compared to that aligned at the document level and introduce a model suitable for this scenario. We show that our model effectively handles contextual phenomena without sacrificing general quality as measured with BLEU despite using only a small amount of document-level data, while a naive approach to combining sentence-level and document-level data leads to a drop in performance. We show that the proposed test sets allow us to distinguish models (even though identical in BLEU) in terms of their consistency. To build context-aware machine translation systems, such targeted test sets should prove useful, for validation, early stopping and for model selection.

A Protocols for test sets
In this section we describe the process of constructing the test suites.

A.1 Deixis
English second person pronoun "you" may have three different interpretations important when translating into Russian: the second person singular informal (T form), the second person singular formal (V form) and second person plural (there is no T-V distinction for the plural from of second person pronouns). Morphological forms for second person singular (V form) and second person plural pronoun are the same, that is why to automatically identify examples in the second person polite form, we look for morphological forms corresponding to second person plural pronouns.
Below, all the steps performed to obtain the test suite are described in detail.

A.1.1 Automatic identification of politeness
For each sentence we try to automatically find indications of using T or V form. Presence of the following words and morphological forms are used as indication of usage of T/V forms: 1. second person singular or plural pronoun, 2. verb in a form corresponding to second person singular/plural pronoun, 3. verbs in imperative form, 4. possessive forms of second person pronouns.
For 1-3 we used morphological tags predicted by pymorphy2, for 4th we used hand-crafted lists of forms of second person pronouns, because pymorphy2 fails to identify them. The first rule is needed as morphological forms for second person plural and second person singular V form pronouns and related verbs are the same, and there is no simple and reliable way to distinguish these two automatically. The second rule is to exclude cases where there is only one appropriate level of politeness according to the relation between the speaker and the listener. Such markers include "Mr.", "Mrs.", "officer", "your honour" and "sir". For the impolite form, these include terms denoting family relationship ("mom", "dad"), terms of endearment ("honey", "sweetie") and words like "dude" and "pal".

A.1.3 Automatic change of politeness
To construct contrastive examples aiming to test the ability of a system to produce translations with consistent level of politeness, we have to produce an alternative translation by switching the formality of the reference translation. First, we do it automatically: 1. change the grammatical number of second person pronouns, verbs, imperative verbs, 2. change the grammatical number of possessive pronouns.
For the first transformation we use pymorphy2, for the second use manual lists of possessive second person pronouns, because pymorphy2 can not change them automatically.

A.1.4 Human postprocessing of automatic change of politeness
We manually correct the translations from the previous step. Mistakes of the described automatic change of politeness happen because of: 1. ambiguity arising when imperative and indicative verb forms are the same,