Dynamic Context Selection for Document-level Neural Machine Translation via Reinforcement Learning

Document-level neural machine translation has yielded attractive improvements. However, majority of existing methods roughly use all context sentences in a fixed scope. They neglect the fact that different source sentences need different sizes of context. To address this problem, we propose an effective approach to select dynamic context so that the document-level translation model can utilize the more useful selected context sentences to produce better translations. Specifically, we introduce a selection module that is independent of the translation module to score each candidate context sentence. Then, we propose two strategies to explicitly select a variable number of context sentences and feed them into the translation module. We train the two modules end-to-end via reinforcement learning. A novel reward is proposed to encourage the selection and utilization of dynamic context sentences. Experiments demonstrate that our approach can select adaptive context sentences for different source sentences, and significantly improves the performance of document-level translation methods.


Introduction
Although neural machine translation (NMT) has achieved great progress in recent years (Cho et al., 2014;Bahdanau et al., 2015;Luong et al., 2015;Vaswani et al., 2017), when fed an entire document, standard NMT systems translate sentences in isolation without considering the cross-sentence dependencies. Consequently, document-level neural machine translation (DocNMT) methods are proposed to utilize source-side or target-side intersentence contextual information to improve translation quality over sentences in a document (Jean et al., 2017;Wang et al., 2017;Tiedemann and Scherrer, 2017;Tu et al., 2018;Kuang et al., 2018;Junczys-Dowmunt, 2019;Ma et al., 2020). More recently, researchers of DocNMT mainly focus on exploring various attention-based networks to leverage the cross-sentence context efficiently, and evaluate the special discourse phenomena (Bawden et al., 2018;Müller et al., 2018;Voita et al., 2019b;Jwalapuram et al., 2019). However, there is still an issue that has received less attention: which context sentences should be used when translating a source sentence?
We conduct an experiment to verify an intuition: the translation of different source sentences requires different context. As shown in Table 1, we train two DocNMT models and test them using various context settings 1 . During the test, we obtain dynamic context sentences that achieve the best BLEU scores by traversing all the context combinations for each source sentence. Compared with the fixed size context (row 1 and 2), dynamic context (row 3 and 4) can significantly improve translation quality. Although row 2 uses more context, redundant information may hurt the results. Experiments indicate that only the limited context sentences are really useful, and they change with source sentences.
Majority of existing DocNMT models set the context size or scope to be fixed. They utilize all of the previous k context sentences Miculicich et al., 2018;Voita et al., 2019b;Yang et al., 2019;Xu et al., 2020), or the full context in the entire document (Maruf and Haffari, 2018;Tan et al., 2019;Zheng et al., 2020). As a result, the inadequacy or redundancy of contextual information is almost inevitable. From this viewpoint, Maruf et al. (2019) propose a selective attention approach that uses the sparsemax function (Martins and Astudillo, 2016) instead of the softmax to normalize the attention weights. The sparsemax assigns the low probability in softmax to zero so that the model can focus on the sentences with high probability. However, the learning of attention weights lacks guidance, and they cannot handle the situation where the source sentences achieve the best translation results without relying on any context, which happens in about 39.4% of sentences in the experiment.
To address the problem, we propose an effective approach to select contextual sentences dynamically for each source sentence in the documentlevel translation. Specifically, we propose a Context Scorer to score each candidate context sentence according to the currently translated source sentence. Then, we utilize two selection strategies to select useful context sentences for the translation module. The size of selected context is variable for different sentences. A core challenge of our approach is that the selection process is nondifferentiable. Therefore, we leverage the reinforcement learning (RL) method to train the selection and DocNMT modules together. We design a novel reward to encourage the model to be aware of different context sentences and select more appropriate context to improve translation quality.
In this paper, we make the following contributions: • Our approach can measure the contribution of each context sentence to the source, and select dynamic context for the translation of different source sentences. Independent of the translation network, our approach is easily adaptable to existing DocNMT models.
• We bridge the training of context selection and context-aware translation via reinforcement learning. Experiments show that our approach can significantly improve the performance of DocNMT models with the selected dynamic context sentences.  Figure 1: The architecture of context scorer. We add a special empty context sentence "NON" to help the decision of selection strategies. The details of Transformer layers are shown in the right dotted box.
2 Document-level Machine Translation A standard DocNMT system generally translates a source sentence X = {x 1 , · · · , x I } to a target sentence Y = {y 1 , · · · , y T } with the aid of contextual information Z that is usually a subset of the candidate context set Z. The model is trained to minimize the negative log-likelihood as: Different granularity (word or sentence) and different sources (source-side or target-side) of contextual information Z have been explored. Maruf et al. (2019) divide the candidate context set Z into two cases: offline where the context comes from the entire document, and online that only uses the past context. In this paper, we mainly focus on a general scenario, where DocNMT translates sentences with the online source-side context sentences.

Dynamic Context Selection
Our approach translates a source sentence X in the document in two steps. First, we select the appropriate context sentences for the translation of X via the selection module. Independent of Doc-NMT module, this step is conducted before the context encoding in DocNMT module. The core component is a Context Scorer that calculates the contribution of each context sentence z ∈ Z to the translation of X (sub-section 3.1). According to the context scores, we propose two strategies to choose the useful context sentences (sub-section 3.2). Sec-ond, we feed the selected context sentences into a DocNMT module to generate the translation.
To overcome the non-differentiable behavior of the context selection and the lack of direct supervision when training the context scorer, we connect the two steps through the reinforcement learning strategy. We propose an effective reward that is related to the translation quality to guide the dynamic selection of context sentences and the optimization of parameters in DocNMT model (sub-section 3.3).

Context Scorer
As Figure 1 shows, we obtain the representation of context sentences for scoring. Inspired by the popular pre-training language models such as GPT (Radford et al., 2018) and BERT (Devlin et al., 2019), we produce one instance by concatenating the source sentence with a context sentence, and adding a special symbol " DCS " at the beginning and a separator token " SEP " in between. The instance is fed into a stack of L 1 Transformer encoder layers. We believe the special symbol " DCS " can encode the information of source-context sentence pairs well by the self-attention.
For a candidate context sentence z ∈ Z, its hidden state of " DCS " after L 1 layers is extracted as the input to L 2 Transformer encoder layers to model the dependencies among context sentences. We denote the hidden state after L 2 layers as h z ∈ R d 1 . After that, we adopt a twolayer linear scorer network to measure the score as follows: where W 1 ∈ R d 1 ×d 2 , and W 2 ∈ R d 2 ×1 . σ stands for the logistic sigmoid function. Considering the sampling operation during training process, we normalize all scores of context sentences in candidate set Z as a probability distribution: where [·; ·] concatenates elements into a vector.

Selection Strategies
According to the selection probability in P select , we can obtain useful context sentences for the translation task. To select context dynamically, we add a special empty sentence " N ON " into the candidate context set, which stands for the situation that translates a source sentence without any context. As a result, we select those context sentences  Figure 2: Reinforced training of the context selection and context-aware translation. The two DocNMT models share parameters. whose probability is higher than " N ON ". If the probability of " N ON " is the highest, context size is zero. We call this strategy as probabilityfirst. The selected context sentences change dynamically with the change of source sentences, and the context size can range from 0 to |Z|. In order to make a fair comparison with existing DocNMT models setting fixed context size, we also propose a size-first strategy that selects the certain number of context sentences with the highest probability except " N ON ". Despite of the fixed size, the context is still dynamic because selected sentences can be anywhere and discontinuous in the document.

Model Learning
Our strategies perform a non-differentiable hard selection, and it is difficult to decide which context sentences are helpful for the translation. It makes the training quite intractable. Therefore, we apply the policy gradient method to train the selection module and the document-level translation module in an end-to-end fashion through a novel reward. The reward encourages the model to select more useful context to improve the generation probability of the ground truth translations. Figure 2 shows the reinforcement-guided training process.

Modules Initialization
It is well known that a fine initialization of network is important to optimize the parameters in reinforcement learning.
For DocNMT module that is usually trained in two stages (Tu et al., 2018;Miculicich et al., 2018;Maruf et al., 2019), we load the parameters of standard sentence-level NMT model to initialize the network.
For the selection module, we simplify the initialization of context scorer as a binary classification task without considering the dependencies among context sentences. Its initialization contains two steps. First, we create pseudo labels for candidate context sentences. Each context sentence is labeled as 1 or 0. The score in Eq. 2 is treated as the probability to predict label 1. Specifically, pseudo labels are generated by an extra DocNMT model trained with a single random context sentence. We feed different candidate context sentences to the trained model to translate the same source sentence. Candidate context sentences with higher BLEU than " N ON " are labeled as 1, while those with lower BLEU are labeled as 0. Second, we train the context scorer to predict the pseudo labels. We share the parameters of embedding layer with initialized DocNMT model. The initial scorer is trained to minimize the cross-entropy loss.

Reward
Given that our goal of context selection is to improve translation quality, we propose a reward that can measure translation quality and is sensitive to the context changes 2 .
For a decoding time t, we calculate the cost of generating ground truth target word y t correctly as follows: where the first two items calculate the gap between the logarithmic probabilities of ground truth target word y t and the best wordỹ 1st t whose probability is the top one in the prediction probability distribution. And the last item is a regularization that indicates the difference of probabilities betweenỹ 1st t and the wordỹ 2nd We obtain the average cost (whose value > 0) of generating the ground truth sentence Y = {y 1 , · · · , y T }, and utilize a monotone decreasing function to get the final reward bounded in 0 ∼ 1 as follows: A high value of the reward means that it is easy to generate the ground truth. Therefore, the selected context sentences should be encouraged. Conversely, if a reward is low, generating the ground truth with the selected context would cost a lot, so the selection is discouraged.

Self-Critical Training
We train the whole model with the self-critical training method (Rennie et al., 2017;Bai et al., 2018). The goal of RL training is to minimize the negative expected reward. And in practice, the loss is usually approximated with a single sample u from the policy P as follows: The self-critical training introduces a baseline reward r(u ) to reduce the variance of the gradient, where u is obtained by the inference algorithm at test time. The final gradient is estimated by: Specifically, we denote the trainable parameters of the context scorer and DocNMT by ω and θ, respectively. For each source sentence X, we select a set of context sentences Z * by our selection strategies in sub-section 3.2. Meanwhile, another set of context sentencesẐ with the same size of Z * is sampled according to P select in equation 3. Two sets of context sentences are fed into the same DocNMT module to obtain the rewards r(Z * ) and r(Ẑ), respectively. Therefore, referring to equation 7, the final gradient of the context scorer is calculated by: where P ω (Ẑ) is the probability of samplingẐ from P select . With the baseline reward r(Z * ) obtained by the current best policy (i.e., learned selection strategies), the method encourages model to explore more useful context (i.e., sampled context) that yields higher reward than the current best (i.e., selected context).  For DocNMT module, we can combine the MLE objective (Eq. 1) and RL objective (Eq. 6) together to stabilize the training procedure (Wu et al., 2018) through a balance factor α as follows: We introduce the RL objective into DocNMT module so that the model can make better use of the selected context. The final RL gradient of Doc-NMT is calculated by: whereŶ is a sequence generated by current Doc-NMT model with the sampled contextẐ.

Datasets
We evaluate our approach on different domains of Chinese-English (Zh→En) and English-German (En→De) datasets. The corpora statistics are listed in Table 9. For TED Talks in IWSLT17 3 , we use dev-2010 as the development set, and tst-2010∼2013 as the test set for both Zh→En and En→De language pairs. For News-Commentary v14 4 , we use the newstest2017 for development and newstest2018 for testing. Europarl is a large scale corpus extracted from Europarl v7, and we use the same training, development and test sets as Maruf et al. (2019).

Models
We compare our approach with the following methods: 1) SENTNMT (Vaswani et al., 2017) is a standard sentence-level Transformer model using the "base" version parameters. 2) TDNMT  introduces the contextual information by adding attention sub-layers at each encoder and decoder layer. We use 2 previous consecutive context sentences as they suggested. 3) HAN (Miculicich et al., 2018) uses 3 previous sentences as context. We adopt the "HAN encoder + HAN decoder" strategy that adds a hierarchical network on the top of the last encoder and decoder layer to model sentence-level and word-level contextual information. 4) SAN (Maruf et al., 2019) utilizes all context in the entire document by calculating the sentence-level and word-level weights. It focuses on relevant context sentences through the sparsemax function. We choose the model that integrates the online context into encoder with "sparse-soft H-Attention".
We implement our approach and baseline methods based on the toolkit THUMT . The parameters are the "base" version of the original Transformer (Vaswani et al., 2017). The d 1 and d 2 in Eq. 2 are 512 and 256, respectively. We set the layers of L 1 = 2 and L 2 = 2. The effect of layer depth of context scorer and more implementation details are shown in the appendix.

Main Results
We use BLEU (Papineni et al., 2002) score to evaluate the translation quality. Considering the memory limitation and complex sampling space, we select dynamic context from previous six sentences. Table 10 shows the performance of models utilizing different context settings. We always keep the same setting for training and test.
Comparison with Fixed Context Methods. The performance of DocNMT models with fixed context is shown in row 2∼5. Row 2 and 3 follow the context settings in the published papers. It can be found that using more context sentences indiscriminately (row 4 and 5) does not bring significant BLEU improvement. Instead, it increases computational cost.
By contrast, our approach (row 10∼15) can significantly improve translation quality on all datasets. Let us take the TDNMT models on Zh→En TED for example. Row 11 applies the size-first strategy to select context sentences of the same size as original TDNMT model in row 3.  Table 3: Performance of models on BLEU (%) using different context settings. "full" means using all context in the scope. "random", "attend", and "select" stand for selecting sentences randomly, implicitly based on attention weights, and explicitly by our approaches, respectively. "dyn" stands for dynamic size. "DCS-SF" and "DCS-PF" mean dynamic context selection by size-first and probability-first strategies respectively. All results using "DCS" are statistically significantly (p-values < 0.05) better than corresponding original DocNMT models. show the models trained and tested with dynamic context settings. Row 6 and 7 show a lower bound that randomly selects the fixed size context sentences. The results are similar to original models with the fixed size previous sentences (row 2 and 3). In contrast to the random selection, our approach (row 10 and 11) can select the same size of context sentences that are really helpful to generate better translations. SAN (row 8) implicitly selects context from all previous sentences through sharpening the attention weights. It resets low attention weights to zero to filter out some sentences. For a fair comparison, we also implement SAN in a limited context scope (row 9). Even if the candidate set is limited to previous six sentences, the BLEU does not decrease significantly. Different from SAN, our approach explicitly selects context sentences via reinforced guide. As row 16 shows, when added into SAN (row 9), our approach can obtain +0.69 BLEU gains (20.18 vs. 19.49) on Zh→En TED by picking a more focused context candidate set for SAN. Furthermore, our approach can set the context size to be zero, but SAN cannot deal with  Comparison of Selection Strategies. We also compare the two selection strategies proposed in section 3.2. Results with probability-first strategy (row 13 and 15) are slightly better than size-first strategy (row 10 and 11). The size-first strategy has to contain some useless sentences because of the fixed size. By contrast, the probability-first strategy allows more flexible context selection of dynamic size. It can achieve +0.72 (20.26 vs. 19.54) and +0.95 (20.34 vs. 19.39) BLEU improvement on Zh→En TED when applied to HAN and TDNMT model, respectively.

Effect of DocNMT Training
Our proposed context selection module is independent of the translation module. Therefore, the con- text scorer and DocNMT can be trained separately. As shown in Table 11, we discuss the impact of selected context on the training of DocNMT. In row 2, we only train the context scorer, and keep the original DocNMT model unchanged as a component to calculate rewards. The result shows that our selection module can effectively distinguish between useful and useless context sentences for translation, and achieves +0.54 BLEU gains on the En→De Europarl development set.
We also explore whether the selected context would be helpful for the DocNMT training. We set the balance factor α Eq. 9 to be [0, 0.25, 0.5, 0.75, 1.0] in our experiments. Row 3 shows the model setting α = 1.0 that optimizes the standard MLE loss using the selected context sentences. Row 7 sets α = 0.0 to fine-tune DocNMT with the RL loss. By contrast, DocNMT models guided by the combination of MLE and RL loss can be learned better. We think the RL loss may make the model more sensitive to the selected context sentences. When α = 0.75, DocNMT can obtain the best BLEU score on development set, thus we use the setting in our experiments. Figure 3 shows the distribution of different context sizes and maximum distances in the test sets of Zh→En TED. Our approach selects context sentences whose size can range from zero to six. In Figure 3 (a), 78.9% of source sentences tend to select no more than three context sentences. 26.2% of sentences can be translated well without contextual information. The average context size over the   Table 6: BLEU (%) scores on the context-empty and context-nonempty test sets. "+" stands for the improvement when compared with TDNMT. test sets is 2.05 sentences. In Figure 3 (b), we show the maximum distances from the selected context sentences to the currently translated sentence. Except for the cases that need no context (distance 0), the distance distribution is relatively uniform. The total average distance is previous 2.57 sentences.

Selection of Empty Context
Our approach has the ability to select empty context for translation, which other models such as SAN (Maruf et al., 2019) cannot do. To evaluate whether the selected empty context is reasonable, we annotate a special test set that contains 500 sentences selected randomly from Zh→En TED test sets. Each sentence is given its previous 6 sentences as context. Two annotators are instructed to mark contextempty sentences that can be translated well without any contextual information. The annotation details and statistics are shown in the appendix. The Cohen's Kappa value (Cohen, 1960) of annotation is 0.72. We gather sentences marked by both annotators as the final context-empty sentences (about 39.4% in 500 sentences). Therefore, the test set is divided into context-empty and context-nonempty subsets. Their sizes are 197 and 303, respectively. Table 5 shows the performance of our approach (using "TDNMT+DCS-PF" model) for predicting empty context on the 500 annotated sentences. For the selection of empty context, our approach can achieve 58.96 F1-score. Table 6 shows the BLEU scores on the contextempty and context-nonempty subsets. Through our context selection, the improvement of BLEU on context-empty set is higher than context-nonempty set. The analysis indicates that our approach is aware of context-empty sentences, and can select empty context to improve translation quality.

Analysis of Discourse Phenomena
In addition to the selection of empty context, we also want to examine whether our approach can select context sentences that are helpful to improve the translation of discourse phenomena. Voita et al. (2019b) construct contrastive test sets for English-Russian to evaluate four types of discourse phenomena (i.e., deixis, lexical cohesion, inflection and VP ellipses). Each test instance consists of a positive and several negative translations with incorrect phenomena. Models are evaluated by the accuracy that is defined as the proportion of times the generation probability of positive translation is higher than negative ones. Meanwhile, each instance has three context sentences. Among them, there is one and only one context sentence that is decisive in resolving the phenomena. It has been marked. Therefore, we can evaluate the accuracy of context selection, taking the marked context sentences as the standard answer.
We use the same datasets as Voita et al. (2019b) to train models. Different from TDNMT  that only uses source-side context, CADec (Voita et al., 2019b) is proposed to utilize both source-side and target-side context. Based on CADec, we try to extend our approach in a simple way to select the target-side context. When the context scorer selects a source-side context sentence, the corresponding sentence-level translation is directly selected as target-side context. Table 7 shows the accuracy of context selection at four test sets. It can be found that our approach can select more than 85% standard context sentences for special phenomena, and achieve more than 80% exact match on lexical cohesion and VP ellipses sets.
The accuracy of discourse phenomena are shown in Table 8. TDNMT does not perform well because it only uses source-side context, which is unchanged in contrastive instances of test sets. Compared with original CADec, our approach can improve the performance of lexical cohesion. Al-  though the simple way of selecting target-side context bears the risk of missing selection, the accuracy of some phenomena does not change significantly. Table 7 has shown that our approach can select useful target-side context in most cases. And the selection mechanism can make the model focus more on the useful context to resolve the discourse phenomena.

Related Work
Standard neural machine translation methods usually focus on the sentence-level translation (Cho et al., 2014;Bahdanau et al., 2015;Zhang and Zong, 2015;Luong et al., 2015;Tu et al., 2016;Zhang and Zong, 2016;Vaswani et al., 2017;Zhao et al., 2020). As a contrast, document-level neural machine translation methods mainly pay attention to how to utilize the cross-sentence context. Researchers propose various context-aware networks to utilize contextual information to improve the performance of DocNMT models on the translation quality (Jean et al., 2017;Tu et al., 2018;Kuang et al., 2018) or discourse phenomena (Bawden et al., 2018;Voita et al., 2019b,a). However, most methods roughly leverage all context sentences in a fixed size that is tuned on development sets (Wang et al., 2017;Miculicich et al., 2018;Yang et al., 2019;Xu et al., 2020) , or full context in the entire document (Maruf and Haffari, 2018;Tan et al., 2019;Kang and Zong, 2020;Zheng et al., 2020). They ignore the individualized needs for context when translating different source sentences. Some works have noticed that not all context is useful (Jean and Cho, 2019;Kim et al., 2019). Kimura et al. (2019) explore the context selection in the single-encoder framework (Tiedemann and Scherrer, 2017), and select context sentences that yield highest forced back-translation probability. However, the method cannot optimize Doc-NMT model at training phase, and requires backtranslation model at inference phrase. Maruf et al. (2019) sharpen the attention weights between the source and context sentences through the sparse-max function, and implicitly select context with high attention weights. Nevertheless, the method lacks direct supervision over context selection, and it cannot cover the situation where context is not needed. Inspired by the extractive-abstractive summarization (Chen and Bansal, 2018), our approach is different from above DocNMT methods. Our approach can explicitly select dynamic size (that can be 0) of context sentences for the translation of different source sentences.

Conclusion and Future Work
We propose a dynamic selection method to choose variable sizes of context sentences for documentlevel translation. The candidate context sentences are scored and selected by two proposed strategies. We train the whole model via reinforcement learning, and design a novel reward to encourage the selection of useful context sentences. When applied to existing DocNMT models, our approach can improve translation quality significantly. In the future, we will select context sentences in larger candidate space, and explore more effective ways to extend our approach to select target-side context sentences.

A.1 Parameters and Implementation
We implement all models based on the toolkit THUMT 5 with the parameters of the "base" version of Transformer (Vaswani et al., 2017). Specifically, we use 6 layers of encoder and decoder with 8 attention heads. The hidden size and feed-forward layer size are 512 and 2,048, respectively. For Zh→En, Chinese and English vocabulary sizes are 30K and 25K, respectively. For En→De, source-side and target-side share a vocabulary table. The vocabulary size is 30K. Chinese sentences are segmented into words by our in-house toolkit. English and German datasets are tokenized and truecased by the Moses toolkit 6 . Words are segmented by bytepair encoding (Sennrich et al., 2016).
We introduce a context scorer that is independent of the DocNMT models, which allows our approach to be easily deployed on many baseline DocNMT systems. Compared with original Doc-NMT models, the amount of additional parameters depends on the number of Transformer encoder layers L 1 and L 2 in the context scorer.  Table 9: BLEU (%) scores on En→De TED development set using different layers of context scorer.
In Table 9, we discuss the effect of layer depth of context scorer (defined in subsection 3.1). Experiments are conducted using "TDNMT+DCS-PF" model with the balance factor α = 0.75. Our approach achieves the highest BLEU with a context scorer setting L 1 = 2 and L 2 = 2, which introduces 12.7M extra parameters to any original Doc-NMT models.

A.2 Training and Inference
For training, we use the Adam optimizer with β 1 = 0.9, β 2 = 0.98 and = 10 −9 . We employ label smoothing with a value of 0.1 and dropout with a rate of 0.1. The batch size is 3,000 tokens. We employ 4 Titan Xp GPUs to train all models. Compared with original DocNMT (TDNMT), the training and testing speeds are slowed down by an order of 1.61 (mainly because of the generation of Y in Eq. 10) and 1.05, respectively.
We use multi-bleu.perl 7 to compute the BLEU score. The beam size is set to 4. The significance test is conducted by the script "bootstraphypothesisdifference-significance.pl" in Moses.

B Annotation and Statistics of Empty Context
In this section we describe the annotation process and statistics of the special test set constructed to evaluate the selection of empty context.

B.1 Annotation
We randomly select 500 sentences with previous 6 sentences as context from Chinese-English TED tst-2010∼2013. Each example to be annotated contains a source-reference sentence pair and six source-reference contextual sentence pairs. Two annotators proficient in both Chinese and English are instructed to annotate the sentences that can be translated well without any context. The process consists of three steps, and is carried out independently between two annotators.
Step1. Annotators are instructed to read a single source sentence X without any context, and translate it into Y by themselves.
Step2. The reference Y of the sentence X is shown to annotators. Then, they are instructed to compare Y with Y word-by-word to answer whether Y is appropriate.
Step3. Annotators are instructed to read sourcereference contxtual sentences, and compare Y with Y word-by-word again. After that, they are  asked to determain whether Y needs to be modified better. If a annotator insists that his translation Y is appropriate at Step2 and needs no modification at Step3, the sentence X is annotated as "contextempty", which means it can be translated well without relying on any context. Otherwise, the sentence is annotated as "context-nonempty". Table 10 shows the statistics of annotation. The Cohen's Kappa value is 0.72. 197 context-empty sentences are annotated by both annotators. These sentences are gathered as the final context-empty test set. The other 303 sentences make up the context-nonempty test set.

B.2 Statistics of Empty Context Selection
Taking the human annotation in Table 10 as the golden test set, Table 11 shows the statistics of empty context prediction by our approach in subsection 5.3.