Context-aware Neural Machine Translation with Coreference Information

We present neural machine translation models for translating a sentence in a text by using a graph-based encoder which can consider coreference relations provided within the text explicitly. The graph-based encoder can dynamically encode the source text without attending to all tokens in the text. In experiments, our proposed models provide statistically significant improvement to the previous approach of at most 0.9 points in the BLEU score on the OpenSubtitle2018 English-to-Japanese data set. Experimental results also show that the graph-based encoder can handle a longer text well, compared with the previous approach.


Introduction
The quality of machine translators has recently dramatically improved with Sequence-to-Sequence (Seq2Seq) models (Bahdanau et al., 2014). Most Seq2Seq models are used based on the premise that each sentence is independently translated one by one. In contrast to this premise, real sentences are often an element of a larger unit, such as a document. This means that a sentence is not always semantically self-contained in itself. To correctly interpret a sentence which is a part of a document, it is important to consider its context, preceding and/or succeeding sentences.
In order to tackle the problem, Seq2Seq models that can receive two sentences (Tiedemann and Scherrer, 2017;Bawden et al., 2018;Voita et al., 2018;Wang et al., 2017) have been utilized. For capturing multiple-sentence information more effectively, Miculicich et al. (2018); Zhang et al. (2018) incorporated document-level attention modules into Seq2Seq models. Stojanovski and Fraser (2018) proposed a Seq2Seq model which can capture antecedents of pronouns in the previous source sentence by using a coreference resolution toolkit. To capture the entire source text information, these models strongly depend on attention distributions.
However, the space complexity of the attention mechanism in the Seq2Seq model increases in proportion to the square of the input sequence length, because it tries to attend to all the words in the source text. This characteristic prevents the model from translating a long text. Furthermore, in translating into a pro-drop language such as Japanese, longer contexts are required to generate accurate and naturally concise sentences.
To avoid the problem, we propose a model that can effectively capture contextual information, preceding and succeeding sentences of the source sentence to be translated, by constructing an encoder that is based on explicit coreference relations. The proposed model can directly take into account relationships between sentences via a graph structured encoder constructed with a coreference resolution toolkit. Therefore, it does not need to attend to all input tokens. This characteristic enables our proposed model to handle more sentences in a step, compared with the previous models, and it may improve translation quality when a source text has many sentences.
Experimental results on English-to-Japanese translation pairs in OpenSubtitles2018 (Lison et al., 2018) show that our proposed model can significantly improve the previous model in terms of BLEU scores. In addition, we observe that our model is especially effective in translating a sentence which is a part of a long text, compared to the previous model. our proposed model is based on. We use LSTM (Hochreiter and Schmidhuber, 1997) as recurrent neural network (RNN) structures in the encoder and the decoder. In the Seq2Seq model, a probability of translating an input sentence x = (x 1 , · · · , x Tx ) into an output sentence y = (y 1 , · · · , y n ) is represented as follows: where i is the position of an output token, t is the position of an input token, emb(·) is a function that returns the embedding of an input word, g is a 2-layer feedforward neural network (FFNN), dec is a decoder forward-LSTM, enc is an encoder bidirectional-LSTM (Bi-LSTM), and a is a dot attention (Luong et al., 2015) for calculating the attention weight.

Graph-based Encoder with Coreference Relations
Our proposed model can encode not only the sentence to be translated but also its preceding and succeeding sentences together, based on the results of coreference resolution. Therefore, information about sentence relationships can be effectively utilized. Figure1 shows the network structure of our proposed model. At first, input sentences are analyzed by using a coreference resolution system. After that, the encoder part is structured based on the coreference resolution results, and the input text is encoded into hidden states. Then, the hidden states are converted to a translated text via attention distributions and the decoder. During the translation, the attention distri-butions are only calculated for the currently translated sentence. In the next subsections, we explain the details of each step. We denote a sequence of N sentences as (X 1 , · · · , X N ), and j-th word in X i as x i j hereafter.

Coreference Resolution
Multiple sentences in a source text (X 1 , · · · , X N ) are concatenated and then input to a coreference resolution system. We use NeuralCoref 1 as the coreference resolution system. Let the length of X i be T i . The concatenated token sequence is represented as: The coreference resolution system extracts N c clusters of coreferring mentions (c 1 , · · · , c Nc ), which are defined as: where main k is a span of the representative mention in a cluster of coreferring mentions, and sub k is a span of another mention in the cluster. 2 In general, because many mentions are in a single cluster, the same main k is sometimes paired to different mentions.
To use coreference relations in our graph-based encoder, we need to consider word-based coreference relations. Let head(·) be a function that returns the first word of an input span and tail(·) be a function that returns the last word of the input span. When x i j refers to x i j , x i j and x i j satisfy the following conditions: 1 https://github.com/huggingface/neuralcoref. This code is based on the work by Clark and Manning (2016). 2 We treat a nominal noun which is the antecedent of a pronoun or a proper noun as a representative mention. I have two daughters . They are · · · main1 sub1 head(sub1) tail(main1) refers to a word refers to a span Figure 2: An example of a word-based coreference relation. Figure 2 shows an example of a word-based coreference relation.
Furthermore, we denote a set of words which are referred by word x i j as ref (x i j ). Because the number of words referred by a word is at most one, the number of elements in ref ( as follows: where

Graph-based Encoder
In this section, we explain how to use the coreference relations in the encoder. Similar to the standard Seq2Seq model, the encoder of the proposed model is based on Bi-LSTM. For each input sentence X i = (x i 1 , · · · ), the forward encoder calculates the current hidden state − → h i t at the position of a word x i t as follows: where − → h i t−1 is the previous hidden state, ref f (x i t ) is a set of words which are referred by x i t and m(·, ·) is a function which merges hidden state vectors. In this paper, we propose the following two functions as m(·, ·): Coref-mean treats averaged hidden state vectors as the merged vector, as follows: Coref-gate treats weighted sum of the hidden state vectors as the merged vector, as follows: where represents the element product for each dimension and β i j represents the importance of − → h i j . β i j is calculated as follows: where W t and W s are weight matrices. The backward encoding is similarly processed by replacing ref f with ref b . Finally, the forward and backward hidden states are concatenated to for each t. After that, h i t is used for translation, in place of h t in equation (1), with attending only to the target sentence to be translated.

Experimental Setting
We evaluated the proposed models on the Englishto-Japanese translation data set in OpenSubti-tles2018 (Lison et al., 2018). We cut out consecutive n (= 1, 2, 3, 5, 7) sentences from the original data set as a unit. After that, we randomly selected 2000 units as test data, and the remaining about 1.87 million units were used as training data. All Japanese texts were tokenized by MeCab 3 with NEologd (Sato et al., 2017).
We set the vocabulary size for both source and target sides as 32,000. Both the encoder and the decoder were composed of 2-layer LSTMs. The dimension size of word embeddings for both source and target sides was set to 500. The dimension size of the encoder LSTM layers, the decoder LSTM layers, and an attention layer were set to 500, 1000, and 500, respectively. Initial values for weights were randomly sampled from a uniform distribution within the range of -1 to 1 (Glorot and Bengio, 2010).
Adam (Kingma and Ba, 2014) was used to update weight parameters, and the learning rate was set to 0.001. Learning was carried out for 200,000 steps for the entire training data. The mini-batch size was set to 32, and the gradients were averaged by the number of examples in each mini-batch. The order of mini-batches was randomly shuffled at the start of the training. Pytorch was used to implement the models. All models were run on a single GPU NVIDIA Tesla P100 4 independently.
We changed the number of input sentences, n, in the range of {1, 2, 3, 5, 7} to observe the relationships between translation quality and the number of input sentences. We input a sentence to be Number of sentences (n)  Table 1: BLEU scores for each model. The bold indicates the best score. The underlined indicates that these scores are statistically significantly improved from the score of the baseline Concat at the same setting (p < 0.05). × represents that the model did not run due to the shortage of GPU memories. n = 1 n = 2 n = 3 n = 5 n = 7 12.8% 13.2% 13.8% 14.6% 15.4% translated and n − 1 sentences that precede the input sentence.
As a baseline model, we used a method concatenating multiple input sentences and generating a single sentence, proposed by Bawden et al. (2018) (Concat) 5 . We compared our proposed models, Coref-mean (Cor-m) and Coref-gate (Cor-g), with the baseline. In order to evaluate the effectiveness of succeeding sentences, we also experimented with the cases of inputting the same number of preceding and succeeding sentences for the target sentence to be translated at the center, for Cor-g. We denote this setting as Coref-gatecentered (Cor-g-c). The number of weight parameters for each model is 111,057k for the baseline and Cor-m, and 111,558k for Cor-g.
We used BLEU scores (Papineni et al., 2002) to evaluate the translation performance for each model. All reported BLEU scores in the experiments are averages for three times and are based on MeCab tokenization. Significance tests were conducted by paired bootstrap resampling (Koehn, 2004) with multevel (Clark et al., 2011) 6 . 5 In our preliminary comparison, there are no statistically significant differences in translation performances between Concat and the method of inputting and outputting concatenated multiple sentences, also proposed by Bawden et al. (2018). From the computational efficiency perspective, therefore, we chose Concat as our baseline.   Table 1 shows the results 7 . In this table, we can observe that our proposed models, Cor-m and Cor-g, outperformed the baseline Concat in terms of BLEU scores at every unit length. Interestingly, at the setting of n = 1, Cor-g also outperformed Concat. As shown in Table 2, this is because our proposed models can also use inter-sentential coreference information for translation. In the setting of n = 2, all the results improved from those for n = 1. This is consistent to the reported results in Bawden et al. (2018). In the setting of n > 2, improvement of BLEU scores for Concat stopped at n = 3, in contrast to the proposed models. This indicates that the proposed model can handle more sentences well by using their graph-based encoder and provided coreference information.

Results and Analysis
The scores for Cor-g is always better than those for Cor-m. From this result, we can say that the gating mechanism in Cor-g works well. In addition, as shown in Figure 3, the translation of Cor-g has a closer token length to the reference, while Concat and Cor-m encounter severe undergeneration problems. The results in Figure 4 show that in n > 2, Cor-g can maintain word coherence without increasing word types in generated sentences. Taking into account the gain of the BLEU scores, these results support our estimation that Cor-g can capture contexts well, compared to Cor-m and Concat.
However, the scores for Cor-g-c degraded compared to Cor-g at the same sentence numbers. This result reflects a tendency that most coreferences are anaphora, and cataphora is rarely observed in the test set. Ignoring the succeeding sentences, Cor-g-c at n = 3, 5, 7 is similar to the setting of Cor-g with n = 2, 3, 4. Interestingly, Cor-g-c at n = 3, 5 achieved better BLEU scores, compared to Cor-g with n = 2, 3. This indicates that cataphora information is also useful to translate many sentences in a text.

Conclusion
In this paper, we proposed a Seq2Seq model that can incorporate information in preceding and succeeding sentences of the translating sentence effectively, by taking into account provided coreference relations explicitly. Experimental results showed that the proposed models can improve the translation quality in the setting of inputting multiple sentences jointly, compared to the previous model. From these results, we could conclude that considering explicit coreference relations in the Seq2Seq model actually contributes to improve the performances on the English-to-Japanese translation.