Neural Machine Translation for Cross-Lingual Pronoun Prediction

In this paper we present our systems for the DiscoMT 2017 cross-lingual pronoun prediction shared task. For all four language pairs, we trained a standard attention-based neural machine translation system as well as three variants that incorporate information from the preceding source sentence. We show that our systems, which are not specifically designed for pronoun prediction and may be used to generate complete sentence translations, generally achieve competitive results on this task.


Introduction
Given a source document and its corresponding partial translation, the goal of the DiscoMT 2017 cross-lingual pronoun prediction shared task (Loáiciga et al., 2017) is to correctly replace the missing pronouns, choosing among a small set of candidates. In this paper, we propose and evaluate models on four sub-tasks: En-Fr, En-De, De-En and Es-En.
We consider the use of attention-based neural machine translation systems  for pronoun prediction and investigate the potential for incorporating discourse-level structure by integrating the preceding source sentence into the models. More specifically, instead of modeling the conditional distribution p(Y |X) over translations given a source sentence, we explore different networks that model p(Y |X, X −1 ), where X −1 is the previous source sentence. The proposed larger-context neural machine translation systems are inspired by recent work on largercontext language modeling (Wang and Cho, 2016) and multi-way, multilingual neural machine translation (Firat et al., 2016).

Baseline: Attention-based Neural Machine Translation
An attention-based translation system  is composed of three parts: encoder, decoder, and attention model.
The decoder, composed of a GRU f topped by a one hidden layer MLP g, models the conditional probability of the target sentence word y i knowing the previous words and the source sentence x.
s i is the RNN hidden state for time i, and c i is a distinct context vector used to predict y i .
The computation of the context vector c i depends on the previous decoder hidden state and on the sequence of annotations (h 1 , ..., h Tx ), where each h j is a representation of the whole source sentence with a focus on the j th word. c i is a weighted sum of the annotations.
where e ij is the attention model score, which represents how well the output at time i aligns with the input around time j.

Larger-Context Neural Machine Translation
As the antecedent needed to correctly translate a pronoun may be in a different sentence (intersentential anaphora) (Guillou et al., 2016), we added the previous sentence as a auxiliary input to the neural machine translation system, using an additional encoder and attention model. Similarly to the source sentence encoding, we apply a bidirectional recurrent network to generate context annotation vectors h c 1 , . . . , h c Tc . The additional attention model differs slightly from the original one by integrating the current source representation c i as a new input, so that the context vector depends on the currently attended source words. As such, this attention model takes as input the previous target symbol, the previous decoder hidden state, the context annotation vectors as well as the source vector from the main attention model. That is, the unnormalized alignment scores are computed as Similarly to the source vector c i , the timedependent context vector c c i is also a weighted sum, this time of the context annotation vectors. With this new information, we explored three different approaches.

Simple Context Model (SCM)
For the first approach, we simply use the context representation c c i as a additional input to the decoder GRU and the prediction function g.
p(y i |y 1 , ..., Our second approach is very similar to the first with the exception that, for both functions f and g, distinct gates (g 1 and g 2 ) are applied to the context representation c c i . Similar context-modulating gates were previously used by (Wang et al., 2017).
p(y i |y 1 , ..., Each gate has its own set of parameters and depends on the previous target symbol, the current source representation and the decoder hidden state, at time i − 1 for g 1 and i for g 2 .

Combined Context Model (CCM)
The last method first combines the source and context representations into a vector d i through a multi-layer perceptron. As in the second approach, the context is also gated.

Pronoun prediction task
The DiscoMT 2017 pronoun prediction task serves as a platform to improve pronoun prediction. We are provided source documents and their lemmatized translations for four language pairs: En-Fr, En-De, De-En and Es-En. In each translation, some sentences have one or more pronouns substituted by the placeholder "REPLACE". For each of these tokens, we must select the correct pronoun among a small set of candidates. There are respectively 8, 5, 9 and 7 target classes for En-Fr, En-De, De-En and Es-En. For example, in the case of En-Fr, the task is concentrated on the translation of "it" and "they". The possible target classes are:   Table 2: Test macro-average recall (in %) for cross-lingual pronoun prediction. The "Best" column displays the highest score across all primary and contrastive submissions to the DiscoMT 2017 shared task (Loáiciga et al., 2017).
Although only a subset of the data has context dependencies, it is not difficult to find such instances. The following set of sentences taken from the En-Fr development data is a good example: • Context: So the idea is that accurate perceptions are fitter perceptions .
• Source: They give you a survival advantage .
And here are the source sentence translation with the missing token and the corresponding target: • Translation: REPLACE vous donner un avantage en terme de survie .
• Target: elles In this example, "REPLACE" should be the translation of the word "They", which refers to "perceptions" in the previous sentence. This is important because in French, "perceptions" is feminine. Correctly choosing a good pronoun here can only be done confidently with contextual information.

Experimental settings
To train our models, which are fully differentiable, we use the Adadelta optimizer (Zeiler, 2012). Word embeddings have dimensionality 620, decoder and source encoder RNNs have 1000-dimensional hidden representations, and the context encoder RNN hidden states are of size 620. As the source and context annotations are the concatenation of the forward and backward encoder hidden states, their dimensionality are 2000 and 1240 respectively. The models are regularized with 50% Dropout (Pham et al., 2014) applied to all RNN inputs and on the decoder hidden layer preceding the softmax.
Pronouns are predicted using a modified beam search where the beam is expanded only at the "REPLACE" placeholders, and is otherwise constrained to the reference. The beam size is set to the number of pronoun classes, so that our approach is equivalent to exhaustive search for sentences with a single placeholder. Models for which beam search lead to the highest validation macro-average recall were selected and submitted for the shared task. The baselines were also sent as contrastive submissions. Table 1 and 2 respectively present validation and test results across all language pairs for the models described in sections 2 and 3. Amongst the four models we evaluated on the test sets, a different one performs best for each language pair. Nevertheless, the DGCM model is the most consistent, always ranking second or first amongst our systems. Moreover, it beats the baseline on all tasks except Es-En, which it trails by a marginal 0.2%.

Results
Our models, which don't leverage the given part-of-speech tags and external alignments, are generally competitive with the best submissions (Loáiciga et al., 2017). For Es-En, our contrastive submission achieves the best performance. As for En-Fr and De-En, our systems obtain a macro-average recall within 5% of the winners. Finally, the relatively poor performance of our models for En-De is due to their incapacity at correctly predicting the rare pronoun 'er'. Indeed, the recall of 0/8 for that class greatly affects the results.

Conclusion
In this paper, we have presented our systems for the DiscoMT 2017 cross-lingual pronoun prediction shared task. We have explored various ways of incorporating discourse context into neural machine translation. Even if the DGCM model often achieves better performance than the baseline by taking in account the previous sentence, we believe there is still important progress to be made. In order to improve further, we may need to better understand the impact of context by carefully analyzing the behaviour of our models.