Cross-lingual Pronoun Prediction for English, French and German with Maximum Entropy Classification

We present our submission to the cross-lingual pronoun prediction (CLPP) shared task for English-German and English-French at the First Conference on Machine Translation (WMT16). We trained a Maximum Entropy (MaxEnt) classiﬁer based on features from Wetzel et al. (2015), that we adapted to the new task and applied to a new language pair. Additional features such as n-grams of the pronoun context and prediction of NULL - translations proved helpful to a varying degree. Experiments with a sequence classiﬁer over pronoun sequences did not show any improvements. Our submission is among the top three systems for English-French (61.62% macro-averaged recall) and in the middle range for English-German (48.72%) out of nine submissions.


Introduction
Translation of pronouns is a non-trivial task due to ambiguities in the source language (event pronouns, referential and non-referential uses) and due to diverging usage of pronouns between two languages (e.g. morphological differences including gender and number, pro-drop languages, preference of passive construction with expletive it). In the recent past there has been work on analysing these differences and various approaches to tackle the problem exist (Hardmeier and Federico, 2010;Le Nagard and Koehn, 2010;Guillou, 2012;Weiner, 2014;Guillou et al., 2014) including the submissions to the CLPP shared task (Hardmeier et al., 2015).
This shared task is organized again this year (Guillou et al., 2016). In addition to the English-French language pair, it introduces data sets for English-German, as well as the inverse translation directions from French and German into English. The task is to predict a target-side pronoun from a closed set of classes for each subject-position 3rd person pronoun in the source language.
One of the major differences to the shared task from last year is the target-side data. It comes in the form of lemmatized tokens with their Partof-Speech (POS) tag, instead of the full word forms. This makes the task more challenging, since agreement features of words surrounding a pronoun are no longer available. For example all the determiners are mapped to one generic form irrespective of their gender or number. One can also argue that it makes the task more realistic, when considering Statistical Machine Translation (SMT) as the driving goal. SMT systems do not necessarily produce the correct target-side surface word forms and approaches to pronoun translation should not rely on error-free translations of the relevant context. This change therefore helps with handling more noisy or underspecified input.
In this paper we focus on learning to predict translations of pronouns from English into French and German. The set of source pronouns (i.e. it and they) is the same for both language pairs. For French, the closed target classes are: ce, elle, elles, il, ils, cela, on, OTHER and for German they are: er, sie, es, man, OTHER.
We use a MaxEnt classification model to learn pronoun predictions. This work is based on findings in (Wetzel et al., 2015). We incorporate source-and target-side bag-of-words context window features based on tokens and POS tags, a target-side pronoun antecedent feature and a target-side Language Model (LM) feature. Furthermore, we focus on predicting cases where the source pronoun does not have a corresponding translation and is therefore aligned to a special NONE token. We conduct additional experiments in an attempt to exploit the sequential character of coreference chains that contain pronouns by using linear-chain Conditional Random Fields (CRFs).

Related Work
The CLPP shared task from last year (Hardmeier et al., 2015) had eight contributions and a very strong baseline. The official macro-averaged F1 metric ranked the baseline highest, however in terms of accuracy, a few of the submission managed to perform better. Tiedemann (2015) explores models for CLPP with the focus on using only simple features. The major simplification is that no coreference resolution is performed. Experiments on using a sequence model for classification are reported, which makes predictions based on previous classification choices. However, only a degradation of performance was observed. One possible reason for that is that not every preceding classification choice corresponds to a mention of the same entity, and hence should only influence the current choice if it does. This distinction was not captured by (Tiedemann, 2015). We also explore the usefulness of a sequence classifier, however our sequences are more informed in that they follow automatically resolved source-side coreference chains.
Pham and van der Plas (2015) train a Multi-Layer Perceptron. Features consist of wordembeddings of local context words, averaged word vectors of target-side antecedents of a pronoun obtained via automatic coreference chains from the source projected to the target side via wordalignments and additional vectors containing morphological information. They use a subset of the types of our features, however integration is via word-embeddings and training is based on Neural Networks. They could not find any improvements when including target-side antecedents via source side coreference chains.

Features
In this section we motivate and describe the types of features we extract for learning the MaxEnt classifier and the CRF models. For a more detailed description of the features from last year, please refer to (Wetzel et al., 2015).

Context window
For each training instance, i.e. for each source pronoun for which we want a prediction, we extract a bag of words consisting of the ±3 tokens around the source pronoun. Additionally, we extract the tokens in the ±3 context window of the aligned target pronoun. The source-side feature consists of tokens in their full form, whereas the target-side feature uses the lemmatized tokens from the training data.
Additionally, we extract POS tags for these tokens. For the source side we automatically obtain POS tags with StanfordCoreNLP (Lee et al., 2013). For the target side the POS tags are provided as part of the training and test data.
A common strategy to improve linear classifiers is to include combinations of features so that the classifier can tune additional weights if predictive n-gram combinations provide useful information. Therefore, we experiment with combining the above context window features within each type. In addition to the unigram values, we extract n-gram values by concatenating adjacent tokens or POS tags.
All of the above features are extracted both from the source and the target side.

Pleonastic pronouns
Pleonastic pronouns are non-referential pronouns, i.e. they do not have an antecedent in the discourse. They behave differently compared to referential pronouns, e.g. grammatical agreement requirements do not exist. We use Nada (Bergsma and Yarowsky, 2011) to get an estimate if a particular pronoun is pleonastic and integrate this estimate directly as feature value into our classifier.
Furthermore, the Stanford deterministic coreference system (Lee et al., 2013), which we use in the feature described in Section 3.4, only has a very basic rule-based detection mechanism for pleonastic pronouns. Intuitively, Nada's estimates should therefore counterbalance erroneous handling in coreference resolution.
This feature is only applied on the source side.

Language Model prediction
LMs provide a probability of a sequence of words trained on large monolingual corpora and are used in SMT as a model to encourage fluency, i.e. producing typical target-language sentences. Wetzel et al. (2015) incorporated a LM feature based on the preceding 5-gram context of a target pronoun, by utilising the conditional probability P (classLabel 5 |w 1 , w 2 , w 3 , w 4 ), where classLabel is one of the class labels from the closed set of target classes, or the OTHER class, and w are the preceding words. This ignored any information following the pronoun, which could as well be indicative of the correct prediction. Therefore, we expand the feature to provide a rating for the entire sentence, i.e. P ( s , w 1 , ..., classLabel, ..., w n , /s ), where n is the sentence length, and s and /s are sentence boundary markers. The class label that produces the highest scoring sentence according to the LM is then used as a feature value in our classifier. To obtain such a prediction for the class labels that correspond to pronouns we can directly substitute the target-side pronoun placeholder with each class label when querying the LM.
The OTHER class requires special treatment, since it does not occur as such in the LM training data. We approximate the probability for this class in the same way as described in (Wetzel et al., 2015). We first collect frequencies of words that are tagged as OTHER from the training data. Then we query the LM with the top-n words as substitute for the placeholder. The highest scoring word within that group then competes as representative for OTHER against the probabilities of the rest of the class labels.
This feature is only applied on the target side.

Antecedent information
The antecedent feature proved useful in (Wetzel et al., 2015). Intuitively, if we know the closest target-side antecedent of a referential target-side pronoun, we have access to additional information such as grammatical gender and number. Both in German and French, the pronoun has to agree in gender and number with its antecedent. Furthermore, the fact whether we find an antecedent at all should be useful information as well, since it separates referential from non-referential cases. We perform antecedent detection with the help of source-side coreference chains. We follow the source-side chain that contains the source pronoun of interest in reverse order (i.e. towards the beginning of the document) and check if the token that is aligned to the source-side mention head is a noun. If it is not, the search proceeds. The reason why we do not just search for the closest noun-antecedent on the source side and then take its projection is that nouns do not necessarily have to align to nouns, but could be aligned to NULL, pronouns, etc. We take the closest noun that we can find on the target side.
Since the target side only contains lemma information, where all gender-or number-specific information has been removed from nouns (or merged to the same token for e.g. determiners), we cannot apply a morphological tagger to give us this information. Therefore, we resort to a simpler method and look up the most frequent gender for a given lemma in a lexicon. We only experiment with this feature on the English-German task.
All of the above features are extracted from the target side (with the help of source-side annotation).

Predicting NONE
Source pronouns do not necessarily have a counterpart in the target language. These cases are recorded in the training data with NONE labels and occur very frequently (cf . Table 1). However, they are not part of the official set of class labels and mapped to the OTHER class for training and testing. If we know that a source pronoun does not have a translation, then this might be useful in an SMT scenario, where a feature function could score phrases higher that do not contain targetside pronouns. For CLPP our expectation is that it should help to improve prediction performance for the very heterogeneous OTHER class.
For training the classifiers we therefore first map all NONE cases from OTHER to NONE, train with the above features and map the final predictions back to OTHER before evaluation.

Pronoun prediction in a sequence
The MaxEnt classifier makes the assumption that the translation of the pronoun is only dependent on the source and target contexts and the antecedent   it refers to (for referential pronouns). This ignores the fact that pronouns are part of a longer chain of co-referring expressions, among them other pronouns. Therefore, we first prepare the training and test data such that all pronoun instances that belong to the same coreference chain form one training or testing sequence. We then train a linear-chain CRF with the same features as given above instead of a MaxEnt classifier to predict an optimal sequence of target pronouns, rather than making each prediction independently of the other pronouns. This way, typical patterns of pronoun sequences can be learnt, which might help with the prediction. Table 2 gives the distribution of sequence lengths.

Experiments
We first describe the experimental setup of our systems, then briefly describe the data we used and provide information about feature and parameter settings. Finally, we report our results on development and test data.

Systems
We use Mallet (McCallum, 2002) for training the MaxEnt classifiers and CRF models. For the MaxEnt classifier we use the default settings. For the CRF we train three-quarter order models (i.e. one weight for each feature, label pair, and one for each current label, previous label pair) and only allow label transitions that have been observed in the training data.
In all experiments, we have two setups. The POSTCOMBINED setup, where we split the training and test data for each source pronoun into separate sets, train separate classifiers and combine the predictions after classification. And the ALLINONE setup, where we do not split the data.
The systems marked with initial consist of the context window features, the pleonastic pronoun feature, the LM feature and the antecedent information (without gender information). We use fGender to refer to the gender feature, 3-gram window to refer to the n-grams from the context window and fNone to refer to the NONE-prediction feature. Systems marked with sequence are the CRF models. We submit the best performing system according to the official macro-averaged recall measure on the development set for each language pair as primary test set submission.
The official BASELINE uses LM predictions similarly to our LM feature. Additionally, it attempts to find the optimal predictions for a sentence, if there are multiple pronouns that have to be predicted. It has a NULL penalty parameter that determines the influence of not predicting a pronoun at all. For a more detailed description, please refer to the shared task paper (Guillou et al., 2016).

Data
For training, we only extract information from the IWSLT15 and NewsCommentary (NC9) corpus. We do not employ the provided Europarl corpus, as it does not come with predefined document boundaries other than parliamentary sessions of a complete day. For development, we use the TEDdev set. For the final submission on the official test set we include TEDdev in the training data.

Features and parameters
For the LM feature, we take the provided trained models from the shared task, which are 5-gram modified Kneser-Ney LMs that work on lemmatized text. We use KenLM (Heafield, 2011) for obtaining probabilities. As proxy for the OTHER class we use the top 35 words for German, and the top 70 for French.  Table 4: System performance in percent for English-German on the development data set.
For gender detection of German antecedents we use the lexicon from Zmorge (Sennrich and Kunz, 2014). Gender distribution of nouns is given in Table 3. When a noun has multiple genders in the lexicon, we take the most frequent one for that noun.
The different parameters such as context window size were taken from our findings of the previous year (Wetzel et al., 2015). The n-grams of the context window are extracted for n=1..3 including beginning-and end-of-sentence markers if necessary.

Results
The results on the development set are given in Table 4 for English-German and in Table 5 for English-French. The final results including the ranks on the official test set of the shared task are given in Table 6.
The initial systems in each language-pair perform much better than the baseline, which is especially noticeable in English-French. Adding the gender feature to the English-German classifier shows some good improvements in performance, thereby confirming the usefulness of adding gender information.
The additional feature that predicts NONE as possible translation is helpful for the English-French pair. Results on English-German showed a decrease in performance with respect to macroaveraged recall. This decrease is surprising, especially considering the much larger frequency of NONE in the German data set (cf. Table 1).
Performance between development and test sets varies greatly despite similar class label distributions (except for a much smaller amount of OTHER instances in the English-French test set). To a certain degree this is expected, however the big changes in performance suggest that there are other differences in the data sets which are worth exploring.
Training a MaxEnt classifier where we substitute our LM feature with predictions from the shared task baseline performed slightly worse. This suggests that a simpler LM feature is sufficient when included in the classifier, and that joint prediction of multiple target pronouns within one sentence is not necessary. However, we did not tune the NULL penalty of the baseline model.
The confusion matrix for English-German in Table 7 (top-left) shows that OTHER is overpredicted, which might explain the overall lower performance of the system compared to other participants. Furthermore, es and sie are confused by our classifier. For English-French in Table 7 (bottom-left) one can observe that the biggest confusion is between gender in plural pronouns (i.e. elles and ils). This might be because we did not include any explicit gender information as feature. As above, the OTHER class is also very confused over all cases.
Similarly to our findings from last year, the POSTCOMBINED setup scored consistently worse on the test sets (and only once slightly better on the development set). This provides evidence, that splitting the training data according to source pronouns is counterproductive. Furthermore, it might even be worse for the inverse prediction tasks, since there are a lot more source pronouns, hence making the available data even sparser.
The lemmatization of the French data merges singular and plural forms of il into one lemma, similarly for elle. The baseline which uses the LM trained on the lemmatized data is therefore never able to predict the plural forms of these two pronouns, resulting in zero precision and recall. This is confirmed by the corresponding confusion matrix. This might also have an indirect impact on the performance of our classifiers, since they use LM prediction as a feature.
Feature ablation experiments shown in Table 8 revealed that the antecedent feature is helpful for English-German, but not for English-French. One possible explanation for this might be that we do not have gender information of the antecedent in French and only adding the antecedent itself might not be sufficient.
Additional ablation experiments showed that the LM feature in fact hurts performance. Removing this feature gives a boost in performance, which brings our systems to the second place (first for accuracy) for English-German and to the third place (second for accuracy) for English-French. This contradicts findings from experiments we conducted for last year's shared task, where adding baseline predictions, which are very similar to our LM feature, greatly improved results. An explanation for this behaviour could be that the LM this year was trained on lemmatized text and therefore performs much worse than when trained on original data. Confusion matrices for these results are given in Table 7 (numbers to the right). For both language pairs we are now underpredicting OTHER, however gaining accuracy on the classes representing pronouns.

Conclusion
We experimented with MaxEnt classifiers for CLPP applied to English-German and English-French. Some of the features are only useful for one of the two language pairs. Adding LM predictions considerably worsened performance, which is contrary to experiments performed on last year's shared task. Modelling pronoun sequences with CRFs did not prove useful at all.
The greatly varying degree of performance between development and test sets relativizes any findings of the shared task, and it should be further investigated what the cause of that is.