A Maximum Entropy Classifier for Cross-Lingual Pronoun Prediction

We present a maximum entropy classiﬁer for cross-lingual pronoun prediction. The features are based on local source-and target-side contexts and antecedent information obtained by a co-reference resolution system. With only a small set of feature types our best performing system achieves an accuracy of 72.31%. According to the shared task’s ofﬁcial macro-averaged F1-score at 57.07%, we are among the top systems, at position three out of 14. Feature ablation results show the important role of target-side information in general and of the resolved target-side antecedent in particular for predicting the correct classes.


Introduction
In this paper we focus on pronouns which pose a problem for machine translation (MT).Pronoun translation is challenging due to the fact that pronouns often refer to entities mentioned in a nonlocal context such as previous clauses or sentences.Furthermore, languages differ with respect to usage of pronouns, e.g.how they agree with their antecedent or whether source and target language exhibit similar patterns of pronoun usage.Since pronouns contribute an important part to the meaning of an utterance, the meaning can be changed considerably when wrongly resolved and translated.
This problem gained recent interest and work has been presented in annotating and analysing translations of pronouns in parallel corpora (Guillou et al., 2014) and MT systems focusing on translation of pronouns have been proposed (Hardmeier and Federico, 2010;Le Nagard and Koehn, 2010;Guillou, 2012;Hardmeier et al., 2014).
The DiscoMT 2015 shared task on pronoun translation (Hardmeier et al., 2015) calls for con-tributions to tackle this problem.We focus on the cross-lingual pronoun prediction subtask, which is set up as follows: the two English (source language) third-person subject pronouns it and they can be translated in a variety of ways into French.A common set of nine classes (ce, cela, elle, elles, il, ils, on, c ¸a) is defined as possible translations including an extra class OTHER which groups together any less frequent translations, including null, noun translations, alignment errors.The source and target corpora both consist of humancreated documents and therefore abstract away from additional difficulties that arise with noisy automatic translations.Hardmeier et al. (2013) propose a neuralnetwork-based approach for a similar crosslingual pronoun prediction task.Their model jointly models anaphora resolution and pronoun prediction.Our approach builds on a maximum entropy (MaxEnt) classifier that incorporates various features based on the source pronoun and local source-and target-side contexts.Moreover, the target-side noun referent (i.e. the antecedent) of a pronoun is used and obtained with an automatic co-reference resolution system.Our system achieves high accuracy and performs third-best according to the official evaluation metric.
In Section 2 we present our MaxEnt classifier including a description of the features used.This is followed by Section 3 with experiments and evaluation.Furthermore, in Section 4 we discuss the results and in Section 5 we give concluding remarks.
2 Systems for Cross-Lingual Pronoun Prediction

Maximum Entropy Classification
A MaxEnt classifier can model multinomial dependent variables (discrete class labels) given a set of independent variables (i.e.observations).
where Z(x) is a normalizing factor ensuring valid probabilities.

Features
Local Context The local context around the source pronoun and target pronoun can contain the antecedent (cf. Figure 1) or other information, such as the inflection of a verb which can provide evidence for the gender or number of the target-side pronoun.Therefore, we include the tokens that are within a symmetric window of size 3 around the pronoun.We integrate this information as bag-of-words, but separate the feature space by source and target side vocabulary and whether the word occurs before or after the pronoun.Special BOS and EOS markers are included for contexts at the beginning or end of sentence, respectively.We neither remove stopwords nor normalize the tokens.
We also include as features, the Part-of-Speech (POS) tags in a 3-word window to each side of source and target pronouns.This gives some abstraction from the lexical surface form.For the source side we use the POS tags from Stanford CoreNLP (Manning et al., 2014) mapped to universal POS tags (Petrov et al., 2012).For the target side we use coarse-grained tags provided by Morfette (Chrupała et al., 2008). 1anguage Model Prediction We include a target-side Language Model (LM) prediction as a feature for the classifier.A 5-gram LM is queried by providing the preceding four context words followed by one of the eight target-side pronouns that the class labels represent.The pronoun that has the highest prediction probability is the feature that we include in the training data.The ninth class OTHER requires special treatment, since it represents all other tokens that were observed in the aligned data and thus does not itself appear in the LM training data.To get an accurate prediction probability for this aggregate class one would have to iterate over the entire vocabulary V (excluding the other eight pronouns) and find the most likely token.Since this would require a huge amount of LM queries (|V |× number of training instances) we approximated this search by taking the 40 most frequent tokens that are observed in the training data in the position which was labelled as OTHER.The highest prediction probability is then used to compete with the probabilities of the other explicit classes.Once the most likely prediction is determined we included the predicted class label as feature.
Target-side Antecedent The target-side noun antecedent of the pronoun determines the morphological features the pronoun has to agree with, i.e. number and gender.We use the source-side coreference resolution system provided by Stanford CoreNLP (Lee et al., 2013) to determine the coreference chains in each document of the training data.We then project these chains to the target side via word-alignments (cf. Figure 2).The motivation to obtain target-side co-reference chains in that way is three-fold.First, the target side of the training data is missing most of the targetside pronouns since it is the task to predict them.Therefore, relevant parts of co-reference chains are missing and the place-holders for these pronouns will introduce noise to the resolution system.Secondly, we have a statistical machine translation (SMT) scenario in mind as an application for cross-lingual pronoun prediction.Applying a co-reference system to the noisy SMT output of already translated parts of the document is subjecting the system to much noisier data than it was originally developed for.Thirdly, resources  ) is determined with a co-reference resolution system.The target-side antecedent aligned is obtained by following the word alignment links.In the shared task, the target pronoun elles has to be predicted.
and tools for automatic co-reference resolution are more easily available for English than for French.
Given the target-side co-reference chains in a document, we consider the chain the target-side pronoun is assigned to and greedily search for the closest noun token in the chain in the preceding context.This mention is included in the training data for the classifier as lexical feature.In addition, we extract morphological features from the noun (i.e.number and gender) by automatically analyzing the target-side sentences with Morfette. 2In cases where the pronoun was not assigned to a co-reference chain, a special indicator feature was used.In addition, the word alignment can align one source token to multiple target tokens.We searched for the first noun in the aligned tokens and considered this to be the representative head antecedent of the given pronoun.If no noun could be found with this method, we resorted to taking the best representative antecedent of the source chain as determined by the Stanford co-reference system and took the aligned token as the relevant target-side antecedent.In this case null alignments are also possible and a special indicator feature is used for that.
Pleonastic Pronouns Pleonastic pronouns are a class of pronouns that do not have a referent in the discourse, e.g. in "It is raining".Their surface form in English is indistinguishable from referential forms.Nada (Bergsma and Yarowsky, 2011) is a tool that provides confidence estimates for pronouns whether they are referential. 3We 2 Morfette's performance is quite robust and can handle sentences that contain REP LACE xx tokens, which are the placeholders for target-side pronouns that have to be predicted.A comparison of the performance on the original sentences and the sentences with the REP LACE xx tokens showed only minor differences.
3 https://code.google.com/p/nada-nonref-pronoun-detector/ include these estimates as an additional feature.This should provide information especially for the French class labels that can be used as pleonastic pronouns, e.g."il pleut (it is raining)" or "c ¸a fait mal (it hurts)".
In addition, the rule-based detection of pleonastic pronouns is only basic in the Stanford coreference system (Lee et al., 2013).However since they do not have a referent, they cannot be part of a co-reference chain.Therefore, we expect this feature to also counteract wrong decisions by the co-reference resolution system to a certain degree.Since Nada only provides estimates for it, we do not have such a feature for pleonastic uses of the other source pronoun of the task they.

Classifier Types
We trained classifiers in two different setups.The first setup provides all our extracted features as training data to one MaxEnt classifier, including the source pronoun as additional feature for each training instance (from now on referred to as the ALLINONE system).The second setup splits the training data into the two source pronoun cases (it and they) and trains a separate classifier for each of them (POSTCOMBINED system).

Data
The shared task provides three corpora that can be used for training.The Europarl7 corpus, the NewsCommentary9 corpus and the IWSLT14 corpus which are transcripts of planned speech, i.e.TED talks.Only the latter two corpora come with natural text boundaries.Since these boundaries are necessary for co-reference resolution, we did not use the Europarl corpus.Ranks according to each metric are given in parenthesis out of 14 submitted systems (including multiple submissions per submitter and the baseline).
2093 sentences in twelve TED talk documents.

Classifier
We extract features from the training and test set and use Mallet (McCallum, 2002) to train the MaxEnt classifier. 4 The variance for regularizing the weights is set to 1 (default setting).
For the LM component of our system we use the baseline model provided for the pronoun translation subtask.This is a 5-gram modified Kneser-Ney LM trained with KenLM (Heafield, 2011). 5

Evaluation Metrics
The official evaluation metric for the shared task is the macro-averaged F-score over all prediction classes (Mac-F1).Since this metric favours systems perform equally well on all classes, the task puts emphasis on handling low-frequency classes well instead of only getting the frequent classes right.In addition to scores with the official metric we also report overall accuracy (Acc), i.e. the ratio between the correctly predicted classes and all test instances.
The evaluation script of the shared task provides results for the official fine-grained class separation with nine classes.It also provides a coarsegrained separation where some of the class labels are merged.Results reflect the fine-grained distinction except where stated.

Results on the Test Set
Table 1 shows the official results on the test set together with the respective ranks out of 14 submitted systems.Table 2 and Table 3 provide the per-class precision, recall and F1, overall accuracy, and overall macro-averaged F-score.

Discussion
Confusion Matrices Table 5 and Table 6 present confusion matrices on the test set.Divergences from strong diagonal values in both tables derive in part from gender-choice errors.In addition, the morphological number of the personal pronouns is almost perfectly predicted in all cases.
The OTHER class causes quite a few confusions, which is not surprising since it aggregates a heterogeneous set of possible source pronoun translations.We expect a more detailed distinction in this group to lead to better systems in general.Feature Ablation In order to investigate the usefulness of the different types of features, we performed a feature ablation.When removing all features that are related to the antecedent of the target pronoun we need to predict, i.e. the antecedent itself and its number and gender, we observe a considerable drop in performance for both evaluation metrics.This is according to our expectations, since number and gender are strong cues for most of the classes.The antecedent token itself also provides enough information to the classifier to make a positive impact on the results .When removing all features related to the target side we can observe a consistent drop in performance over all sets and classifiers. 6This result shows the important influence the target language has on the translation of a source pronoun.Removing the source-side features does not have a strong impact on the results, which is consistent again over all settings.Both results taken together strongly indicate that the target-side features are much more important than the source-side features.

Classifier Types
The overall results show a consistent preference for the ALLINONE classifier over the POSTCOMBINED one.The difference in performance seems to be mostly influenced by the fact that splitting the training data into two separate sets for the POSTCOMBINED setting also results in much smaller data sizes for each of the individual classifiers.Our feature ablation results show that particular features are useful for the former classifier, but useless or even harmful for the latter.This instability might be due to the fact that the POSTCOMBINED classifier has to learn from much smaller data sets.Incorporating more training data from the Europarl corpus could alleviate this problem and would make it possible to determine whether these differences persist.

Language Model
The mixed results for the usefulness of the LM features prompt for a further investigation of how to integrate the LM.Currently we base the LM predictions on the preceding n-gram of the target pronoun.However, it is also conceivable for this task to query the LM with n-grams that are within a sliding window of tokens containing the target pronoun.Furthermore, there is a small mismatch between the trained LM which has been trained on truecased data and the preceding tokens we have from the shared task data where the case was not modified.If this difference is eliminated we expect more accurate LM predictions, which should then in turn provide more accurate features for the classifiers.
Additionally, our LM feature currently predicts OTHER with a fairly high frequency of around 80% (followed by il with around 15%).This might be another reason why some classifiers work better without this feature, since this distribution does not match the observed distribution of target pronouns in the training data.

Conclusion
We presented a MaxEnt classifier that can determine the French translation of the English 3rd person subject pronouns with fairly high accuracy and performs among the top systems that have been submitted for this task.The classifier only uses a small set of feature types.Target-side features contribute most to the classification quality.Potentially non-local target-side antecedent features obtained via a source-side co-reference system and projected to the target via word alignments provide useful information as well.

Figure 2 :
Figure2: The antecedent co−ref of they on the English sentence (source language) is determined with a co-reference resolution system.The target-side antecedent aligned is obtained by following the word alignment links.In the shared task, the target pronoun elles has to be predicted.

Table 1 :
Official performance on the test data.

Table 3 :
Performance of POSTCOMBINED classifier on the test set.

Table 4 :
Feature ablation both types of classifiers on the test set.

Table 5 :
Confusion matrix for the ALLINONE classifier on the test set.Row labels are gold labels and column labels are labels as they were classified.

Table 6 :
Confusion matrix for the POSTCOMBINED classifier on the test set.Row labels are gold labels and column labels are labels as they were classified.