It-disambiguation and source-aware language models for cross-lingual pronoun prediction

We present our systems for the WMT 2016 shared task on cross-lingual pronoun prediction. The main contribution is a clas-siﬁer used to determine whether an instance of the ambiguous English pronoun “it” functions as an anaphoric, pleonas-tic or event reference pronoun. For the English-to-French task the classiﬁer is incorporated in an extended baseline, which takes the form of a source-aware language model. An implementation of the source-aware language model is also provided for each of the remaining language pairs.


Introduction
The WMT 2016 shared task on cross-lingual pronoun prediction focuses on the translation of the subject position pronouns "it" and "they" for several language pairs . Both of these pronouns perform multiple functions in text, and disambiguation is required if they are to be translated correctly into other languages (Guillou, 2016). The pronoun "they" is typically used as an anaphoric pronoun, but may also be used generically, for example in "They say it always rains in Scotland". The pronoun "it" may be used as an anaphoric, pleonastic or event reference pronoun. Examples of these pronoun functions are provided in Figure 1.

anaphoric
I have a bicycle. It is red. pleonastic It is raining. event He lost his job. It came as a total surprise.

Figure 1: Examples of different pronoun functions
Anaphoric pronouns corefer with a noun phrase (i.e. the antecedent). Pleonastic pronouns, in con-trast, do not refer to anything but are required to fill the subject position in many languages, including English, French and German. Event reference pronouns may refer to a verb, verb phrase, clause or even an entire sentence.
Different French pronouns are required when translating an instance of "it" depending on its function. For example, anaphoric "it" may be translated with the third-person singular pronouns "il" [masc.] and "elle" [fem.], or with an nongendered demonstrative such as "cela". The French pronoun "ce" may function as both an event reference and a pleonastic pronoun, but "il" is used only as a pleonastic pronoun.
As revealed in an analysis of the systems submitted to the DiscoMT 2015 shared task on pronoun translation (Hardmeier et al., 2015a), the translation of pleonastic and event reference pronouns poses a particular problem for MT systems . Poor performance may be attributed to the inability of the systems to disambiguate the various possible functions of the pronoun "it". In the case of systems that incorporate coreference resolution and methods for identifying instances of pleonastic "it", inaccurate output may harm translation performance. No suitable tools exist for the detection of event reference pronouns in English.
To address the problem of disambiguating the function of "it", we propose a classifier that uses information from the current and previous sentences, as well as external tools, and indicates for each instance of "it" whether the pronoun function is anaphoric, pleonastic or event reference. The classifier was trained using data from the Par-Cor corpus (Guillou et al., 2014) and the Dis-coMT2015.test dataset . In both corpora, pronouns are labelled according to their function, following the ParCor annotation scheme. The classifier is incorporated in an extended baseline system for the Englishto-French task. The extended baseline takes the form of a n-gram language model that operates over target-language lemmas, but also has access to the identity of the source-language pronouns. Source-aware language models are also provided for the other tasks: English-to-German, Germanto-English and French-to-English.

Previous Work
Work on pronoun translation, in which a complete machine translation pipeline is provided, has also considered different functions of the pronoun "it". Le Nagard and Koehn (2010) identify and exclude instances of pleonastic "it" in their Englishto-French system. Guillou (2015) distinguishes between anaphoric vs. non-anaphoric pronouns in an English-to-French automatic post-editing system. Novák et al. (2013) consider the translation of three different uses of "it" in Englishto-Czech translation: referential it, referring to a noun phrase, anaphoric it, referring to a verb phrase, and pleonastic it. These three categories correspond to those that we refer to as anaphoric, event reference and pleonastic, respectively.
Work by Navarretta (2004) and Dipper et al. (2011) has focused on resolving abstract anaphora in Danish and on the manual annotation of abstract anaphora in English and German. Abstract anaphora, in which pronouns refer to abstract entities such as facts or events, is referred to as event reference in this paper. The automatic detection of instances of pleonastic "it" has been addressed by NADA (Bergsma and Yarowsky, 2011), and also by the Stanford sieve-based coreference resolution system (Lee et al., 2011).
The cross-lingual pronoun prediction task formalised by Hardmeier (2014) was first introduced as a shared task at DiscoMT 2015 (Hardmeier et al., 2015a). The participants used a range of features in their classifiers, but this paper marks the first attempt to incorporate a component to disambiguate the various uses of "it".

Data
The ParCor corpus (Guillou et al., 2014) and Dis-coMT2015.test dataset  were used to train the classifier. Under the Par-Cor annotation scheme, which was used to annotate both corpora, pronouns are labelled accord-ing to their function. For all instances of "it" labelled as anaphoric, pleonastic or event reference, the sentence-internal position of the pronoun and the sentence itself are extracted 1 . The pronouns "this" and "that", when used as event reference pronouns, may in many cases be used interchangeably with the pronoun "it" (Guillou, 2016). Consider Ex. 1, in which the pronouns "this" and "it" may be used to express the same meaning.
To increase the number of training examples, instances of event reference "this" and "that" are replaced with "it" and added to the training data.
The data was divided into 1504 instances for training, and 501 each for the development and test sets. All sentences were shuffled before the corpus was divided, promoting a balanced distribution of the classes (Table 1). Data it-Set Event Anaphoric Pleonastic Total  Training  504  779  221 1504  Dev  157  252  92  501  Test  169  270  62  501  Total  830  1301  375 2506   Table 1: Distribution of classes in the training data All classifiers were trained using the Stanford Maximum Entropy package (Manning and Klein, 2003).

Features
To parse the corpus, we used the joint part-ofspeech tagger and dependency parser of Bohnet et al. (2013) from the Mate toolkit. We used the pre-trained models for English that are available online 2 . In addition, the corpus was lemmatised using the TreeTagger lemmatiser (Schmid, 1994). Although other tools were used, we relied on the output of these two parsers to extract most of our features.
For each training example, we extract the following information: 1. Previous three tokens. This includes words and punctuation. It also includes the tokens in the previous sentence when the it-occupies the first position of the current sentence.
2. Next two tokens 3. Lemmas of the next two tokens 4. Head word. As the task is limited to subject it and they, most of the time the head word is a verb.
5. Whether the head word takes a 'that' complement (verbs only) 6. Tense of head word (verbs only). This is computed using the rules described in Loáiciga et al. (2014 14. Lemma of the head word 15. Likelihood of head word taking an event subject (verbs only). An estimate of the likelihood of a verb taking a event subject was computed over the Annotated English Gigaword v.5 corpus (Napoles et al., 2012). We considered two cases where an event subject appears often and may be identified by exploiting the parse annotation of the Gigaword corpus. The first case is when the subject is a gerund and the second case is composed of "this" pronoun subjects.
We also experimented with other features and options. For features 2 and 3, a window of three tokens showed a degradation in performance. For features 9 and 10, we experimented with adding their WordNet type (WordNet (Princeton University, 2010) contains 26 types of nouns), but this had no effect. The feature combination of noun and adjectives to the left or right also had no effect.

Results
For development and comparison we built two different baselines. One is a 3-gram language model built using KenLM (Heafield, 2011) and trained over a modified version of the annotated corpus in which every it is concatenated with its type (e.g. it event). For testing, the it position is filled with each of the three it label and the language model is queried. This baseline functions in a very similar way to the share-task own baseline. Table 2 presents the results of this baseline using 14-fold cross-validation and a single held-out test set (all test-set mentions refer to the same test set). The motivation for the choice of the number of folds is threefold. First, we wanted to respect document boundaries; second, we aimed for a fair proportion of the three classes in all folds; and, lastly, we tried to lessen the variance given the relatively small size of the corpus. The second baseline is a setting in which all instances of the test set are set to the majority class it-anaphoric.
A quick scan of Tables 2 and 3 anticipates one of the conclusions of this paper: predicting event reference pronouns is a complex problem. The 3-gram baseline appears to be biased towards the pleonastic class, as suggested by its high precision and very low recall for the event and anaphoric classes and the opposite situation for the pleonastic class. While our own classifier is more balanced, it achieves only moderate results with the event class. Compared to both of the baselines, it shows only a very small improvement.  A manual inspection of the results shows that discriminating between anaphoric and event reference instances of it is indeed a very subtle process. Determining the presence or the lack of a specific (np-like) antecedent requires the understanding of the complete coreference chain. Take for instance the following example taken from a dialogue in the corpus: 1 You're part of a generation that grew up with the Internet, and it seems as if you become offended at almost a visceral level when you see something done that you think will harm the Internet. 2 Is there some truth to it? 3 It is. 4 I think it's very true. 5 This is not a left or right issue. 6 Our basic freedoms, and when I say our, I don't just mean Americans, I mean people around the world, it's not a partisan issue .
In the example above the first italicised it is an event reference pronoun while the second is an anaphoric pronoun. With access to the whole coreference chain, one can see that the it in sentence 3 refers to the event expressed in the first sentence, therefore it is annotated as an event. This same entity is then referred to with the word issue in sentence 5, which in turn becomes the antecedent to the it in sentence 6. The classifier, however, labelled these two instances as anaphoric and event respectively.
It is worth noting that from the 2031 segments composing the annotated corpus, 349 (17%) contain co-occurrences of between 2 and 7 it pronouns within the same segment. We experimented including the previous it-label, when there are several within the same sentence, as an additional feature and obtained important gains in performance. It can be seen in the w/ oracle feature section of Table 3 that performance improves in almost all cases when this feature is used. The only exception is for the it-pleonastic class of the test set. We then tried to approximate this feature by using the relative position of the it-label to other itlabels within the same sentence (e.g., first, second, etc.). Contrary to the oracle feature, the approximated feature did not lead to any improvement. Modelling co-occurrences of pronouns seems like a promising step in future work.
Binary classification (event vs. non-event) consistently underperformed when compared to the three class set-up.

Source-Aware Language Model
The pronoun prediction part of our models is based on an n-gram model over target lemmas similar to the official shared task baseline. In addition to the pure target lemma context, our model also has access to the identity of the source language pronoun, which, in the absence of number inflection on the target words, provides valuable information about the number marking of the pronouns in the source and opens a way to inject the output of the pronoun type classifier into the system.
Our source-aware language model is an n-gram model trained on an artificial corpus generated from the target lemmas of the parallel training data (Figure 2). Before every REPLACE tag occurring in the data, we insert the source pronoun aligned to the tag (without lowercasing or any other processing). The alignment information attached to the REPLACE tag in the shared task data files is stripped off. In the training data, we instead add the pronoun class to be predicted. Note that all RE-PLACE tags are placeholders for one word translations guaranteed to correspond to a source pronoun it or they according to the shared-task data preparation (Hardmeier et al., 2015b;Guillou et al., 2016). The n-gram model used for this component is a 6-gram model with modified Kneser-Ney smoothing (Chen and Goodman, 1998)    To predict classes for an unseen test set, we first convert it to a format matching that of the training data, but with a uniform, unannotated RE-PLACE tag used for all classes. We then recover the tag annotated with the correct solution using the disambig tool of the SRILM language modelling toolkit (Stolcke et al., 2011). This tool runs the Viterbi algorithm to select the most probable mapping of each token from among a set of possible alternatives. The map used for this task trivially maps all tokens to themselves with the exception of the REPLACE tags, which are mapped to the set of annotated REPLACE tags found in the training data.
The source-aware language model described here is identical to the language model component included in the UU-Hardmeier submission (Hardmeier, 2016).

English-French "it" Disambiguation System
We used the classifier described in Section 3 to annotate all instances of it from the source side of the data which were mapped to a REPLACE item according to the alignment provided. Afterwards, a new source-aware language model is trained in the manner described in Section 4. In this way, instead of the sentence 'It 's got these fishing lures on the bottom .' presented in Figure 2, the system receives the labelled input 'It anaphoric 's got these fishing lures on the bottom .' All the data provided for the shared-task was used in training this system.

Results and Analysis
Unfortunately, following the submission of our system we identified an error related to the feature extraction process. We relied on contextual information of the previous sentence for some of our features. However, due to the 1 : N alignments, the context information was sometimes inaccurate. The correction of this problem produced the results reported in the section titled Submitted corrected in Table 4. The macro-averaged recall obtained is 57.03%, which is considerably better than the result of the submitted system (48.92%), but still slightly lower than the score of 59.84% which was obtained by the unmodified system. However, some pronouns present better scores using the submitted corrected system than the unmodified system. Precision, in particular, is higher (bolded scores in Table 4). This outcome is expected for the pronoun cela, which is the French neuter demonstrative pronoun frequently used for event reference. However, there are also gains in precision for on, elles and ils. In our opinion, this suggests that while not directly treating any of the other source-language pronouns (in the context of this shared-task, other source pronouns refers only to they), the disambiguation of it positively affects the translation of the other target-language  pronouns. The pronoun it, after all, is used three times more frequently than they in the training data (Loáiciga and Wehrli, 2015). Looking at the predictions, we confirmed that both source-aware language models produced identical results almost all of the time, with the system without the labels producing more correct predictions in total. However, there are some few examples where the system with the labels outperforms both the baseline and the un-labelled one. A contrastive example can be seen in Figure 3.

Conclusions and Future Work
Distinguishing between anaphoric and event reference realisations of "it" is a very complex task. In

Source:
it anaphoric just takes a picture of objective reality as it anaphoric is . LM w/o labels: il OTHER LM w/labels: elle OTHER Baseline: cela OTHER Gold elle prendre juste un image objectif de la réalité . particular, it can be difficult to determine the antecedent of an event reference pronoun. The identification of pleonastic realisations, on the other hand, is almost impossible in an n-gram context such as that provided by a language model. However, it is feasible in the three class setting, and at the same time helpful for the disambiguation of the event and anaphoric realisations. While our results are modest, they point towards an improvement in the general quality of pronoun translation. Accurate disambiguation of the pronoun "it" has the potential to help NLP applications such as Machine Translation and Coreference Resolution.
In the near future, we will experiment with other classification algorithms suitable for small training sets. We also intend to experiment with features that incorporate semantic knowledge in the form of statistics computed over external resources, including the Gigaword corpus. Last, with the generated data from this shared-task, we plan to do bootstrap and experiment with self-training.