What is it? Disambiguating the different readings of the pronoun ‘it’

In this paper, we address the problem of predicting one of three functions for the English pronoun ‘it’: anaphoric, event reference or pleonastic. This disambiguation is valuable in the context of machine translation and coreference resolution. We present experiments using a MAXENT classifier trained on gold-standard data and self-training experiments of an RNN trained on silver-standard data, annotated using the MAXENT classifier. Lastly, we report on an analysis of the strengths of these two models.


Introduction
We address the problem of disambiguating the English pronoun 'it', which may function as a pleonastic, anaphoric, or event reference pronoun. As an anaphoric pronoun, 'it' corefers with a noun phrase (called the antecedent), as in example (1): (1) I have a bicycle. It is red.
Pleonastic pronouns, in contrast, do not refer to anything but are required to fill the subject position in many languages, including English, French and German: (2) It is raining.
Event reference pronouns are anaphoric, but instead of referring to a noun phrase, they refer to a verb, verb phrase, clause or even an entire sentence, as in example (3): (3) He lost his job. It came as a total surprise.
We propose the identification of the three usage types of it, namely anaphoric, event reference, and pleonastic, with a single system. We present several classification experiments which rely on information from the current and previous sentences, as well as on the output of external tools.

Related Work
Due to its difficulty, proposals for the identification and the subsequent resolution of abstract anaphora (i.e., event reference) are scarce (Eckert and Strube, 2000;Byron, 2002;Navarretta, 2004;Müller, 2007). The automatic detection of instances of pleonastic 'it', on the other hand, has been addressed by the non-referential 'it' detector NADA (Bergsma and Yarowsky, 2011), and also in the context of several coreference resolution systems, including the Stanford sieve-based coreference resolution system (Lee et al., 2011). The coreference resolution task focuses on the resolution of nominal anaphoric pronouns, de facto grouping our event and pleonastic categories together and discarding both of them. The coreference resolution task can be seen as a two-step problem: mention identification followed by antecedent identification. Identifying instances of pleonastic 'it' typically takes place in the mention identification step. The recognition of event reference 'it' is, however, to our knowledge not currently included in any such systems, although from a linguistic point of view, event instances are also referential (Boyd et al., 2005). As suggested by Lee et al., (2016), it would be advantageous to incorporate event reference resolution in the second step.
In the context of machine translation, work by Le Nagard and Koehn (2010); Novák et al. (2013); Guillou (2015) and Loáiciga et al. (2016) have also considered disambiguating the function of the pronoun 'it' in the interest of improving pronoun translation into different languages.

Labeled Data
The ParCor corpus (Guillou et al., 2014) and Dis-coMT2015.test dataset  were used as gold-standard data. Under the Par-Cor annotation scheme, which was used to annotate both corpora, pronouns are manually labeled according to their function: anaphoric, event reference, pleonastic, etc. For all instances of 'it' in the corpora, we extracted the sentence-internal position of the pronoun, the sentence itself, and the two previous sentences. All examples were shuffled before the corpus was divided, ensuring a balanced distribution of the classes ( Table 1).
The pronouns 'this' and 'that', when used as event reference pronouns, may often be used interchangeably with the pronoun 'it' (Guillou, 2016

Baselines
We provide two different baselines (MC and LM BASELINE in Table 2). The first is a setting in which all instances are assigned to the majority class it-anaphoric. The second baseline system is a 3-gram language model built using KenLM (Heafield, 2011) and trained on a modified version of the annotated corpus in which every instance of 'it' is concatenated with its function (e.g. 'itevent'). At test time, the 'it' position is filled with each of the three it-function labels in turn, the language model is queried, and the highest scoring option is chosen.

Features
We designed features to capture not only the token context, but also the syntactic and semantic context preceding the pronouns and, where appropriate, their antecedents/referents, as well as the pronoun head. We used the output of the POS tagger and dependency parser of Bohnet et al. (2013) 1 , and of the TreeTagger lemmatizer (Schmid, 1994) to extract the following information for each training example: Token context (tok) 1. Previous three tokens and next two tokens. This includes words, punctuation and the tokens in the previous sentence when the 'it' occupies the first position of the current sentence. 2. Lemmas of the next two tokens.
Pronoun head (hea) 3. Head word and its lemma. Most of the time the head word is a verb. 4. If the head verb is copular, we include its complement head and not the verb itself (for the verbs be, appear, seem, look, sound, smell, taste, feel, become and get). 5. Whether the head word takes a 'that' complement (verbs only). 6. Tense of head word (verbs only), computed as described by Loáiciga et al. (2014).
Syntactic context (syn) 7. Whether a 'that' complement appears in the previous sentence. 8. Closest NP head to the left and to the right. 9. Presence or absence of extraposed sentential subjects as in 'So it's difficult to attack malaria from inside malarious societies, [...]. 10. Closest adjective to the right.
Semantic context (sem) 11. VerbNet selectional restrictions of the verb. VerbNet (Kipper et al., 2008) specifies 36 types of argument that verbs can take. We limited ourselves to the values of abstract, concrete and unknown. 12. Likelihood of head word taking an event subject (verbs only). An estimate of the likelihood of a verb taking a event subject was computed over the Annotated English Gigaword v.5 corpus (Napoles et al., 2012). We considered two cases favouring event subjects that may be identified by exploiting the parse annotation of the Gigaword corpus. The first case is when the subject is a gerund and the second case is composed of 'this' pronoun subjects. 13. Non-referential probability assigned to the instance of 'it' by NADA (Bergsma and Yarowsky, 2011).

MaxEnt
The MAXENT classifier is trained using the Stanford Maximum Entropy package (Manning and Klein, 2003) with all of the features described above. We also experimented with other features and options. For features 1 and 2, a window mate-tools/downloads/list   Figure 1: Feature ablation -MAXENT system. of three tokens showed a degradation in performance. For feature 8, adding one of the 26 Word-Net (Princeton University, 2010) types of nouns had no effect. The feature combination of noun and adjectives to the left or right also had no effect. Feature ablation tests revealed that while combining all features is beneficial for the prediction of the anaphoric and pleonastic classes, the same is not true for the event class. In particular, the inclusion of semantic features, which we designed as indicators of eventness, appears to be harmful (Figure 1).

Unlabeled Data
Given the small size of the gold-standard data, and with the aim of gaining insight from unstructured and unseen data, we used the MAXENT classifier to label additional data from the pronoun prediction shared task at WMT16 . This new silver-standard training corpus comprises 1,101,922 sentences taken from the Europarl (3,752,440 sentences), News (344,805 sentences) and TED talks (380,072 sentences) sections of the shared task training data.

RNN
Our second system is a bidirectional recurrent neural network (RNN) which reads the context words and then makes a decision based on the representations that it builds. Concretely, it consists on wordlevel embeddings of size 90, two layers of Gated 0.729 0.728 (6) Ambiguous between event and anaphoric (3/12) (7/12) e.g. Today, multimedia is a desktop or living room experience, because the apparatus is so clunky . It will change dramatically with small, bright, thin, high-resolution displays.
0.250 0.583 (7) Ambiguous between event and pleonastic (2/5) (1/5) e.g. I did some research on how much it cost, and I just became a bit obsessed with transportation systems. And it began the idea of an automated car.
0.400 0.200 (8) Annotation errors (0/4) (0/4) e.g. Youth unemployment is particularly worrying in it context, as the lost opportunity for jobless young people to develop professional skills is likely to translate into lower productivity and lower earnings over a longer period of time.
-- Table 3: Accuracy scores of the systems in different portions of the test-set. For each category, we test whether MAXENT is better or worse than RNN-COMBINED. A * indicates significance at p < 0.001 using McNemar's χ 2 test.
Recurrent Units (GRUs) of size 90 as well, and a final softmax layer to make the predictions. The network uses a context window of 50 tokens both to the left and right of the 'it' to be predicted. The features described above are also fed to the network in the form of one-hot vectors. The system uses the adam optimizer and the categorical crossentropy loss function. We chose this architecture following the example of Luotolahti et al. (2016), who built a system for the related task of crosslingual pronoun prediction.

Discussion
We report all of the results in Table 2. MAXENT and RNN-GOLD are trained on the gold-standard data only. RNN-SILVER is trained on the silverstandard data (annotated using the MAXENT classifier). RNN-COMBINED is trained on both the silver-standard and gold-standard data.
The MAXENT and RNN models show improvements, albeit small for the it-event class, over the baseline systems. Since they are trained on the same gold-standard data, one would expect RNN-GOLD to perform similarly to MAXENT. However, in the case of the RNN-gold, the 50 tokens window may actually not have enough words to be filled with, because the gold-standard data is composed of the sentence with the it-pronoun and the three previous sentences, which in addition tend to be short. For the RNN-SILVER system this is not a problem, since the sentences of interest have not been taken out of their original context, fully exploiting the RNN capacity to learn the entirety of the context window they are presented with, even if the data is noisy. As expected, RNN-COMBINED performs better than RNN-GOLD and RNN-SILVER. Although it does not perform overwhelmingly better than MAXENT, there are gains in precision for the it-anaphoric class, and in recall for the it-pleonastic and it-event classes, suggesting that the system benefits from the inclusion of gold-standard data.
With the two-fold goal of gaining a better understanding of the difficulties of the task and strengths of the systems, we re-classified the test set in a stratified manner. We present the systems with seven scenarios reflecting the different types of reference relationships observed in the corpora (Table 3). Our scenarios are exhaustive, thus some only have few examples. The analysis reveals that the MAXENT is a better choice for nominal reference (case (1), mostly it-anaphoric) whereas the RNN-COMBINED system is better at identifying difficult antecedents such as cases (4) and (6). RNN-COMBINED performs slightly better at detecting verbal antecedents, case (2), while both systems perform similarly at learning pleonastic instances (5) or when the antecedent is not in the snippet (3). Finally, we found 4 instances of annotation errors (8). These correspond to some of the automatically substituted cases of 'this'/'that' with 'it', for which the 'this'/'that' should not have been marked as a pronoun by the human annotator in the first place. Case (8) is not taken into account in the evaluation.
Taking the complete test set, we found that the MAXENT system performs better than the RNN-COMBINED system in absolute terms (χ 2 = 50.8891, p < 0.001), but this is because case (1) is the most frequent one, which is also the case the MAXENT system is strongest at.

Conclusions and Future Work
We have shown that distinguishing between nominal anaphoric and event reference realizations of 'it' is a complex task. Our results are promising, but there is room for improvement. The selftraining experiment demonstrated the benefit of combining gold-standard and silver-standard data.
We also found that the RNN-COMBINED system is better at handling difficult and ambiguous referring relationships, while the MAXENT performed better for the nominal anaphoric case, when the antecedent is close. Since the two models have different strengths, in future work we plan to enrich the training data with re-training instances from the silver data where the two systems agree, in order to reduce the amount of noise, following the example of Jiang et al. (2016).
Ultimately, we aim towards integrating the itprediction system within a full machine translation pipeline and a coreference resolution system. In the first case, the different translations of pronoun 'it' can be constrained according to their function. In the second case, the performance of a coreference resolution system vs a modified version using the three-way distinction can be measured.