Resolving Discourse-Deictic Pronouns: A Two-Stage Approach to Do It

Discourse deixis is a linguistic phenomenon in which pronouns have verbal or clausal, rather than nominal, antecedents. Studies have estimated that between 5% and 10% of pronouns in non-conversational data are discourse deictic. However, current coreference resolution systems ignore this phenomenon. This paper presents an automatic system for the detection and resolution of discourse-deictic pro-nouns. We introduce a two-step approach that ﬁrst recognizes instances of discourse-deictic pronouns, and then resolves them to their verbal antecedent. Both components rely on linguistically motivated features. We evaluate the components in isolation and in combination with two state-of-the-art coreference re-solvers. Results show that our system out-performs several baselines, including the only comparable discourse deixis system, and leads to small but statistically signiﬁcant improvements over the full coreference resolution systems. An error analysis lays bare the need for a less strict evaluation of this task.


Introduction
Coreference resolution is a central problem in Natural Language Processing with a broad range of applications such as summarization (Steinberger et al., 2007), textual entailment (Mirkin et al., 2010), information extraction (McCarthy and Lehnert, 1995), and dialogue systems (Strube and Müller, 2003). Traditionally, the resolution of noun phrases (NPs) has been the focus of coreference research (Ng, 2010). However, NPs are not the only participants in coreference, since verbal or clausal mentions can also take part in coreference relations. For example, consider: (1) The United States says it may invite Israeli and Palestinian negotiators to Washington.
(2) Without planning it in advance, they chose to settle here.
In (1), the antecedent of the pronoun is an NP, while in (2) the antecedent 1 is a clause 2 (Webber, 1988). Current state-of-the-art coreference resolution systems (Lee et al., 2011;Fernandes et al., 2012;Durrett and Klein, 2014;Björkelund and Kuhn, 2014) focus on the former and ignore the latter cases. Corpus studies across several languages (Eckert and Strube, 2000;Botley, 2006;Recasens, 2008) have estimated that between 5% and 10% of pronouns in non-conversational data, and up to 20% in conversational, have verbal antecedents. A coreference system that is able to handle discourse deixis will thus be more accurate, and benefit downstream applications.
In this paper we present an automatic system that processes discourse-deictic pronouns. We resolve the three pronouns it, this and that, which can appear in linguistic contexts that reflect the phenomenon illustrated in (2). Our system has a modular architecture consisting of two independent stages: classification and resolution. The first stage classifies a pronoun as discourse deictic (or not), and the second stage resolves discourse-deictic pronouns to verbal antecedents. Both stages use linguistically moti-vated features.
We first evaluate our system by measuring the performance of the detection and resolution components in isolation. They outperform several baselines, including Müller's (2007) approach, which is the only other comparable discourse deixis system, to the best of our knowledge. We also measure the impact of our system on two state-of-the-art coreference resolution systems (Durrett and Klein, 2014;Björkelund and Kuhn, 2014). The results show the benefits of stacking a discourse deixis engine on top of NP coreference resolution.

Related Work
Coreference resolution systems mostly focus on NPs. Although some isolated efforts have been made to study discourse-deictic pronouns, they consist mostly of theoretical inquiries or corpus analyses. A few practical implementations have been proposed as well, but most rely on manual intervention or only apply to restricted domains.
Webber (1988) presents a seminal account of discourse-deictic pronouns. She catalogs how the usage of certain pronouns varies based on discourse context. She also provides an insight into the distinguishing characteristics of discourse deixis.
Several empirical studies have also been conducted to evaluate the prevalence of discourse deixis in corpora across languages. These have been applied to English for dialogues (Byron and Allen, 1998;Eckert and Strube, 2000) and news and literature (Botley, 2006), Danish and Italian (Navarretta and Olsen, 2008;Poesio and Artstein, 2008;Caselli and Prodanof, 2010), and Spanish (Recasens, 2008). These studies find that discourse deixis occurs in different languages, although prevalence depends on the domain in question. While discourse deixis can account for up to 20% of pronouns in dialogue and conversational text, a more general figure is between 5% to 10% for other genres.
In addition to a corpus analysis, Eckert and Strube (2000) provide a schema for performing discourse deixis resolution that they evaluate by measuring inter-annotator agreement on five dialogues from the Switchboard corpus. Byron (2002) presents an early attempt at a practical system that handles discourse deixis. However, it relies on sophisticated discourse No verbal antecedent and semantic features, thus only working with manual intervention in a limited domain. The first fully automatic system to handle discourse-deictic pronouns was the one by Müller (2007). In contrast to our two-stage approach, it directly resolves pronouns to nominal or verbal antecedents. The author targets coreference resolution in dialogues, but includes several features that are equally applicable to text data-thus making a comparison to our system viable. Chen et al. (2011) present another unified approach to dealing with entity and event coreference. Their system combines the predictions from seven distinct mention-pair resolvers, each of which focuses on a specific pair of mention types (NP, pronoun, verb). In particular, their verb-pronoun resolver falls within the scope of discourse deixis. Due to the tight coupling of multiple resolvers, a direct comparison with systems focusing on discourse deixis is hard. However, their features are among the ones considered in this work.

Our Approach
In this section we describe the architecture of our two-stage system, and then detail the features used in both stages.

System Architecture
We propose a two-stage approach for discourse deixis processing. Our system first classifies a potential pronoun as discourse deictic (or not), and then it optionally resolves discourse-deictic pronouns with their antecedent. Preference between v and parent verb of p - Table 1: Features used for pronoun p and candidate v in the classification (Cla.) and resolution (Res.) stages. Features marked with • were selected, and those marked with -were discarded by feature selection. The last column (Mül.) contains the features used by Müller (2007). Features marked with are described in Section 3.2.
More specifically, and as described in Algorithm 1, a classification model Θ c is applied to each pronoun p to obtain its probability of being discourse deictic p c (p). If the probability is above a threshold th c , the pronoun is considered for resolution. All verbs v in the current and n previous sentences 3 are considered as candidates. A resolution model Θ r is applied to each candidate v to obtain its probability of being the antecedent of p, p r (v, p); if the candidate with the highest score v best is above a threshold th r , then it is returned as the antecedent.

A window of sentences is used in our experiments.
Otherwise, the pronoun remains unlinked.
Both components are implemented as maximum entropy classifiers. For simplicity, our approach is independent from the NP-NP coreference resolution component: competition between verbal and nominal antecedents is not considered. Table 1 gives an overview of the features that were used by the classification and resolution models. We consider all the features listed in the table, but some of them (marked with -) are pruned by feature selection (see Section 4.2). Real-valued features are 301 quantized, and dependency label paths are considered up to length 2. Details for the more sophisticated features (marked with in the table) follow.

Features
Negated parent/candidate We consider a verb token to be negated if it has a child connected with a negation label.
Parent/candidate transitivity We consider a verb token to be transitive if it has a child with a direct object label.
Clause-governing parent/candidate This is the probability of the parent/candidate to have a clausal or verbal argument. Probabilities for every verbal lemma are estimated from the Google News corpus. We then use the logarithm of these probabilities as the feature values.
Attribute lemma/POS If the pronoun is the subject of a copular verb, we consider the lemma and POS of the attribute of this verb as features.
Right frontier Webber (1988) proposes the right frontier condition to restrict the set of candidates available as antecedents for discourse-deictic pronouns. We define this condition in terms of what Webber calls discourse units. These are minimal discourse segments, and a sequence of several units can also be nested and form a larger unit. She states that only units on the right frontier (i.e., not followed by another unit at the same nesting level) can be antecedents for such pronouns.
( In (3), where discourse units are marked by square brackets, the verbal heads of discourse segments that are on the right frontier are underlined, while the others are italicized to denote inaccessibility.
In our system, we approximate discourse units by sentences and clauses. The candidate antecedents are the respective verbal heads of these units. This feature triggers if the antecedent candidate occurs on the right frontier of the pronoun. Since we also consider cataphoric relations, we reverse the rule to check the left frontier for these cases. (2000) define an anaphor to be I-incompatible if it occurs in a context in which it "cannot refer to an individual object." Adjectives can be used as contextual cues for I-incompatible anaphors in copular constructions (4).
Similarly to Müller (2007), we define the Iincompatibility score of an adjective as its conditional probability of being the attribute of a nonnominal subject given that it occurs in a copular construction. This is estimated from the Google News corpus as the number of occurrences of the adjective in one of these patterns: • clausal subject + BE + ADJ (To read is healthy) • IT + BE + ADJ + TO/THAT (It is healthy to read) • nominalized 4 subject + BE + ADJ (The construction was suspended) • -ing subject + BE + ADJ (Reading is healthy) divided by its number of occurrences in the pattern BE + ADJ. At classification time, if the pronoun is in a copular construction with an adjective attribute, the I-incompatibility score of the latter is used as feature.
Verb association strength To capture the strength of association between the candidate antecedent and the parent verb of the pronoun, we use the normalized pointwise mutual information of the two verbs co-occurring within a window of 3 sentences, estimated from counts in the Google News corpus.
Selectional preference We use selectional preference, as defined by Resnik (1997), to capture the degree to which the antecedent makes a reasonable substitute of the pronoun in the context of its parent verb.  this quantity correspond to more selective predicates. Then, the selectional preference strength of a verb ω for a particular argument a is defined as A R (ω, a) = p(a|ω) · log (p(a|ω)/p(a)) /S R (ω).
To account for nominalizations, verbs and nouns are stemmed following Porter (1980).

Evaluation
In this section we describe the setup for evaluating our system.

Dataset
We perform all our experiments on the English section of the CoNLL-2012 corpus (Pradhan et al., 2012), which is based on OntoNotes (Pradhan et al., 2007). It consists of 2384 documents (1.6M words) from a variety of domains: news, broadcast conversation, weblogs, etc. It is annotated with POS tags, syntax trees, word sense annotation, coreference relations, etc. The coreference layer includes verbal mentions. Given these annotations, we consider a pronoun to be discourse deictic if the preceding mention in its coreference cluster is verbal, or if it is the first mention in the cluster and the next one is verbal. The distribution of potentially discourse-deictic pronouns (it, this and that) in the test set is summarized in Table 2.
For all our experiments we train, tune and test according to the CoNLL-2012 split of OntoNotes. The gold analyses provided for the shared task are used for training, and the system analyses for development and testing.

Experiments
We train the two components of our system separately. For each of them, a maximum entropy model is learned on the train partition. Feature selection and threshold tuning are performed by hill climbing on the development set. We use separate thresholds for it, this, and that, since their distributions in the corpus are quite different.
We perform two evaluations of our system: first classification and resolution are evaluated in isolation, and then both components are stacked on top of an NP coreference engine.
For classification, we measure system performance on standard precision (P), recall (R) and F1 of correctly predicting whether a pronoun is discourse deictic or not. For resolution, precision is computed as the fraction of predicted antecedents that are correct, and recall as the fraction of gold antecedents that are correctly predicted. To decouple the evaluation of both stages, we also include results with oracle classifications as input to the resolution stage.
Finally, we use the output of our system to extend the predictions of two state-of-the-art NP coreference systems: • BERKELEY (Durrett and Klein, 2014), a joint model for coreference resolution, named entity recognition, and entity linking. We only add our predictions for pronouns it, this, that that are output as singletons by the NP coreference system. We report the standard coreference measures on the combined outputs using the updated CoNLL scorer v7 (Pradhan et al., 2014). Here, the systems are evaluated on all nominal, pronominal, and verbal mentions. The metrics include precision, recall and F1 for MUC, B 3 and CEAF e , and the CoNLL metric, which is the arithmetic mean of the first three F1 scores.

Baselines
We compare our classification component against two baselines: • ALL, which blindly classifies all mentions as discourse deictic. • NAIVE c , which classifies all this and that mentions as discourse deictic, and all it mentions as non-discourse-deictic. This is motivated by 303     the distribution of discourse deixis in the corpus (see Table 2).
For resolution, we use the baselines: • NAIVE r , which resolves a pronoun to the closest verb in the previous sentence. This is motivated by corpus analyses studying the position of discourse-deictic pronouns relative to their antecedents (Navarretta, 2011). • MÜLLER r , which is an equivalent maximum entropy model using the subset of our features also considered by Müller (2007). See column Mül. in Table 1.
Finally, when measuring the impact of our system on top of an NP coreference resolution engine, we consider the following baselines: • NAIVE, which uses NAIVE c and NAIVE r .
• MÜLLER, which does not include a classification stage, and uses MÜLLER r for resolution. • ONESTAGE, which does not include a classification stage, and uses our complete feature set for resolution. 5 • ORACLE, which outputs the gold annotations for discourse-deictic relations.

Results
The results for the classification stage are presented in Table 3, broken down by pronoun type. ALL performs the poorest overall, penalized by a precision just above 12%. Since in the case of it only 5.7% of the occurrences are discourse deictic, NAIVE c gets better results overall by always classifying it as non-deictic. Our TWOSTAGE system improves over NAIVE c by an additional 4% F1. However, the scores remain low-partly because of the difficulty of the problem (especially the class imbalance), and partly because despite using a rich set of features, most of them focus on local context and ignore cues at the discourse level. The classification of it is particularly difficult, reflecting the fact that the pronoun has a wide variety of usages in English.
The scores for resolution are shown in Tables 4 and 5. The former uses oracle classification whereas the latter uses the system output of our classifier.
With oracle classification, NAIVE r and MÜLLER r perform very similar, except for the case of this. Our TWOSTAGE resolver outperforms both of them for all pronouns and metrics, except for the recall of that. Overall, the difference in F1 is 9 points over NAIVE r and 7 points over MÜLLER r . The evaluation actually penalizes recall for our system, since we do not take advantage of the fact that all considered pronouns are discourse deictic: we trust the threshold and do not force the assignment of an antecedent.
All the results are lower with system classification. Given that our classifier performs the best for that, the drop for this pronoun is not as high as for the other two. Again, it stands out as the hardest pronoun to resolve. Neither NAIVE r nor MÜLLER r recover any correct antecedent for it. TWOSTAGE obtains the highest scores across all pronouns and metrics.
Finally, Table 6 contains the coreference measures for end-to-end evaluation on top of the BERKELEY and HOTCOREF systems. The ORA-CLE row shows an upper bound of 2% in CoNLL score improvement. All three baselines-NAIVE, MÜLLER and ONESTAGE-actually cause a decrease of up to 0.9% CoNLL.
Our system TWOSTAGE achieves a small fraction of the headroom. The total number of discoursedeictic entities that it predicts on the test set is 248, of which 204 end up merged in the BERKELEY output, and 210 in HOTCOREF. This allows it to obtain the best B 3 , CEAF e and CoNLL values, despite the fact that the low recall in the classification of discourse-deictic it reduces our margin for recall gains by one third. The drop in MUC highlights the difficulty of keeping the precision level, but our system is able to reach a better precision-recall balance than the other compared approaches.
We assess the statistical significance of the improvements of TWOSTAGE over BERKELEY and HOTCOREF using paired bootstrap resampling (Koehn, 2004) followed by two-tailed Wilcoxon signed-rank tests. All the differences are significant at the 1% level, except for the B 3 F1 differences.

Error Analysis
In order to gain insight into the precision errors of our system, we manually analyzed 50 of its decisions on the CoNLL-2012 development set. Of these, 30% were correct, matching the gold annotation, as in (5) The distribution of errors for the remaining cases is shown in Table 7. While half of the errors are due to actual errors in the model learned by our systemeither in classification (6) or resolution (7)-or due to a pre-processing error, another third of them are not true errors but missing (8) or partial annotations (9)-(10) in the gold standard corpus.
(6) If pictures are taken without permission, that is to say, it will at all times be pursued by legal action, a big hassle.
(7) Do we even know if these two medications are going to be effective against a strain that hasn't even presented itself? Here's the important thing about that.
(8) You will be taken to stand before governors and kings. People will do this to you because you follow me.  Table 6: End-to-end coreference resolution evaluation (TWOSTAGE corresponds to our system). All differences between the baseline system and TWOSTAGE are significant at the 1% level except for the B 3 F1 differences.
(9) At this point they've wittled it down to one aircraft and a missing crew of four individuals. So we've gone from several possible aircraft to one aircraft and from several missing airmen to four. So how much easier will that make it for you to unlock this case, do you think?  (10) show the difficulty of annotating discourse deixis relations under guidelines that require a unique verbal antecedent (Poesio and Artstein, 2008;Recasens, 2008). In our analysis we found several cases in which more than one antecedent is acceptable. This is usually the case when there is an elaboration (i.e., both the first clause and the follow-up clause restating or elaborating on the first one are acceptable antecedents, as in (9)) or a sequence of related and overlapping events. As pointed out by Poesio and Artstein (2008), "it is not completely clear the extent to which humans agree on the interpretation of such expressions," and the inconsistencies observed in the data are evidence of this.
Another class of hard cases are the discoursedeictic pronouns that are used for packaging a previous fragment or set of clauses (10). It is very hard to pick an antecedent for them, even deciding whether the antecedent is an NP or a clause (Francis, 1994).
Finally, in 20% of the cases the system and the annotation are in disagreement, but both decisions are debatable. In many of them, the system did not make any prediction, but the one in the gold annotation is incorrect. In (11), act is a more plausible antecedent for that.
(11) "Why didn't the Bank Board act sooner?" he said. "That is what Common Cause should ask be investigated." As a result, even though our system obviously makes multiple mistakes in its decisions, we believe that the evaluation overpenalizes its performance due to the debatable and not always clear-cut annotations discussed above. Discourse deixis resolution is a hard problem in itself (the chances of selecting a wrong antecedent for a pronoun are many times greater than picking the right one), and this difficulty is accentuated by the problematic annotations in the training and test data.
Given the difficulty of identifying a single antecedent to discourse-deictic pronouns, as evidenced by the low inter-annotator agreement on this task, a more flexible evaluation measure for discourse deixis systems is needed. 306

Conclusion
We have presented an automatic system for discourse deixis resolution. The system works in two stages: first classifying pronouns as discourse deictic or not, and then assigning an antecedent.
Empirical evaluations show that our system outperforms naive baselines as well as the only existing comparable system. Additionally, when stacked on top of two different state-of-the-art NP coreference resolvers, our system yields improvements on the B 3 , CEAF e and CoNLL measures. The results are still far from the upper bound achievable by an oracle. However, our research highlights the inconsistencies in the annotation of discourse deixis in existing resources, and thus the performance of our system is likely underestimated.
These inconsistencies call for future work to improve existing annotated corpora so that similar systems may be more fairly evaluated. Regarding our approach, a tighter integration between the NP and discourse deixis components could help them make more informed decisions. Other future research includes jointly learning the classification and resolution stages of our system, and exploring semisupervised learning techniques to compensate for the paucity of annotated data. Finally, we would like to transfer our system to other languages.