Enhanced Word Representations for Bridging Anaphora Resolution

Most current models of word representations (e.g., GloVe) have successfully captured fine-grained semantics. However, semantic similarity exhibited in these word embeddings is not suitable for resolving bridging anaphora, which requires the knowledge of associative similarity (i.e., relatedness) instead of semantic similarity information between synonyms or hypernyms. We create word embeddings (embeddings_PP) to capture such relatedness by exploring the syntactic structure of noun phrases. We demonstrate that using embeddings _PP alone achieves around 30% of accuracy for bridging anaphora resolution on the ISNotes corpus. Furthermore, we achieve a substantial gain over the state-of-the-art system (Hou et al., 2013b) for bridging antecedent selection.


Introduction
Bridging (Clark, 1975;Prince, 1981;Gundel et al., 1993) establishes entity coherence in a text by linking anaphors and antecedents via various nonidentity relations. In Example 1, the link between the bridging anaphor (the chief cabinet secretary) and the antecedent (Japan) establish local (entity) coherence.
(1) Yet another political scandal is racking Japan. On Friday, the chief cabinet secretary announced that eight cabinet ministers had received five million yen from the industry.
Choosing the right antecedents for bridging anaphors is a subtask of bridging resolution. For this substask, most previous work (Poesio et al., 2004;Lassalle and Denis, 2011;Hou et al., 2013b) calculate semantic relatedness between an anaphor and its antecedent based on word co-occurrence count using certain syntactic patterns.
Most recently, word embeddings gain a lot popularity in NLP community because they reflect human intuitions about semantic similarity and relatedness. Most word representation models explore the distributional hypothesis which states that words occurring in similar contexts have similar meanings (Harris, 1954). State-of-theart word representations such as word2vec skipgram (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) have been shown to perform well across a variety of NLP tasks, including textual entailment (Rocktäschel et al., 2016), reading comprehension (Chen et al., 2016), and information status classification (Hou, 2016). However, these word embeddings capture both "genuine" similarity and relatedness, and they may in some cases be detrimental to downstream performance (Kiela et al., 2015). Bridging anaphora resolution is one of such cases which requires lexical association knowledge instead of semantic similarity information between synonyms or hypernyms. In Example 1, among all antecedent candidates, "the chief cabinet secretary" is the most similar word to the bridging anaphor "eight cabinet ministers" but obviously it is not the antecedent for the latter.
In this paper, we explore the syntactic structure of noun phrases (NPs) to derive contexts for nouns in the GloVe model. We find that the prepositional structure (e.g., X of Y) and the possessive structure (e.g., Y's X) are a useful context source for the representation of nouns in terms of relatedness for bridging relations.
We demonstrate that using our word embeddings based on PP contexts (embeddings PP) alone achieves around 30% of accuracy on bridging anaphora resolution in the ISNotes corpus, which is 12% better than the original GloVe word embeddings. Moreover, adding an additional feature based on embeddings PP leads to a significant improvement over a state-of-the-art system on 1 bridging anaphora resolution (Hou et al., 2013b).

Related Work
Bridging anaphora resolution. Anaphora plays an important role in discourse comprehension. Different from identity anaphora which indicates that a noun phrase refers back to the same entity introduced by previous descriptions in the discourse, bridging anaphora links anaphors and antecedents via lexico-semantic, frame or encyclopedic relations.
Bridging resolution has to recognize bridging anaphors and find links to antecedents. There has been a few works tackling full bridging resolution (Hahn et al., 1996;Hou et al., 2014). In recent years, various computational approaches have been developed for bridging anaphora recognition (Markert et al., 2012;Hou et al., 2013a) and for bridging antecedent selection (Poesio et al., 2004;Hou et al., 2013b). This work falls into the latter category and we create a new lexical knowledge resource for the task of choosing antecedents for bridging anaphors.
Previous work on bridging anaphora resolution (Poesio et al., 2004;Lassalle and Denis, 2011;Hou et al., 2013b) explore word co-occurence count in certain syntactic preposition patterns to calculate word relatedness. These patterns encode associative relations between nouns which cover a variety of bridging relations. Our PP context model exploits the same principle but is more general. Unlike previous work which only consider a small number of prepositions per anaphor, the PP context model considers all prepositions for all nouns in big corpora. It also includes the possessive structure of NPs. The resulting word embeddings are a general resource for bridging anaphora resolution. In addition, it enables efficient computation of word association strength through lowdimensional matrix operations.
Enhanced word embeddings. Recently, a few approaches investigate different ways to improve the vanilla word embeddings. Levy and Goldberg (2014) explore the dependency-based contexts in the Skip-Gram model. The authors replace the linear bag-of-words contexts in the original Skip-Gram model with the syntactic contexts derived from the automatically parsed dependency trees. They observe that the dependency-based embeddings exhibit more functional similarity than the original skip-gram embeddings. Heinzerling et al. (2017) show that incorporating dependency-based word embeddings into their selectional preference model slightly improve coreference resolution performance. Kiela et al. (2015) try to learn word embeddings for similarity and relatedness separately by utilizing a thesaurus and a collection of psychological association norms. The authors report that their relatedness-specialized embeddings perform better on document topic classification than similarity embeddings. Schwartz et al. (2016) demonstrate that symmetric patterns (e.g, X or Y) are the most useful contexts for the representation of verbs and adjectives. Our work follows in this vein and we are interested in learning word representations for bridging relations.

Asymmetric Prepositional and Possessive Structures
The syntactic prepositional and possessive structures of NPs encode a variety of bridging relations between anaphors and their antecedents. For instance, the rear door of that red car indicates the part-of relation between "door" and "car", and the company's new appointed chairman implies the employment relation between "chairman" and "company". We therefore extract noun pairs doorcar, chairman-company by using syntactic structure of NPs which contain prepositions or possessive forms. It is worth noting that bridging relations expressed in the above syntactic structures are asymmetric. So for each noun pair, we keep the head on the left and the noun modifier on the right. However, a lot of nouns can appear on both positions, such as "travelers in the train station", "travelers from the airport", "hotels for travelers", "the destination for travelers". To capture the differences between these two positions, we add the postfix " PP" to the nouns on the left. Thus we extract the following four pairs from the above NPs: travelers PP-station, travelers PP-airport, hotels PPtravelers, destination PP-travelers.

Word Embeddings Based on PP Contexts (embeddings PP)
Our PP context model is based on GloVe (Pennington et al., 2014), which obtains state-of-theart results on various NLP tasks. We extract noun pairs as described in Section 3.1 from the automatically parsed Gigaword corpus (Parker et al., 2011;  4 Haneda is an airport in Japan.  Napoles et al., 2012). We treat each noun pair as a sentence containing only two words and concatenate all 197 million noun pairs in one document. We employ the GloVe tookit 1 to train the PP context model on the above extracted noun pairs. All tokens are converted to lowercase, and words that appear less than 10 times are filtered. This results in a vocabulary of around 276k words and 188k distinct nouns without the postfix " PP". We set the context window size as two and keep other parameters the same as in Pennington et al. (2014). We report results for 100 dimension embeddings, though similar trends were also observed with 200 and 300 dimensions. For comparison, we also trained a 100 dimension word embeddings (GloVe Giga) on the whole Gigaword corpus, using the same parameters reported in Pennington et al. (2014). Table 1 lists a few target words and their top five nearest neighbors (using cosine similarity) in embeddings PP and GloVe Giga respectively. For the target words "residents" and "members", both embeddings PP and GloVe Giga yield a list of similar words and most of them have the same semantic type as the target word. For the "travelers" example, GloVe Giga still presents the similar words with the same semantic type, while embed-1 https://github.com/stanfordnlp/GloVe dings PP generates both similar words and related words (words containing the postfix " PP"). More importantly, it seems that embeddings PP can find reasonable semantic roles for nominal predicates (target words containing the postfix " PP"). For instance, "president PP" is mostly related to countries or organizations, and "residents PP" is mostly related to places.
The above examples can be seen as qualitative evaluation for our PP context model. We assume that embeddings PP can be served as a lexical knowledge resource for bridging antecedent selection. In the next section, we will demonstrate the effectiveness of embeddings PP for the task of bridging anaphora resolution.

Quantitative Evaluation
For the task of bridging anaphora resolution, we use the dataset ISNotes 2 released by Hou et al. (2013b). This dataset contains around 11,000 NPs annotated for information status including 663 bridging NPs and their antecedents in 50 texts taken from the WSJ portion of the OntoNotes corpus (Weischedel et al., 2011). It is notable that bridging anaphors in ISNotes are not limited to definite NPs as in previous work (Poesio et al., 1997(Poesio et al., , 2004Lassalle and Denis, 2011).
The semantic relations between anaphor and antecedent in the corpus are quite diverse: only 14% of anaphors have a part-of/attribute-of relation with the antecedent and only 7% of anaphors stand in a set relationship to the antecedent. 79% of anaphors have "other" relation with their antecedents, without further distinction. This includes encyclopedic relations such as the waiter -restaurant as well as context-specific relations such as the thieves -palms.
We follow Hou et al. (2013b)'s experimental setup and reimplement MLN model II as our baseline. We first test the effectiveness of embeddings PP alone to resolve bridging anaphors. Then we show that incorporating embeddings PP into MLN model II significantly improves the result.

Using embeddings PP Alone
For each anaphor a, we simply construct the list of antecedent candidates E a using NPs preceding a from the same sentence as well as from the previous two sentences. Hou et al. (2013b) found that globally salient entities are likely to be the antecedents of all anaphors in a text. We approximate this by adding NPs from the first sentence of the text to E a . This is motivated by the fact that ISNotes is a newswire corpus and globally salient entities are often introduced in the beginning of an article. On average, each bridging anaphor has 19 antecedent candidates using this simple antecedent candidate selection strategy.
Given an anaphor a and its antecedent candidate list E a , we predict the most related NP among all NPs in E a as the antecedent for a. The relatedness is measured via cosine similarity between the head of the anaphor (plus the postfix " PP") and the head of the candidate.
This simple deterministic approach based on embeddings PP achieves an accuracy of 30.32% on the ISNotes corpus. Following Hou et al. (2013b), accuracy is calculated as the proportion of the correctly resolved bridging anaphors out of all bridging anaphors in the corpus.
We found that using embeddings PP outperforms using other word embeddings by a large margin (see Table 2), including the original GloVe vectors trained on Gigaword and Wikipedia 2014 dump (GloVe GigaWiki14) and GloVe vectors that we trained on Gigaword only (GloVe Giga). This confirms our observation in Section 3.2 that em-acc GloVe GigaWiki14 18.10 GloVe Giga 19.00 embeddings wo PPSuffix 22.17 embeddings PP 30.32 Table 2: Results of embeddings PP alone for bridging anaphora resolution compared to the baselines. Bold indicates statistically significant differences over the baselines using randomization test (p < 0.01).
biddings PP can capture the relatedness between anaphor and antecedent for various bridging relations.
To understand the role of the suffix " PP" in embeddings PP, we trained word vectors embeddings wo PPSuffix using the same noun pairs as in embeddings PP. For each noun pair, we remove the suffix " PP" attached to the head noun. We found that using embeddings wo PPSuffix only achieves an accuracy of 22.17% (see Table 2). This indicates that the suffix " PP" is the most significant factor in embeddings PP. Note that when calculating cosine similarity based on the first three word embeddings in Table 2, we do not add the suffix " PP" to the head of an bridging anaphor because such words do not exist in these word vectors.

MLN model II + embeddings PP
MLN model II is a joint inference framework based on Markov logic networks (Domingos and Lowd, 2009). In addition to modeling the semantic, syntactic and lexical constraints between the anaphor and the antecedent (local constraints), it models that: • semantically or syntactically related anaphors are likely to share the same antecedent (joint inference constraints); • a globally salient entity is preferred to be the antecedent of all anaphors in a text even if the entity is distant to the anaphors (global salience constraints); • several bridging relations are strongly signaled by the semantic classes of the anaphor and the antecedent, e.g., a job title anaphor such as chairman prefers a GPE or an organization antecedent (semantic class constraints).  Due to the space limit, we omit the details of MLN model II, but refer the reader to Hou et al. (2013b) for a full description.
We add one constraint into MLN model II based on embeddings PP: each bridging anaphor a is linked to its most related antecedent candidate using cosine similarity. We use the same strategy as in the previous section to construct the list of antecedent candidates for each anaphor. Unlike the previous section, which only uses the vector of the NP head to calculate relatedness, here we include all common nouns occurring before the NP head as well because they also represent the core semantic of an NP (e.g., "earthquake victims" and "the state senate").
Specifically, given an NP, we first construct a list N which consists of the head and all common nouns appearing before the head, we then represent the NP as a vector v using the following formula, where the suffix " PP" is added to each n if the NP is a bridging anaphor: v = n∈N embeddings P P n |N | (1) Table 3 shows that adding the constraint based on embeddings PP improves the result of MLN model II by 4.5%. However, adding the constraint based on the vanilla word embeddings (GloVe GigaWiki14) or the word embeddings without the suffix " PP" (embeddings wo PPSuffix) slightly decreases the result compared to MLN model II. Although MLN model II already explores preposition patterns to calculate relatedness between head nouns of NPs, it seems that the feature based on embeddings PP is complementary to the original preposition pattern feature. Furthermore, the vector model allows us to represent the meaning of an NP beyond its head easily.

Conclusions
We present a PP context model based on GloVe by exploring the asymmetric prepositional structure (e.g., X of Y) and possessive structure (e.g., Y's X) of NPs. We demonstrate that the resulting word vectors (embeddings PP) are able to capture the relatedness between anaphor and antecedent in various bridging relations. In addition, adding the constraint based on embeddings PP yields a significant improvement over a state-of-the-art system on bridging anaphora resolution in ISNotes (Hou et al., 2013b).
For the task of bridging anaphora resolution, Hou et al. (2013b) pointed out that future work needs to explore wider context to resolve contextspecific bridging relations. Here we combine the semantics of pre-nominal modifications and the head by vector average using embeddings PP. We hope that our embedding resource 3 will facilitate further research into improved context modeling for bridging relations.