Revisiting Selectional Preferences for Coreference Resolution

Selectional preferences have long been claimed to be essential for coreference resolution. However, they are modeled only implicitly by current coreference resolvers. We propose a dependency-based embedding model of selectional preferences which allows fine-grained compatibility judgments with high coverage. Incorporating our model improves performance, matching state-of-the-art results of a more complex system. However, it comes with a cost that makes it debatable how worthwhile are such improvements.


Introduction
Selectional preferences have long been claimed to be useful for coreference resolution. In his seminal work on "Resolving Pronominal References" Hobbs (1978) proposed a semantic approach that requires reasoning about the "demands the predicate makes on its arguments." For example, selectional preferences allow resolving the pronoun it in the text "The Titanic hit an iceberg. It sank quickly." Here, the predicate sink 'prefers' certain subject arguments over others: It is plausible that a ship sinks, but implausible that an iceberg does.
Since negative results do not often get reported, there is no clear evidence in the literature regarding the non-utility of particular knowledge sources. Consequently, an absence of the explicit modeling of selectional preferences in the recent literature is an indicator that incorporating this knowledge source has not been very successful for coreference resolution.
More than ten years ago, Kehler et al. (2004) declared the "non-utility of predicate-argument structures for pronoun resolution" and observed that minor improvements on a small dataset were due to fortuity rather than selectional preferences having captured meaningful world knowledge relations.
The claim by Kehler et al. (2004) is based on selectional preferences extracted from a, by current standards, small number of 2.8m predicateargument pairs. Furthermore, they employ a simple (linear) maximum entropy classifier, which requires manual definition of feature combinations and is unlikely to fully capture the complex interaction between selectional preferences and other coreference features. Therefore, it is worth revisiting how a better selectional preference model affects the performance of a more complex coreference resolver.
In this work, we propose a fine-grained, highcoverage model of selectional preferences and study its impact on a state-of-the-art, non-linear coreference resolver. We show that the incorporation of our selectional preference model improves the performance. However, it is debatable whether such small improvements, that cost notable extra time or resources, are advantageous.

Modeling Selectional Preferences
The main design choice when modeling selectional preferences is the selection of a relation inventory, i.e. the concepts and entities that can be relation arguments, and the semantic relationships that hold between them. Prior work has studied many relation inventories. Predicate-argument statistics for word-word pairs (eat, food) 1 are easy to obtain but do not generalize to unseen pairs (Dagan and Itai, 1990). Class-based approaches generalize via word-class pairs (eat, /nutrient/food) (Resnik, 1993) or classclass pairs (/ingest, /nutrient/food) (Agirre and Martinez, 2001), but require disambiguation of words to classes and are limited by the coverage of the lexical resource providing such classes (e.g. WordNet).
Other possible relation inventories include semantic representations such as FrameNet frames and roles, event types and arguments, or abstract meaning representations. While these semantic representations are arguably well-suited to model meaningful world knowledge relationships, automatic annotation is limited in speed and accuracy, making it difficult to obtain a large number of such "more semantic" predicate-argument pairs. In comparison, syntactic parsing is both fast and accurate, making it trivial to obtain a large number of accurate, albeit "less semantic" predicate-argument pairs. The drawback of a syntactic model of selectional preferences is susceptibility to lexical and syntactic variation. For example, The Titanic sank and The ship went under differ lexically and syntactically, but would have the same or a very similar representation in a semantic framework such as FrameNet.
overcomes this drawback via distributed representation of predicate-argument pairs, using (syntactic) dependencies that were specifically designed for semantic downstream tasks, and by resolving named entities to their fine-grained entity types. Distributed representation. Inspired by Structured Vector Space (Erk and Padó, 2008), we embed predicates and arguments into a lowdimensional space in which (representations of) predicate slots are close to (representations of) their plausible arguments, as should be arguments that tend to fill the same slots of similar predicates, and predicate slots that have similar arguments. For example, captain should be close to pilot, ship to airplane, the subject of steer close to both captain and pilot, and also to, e.g., the subject of drive. Such a space allows judging the plausibility of unseen predicate-argument pairs. 2 We construct this space via dependency-based word embeddings (Levy and Goldberg, 2014). To see why this choice is better-suited for modeling selectional preferences than alternatives such as word2vec (Mikolov et al., 2013) or GloVe (Pennington et al., 2014), consider the following example: :: Here, captain and ship, have high syntagmatic similarity, i.e., these words are semantically related and tend to occur close to each other. This also holds for pilot and airplane. In contrast, captain and pilot, as well as ship and airplane have high paradigmatic similarity, i.e., they are seman-tically similar and occur in similar contexts. A model of selectional preferences requires paradigmatic similarity: The representations of captain and pilot in such a model should be similar, since they both can plausibly fill the subject slot of the predicate steer. Due to their use of linear context windows, word2vec and GloVe capture syntagmatic similarity, while dependency-based embeddings capture paradigmatic similarity (cf. Levy and Goldberg, 2014). Enhanced++ dependencies. Due to distributed representation, our model generalizes over syntactic variation such as active/passive alternations: For example, steer@dobj 3 is highly similar to steer@nsubjpass (see Appendix for more examples). To further mitigate the effect of employing syntax as a proxy for semantics, we use En-hanced++ dependencies (Schuster and Manning, 2016). Enhanced++ dependencies aim to support semantic applications by modifying syntactic parse trees to better reflect relations between content words. For example, the plain syntactic parse of the sentence Both of the girls laughed identifies Both as subject of laughed. The En-hanced++ representation introduces a subject relation between girls and laughed, which allows learning more meaningful selectional preferences: Our model should learn that girls (and other humans) laugh, while learning that an unspecified both laughs is not helpful.
Fine-grained entity types. A good model of selectional preferences needs to generalize over named entities. For example, having encountered sentences like The Titanic sank, our model should be able to judge the plausibility of an unseen sentence like The RMS Lusitania sank. For popular named entities, we can expect the learned representations of Titanic and RMS Lusitania to be similar, allowing our model to generalize, i.e., it can judge the plausibility of The RMS Lusitania sank by virtue of the similarity between Titanic and RMS Lusitania. However, this will not work for rare or emerging named entities, for which no, or only low-quality, distributed representations have been learned. To address this issue, we incorporate fine-grained entity typing (Ling and Weld). For each named entity encountered during training, we generate an additional training instance by replacing the named entity with its entity type, e.g. (Titanic, sank@nsubj) yields (/product/ship, sank@nsubj).

Implementation
We train our model by combining term-context pairs from two sources. Noun phrases and their dependency context are extracted from GigaWord (Parker et al., 2011) and entity types in context from Wikilinks (Singh et al., 2012). Term-context pairs are obtained by parsing each corpus with the Stanford CoreNLP dependency parser (Manning et al., 2014). After filtering, this yields ca. 1.4 billion phrase-context pairs such as (Titanic, sank@nsubj) from GigaWord and ca. 12.9 million entity type-context pairs such as (/product/ship, sank@nsubj) from Wikilinks. Finally, we train dependency-based embeddings using the generalized word2vec version by Levy and Goldberg (2014), obtaining distributed representations of selectional preferences. To identify fine-grained types of named entities at test time, we first perform entity linking using the system by Heinzerling et al. (2016), then query Freebase (Bollacker et al., 2008) for entity types and apply the mapping to fine-grained types by Ling and Weld.
The plausibility of an argument filling a particular predicate slot can now be computed via the cosine similarity of their associated embeddings. For example, in our trained model, the similarity of (Titanic, sank@nsubj) is 0.11 while the similarity of (iceberg, sank@nsubj) is -0.005, indicating that an iceberg sinking is less plausible.

Do Selectional Preferences Benefit
Coreference Resolution?
We now investigate the effect of incorporating selectional preferences, implicitly and explicitly, in coreference resolution. Figure 2 shows the selectional preference similarity of 10.000 coreferent and 10.000 noncoreferent mention pairs sampled randomly from the CoNLL 2012 training set. As we can see, while coreferent mention pairs are more similar than non-coreferent mention pairs according to the selectional preference similarity, there is not a direct relation between the similarity values and the coreferent relation. This indicates that coreference does not have a linear relation to the selectional preference similarities. However, it is worth investigating how these similarity values affect the overall performance when they are combined with other knowledge sources in a non-linear way.
We select the ranking model of deep-coref (Clark and Manning, 2016b) as our baseline. deep-coref is a neural model that combines the input features through several hidden layers. Baseline in Table 1 reports our baseline results on the CoNLL 2012 test set. The results are reported using MUC (Vilain et al., 1995), B 3 (Bagga and Baldwin, 1998), CEAF e (Luo, 2005), the average F 1 score of these three metrics, i.e. CoNLL score, and LEA (Moosavi and Strube, 2016b). deepcoref includes the embeddings of the dependency governor of mentions. Combined with the relative position of a mention to its governor, deep-coref may be able to implicitly capture selectional preferences to some extent. −gov in Table 1 represents deep-coref performance when governors are not incorporated. As we can see, the exclusion of the governor information does not affect the performance. This result shows that the implicit mod-  eling of selectional preferences does not provide any additional information to the coreference resolver.
For each mention, we consider (1) the whole mention string, (2) the whole mention string without articles, (3) mention head, (4) context representation, i.e. governor@dependency-relation, and (5) entity types if the mention is a named entity. We obtain an embedding for each of the above properties if they exist in the selectional preference model, otherwise we set them to unknown.
For each (antecedent, anaphor) pair, we consider all the acquired embeddings of anaphor and antecedent. We try two different ways of incorporating this knowledge into deep-coref including: (1) incorporating the computed embeddings directly as a new set of inputs, i.e. +embedding in Table 2. We add a new hidden layer on top of the new embeddings and combine its output with outputs of the hidden layers associated with other sets of inputs; and (2) computing a similarity value between all possible combinations of the antecedentanaphor acquired embeddings and then binarizing all similarity values, i.e. +binned sim. in Table 2.
Providing selectional preference embeddings directly to deep-coref adds more complexity to the baseline coreference resolver. Yet, it performs onpar with +binned sim. on the development set and generalizes worse on the test set. +SP in Table 1 is the performance of +binned sim. on the test set. As we can see from the results, adding selectional does [that] ante really impact the case ...
Reinforce in Table 1 presents the results of the reward-rescaling model of Clark and Manning (2016a) that are so far the highest reported results on the official test set. The reward rescaling model of Clark and Manning (2016a) casts the ranking model of Clark and Manning (2016b) in the reinforcement learning framework which considerably increases the training time, from two days to six days in our experiments.
We analyze how our selectional preference model affects the resolution of various types of mentions. We use Martschat and Strube (2014)'s toolkit 4 to perform recall and error analyses. The differences in the number of recall and precision errors in +SP compared to baseline on the test set are reported in Table 4. By using our selectional preference features, the number of recall errors decreases for all types of mentions. The recall error reduction is more prominent for pronouns. On the other hand, the number of precision errors increases for all types of mentions. The increase in the precision error is the highest for common nouns. Overall, +SP creates about 260 more links than baseline. Table 3 lists a few examples from the development set in which +SP creates a link that baseline does not. It also includes the similarity that has a high value for the linked mentions and probably is the reason for creating the link. For instance, in the first example, based on our model, similarity(impact@nsubj,shows@nsubj) is known and it is also higher than similarity(impact@dobj,shows@nsubj).
In order to estimate a higher bound on the expected performance boost, we run the baseline and +SP models only on anaphoric mentions. By using anaphoric mentions, the performance improves by one percent, based on both the CoNLL score and LEA. This result indicates that the incorporation of selectional preferences creates many links for nonanaphoric mentions, which in turn decreases precision. Therefore, the overall performance does not improve substantially when system mentions are used. deep-coref incorporates anaphoricity scores at resolution time. One possible way to further improve the results of +SP is to incorporate anaphoricity scores at the input level. In this way, the coreference resolver could learn to use selectional preferences mainly for mentions that are more likely to be anaphoric. However, given that the F 1 score of current anaphoricity determiners or singleton detectors is only around 85 percent Strube, 2016a, 2017), the effect of using system anaphoricity scores might be small.

Conclusions
We introduce a new model of selectional preferences, which combines dependency-based word embeddings and fine-grained entity types. In order to be effective, a selectional preference model should (1) have a high coverage so it can be used for large datasets like CoNLL, and (2) be combined with other knowledge sources in a nonlinear way. Our selectional preference model slightly improves coreference resolution performance, but considering the extra resources that are required to train the model, it is debatable whether such small improvements are advantageous for solving coreference.