Use Generalized Representations, But Do Not Forget Surface Features

Only a year ago, all state-of-the-art coreference resolvers were using an extensive amount of surface features. Recently, there was a paradigm shift towards using word embeddings and deep neural networks, where the use of surface features is very limited. In this paper, we show that a simple SVM model with surface features outperforms more complex neural models for detecting anaphoric mentions. Our analysis suggests that using generalized representations and surface features have different strength that should be both taken into account for improving coreference resolution.


Introduction
Coreference resolution is the task of finding different mentions that refer to the same entity in a given text. Anaphoricity detection is an important step for coreference resolution. An anaphoricity detection module discriminates mentions that are coreferent with one of the previous mentions. If a system recognizes mention m as non-anaphoric, it does not need to make any coreferent links for the pairs in which m is the anaphor.
The current state-of-the-art coreference resolvers (Wiseman et al., 2016;Clark and Manning, 2016a;Clark and Manning, 2016b), as well as their anaphoricity detection modules, use deep neural networks, word embeddings and a small set of features describing surface properties of mentions. While it is shown that this small set of features has significant impact on the overall performance (Clark and Manning, 2016a), their use is very limited in the state-of-the-art systems in comparison to the embedding features.
In this paper, we first introduce a new neural model for anaphoricity detection that considerably outperforms the anaphoricity detection of the state-of-the-art coreference resolver, i.e. deepcoref introduced by Clark and Manning (2016a). However, we show that a simple SVM model that is adapted from our coreferent mention detection approach (Moosavi and Strube, 2016), significantly outperforms the more complex neural models. We show that the SVM model also generalizes better than the neural model on a new domain other than the CoNLL dataset.

Discriminating Mentions for Coreference Resolution
The recognition of various categories of mentions can be beneficial for coreference resolution. The detection of the following categories is most common in the literature: (1) non-referential, (2) discourse-old, and (3) coreferent mentions. One can also discriminate other categories of mentions like mentions that are unlikely to be antecedents or discourse-new mentions (Uryupina, 2009). However, they are not common in comparison to the above categories.

Non-Referential Mentions
Non-referential mentions do not refer to an entity. These mentions only fill a syntactic position.
For instance, "it" in "it is raining" is a non-referential mention. The approaches proposed by Evans (2001), Müller (2006), Bergsma et al. (2008), Bergsma and Yarowsky (2011) are examples of detecting non-referential cases of the pronoun it. Byron and Gegg-Harrison (2004) present a more general approach for detecting non-referential noun phrases.

Discourse-Old Mentions
Each mention can be assessed from the point of view of the discourse model (Prince, 1992). According to the discourse model, a mention may be new, old or inferable. Mentions which introduce a new entity into the discourse are discourse-new mentions. A discourse-new mention may be a singleton or it may be the first mention of a coreference chain. For instance, The first "Plato" in Example 2.1 is a discourse-new mention.
Example 2.1. Plato was a philosopher in Classical Greece. This philosopher is the founder of the Academy in Athens. Plato died at the age of 81.
A discourse-old mention refers to an entity that is already evoked in the discourse. Except for first mentions of coreference chains, other coreferent mentions are discourse-old. For instance, "this philosopher" and the second "Plato" in Example 2.1 are discourse-old mentions.
A mention is inferable if the hearer can infer the identity of the mention from another entity that has already been evoked in the discourse. "the windows" in Example 2.2 is an inferable mention. The detection of discourse-old mentions is commonly referred to as anaphoricity detection (e.g. Zhou andKong (2009), Ng (2009), Wiseman et al. (2015), Lassalle and Denis (2015), inter alia) while the task of anaphoric mention detection, based on its original definition, is of no use for coreference resolution.
Mentions whose interpretations do not depend on previous mentions are called non-anaphoric mentions (van Deemter and Kibble, 2000). For example, both "Plato"s in Example 2.1 are non-anaphoric.
For consistency with the coreference literature, we refer to the task of discourse-old mention detection as anaphoricity detection.

Coreferent Mentions
Marneffe et al. (2015) discriminate mentions as coreferent vs. non-coreferent. Coreferent mentions are those mentions that appear in a coreference chain. A non-coreferent mention therefore can be a non-referential noun phrase or a referential noun phrase whose entity is only mentioned once (i.e. singleton). The proposed approaches of Recasens et al. (2013), Marneffe et al. (2015), and Moosavi and Strube (2016) discriminate mentions for coreference resolution this way.

Anaphoricity Detection Models
Anaphoricity detection is the most common approach for discriminating mentions for a coreference resolver. All of the state-of-the-art coreference resolvers use anaphoricity detection. In this paper, we compare three different anaphoricity detection approaches: two approaches using neural networks and word embeddings, and one using an SVM model and surface features. Clark and Manning (2016a) introduce the first neural model. Since Clark and Manning (2016a) train their anaphoricity model jointly with the coreference model, we refer to this model as the joint model. We introduce a new anaphoricity detection model as the second neural model using a Long-Short Term Memory (LSTM) network (Hochreiter and Schmidhuber, 1997). The third approach is adapted from our state-of-the-art coreferent mention detection (Moosavi and Strube, 2016).

Joint Model
As one of the neural models for anaphoricity detection, we consider the anaphoricity module of deep-coref 1 , the state-of-the-art coreference resolution system introduced by Clark and Manning (2016a). This model has three layers for encoding different types of information regarding a mention. The first layer encodes the word embeddings of the head, first, last, two previous/following words, and the syntactic parent of the mention. The second layer encodes the averaged word embeddings of the five previous/following words, all words of the mention, sentence words, and document words. The third layer encodes the following features of a mention: type, length, position and whether it is embedded in another mention. The outputs of these three layers are combined into one vector and then get passed through a network with two hidden layers. This anaphoricity model is trained jointly with the deep-coref coreference model.

LSTM Model
In this section we propose a new neural model for anaphoricity detection. Apart from the properties of the mention itself, we consider a limited number of surrounding words. We first generalize the context of a mention by removing the mention from the context and replacing it with a special placeholder. In our experiments, we consider the 10 previous and following words of a mention. We concatenate the mention tokens and the head token to the generalized word sequence. We separate the head and mention tokens in the concatenated sequence using two different placeholders.
The word embeddings of the above sequence are encoded using a bidirectional LSTM. LSTMs show convincing results on generating meaningful representations for various NLP tasks (e.g.  and ).
We also incorporate a set of surface features that contains (1) mention type (proper, nominal (definite, indefinite), pronouns (he, I, it, she, they, we, you)), (2) string match in the text, (3) string match in the previous context, (4) head match in the text, (5) head match in the previous context, (6) contains tokens of another mention, (7) contains tokens of a previous mention, (8) contained in another mention, (9) contained in a previous mention, and (10) embedded in another mention. These features are concatenated with the output of the bidirectional LSTM and get passed through one more layer that generates the output.
We also experiment with a more complex model including two different LSTMs for encoding mentions and their surrounding words. We consider longer sequences of previous words and an attention mechanism for processing the long sequence. However, the performance did not improve upon the LSTM model while it considerably increased the training time.

Implementation Details
Hyperparameters are tuned on the CoNLL 2012 development set. We minimize the cross entropy loss using gradient-based optimization and the Adam update rule (Kingma and Ba, 2014). We use minibatches of size 50. A dropout (Hinton et al., 2012) with a rate of 0.3 is applied to the output of LSTM. We initialize the embeddings with the 300-dimensional Glove embeddings (Pennington et al., 2014). The size of LSTM's hidden layer is set to 128. The model is trained in only one epoch.

SVM Model
Our SVM model introduced in Moosavi and Strube (2016), achieves state-of-theart results for coreferent mention detection. This model uses the following set of features: lemmas and POS tags of all words of a mention, lemmas and POS tags of the two previous/following words, mention string, mention length, mention type (proper, nominal, pronoun, list), string match in the text, and head match in the text. We use a similar SVM model for anaphoricity detection. In addition to the features we used for coreferent mention detection, we also add the following features for anaphoricity detection: string match in the previous context, head match in the previous context, mention words are contained in another mention, mention words are contained in a previous mention, mention contains words of another mention, mention contains words of a previous mention. Similar to Moosavi and Strube (2016), we use an anchored SVM (Goldberg and Elhadad, 2007) with a polynomial kernel of degree two and remove feature-values that occur less than 10 times. The use of an anchored SVM with pruning helps the model to generalize better on new domains (Goldberg and Elhadad, 2009).

Performance Evaluation
We evaluate the anaphoricity models on the CoNLL 2012 dataset. It is worth noting that all of the examined anaphoricity detectors in this section use the same mention detection module and results are reported using system detected mentions. The performance of the mention detection module is of crucial importance for anaphoricity detection. Therefore, it is important that the compared anaphoricity detectors use the same mention detection.
The LSTM model that is described in Section 3.2 is denoted as LSTM in Table 1. In order to investigate the effect of the used surface features, we also report the results of the LSTM model without using these features (LSTM * ).
The following observations can be drawn from the results of Table 1: (1) our LSTM model outperforms the joint model while using less features and being trained independently, (2) the results of the LSTM * model is considerably lower than those of LSTM, especially for recognizing anaphoric mentions, and (3) the simple SVM model outperforms the neural models in detecting both anaphoric and non-anaphoric mentions.

Generalization Evaluation
In order to investigate the generalization on new domains, we evaluate the LSTM and SVM models on the WikiCoref dataset (Ghaddar and Langlais, 2016). The WikiCoref dataset is annotated according to the same annotation guideline as that of CoNLL. Therefore, it is an appropriate dataset for performing out-of-domain evaluations when CoNLL is used for training. For the experiments of Table 2, all models are trained on the CoNLL 2012 training data and tested on the WikiCoref dataset.
The word dictionary that is used for the LSTM model is built based on the CoNLL 2012 training data. All words that are not included in this dictionary are treated as out of vocabulary words with randomly initialized word embeddings. We further improve the performance of LSTM on Wiki-Coref, by adding the words from the WikiCoref dataset into its dictionary. The LSTM model trained with this extended dictionary is denoted as LSTM † in Table 2. LSTM † results are still lower than those of the SVM model while SVM does not use any information from the test dataset. Pruning rare lexical features from the training data along the incorporation of part of speech tags, which are far more generalizable than lexical features, could explain the generalizability of the SVM model on the new domain.

Analysis Based on Mention Types
We analyze the output of the LSTM and SVM models on the CoNLL 2012 test set to see how well they perform for different types of men-  tions. As can be seen from Table 3, there is not much difference between the performance of LSTM and SVM for recognizing anaphoric pronouns. SVM detects anaphoric proper names better while LSTM is better at recognizing anaphoric common nouns.
We also analyze the output of LSTM * . As can be seen, the incorporation of surface features does not affect the detection of anaphoric pronouns very much while it mainly affects the detection of anaphoric proper names by about 24 percent.
In order to see whether the same pattern holds for coreference resolution, we compare the recall and precision errors of the best coreference system that only uses surface features, i.e. cort (Martschat and Strube, 2015) with singleton features (Moosavi and Strube, 2016) 2 , and the stateof-the-art deep coreference resolver, i.e. deepcoref (Clark and Manning, 2016a). The comparison of the errors for the CoNLL 2012 test set is shown in Table 4. We use the error analysis tool of cort introduced by Martschat and Strube (2014) for the results of Table 4. As can be seen from Table 4, while deep-coref is significantly better than cort for resolving common nouns and specially pronouns, its result does not go far beyond that of cort when it comes to resolving proper names.

Discussion
In this paper we analyze the effect of surface features for anaphoricity detection, which is a small but an important step for coreference resolution. Our analysis shows that surface features, as it was known, are important. Based on our results, the effects of incorporating surface properties and generalized representations are different for different types of mentions. These results suggest that apart from a unified model, we should consider different models or at least different features for processing different types of mentions and do not put all the burden on a single model to learn the differences. The works by Lassalle and Denis (2013) and Denis and Baldridge (2008) are examples of models in which distinct models have been used for various types of mentions. Besides, our analysis shows the importance of surface features for proper names. Word embeddings are very useful for capturing semantic relatedness. A coreference resolver that uses word embeddings has a great advantage in better resolution of common nouns and pronouns. However, the use of surface features in current state-of-the-art coreference resolvers is very limited. Before going towards using more sophisticated knowledge sources, there are still easy victories that can be achieved by incorporating more generalizable surface properties, especially for proper names.