Lexical Features in Coreference Resolution: To be Used With Caution

Lexical features are a major source of information in state-of-the-art coreference resolvers. Lexical features implicitly model some of the linguistic phenomena at a fine granularity level. They are especially useful for representing the context of mentions. In this paper we investigate a drawback of using many lexical features in state-of-the-art coreference resolvers. We show that if coreference resolvers mainly rely on lexical features, they can hardly generalize to unseen domains. Furthermore, we show that the current coreference resolution evaluation is clearly flawed by only evaluating on a specific split of a specific dataset in which there is a notable overlap between the training, development and test sets.


Introduction
Similar to many other tasks, lexical features are a major source of information in current coreference resolvers. Coreference resolution is a set partitioning problem in which each resulting partition refers to an entity. As shown by Durrett and Klein (2013), lexical features implicitly model some linguistic phenomena, which were previously modeled by heuristic features, but at a finer level of granularity. However, we question whether the knowledge that is mainly captured by lexical features can be generalized to other domains.
The introduction of the CoNLL dataset enabled a significant boost in the performance of coreference resolvers, i.e. about 10 percent difference between the CoNLL score of the currently best coreference resolver, deep-coref by Clark and Manning (2016b), and the winner of the CoNLL 2011 shared task, the Stanford rule-based system by Lee et al. (2013). However, this substantial improvement does not seem to be visible in downstream tasks. Worse, the difference between stateof-the-art coreference resolvers and the rule-based system drops significantly when they are applied on a new dataset, even with consistent definitions of mentions and coreference relations (Ghaddar and Langlais, 2016a).
In this paper, we show that if we mainly rely on lexical features, as it is the case in state-of-theart coreference resolvers, overfitting become more sever. Overfitting to the training dataset is a problem that cannot be completely avoided. However, there is a notable overlap between the CoNLL training, development and test sets that encourages overfitting. Therefore, the current coreference evaluation scheme is flawed by only evaluating on this overlapped validation set. To ensure meaningful improvements in coreference resolution, we believe an out-of-domain evaluation is a must in the coreference literature.

Lexical Features
The large difference in performance between coreference resolvers that use lexical features and ones which do not, implies the importance of lexical features. Durrett and Klein (2013) show that lexical features implicitly capture some phenomena, e.g. definiteness and syntactic roles, which were previously modeled by heuristic features. Durrett and Klein (2013) use exact surface forms as lexical features. However, when word embeddings are used instead of surface forms, the use of lexical features is even more beneficial. Word embeddings are an efficient way of capturing semantic relatedness. Especially, they provide an efficient way for describing the context of mentions. Durrett and Klein (2013) show that the addition of some heuristic features like gender, num- ber, person and animacy agreements and syntactic roles on top of their lexical features does not result in a significant improvement. deep-coref, the state-of-the-art coreference resolver, follows the same approach. Clark and Manning (2016b) capture the required information for resolving coreference relations by using a large number of lexical features and a small set of nonlexical features including string match, distance, mention type, speaker and genre features. The main difference is that Clark and Manning (2016b) use word embeddings instead of the exact surface forms that are used by Durrett and Klein (2013).
Based on the error analysis by cort (Martschat and Strube, 2014), in comparison to systems that do not use word embeddings, deep-coref has fewer recall and precision errors especially for pronouns. For example, deep-coref correctly recognizes around 83 percent of non-anaphoric "it" in the CoNLL development set. This could be a direct result of a better context representation by word embeddings.

Out-of-Domain Evaluation
Aside from the evident success of lexical features, it is debatable how well the knowledge that is mainly captured by the lexical information of the training data can be generalized to other domains. As reported by Ghaddar and Langlais (2016b), state-of-the-art coreference resolvers trained on the CoNLL dataset perform poorly, i.e. worse than the rule-based system (Lee et al., 2013), on the new dataset, WikiCoref (Ghaddar and Langlais, 2016b), even though WikiCoref is annotated with the same annotation guidelines as the CoNLL dataset. The results of some of recent coreference resolvers on this dataset are listed in Table 1.
The results are reported using MUC (Vilain et al., 1995), B 3 (Bagga and Baldwin, 1998), CEAF e (Luo, 2005), the average F 1 score of these three metrics, i.e. CoNLL score, and LEA (Moosavi and Strube, 2016). berkeley is the mention-ranking model of Durrett and Klein (2013) with the FINAL feature set including the head, first, last, preceding and following words of a mention, the ancestry, length, gender and number of a mention, distance of two mentions, whether the anaphor and antecedent are nested, same speaker and a small set of string match features.
cort is the mention-ranking model of Martschat and Strube (2015). cort uses the following set of features: the head, first, last, preceding and following words of a mention, the ancestry, length, gender, number, type, semantic class, dependency relation and dependency governor of a mention, the named entity type of the head word, distance of two mentions, same speaker, whether the anaphor and antecedent are nested, and a set of string match features. berkeley and cort scores in Table 1 are taken from Ghaddar and Langlais (2016a).
deep-coref is the mention-ranking model of Clark and Manning (2016b). deep-coref incorporates a large set of embeddings, i.e. embeddings of the head, first, last, two previous/following words, and the dependency governor of a mention in addition to the averaged embeddings of the five previous/following words, all words of the mention, sentence words, and document words. deep-coref also incorporates type, length, and position of a mention, whether the mention is nested in any other mention, distance of two mentions, speaker features and a small set of string match features.
For   [lea] in which WikiCoref's words are not incorporated into the dictionary. Therefore, for deep-coref − , WikiCoref's words that do not exist in CoNLL will be initialized randomly instead of using pre-trained word2vec word embeddings. The performance gain of deep-coref [lea] in comparison to deep-coref − indicates the benefit of using pre-trained word embeddings and word embeddings in general. Henceforth, we refer to deep-coref [lea] as deep-coref.

Why do Improvements Fade Away?
In this section, we investigate how much lexical features contribute to the fact that current improvements in coreference resolution do not properly apply to a new domain. Table 2 shows the ratio of non-pronominal coreferent mentions in the CoNLL test set that also appear as coreferent mentions in the training data. These high ratios indicate a high degree of overlap between the mentions of the CoNLL datasets.
The highest overlap between the training and test sets exists in genre pt (Bible). The tc (telephone conversation) genre has the lowest overlap for non-pronominal mentions. However, this genre includes a large number of pronouns. We choose wb (weblog) and pt for our analysis as two genres with low and high degree of overlap. Table 3 shows the results of the examined coreference resolvers when the test set only includes one genre, i.e. pt or wb, in two different settings: (1) the training set includes all genres (in-domain evaluation), and (2) the corresponding genre of the test set is excluded from the training and development sets (out-of-domain evaluation).
berkeley-final is the coreference resolver of Durrett and Klein (2013) with the FINAL feature set explained in Section 3. berkeley-surface is the same coreference resolver with only surface features, i.e. ancestry, gender, number, same speaker and nested features are excluded from the FINAL feature set.
cort−lexical is a version of cort in which no lexical feature is used, i.e. the head, first, last, governor, preceding and following words of a mention are excluded.
For in-domain evaluations we train deep-coref 's ranking model for 100 iterations, i.e. the setting used by Clark and Manning (2016a). However, based on the performance on the development set, we only train the model for 50 iterations in out-ofdomain evaluations.
The results of the pt genre show that when there is a high overlap between the training and test datasets, the performance of all learning-based classifiers significantly improves. deep-coref has the largest gain from including pt in the training data that is more than 13% based on the LEA score. cort uses both lexical and a relatively large number of non-lexical features while berkeley-surface is a pure lexicalized system. However, the difference between the berkeley-surface's performances when pt is included or excluded from the training data is lower than that of cort. berkeley uses feature-value pruning so lexical features that occur fewer than 20 times are pruned from the training data. Maybe, this is the reason that berkeley's performance difference is less than other lexicalized systems in highly overlapping datasets.
For a less overlapping genre, i.e. wb, the performance gain of including the genre in the training data is significantly lower for all lexicalized systems. Interestingly, the performance of berkeleyfinal, cort and cort−lexical increases for the wb genre when this genre is excluded from the training set. deep-coref, which uses a complex deep neural network and mainly lexical features, has the highest gain from the redundancy in the training and test datasets. As we use more complex neural networks, there is more capacity for brute-force memorization of the training dataset.
It is also worth noting that the performance gains and drops in out-of-domain evaluations are   not entirely because of lexical features, as the performance of cort−lexical also drops significantly in pt out-of-domain evaluation. The classifier may also memorize other properties of the seen mentions in the training data. However, in comparison to features like gender and number agreement or syntactic roles, lexical features have the highest potential for overfitting. We further analyze the output of deep-coref on the development set. The all rows in Table 4 show the number of pairwise links that are created by deep-coref on the development set for different mention types. The seen rows show the ratio of each category of links for which the (antecedent head, anaphor head) pair is seen in the training set. All ratios are surprisingly high. The most worrisome cases are those in which both mentions are either a proper name or a common noun. Table 5 further divides the links of Table 4 based on whether they are correct coreferent links. The results of Table 5 show that most of the incorrect links are also made between the mentions that are both seen in the training data.
The high ratios indicate that (1)   overlap between the mention pairs of the training and development sets, and (2) even though that deep-coref uses generalized word embeddings instead of exact surface forms, it is strongly biased towards the seen mentions. We analyze the links that are created by Stanford's rule-based system and compute the ratio of the links that exist in the training set. All corresponding ratios are lower than those of deep-coref in Table 5. However, the ratios are surprisingly high for a system that does not use the training data. This analysis emphasizes the overlap in the CoNLL datasets. Because of this high overlap, it is not easy to assess the generalizability of a coreference resolver to unseen mentions on the CoNLL dataset given its official split.
We also compute the ratios of Table 5 for the missing links that are associated with the recall er-  rors of deep-coref. We compute the recall errors by cort error analysis tool (Martschat and Strube, 2014). Table 6 shows the corresponding ratios for recall errors. The lower ratios of Table 6 in comparison to those of Table 4 emphasize the bias of deep-coref towards the seen mentions. For example, the deep-coref links include 31 cases in which both mentions are either proper names or common nouns and the head of one of the mentions is "country". For all these links, "country" is linked to a mention that is seen in the training data. Therefore, this raises the question how the classifier would perform on a text about countries not mentioned in the training data.
Memorizing the pairs in which one of them is a common noun could help the classifier to capture world knowledge to some extent. From the seen pairs like (Haiti, his country), and (Guangzhou, the city) the classifier could learn that "Haiti" is a country and "Guangzhou" is a city. However, it is questionable how useful word knowledge is if it is mainly based on the training data.
The coreference relation of two nominal noun phrases with no head match can be very hard to resolve. The resolution of such pairs has been referred to as capturing semantic similarity (Clark and Manning, 2016b). deep-coref links 49 such pairs on the development set. Among all these links, only 5 pairs are unseen on the training set and all of them are incorrect links.
The effect of lexical features is also analyzed by Levy et al. (2015) for tasks like hypernymy and entailment. They show that state-of-the-art classifiers memorize words from the training data. The classifiers benefit from this lexical memorization when there are common words between the training and test sets.

Discussion
We show the extensive use of lexical features biases coreference resolvers towards seen mentions.
This bias holds us back from developing more robust and generalizable coreference resolvers. After all, while coreference resolution is an important step for text understanding, it is not an endtask. Coreference resolvers are going to be used in tasks and domains for which coreference annotated corpora may not be available. Therefore, generalizability should be brought into attention in developing coreference resolvers.
Moreover, we show that there is a significant overlap between the training and validation sets in the CoNLL dataset. The LEA metric (Moosavi and Strube, 2016) is introduced as an attempt to make coreference evaluations more reliable. However, in order to ensure valid developments on coreference resolution, it is not enough to have reliable evaluation metrics. The validation set on which the evaluations are performed also needs to be reliable. A dataset is reliable for evaluations if a considerable improvement on this dataset indicates a better solution for the coreference problem instead of a better exploitation of the dataset itself.
This paper is not intended to argue against the use of lexical features. Especially, when word embeddings are used as lexical features. The incorporation of word embeddings is an efficient way for capturing semantic relatedness. Maybe we should use them more for describing the context and less for describing the mentions themselves. Pruning rare lexical features plus incorporating more generalizable features could also help to prevent overfitting.
To ensure more meaningful improvements, we ask to incorporate out-of-domain evaluations in the current coreference evaluation scheme. Outof-domain evaluations could be performed by using either the existing genres of the CoNLL dataset or by using other existing coreference annotated datasets like WikiCoref, MUC or ACE.