Using Linguistic Features to Improve the Generalization Capability of Neural Coreference Resolvers

Coreference resolution is an intermediate step for text understanding. It is used in tasks and domains for which we do not necessarily have coreference annotated corpora. Therefore, generalization is of special importance for coreference resolution. However, while recent coreference resolvers have notable improvements on the CoNLL dataset, they struggle to generalize properly to new domains or datasets. In this paper, we investigate the role of linguistic features in building more generalizable coreference resolvers. We show that generalization improves only slightly by merely using a set of additional linguistic features. However, employing features and subsets of their values that are informative for coreference resolution, considerably improves generalization. Thanks to better generalization, our system achieves state-of-the-art results in out-of-domain evaluations, e.g., on WikiCoref, our system, which is trained on CoNLL, achieves on-par performance with a system designed for this dataset.


Introduction
Coreference resolution is the task of recognizing different expressions that refer to the same entity. The referring expressions are called mentions. For instance, the sentence "[Susan] 1 sent [her] 1 daughter to a boarding school" contains two coreferring mentions. "her" is an anaphor which refers to the antecedent "Susan".
The availability of coreference information benefits various Natural Language Processing (NLP) tasks including automatic summarization, question answering, machine translation and information extraction. Current coreference developments are almost only targeted at improving scores on the CoNLL official test set. However, the superiority of a coreference resolver on the CoNLL evaluation sets does not necessarily indicate that it also performs better on new datasets. For instance, the ranking model of Clark and Manning (2016a), the reinforcement learning model of Clark and Manning (2016b) and the end-to-end model of Lee et al. (2017) are three recent coreference resolvers, among which the model of Lee et al. (2017) performs the best and that of Clark and Manning (2016b) performs the second best on the CoNLL development and test sets. However, if we evaluate these systems on the WikiCoref dataset (Ghaddar and Langlais, 2016a), which is consistent with CoNLL with regard to coreference definition and annotation scheme, the performance ranking would be in a reverse order 1 .
In Moosavi and Strube (2017a), we investigate the generalization problem in coreference resolution and show that there is a large overlap between the coreferring mentions in the CoNLL training and evaluation sets. Therefore, higher scores on the CoNLL evaluation sets do not necessarily indicate a better coreference model. They may be due to better memorization of the training data. As a result, despite the remarkable improvements in coreference resolution, the use of coreference resolution in other applications is mainly limited to the use of simple rule-based systems, e.g. Lapata and Barzilay (2005), Yu and Ji (2016), and Elsner and Charniak (2008).
In this paper, we explore the role of linguistic features for improving generalization. The incorporation of linguistic features is considered as a potential solution for building more generalizable NLP systems 2 . While linguistic features 3 were shown to be important for coreference resolution, e.g. Uryupina (2007) and Bengtson and Roth (2008), state-of-the-art systems no longer use them and mainly rely on word embeddings and deep neural networks. Since all recent systems are using neural networks, we focus on the effect of linguistic features on a neural coreference resolver.
The contributions of this paper are as follows: -We show that linguistic features are more beneficial for a neural coreference resolver if we incorporate features and subsets of their values that are informative for discriminating coreference relations. Otherwise, employing linguistic features with all their values only slightly affects the performance and generalization.
-We propose an efficient discriminative pattern mining algorithm, called EPM, for determining (feature, value) pairs that are informative for the given task. We show that while the informativeness of EPM mined patterns is onpar with those of its counterparts, it scales best to large datasets. 4 -By improving generalization, we achieve state-of-the-art performance on all examined out-of-domain evaluations. Our out-ofdomain performance on WikiCoref is on-par with that of Ghaddar and Langlais (2016b)'s coreference resolver, which is a system specifically designed for WikiCoref and uses its domain knowledge.

Importance of Features in Coreference
Uryupina (2007)'s thesis is one of the most thorough analyses of linguistically motivated features for coreference resolution. She examines a large set of linguistic features, i.e. string match, syntactic knowledge, semantic compatibility, discourse structure and salience, and investigates their interaction with coreference relations. She shows that even imperfect linguistic features, which are extracted using error-prone preprocessing modules, boost the performance and argues that coreference resolvers could and should benefit from linguistic theories. Her claims are based on analyses on the MUC dataset. Ng and Cardie (2002), Yang et al. (2004), Ponzetto and Strube (2006), Bengtson and itions, e.g. string match, or are acquired from linguistic preprocessing modules, e.g. POS tags, as linguistic features. 4 The EPM code is available at https://github. com/ns-moosavi/epm Roth (2008), and Recasens and Hovy (2009) also study the importance of features in coreference resolution.
Apart from the mentioned studies, which are mainly about the importance of individual features, studies like Björkelund andFarkas (2012), Fernandes et al. (2012), and Uryupina and Moschitti (2015) generate new features by combining basic features. Björkelund and Farkas (2012) do not use a systematic approach for combining features.  use the Entropy guided Feature Induction (EFI) approach  to automatically generate discriminative feature combinations. The first step is to train a decision tree on a dataset in which each sample consists of features describing a mention pair. The EFI approach traverses the tree from the root in a depth-first order and recursively builds feature combinations. Each pattern that is generated by EFI starts from the root node. As a result, EFI tends to generate long patterns. A decision tree does not represent all patterns of data. Therefore, it is not possible to explore all feature combinations from a decision tree. Uryupina and Moschitti (2015) propose an alternative approach to EFI. They formulate the problem of generating feature combinations as a pattern mining approach. They use the Jaccard Item Mining (JIM) algorithm 5 (Segond and Borgelt, 2011). They show that the classifier that uses the JIM features significantly outperforms the one that employs the EFI features.
3 Baseline Coreference Resolver deep-coref (Clark and Manning, 2016a) and e2ecoref (Lee et al., 2017) are among the best performing coreference resolvers from which e2ecoref performs better on the CoNLL test set. deepcoref is a pipelined system, i.e. a mention detection first determines the list of candidate mentions with their corresponding features. It contains various coreference models including the mention-pair, mention-ranking, and entity-based models. The mention-ranking model of deepcoref has three variations: (1) "ranking" uses the slack-rescaled max-margin training objective of Wiseman et al. (2015), (2) "reinforce" is a variation of the "ranking" model in which the hyperparameters are set in a reinforcement learning framework (Sutton and Barto, 1998), and (3) "top-pairs" is a simple variation of the "ranking" model that uses a probabilistic objective function and is used for pretraining the "ranking" model. e2e-coref is an end-to-end system that jointly models mention detection and coreference resolution. It considers all possible (start, end) word spans of each sentence as candidate mentions. Apart from a single model, e2e-coref includes an ensemble of five models.
We use deep-coref as the baseline in our experiments. The reason is that some of the examined features require the head of each mention to be known, e.g. head match, while e2e-coref mentions do not have specific heads and heads are automatically determined using an attention mechanism. We also observe that if we limit e2e-coref candidate spans to those that correspond to deep-coref's detected mentions, the performance of e2e-coref drops to a level on-par with deep-coref 6 .

Examined Features
The examined linguistic features include string match, syntactic, shallow semantic and discourse features. -Dependency relation: enhanced dependency relation (Schuster and Manning, 2016) of the head word to its parent -POS tags of the first, last, head, two words preceding and following of each mention Pairwise features include: -Head match: both mentions have the same head, e.g. "red hat" and "the hat" -String of one mention is contained in the other, e.g. "Mary's hat" and "Mary" -Head of one mention is contained in the other, e.g. "Mary's hat" and "hat" -Acronym, e.g. "Heidelberg Institute for Theoretical Studies" and "HITS" -Compatible pre-modifiers: the set of premodifiers of one mention is contained in that of the other, e.g. "the red hat that she is wearing" and "the red hat" -Compatible 7 gender, e.g. "Mary" and "women" -Compatible number, e.g. "Mary" and "John" -Compatible animacy, e.g. "those hats" and "it" -Compatible attributes: compatible gender, number and animacy, e.g. "Mary" and "she" -Closest antecedent that has the same head and compatible premodifiers, e.g. "this new book" and "This book" in "Take a look at this new book. This book is one of the best sellers." -Closest antecedent that has compatible attributes, e.g. the antecedent "Mary" and the anaphor "she" in the sentence "John saw Mary, and she was in a hurry" -Closest antecedent that has compatible attributes and is a subject, e.g. the antecedent "Mary" and the anaphor "she" in the sentence "Mary saw John, but she was in a hurry" -Closest antecedent that has compatible attributes and is an object, e.g. "Mary" and "she" in "John saw Mary, and she was in a hurry" The last three features are similar to the discourselevel features discussed by Uryupina (2007), which are created by combining proximity, agreement and salience properties. She shows that such features are useful for resolving pronouns. we estimate proximity by considering the distance of two mentions. The salience is also incorporated by discriminating subject or object antecedents. We do not use any gold information. All features are extracted using Stanford CoreNLP (Manning et al., 2014).

Impact of Linguistic Features
In this section, we examine the effect of employing all linguistic features described in Section 4 in a neural coreference resolver, i.e. deep-coref. We use MUC (Vilain et al., 1995), B 3 (Bagga and Baldwin, 1998), CEAF e (Luo, 2005) The rows "ranking" and "top-pairs" show the base results of deep-coref's "ranking" and "toppairs" models, respectively. "+linguistic" rows represents the results for each of the mentionranking models in which the feature set of Section 4 is employed. The gender, number, animacy and mention type features, which have less than five values, are converted to binary features. Named entity and POS tags, and dependency relations are represented as learned embeddings.
We observe that incorporating all the linguistic features bridges the gap between the performance of "top-pairs" and "ranking". However, it does not improve significantly over "ranking". Henceforth, we use the "top-pairs" model of deep-coref as the baseline model to incorporate linguistic features.
To assess the impact on generalization, we evaluate "top-pairs" and "+linguistic" 8 models that are trained on CoNLL, on WikiCoref (see Table 2). We observe that the impact on generalization is also not notable, i.e. the CoNLL score improves only by 0.5pp over "ranking".  Based on an ablation study, while our feature set contains numerous features, the resulting improvements of "linguistic" over "top-pairs" mainly comes from the last four pairwise features in Section 4, which are carefully designed features.

Better Exploiting Linguistic Features
As discussed by Moosavi and Strube (2017a), there is a large lexical overlap between the coreferring mentions of the CoNLL training and evaluation sets. As a result, lexical features provide a 8 i.e. "top-pairs+linguistic" very strong signal for resolving coreference relations.
For linguistic features to be more effective in current coreference resolvers, which rely heavily on lexical features, they should also provide a strong signal for coreference resolution.
Additional linguistic features are not necessarily all informative for coreference resolution, especially if they are extracted automatically and are noisy. Besides, for features with multiple values, e.g. mention-based features, only a small subset of values may be informative.
To better exploit linguistic features, we only employ (feature, value) pairs 9 that are informative for coreference resolution. Coreference resolution is a complex task in which features have complex interactions (Recasens and Hovy, 2009). As a result, we cannot determine the informativeness of feature-values in isolation.
We use a discriminative pattern mining approach (Cheng et al., 2007(Cheng et al., , 2008Batal and Hauskrecht, 2010) that examines all combinations of feature-values, up to a certain length, and determines which feature-values are informative when they are considered in combination.
Due to the large data size (all mention-pairs of the CoNLL training data) and the high dimensionality of feature-values, compared to common evaluation sets of pattern mining methods, the existing discriminative pattern mining approaches were not applicable to our data. In this section, we propose an efficient discriminative pattern mining approach, called Efficient Pattern Miner (EPM), that is scalable to large NLP datasets. The most important properties of EPM are (1) it examines all frequent feature-values combinations, up to the desired length, (2) it is scalable to large datasets, and (3) it is only data dependent and independent of the coreference resolver.

Notation
We use the following notations and definitions throughout this section: : set of n training samples. X i is the set of feature-values that describes the ith sample. c(X i ) ∈ C is the label of X i , e.g. coreferent and non-coreferent.
-A = {a 1 , . . . , a l }: set of all feature-values present in D. Each a i ∈ A is called an item, e.g. a i ="anaphor type=proper".
support(p, c i ): the number of samples that contain pattern p and are labeled with c i .

Data Structure
For representing the input samples, we use the Frequent Pattern Tree (FP-Tree) structure that is the data structure of the FP-Growth algorithm (Han et al., 2004), i.e. one of the most common algorithms for frequent pattern mining. FP-Tree provides a structure for representing all existing patterns of data in a compressed form. Using the FP-Tree structure allows an efficient enumeration of all frequent patterns. In the FP-Tree structure, items are arranged in descending order of frequency. Frequency of an item corresponds to c i ∈C support(a i , c i ). Except for the root, which is a null node, each node n contains an item a i ∈ A. It also contains the support values of a i in the subpath of the tree that starts from the root and ends with n, i.e. support n (a i , c j ).
The FP-Tree construction method (Han et al., 2004) is as follows: (a) scan D to collect the set of all items, i.e. A. Compute support(a i , c j ) for each item a i ∈ A and label c j ∈ C. Sort A's members in descending order according to their frequencies, i.e. c i ∈C support(a i , c i ). (b) create a null-labeled node as the root, and (c) scan D again. For each (X i , c(X i )) ∈ D: 1. Order all items a j ∈ X i according to the order in A.
2. Set the current node (T ) to the root.

Consider
If T has a child n that contains a k then increment support n (a k , c(X i )) by one. Otherwise, create a new node n that contains a k with support n (a k , c(X i )) = 1. Add n to the tree as a child of T .
4. IfX i is non-empty, set T to n. Assign X i = X i and go to step 3.
If we sort A based on a i 's frequencies (support(a i , 0) + support(a i , 1)), the ordering of A's items will remain the same.
The FP-Tree construction steps for the above samples are demonstrated in Figure 1. ana-type, ant-type, and head-match features are abbreviated as ana, ant, and head, respectively. From an initial FP-Tree (T ) that represents all existing patterns, one can easily obtain a new FP-Tree in which all patterns include a given pattern p. This can be done by only including sub-paths of T that contain pattern p. The new tree is called conditional FP-Tree of p, T p . An example of conditional FP-Tree is included in the supplementary materials.

Informativeness Measures
We use a discriminative power and an information novelty measure for determining informativeness. We also use a frequency measure which determines the required minimum frequency of a pattern in training samples. It helps to avoid overfitting to the properties of the training data. Discriminative power: We use the G 2 likelihood ratio test (Agresti, 2007) in order to choose patterns whose association with the class variable is statistically significant. 10 The G 2 test is successfully used for text analysis (Dunning, 1993). Information Novelty: A large number of redundant patterns can be generated by adding irrelevant items to a base pattern that is discriminative itself.
We consider the pattern p as novel if (1) p predicts the target class label c significantly better than all of its containing items, and (2) p predicts c significantly better than all of its sub-patterns that satisfy the frequency, discriminative power, and the first information novelty conditions. Similar to Batal and Hauskrecht (2010), we employ a binomial distribution to determine information novelty.

Mining Algorithm
The EPM algorithm is summarized in Algorithm 1. It takes FP-Tree T , pattern p on which T is conditioned, and set of items (A j ⊂ A) whose combinations with p will be examined. Initially, p is empty and the FP-Tree is constructed based on all frequent items of data and A j = A. Resulting patterns are collected in P .
For each a i ∈ A j , the algorithm builds new pattern q by combining a i with p. f requent(q) checks whether q meets the frequency condition. If q is frequent, the algorithm continues the search process. Otherwise, it is not possible to build any frequent pattern out of a non-frequent one. Discriminative power and the first condition of information novelty are then checked for pattern q.
if Discriminative(q) then if N ovel(q) then P = P ∪ q end end if |q| >= Θ l then continue end construct T q = q's conditional tree EP M (T q , q, ancestors(a i )) end end Algorithm 1: The EPM algorithm.
We use a threshold (Θ l ) for the maximum length of mined patterns. Θ l can be set to large values if more complex and specific patterns are desirable.
If |q| is smaller than Θ l , the conditional FP-Tree T q is built that represents patterns of T that include the pattern q. The mining algorithm then continues to recursively search for more specific patterns by combining q with the items included in ancestors(a i ), which keeps the list of all ancestors of a i in the original FP-Tree. EPM examines all frequent patterns of up to length Θ l .
If we use a statistical test multiple times, the risk of making false discoveries increases (Webb, 2006). To tackle this, we apply the Bonferroni correction for multiple tests in a post-pruning function after the mining process. This function also applies the second information novelty condition on the resulting patterns.

Why Use EPM?
In this section, we explain why EPM is a better alternative compared to its counterparts for large NLP datasets. We compare EPM with two efficient discriminative pattern mining algorithms, i.e. Minimal Predictive Patterns (MPP) (Batal and Hauskrecht, 2010) and Direct Discriminative Pattern Mining (DDPMine) (Cheng et al., 2008), on standard machine learning datasets.
MPP selects patterns that are significantly more predictive than all their sub-patterns. It measures significance by the binomial distribution. For each pattern of length l, MPP checks 2 l −1 sub-patterns. DDPMine is an iterative approach that selects the most discriminative pattern at each iteration and reduces the search space of the next iteration by removing all samples that include the selected pattern. DDPMine uses the FP-Tree structure.
We show that EPM scales best and compares favorably based on the informativeness of resulting patterns. Due to its efficiency, EPM can handle large datasets similar to ones that are commonly used in various NLP tasks.

Experimental Setup
We use the same FP-Tree implementation for DDPMine and EPM. In all algorithms, we consider a pattern as frequent if it occurs in 10% of the samples of one of the classes. We use Θ l = 3 for both MPP and EPM.
We perform 5-times repeated 5-fold cross validation and the results are averaged. In each validation, all experiments are performed on the same split. We use a linear SVM, i.e. LIBLINEAR 2.11 (Fan et al., 2008), as the baseline classifier.
We use several datasets from the UCI machine learning repository (Lichman, 2013) whose characteristics are presented in the first three columns of Table 3, i.e. the number of (1)  (real/integer/nominal) features (#Features), (2) frequent items (#FI), and (3) samples (n). We use one[the minority class]-vs-all technique for datasets with more than two classes.

How Informative are EPM Patterns?
To evaluate the informativeness of mined patterns, the common practice is to add them as new features to the feature set of the baseline classifier; the more informative the patterns, the greater impact they would have on the overall performance. All patterns are added as binary features, i.e. the feature is true for samples that contain all items of the corresponding pattern. The effect of the patterns of DDPMine, MPP and EPM on the overall accuracy is presented in Table 3. The columns #Patterns show the number of patterns mined by each of the algorithms. The Orig columns show the results of the SVM using the original feature sets. The DDP, MPP, and EPM columns show the results of the SVM on the datasets for which the feature set is extended by the features mined by DDPMine, MPP, and EPM, respectively. The results of the 5-repeated 5-fold cross validation are reported if each single validation takes less than 10 hours.
Based on the results of Table 3 (1) EPM efficiently scales to larger datasets, (2) MPP and EPM patterns considerably improves the performance, and (3) EPM has on-par results with MPP while it mines considerably fewer patterns.

How Does it Scale?
Figure 2 compares EPM mining time (in seconds) with those of DDPMine and MPP. The parameter in the parentheses is the pattern size threshold, e.g. Θ l = 4 for EPM(4). The experiments that take more than two days are terminated and are not included. EPM is notably faster in comparison to the other two approaches. It is notable that the examined datasets are considerably smaller than the coreference data, which includes more than 33 million samples and 200 frequent feature-values.
8 Impact of Informative Feature-values

Experimental Setup
For determining informative feature-values, we extract all features for all mention-pairs 11 of the CoNLL training data and then apply EPM on this data. In order to prevent learning annotation errors and specific properties of the training data, we consider a pattern as frequent if it occurs in coreference relations of at least m different coreferring anaphors (m = 20). Since the majority of mention-pairs are non-coreferent and we are not interested in patterns for non-coreferring relations, we also consider the coreference probability of each pattern p, i.e. |{X i |p∈X i ∧c(X i )=coref erent}| |{X i |p∈X i }| , in the post-pruning function. The coreference probability should be higher than a threshold (60% in our experiments), so we only mine patterns that are informative for coreferring mentions.
For the coreference resolution experiments, instead of incorporating informative patterns, we incorporate feature-values that are included in the  Table 4: Comparisons on the CoNLL test set. The F1 gains that are statistically significant: (1) "+EPM" compared to "toppairs", "ranking" and "JIM", (2) "+EPM" compared to "reinforce" based on MUC, B 3 and LEA, (3) "single" compared to "+EPM" based on MUC and B 3 , and (4) "ensemble" compared to other systems. Significance is measured based on the approximate randomization test (p < 0.05) (Noreen, 1989). informative patterns mined by EPM. The reason is that deep-coref, or any other recent coreference resolver, uses a deep neural network, which has a fully automated feature generation process. We add these feature-values as binary features. By setting Θ l to five, 12 EPM results in 13 pairwise feature-values, 112 POS tags, i.e. 53 POS for anaphors and 59 for antecedents, 25 dependency relations, 26 mention types (mention types or fine mention types), and finally, 14 named entity tags. 13 Based on the observation in Section 5, we use the top-pairs model of deep-coref as the baseline to employ additional features, i.e. "+EPM" is the top-pairs model in which EPM feature-values are incorporated.

Impact on In-domain Performance
The performance of the "+EPM" model compared to recent state-of-the-art coreference models on the CoNLL test set is presented in Table 4. The "single" and "ensemble" rows represent the results of the single and ensemble models of e2e-coref.
We also compare EPM with the pattern mining approach used by Uryupina and Moschitti (2015), i.e. Jaccard Item Mining (JIM). For a fair comparison, while Uryupina and Moschitti (2015) used mined patterns for extracting feature templates, we use them for selecting feature-values. We run the JIM algorithm on the same data and with the same setup as that of EPM. 14 This results in nine pair- 12 We observe that using larger Θ l values will result in many over-specified patterns. 13 Following the previous studies that show different features are of different importance for various types of mentions, e.g. Denis and Baldridge (2008) and Moosavi and Strube (2017b), we mine a separate set of patterns for each type of anaphor. These resulting feature-values are the union of informative feature-values for all types of anaphora.
14 We set the minimum frequency, maximum pattern length and score + threshold parameters of JIM to 20, 5 and wise features, 260 POS tags, 38 dependency relations, 32 mention types, and 18 named entity tags.
The "+JIM" row shows the results of deep-coref top-pairs model in which these feature-values are incorporated. As we see, EPM feature-values result in significantly better performance than those of JIM while the number of EPM feature-values is considerably less than JIM.   Table 5 shows the effect of each group of EPM feature-values, i.e. pairwise features, mention types, dependency relations, named entity tags and POS tags, on the performance of "+EPM". The performance of "+EPM" from which each of the above feature groups is removed, one feature group at a time, is represented as "-pairwise", "-types", "-dep", "-NER", and "-POS", respectively. The POS and named entity tags have the least and the pairwise features have the most significant effect. Since pairwise features have the most significant effect, we also perform an experiment in which only pairwise features are incorporated in the "top-pairs" model, i.e. "+pairwise". The results of "-pairwise" compared to "+pairwise" show that pairwise feature-values have a significant impact, but only when they are considered in combination with other EPM 0.6.

Impact on Generalization
We use the same setup as that of Moosavi and Strube (2017a) for evaluating generalization including (1) training on the CoNLL data and testing on WikiCoref 15 and (2) excluding a genre of the CoNLL data from training and development sets and testing on the excluded genre. Similar to Moosavi and Strube (2017a), we use the pt and wb genres for the latter evaluation setup. The results of the first evaluation setup are shown in Table 6. The best performance on WikiCoref is achieved by Ghaddar and Langlais (2016a) ("G&L" in Table 6) who introduced Wi-kiCoref and design a domain-specific coreference resolver that makes use of the Wikipedia markups of a document as well as links to Freebase, which are annotated in WikiCoref.
Incorporating EPM feature-values improves the performance by about three points. While "+EPM" does not use the WikiCoref data during training, and unlike "G&L", it does not employ any domain-specific features, it achieves onpar performance with that of "G&L". This indeed 15 WikiCoref only contains 30 documents, which is not enough for training neural coreference resolvers.
shows the effectiveness of informative featurevalues in improving generalization.
The second set of generalization experiments is reported in Table 7. "in-domain" columns show the results when the evaluation genres were included in training and development sets while the "out-of-domain" columns show the results when the evaluation genres were excluded. As we can see, "+EPM" generalizes best, and in out-ofdomain evaluations, it considerably outperforms the ensemble model of e2e-coref, which has the best performance on the CoNLL test set.

Conclusions
In this paper, we show that employing linguistic features in a neural coreference resolver significantly improves generalization. However, the incorporated features should be informative enough to be taken into account in the presence of lexical features, which are very strong features in the CoNLL dataset. We propose an efficient algorithm to determine informative feature-values in large datasets. As a result of a better generalization, we achieve state-of-the-art results in all examined outof-domain evaluations.