Using Type Information to Improve Entity Coreference Resolution

Coreference resolution (CR) is an essential part of discourse analysis. Most recently, neural approaches have been proposed to improve over SOTA models from earlier paradigms. So far none of the published neural models leverage external semantic knowledge such as type information. This paper offers the first such model and evaluation, demonstrating modest gains in accuracy by introducing either gold standard or predicted types. In the proposed approach, type information serves both to (1) improve mention representation and (2) create a soft type consistency check between coreference candidate mentions. Our evaluation covers two different grain sizes of types over four different benchmark corpora.


Introduction
Coreference resolution (CR) is an extensively studied problem in computational linguistics and NLP (Hobbs, 1978;Lappin and Leass, 1994;Mitkov, 1999;Ng, 2017;Clark and Manning, 2016;Lee et al., 2017). Solutions to this problem allow us to make meaningful links between concepts and entities within a discourse and therefore serves as a valuable pre-processing step for downstream tasks like summarization and questionanswering (Steinberger et al., 2007;Dasigi et al., 2019;Sukthanker et al., 2020a).
Recently, multiple datasets including Ontonotes (Pradhan et al., 2012), Litbank (Bamman et al., 2020), EmailCoref (Dakle et al., 2020), and WikiCoref (Ghaddar and Langlais, 2016) have been proposed as benchmark datasets for CR, especially in the sub-area of entity anaphora (Sukthanker et al., 2020b). Entity anaphora is a simpler starting place for work on anaphora because unlike abstract anaphora (Webber, 1991), entity anaphora are pronouns or noun phrases that refer to an explicitly mentioned entity in the discourse rather than an abstract idea that must be constructed from a repackaging of information revealed over an extended text. An affordance of entity anaphora is that they have easily articulated semantic types. Most of the entity CR datasets are extensively annotated for syntactic features (like constituency parse etc.) and semantic features (like entity-types). However, none of the published SOTA methods (Lee et al., 2017;Joshi et al., 2019Joshi et al., , 2020) explicitly leverage the type information.
In this paper, we present a proof of concept to portray the benefits of using type information in neural approaches for CR. Named entities are generally divided generically (e.g. person, organization etc.) or in a domain-specific manner (e.g. symptom, drug, test etc.). In this work, we consider CR datasets that contain generic entitytypes. One challenge is that the different corpora do not utilize the same set of type tags. For example, OntoNotes includes 18 types while EmailCoref includes only 4. Thus, we evaluate the performance of the proposed modeling approach on each dataset both with the set of type tags germaine to the dataset as well as a common set of four basic types (person, org, location, facility) inspired from research on Named Entity Recognition (NER) (Tjong Kim Sang, 2002;Tjong Kim Sang and De Meulder, 2003).
Our motivation is similar to (Durrett and Klein, 2014), which used a structured CRF with handcurated features to jointly-model the tasks of CR, entity typing, and entity linking. Their joint architecture showed an improved performance on CR over the independent baseline. However, our work differs from there's as we show the benefits of entity-type information in neural models that use contextualized representations like BERT (Peters et al., 2018). Some prior art (Petroni et al., 2019;Roberts et al., 2020) argues that contextual-Figure 1: We improve Bamman et al. (2020) for entity coreference resolution by incorporating type information at two levels.
(1) Type information is concatenated with the mention span representation created by their model; and (2) A consistency check is incorporated that compares the types of two mentions under consideration to calculate the coreference score. Please refer to Section 3 for details.
ized embeddings implicitly capture facts and relationships between real-world entities. However, in this work, we empirically show that access to explicit knowledge about entity-types benefits neural models that use BERT for CR. We show a consistent improvement in performance on four different coreference datasets from varied domains.
Our contribution is that we evaluate the impact of the introduction of type information in neural entity coreference at two different levels of granularity (which we refer to as original vs common), demonstrating their utility both in the case where gold standard type information is available, and the more typical case where it is predicted.

Related Work
Neural Coreference Resolution: Recently, neural approaches to coreference (Joshi et al., 2020(Joshi et al., , 2019Lee et al., , 2017 have begun to show their prowess. The SOTA models show impressive performance on state-of-the-art datasets like OntoNotes (Pradhan et al., 2012) and GAP (Webster et al., 2018). The notable architecture proposed by Lee et al. (2017) scores pairs of entity mentions independently and later uses a clustering algorithm to find coreference clusters. On the other hand,  improve upon this foundation by introducing an approximated higher-order inference that iteratively updates the existing span representation using its antecedent distribution. Moreover, they propose a coarseto-fine grained approach to pairwise scoring for tackling the computational challenges caused due to the iterative higher-order inference. More recently, Joshi et al. (2019Joshi et al. ( , 2020 showed that use of contextual representations instead of wordembeddings like GloVe (Pennington et al., 2014) can further boost the results over and above those just mentioned. Our work offers additional improvement by building on the model proposed in Bamman et al. (2020), which is based on Lee et al. (2017), and adds additional nuanced information grounded in semantic types. Type Information: Named Entity Recognition datasets (Tjong Kim Sang, 2002;Tjong Kim Sang and De Meulder, 2003;Li et al., 2016) often group entity mentions into different types (or categories) depending on the domain and the potential downstream applications of the corpus. For example, the medical corpus used in the i2b2 Challenge 2010 (Uzuner et al., 2011) annotates domain-specific types like problem, test, symptom etc., whereas, a more general-domain dataset like CoNLL-2002(Tjong Kim Sang, 2002 uses generic types like person, organization, and location. Type information as a predictive signal has been shown to be beneficial for NLP tasks like relation extraction (Soares et al., 2019) and entitylinking . It affords some level of disambiguation, which assists models with filtering out some incorrect predictions in order to increase the probability of a correct prediction. In this work, we evaluate the benefits of using explicit type information for CR. We show that a model that leverages entity types associated with the anaphoric/ antecedent mentions significantly reduces the problem of type inconsistency in the output coreference clusters and thus improves the overall performance of the neural baseline on four datasets. Type Information for CR: Multiple prior works have shown type-information to be a useful feature for shallow coreference resolution classifiers (Soon et al., 2001;Bengtson and Roth, 2008;Ponzetto and Strube, 2006;Haghighi and Klein, 2010;Durrett and Klein, 2014). (Soon et al., 2001) take the most frequent sense for each noun in WordNet as the semantic class for that noun and use a decision-tree for pairwise classification of whether two samples co-refer each other. (Bengtson and Roth, 2008) use a hypernym tree to extract the type information for different common nouns, and compare the proper names against a predefined list to determine if the mention is a person. They, then, pass this and many other features (like distance, agreement, etc.) through a regularized average perceptron for pairwise classification. This paper expands on these studies to show that entity-type information is also beneficial for neural models that use contextualized representations like BERT (Peters et al., 2018), which have been argued to implicitly capture facts and relationships between real-world entities (Petroni et al., 2019;Roberts et al., 2020).

Model
In this section, we explain how we introduce type information into a neural CR system.

Baseline
We use the model proposed in Bamman et al. (2020) as our baseline. The model gives stateof-the-art scores on the LitBank corpus (Bamman et al., 2020) and is an end-to-end mention ranking system based on Lee et al. (2017), which has shown competitive performance on the OntoNotes dataset. However, this model differs from Lee et al. (2017) as it uses BERT embeddings, omits author and genre information, and only focuses on the task of mention-linking. Since our main goal is to evaluate the benefits of type information, we too separate mention-linking from mentionidentification and only show results computed over gold-standard mentions. This controls for the effects of the mention-identification module's performance on our experiments. Impact of typeinformation incorporation in the real-world endto-end CR setting (mention identification + linking) is left as future work.
The BERT embeddings for each token i are passed through a bi-directional LSTM (x i ). To represent a mention m with start and end positions s, e respectively, x s , x e , attention over x s , ..., x e , and features to represent the width (wi) and inclusion within quotations (qu) are concatenated.
Finally, given the representation of two mentions m j and m k , their coreference score S(m j , m k ) is computed by concatenating m j , m k , m j m k , distance (d) between the mentions and whether one mention is nested (n) within the other, which are then passed through fully-connected layers (FC).
We refer the reader to (Bamman et al., 2020;Lee et al., 2017) for more details about the architecture.

Entity Type Information
We improve the above model by including entitytype information on two levels ( Figure 1). First, we concatenate the entity-type t of the mention to m (in Eq. 1) to improve the mention representa- This allows the model access to the entity type of the mention as an additional feature. We call this +ET-self. Second, to check the type consistency (softly) between any two mentions under consideration as possibly coreferent, we append a feature (tc) in Eq. 2, which takes the value 0 if both mentions have the same type, and 1 otherwise. For example, in Figure 1, since Los Angeles and it have the same entity-type PLACE, tc jk = 0.
S (m j , m k ) = FC([m j ; m k ; m j m k ; d; n; tc jk ]) (4) This part of the approach is referred to as +ETcross throughout the remainder of the paper. We decide against the use of a hard consistency check (which would filter out mentions which do not have the same type) as it might not generalize well to bridging anaphora (Clark, 1975) where the anaphor refers to an object that is associated with, but not identical to, the antecedent (Poesio et al., 2018). In such cases, the type of the anaphora and its antecedent may not match. Finally, our architecture combines both components together as +ET (ET = ET-self + ET-cross).

Datasets
We gauge the benefits of using entity-type information on the four datasets discussed below. LitBank. This dataset (Bamman et al., 2020) contains coreference annotations for 100 literary texts. 1 This dataset limits the markable mentions to six entity-types, where majority of the mentions (83.1%) point to a person. EmailCoref. This dataset (Dakle et al., 2020) comprises of 46 email threads with a total of 245 email messages. 2 Similar to LitBank, it considers a mention to be a span of text that refers to a real-world entity. In this work, we filter out pronouns that point towards multiple entities in the email (e.g. we, they) thus only focusing on singular mentions.
Ontonotes. From this multi-lingual dataset, we evaluate on the subset (english) from OntoNotes that was used in the CoNLL-2012 shared task (Pradhan et al., 2012). 3 It contains 2802 training, 343 development, and 348 test documents. The dataset differs from LitBank in its annotation scheme with the biggest difference being the fact that it does not annotate singletons. It contains annotations for 18 different entitytypes. However, unlike LitBank and EmailCoref, not all mentions have an associated entity-type. For example, none of the pronoun mentions are given a type even if they act as anaphors to typed entities. We partially ameliorate this issue by extracting gold coreference clusters that contain at least one typed mention and assigning the majority type in that cluster to all of its elements. For example, in Figure 1, if Los Angeles is typed PLACE, and it is in the gold coreference cluster of Los Angeles (no other element in the cluster), then it is also assigned the type PLACE.  WikiCoref. This corpus, released by (Ghaddar and Langlais, 2016), comprises 30 documents from wikipedia annotated for coreference resolution. 4 The annotations contain additional metadata, like the associated freebase rdf link for each mention (if available). We use this rdf entry to extract the mention's entity types from freebase dump. Mentions that do not get any type are marked NA. The first 24 documents are chosen for training, the next 3 for development, and the rest for testing. The above-discussed datasets differ in the number as well as the categories of entity-types they originally annotate (Table 1). Apart from a common list of types (like PER, ORG, LOC), they also include corpus-specific categories like DIGital (EmailCoref), MONey, and LANG (OntoNotes). We carry out experiments with two sets of types -original and common -for each dataset. The common set of types include the following 5 categories: PER, ORG, LOC, FAC, OTHER.

Experiments and Results
In this section, we provide the results of our empirical experiments. Evaluation Metrics: We convert all three datasets into the CoNLL 2012 format and report the F1 score for MUC, B 3 , and CEAF metrics using the CoNLL-2012 official scripts. The performances are compared on the average F1 of the abovementioned metrics. For EmailCoref, OntoNotes, and WikiCoref, we report the mean score of 5 independent runs of the model with different seeds. Whereas, for LitBank, we present the 10-fold cross-validation results. 5

Performance with Original Types
In order to establish an upper bound for improvement through introduction of type information, our first experiment leverages the original list of entity-types annotated in different corpora (+ ET (orig)), using the gold standard labels for types.
Inclusion of entity-type information improves over the baseline for all CR datasets. Table 2 presents the performance of the baseline model and the model with entity-type information. We find that entity-type information gives a boost of 0.96 Avg. F1 (p < 0.01) on LitBank which is the new state-of-the-art score with goldmentions. This suggests that type information is helpful for CR on LitBank despite the heavily skewed distribution of entity-types in this corpus. Similarly, type information also benefits Email-Coref and WikiCoref resulting in an absolute improvement of 1.67 and 2.9 Avg. F1 points respectively (p < 0.01). We also see a 2.4 Avg. F1 improvement (p < 0.01) on OntoNotes, the largest dataset in this study. This suggests that explicit access to type information is beneficial all over the board, despite the use of contextual representations which have been claimed to model realworld facts and relationships (Petroni et al., 2019). Ablation Results: To understand the contribution of the inclusion of type information to improve mention representation (+ET-self) and type consistency check between candidate mentions (+ET-  cross), we perform an ablation study (Table 3). We find that both components consistently provide significant performance boosts over the baseline. However, their combination (+ET) performs the best across all datasets.

Performance with Common Types
The previous experiment leverages the original entity-types assigned by dataset annotators. Due to the differences in domain and annotation guidelines among these datasets, the annotators introduce several domain-specific entity types (e.g. DIGital, Work Of Art etc.) apart from the common four (PERson, ORGanization, LOCation, FACility) that are often used in the Named Entity Recognition literature (Tjong Kim Sang, 2002). The former can prove to be much more difficult to obtain/ learn due to dearth of relevant data. Therefore, to assess the worth of using a common entitytype list for all datasets, we map the original types (Table 1) to the above-mentioned four common types. 6 Categories that do not map to any common type are assigned Other. +ET (com) rows in Table 2 show the results for this experiment. Models trained with common types as features perform worse than +ET (orig) which was expected as several original types are now clubbed into a single category (e.g. LAW -> OTHER, LANG -> OTHER) thus somewhat reducing the effectiveness of the feature. One surprising observation is the small difference between the performance on OntoNotes dataset, despite the fact that the number of type categories reduce from 18 + Other (+ET (orig)) to 4 + Other (+ET (com)). This could either be because (1) the entities with corpus-specific types occur less frequently in Ontonotes, or (2) the baseline model does a good job in resolving them. Further research is required to understand this case which is out of scope for this work.

# Impure Clusters (#IC)
Our hypothesis around the use of entity-types was to provide additional information to the model that could be leveraged to minimize errors due to type mismatch in CR. To evaluate if the F1 score improvements achieved by +ET models are because of fewer type mismatch errors, we report the number of coreference clusters detected by the model that contain at least one element with a type that is different from the others in the cluster. Since all of the datasets used in this work only consider identity coreferences (Recasens et al., 2011) -with potentially varied definitions of identity (Bamman et al., 2020;Pradhan et al., 2012) -where the mention is a linguistic "re-packaging" of its antecedent, this measure makes sense. As shown in Table 2, the models that score lower on the impurity measure get a higher Avg F1. This suggests that the aggregate performance improvements are at least partly due to the better mention-mention comparison in +ET systems.

Predicted Types
Results shown in the previous section assume the presence of gold standard types during training as well as inference, which is often impractical in the real-world. Most of the new samples that a CR model would encounter would not include type information about the candidate mentions. Therefore, we set up an additional experiment to gauge the benefits of type information using predicted types. We introduce a baseline approach to infer the type of the mentions and then use these predictions in the +ET models, in place of the gold types, for coreference resolution.

Type Prediction Model
Given the mention and its immediate context, i.e. the sentence it occurs in (S = ..., c −2 , c −1 , e 1 , e 2 , ..., e n , c 1 , c 2 , ...), we add markers <ENT_START>/ <ENT_END> before/ after the beginning/ ending of the mention in the sentence. The new sequence (S = ..., c −2 , c −1 , <ENT_START>, e 1 , e 2 , ..., e n , <ENT_END>, c 1 , c 2 , ...) is tokenized using BERT tokenizer and passed through the BERT encoder. The output from which is then mean-pooled and passed through a fully-connected layer for classification. This architecture is motivated from (Soares et al., 2019) who show that adding markers around entities before passing the sentence through BERT performs better for relation extraction.

Experiments and Results
Type Prediction: Our final evaluation of the use of types in coreference is perhaps the most important one as it uses predicted types rather than annotated types, thus demonstrating that the benefits can be achieved in practice. Here we use the Type Prediction Model described just above. We limit the length of the input sequence to 128 tokens and use BERT-base-cased model for our type-prediction experiments. We perform a fivefold cross-validation to predict the type for each mention in the dataset. Since all four datasets suffer from class-imbalance, we report both Macro F1 score as well as the accuracy for the model. The model is trained for 20 epochs, with earlystopping (patience = 10), and is fine-tuned on the development set for Macro F1 to give more importance to minority type categories. We do not consider NA as a separate class during type prediction for WikiCoref and OntoNotes. For evaluation of our type-prediction model, we ignore the mentions that do not have an associated gold type (NA) from the final numbers in Table 4.
As shown, our model performs well on Lit-Bank, EmailCoref, and Ontonotes due to their favorable size in terms of training samples for the BERT-based type predictor. WikiCoref, however, proves more challenging as the model only manages 38.0 Macro F1 points with original (orig) types and 45.0 with common types (com), portraying its lack of ability to learn minority type categories with less data. Furthermore, our model finds it easier to predict the common (com) set of types for each dataset as combining multiple corpus-specific types into one partially alleviates the problem of class-imbalance. In line with our expectation, the largest improvement due to common types is seen for OntoNotes where the prob-   lem reduces from an 18-way classification to a 5way classification.
Coreference Resolution: Each mention in the corpus occurs in the test-sets of the five-fold cross-validation type-prediction experiments exactly once. This allows us to infer the type of each mention using the model that is trained on a different subset of the dataset. These inferred types are used in the training and testing of the CR systems in a manner similar to the annotated types. Empirically, we found that the above configuration performs better than using the +ET models trained with annotated types and testing with predicted types, as the former exposes the CR models to the noisy types during training thus allowing them to learn weights that take this noise into account. We report the results for both original (+ ET-pred (orig)) and common (+ ET-pred (com)) type categories on each dataset. Table 5 shows the results for performance of the baseline and the type-informed models on the four datasets, where the types are inferred from the model described in Section 6.1. We find that the improvements from type-information persist across LitBank, EmailCoref, and OntoNotes despite the use of predicted types, but, quite expectedly, remain smaller than the improvements from the gold annotated types. Scores on WikiCoref show no significant improvement over the baseline, which could be explained by the poor performance of the type prediction model on this dataset which reduces the potency of the feature for CR. Table 6 shows the most frequently occurring entity-types for each of the genres in OntoNotes. In line with our intuition, we find that enity-type information helps the baseline in bc, bn, wb, and mz genres which have less skew in their entity-type distribution. Genres like bc, bn and wb, although dominated by PER entities, contain a substantial minority of other entity-types like ORG and GPE. Along the same lines, mz contains a majority of GPE entities but also enough entities with type PER and ORG to make type information a potentially useful feature for CR. However, two exceptions to this are the improved performance of +ET (orig) on tc (highest skew) and no significant improvement on nw (lowest skew). These findings prompt further research in the future.

Type Prediction: PRP vs NP
Entity coreference in discourse often takes the surface form of pronouns (PRP) (like she, they, that, it etc.) or noun phrases (NP) (like LA, John's brother etc.) In Table 4, we compare the performance of our type prediction model on different types of pronouns, and noun phrases of varying length. We find that the model does well in predicting types for personal pronouns (PRP (pers.)) like she, he and noun phrases (NP). However, it consistently underperforms on demonstrative pro-    nouns (PRP (dem.)) like this, that, and it across all datasets. This reduced performance could be due to the fact that demonstrative pronouns do not contain any signal about the type of the entity they refer to. Therefore, the type prediction model has to solely rely on the context to make that decision. However, this is not the case with PRPs (pers.) and NPs where the mention string is usually a strong indicator of the type. This problem is worsened by the imbalance due to the small presence of PRP (dem.) mentions in difference CR datasets. Since, the model does not encounter enough PRPs (dem.), it might not be able to learn to give high importance to context in these cases.
This could be partially alleviated by creating a separate type-prediction path for PRP (dem.) where the mention span is masked before it is passed through the model. A model that is trained with masked mentions would focus more on the context for type prediction and thus could lead to better performance on PRPs (dem.). One could also experiment with training the type-prediction model on all of the mentions across the four datasets. The common list of types introduced in this work would allow for the creation of a larger training-set that includes mentions from multiple corpora (including external NER datasets) which could provide enough signal for the model to better learn the common types for PRPs (dem.).
Both these approaches could further boost the results for CR with predicted entity-types, ultimately, reducing the gap between the scores in Table 2 and 5. However, they are left as future work as they are out of scope for this paper. Table 7 provides an excerpt of an email from EmailCoref corpus. As shown, the baseline model predicts the coreference clusters for an organizer (DIG) and PricewaterhouseCoopers Calgary (ORG) incorrectly. For the former, the model mistakes it as a reference to your current home address (LOC) which is corrected by the entity-type aware models. For the latter, the baseline considers PricewaterhouseCoopers Calgary (PCC) as part of a new coreference cluster, even though it refers to the organization of the email's sender which was previously referred to as we in the email. Models with access to gold type information (+ET (orig) and +ET (com)) are able to make that connection. +ET-pred (orig), however, is unable to cluster PCC correctly which could be due to the fact that the type-prediction model incorrectly classifies the type of we as PER rather than ORG. This could lead to the CR model considering PCC (ORG) as a new entity in the discourse rather than a postcedent of we. This example demonstrates that sentencelevel context might not be sufficient in some cases for mention type-disambiguation. We intend to experiment with models that capture long-term context and leverage external knowledge in the future.

Conclusion
In this work, we show the importance of using entity-type information in neural coreference resolution (CR) models with contextualized embeddings like BERT. Models which leverage type information, annotated in the corpus, substantially outperform the baseline on four CR datasets by reducing the number of type mismatches in detected coreference clusters. Since, these datasets vary in number and categories of the types they define, we also experiment with mapping the original corpus types to four common types (PER, ORG, LOC, FAC) based on previous NER research that can be learnt more easily through large NER datasets. Models which use these common types perform slightly worse than original types but still show significant improvements over the baseline systems.
The presence of gold standard types during CR inference is unlikely in practice. Therefore, we propose a model that infers the type of a mention given the mention span and its immediate context to use along side the proposed CR approach. In our evaluation, we find that using types predicted by our model for CR still performs significantly better than the baseline, thus offering stronger evidence that type information holds the potential for practical improvements for CR.      (com) experiments. These types are annotated in most of the named-entity recognition datasets and therefore are easier to model and learn via machine learning approaches. Tables A2, A3, A4, A5 show the mapping between the original types of each coreference dataset used in our study to the reduced common types. The most drastic difference occurs for OntoNotes (19 -> 5) and . OTHER type in WikiCoref is for freebase links that did not have an associated type stored in freebase, whereas NA is used for mentions which do not have a freebase link. For OntoNotes, NA refers to the mentions that did not get any type assigned to them even after the use of our cluster based type-propagation approach (explained in Section 4).