Knowledge-aware Pronoun Coreference Resolution

Resolving pronoun coreference requires knowledge support, especially for particular domains (e.g., medicine). In this paper, we explore how to leverage different types of knowledge to better resolve pronoun coreference with a neural model. To ensure the generalization ability of our model, we directly incorporate knowledge in the format of triplets, which is the most common format of modern knowledge graphs, instead of encoding it with features or rules as that in conventional approaches. Moreover, since not all knowledge is helpful in certain contexts, to selectively use them, we propose a knowledge attention module, which learns to select and use informative knowledge based on contexts, to enhance our model. Experimental results on two datasets from different domains prove the validity and effectiveness of our model, where it outperforms state-of-the-art baselines by a large margin. Moreover, since our model learns to use external knowledge rather than only fitting the training data, it also demonstrates superior performance to baselines in the cross-domain setting.


Introduction
Being an important human language phenomenon, coreference brings simplicity for human languages while introducing a huge challenge for machines to process, especially for pronouns, which are hard to be interpreted owing to their weak semantic meanings (Ehrlich, 1981).As one challenging yet vital subtask of the general coreference resolution, pronoun coreference resolution (Hobbs, 1978) is to find the correct reference for a given pronominal anaphor in the context and has showed its importance in many natural language processing (NLP) tasks, such as machine translation (Mitkov et al., 1995), dialog systems (Strube and Müller, 2003), information extraction (Edens et al., 2003), and summarization (Steinberger et al., 2007), etc.
In general, to resolve pronoun coreferences, one needs intensive knowledge support.As shown in Table 1, answering the first question requires the knowledge on which object can be eaten (apple v.s.table), while the second question requires the knowledge that the CT scan is a test (not the hospital) and only tests can show something.Previously, rule-based (Hobbs, 1978;Nasukawa, 1994;Mitkov, 1998;Zhang et al., 2019a) and feature-based (Ng, 2005;Charniak and Elsner, 2009;Li et al., 2011) supervised models were proposed to integrate knowledge to this task.However, while easy to incorporate external knowledge, these traditional methods faced the problem of no effective representation learning models can handle such complex knowledge.Later, end-toend solutions with neural models (Lee et al., 2017(Lee et al., , 2018) ) achieved good performance on the general coreference resolution task.Although such algo-rithms can effectively incorporate contextual information from large-scale external unlabeled data into the model, they are insufficient to incorporate existing complex knowledge into the representation for covering all the knowledge one needs to build a successful pronoun coreference system.In addition, overfitting is always observed on deep models, whose performance is thus limited in cross-domain scenarios and restricts their usage in real applications (Liu et al., 2018(Liu et al., , 2019)).Recently, a joint model (Zhang et al., 2019b) was proposed to connect the contextual information and human-designed features together for pronoun coreference resolution task (with gold mention support) and achieved the state-of-the-art performance.However, their model still requires the complex features designed by experts, which is expensive and difficult to acquire, and requires the support of the gold mentions.
To address the limitations of the aforementioned models, in this paper, we propose a novel end-toend model that learns to resolve pronoun coreferences with general knowledge graphs (KGs).Different from conventional approaches, our model does not require to use featurized knowledge.Instead, we directly encode knowledge triplets, the most common format of modern knowledge graphs, into our model.In doing so, the learned model can be easily applied across different knowledge types as well as domains with adopted KG.Moreover, to address the knowledge matching issue, we propose a knowledge attention module in our model, which learns to select the most related and helpful knowledge triplets according to different contexts.Experiments conducted on general (news) and in-domain (medical) cases shows that the proposed model outperforms all baseline models by a great margin.Additional experiments with the cross-domain setting further illustrate the validity and effectiveness of our model in leveraging knowledge smartly rather than fitting with limited training data 1 .To summarize, this paper makes the following contributions: 1. We explore how to resolve pronoun coreferences with KGs, which outperforms all existing models by a large margin on datasets from two different domains.2. We propose a knowledge attention module, which helps to select the most related and help-1 All code and data are available at: https://github.com/HKUST-KnowComp/Pronoun-Coref-KG.ful knowledge from different KGs. 3. We evaluate the performance of different pronoun coreference models in a cross-domain setting and show that our model has better generalization ability than state-of-the-art baselines.

The Task
Given a text D, which contains a pronoun p, the goal is to identify all the mentions that p refers to.We denote the correct mentions p refers to as c ∈ C, where C is the correct mention set.Similarly, each candidate span is denoted as s ∈ S, where S is the set of all candidate spans.Note that in the case where no golden mentions are annotated, all possible spans in D are used to form S. To exploit knowledge, we denote the knowledge set as G, instantiated by multiple knowledge triplets2 .The task is thus to identify C out of S with the support of G. Formally, it optimizes where F (•) is the overall scoring function3 of p referring to s in D with G.The details of F are illustrated in the following section.

Model
The overall framework of our model is shown in Figure 1.There are several layers in it.At the bottom, we encode all mention spans (s) and pronouns (p) into embeddings so as to incorporate contextual information.In the middle layer, for each pair of (s, p), we use their embeddings to select the most helpful knowledge triplets from G and generate the knowledge representation of s and p.At the top layer, we concatenate the textual and knowledge representation as the final representation of each s and p, and then use this representation to predict whether there exists the coreference relation between them.

Span Representation
Contextual information is crucial to distinguish the semantics of a word or phrase, especially for text representation learning (Song et al., 2018;Song and Shi, 2018).In this work, a standard bidirectional LSTM (BiLSTM) (Hochreiter and Schmidhuber, 1997) model is used to encode each span with attentions (Bahdanau et al., 2014), which is similar to the one used in Lee et al. (2017).The structure is shown in Figure 2. Let initial word embeddings in a span s i be denoted as x 1 , ..., x T and their encoded representation be x * 1 , ..., x * T .The weighted embeddings of each span xi is obtained by where a t is the inner-span attention computed by where α t is a standard feed-forward neural network4 α t = N N α (x * t ).Finally, the starting (x * start ) and ending (x * end ) embedding of each span is concatenated with the weighted embedding ( xi ) and the length feature (φ(i)) to form its final representation e: Thus the span representation of s and p are marked as e s and e p , respectively.

Knowledge Representation
For each candidate span s and the target pronoun p, different knowledge from a KG can be extracted with various methods.For simplicity and generalization consideration, we use the string match in our model for knowledge extraction.Specifically, for each triplet t ∈ G where the head and tail of t are both lists of words, if its head is the same as the string of s, we consider it to be a related triplet.Therefore, we encode the information of t by the averaging embeddings of all words in its tail.For example, if s is 'the apple' and the knowledge triplet ('the apple', IsA, 'healthy food') is found by searching the KG, we represent this relation from the averaged embeddings of 'healthy' and 'food'.Consequently, for s and p, we denote their retrieved knowledge set as K s and K p respectively, where K s contains m s related knowledge embeddings k 1,s , k 2,s , ..., k ms,s and K p contains m p of them k 1,p , k 2,p , ..., k mp,p .
To incorporate the aforementioned knowledge embeddings into our model, we face a challenge that there are a huge number of such embeddings while most of them are useless in certain contexts.To solve it, a knowledge attention module is proposed to select the appropriate knowledge.
For each pair of (s, p), as shown in Figure 3, we first concatenate e s and e p to get the overall (span, pronoun) representation e s,p , which is used to select knowledge for both s and p. Taking that for s as example, we compute the weight of each k i ∈ K s by where ).As a result, the knowledge of s is summed by to represent the overall knowledge for s.A similar process is also conducted for p with its knowledge representation o p .

Scoring
The final score of each pair (s, p) is computed by where ) is the scoring function for s to be a valid mention and ) is the scoring function to identify whether there exists a coreference relation from p to s, with denoting element-wise multiplication.
After getting the coreference score for all mention spans, we adopt a softmax selection on the most confident candidates for the final prediction, which is formulated as where candidates with score F higher than a threshold t are selected.

Experiments
Experiments are illustrated in this section.

Datasets
Two datasets are used in our experiments, where they are from two different domains: • CoNLL: The CoNLL-2012 shared task (Pradhan et al., 2012) corpus, which is a widely used dataset selected from the Ontonotes 5.05 .• i2b2: The i2b2 shared task dataset (Uzuner et al., 2012), consisting of electronic medical records from two different organizations, namely, Partners HealthCare (Part) and Beth Israel Deaconess medical center (Beth).All records have been fully de-identified and manually annotated with coreferences.
We split the datasets into different proportions based on their original settings.Three types of pronouns are considered in this paper following Ng (2005), i.e., third personal pronoun (e.g., she, her, he, him, them, they, it), possessive pronoun (e.g., his, hers, its, their, theirs), and demonstrative pronoun (e.g., this, that, these, those).Table 2 reports the number of the three types of pronouns and the overall statistics of the experiment datasets with proportion splittings.Following conventional approaches (Ng, 2005;Li et al., 2011), for each pronoun, we consider its candidate mentions from the previous two sentences and the current sentence it belongs to.According to our selection range of the candidate mentions, each pronoun in the CoNLL data and i2b2 data has averagely 1.3 and 1.4 correct references, respectively.

Knowledge Resources
As mentioned in previous sections, our model is designed to leverage general KGs, where it takes triplets as the input of knowledge representations.For all knowledge resources, we format them as triplets and merge them together to obtain the final knowledge set.Different knowledge resources are introduced as follows.
Commonsense knowledge graph (OMCS).We use the largest commonsense knowledge base, the open mind common sense (OMCS) (Singh, 2002) in this paper.OMCS contains 600K crowdsourced commonsense triplets such as (food, UsedFor, eat) and (wind, CapableOf, blow to east).All relations in OMCS are human-defined and we select those highly-confident ones (confidence score larger than 2) to form the OMCS KG, with 62,730 triplets.
Medical concepts (Medical-KG).Being part of the i2b2 contest, the related knowledge about medical concepts such as (the CT scan, is, test) and (intravenous fluids, is, treatment) are provided.The annotated triplets are used as the medical concept KG, which contains 22,234 triplets.
Linguist features (Ling).In addition to manually annotated KGs, we also consider linguist features, i.e., plurality and animacy & gender (AG), as one important knowledge resources.Stanford parser6 is employed to generate plurality, animacy, and gender markups for all the noun phrases, so as to automatically generate linguistic knowledge (in the form of triplets) for our data.Specifically, the plurality feature denotes each s and p to be singular or plural.The animacy & gender (AG) feature denotes whether the n or p is a living object, and being male, female, or neutral if it is alive.For example, a mention 'the girls' is labeled as plural and female; we use triplets ('the girls', plurality, Plural) and ('the girls', AG, female) to represent them.As a result, we have 40,149 and 40,462 triplets for plurality and AG, respectively.
Selectional Preference (SP).Selectional preference (Hobbs, 1978) knowledge is employed as the last knowledge resource, which is the semantic constraint for word usage.SP generally refers to that, given a predicate (e.g., verb), people have the preference for the argument (e.g., its object or subject) connected.To collect SP knowledge, we first parse the English Wikipedia7 with the Stanford parser and extract all dependency edges in the format of (predicate, argument, relation, number), where predicate is the governor and argument the dependent in each dependency edge8 .Following (Resnik, 1997)

Baselines
Several baselines are compared in this paper, including three widely used pre-trained models: • Deterministic model (Raghunathan et al., 2010), which is an unsupervised model and leverages manual rules to detect coreferences.• Statistical model (Clark and Manning, 2015), which is a supervised model and trained on manually crafted entity-level features between clusters and mentions.• Deep-RL model (Clark and Manning, 2016), which uses reinforcement learning to directly optimize the coreference matrix instead of the loss function of supervised learning.
The above models are included in the Stanford CoreNLP toolkit 9 .We also include a state-of-theart end-to-end neural model as one of our baselines: • End2end (Lee et al., 2018), which is the current state-of-the-art model performing in an end-toend manner and leverages both contextual information and a pre-trained language model (Peters et al., 2018).
We use their released code 10 .In addition, to show the importance of incorporating knowledge, we also experiment with two variations of our model: • Without KG removes the KG component and keeps all other components in the same setting as that in our complete model.• Without Attention removes the knowledge attention module and concatenates all the knowledge embeddings.All other components are identical as our complete model.

Implementation
Following the previous work (Lee et al., 2018), we use the concatenation of the 300d GloVe embeddings (Pennington et al., 2014)  For model training, we use cross-entropy as the loss function and Adam (Kingma and Ba, 2014) as the optimizer.All the aforementioned hyperparameters are initialized randomly, and we apply dropout rate 0.2 to all hidden layers in the model.For the CoNLL dataset, the model training is performed with up to 100 epochs, and the best one is selected based on its performance on the development set.For the i2b2 dataset, because no dev set is provided, we train the model up to 100 epochs and use the final converged one.

Results
Table 3 reports the performance of all models, with the results for CoNLL and i2b2 in (a) and (b), respectively.Overall, our model outperforms all baselines on two datasets with respect to all pronoun types.There are several interesting observations.In general, the i2b2 dataset seems simpler than the CoNLL dataset, which might because that i2b2 only involves clinical narratives and its training data is highly similar to the test data.As a result, all neural models perform dramatically good, especially on the third personal and possessive pronouns.In addition, we also notice that it is more challenging for all models to resolve demonstrative pronouns (e.g., this, that) on both datasets, because such pronouns may refer to complex things and occur with low frequency.
Moreover, there are significant gaps in the performance of different models, with the following observations.First, models with manually defined rules or features, which cannot cover rich contextual information, perform poorly.In contrast, deep learning models (e.g., End2end and our proposed models), which leverage text representations for context, outperform other approaches by a great margin, especially on the recall.Sec- The Complete Model 75.7 -95.2 --OMCS 74.8 -0.9 95.1 -0.1 -Medical-KG 74.5 -1.2 94.6 -0.6 -Ling 73.8 -1.9 94.9 -0.3 -SP 74.0 -1.7 94.7 -0.5 Table 4: The performance of our model with removing different knowledge resources.The F1 of each case and the difference of F1 between each case and the complete model are reported.
ond, adding knowledge in an appropriate manner within neural models is helpful, which is supported by that our model outperforms the End2end model and the Without KG one on both datasets, especially CoNLL, where the external knowledge plays a more important role.Third, the knowledge attention module ensures our model to predict more precisely, which also results in the overall improvement on F1.To summarize, the results suggest that external knowledge is important for effectively resolving pronoun coreference, where rich contextual information determines the appropriate knowledge with a well-designed module.

Analysis
Further analysis is conducted in this section regarding the effect of different knowledge resources, model components, and settings.Details are illustrated as follows.

Ablation Study
We ablate different knowledge for their contributions in our model, with the results reported in Table 4.It is observed that all knowledge resources contribute to the final success of our model, where different knowledge types play their unique roles in different datasets.For example, the Ling knowledge contributes the most to the CoNLL dataset while the medical knowledge is the most important one for the medical data.

Effect of the Selection Threshold
We experiment with different thresholds t for the softmax selection.The effects of t against overall performance are shown in Figure 4.In general, with the increase of t, fewer candidates are selected.Therefore, the overall precision increases and the recall drops.Consider that both the precision and recall are important for resolving pronoun  coreference, we select different thresholds for different datasets to ensure the balance between precision and recall.In detail, for the CoNLL dataset, we set r = 10 −2 to select the most confident predictions; and for the i2b2 dataset, we set r = 10 −8 so as to keep more predictions.

Effect of Gold Mentions
The effect of adding gold mentions is shown in Table 5. Providing gold mentions to the End2end model can significantly boost its performance by 6.2 F1 and 2.1 F1 on the CoNLL and i2b2 dataset, respectively.Yet, the performance gain from gold mentions is less for our model.Such results clearly illustrate that our model is able to benefit the mention detection with the help of KG incorporation.Besides that, with the help of gold mentions, our model achieves the comparable (slightly better) performance with the contextand-knowledge model (Zhang et al., 2019b ble 5.As we also included one new challenging pronoun type, the demonstrative pronoun, the overall performance of their model is lower than the one reported in the original paper.The reason of our model being better is that more knowledge resources (e.g., OMCS) can be incorporated into our model due to its generalizable design.Moreover, it is more difficult for their method (Zhang et al., 2019b) to incorporate mention detection into the model, because in this case we need to enumerate all mention spans and generate corresponding features for all spans.which is expensive and difficult to acquire.

Cross-domain Evaluation
Considering that neural models are intensive datadriven and normally restricted by data nature, they are not easily applied in a cross-domain setting.

Case Study
To better illustrate the effectiveness of incorporating different knowledge in this task, two examples are provided for the case study in Table 7.In example A, our model correctly predicts that 'it' refers to the 'magazine' rather than the 'room', because we successfully retrieve the knowledge that compared with the 'room', the 'magazine' is more likely to be the object of drop.In example B, even though the distance between 'erythema' and 'This' is relatively far11 , our model is able to determine the coreference relation between them because it successfully finds out that 'erythema' is a kind of disease, while a lot of diseases appear as the context of 'be treated' in the training data.

Related Work
Detecting mention spans in linguistic expressions and identifying coreference relations among them is a core task, namely, coreference resolution, for natural language understanding.Mention detection and coreference prediction are the two major focuses of the task as listed in Lee et al. (2017).Compared to general coreference problem, pronoun coreference resolution has its unique challenge since pronouns themselves have weak semantics meanings, which make it the most challenging sub-task in general coreference resolution.To address the unique difficulty brought by pronouns, we thus focus on resolving pronoun coreferences in this paper.
Resolving pronoun coreference relations often requires the support of manually crafted knowledge (Rahman and Ng, 2011;Emami et al., 2018), especially for particular domains such as medicine (Uzuner et al., 2012) and biology (Cohen et al., 2017).Previous studies on pronoun coreference resolution incorporated external knowledge including human defined rules (Hobbs, 1978;Ng, 2005), e.g., number/gender requirement of different pronouns, domain-specific knowledge such as medical (Jindal and Roth, 2013) or biological (Trieu et al., 2018) ones, and world knowledge (Rahman and Ng, 2011), such as selectional preference (Wilks, 1975).Later, end-to-end solutions (Lee et al., 2017(Lee et al., , 2018) ) were proposed to learn contextual information and solve coreferences synchronously with neural networks, e.g., LSTM.Their results proved that such knowledge is helpful when appropriately used for coreference resolution.However, external knowledge is often omitted in their models.Consider that context and external knowledge have their own advantages: the contextual information covering diverse text expressions that are difficult to be predefined while the external knowledge being usually more precisely constructed and able to provide extra information beyond the training data, one could benefit from both sides for this task.Different from previous studies, we provide a generic solution to resolving pronoun coreference with the support of knowledge graphs based on contextual modeling, where deep learning models are adopted in our work to incorporate knowledge into pronoun coreference resolution and achieve remarkably good results.

Conclusion
In this paper, we explore how to build a knowledge-aware pronoun coreference resolution model, which is able to leverage different external knowledge for this task.The proposed model is an attempt of the general solution of incorporating knowledge (in the form of KG) into the deep learning based pronoun coreference model, rather than using knowledge as features or rules in a dedicated manner.As a result, any knowledge resource presented in the format of triplets, the most widely used entry format for KG, can be consumed in our model with a proposed attention module.Experimental results on two different corpora from two domains demonstrate the superiority of the proposed model to all baselines.Moreover, as our model learns to use knowledge rather than just fitting the training data, our model achieves much better and more robust performance than state-ofthe-art models in the cross-domain scenario.

Figure 1 :
Figure 1: The overall framework of our approach to pronoun corference resolution with KGs.k 1 ,...,k m represent the retrieved knowledge for each span in the black boxes.Dotted box represents the span representation module, which generates a contextual representation for each span.Dashed box represents the knowledge selection module, which selects appropriate knowledge based on the context and generates an overall knowledge representation for each span.F (•) is the overall coreference scoring function.

Figure 2 :
Figure 2: The structure of the span representation module.BiLSTM and attention are employed to encode the contextual information.

Figure 3 :
Figure 3: The structure of the knowledge attention module.The joint representation of the candidate span and pronoun is used to select knowledge for s and p.

Figure 4 :
Figure 4: Effect of different softmax selection thresholds with respect to our model performance on two datasets.In general, with the threshold becoming larger, less candidates are selected, the precision thus increases while the recall drops.
, each potential SP pair is measured by a posterior probability Count r (p) and Count r (p, a) refer to how many times p and the predicate-argument pair (p, a) appear in the relation r, respectively.In our experiment, if P r (a|p) > 0.1 and Count r (p, a) > 10, we consider the triplet (p, r, a) (e.g., ('dog', nsubj, 'barks')) a valid SP relation.Finally, we select two SP relations, nsubj and dobj, to form the SP knowledge graph, including 17,074 and 4,536 frequent predicate-argument pairs for nsubj and dobj, respectively.

Table 3 :
The performance of pronoun coreference resolution with different models on two evaluation datasets.Precision (P), recall (R), and the F1 score are reported, with the best one in each F1 column marked as bold.

Table 5 :
Influence of gold mentions.F1 scores on different test sets are reported.Adding human-annotated gold mentions help both the End2end and our model.Best performed model are indicated with the bold font.

Table 6 :
).As their features are originally designed for CoNLL, we only report the performance on CoNLL in Ta-Cross-domain performance of different models.F1 on the target domain test sets are reported.

Table 7 :
The case study on two examples from the test data, i.e., A: from the CoNLL and B: from the i2b2.Pronouns and correct mentions are marked by red bold and blue underline font respectively.Knowledge triplets used for them are listed in the bottom row.