Neural Coreference Resolution with Deep Biaffine Attention by Joint Mention Detection and Mention Clustering

Coreference resolution aims to identify in a text all mentions that refer to the same real-world entity. The state-of-the-art end-to-end neural coreference model considers all text spans in a document as potential mentions and learns to link an antecedent for each possible mention. In this paper, we propose to improve the end-to-end coreference resolution system by (1) using a biaffine attention model to get antecedent scores for each possible mention, and (2) jointly optimizing the mention detection accuracy and the mention clustering log-likelihood given the mention cluster labels. Our model achieves the state-of-the-art performance on the CoNLL-2012 Shared Task English test set.


Introduction
End-to-end coreference resolution is the task of identifying and grouping mentions in a text such that all mentions in a cluster refer to the same entity. An example is given below (Björkelund and Kuhn, 2014) where mentions for two entities are labeled in two clusters: [Drug Emporium Inc.]  Many traditional coreference systems, either rulebased (Haghighi and Klein, 2009;Lee et al., 2011) * Work done during the internship at IBM Watson. or learning-based (Bengtson and Roth, 2008;Fernandes et al., 2012;Durrett and Klein, 2013;Björkelund and Kuhn, 2014), usually solve the problem in two separate stages: (1) a mention detector to propose entity mentions from the text, and (2) a coreference resolver to cluster proposed mentions. At both stages, they rely heavily on complicated, fine-grained, conjoined features via heuristics. This pipeline approach can cause cascading errors, and in addition, since both stages rely on a syntactic parser and complicated handcraft features, it is difficult to generalize to new data sets and languages.
Very recently, Lee et al. (2017) proposed the first state-of-the-art end-to-end neural coreference resolution system. They consider all text spans as potential mentions and therefore eliminate the need of carefully hand-engineered mention detection systems. In addition, thanks to the representation power of pre-trained word embeddings and deep neural networks, the model only uses a minimal set of hand-engineered features (speaker ID, document genre, span distance, span width).
The core of the end-to-end neural coreference resolver is the scoring function to compute the mention scores for all possible spans and the antecedent scores for a pair of spans. Furthermore, one major challenge of coreference resolution is that most mentions in the document are singleton or non-anaphoric, i.e., not coreferent with any previous mention (Wiseman et al., 2015). Since the data set only have annotations for mention clusters, the end-to-end coreference resolution system needs to detect mentions, detect anaphoricity, and perform coreference linking. Therefore, research questions still remain on good designs of the scoring architecture and the learning strategy for both mention detection and antecedent scoring given only the gold cluster labels.
To this end, we propose to use a biaffine atten- Figure 1: Model architecture. We consider all text spans up to 10-word length as possible mentions. For brevity, we only show three candidate antecedent spans ("Drug Emporium Inc.", "Gary Wilber", "was named CEO") for the current span "this drugstore chain".
tion model instead of pure feed forward networks to compute antecedent scores. Furthermore, instead of training only to maximize the marginal likelihood of gold antecedent spans, we jointly optimize the mention detection accuracy and the mention clustering log-likelihood given the mention cluster labels. We optimize mention detection loss explicitly to extract mentions and also perform anaphoricity detection.
We evaluate our model on the CoNLL-2012 English data set and achieve new state-of-the-art performances of 67.8% F1 score using a single model and 69.2% F1 score using a 5-model ensemble.

Task Formulation
In end-to-end coreference resolution, the input is a document D with T words, and the output is a set of mention clusters each of which refers to the same entity. A possible span is an N-gram within a single sentence. We consider all possible spans up to a predefined maximum width. To impose an ordering, spans are sorted by the start position START(i) and then by the end position END(i). For each span i the system needs to assign an antecedent a i from all preceding spans or a dummy antecedent : a i ∈ { , 1, . . . , i−1}. If a span j is a true antecedent of the span i, then we have a i = j and 1 ≤ j ≤ i−1. The dummy antecedent represents two possibilities: (1) the span i is not an entity mention, or (2) the span i is an entity mention but not coreferent with any previous span. Finally, the system groups mentions according to coreference links to form the mention clusters.

Model
Figure 1 illustrates our model. We adopt the same span representation approach as in Lee et al. (2017) using bidirectional LSTMs and a headfinding attention. Thereafter, a feed forward network produces scores for spans being entity mentions. For antecedent scoring, we propose a biaffine attention model (Dozat and Manning, 2017) to produce distributions of possible antecedents. Our training data only provides gold mention cluster labels. To make best use of this information, we propose to jointly optimize the mention scoring and antecedent scoring in our loss function. Span Representation Suppose the current sentence of length L is [w 1 , w 2 , . . . , w L ], we use w t to denote the concatenation of fixed pretrained word embeddings and CNN character embeddings (dos Santos and Zadrozny, 2014) for word w t . Bidirectional LSTMs (Hochreiter and Schmidhuber, 1997) recurrently encode each w t : Then, the head-finding attention computes a score distribution over different words in a span s i : where FFNN is a feed forward network outputting a vector. Effective span representations encode both contextual information and internal structure of spans. Therefore, we concatenate different vectors, including a feature vector φ(i) for the span size, to produce the span representation s i for s i : (3)

Mention Scoring
The span representation is input to a feed forward network which measures if it is an entity mention using a score m(i): Since we consider all possible spans, the number of spans is O(T 2 ) and the number of span pairs is O(T 4 ). Due to computation efficiency, we prune candidate spans during both inference and training. We keep λT spans with highest mention scores.
Biaffine Attention Antecedent Scoring Consider the current span s i and its previous spans s j (1 ≤ j ≤ i − 1), we propose to use a biaffine attention model to produce scores c(i, j): FFNN anaphora and FFNN antecedent reduce span representation dimensions and only keep information relevant to coreference decisions. Compared with the traditional FFNN approach in Lee et al. (2017), biaffine attention directly models both the compatibility of s i and s j byŝ j U biŝi and the prior likelihood of s i having an antecedent by v biŝ i . Inference The final coreference score s(i, j) for span s i and span s j consists of three terms: (1) if s i is a mention, (2) if s j is a mention, (3) if s j is an antecedent for s i . Furthermore, for dummy antecedent , we fix the final score to be 0: During inference, the model only creates a link if the highest antecedent score is positive.

Joint Mention Detection and Mention Cluster
During training, only mention cluster labels are available rather than antecedent links. Therefore, Lee et al. (2017) train the model end-to-end by maximizing the following marginal log-likelihood where GOLD(i) are gold antecedents for s i : However, the initial pruning is completely random and the mention scoring model only receives distant supervision if we only optimize the above mention cluster performance. This makes learning slow and ineffective especially for mention detection. Based on this observation, we propose to directly optimize mention detection: whereŷ i = sigmoid(m(i)), y i = 1 if and only if s i is in one of the gold mention clusters. Our final loss combines mention detection and clustering: where N is the number of all possible spans, N is the number of unpruned spans, and λ detection controls weights of two terms.

Experiments
Data Set and Evaluation We evaluate our model on the CoNLL-2012 Shared Task English data (Pradhan et al., 2012) which is based on the OntoNotes corpus (Hovy et al., 2006). It contains 2,802/343/348 train/development/test documents in different genres. We use three standard metrics: MUC (Vilain et al., 1995), B 3 (Bagga and Baldwin, 1998), and CEAF φ 4 (Luo, 2005). We report Precision, Recall, F1 for each metric and the average F1 as the final CoNLL score. Implementation Details For fair comparisons, we follow the same hyperparameters as in Lee et al. (2017). We consider all spans up to 10 words and up to 250 antecedents. λ = 0.4 is used for span pruning. We use fixed concatenations of 300-dimension GloVe (Pennington et al., 2014) embeddings and 50-dimension embeddings from Turian et al. (2010). Character CNNs use 8dimension learned embeddings and 50 kernels for each window size in {3,4,5}. LSTMs have hidden size 200, and each FFNN has two hidden layers with 150 units and ReLU (Nair and Hinton, 2010) activations. We include (speaker ID, document  genre, span distance, span width) features as 20dimensional learned embeddings. Word and character embeddings use 0.5 dropout. All hidden layers and feature embeddings use 0.2 dropout. The batch size is 1 document. Based on the results on the development set, λ detection = 0.1 works best from {0.05, 0.1, 0.5, 1.0}. Model is trained with ADAM optimizer (Kingma and Ba, 2015) and converges in around 200K updates, which is faster than that of Lee et al. (2017).
Overall Performance In Table 1, we compare our model with previous state-of-the-art systems. We obtain the best results in all F1 metrics. Our single model achieves 67.8% F1 and our 5-model ensemble achieves 69.2% F1. In particular, compared with Lee et al. (2017), our improvement mainly results from the precision scores. This indicates that the mention detection loss does produce better mention scores and the biaffine attention more effectively determines if two spans are coreferent. Ablation Study To understand the effect of different proposed components, we perform ablation study on the development set. As shown in Table  2, removing the mention detection loss term or the biaffine attention decreases 0.3/0.4 final F1 score, but still higher than the baseline. This shows Figure 2: Mention detection subtask on development set. We plot accuracy and frequency breakdown by span widths.
that both components have contributions and when they work together the total gain is even higher.

Mention Detection Subtask
To further understand our model, we perform a mention detection subtask where spans with mention scores higher than 0 are considered as mentions. We show the mention detection accuracy breakdown by span widths in Figure 2. Our model indeed performs better thanks to the mention detection loss. The advantage is even clearer for longer spans which consist of 5 or more words.
In addition, it is important to note that our model can detect mentions that do not exist in the training data. While Moosavi and Strube (2017) observe that there is a large overlap between the gold mentions of the training and dev (test) sets, we find that our model can correctly de-tect 1048 mentions which are not detected by Lee et al. (2017), consisting of 386 mentions existing in training data and 662 mentions not existing in training data. From those 662 mentions, some examples are (1) a suicide murder (2) Hong Kong Island (3) a US Airforce jet carrying robotic undersea vehicles (4) the investigation into who was behind the apparent suicide attack. This shows that our mention loss helps detection by generalizing to new mentions in test data rather than memorizing the existing mentions in training data.

Related Work
As summarized by Ng (2010), learning-based coreference models can be categorized into three types: (1) Mention-pair models train binary classifiers to determine if a pair of mentions are coreferent (Soon et al., 2001;Ng and Cardie, 2002;Bengtson and Roth, 2008). (2) Mention-ranking models explicitly rank all previous candidate mentions for the current mention and select a single highest scoring antecedent for each anaphoric mention (Denis and Baldridge, 2007b;Wiseman et al., 2015;Clark and Manning, 2016a;Lee et al., 2017). (3) Entity-mention models learn classifiers to determine whether the current mention is coreferent with a preceding, partially-formed mention cluster (Clark and Manning, 2015;Wiseman et al., 2016;Clark and Manning, 2016b).
In addition, we also note latent-antecedent models (Fernandes et al., 2012;Björkelund and Kuhn, 2014;Martschat and Strube, 2015). Fernandes et al. (2012) introduce coreference trees to represent mention clusters and learn to extract the maximum scoring tree in the graph of mentions.
Recently, several neural coreference resolution systems have achieved impressive gains (Wiseman et al., 2015(Wiseman et al., , 2016Clark and Manning, 2016b,a). They utilize distributed representations of mention pairs or mention clusters to dramatically reduce the number of hand-crafted features. For example, Wiseman et al. (2015) propose the first neural coreference resolution system by training a deep feed-forward neural network for mention ranking. However, these models still employ the two-stage pipeline and require a syntactic parser or a separate designed hand-engineered mention detector.
Finally, we also note the relevant work on joint mention detection and coreference resolution. Daumé III and Marcu (2005) propose to model both mention detection and coreference of the Entity Detection and Tracking task simultaneously. Denis and Baldridge (2007a) propose to use integer linear programming framework to model anaphoricity and coreference as a joint task.

Conclusion
In this paper, we propose to use a biaffine attention model to jointly optimize mention detection and mention clustering in the end-to-end neural coreference resolver. Our model achieves the state-ofthe-art performance on the CoNLL-2012 Shared Task in English.