Coreference Resolution in Full Text Articles with BERT and Syntax-based Mention Filtering

This paper describes our system developed for the coreference resolution task of the CRAFT Shared Tasks 2019. The CRAFT corpus is more challenging than other existing corpora because it contains full text articles. We have employed an existing span-based state-of-theart neural coreference resolution system as a baseline system. We enhance the system with two different techniques to capture longdistance coreferent pairs. Firstly, we filter noisy mentions based on parse trees with increasing the number of antecedent candidates. Secondly, instead of relying on the LSTMs, we integrate the highly expressive language model–BERT into our model. Experimental results show that our proposed systems significantly outperform the baseline. The best performing system obtained F-scores of 44%, 48%, 39%, 49%, 40%, and 57% on the test set with B3, BLANC, CEAFE, CEAFM, LEA, and MUC metrics, respectively. Additionally, the proposed model is able to detect coreferent pairs in long distances, even with a distance of more than 200 sentences.


Introduction
Coreference resolution is important not only in general domains but also in the biomedical domain. The Colorado Richly Annotated Full Text (CRAFT) corpus (Cohen et al., 2017) was constructed with an aim of boosting the performance of the task in the biomedical literature. Unlike other corpora, CRAFT is comprised of full text articles or full papers, its coreferent chains are arbitrarily long; the mean length of coreferent chains is 4 while the longest chain is 186, which makes the resolution even more difficult than usual. The corpus has been fully released in the CRAFT Shared Task 2019. In this paper, we present our approach to address the coreference resolution task in this challenging corpus.
We employ the state-of-the-art end-to-end coreference system (Lee et al., 2017) as our baseline. The system generates all continuous sequences of words (or spans) in each sentence as mention candidates, which means the number of candidates increases linearly to the number of sentences. Such candidates may contain a large number of noisy spans, which are spans in a sentence that do not fit any noun phrases according to the corresponding parse tree. Such noisy spans are often wasteful when being included in the list of candidates for the coreference resolution step. Especially for the CRAFT corpus, of which the average number of sentences is more than 300, the number of noisy spans would be many and needs to be reduced. Also, our observations on the CRAFT corpus show that in many cases, a mention and its antecedent are far away, e.g., a mention can occur in the result section of a paper while its antecedent is in the abstract section.
To address these problems, we enhance the baseline system in two ways; we propose to filter noisy spans by using syntactic information and increase the number of antecedent candidates to capture such long-distance coreferent pairs. We further boost the system by replacing the underlying Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) layer with the Bidirectional Encoder Representations from Transformer (BERT) model (Devlin et al., 2019)-a contextualized language model that can efficiently capture context in a wide range of NLP tasks.
We have evaluated our system on six common metrics for coreference resolution including B 3 , BLANC, CEAFE, CEAFM, LEA, and MUC using the official evaluation script provided by the shared task organizers. By increasing the num-ber of antecedents and filtering noisy ones, we could boost the recall of mention detection, hence improving the performance of coreference resolution. When incorporating BERT into the system, we could attain better scores in both mention detection and coreference resolution at every metrics.
Our contributions are as follows.
• We proposed a new method to filter noisy spans, which is a weakness of the baseline system (Lee et al., 2017). Our filtering method based on syntactic trees reduced up to 90% noisy spans but still kept 93% of correct mentions on the development set.
The method helps our model more computationally efficient than the baseline one, hence allowing us to increase the number of antecedent candidates to capture long-distance coreferent pairs.
• We successfully integrated the BERT model to replace the LSTM layers for coreference resolution task and obtained significant improvement.
• Although we only experimented our model with the CRAFT corpus, our proposed method is general enough to be applied to other corpora with long documents.

LSTM-based Baseline Model
Our model is based on the span-based end-to-end model (Lee et al., 2017). The model employs an exhaustive method to create any continuous sequences of words (spans) in each sentence. The representation of a span from the k-th word to the l-th word in a sentence is calculated by concatenating the information of the first word, last word, head word, and the span width feature as follows: where h k and h l are embeddings of the first and last words calculated by a bidirectional LSTM; w k..l is the weighted sum of the word vectors; and φ(k, l) encodes the size of this span. Mention scores are calculated using a feedforward neural network given the span representation.
where w m is a learnable weight vector; and FFNN denotes a feed-forward neural network.
Since the span-based model generates a large number of spans, a simple technique is used to rank and filter spans based on a λ ratio multiplied by the document size and choose the k best candidates.
To find an antecedent for each mention, we calculate the antecedent score as follows: where w a is a learnable weight vector; • denotes an element-wise multiplication and φ((k, l), (u, v)) represents the feature vector between the two mentions.

Coreference Resolution with BERT
Recently, BERT (Devlin et al., 2019) shows significant improvement on various tasks in comparison with other deep learning models including LSTMs. This highly expressive language model is able to capture contextual information effectively. We, therefore, aim at investigating whether this architecture can work effectively on coreference resolution in comparison with the previous LSTMbased models. In the BERT model, contextual representations are assigned to sub-words in each word. We use the representation of the last subword in a word as the representation of the word and calculated the span representation using Equation 1. Since the pre-trained BERT model just supports sentences up to 512 sub-words, we utilize a sliding window technique with a window size of 512 and stride of 256 for longer sentences and then retrieve subword embeddings from windows so that each subword has maximum left and right context. We adapted the mention score and antecedent score functions as Equations 2 and 3.

Learning Parse Trees to Filter Mentions
A weakness of the span-based baseline model is that the greedy method generates a large number of noisy, mostly meaningless, spans. Although Lee et al. (2017) proposed to select k-best candidates but this strategy is problematic when working on long documents, in which a mention is probably far away from its true antecedents while there are a large number of noisy candidates between them.  Figure 1: Three patterns corresponding to three gold mentions are extracted from the parse tree: "a diurnal rhythm with IOP" (pattern: (NP,IN,NP)), "the dark period of the day" (pattern: (NP)), "the day" (pattern: (NP)) In order to overcome this issue, we propose to filter noisy spans based on their syntactic information. We observe that in the task like coreference resolution, mentions usually follow syntactic structures such as noun phrases. We therefore learn a syntactic parsing model to parse sentences and then extract patterns of gold mentions based on the resulting parse trees.
The end-to-end parsing model is trained jointly by the two following steps.
• Part-of-speech (POS) classifier: given raw sentences from the training set, words are split into sub-words with corresponding vectors from BERT embeddings. The last subword embedding of each word is used as the word embedding and passed through a linear layer to predict POS tags. The gold label POS tags are obtained from the CRAFT training set. Predicted POS tags and the raw texts will be used as the input for the parsing model.
• Parser: our model is based on the constituency parsing model (Kitaev and Klein, 2018), in which parse trees were built based on a self-attentive encoder and achieved state-of-the-art performance on the Penn Treebank. Unlike their model, we replaced the self-attentive encoder by BERT. Figure 1 presents an example of using a parse tree to extract patterns of gold mentions. In this example, three patterns corresponding to three gold mentions are extracted: (NP, IN, NP), (NP), and (NP).
In the coreference resolution model, generated spans that match with the learned patterns are fed into the span representation layer to create span embeddings, while unmatched spans are ignored.

Dataset
The organizer provided two subsets of the CRAFT corpus (Cohen et al., 2017): one for training and one for testing systems. To estimate our model before submitting testing results, we further divided the original training set into two subsets, namely training and development sets. Table 1 shows the statistics numbers of these three subsets.

Compared Models
In order to show the effect of our proposed methods, we compare the following models. • LSTM filter: this is the same as the LSTM baseline model, but we applied the filtering method and increased the antecedent number to 600 instead of 250.
• BERT: we employed the pre-trained SciB-ERT model (Beltagy et al., 2019) instead of using the LSTM as the baseline model. The number of antecedent candidates was 600.
• BERT filter: we used the same settings as BERT but we combined it with the filtering method.

Results and Discussion
We firstly present the results of extracting patterns to filter mentions. We then report and discuss the performance of our models on the official test set.
In order to deeply investigate the effect of the proposed method, we describe the intensive results of ablation tests on the development set. We finally conduct analysis to see how each model works on each group of sentence-level distance between mentions and antecedents. Table 2 reports some patterns 4 with the highest frequencies in the training set. In total, we extracted 1,561 unique patterns. To avoid low quality filtering, we kept patterns with a minimum frequency threshold of 5. The threshold was chosen from our experiments so that we could filter a large number of noisy spans but still kept a high recall on the development set. Specifically, this filtering method helps to reduce up to 90% noisy spans but still kept 93% of correct mentions on the development set.

Evaluation on the Test Set
The results on the official test set are presented in Table 3. In summary, our BERT filter obtained the best performance on both mention and coreference detection in all metrics.
Mention detection For mention detection, most models obtained approximately the same precision of more than 70%. However, the recall   of the BERT filter is much higher than those of the LSTM and LSTM filter (57% vs. 35% and 41%, respectively). Consequently, the F-score of the BERT filter is 16% and 11% points higher than the LSTM and LSTM filter, respectively. The E2E MetaMap is 5% points lower than the BERT filter in F-score.
Coreference detection By obtaining the highest recall in mention detection, the BERT filter could achieve the highest scores in coreference detection in all metrics. Using the mention filtering improved the baseline LSTM from 4-16% points in F-score varied by metrics. When replacing LSTM by BERT and combining with mention filtering, we obtained significant improvements: +19% points of B 3 and LEA; +16% points of MUC, BLANC and CEAFM; and +13% points of CEAFE in F-score.
The E2E MetaMap performance is higher than the LSTM and LSTM filter, but lower than the BERT filter.
As aforementioned, the LSTM model is based on the PyTorch implementation while the E2E MetaMap is based on the Tensorflow repository. Therefore it is difficult to verify whether performance difference comes from using MetaMap features or from the implementation. Due to time constraint, we have not conducted experiments to clarify the reasons yet. We will leave this as our future work.

Ablation Tests
We conducted experiments on the development set to show the effect of using mention filtering and BERT. In order to directly compare between BERT and LSTM, we also conducted an experiment with BERT and set a value of 250 to the number of antecedent candidates. We named it as BERT 250. Meanwhile, LSTM, LSTM filter, BERT, BERT filter have the same settings as described in Section 3.2. All of the results are reported in Table 4.
Mention Filtering When we used mention filtering, the mention detection precision dropped 6% points in the case of LSTM, but in the case of BERT it was almost the same. However, the filter-ing helped to improve recall in both cases, which is important to the coreference detection step. As a result, in the coreference resolution step, mention filtering improved 2-8% points of F-score in all metrics.
Using BERT Using BERT could significantly boost the performance of the baselines in both mention detection and coreference resolution. For mention detection, BERT produced almost the same precision with the LSTM but much higher recall (+17% points), which led to an increase of 10% points in F-score. For coreference detection, BERT-based models outperformed the LSTM-based ones from 4-14% points of F-scores in all metrics.
In summary, when combining both techniques (BERT filter vs. LSTM), we could make a significant increase of more than 14% points in F-score for mention detection and from 11% to 20% points in F-score in all metrics for coreference resolution on the development set.

Analysis
To investigate the effect of distance between a mention and its antecedent(s) on each model, we calculated the number of true positive coreference predictions in the development set and grouped them by sentence-level distance. Specifically, we divided the number of true positive predictions into five groups: ≤10, 11-50, 51-100, 101-200, and >200. The first two groups can be considered as short-distance coreference, e.g., abstract papers like the BioNLP dataset (Nguyen et al., 2011) with an average of nine sentences per document. Meanwhile, the other three groups can be considered as long-distance coreference like full papers in the CRAFT corpus.

Distribution of coreferent pairs in gold data
As illustrated in Figure 2, only about 40.85% of the gold pairs are in the groups of short distance while the other 59.15% of them are in the groups of long distance. This means that if a model cannot deal with long distance coreference, pairs of mentions and antecedents in this region cannot be discovered. Among those long distance pairs, 47.8% are in between 51-200 sentences while the number of pairs whose distance is more than 200 is about 11.35%.
The effect of mention filtering The results in Figure 2 revealed that by using the filtering methods, we could effectively address long-distance coreferent pairs. It can be seen from the figure that the baseline model was good enough when working on short distance pairs, and the filtering may slightly harm the performance. However, for longer distances, the filtering contributed to increases of 5.46%, 55.26%, 84.60% and 100% for the groups of 11-50, 51-100, 101-200, and >200, respectively, in comparison with the baseline.
The effect of BERT Without using the filtering method, BERT itself could capture a fairly large number of long-distance pairs, which was even better than the LSTM filter model. Recall problem It is necessary to note that although our model improved the baseline LSTM and obtained promising results, the recall is still low in all groups of distance. For instance, the best performing model, i.e., BERT filter could cover only 50.90%, 45.38%, 39.08%, 31.71%, and 12.82% of the gold pairs in the groups of <=10, 11-50, 51-100, 101-200, >200, respectively. This is an open issue that we will address in the future.

Conclusion
In this paper, we particularly address the challenge of coreference resolution in full text articles in the CRAFT Shared Task 2019. Specifically, we employ the span-based end-to-end model (Lee et al., 2017) and enhance the model by utilizing a syntax-based mention filtering method and BERT.
To filter noisy mentions, we jointly train a parsing model with a POS classifier to obtain parse trees of sentences. We then generate syntactic patterns of gold mentions based on the resulting parse trees. Any mentions that satisfy the generated patterns will be fed into the coreference resolution model. We finally incorporate BERT into our model. Experimental results on the CRAFT corpus indicate that the proposed method is effective in capturing long-distance coreferences in long documents.

A Penn Treebank Labels
For the Penn Treebank labels in our syntactic patterns, we follow the BioMedical Treebank tagset definition (Warner et al., 2004). Please refer to Table 5 for the detail description.

B Non-Coreference Results
Unlike other metrics, the BLANC metric also contains non-coreference results. We report the results of the test set in Table 6.

C Results on (Mention-Antecedent) Pair Distance
We present the detail results of each model and the corresponding gold coreference grouped by the sentence-level distance of mention-antecedent pairs in Table 7. The results are calculated in five groups of distance: <=10, 11-50, 51-100, 101-200, >200.
Tag Description NP noun phrase NN noun, singular or mass NML sub-NP nominal substrings PRP$ possessive pronoun LS list item marker