Investigating Domain-Specific Information for Neural Coreference Resolution on Biomedical Texts

Existing biomedical coreference resolution systems depend on features and/or rules based on syntactic parsers. In this paper, we investigate the utility of the state-of-the-art general domain neural coreference resolution system on biomedical texts. The system is an end-to-end system without depending on any syntactic parsers. We also investigate the domain specific features to enhance the system for biomedical texts. Experimental results on the BioNLP Protein Coreference dataset and the CRAFT corpus show that, with no parser information, the adapted system compared favorably with the systems that depend on parser information on these datasets, achieving 51.23% on the BioNLP dataset and 36.33% on the CRAFT corpus in F1 score. In-domain embeddings and domain-specific features helped improve the performance on the BioNLP dataset, but they did not on the CRAFT corpus.


Introduction
Deep neural systems have recently achieved the state-of-the-art performance on coreference resolution tasks in the general domain (Clark and Manning, 2016;Wiseman et al., 2016;Lee et al., 2017). These systems do not heavily rely on manual features since the networks automatically build advanced features from the input. Such an attribute has made deep neural systems preferable to traditional manual feature-based systems.
In the biomedical domain, coreference information has been shown to enhance the performance of entity and event extraction Choi et al., 2016a). Most of work in this domain use rule-based or hybrid approaches (Nguyen et al., 2011(Nguyen et al., , 2012D'Souza and Ng, 2012;Li et al., 2014;Choi et al., 2016b;Cohen et al., 2017). These systems rely on syntactic parsers to extract hand-crafted features and rules, e.g., rules based on predicate argument structure (Nguyen et al., 2012; or features based on syntax trees (D'Souza and Ng, 2012). These rules are designed specifically for each type of coreference, such as noun phrases, relative pronouns, and non-relative pronouns. Moreover, several rules are restricted to specific entities of the training corpus, e.g., protein entities for the BioNLP Protein Coreference dataset (Nguyen et al., 2011). 1 Given the fact that deep learning methods can produce the state-of-the-art performance on general texts, we are motivated to apply such methods to biomedical texts. We therefore raise three research questions in this paper: • How does a general domain neural system with no parser information perform on biomedical domain? • How we can incorporate domain-specific information into the neural system? • Which performance range the system is in comparison with existing systems? In order to address these questions, we directly apply the end-to-end neural coreference resolution system by Lee et al. (2017) (Lee2017) to biomedical texts. We then investigate domain specific features such as domain-specific word embeddings, grammatical number agreements between mentions, i.e., mentions are singular or plural, and agreements of MetaMap (Aronson and Lang, 2010) entity tags of mentions. These features do not rely on any syntactic parsers. Moreover, these features are also general for any biomedical corpora and not restricted to the corpora we use.
We evaluated the Lee2017 system on two datasets: the BioNLP Protein Coreference dataset (Nguyen et al., 2011) and CRAFT (Cohen et al., 2017). Our experimental results have revealed that the system could achieve reasonable performance on both corpora. The system outperformed several systems on the BioNLP dataset that employed rule-based (Choi et al., 2016b) and conventional machine learning methods (Nguyen et al., 2011) using parser information, although it was not competitive with the state-of-the-art systems. Integrating in-domain embeddings and domain-specific features into the deep neural system improved the performance of both mention detection and mention linking on the BioNLP dataset, but the integration could not enhance the performance on the CRAFT corpus.

Methods
In this section, we briefly introduce the baseline Lee2017 system (Lee et al., 2017) and present domain-specific features to adapt the system to biomedical texts.

Baseline System
The baseline Lee2017 system treats all spans up to the maximum length as mention candidates. Each mention candidate is represented as a concatenated vector of the first word, the last word, the soft head word, and the span length embeddings. The embeddings for the first and last words are calculated from the outputs of LSTMs (Hochreiter and Schmidhuber, 1997), while those for soft head word are calculated from the weighted sum of the embeddings of words in the span using an attention mechanism (Bahdanau et al., 2014). These candidates are ranked based on their mention scores s m calculated as follows: where w m is a weight vector, FFNN denotes a feed-forward neural network, and g i is the vector representation of a mention i. After mentions are decided, the system resolves coreference by linking mentions back to their antecedent using antecedent scores s a calculated as: (2) where • denotes an element-wise multiplication and φ(i, j) represents the feature vector between the two mentions.

Domain-specific features
We incorporate the following domain-specific features to enhance the baseline system. In-domain word embeddings: The input word embeddings play an important role in deep learning. Instead of using embeddings trained on general domains, e.g., word embeddings provided with the word2vec tool (Mikolov et al., 2013), we use 200-dimensional embeddings trained on the whole PubMed and PubMed Central Open Access subset (PMC) with a window size of 2 (Chiu et al., 2016). Grammatical numbers: We check mentions' grammatical numbers, i.e., whether each mention is singular or plural. A mention is singular if its part-of-speech tag is N N or if it is one of the five singular pronouns: it, its, itself, this, and that. A mention is plural if its part-of-speech tag is N N S or if it is one of the seven plural pronouns: they, their, theirs, them, themselves, these, and those. MetaMap entity tags: We employ MetaMapLite 2 to identify all possible entities according to the UMLS semantic types. 3 In cases that MetaMapLite assigns multiple semantic types for each entity, we take into account all of the types.
The grammatical numbers and MetaMap entity tags are incorporated into the network as follows. We firstly pre-processed the input and assigned token-based values for each type of features. For example, a token may have "singular", "plural", or "unknown" as the number attribute. Meanwhile, the MetaMap entity tags are distributed to each token with their position information chosen from "Begin" and "Inside". These features are finally encoded as a binary vector of φ(i, j) in Equation 2 that shows whether two mentions i and j has the number agreement and whether they share the same MetaMap semantic type.

Data
We employed two biomedical corpora: BioNLP Protein Coreference dataset (Nguyen et al., 2011) and CRAFT (Cohen et al., 2017). The BioNLP dataset consists of 1,210 PubMed abstracts selected from the GENIA-MedCo coreference corpus. CRAFT (Cohen et al., 2017)  erence annotations of 67 full papers extracted from PMC. While BioNLP focusses on protein/gene coreference, CRAFT covers a wider range of coreference relations such as events, pronominal anaphora, noun phrases, verbs, and nominal premodifiers corefernce. In the CRAFT corpus, coreference is divided into two types: identity chains (a set of base noun phrases and/or appositives that refer to the same thing in the world) and appositive relations (two noun phrases that are adjacent and not linked by a copula). We use only the identity chains. The BioNLP dataset was officially divided into training, development, and test sets. Regarding CRAFT, we randomly divided it into three subsets in a ratio of 8:1:1 for training, development, and test, respectively. Detailed characteristics of the two corpora as well as these three sets are reported in Table 1. It is noticeable that CRAFT is a corpus of full papers, which makes it more challenging for text mining tools than the BioNLP dataset-a corpus of abstracts (Cohen et al., 2010).

Settings
We first directly applied the Lee2017 system to the corpora. Lee2017 used two pretrained embeddings in general domains provided by Pennington et al. (2014) and Turian et al. (2010), and all default features such as speaker, genre, and distance.
To train the Lee2017 system, we employed the same hyper-parameters as reported in Lee et al. (2017) except for a threshold ratio. Although Lee2017 used the ratio λ = 0.4 to reduce the number of mentions from the list of candidates, we tuned it on the BioNLP development set and used λ = 0.7.
We then investigate the impact of each feature on the biomedical texts by preparing the following four systems: • Lee2017: general embeddings, speaker, genre, and distance features  Table 2: Results of mention detection on the development set of BioNLP and CRAFT. The highest numbers are shown in bold.
• PubMed: biomedical embeddings, same features as Lee2017 • PubMed-SG: PubMed with no speaker and genre features • PubMed+*: PubMed with the MetaMap feature (MM) and/or the grammatical number feature (Num). For evaluation, we calculated precision, recall, and F1 on MUC, B 3 , and CEAF φ4 using the CoNLL scorer (Pradhan et al., 2014). For the BioNLP dataset, we also employed the scorer provided by the shared task organisers to make fair comparisons with previous work. We reported the performance on two sub-tasks: (1) mention detection, i.e., to identify coreferent mentions, such as named entities, prepositions or noun phrases, and (2) mention linking, i.e., to link these mentions if they refer to the same thing. The result of the first task affects that of the second one.

Results
Results on the development sets of the two corpora are presented in Table 2 for mention detection and  Table 3 for mention linking (see Appendix A for detailed scores in different metrics).
Regarding the BioNLP dataset, the Lee2017 system performed reasonably well even when it did not use any domain-specific features. Replacing general embeddings by the biomedical ones improved F1 score in general (Lee2017 v.s. PubMed). Removing speaker and genre features (-SG) did not help enhance the performance.  Adding MetaMap's tags (+MM) or the number feature (+Num) produced slightly better scores in comparison to PubMed. However, combining the two features at the same time was not as effective as expected. Among the proposed features, the agreement on MetaMap entity tags (+MM) was the strongest one on the BioNLP dataset.
The impact of the features was quite different on the CRAFT corpus. As shown in Table 2, introducing biomedical embeddings (PubMed) show slightly worse F1 score on mention detection than Lee2017 but it also show a slight improvement on mention linking. Removing speaker and genre features (-SG) boosted the performance. However, adding domain-specific features all harmed the performance. As a result, PubMed-SG showed the best score on the CRAFT development set. Tables 2 and 3 justify the fact that the CRAFT corpus is more challenging than the BioNLP dataset. The scores of the experimented systems on the CRAFT corpus were always lower than those on the BioNLP dataset. This is reasonable because (1) CRAFT consists of full papers that are significantly longer than abstracts, (2) it covers a wide range of anaphors, and (3) its identity chains can be arbitrarily long.

Results in
We applied the best performing system on each development set, i.e., PubMed+MM for BioNLP and PubMed-SG for CRAFT, to its test set, and reported the results in Tables 4 and 5 with showing the performance in previous work for comparison. Table 4 reveals that the neural system outperformed five systems that used SVM and rulebased approaches including the best system on the shared task, and the system could compete with Nguyen et al. (2012)'s. Meanwhile, on the CRAFT corpus (Table 5), we could only produce better performance than the general state-of-theart system, especially due to the low precision.   Cohen et al. (2017). This is not a fair comparison as our system only addressed identity chains and the test set is different from theirs.

Conclusion
We have applied a neural coreference system to biomedical texts and incorporated domain-specific features to enhance the performance. Experimental results on two biomedical corpora, the BioNLP dataset and the CRAFT corpus, have shown that (1) the neural system performed reasonably well with no parser information, (2) the in-domain embeddings and domain-specific features did not consistently perform well on the two corpora, and (3) the system could attain better performance than several rule-based and traditional machine learning-based systems on the BioNLP dataset.
As future work, we would like to investigate feature representations to make input features useful to a target domain. We will also incorporate rules in the existing systems into the network.