Marginal Likelihood Training of BiLSTM-CRF for Biomedical Named Entity Recognition from Disjoint Label Sets

Extracting typed entity mentions from text is a fundamental component to language understanding and reasoning. While there exist substantial labeled text datasets for multiple subsets of biomedical entity types—such as genes and proteins, or chemicals and diseases—it is rare to find large labeled datasets containing labels for all desired entity types together. This paper presents a method for training a single CRF extractor from multiple datasets with disjoint or partially overlapping sets of entity types. Our approach employs marginal likelihood training to insist on labels that are present in the data, while filling in “missing labels”. This allows us to leverage all the available data within a single model. In experimental results on the Biocreative V CDR (chemicals/diseases), Biocreative VI ChemProt (chemicals/proteins) and MedMentions (19 entity types) datasets, we show that joint training on multiple datasets improves NER F1 over training in isolation, and our methods achieve state-of-the-art results.


Introduction
Identifying entities in text is a vital component in language understanding, facilitating knowledge base construction (Riedel et al., 2013), question answering (Bordes et al., 2015), and search. Identifying these entities are particularly important in biomedical data. While large scale Named Entity Recognition (NER) datasets exist in news and web data (Tjong Kim Sang and De Meulder, 2003;Hovy et al., 2006), biomedical NER datasets are typically smaller and contain only one or two types per dataset. Ultimately, we would like to identify all entity types present across the union of the label sets during inference while leveraging all the available annotations to train our models.
While one may train a single model across the union of all the datasets available, this training procedure assumes that all labels (from the union of the tag set) are correctly annotated in every training instance -which is incorrect. On the other hand, training separate models on each available dataset does not take advantage of shared statistical strength from the multiple sources of information, and requires resolution of the conflicting predictions output by the different models.
To remedy these problems, we propose methods to train a joint model across the multiple tag-sets of the different datasets, sharing statistical strength by using a single feature encoder across datasets while respecting the incompleteness of the labels during training. Thus, our single model can take full advantage of all the available annotated resources and predict the full set of relevant types given a piece of text.
In experiments on three datasets, we show our methods outperform models that do not consider the incomplete annotations. We also show that jointly training on multiple datasets improves performance further and achieves state-of-the-art performance on the Biocreative V CDR dataset.

Model
Our models build on state-of-the-art NER systems (Lample et al., 2016) based on bi-directional Long Short Term Memory (BiLSTM) feature extractors fed into a conditional random field (CRF).
The data consists of input sequence of tokens x = {x 1 , . . . , x T } where each token is a sequence of characters x t = {c 1 , . . . , c Kt }. The output consists of labels for each token in the sequence y = {y 1 , . . . , y T }. Labeling is done using the BILOU tagging scheme, following previous observations that it outperforms the BIO tagging scheme (Ratinov and Roth, 2009). We have D such datasets of input tokens and output labels.

Feature Encoder BiLSTM
Our model takes a sequence of tokens from a single abstract as input. Tokens are generated using byte-pair encodings (BPE) (Gage, 1994;Sennrich et al., 2016), which have recently been shown to be effective for tokenization of biological texts by addressing the issue of rare or out-of-vocabulary tokens . BPE starts from white space tokenization and breaks down the tokens further. Because all of the evaluations are on the span level rather than the token level, the use of BPE does not impact any numerical performance. Each token t produced from BPE is mapped to a d dimensional word embedding w.
Character level features have been shown to improve NER accuracy (Lafferty et al., 2001;Lample et al., 2016;Passos et al., 2014). We encode characters in a word using another BiLSTM, similar to Lample et al. (2016), and obtain a character based embedding for every word by concatenating the last hidden state of the forward and backward character LSTM. We concatenate this character based embedding with the d-dimensional word embedding and input it to the word-level BiLSTM. This feature representation is then projected to the label dimension L using a linear layer, giving a matrix of scores [f il ] where f il is the score for predicting label l ∈ [L] for token i ∈ [T ].

Conditional Random Field (CRF)
BiLSTM-CRF models used for named entity recognition add a CRF layer (Lafferty et al., 2001) on the output representations from the BiLSTM model described. The CRF layer scores all possible labelings to give a probability of the correct label sequence under the model. Given an input sequence of tokens x = {x 1 , . . . , x T } and the output matrix of scores [f il ], the score for an output labeling y = {y 1 , . . . , y T } is given by: where A is an L × L matrix of parameters for transitioning between output labels. The CRF then generates the likelihood for the correct labeling by normalizing this score over all possible output labelings: The log normalization term here is: logsumexp where the sum goes over all possible labelings y of the sequence and is computed efficiently using dynamic programming (Lafferty et al., 2001).

Tagging Multiple Datasets
One way to tag multiple datasets is to concatenate all the datasets with all the output labels and train a single BiLSTM-CRF model. However, this assumes that each text snippet is completely annotated across the label sets, which is not true. We now discuss two models which do not make this assumption.

Multiple CRFs
We first propose one simple method to get around the assumption of complete annotation -train separate CRFs for the label set of each dataset. In particular, to share statistical strengths on the input tokens, we share the BiLSTM feature encoder across the datasets but use separate CRF layers for each of the datasets. This is a multi-task learning model (Caruana, 1998) and is expected to perform better than the naive model as it no longer makes the strict assumption of complete annotation (by using separate CRFs), and shares statistical strength across datasets. However, given a new abstract to tag, this model will generate multiple possible labelings from the different CRFs. Moreover, the labelings output by the different CRFs may be inconsistent, and how to combine these multiple labelings is not obvious. We propose and evaluate a simple heuristic procedure for merging the outputs of the different CRF predictions. Whenever the different CRF predictions disagree on a span of tokens, we choose the prediction from the CRF that has higher marginal probability of predicting that span of tokens (Alg. 1 in supplementary).

EM Marginal CRF
We also propose an alternative principled approach that does not require a heuristic merging process. In order to label D datasets with some disjoint labels, we only consider the probability of the "observed labels" and allow the "unobserved" tokens to be free. Thus, when tagging dataset i ∈ [D], we treat the non-entity tokens as potentially taking any entity type label from any of the other datasets as well as the 'O' label. For a particular input x of length T from a dataset i ∈ [D] with label set S i , let y be the gold output label. Let E ⊂ [T ] be the index of tokens with any entity type label in S i and N ⊂ [T ] be the index of tokens with 'O' label, and let y E be the output sequence corresponding to indices in E, and similarly y N be the output sequence for indices in N . Then, from (1), we get the likelihood P i (y E ∪ y N |x), and a naive CRF trained on the concatenation of all the data will maximize this probability. However, since we cannot make the complete annotation assumption, we should instead maximize only the marginal probability of the observed entities on the dataset i, P i (y E |x), allowing y N to take any values from the labels of the other datasets: ∪ D j =i S j . Thus, where log Z is the log normalization term which is the same as in (1). Note that since the normalization term is the same here as for a standard CRF, we can still use the same dynamic programming algorithm as for a regular CRF to compute this log Z. Now, in order to compute the first term, we note that it is similar to the computation required to compute log Z -whereas log Z is obtained by summing over all possible output sequences, this term is obtained by summing over all possible output sequences which have indices in E fixed to the correct label and indices in N taking values from ∪ j =i S j . Thus, this can be computed using the same dynamic programming algorithm (Tsuboi et al., 2008), and the implementation of training this model is compatible with modern automatic differentiation libraries.

Experimental Results
We perform experiments on two benchmark Biocreative datasets as well as the recently introduced MedMentions data (Murty et al., 2018). Our experiments consider three types of models. The single CRF model naively concatenates all training datasets together and assumes complete labeling, multi CRF has a single Bi-LSTM feature encoder with a separate CRF for each dataset (Section 2.3.1), and EM CRF has a single feature encoder and a single CRF trained with EM marginalization (Section 2.3.2). For full dataset statistics and specific implementation details see supplementary material.

Biocreative V / VI
Biocreative V Chemical Disease Relation (CDR): consists of 1,500 titles and abstracts from PubMed, human annotated with chemical and disease mentions (Li et al., 2016), and has been used in previous NER evluations (Fries et al., 2017;. Biocreative VI ChemProt (CP): consists of 2,432 PubMed titles and abstracts, and contains human annotated mentions of both chemicals and proteins (Krallinger et al., 2017) Our results are shown in Table 1. The top portion of the table shows models trained on single datasets, and the bottom portion shows models trained on both CDR and CP. Comparing the top and bottom portions of the table, we can see that models trained on both CP and CDR outperform training on either in isolation. Further, we see in the bottom section that our EM CRF outperforms the single CRF model and is generally better than the multi CRF model.

Adding Additional Data
Weakly Labeled data The addition of weakly labeled data has been used recently to improve the performance of relation extraction systems (Peng et al., 2016;. In these approaches, titles and abstracts from PubMed are annotated using Pubtator, a state of the art entity tagging and linking/normalization system (Wei et al., 2013). We use the same weakly labeled data from .
Results when adding in the additional weakly labeled data is shown in  improve further, outperforming the state-of-the-art TaggerOne model .

MedMentions
MedMentions (Murty et al., 2018) is a recently introduced large dataset of PubMed abstracts containing entity linked mentions of many different semantic types. We used this data to create an artificially extreme example where two training sets contain 9 and 10 entity types each. The two type sets are fully disjoint (further details in supplementary).
In Table 3, we see that the single CRF model performs very poorly in this extreme setting due to the large amount of missing annotations. The multi CRF and EM CRF both perform well and come close to the performance of a single CRF trained on the full data, which is approximately twice as much annotated data.

Related Work
Until recently, feature engineered machine learning models were the highest performing approaches to NER (Ratinov and Roth, 2009;Passos et al., 2014). More recently, neural network based approaches have become state-of-the-art (Lample et al., 2016;Strubell et al., 2017;Peters et al., 2017). In BioNLP, many highest performing systems still use engineered features fed into a CRF . In addition to the two datasets we explored in this work, there are several other popular bio NER datasets for chemicals (Krallinger et al., 2015), species (Wang et al., 2010), diseases (Dogan et al., 2014), and genes (Tanabe et al., 2005).
In concurrent work, Wang et al. (2018) train a model very similar to our multi-CRF model on multiple biological NER datasets with non-fully overlapping labels. Additionally, they experiment with different ways of sharing the parameters of the BiLSTM encoder. We believe this work is complementary to ours, and in many ways deals with a simpler subset of the tasks we address. Wang et al. assumes complete labeling in each of their datasets, and does not attempt to merge the final results of the multiple CRFS. On the other hand, we focus on the problem of cohesively labeling a dataset with the joint set of the different label sets, either directly through the EM model or by the merging process of the multi-CRF model.
Our method of training via marginal likelihood is the same as Tsuboi et al. (2008), who trained CRF models for Japanese word segmentation and POS tagging where only partial annotations of sentences are available. In comparison, we use the marginal likelihood training in conjunction with state-of-the art deep learning models for NER and use it to tag across multiple disjoint labels sets.

Conclusions and Future Work
We've introduced a method for training NER models on multiple datasets containing disjoint label sets. We show experimentally that this joint training improves performance and that our EM CRF methods outperform models using a single CRF.
One interesting problem that our models do not account for is the existence of overlapping and non-continuous entity spans. Particularly when annotating using disjoint label sets, a token could belong to multiple entity spans from different label sets. We are interested in investigating this problem in future work.