Joint Entity Extraction and Assertion Detection for Clinical Text

Negative medical findings are prevalent in clinical reports, yet discriminating them from positive findings remains a challenging task for in-formation extraction. Most of the existing systems treat this task as a pipeline of two separate tasks, i.e., named entity recognition (NER)and rule-based negation detection. We consider this as a multi-task problem and present a novel end-to-end neural model to jointly extract entities and negations. We extend a standard hierarchical encoder-decoder NER model and first adopt a shared encoder followed by separate decoders for the two tasks. This architecture performs considerably better than the previous rule-based and machine learning-based systems. To overcome the problem of increased parameter size especially for low-resource settings, we propose the Conditional Softmax Shared Decoder architecture which achieves state-of-art results for NER and negation detection on the 2010 i2b2/VA challenge dataset and a proprietary de-identified clinical dataset.


Introduction
In recent years, natural language processing (NLP) techniques have demonstrated increasing effectiveness in clinical text mining.Electronic health record (EHR) narratives, e.g., discharge summaries and progress notes contain a wealth of medically relevant information such as diagnosis information and adverse drug events.Automatic extraction of such information and representation of clinical knowledge in standardized formats (Tomar and Bhatia, 2019) could be employed for a variety of purposes such as clinical event surveillance, decision support (Jin et al., 2018), pharmacovigilance, and drug efficacy studies.
Although many NLP applications that successfully extract findings from medical reports have Discontinue Abraxane, patient denies taking Tyleno 325 mg and is not taking calcium carbonate.Patient also stopped taking colecalciferol 1,000 units PO.been developed in recent years, identifying assertions such as positive (present), negative (absent), and hypothetical remains a challenging task, especially to generalize (Wu et al., 2014).However, identifying assertions is critical since negative and uncertain findings are frequent in clinical notes (Figure 1), and information extraction algorithms that do not distinguish between them will not paint a clear picture of the patient.
In this paper, we focus on identifying the negated findings in a multi-task setting (Bhatia et al., 2018).Most of the existing systems treat this task as a pipeline of two separate tasks, i.e., named entity recognition (NER) and negation detection.Previous efforts in this area include both rule-based and machine-learning approaches.
Rule-based systems rely on negation keywords and rules to determine the cue of negation.NegEx (Chapman et al., 2001) is a widely used algorithm that consists of ontology lookup to index findings, and negation regular expression search in a fixed scope.ConText (Harkema et al., 2009) extends NegEx to other attributes like hypothetical and make scope variable by searching for a termination term.NegBio (Peng et al., 2018) uses a universal dependency graph for scope detection.Another similar work by Gkotsis et al. (2016) utilizes a constituency-based parse tree to prune out the parts outside the scope.However, these approaches use rules and regular expressions for cue detection which rely solely on surface text and thus are limited when attempting to capture complex syntactic constructions such as long noun phrases.
Kernel-based approaches are also very common, especially in the 2010 i2b2/VA task of predicting assertions.The state-of-the-art in that challenge applies support vector machines (SVM) to assertion prediction as a separate step after concept extraction (de Bruijn et al., 2011).They train classifiers to predict assertions of each concept word, and a separate classifier to predict the assertion of the whole concept.Shivade et al. (2015) propose an Augmented Bag of Words Kernel (ABoW), which generates features based on NegEx rules along with bag-of-words features.Cheng et al. (2017) use CRF for classification of cues and scope detection.These machine learning based approaches often suffer in generalizability, the ability to perform well on unseen text.
Recently, neural network models by Fancellu et al. (2016) and Rumeng et al. (2017) have been proposed.Most relevant to our work is that of Rumeng et al. (2017) where gated recurrent units (GRU) are used to represent the clinical events and their context, along with an attention mechanism.Given a text annotated with events, it classifies the presence and period of the events.However, this approach is not end-to-end as it does not predict the events.Additionally, these models generally require large annotated corpus, which is necessary for good performance.Unfortunately, such clinical text data is not easily available.
Multi-task learning (MTL) is one of the most effective solutions for knowledge transfer across tasks.In the context of neural network architectures, we perform MTL by sharing parameters across models, such as pretraining using word embeddings (Bhatia et al., 2016;Bojanowski et al., 2016), a popular approach for most NLP tasks.In this paper, we propose an MTL approach to negation detection that overcomes some of the limitations in the existing models such as data accessibility.MTL leverages overlapping representation across sub-tasks and it is one of the most effective solutions for knowledge transfer across tasks.In the context of neural network architectures, we perform MTL by sharing parameters across tasks.
To the best of our knowledge, this is the first work to jointly model named entity and negation in an end-to-end system.Our main contributions are summarized below: • An end-to-end hierarchical neural model consisting of a shared encoder and different decoding schemes to jointly extract entities and negations.Using our proposed model, we obtain substantial improvement over prior models for both entities and negations on the 2010 i2b2/VA challenge task as well as a proprietary de-identified clinical note dataset for medical conditions.

Methodology
We first present a standard neural framework for named entity recognition.To facilitate multi-task learning, we expand on that architecture by building a two decoder model.Then, to overcome the issues of the two decoder model we propose a single shared decoder model.Finally, we introduce the Conditional softmax shared decoder.

Named Entity Recognition Architecture
NER is a sequence tagging problem which maximizes a conditional probability of tags y given an input sequence x, parameterized by θ.
Here T is the length of the sequence, and y <t represents tags for all previous time-steps.We focus on an established hierarchical architecture (Lample et al., 2016;Yang et al., 2016;Chiu and Nichols, 2016) consisting of encoders (at both word and character levels) and a tagger for output generation.

Encoders
Input to the model, x ∈ N T , represents token ids of the input vocabulary.This sequence is encoded first at the character level and additionally at the word level.Character level representation consists of using a bi-directional Long Short-Term Memory (BiLSTM) (Hochreiter and Schmidhuber, 1997;Graves et al., 2013) unit to encode each 1:l , where l represents the length of the word.We concatenate the last time-step of each of these sequences to obtain a vector representation, h The final input to the word level encoder is a combination of a pre-trained word embedding (Pennington et al., 2014) and the character representation, For the word level encoder we make use of another BiLSTM.

Tagger
The tagger consists of a uni-directional LSTM which takes as input the latent word representation given by the word level encoder, as well as the label embedding of the previously generated tag.During training we feed ground truth labels by way of teacher forcing (Williams and Zipser, 1989), while at test time we use the generated sequence directly.This system is trained using a standard cross-entropy objective.

Two Decoder Model
To facilitate the MTL setting, we begin with a two decoder model consisting of decoders which use the shared encoder representation to jointly predict entities and negation attribute (Figure 2).This is a standard architecture used for MTL which consists of different LSTM's for decoders followed by corresponding softmaxes.This model mitigates the issues associated with rule-based models that rely solely on surface text, and thus are limited when attempting to capture complex syntactic constructions.With shared contextual encoder representation consisting of character and word embedding based models, the proposed architecture provides an effective solution for knowledge transfer across tasks, thus consolidating the ability to perform well on unseen text.However, this proposed ar-Figure 3: Shared decoder model chitecture is not scalable, the number of decoders scales linearly with the number of attributes.Another problem we realized with this architecture is the performance degradation when working in an extremely low resource setting, where more parameters prevent the model from generalizing well.

Shared Decoder Model
To overcome the limitations of the two decoder model we propose a shared decoder model (Figure 3).We share the encoder and decoder of the two tasks and the common output from the decoder is fed into two different softmax for entity and negations.

Conditional Softmax Decoder Model
While the single decoder model is more scalable, we found that this model did not perform as well for negation as the two decoder model.It can be attributed to the fact that negation occurs less frequently than the entities, thus the decoder primarily focuses on making entity extraction predictions.To mitigate this issue and provide more context to negation attributes, we add an additional input, which is the softmax output from entity extraction (Figure 4).Thus, the model learns more about the input as well as the label distribution from entity extraction prediction.As an example, we use negation only for PROBLEM entity in the  (Shivade et al., 2015) 0.899 0.900 0.900 ---Independent Negation (Lample et al., 2016) 0.810 0.850 0.820 0.840 0.820 0.83 Two Decoder (this paper) 0.894 0.908 0.899 0.931 0.865 0.897 Shared Decoder (this paper) 0.870 0.902 0.882 0.921 0.850 0.878 Conditional (this paper) 0.919 0.891 0.905 0.928 0.874 0.899 i2b2 dataset.Providing the entity prediction distribution helps the negation model to make better predictions.The negation model learns that if the prediction probability is not inclined towards PROB-LEM, then it should not predict negation irrespective of the word representation.
where, SoftOut Ent t is the softmax output of the entity at time step t.

Dataset
We evaluated our model on two datasets.First is the 2010 i2b2/VA challenge dataset for "test, treatment, problem" (TTP) entity extraction and assertion detection (i2b2 dataset).Unfortunately, only part of this dataset was made public after the challenge, therefore we cannot directly compare with NegEx and ABoW results.We followed the original data split from R. Chalapathy and Piccardi (2016) of 170 notes for training and 256 for testing.The second dataset is proprietary and consists of 4,200 de-identified, annotated clinical notes with medical conditions (proprietary dataset).

Model settings
Word, character and tag embeddings are 100, 25, and 50 dimensions, respectively.For word embeddings we use GloVe (Peng et al., 2018) and fine tune during training, while character and tag embeddings are randomly initialized.Character and word encoders have 50, and 100 hidden units, respectively, while the tagger LSTM has a hidden size of 50.Dropout is used after every RNN, as well as for word embedding input.We use Adam (Kingma and Ba, 2014) as an optimizer.Hyperparameters are tuned using Bayesian Optimization (Snoek et al., 2012).

Results
Since there is no prior work which has solved the two tasks as a joint model, we report the best results for both the individual tasks (Table 1).We observe that the baseline model for NER (Indepedent NER) presented in the methodology section outperforms the best model (R. Chalapathy and Piccardi, 2016) on the i2b2 challenge.The Two decoder and the conditional softmax decoder (Conditional decoder) model achieve even better results for NER than our baseline model, where the conditional decoder model achieved new stateof-art for 2010 i2b2/VA challenge task.Shared decoder underperformed the other two models.That can be attributed to a single decoder which primarily focuses on making entity extraction predictions which are more frequent than negations.The conditional decoder outperformed the baseline model on the negation prediction task and achieved an improvement of about 8% in F 1 score compared to the baseline model, which suggests that modeling named entity and negation tasks together helps in achieving better results than each of the tasks done independently.
We compare our models for negation detection against NegEx, and ABoW which has best results for the negation detection task on i2b2 dataset.Conditional decoder model outperforms both NegEx and ABoW (Table 1).Low performance of NegEx and ABoW is mainly attributed to the fact that they use ontology lookup to index findings and negation regular expression search within a fixed scope.A similar trend was observed in the medication condition dataset.The important thing to note is the low F 1 score for NegEx.This can primarily be attributed to abbreviations and misspellings in clinical notes which can not be handled well by rule-based systems.
To understand the advantage of conditional decoder, we evaluated our model in extreme low data settings where we used a sample of our training data.We observed that the conditional decoder outperforms the two decoder model and achieved an improvement of 6% in F 1 score in those settings (Figure 5).As we increase the data size, their performance gap narrows in demonstrating that the conditional decoder is robust in low resource settings.

Conclusion
In this paper we have shown that named entity and negation assertion can be modeled in a multitask setting.Joint learning with shared parameters provides better contextual representation and helps in alleviating problems associated with using neural networks for negation detection, thereby achieving better results than the rule-based systems.Our proposed conditional softmax decoder achieves best results across both tasks and is robust to work well in extreme low data settings.For future work, we plan to investigate the model on other related tasks such as relation extraction, nor-malization as well as the use of advanced conditional models.

Figure 1 :
Figure 1: Negated medications (highlighted in red) and negation cues (highlighted in purple) in clinical text.Our model does not explicitly label the cues.

Figure 2 :
Figure 2: Two decoder model, upper decoder for NER and the lower decoder for negation, where common encoder

Figure 5 :
Figure 5: Conditional softmax decoder is more robust in extreme low resource setting than its two decoder counterpart.

Table 1 :
Test set performance during multi-task training.The table displays precision, recall and macro averaged F1.The baseline is the current state-of-the art optimized architecture.