A Dual-Attention Network for Joint Named Entity Recognition and Sentence Classification of Adverse Drug Events

An adverse drug event (ADE) is an injury resulting from medical intervention related to a drug. Automatic ADE detection from text is either fine-grained (ADE entity recognition) or coarse-grained (ADE assertive sentence classification), with limited efforts leveraging inter-dependencies among the two granularities. We instead propose a multi-grained joint deep network to concurrently learn the ADE entity recognition and ADE sentence classification tasks. Our joint approach takes advantage of their symbiotic relationship, with a transfer of knowledge between the two levels of granularity. Our dual-attention mechanism constructs multiple distinct representations of a sentence that capture both task-specific and semantic information in the sentence, providing stronger emphasis on the key elements essential for sentence classification. Our model improves state-of- art F1-score for both tasks: (i) entity recognition of ADE words (12.5% increase) and (ii) ADE sentence classification (13.6% increase) on MADE 1.0 benchmark of EHR notes.


Introduction
Background. Adverse drug events (ADEs), injuries resulting from medical intervention, are a leading cause of death in the United States and cost around $30˜$130 billion every year (Donaldson et al., 2000). Early detection of ADE incidents aids in the timely assessment, mitigation and prevention of future occurrences of ADEs. Natural Language Processing techniques have been recognized as instrumental in identifying ADEs and related information from unstructured text fields of spontaneous reports and electronic health records (EHRs) and thus in improving drug safety monitoring and pharmacovigilance (Harpaz et al., 2014).
Fine-grained ADE detection identifies named ADE entities at the word-level, while coarsegrained ADE detection (also ADE assertive text classification) identifies complete sentences describing drug-related adverse effects. (Gurulingappa et al., 2011)'s system for identification of ADE assertive sentences in medical case reports targets the important application of detecting underreported and under-documented adverse drug effects. Lastly, multi-grained ADE detection identifies ADE information at multiple levels of granularity, namely, both entity and sentence level.
As example, Figure 1 displays ADE and non-ADE sentences. The first is an ADE sentence where the mentions of Drugname and ADE entities have the appropriate relationship with each other. Second and third sentences show that the mention of an ADE entity by itself is not sufficient to assert a drug-related adverse side effect.
Recently, deep learning-based sequence approaches have shown some promise in extracting fine-grained ADEs and related named entities from text . However, the prevalence of entity-type ambiguity remains a major hurdle, such as, distinguishing between Indication entities as the reason for taking a drug versus ADE entities as unintended outcomes of taking a drug. Coarsegrained sentence-level detection performs well in identifying ADE descriptive sentences, but is not equipped to detect fine-grained information such as words associated with ADE related named entities. Unfortunately, when the interaction between these two extraction tasks is ignored, we miss the opportunity of the transfer of knowledge between the ADE entity and sentence prediction tasks.
Attention-based neural network models have been shown to be effective for text classification Figure 1: Each sentence is classified as ADE sentence (binary yes/no). Each word is labeled using beginning of an entity (B-...) vs inside an entity (I-...) for ADE related named entities (multiple classes). O denotes no entity tag.
tasks (Luong et al., 2015;Bahdanau et al., 2014) from alignment attention in translation (Liu et al., 2016) to supervising attention in binary text classification (Rei and Søgaard, 2019). Previous approaches typically apply only a single round of attention focusing on simple semantic information In our ADE detection task, instead, key elements of the sentence can be linked to multiple categories of task-specific semantic information of the named entities (ADE, Drug, Indication, Severity, Dose etc.). Thus, single attention is insufficient in exploring this multi-aspect information and consequently risks losing important cues. Proposed Approach. In our work, we tackle the above shortcomings by designing a dual-attention based neural network model for multi-grained joint learning, called MGADE, that jointly identifies both ADE entities and ADE assertive sentences. The design of MGADE is inspired by multi-task Recurrent Neural Network architectures for jointly learning to label tokens and sentences in a binary classification setting (Rei and Søgaard, 2019). In addition, our model makes use of a supervised selfattention mechanism based on entity-level predictions to guide the attention function -aiding it in tackling the above entity-type ambiguity problem. We also introduce novel strategies of constructing multiple complementary sentence-level representations to enhance the performance of sentence classification.
Our key contributions include: 1. Joint Model. We jointly model ADE entity recognition as a multi-class sequence tagging problem and ADE assertive text classification as binary classification. Our model leverages the mutually beneficial relationships between these two tasks, e.g., ADE sentence classification can influence ADE entity recognition by identifying clues that contribute to ADE assertiveness of the sentence and match them to ADE entities. 2. Dual-Attention. Our novel method for generating and pooling multiple attention mechanisms pro-duces informative sentence-level representations. Our dual-attention mechanisms based on wordlevel entity predictions construct multiple representations of the same sentence. The dual-attention weighted sentence-level representations capture both task-specific and semantic information in a sentence, providing stronger emphasis on key elements essential for sentence classification.
3. Label-Awareness. We introduce an augmented sentence-level representation comprised of predicted entity labels which adds label-context to the proposed dual-attention sentence-level representation for better capturing the word-level label distribution and word dependencies within the sentence. This further boosts the performance of the sentence classification task.
4. Model Evaluation. We compare our joint model with state-of-art methods for the ADE entity recognition and ADE sentence classification tasks. Experiments on MADE1.0 benchmark of EHR notes demonstrate that our MGADE model drives up the F1-score for both tasks significantly: (i) entity recognition of ADE words by 12.5% and by 23.5% and (ii) ADE sentence classification by 13.6% and by 23.0%, compared to state-of-art single task and joint-task models, respectively.

Related Work
developed a multi-task learning model that combines entity recognition with document classification to extract the adverse event from a case narrative and classify the case as serious or nonserious. However, they fall short in tackling our problem. Not only do their targeted labels not fall into the drug-related adverse side effects category in which a causal relationship is suspected and required, but their attention model is only simple self-attention. As consequence, MGADE outperforms their model by 23.5% in F1 score for entity recognition and 23.0% for assertive text classification as seen in Section 4.

Task Definition
In the ADE and medication related information detection task, the entities are ADE, Drugname, Dose, Duration, Frequency, Indication, Route, Severity and Other Signs & Symptoms. The no-entity tag is O. Because some entities (like weight gain) can have multiple words, we work with a BIO tagging scheme to distinguish between beginning (tag B-...) versus inside of an entity (tag I-...). The notation we use is given in Fig 2. Given a sentence (a sequence of words), task one is the multi-class classification of ADE and medication related named entities in the text sequence, i.e., entity recognition. Task two is the binary classification of a sentence as ADE assertive text. The overall goal is to minimize the weighted sum of entity recognition loss and sentence classification loss.

Input Embedding Layer
The input of this layer is a sentence represented by a sequence of words S = w 1 , w 2 , ..., w N , where N is sentence length. The words are first broken into individual characters and character-level representations which capture the morphology of a word computed with a bidirectional-LSTM over the sequence of characters in the input words. We employ the pre-trained word vector, GloVe (Pennington et al., 2014), to obtain a fixed word embedding of each word. A consolidated dense embedding, comprised of pre-trained word embedding concatenated with a learned character-level representation, is used to represent a word. The output of this layer

Contextual Layer
LSTM is a type of recurrent neural network that effectively captures long-distance sequence information and the interaction between adjacent words (Hochreiter and Schmidhuber, 1997). The word representations x t are given as input to two separate LSTM networks (Bi-LSTM) that scan the sequence forward and backward, respectively. The hidden states learned by the forward and backward LSTMs are denoted as The output of this layer is a sequence of hidden This way, the hidden state h t of a word encodes information about the t th word and its context:

Word-level (NER) Output Layer
The hidden states h t are passed through a nonlinear layer and then with the softmax activation function to k output nodes, where k denotes the number of entity-types (classes). Entity-type labels are the named entities in the BIO format. Each output node belongs to some entity-type and outputs a score for that entity-type. The output of the softmax function is a categorical probability distribution, where output probabilities of each class is between 0 and 1, and the total sum of all output probabilities is equal to 1.
Data is classified into a entity-type that has the highest probability value.

Dual-Attention Layer
The purpose of the attention mechanism in the sentence classification task is to select important words in different contexts to build informative sentence representations. Different words have different importance for ADE sentence classification task. For instance, key elements (words/phrases) in the ADE detection task are linked to multiple aspects of semantic information associated with the named entity categories -ADE, Drugname, Severity, Dose, Duration, Indication. . . etc. It is necessary to assign the weight for each word according to its contribution to the ADE sentence classification task.
Moreover, certain named entities are taskspecific and are considered essential for ADE sentence classification. There exists a direct correspondence between such task-specific named entities and the sentence. Hence, we anticipate that there would be at least one word of the same label as the sentence-level label. For instance, a sentence that is labeled as an ADE sentence has a corresponding ADE entity word. Although other named entity words detect important information and contribute to the ADE sentence-level classification task, a stronger focus should be on task-specific ADE words indicative of the ADE sentence core message. A single attention distribution tends to be insufficient to explore the multi-aspect information and consequently may risk losing important cues (Wang et al., 2017).
We address this challenge by generating and us-ing multiple attention distributions that offer additional opportunities to extract relevant semantic information. This way, we focus on different aspects of an ADE sentence to create a more informative representation. For this, we introduce a novel dual-attention mechanism, which in addition to selecting the important semantic areas in the sentence (henceforth referred as supervised selfattention (Bahdanau et al., 2014;Yang et al., 2016;Rei and Søgaard, 2019)), it also provides stronger emphasis on task-specific semantic aspect areas (henceforth referred as task-specific attention). The task-specific attention promotes the words important to the ADE sentence-classification task and reduces the noise introduced by words which are less important for the task.
Similar to (Rei and Søgaard, 2019;Yang et al., 2016), we use a self-attention mechanism where, based on softmax probabilities and normalization, attention-weights are extracted from word-level prediction scores. The difference between the two attention mechanism is that the supervised selfattention recognizes word-level prediction scores of all named entities while the task-specific attention recognizes word-level prediction scores w.r.t only selective named entities (one which correspond to the ADE sentence and ignores other named entities). Specifically, the weights of the supervised self-attention and task-specific attention are calculated as follows: Word-level prediction w.r.t the task-specific named entity (i.e.,) ADE: Task-specific Attention Weight, normalized to sum up to 1 over all values in the sentence, is: Supervised Self-Attention Weight, normalized to sum up to 1 over all values in the sentence: Fig 3 shows the examples of the supervised selfattention and task-specific attention distributions generated from our attention layer. The color depth expresses the degree of importance of the weight in attention vector. As depicted in Fig. 3, the taskspecific attention emphasizes more on the parts relevant to the ADE sentence classification task.
Attention-based Sentence Representations. To generate informative and more accurate sentence representations, we construct two different sentence representations as a weighted sum of the context-conditioned hidden states using the taskspecific attention weight α t and supervised selfattention weight β t , respectively.
1. Task-specific attention weighted sentence rep.: 2. Supervised self-attention weighted sentence rep.: Attention Pooling A combination of multiple sentence representations obtained from focusing on different aspects captures the overall contextual semantic information about a sentence. The two attention-based representations are concatenated to form a dual-attention contextual sentence representation:

Entity Prediction Embedding Layer
ADE detection is a challenging task. Understanding the co-occurrence of named entities (labels) is essential for ADE sentence classification. Although we implicitly capture long-range label dependencies with Bi-LSTM in the contextual layer, and make even more informative sentence-level representations with the help of the dual-attention layer, explicitly integrating information on the label-distribution in a sentence is further helpful to understand the label co-occurrence structure and dependencies in the sentence. The idea is to further improve the performance of ADE sentence classification task by learning the output word-level label knowledge. For a better representing of the wordlevel label distribution and to capture potential label dependencies within each sentence, we propose Entity Prediction Embedding (EPE), a sentence-level vector representation of entity labels predicted at the word-level output layer (Sec. 3.4).

Sentence Encoding Layer
A final sentence representation that captures the overall contextual semantic information and label dependencies within the sentence is constructed by combining the dual-attention weighted sentence representation and Entity Prediction Embedding, respectively.

Sentence Classification Output Layer
Finally, we apply a fully connected function and use sigmoid activation to output the sentence prediction score.

Optimization objective
The objective is to minimize the mean squared error between the predicted sentence-level scorê y (sentence) and the gold-standard sentence label y (sentence) across all m sentences: The objective is to minimize the cross-entropy loss between the predicted word-level probability  scoreŷ (entity) and the gold-standard sentence label y (entity) across all N words in the sentence: Similar to (Rei and Søgaard, 2019), we also add another loss function for joining the sentencelevel and word-level objectives that encourages the model to optimize for two conditions on the ADE sentence (i) an ADE sentence must have at least one ADE entity word, and (ii) ADE sentence must have at least one word that is either non-ADE entity or a no-entity word.
We combine different objective functions using weighting parameters to allow us to control the importance of each objective. The final objective that we minimize during training is then: L = λ sent · L sent + λ word · L word + λ attn · L attn (19) By using word-level entity predictions as attention weights for composing sentence-level representations, we explicitly connect the predictions at both levels of granularity. When both objectives work in tandem, they help improve the performance of one another. In our joint model, we give equal importance to both tasks and set λ word = λ sentence = 1.

Data Set
MADE1.0 NLP challenge for detecting medication and ADE related information from EHR (Jagan-natha and Yu, 2016a) used 1089 de-identified EHR notes from 21 cancer patients (Training: 876 notes, Testing: 213 notes). The annotation statistics of the corpus are provided .
Named Entity Labels. The notes are annotated with several categories of medication information. Adverse Drug Event (ADE), Drugname, Indication and Other Sign Symptom and Diseases (OtherSSD) are specified as medical events that contribute to a change in a patient's medical status. Severity, Route, Frequency, Duration and Dosage specified as attributes describe important properties about the medical events. Severity denotes the severity of a disease or symptom. Route, Frequency, Duration and Dosage as attributes of Drugname label the medication method, frequency of dosage, duration of dosage, and the dosage quantity, respectively.
Sentence Labels. MADE 1.0 text has each word manually annotated with ADE or medication related entity types. For words that belong to the ADE entity type, an additional relation annotation denotes if the ADE entity is an adverse side effect of the prescription of the Drugname entity. Since MADE 1.0 dataset does not have sentence-level annotations, we use the relation annotation with the word annotation to assign each sentence a label as ADE or nonADE. In this work, the relation labels are used only to assign the sentence labels, but they are not used in the supervised learning process.

Hyper-parameter Settings
The model operates on tokenized sentences. Tokens were lower-cased, while the character-level component receives input with the original capitalization to learn the morphological features of each word. As input, the pre-trained publicly available Glove word embeddings of size 300 (Pennington et al., 2014). The size of the learned character-level embedding are 100 dimensional vectors. The size of LSTM hidden layers for word-level and charlevel LSTM are size 300 and 100 respectively. The hidden combined representation h t was set to size 200; the attention weight layer e t was set to size 100. The attention-weighted sentence representations T S S and SS S , are 200 dimensional vectors and therefore their combination context vector C S is 400 dimensional. The Entity Prediction Embedding (EPE) L S is of size k entities that are in BIO format. Hence EPE is a size 19 dimensional binary vector (eighteen entities plus the no entity tag). The final concatenated sentence-level S vector is thus size 419. To avoid over-fitting, we apply a dropout strategy (Ma and Hovy, 2016;Srivastava et al., 2014) of 0.5 for our model. All models were trained with a learning rate of 0.001 using Adam (Kingma and Ba, 2014).  (Dernoncourt et al., 2017), LAST is a Bi-LSTM based sentence classification model that uses the last hidden states for sentence composition; (ii) Similar to (Yang et al., 2016), ATTN is a B-LSTM model that used simple attention weights for sentence composition. Our full model, MGADE succeeds to improve the F1 scores by 13.6% over the LAST baseline in testing. We also compare with a model similar to (Zhang et al., 2018) joint-  (Dernoncourt et al., 2017) 0.66 ATTN (Yang et al., 2016) 0.63 Baseline Joint Model (Zhang et al., 2018) 0.61 MGADE 0.75 task model based on self-attention. MGADE outperforms their model by 23.0% for sentence classification.  (Wunnava et al., 2019) 0.56 Bi-LSTM + CRF (Wunnava et al., 2019) 0.63 Baseline Joint Model (Zhang et al., 2018) 0.51 MGADE 0.63  word. Adding an CRF component to our model might further improve the performance of the entity recognition task. We also compare with a model similar to (Zhang et al., 2018) joint-task model based on self-attention. MGADE outperforms their model by 23.5% for entity recognition.

Ablation Analysis
To evaluate the effect of each part in our model, we remove core sub-components and quantify the performance drop in F1 score. Types of Attention. Table 3 studies the two types of attention we generate: Supervised self-attention (β) and Task-specific attention (α) for composing sentence-level representations. † denotes the models with single-attention. As shown in the table, models that used only a single attention component, be it Supervised Self-Attention based (SS S ) or Task-specific attention based sentence representation (T S S ) achieved the same F1-score for the entity recognition task. However, their sentence classification task performance varies, demonstrating that the two attentions capture different aspects of information in the sentence. The type of attention captured plays a critical role in composing an informative sentence representation. Both single-attention models performed better than the baseline individual sentence-classification models LAST and ATTN (see Table 1). T S S achieved superior sentence classification performance over SS S . Intuitively, stronger focus should be placed on the words indicative of the sentence type, and T S S which emphasizes more on the parts relevant to the ADE sentence classification task is more accurate in identifying ADE sentences. Single Attention v.s. Dual-Attention. Table 3 studies impact of dual-attention component. As seen, the model with dual-attention sentence representation which combines two attention-weighted sentence representations C S outperforms the models with single-attention (denoted by †) in both entity recognition and sentence classification tasks. Label-Awareness. Table 3 studies the effect of adding the label-awareness component in im-proving the sentence representation. Our full model MGADE, with both dual-attention and labelaware components further improves the performance of sentence classification and entity recognition tasks by 1.0% and 2.0% respectively compared to MGADE-DualA, the model with only dual-attention component. Case Study. Dual-attention is not only effective in capturing multiple aspects of semantic information in the sentence, but also in reducing the risk of capturing incorrect or insufficient attention when only one of the single attentions (either taskspecific or supervised self-attention) is used. Fig 4  shows such an example where single attention, either task-specific or supervised self-attention, fails to capture sufficient attention weight on the key semantic areas of the sentence necessary to make a correct prediction on the sentence. The incorrect distribution of attention weights assigned in the single task-specific and single supervised selfattention (Figures 4a and 4c) is addressed by the dual-attention mechanism. The later corrects the distribution and assigns appropriate weights to the relevant semantic words as in Figures 4b and 4d. In Figures 4e and 4f, we demonstrate the effectiveness of the dual-attention mechanism by plotting attention weight distributions and the sentence prediction scores when specific type of attention is composed into the sentence representation. The bar chart depicts the ADE sentence-level classification confidence scores w.r.t single-attention and dual-attention models and confirms the utility of dual-attention.

Conclusion
We propose a dual-attention network for multigrained ADE detection to jointly identify ADE entities and ADE assertive sentences from medical narratives. Our model effectively supports knowledge sharing between the two levels of granularity, i.e., words and sentences, improving the overall quality of prediction on both tasks. Our solution features significant performance improvements over stateof-the-art models on both tasks. Our MGADE architecture is pluggable, in that other sequential learning models including BERT (Devlin et al., 2019) or other models for sequence labelling and text classification could be substituted in place of the Bi-LSTM sequential representation learning model. We leave this enhancement of our model and its study to future work.