An Effective Transition-based Model for Discontinuous NER

Unlike widely used Named Entity Recognition (NER) data sets in generic domains, biomedical NER data sets often contain mentions consisting of discontinuous spans. Conventional sequence tagging techniques encode Markov assumptions that are efficient but preclude recovery of these mentions. We propose a simple, effective transition-based model with generic neural encoding for discontinuous NER. Through extensive experiments on three biomedical data sets, we show that our model can effectively recognize discontinuous mentions without sacrificing the accuracy on continuous mentions.


Introduction
Named Entity Recognition (NER) is a critical component of biomedical natural language processing applications. In pharmacovigilance, it can be used to identify adverse drug events in consumer reviews in online medication forums, alerting medication developers, regulators and clinicians (Leaman et al., 2010;Sarker et al., 2015;Karimi et al., 2015b). In clinical settings, NER can be used to extract and summarize key information from electronic medical records such as conditions hidden in unstructured doctors' notes (Feblowitz et al., 2011;Wang et al., 2018b). These applications require identification of complex mentions not seen in generic domains (Dai, 2018).
Widely used sequence tagging techniques (flat model) encode two assumptions that do not always hold: (1) mentions do not nest or overlap, therefore each token can belong to at most one mention; and, (2) mentions comprise continuous sequences of tokens. Nested entity recognition addresses violations of the first assumption (Lu and Roth, 2015;Katiyar and Cardie, 2018;Sohrab and Miwa, 2018;Ringland et al., 2019). However, the violation of The left atrium is mildly dilated .

E1 E1
have much muscle pain and fatigue .  (Karimi et al., 2015a) data sets, respectively. The first example contains a discontinuous mention 'left atrium dilated', the second example contains two mentions that overlap: 'muscle pain' and 'muscle fatigue' (discontinuous).
the second assumption is comparatively less studied and requires handling discontinuous mentions (see examples in Figure 1). In contrast to continuous mentions which are often short spans of text, discontinuous mentions consist of components that are separated by intervals. Recognizing discontinuous mentions is particularly challenging as exhaustive enumeration of possible mentions, including discontinuous and overlapping spans, is exponential in sentence length. Existing approaches for discontinuous NER either suffer from high time complexity (McDonald et al., 2005) or ambiguity in translating intermediate representations into mentions (Tang et al., 2013a;Metke-Jimenez and Karimi, 2016;Muis and Lu, 2016). In addition, current art uses traditional approaches that rely on manually designed features, which are tailored to recognize specific entity types. Also, these features usually do not generalize well in different genres (Leaman et al., 2015).
Motivations The main motivation for recognizing discontinuous mentions is that they usually represent compositional concepts that differ from concepts represented by individual components. For example, the mention 'left atrium dilated' in the first example of Figure 1 describes a disorder which has its own CUI (Concept Unique Identi-fier) in UMLS (Unified Medical Language System), whereas both 'left atrium' and 'dilated' also have their own CUIs. We argue that, in downstream applications such as pharmacovigilance and summarization, recognizing these discontinuous mentions that refer to disorders or symptoms is more useful than recognizing separate components which may refer to body locations or general feelings.
Another important characteristic of discontinuous mentions is that they usually overlap. That is, several mentions may share components that refer to the same body location (e.g., 'muscle' in 'muscle pain and fatigue'), or the same feeling (e.g., 'Pain' in 'Pain in knee and foot'). Separating these overlapping mentions rather than identifying them as a single mention is important for downstream tasks, such as entity linking where the assumption is that the input mention refers to one entity (Shen et al., 2015).

Contributions
We propose an end-to-end transition-based model with generic neural encoding that allows us to leverage specialized actions and attention mechanism to determine whether a span is the component of a discontinuous mention or not. 1 We evaluate our model on three biomedical data sets with a substantial number of discontinuous mentions and demonstrate that our model can effectively recognize discontinuous mentions without sacrificing the accuracy on continuous mentions.

Prior Work
Existing methods on discontinuous NER can be mainly categorized into two categories: token level approach, based on sequence tagging techniques, and sentence level approach, where a combination of mentions within a sentence is jointly predicted (Dai, 2018).
Token level approach Sequence tagging model takes a sequence of tokens as input and outputs a tag for each token, composed of a position indicator (e.g., BIO schema) and an entity type. The vanilla BIO schema cannot effectively represent discontinuous, overlapping mentions, therefore, some studies overcome this limitation via expanding the BIO tag set (Tang et al., 2013a;Metke-Jimenez and Karimi, 2016;Dai et al., 2017;Tang et al., 2018). In addition to BIO indicators, four new position indicators are introduced in (Metke-Jimenez and 1 Code available at GitHub: https://bit.ly/2XazEAO Karimi, 2016) to represent discontinuous mentions that may overlap: • BH: Beginning of Head, defined as the components shared by multiple mentions; • IH: Intermediate of Head; • BD: Beginning of Discontinuous body, defined as the exclusive components of a discontinuous mention; and • ID: Intermediate of Discontinuous body.
Sentence level approach Instead of predicting whether each token belongs to an entity mention and its role in the mention, sentence level approach predicts a combination of mentions within a sentence. A hypergraph, proposed by Lu and Roth (2015) and extended in (Muis and , can compactly represent discontinuous and overlapping mentions in one sentence. A sub-hypergraph of the complete hypergraph can, therefore, be used to represent a combination of mentions in the sentence.
For the token at each position, there can be six different node types: • A: mentions that start from the current token or a future token; • E: mentions that start from the current token; • T: mentions of a certain entity type that start from the current token; • B: mentions that contain the current token; • O: mentions that have an interval at the current token; • X: mentions that end at the current token.
Using this representation, a single entity mention can be represented as a path from node A to node X, incorporating at least one node of type B. Note that both token level and sentence level approaches predict first an intermediate representation of mentions (e.g., a sequence of tags in (Metke-Jimenez and Karimi, 2016) and a sub-hypergraph in (Muis and Lu, 2016)), which are then decoded into the final mentions. During the final decoding stage, both models suffer from some level of ambiguity. Taking the sequence tagging model using BIO variant schema as an example, even if the model can correctly predict the gold sequence of tags for the example sentence 'muscle pain and fatigue' (BH I O BD), it is still not clear whether the token 'muscle' forms a mention by itself, because the same sentence containing three mentions ('muscle', 'muscle pain' and 'muscle fatigue') can be encoded using the same gold sequence of tags. We refer to a survey by (Dai, 2018) for more discussions on these models, and (Muis and Lu, 2016) for a theoretical analysis of ambiguity of these models.
Similar to prior work, our proposed transitionbased model uses an intermediate representation (i.e., a sequence of actions). However, it does not suffer from this ambiguity issue. That is, the output sequence of actions can always be unambiguously decoded into mention outputs.
The other two methods that focus on the discontinuous NER problem in literature are described in (McDonald et al., 2005;Wang and Lu, 2019). McDonald et al. (2005) solve the NER task as a structured multi-label classification problem. Instead of starting and ending indices, they represent each entity mention using the set of token positions that belong to the mention. This representation is flexible, as it allows mentions consisting of discontinuous tokens and does not require mentions to exclude each other. However, this method suffers from high time complexity. Tang et al. (2018) compare this representation with BIO variant schema proposed in (Metke-Jimenez and Karimi, 2016), and found that they achieve competitive F 1 scores, although the latter method is more efficient. A twostage approach that first detects all components and then combines components into discontinuous mentions based on a classifier's decision was explored in recent work by Wang and Lu (2019).
Discontinuous NER vs. Nested NER Although discontinuous mentions may overlap, we discriminate this overlapping from the one in nested NER. That is, if one mention is completely contained by the other, we call mentions involved nested entity mentions. In contrast, overlapping in discontinuous NER is usually that two mentions overlap, but no one is completely contained by the other. Most of existing nested NER models are built to tackle the complete containing structure (Finkel and Manning, 2009;Lu and Roth, 2015), and they cannot be directly used to identify overlapping mentions studied in this paper, nor mention the discontinuous mentions. However, we note that there is a possible perspective to solve discontinuous NER task by adding fine-grained entity types into the schema. Taking the second sentence in Figure 1 have much muscle pain and fatigue . as an example, we can add two new entity types: 'Body Location' and 'General Feeling', and then annotate 'muscle pain and fatigue' as a 'Adverse drug event' mention, 'muscle' as a 'Body Location' mention, and 'pain' and 'fatigue' as 'General Feeling' mentions ( Figure 2). Then the discontinuous NER task can be converted into a Nested NER task.

Model
Transition-based models, due to their high efficiency, are widely used for NLP tasks, such as parsing and entity recognition (Chen and Manning, 2014;Lample et al., 2016;Lou et al., 2017;Wang et al., 2018a). The model we propose for discontinuous NER is based on the shift-reduce parser (Watanabe and Sumita, 2015;Lample et al., 2016) that employs a stack to store partially processed spans and a buffer to store unprocessed tokens. The learning problem is then framed as: given the state of the parser, predict an action which is applied to change the state of the parser. This process is repeated until the parser reaches the end state (i.e., the stack and buffer are both empty).
The main difference between our model and the ones in (Watanabe and Sumita, 2015;Lample et al., 2016) is the set of transition actions. Watanabe and Sumita (2015) use SHIFT, REDUCE, UNARY, FINISH, and IDEA for the constituent parsing system. Lample et al. (2016) use SHIFT, REDUCE, OUT for the flat NER system. Inspired by these models, we design a set of actions specifically for recognizing discontinuous and overlapping structure. There are in total six actions in our model: • SHIFT moves the first token from the buffer to the stack; it implies this token is part of an entity mention.
• OUT pops the first token of the buffer, indicating it does not belong to any mention.
• COMPLETE pops the top span of the stack, outputting it as an entity mention. If we are interested in multiple entity types, we can extend this action to COMPLETE-y which labels the mention with entity type y.  Figure 3: An example sequence of transitions. Given the states of stack and buffer (blue highlighted), as well as the previous actions, predict the next action (i.e., LEFT-REDUCE) which is then applied to change the states of stack and buffer.
• REDUCE pops the top two spans s 0 and s 1 from the stack and concatenates them as a new span which is then pushed back to the stack.
• LEFT-REDUCE is similar to the REDUCE action, except that the span s 1 is kept in the stack. This action indicates the span s 1 is involved in multiple mentions. In other words, several mentions share s 1 which could be a single token or several tokens.
• RIGHT-REDUCE is the same as LEFT-REDUCE, except that s 0 is kept in the stack. Figure 3 shows an example about how the parser recognizes entity mentions from a sentence. Note that, given one parser state, not all types of actions are valid. For example, if the stack does not contain any span, only SHIFT and OUT actions are valid because all other actions involve popping spans from the stack. We employ hard constraints that we only select the most likely action from valid actions.

Representation of the Parser State
Given a sequence of N tokens, we first run a bidirectional LSTM (Graves et al., 2013) to derive the contextual representation of each token. Specif-ically, for the i-th token in the sequence, its representation can be denoted as: where t i is the concatenation of the embeddings for the i-th token, its character level representation learned using a CNN network (Ma and Hovy, 2016). Pretrained contextual word representations have shown its usefulness on improving various NLP tasks. Here, we can also concatenate pretrained contextual word representations using ELMo (Peters et al., 2018) withc i , resulting in: where ELMo i is the output representation of pretrained ELMo models (frozen) for the i-th token. These token representations c are directly used to represent tokens in the buffer. We also explore a variant that uses the output of pretrained BERT (Devlin et al., 2019) as token representations c, and fine-tune the BERT model. However, this finetuning approach with BERT does not achieve as good performance as feature extraction approach with ELMo (Peters et al., 2019). Following the work in (Dyer et al., 2015), we use Stack-LSTM to represent spans in the stack. That is, if a token is moved from the buffer to the stack, its representation is learned using: where D is the number of spans in the stack. Once REDUCE related actions are applied, we use a multi-layer perceptron to learn the representation of the concatenated span. For example, the REDUCE action takes the representation of the top two spans in the stack: s 0 and s 1 , and produces a new span representation:s where W and b denote the parameters for the composition function. The new span representations is pushed back to the stack to replace the original two spans: s 0 and s 1 .

Capturing Discontinuous Dependencies
We hypothesize that the interactions between spans in the stack and tokens in the buffer are important factors in recognizing discontinuous mentions. Considering the example in Figure 3, a span in the stack (e.g., 'muscle') may need to combine with a future token in the buffer (e.g., 'fatigue'). To capture this interaction, we use multiplicative attention (Luong et al., 2015) to let the span in the stack s i learn which token in the buffer to attend, and thus a weighted sum of the representation of tokens in the buffer B: We use distinct W a i for s i separately.

Selecting an Action
Finally, we build the parser representation as the concatenation of the representation of top three spans from the stack (s 0 , s 1 , s 2 ) and its attended representation (s a 0 , s a 1 , s a 2 ), as well as the representation of the previous action a, which is learned using a simple unidirectional LSTM. If there are less than 3 spans in the stack or no previous action, we use randomly initialized vectors s empty or a empty to replace the corresponding vector. This parser representation is used as input for the final softmax prediction layer to select the next action.

Data sets
Although some text annotation tools, such as BRAT (Stenetorp et al., 2012), allow discontinuous annotations, corpora annotated with a large number of discontinuous mentions are still rare. We use three data sets from the biomedical domain: CADEC (Karimi et al., 2015a), ShARe 13 (Pradhan et al., 2013) and ShARe 14 (Mowery et al., 2014). Around 10% of mentions in these three data sets are discontinuous. The descriptive statistics are listed in Table 1. CADEC is sourced from AskaPatient 2 , a forum where patients can discuss their experiences with medications. The entity types in CADEC include drug, Adverse Drug Event (ADE), disease and symptom. We only use ADE annotations because only the ADEs involve discontinuous annotations. This also allows us to compare our results directly against previously reported results (Metke-Jimenez and Karimi, 2016;Tang et al., 2018). ShARe 13 and 14 focus on the identification of disorder mentions in clinical notes, including discharge summaries, electrocardiogram, echocardiogram, and radiology reports (Johnson et al., 2016). A disorder mention is defined as any span of text which can be 2 https://www.askapatient.com/  mapped to a concept in the disorder semantic group of SNOMED-CT (Cornet and de Keizer, 2008). Although these three data sets share similar field (the subject matter of the content being discussed), the tenor (the participants in the discourse, their relationships to each other, and their purposes) of CADEC is very different from the ShARe data sets . In general, laymen (i.e., in CADEC) tend to use idioms to describe their feelings, whereas professional practitioners (i.e., in ShARe) tend to use compact terms for efficient communications. This also results in different features of discontinuous mentions between these data sets, which we will discuss further in § 7.
Experimental Setup As CADEC does not have an official train-test split, we follow Metke-Jimenez and Karimi (2016) and randomly assign 70% of the posts as the training set, 15% as the development set, and the remaining posts as the test set. 3 The train-test splits of ShARe 13 and 14 are both from their corresponding shared task settings, except that we randomly select 10% of documents from each training set as the development set. Micro 3 These splits can be downloaded from https://bit.ly/2XazEAO. average strict match F 1 score is used to evaluate the effectiveness of the model. The trained model which is most effective on the development set, measured using the F 1 score, is used to evaluate the test set.

Baseline Models
We choose one flat NER model which is strong at recognizing continuous mentions, and two discontinuous NER models as our baseline models: Flat model To train the flat model on our data sets, we use an off-the-shelf framework: Flair (Akbik et al., 2018), which achieves the state-of-the-art performance on CoNLL 03 data set. Recall that the flat model cannot be directly applied to data sets containing discontinuous mentions. Following the practice in (Stanovsky et al., 2017), we replace the discontinuous mention with the shortest span that fully covers it, and merge overlapping mentions into a single mention that covers both. Note that, different from (Stanovsky et al., 2017), we apply these changes only on the training set, but not on the development set and the test set.

BIO extension model
The original implementation in (Metke-Jimenez and Karimi, 2016) used a CRF model with manually designed features. We report their results on CADEC in Table 2 and reimplement a BiLSTM-CRF-ELMo model using their tag schema (denoted as 'BIO Extension' in Table 2).
Graph-based model The original paper of (Muis and  only reported the evaluation results on sentences which contain at least one discontinuous mention. We use their implementation to train the model and report evaluation results on the whole test set (denoted as 'Graph' in Table 2). We argue that it is important to see how a discontinuous NER model works not only on the discontinuous mentions but also on all the mentions, especially since, in real data sets, the ratio of discontinuous mentions cannot be made a priori.
We do not choose the model proposed in (Wang and Lu, 2019) as the baseline model, because it is based on a strong assumption about the ratio of discontinuous mentions. Wang and Lu (2019) train and evaluate their model on sentences that contain at least one discontinuous mention. Our early experiments show that the effectiveness of their model strongly depends on this assumption.
In contrast, we train and evaluate our model in a more practical setting where the number of continuous mentions is much larger than the one of discontinuous mentions.

Experimental Results
When evaluated on the whole test set, our model outperforms three baseline models, as well as over previous reported results in the literature, in terms of recall and F 1 scores ( Table 2).
The graph-based model achieves highest precision, but with substantially lower recall, therefore obtaining lowest F 1 scores. In contrast, our model improves recall over flat and BIO extension models as well as previously reported results, without sacrificing precision. This results in more balanced precision and recall. Improved recall is especially encouraging for our motivating pharmacovigilance and medical record summarization applications, where recall is at least as important as precision.
Effectiveness on recognizing discontinuous mentions Recall that only 10% of mentions in these three data sets are discontinuous. To evaluate the effectiveness of our proposed model on recognizing discontinuous mentions, we follow the evaluation approach in (Muis and Lu, 2016) where we construct a subset of test set where only sentences with at least one discontinuous mention are included (Left part of Table 3). We also report the evaluation results when only discontinuous mentions are considered (Right part of Table 3). Note that sentences in the former setting usually contain continuous mentions as well, including those involved in overlapping structure (e.g., 'muscle pain' in the sentence 'muscle pain and fatigue'). Therefore, the flat model, which cannot predict any discontinuous mentions, still achieves 38% F 1 on average when evaluated on these sentences with at least one discontinuous mention, but 0% F 1 when evaluated on discontinuous mentions only.
Our model again achieves the highest F 1 and recall in all three data sets under both settings. The comparison between these two evaluation results also shows the necessity of comprehensive evaluation settings. The BIO E. model outperforms the graph-based model in terms of F 1 score on CADEC, when evaluated on sentences with discontinuous mentions. However, it achieves only 1.8 F 1 when evaluated on discontinuous mentions only. The main reason is that most of discontinuous mentions in CADEC are involved in overlapping  Table 2: Evaluation results on the whole test set in terms of precision, recall and F 1 score. The original ShARe 14 task focuses on template filling of disorder attributes: that is, given a disorder mention, recognize the attribute from its context. In this work, we use its mention annotations and frame the task as a discontinuous NER task.   structure (88%, cf. Table 1), and the BIO E. model is better than the graph-based model at recognizing these continuous mentions. On ShARe 13 and 14, where the portion of discontinuous mentions involved in overlapping is much less than on CADEC, the graph-based model clearly outperforms BIO E. model in both evaluation settings.

Analysis
We start our analysis from characterizing discontinuous mentions from the three data sets. Then we measure the behaviors of our model and two discontinuous NER models on the development sets based on characteristics identified and attempt to draw conclusions from these measurements.

Characteristics of Discontinuous Mentions
Recall that discontinuous mentions usually represent compositional concepts that consist of multiple components. Therefore, discontinuous mentions are usually longer than continuous mentions ( Table 1). In addition, intervals between components make the total length of span involved even longer. Previous work shows that flat NER performance degrades when applied on long mentions (Augenstein et al., 2017;Xu et al., 2017). Another characteristic of discontinuous men-tions is that they usually overlap (cf. § 1). From this perspective, we can categorize discontinuous mentions into four categories: • No overlap: in such cases, the discontinuous mention can be intervened by severity indicators (e.g., 'is mildly' in sentence 'left atrium is mildly dilated'), preposition (e.g., 'on my' in sentence '...rough on my stomach...') and so on. This category accounts for half of discontinuous mentions in the ShARe data sets but only 12% in CADEC (Table 1).
• Left overlap: the discontinuous mention shares one component with other mentions, and the shared component is at the beginning of the discontinuous mention. This is usually accompanied with coordination structure (e.g., the shared component 'muscle' in 'muscle pain and fatigue'). Conjunctions (e.g., 'and', 'or') are clear indicators of the coordination structure. However, clinical notes are usually written by practitioners under time pressure. They often use commas or slashes rather than conjunctions. This category accounts for more than half of discontinuous mentions in CADEC and one third in ShARe.
• Right overlap: similar to left overlap, although the shared component is at the end. For ex-  ample, 'hip/leg/foot pain' contains three mentions that share 'pain'.
• Multi-overlap: the discontinuous mention shares multiple components with the others, which usually forms crossing compositions.

Impact of Overlapping Structure
Previous study shows that the intervals between components can be problematic for coordination boundary detection (Ficler and Goldberg, 2016). Conversely, we want to observe whether the overlapping structure may help or hinder discontinuous entity recognition. We categorize discontinuous mentions into different subsets, described in § 7.1, and measure the effectiveness of different discontinuous NER models on each category. From Table 4, we find that our model achieves better results on discontinuous mentions belonging to 'No overlap' category on ShARe 13 and 14, and 'Left overlap' category on CADEC and ShARe 14. Note that 'No overlap' category accounts for half of discontinuous mentions in ShARe 13 and 14, whereas 'Left overlap' accounts for half in CADEC (Table 1). Graph-based model achieves better results on 'Right overlap' category. On the 'Multioverlap' category, no models is effective, which emphasizes the challenges of dealing with this syntactic phenomena. We note, however, the portion of discontinuous mentions belonging to this category is very small in all three data sets.
Although our model achieves better results on 'No overlap' category on ShARe 13 and 14, it does not predict correctly any discontinuous mention belonging to this category on CADEC. The ineffectiveness of our model, as well as other discontinuous NER models, on CADEC 'No overlap' category can be attributed to two reasons: 1) the number of discontinuous mentions belonging to this category in CADEC is small (around 12%), rending the learning process more difficult. 2) the gold annotations belonging to this category are inconsistent from a linguistic perspective. For example, severity indicators are annotated as the interval of the discontinuous mention sometimes, but not often. Note that this may be reasonable from a medical perspective, as some symptoms are roughly grouped together no matter their severity, whereas some symptoms are linked to different concepts based on their severity.

Impact of Mention and Interval Length
We conduct experiments to measure the ability of different models on recalling mentions of different lengths, and to observe the impact of interval lengths. We found that the recall of all models decreases with the increase of mention length in general (Figure 4 (a -c)), which is similar to previous observations in the literature on flat men-   tions. However, the impact of interval length is not straightforward. Mentions with very short interval lengths are as difficult as those with very long interval lengths to be recognized (Figure 4 (d -f)).
On CADEC, discontinuous mentions with interval length of 2 are easiest to be recognized (Figure 4 (d)), whereas those with interval length of 3 are easiest on ShARe 13 and 14. We hypothesize this also relates to annotation inconsistency, because very short intervals may be overlooked by annotators. In terms of model comparison, our model achieves highest recall in most settings. This demonstrates our model is effective to recognize both continuous and discontinuous mentions with various lengths. In contrast, the BIO E. model is only strong at recalling continuous mentions (outperforming the graph-based model), but fails on discontinuous mentions (interval lengths > 0).

Example Predictions
We find that previous models often fail to identify discontinuous mentions that involve long and overlapping spans. For example, the sentence 'Severe joint pain in the shoulders and knees.' contains two mentions: 'Severe joint pain in the shoulders' and 'Severe joint pain in the knees'. Graph-based model does not identify any mention from this sentence, resulting in a low recall. The BIO extension model predicts most of these tags (8 out of 9) correctly, but fails to decode into correct mentions (predict 'Severe joint pain in the', resulting in a false positive, while it misses 'Severe joint pain in the shoulders').
In contrast, our model correctly identifies both of these two mentions.
No model can fully recognize mentions which form crossing compositions. For example, the sentence 'Joint and Muscle Pain / Stiffness' contains four mentions: 'Joint Pain', 'Joint Stiffness', 'Muscle Stiffness' and 'Muscle Pain', all of which share multiple components with the others. Our model correctly predicts 'Joint Pain' and 'Muscle Pain', but it mistakenly predicts 'Stiffness' itself as a mention.

Summary
We propose a simple, effective transition-based model that can recognize discontinuous mentions without sacrificing the accuracy on continuous mentions. We evaluate our model on three biomedical data sets with a substantial number of discontinuous mentions. Comparing against two existing discontinuous NER models, our model is more effective, especially in terms of recall.