Combining Spans into Entities: A Neural Two-Stage Approach for Recognizing Discontiguous Entities

In medical documents, it is possible that an entity of interest not only contains a discontiguous sequence of words but also overlaps with another entity. Entities of such structures are intrinsically hard to recognize due to the large space of possible entity combinations. In this work, we propose a neural two-stage approach to recognizing discontiguous and overlapping entities by decomposing this problem into two subtasks: 1) it first detects all the overlapping spans that either form entities on their own or present as segments of discontiguous entities, based on the representation of segmental hypergraph, 2) next it learns to combine these segments into discontiguous entities with a classifier, which filters out other incorrect combinations of segments. Two neural components are designed for these subtasks respectively and they are learned jointly using a shared encoder for text. Our model achieves the state-of-the-art performance in a standard dataset, even in the absence of external features that previous methods used.

The underlying assumptions behind most NER systems are that an entity should contain a contiguous sequence of words and should not overlap with each other. However, such assumptions do Figure 1: Entities are highlighted with colored underlines. "laceration ... esophagus" and "stomach ... lac" contain discontiguous sequence of words and the latter also overlaps with another entity "blood in stomach". not always hold in practice. First, entities or mentions 1 with overlapping structures frequently exist in news (Doddington et al., 2004) and biomedical documents . Second, entities can be discontiguous, especially in clinical texts (Pradhan et al., 2014b). For example, Figure 3 shows three entities where two of them are discontiguous ("laceration . . . esophagus" and "stomach . . . lac"), and the second discontiguous entity also overlaps with another entity ("blood in stomach").
Such discontiguous entities are intrinsically hard to recognize considering the large search space of possible combinations of entities that have discontiguous and overlapping structures. Muis and Lu (2016a) proposed a hypergraphbased representation to compactly encode discontiguous entities. However, this representation suffers from the ambiguity issue during decodingone particular hypergraph corresponds to multiple interpretations of entity combinations. As a result, it resorted to heuristics to deal with such an issue.
Motivated by their work, we take a novel approach to resolve the ambiguity issue in this work. Our core observation is that though it is hard to exactly encode the exponential space of all possible discontiguous entities, recent work on extracting overlapping structures  can be employed to efficiently explore the space of all the span combinations of discontiguous entities. Based on this observation, we decompose the problem of recognizing discontiguous entities into two subtasks: 1) segment extraction: learning to detect all (potentially overlapping) spans that either form entities on their own or present as parts of a discontiguous entity; 2) segment merging: learning to form entities by merging certain spans into discontiguous entities.
Our contributions are summarized as follows: • By decomposing the problem of extracting discontiguous entities into two subtasks, we propose a two-stage approach that does not have the ambiguity issue.
• Under this decomposition, we design two neural components for these two subtasks respectively. We further show that the joint learning setting where the two components use a shared text encoder is beneficial.
• Empirical results show that our system achieves a significant improvement compared with previous methods, even in the absence of external features that previous methods used. 2 Though we only focus on discontiguous entity recognition in this work, our model may find applications in other tasks that involve discontiguous structures, such as detecting gappy multiword expressions (Schneider et al., 2014).

Related Work
The task of extracting overlapping entities has long been studied Zhou, 2006;McDonald et al., 2005;Alex et al., 2007;Finkel and Manning, 2009;Lu and Roth, 2015;Muis and Lu, 2017). As neural models (Collobert et al., 2011;Lample et al., 2016;Huang et al., 2015;Chiu and Nichols, 2016;Ma and Hovy, 2016) are proven effective for NER, there have been several neural systems recently proposed to handle entities of overlapping structures (Ju et al., 2018;Katiyar and Cardie, 2018;Sohrab and Miwa, 2018;Straková et al., 2019;Lin et al., 2019;Fisher and Vlachos, 2019). Our system is based on the model of neural segmental hypergraphs  which encodes all the possible combinations of overlapping entities using a compact hypergraph representation without ambiguity. Note that other system for extracting overlapping structures can also fit into our twostage system.
For discontiguous and overlapping entity recognition, Tang et al. (2013); Zhang et al. (2014);  extended the BIO tagging scheme to encode such complex structures so that traditional linear-chain CRF (Lafferty et al., 2001) can be employed. However, the model suffers greatly from ambiguity during decoding due to the use of the extended tagset. Muis and Lu (2016a) proposed a hypergraph-based representation to reduce the level of ambiguity. Essentially, these systems trade expressiveness for efficiency: they inexactly encoded the whole space of discontiguous entities with ambiguity for training, and then relied on some heuristics to handle the ambiguity during decoding. 3 Considering it is intrinsically hard to exactly identify discontiguous entities in one stage using a structured model, our work tries to decompose the task into two sub-tasks to resolve the ambiguity issue.
This task is also related to joint entity and relation extraction (Kate and Mooney, 2010; Li and Ji, 2014;Miwa and Sasaki, 2014) where the discontiguous entities can be viewed as relation links between segments. The major difference is that discontiguous entities require explicitly modeling overlapping entities and linking multiple segments.

Model
Our goal is to extract a set of entities that may have overlapping and discontiguous structures given a natural language sentence. We use x = x 1 . . . x |x| to denote a sentence, and use y = {[b i:j . . . b m:n ] k } to denote a set of discontiguous entities where each entity of type k contains a list of spans, e.g., b i:j and b m:n , with subscripts indicating the starting and ending positions of the span. Hence, this task can be viewed as extracting and labelling a sequence of spans as an entity.
Our two-stage approach first extracts spans of interest like b i:j , which are parts of discontiguous entities. Then it merges these extracted spans into discontiguous entities. In the more general setting where discontiguous entities are typed, our approach is designed to jointly extract and label the spans at the first stage, then only merge the spans of the same type at the second stage. We call the intermediate typed span b k i:j a segment in the rest of the paper. Formally, our model aims at maximizing the conditional probability p(y|x), which is decomposed as: where s = {b k i:j } denotes the set of segments that leads to y through a specific combination. 4 That is, we divide the problem of extracting discontiguous entities into two subtasks, namely segment extraction and segment merging.

Segment Extraction
The entity segments s of interest in a given sentence could also overlap with each other. For example, in Figure 3 the entity "blood in stomach" contains another segment "stomach". To make our model capable of extracting such overlapping segment combinations, we employ the model of neural segmental hypergraphs from , which uses a hypergraph-based representation to encode all the possible combinations of segments without ambiguity. Specifically, the segmental hypergraphs adopt a log-linear approach to model the conditional probability of each segment combination for a given sentence: where f (x, s) is the score function for any pair of input sentence x and output segment combination s.
In segmental hypergraphs, each segment combination s corresponds to a hyperpath. Following , the score for a hyperpath is the sum of the scores for each hyperedge, which are based on the word-level and span-level representations through LSTM (Graves and Schmidhuber, 2005): where x k is the corresponding word embedding for word x k , h w i denotes the representation for the 4 We note that each y corresponds to one unique s.
On top of the segmental hypergraph representation, the partition function which is the denominator of Equation 2 can be computed using dynamic programming. The inference algorithm has a quadratic time complexity in the number of words, which can be further reduced to linear time complexity if we introduce the maximal length c of a segment. We regard c as a hyperparameter.

Segment Merging
Given a set of segments, our next subtask is to merge them into entities. First, we enumerate all the valid segment combinations, denoted as E, based on the assumption that the segments in the same entity should have the same type and not overlap with each other. Our model then independently decides whether each valid segment combination forms an entity. We call these valid segment combinations entity candidates. For brevity, let us use t k to denote an entity candidate [b i:j . . . b m:n ] k where each segment like b k i:j belongs to s. Formally, given segments s , the probability of generating entities y can be represented as: where 1 is an indicator function. We use a binary classifier to model p(t k ∈ y).
To capture the interactions between segments within the same combination, we employ yet another LSTM on top of segments as follows: where h e t k denotes the representation of the segment combination t k , which then serves as a feature vector for a binary classifier to determine whether it is an entity. Note that we reuse the span representation from Equation 4, meaning that encoder for words and spans are shared in both segment extraction and merging.
The binary classifier for each t k in Equation 5 is computed as: where we use a rectified linear unit (ReLU) (Glorot et al., 2011) and a linear layer, parameterized by W and b, to map the representation from Equation 6 to a scalar score. This score is normalized into a distribution by the sigmoid function. In the joint model, we stack three separate LSTMs to encode text at different levels from words to spans, then discontiguous entities. Intuitively, the word and span level LSTM try to capture the lower-level information for segment extraction while the entity level LSTM captures the higher-level information for segment merging.

Learning and Decoding
For a dataset D consisting of sentence-entities pairs (x, y), our objective is to minimize the negative log-likelihood as follows: where θ denotes model parameters and λ is the 2 coefficient. p(y i |x i ) is computed by p(s i |x i )p(y i |s i , x i ) where s i is inferred from y i . During the decoding stage, the system first predicts the most probable segments from the neural segmental hypergraph byŝ = arg max s p(s |x). Then it feeds the prediction to the next stage of merging segments and outputs the discontiguous entities byŷ = arg max y p(y |ŝ, x).

Setup
Data We evaluated our model on the task of recognizing mentions in clinical text from ShARe/CLEF eHealth Evaluation Lab (SHEL) 2013 (Suominen et al., 2013) and SemEval-2014(Pradhan et al., 2014a. The task is defined to extract mentions of disorders from clinical documents according to the Unified Medical Language System (UMLS). The original dataset only has a small percentage of discontiguous entities, making it not suitable for comparing the effectiveness of different models when handling discontiguous entities. Following Muis and Lu (2016a), we use a subset of the original data where each sentence contains at least one discontiguous entity.
We split the dataset according to the setting of SemEval 2014. Statistics are shown in Table 1. In this subset, 53.6% of entities are discontiguous. Overlapping entities also frequently appear. Since an entity has three segments at most, we make the constraint that an entity candidate has no more than three segments during segment merging.
Note that all entities hold the same type of disorder in this dataset. Our model is intrinsically able to handle discontiguous entities of multiple types. Recall that segments are typed as b k i:j during segment extraction, and only segments of the same type can be merged into an entity t k where k indicates the entity type. To assess its ability to deal with multiple entity types, we conducted a further analysis (see section 4.2).
Hyperparameters We use the pretrained word embeddings from  which are trained on the PubMed corpus. A dropout layer (Srivastava et al., 2014) is used after each word is mapped to its embedding. The dropout rate and the number of hidden units in LSTMs are tuned based on the performance on the development set. We set the maximal length of a segment to be 6 during segment extraction. Our model is trained with Adam (Kingma and Ba, 2014). 5 Baselines The first baseline we consider is to extend the traditional BIO tagging scheme to seven tags following Tang et al. (2013). With this tagging scheme, each word in a sentence is assigned a label. Then a linear-chain CRF is built to model the sequence labelling process. The next baseline is a hypergraph-based method by Muis and Lu (2016a). It encodes each entity combination into a directed graph based on six types of nodes; each has its specific semantics.
Since these baselines are both ambiguous, heuristics are required during decoding. Following Muis and Lu (2016a), we explored two heuristics: given a model's ambiguous output, either a tag sequence or a hypergraph, the "enough" heuristic finds the minimal set of entities that corresponds to it, while "all" decodes the union of all the possible set of entities. Please refer to Muis and Lu (2016a) for details. We also describe them in the Appendix for self-containedness. We compare our approach to these baselines in two settings. In the non-neural setting, we compare models using the same set of handcrafted features, including external features from POS tagger and Brown cluster following (Muis and Lu, 2016a). In the neural setting, we implement a linear-chain CRF model using the same neural encoder. We are trying to see our model can perform better in both settings. Note that all neural models in our experiments do not leverage any handcrafted features.

Results and Analysis
The main results are listed in Table 2. In both non-neural and neural settings, our model achieves the better result in terms of F 1 compared with other baselines, revealing the effectiveness of our methodology of decomposing the task into two stages. Our neural model achieves the best performance even without using any external handcrafted features.
We also assess the performance when our model uses separate encoders for segment extraction and merging. From the results, we observe that the setting of using a shared encoder is very beneficial for our two-stage system.
Compared with non-neural models, neural models are better in terms of F 1 , both for CRF and our models. The gain mostly comes from the ability to recall more entities. Handcrafted features in nonneural models lead to high precisions but do not seem to be general enough to recall most entities.
The "enough" heuristic works better than "all" in most cases. Hence we use it for evaluating models' ability in handling multiple entity types.
Handling Multiple Entity Types To assess the effectiveness of handling entities of multi-  ple types, we further categorize each entity into three types based on its Concept Unique Identifier (CUI), following Muis and Lu (2016a). 6 In this setting, segments are jointly extracted and labelled using these three categories during segment extraction. During segment merging, an entity candidate can only contain segments of the same type during merging. The results are listed in Table 3. Our neural model again achieves the best performance among all models in terms of F 1 . Compared with neural CRF, our model is significantly better at recalling entities. Similar to the previous observation, the neural encoder consistently boosts the performance of the CRF by recalling more entities, compared with its non-neural counterpart.

Conclusion and Future Work
In this work, we propose a neural two-stage approach for recognizing discontiguous entities, which learns to extract and merge segments jointly without suffering from ambiguity issue. Empirically, it achieves a significant improvement compared with previous methods that rely heavily on handcrafted features.
During training, the classifier of merging segments is only exposed to correct segments, making it unable to recover from errors of segment exaction during decoding. This issue is similar to exposure bias (Wiseman and Rush, 2016) and it might be beneficial if the classifier of segment merging is exposed to incorrect segments during training. We leave this for future work.

A Segment Extraction
Neural segmental hypergraphs  were proposed for modeling overlapping structures in entity mentions. We directly adopt their approach to model the segments of overlapping structures. Note that our segment also holds the information of entity type, so the resulting system for segment extraction can also be viewed as performing sub-mention recognition. Next, we illustrate how the segmental hypergraph encodes the overlapping segments by a concrete example.
For brevity, we only show the example that is annotated with one entity type, and it is able to be trivially extended to the case of multiple entity types.
Given a phrase "He had blood in his mouth and on his tongue", there exist two disorder mentions: 'blood in his mouth' and 'blood ... on his tongue' where the second mention has a discontiguous sequence of words. Our two-stage approach first extracts segments that lead to these two mentions. In this example, the segments consist of 'blood', 'blood in his mouth' and 'on his tongue'. We observe that the first two segments overlap with each other.
Segmental hypergraph encodes this segment combination based on five types of nodes: • A i encodes all segments that start with the ith or a later word • E i encodes all segments that start exactly with the i-th word • T k i represents all segments of type k starting with the i-th word • I k i,j represents all segments of type k that contain the j-th word and start with the i-th word • X marks the end of a segment.
Each segment can be expressed in terms of these five nodes and corresponds with a path in the segmental hypergraph. As a result, each segment combination corresponds with a hyperpath where hyperedges are designated to connect multiple nodes so as to model overlapping segments. Figure 2 shows such a hyperpath for the segment combination in our example phrase. Since we only have one entity type in this example, we eliminate the superscript k in T and I nodes that indicates the information of entity type.
Starting from the third word 'blood', there exist two segments 'blood' and 'blood in his mouth'. The brown hyperedge with the parent node being I 3,3 is responsible for connecting these two overlapping segments. This hyperedge means that there exists a segment that ends at the third word (the link from I 3,3 to X) and there also exists a segment that continues to the next word (the link from I 3,3 to I 3,4 ). The segment 'on his tongue' is directly mapped to the path from T 8 to X.
The score for each hyperpath is the sum of the scores that are computed over each hyperedge.
Since T nodes encode word-level information and I nodes encode span-level information, two LSTMs are employed to capture the interactions at both word level and span level respectively. We use their original implementation that is publicly available 7 .

B Heuristics for Handling Ambiguity
This section tries to explain the two heuristics "enough" and "all" when ambiguous tag sequences occur. We use the extended BIO tagging scheme (Tang et al., 2013;Muis and Lu, 2016a) for example.
To encode the three discontiguous entities in Figure 3, this tagset has seven tags: • B/I: Beginning and Inside of contiguous entities • BH/IH: Beginning and Inside of head where head refers to segments shared by multiple discontiguous entities.
• BD/ID: Beginning and Inside of body where body refers to segments that are not shared across entities.
• O: Outside of entities.
The resulting tag sequence is shown in Figure 4. Since this tagging scheme cannot model the correspondence between different tags, tagging sequences are very likely to have multiple interpretations. For instance, it is not clear that "laceration" should be combined with "esophagus" or with "stomach".
Figure 3: Entities are highlighted with colored underlines. "laceration ... esophagus" and "stomach ... lac" contain discontiguous sequence of words and the latter also overlaps with another entity "blood in stomach". "esophagus ... stomach". We make further constraints to generate only one combination following Muis and Lu (2016b).

C Hyperparameters
The hyperparameters used in our neural two-stage model are listed in Table 4. Since the size of our dataset is relatively small, the dropout is crucial to prevent overfitting considering that the pre-traind word embeddings have the dimension of 200. The length of most segments is not greater than 6, so we set the maximal length c to be 6 to improve the efficiency of segment extraction. We also tried to incorporate a character-level component (Lample et al., 2016) to capture morphological and orthographic information. However, it does not have a significant effect on the performance in term of F 1 . word embedding dim 200 LSTM(word) hidden size 128 LSTM(span) hidden size 128 LSTM(entity) hidden size 64 maximal length c 6 dropout 0.8 l 2 0.0001 Table 4: Hyperparameters of our joint model.