Learning to Recognize Discontiguous Entities

This paper focuses on the study of recognizing discontiguous entities. Motivated by a previous work, we propose to use a novel hypergraph representation to jointly encode discontiguous entities of unbounded length, which can overlap with one another. To compare with existing approaches, we first formally introduce the notion of model ambiguity, which defines the difficulty level of interpreting the outputs of a model, and then formally analyze the theoretical advantages of our model over previous existing approaches based on linear-chain CRFs. Our empirical results also show that our model is able to achieve significantly better results when evaluated on standard data with many discontiguous entities.


Introduction
Building effective automatic named entity recognition (NER) systems that is capable of extracting useful semantic shallow information from texts has been one of the most important tasks in the field of natural language processing. An effective NER system can typically play an important role in certain downstream NLP tasks such as relation extraction, event extraction, and knowledge base construction (Hasegawa et al., 2004;Al-Rfou and Skiena, 2012).
Most traditional NER systems are capable of extracting entities 1 as short spans of texts. Two basic assumptions are typically made when extract-  ing entities: 1) entities do not overlap with one another, and 2) each entity consists of a contiguous sequence of words. These assumptions allow the task to be modeled as a sequence labeling task, for which many existing models are readily available, such as linear-chain CRFs (McCallum and Li, 2003).
While the above two assumptions are valid for most cases, they are not always true. For example, in the entity University of New Hampshire of type ORG there exists another entity New Hampshire of type LOC. This violates the first assumption above, yet it is crucial to extract both entities for subsequent tasks such as relation extraction and knowledge base construction. Researchers therefore have proposed to tackle the above issues in NER using more sophisticated models (Finkel and Manning, 2009;. Such efforts still largely rely on the second assumption. Unfortunately, the second assumption is also not always true in practice. There are also cases where the entities are composed of multiple discontiguous sequences of words, such as in disorder mention recognition in clinical texts (Pradhan et al., 2014b), where the entities (disorder mentions in this case) may be discontiguous. Consider the example shown in Figure 1. In this example there are four enti-ties, the first one, hiatal hernia, is a conventional contiguous entity. The second one, laceration ... esophagus, is a discontiguous entity, consisting of two parts. The third and fourth ones, blood in stomach and stomach ... lac (for stomach laceration), are overlapping with each other, with the fourth being discontiguous at the same time.
For such discontiguous entities which can potentially overlap with other entities in complex manners, existing approaches such as those based on simple sequence tagging models have difficulties handling them accurately. This stems from the fact that there is a very large number of possible entity combinations in a sentence when the entities can be discontiguous and overlapping.
Motivated by this, in this paper we propose a novel model that can better represent both contiguous and discontiguous entities which can overlap with one another. Our major contributions can be summarized as follows: • We propose a novel model that is able to represent both contiguous and discontiguous entities.
• Theoretically, we introduce the notion of model ambiguity for quantifying the ambiguity of different NER models that can handle discontiguous entities. We present a study and make comparisons about different models' ambiguity under this theoretical framework.
• Empirically, we demonstrate that our model can significantly outperform conventional approaches designed for handling discontiguous entities on data which contains many discontiguous entities.

Related Work
Learning to recognize named entities is a popular task in the field of natural language processing. A survey by Nadeau (2007) lists several approaches in NER, including Hidden Markov Models (HMM) (Bikel et al., 1997), Decision Trees (Sekine, 1998), Maximum Entropy Models (Borthwick and Sterling, 1998), Support Vector Machines (SVM) (Asahara and Matsumoto, 2003), and also semi-supervised and unsupervised approaches. Ratinov (2009) utilizes averaged perceptron to solve this problem and also focused on four key design decisions, achieving state-of-the-art in MUC-7 dataset. These ap-proaches work on standard texts, such as news articles, and the entities to be recognized are defined to be contiguous and non-overlapping. Noticing that many named entities contain other named entities inside them, Finkel and Manning (2009) proposed a model that is capable of extracting nested named entities by representing the sentence as a constituency parse tree, with named entities as phrases. As a parsing-based model, the approach has a time complexity that is cubic in the number of words in the sentence.
Recently,  proposed a model that can represent overlapping entities. In addition to supporting nested entities, theoretically this model can also represent overlapping entities where neither is nested in another. The model represents each sentence as a hypergraph with nodes indicating entity types and boundaries. Compared to the previous model, this model has a lower time complexity, which is linear in the number of words in the sentence.
All the above models focus on NER in conventional texts, where the assumption of contiguous entities is valid. In the past few years, there is a growing body of works on recognizing disorder mentions in clinical text. These disorder mentions may be discontiguous and also overlapping. To tackle such an issue, a research group from University of Texas Health Science Center at Houston (Tang et al., 2013;Zhang et al., 2014;Xu et al., 2015) first utilized a conventional linear-chain CRF to recognize disorder mention parts by extending the standard BIO (Begin, Inside, Outside) format, and next did some postprocessing to combine different components. Though effective, as we will see later, such a model comes with some drawbacks. Nevertheless, their work motivated us to perform further analysis on this issue and propose a novel model specifically designed for discontiguous entity extraction.

Linear-chain CRF Model
Before we present our approach, we would like to spend some time to discuss a simple approach based on linear-chain CRFs (Lafferty et al., 2001). This approach is primarily based on the system by Tang et al. (2013), and this will be the baseline system EGD showed hiatal [B] hernia [I] and vertical laceration [BD] in distal esophagus [BD] with blood [B] in [I] stomach [BH] and overlying lac [BD] .
Infarctions [BH] either water [BD] shed [ID] or embolic [BD]  The O labels are not shown. Misspellings are from the dataset. that we will make comparison with in later sections.
The problem is regarded as a sequence prediction task, where each word is assigned a label similar to BIO format often used for NER. We used the encoding used by Tang et al. (2013), which uses 7 tags to handle entities that can be discontiguous and overlapping. Specifically, we used B, I, O, BD, ID, BH, and IH to denote Beginning of entity, Inside entity, Outside of entity, Beginning of Discontiguous entity, Inside of Discontiguous entity, Beginning of Head, and Inside of Head. To encode a sentence in this format, first we identify the contiguous word sequences which are parts of multiple entities. We call these head components and we label each word inside each component with BH (for the first word in each component) or IH. Then we find contiguous word sequences which are parts of a discontiguous entity, which we call the body components. Words inside those components which have not been labeled are labeled with BD (for the first word in each component) or ID. Finally, words that are parts of a contiguous entity are called contiguous component, and, if they have not been labeled, are labeled as B (for the first word in each component) or I.
This encoding is lossy, since the information on which parts constitute the same entity is lost. The top example in Figure 2 is the encoding of the example shown in Figure 1. During decoding, based on the labels only it is not entirely clear whether "laceration" should be combined with "esophagus" or with "stomach" to form a single mention. For the bottom example, we cannot deduce that "Infarctions" alone is a mention, since there is no difference in the encoding of a sentence with only two mentions {"Infarctions . . . water shed", "Infarctions . . . embolic"} or having three mentions with "Infarctions" as another mention, since in both cases, the word "Infarctions" is labeled with BH.
Also, it should be noted that some of the label sequences are not valid. For example, a sentence in which there is only one word labeled as BD is invalid, since a discontiguous entity requires at least two words to be labeled as BD or BH. This is, however, a possible output from the linear CRF model, due to the Markov assumption inherent in linear CRF models. Later we see that our models do not have this problem.

Our Model
Linear-chain CRF models are limited in their representational power when handling complex entities, especially when they can be discontiguous and can overlap with one another. While recent models have been proposed to effectively handle overlapping entities, how to effectively handle discontiguous entities remains a research question to be answered. Motivated by previous efforts on handling overlapping entities , in this work we propose a model based on hypergraphs that can better represent entities that can be discontiguous and at the same time be overlapping with each other.
Unlike the previous work , we establish a novel theoretical framework to formally quantify the ambiguity of our hypergraph-based models and justify their effectiveness by making comparisons with the linear-chain CRF approach. Now let us introduce our novel hypergraph representation. A hypergraph can be used to represent entities of different types and their combinations in a given sentence. Specifically, a hypergraph is constructed as follows. For the word at position k, we have the following nodes: • A k : this node represents all entities that begin with the current or a future word (to the right of the current word).
• E k : this node represents all entities that begin with the current word.
• T k t : this node represents entities of certain specific type t that begin with the current word. There is one T k t for each different type. • B k t,i : this node indicates that the current word is part of the i-th component of an entity of type t.
• O k t,i : this node indicates that the current word appears in between (i-1)-th and i-th components of an entity of type t.
There is also a special leaf node, X-node, which indicates the end (i.e., right boundary) of an entity.
The nodes are connected by directed hyperedges, which for the purpose of explaining our models are defined as those edges that connect one node, called the parent node, to one or more child nodes. For ease of notation, in the rest of this paper we use edge to refer to directed hyperedge.
The edges Each A k is a parent to E k and A k+1 , encoding the fact that the set of all entities at position k is the union of the set of entities starting exactly at current position (E k ) with the set of entities starting at or after position k + 1 (A k+1 ).
Each E k is a parent to T k 1 , . . . , T k T , where T is the total number of possible types that we consider. Each T k t has two edges where it serves as a parent, within one it is parent to B k t,0 and within another it is to X. These edges encode the fact that at position k, either there is an entity of type t that begins with the current word (to B k t,0 ), or there is no entity of type t that begins with the current word (to X).
In the full hypergraph, each B k t,i is a parent to B k+1 t,i (encoding the fact that the next word also belongs to the same component of the same entity), to O k+1 t,i+1 (encoding the fact that this word is part of a discontiguous entity, and the next word is the first word separating current component and the next component), and to X (representing that the entity ends at this word). Also there are edges with all possible combinations of B k+1 t,i , O k+1 t,i+1 , and X as the child nodes, representing overlapping entities. For example, the edge B k t,i → (B k+1 t,i ,X ) denotes that there is an entity which continues to the next word (the edge to B k+1 t,i ), while there is another entity ending at k-th word (the edge to X). In total there are 7 edges in which B k t,i is a parent, which are: Analogously, O k t,i has three edges that connect to O k+1 t,i , B k+1 t,i+1 , and both. Note that O k t,i is not a parent to X by definition.
During testing, the model will predict a subgraph which will result in the predicted entities after decoding. We call this subgraph representing certain entity combination entity-encoded hypergraph.
For example, Figure 3 shows the entity-encoded hypergraph of our model encoding the three mentions in the second example in Figure 4. The edge from the T-node for the first word to the B-node for the first word shows that there is at least one entity starting with this word. The three places where an X-node is connected to a B-node show the end of the three entities. Note that this hypergraph clearly shows the presence of the three mentions without ambiguity, unlike a linear-chain encoding of this example where it cannot be inferred that "Infarctions" alone is a mention, as discussed previously. In this paper, we set the maximum number of components to be 3 since the dataset does not contain any mention with more than 3 components.
Also note that this model supports discontiguous and overlapping mentions of different types since each type has its own set of O-nodes and B-nodes, unlike the linear-chain model, which supports only overlapping mentions of the same type.
We also experimented with a variant of this model, where we split the T-nodes, B-nodes, and O-nodes further according to the number of com-ponents. We split B k t,i into B k t,i,j , i = 1 . . . j, j = 1 . . . 3 which represents that the word is part of the i-th component of a mention with total j components. Similarly we split O k t,i into O k t,i,j and T k t into T k t,j . We call the original version SHARED model, and this variant SPLIT model. The motivation for this variant is that the majority of overlaps in the data are between discontiguous and contiguous entities, and so splitting the two cases -one component (contiguous) and more (discontiguous) -will reduce ambiguity for those cases.
These models are still ambiguous to some degree, for example when an O-node has two child nodes and two parents, we cannot decide which of the parent node is paired with which child node. However, in this paper we argue that: • This model is less ambiguous compared to the linear-chain model, as we will show later theoretically and empirically.
• Every output of our model is a valid prediction, unlike the linear-chain model since this model will always produce a valid path from T-nodes to the X-nodes representing some entities.
We will also show through experiments that our models can encode the entities more accurately.

Interpreting Output Structures
Both the linear-chain CRF model and our models are still ambiguous to some degree, so we need to handle the ambiguity in interpreting the output structures into entities. For all models, we define two general heuristics: ENOUGH and ALL. The ENOUGH heuristic handles ambiguity by trying to produce a minimal set of entities which encodes to the one produced by the model, while ALL heuristic handles ambiguity by producing the union of all possible entity combinations that encode to the one produced by the model. For more details on how these heuristics are implemented for each model, please refer to the supplementary material.

Training
For both models, the training follows a log-linear formulation, by maximizing the loglikelihood of the training data D: Here (x, y) is a training instance consisting of the sentence x and the entity-encoded hypergraph y ∈ Y where Y is the set of all possible mentionencoded hypergraphs. The vector w consists of feature weights, which are the model parameters to be learned. The set E(x, y) consists of all edges present in the entity-encoded hypergraph y for input x. The function f (e) returns the features defined over the edge e, Z w (x) is the normalization term which gives the sum of scores over all possible entity-encoded hypergraphs in Y that is relevant to the input x, and finally λ is the 2 -regularization parameter.

Model Ambiguity
The main aim of this paper is to assess how well each model can represent the discontiguous entities, even in the presence of overlapping entities.
In this section, we will theoretically compare the models' ambiguity, which is defined as the average number of mention combinations that map to the same encoding in a model. Now, to compare two models, instead of calculating the ambiguity directly, we can calculate the relative ambiguity between the two models directly by comparing the number of canonical encodings in the two models.
A canonical encoding is a fixed, selected representation of a particular set of mentions in a sentence, among (possibly) several alternative representations. Several alternatives may be present due to the ambiguity of the encoding-decoding process and also since the output of the model is not restricted to a specific rule. For example, for the text "John Smith", a model trained in BIO format might output "B-PER I-PER" or "I-PER I-PER", and both will still mean that "John Smith" is a person, although the "correct" encoding would of course be "B-PER I-PER", which is selected as the canonical encoding. Intuitively, a canonical encoding is a formal way to say that we only consider the "correct" encodings.
A model with larger number of canonical encodings will, on average, have less ambiguity compared to the one with smaller number of canonical encodings. Subsequently, a model with less ambiguity will be more precise in predicting entities.
Let M LI (n), M SH (n), M SP (n) denote the number of canonical encodings of the linear-chain, SHARED, and SPLIT model, respectively, for a sen-tence with n words. Then we formally define the relative ambiguity of model M 1 over model M 2 , A r (M 1 , M 2 ), as follows: A r (M 1 , M 2 ) > 1 means model M 1 is more ambiguous than M 2 . Now, we claim the following: We provide a proof sketch below. Due to space limitation, we cannot provide the full dynamic programming calculation. We refer the reader to the supplementary material for the details.
Proof Sketch The number of canonical encodings in the linear-chain model is less than 7 n since there are 7 possible tags for each of the n words and not all of the 7 n tag sequences are canonical encodings. So we have M LI (n) < 7 n and thus we can derive log n i=1 M LI (i) < 3n log 2. For our models, by employing some dynamic programming adapted from the inside algorithm (Baker, 1979), we can calculate the growth order of the number of canonical encodings for SHARED model to arrive at a conclusion that ∀n > n 0 , n i=1 M SH (i) > C · 2 10n for some constants n 0 , C. Then we have: Theorem 4.1 says that the linear-chain model is more ambiguous compared to our SHARED model. Similarly, we can also establish A r (SH, SP) > 1. Later we also see this empirically from experiments.

Data
To allow us to conduct experiments to empirically assess different models' capability in handling entities that can be discontiguous and can potentially overlap with one another, we need a text corpus annotated with entities which can be discontiguous and overlapping with other entities. We found the largest of such corpus to be the dataset from the task to recognize disorder mentions in clinical text, initially organized by ShARe/CLEF eHealth Evaluation Lab (SHEL) in 2013 (Suominen et al., 2013) and continued in SemEval-2014(Pradhan et al., 2014a. The definition of the task is to recognize mentions of concepts that belong to the Unified Medical Language System (UMLS) semantic group disorders from a set of clinical texts. Each text has been annotated with a list of disorder mentions by two professional coders trained for this task, followed by an open adjudication step (Suominen et al., 2013).
Unfortunately, even in this dataset, only 8.95% of the mentions are discontiguous. Working directly on such data would prevent us from understanding the true effectiveness of different models when handling entities which can be discontiguous and overlapping. In order to truly understand how different models behave on data with discontiguous entities, we consider a subset of the data where we consider those sentences which contain at least one discontiguous entity. We call the resulting subset the "Discontiguous" subset of the "Original" dataset. Later we will also still use the training data of the "Original" dataset in the experiments.
Note that this "Discontiguous" subset still contains contiguous entities since a sentence usually contains more than one entity. The subset is a balanced dataset with 53.61% of the entities being discontiguous and the rest contiguous. We then split this dataset into training, development, and test set, according to the split given in SemEval 2014 setting (henceforth LARGE dataset). To see the impact of dataset size, we also experiment on a subset of the LARGE dataset, following the SHEL 2013 setting, with the development set in the LARGE dataset used as test set (henceforth SMALL dataset). The training and development set of the SMALL dataset comes from a random 80% (Tr80) and 20% (Tr20) split of the training set in LARGE dataset.
The statistics of the datasets, including the number of overlaps between the entities in the "All" column, are shown in Table 1.
We note that this dataset only contains one type of entity. In later experiments, in order to evaluate the models on multiple types, we create another dataset where we split the entities based on the entity-level semantic category. This information is available for some entities through the Concept Unique Identifier (CUI) annotation in the data. In total we have three types: two types (type A and B) based on the semantic category, and one type (type N) for those entities   Figure 4 shows some examples of the mentions. The first example shows two discontiguous mentions that do not overlap. The second example shows a typical discontiguous and overlapping case. The last example shows a very hard case of overlapping 2 It is tempting to just ignore these entities since the N type does not convey any specific information about the entities in it. However, due to the dataset size, excluding this type will lead to very small number of interactions between types. So we decided to keep this type and discontiguous mentions, as each of the components in {blood, dark, black material} is paired with each of the word in {vomit, bowel movement}, resulting in six mentions in total, with one having three components (dark . . . material . . . vomit).

Features
Motivated by the features used by Zhang et al. (2014), for both the linear-chain CRF model and our models we use the following features: neighbouring words with relative position information (we consider previous and next k words, where k=1, 2, 3), neighbouring words with relative position information paired with the current word, word n-grams containing the current word (n=2,3), POS tag for the current word, POS tag n-grams containing the current word (n=2,3), orthographic features (prefix, suffix, capitalization, lemma), note type (discharge summary, echo report, radiology, and ECG report), section name (e.g. Medications, Past Medical History) 3 , Brown cluster, and wordlevel semantic category information 4 . We used Stanford POS tagger (Toutanova et al., 2003) for POS tagging, and NLP4J package 5 for lemmatization. For Brown cluster features, following Tang et al. (2013), we used 1,000 clusters from the combination of training, development, and test set, and used all the subpaths of the cluster IDs as features.

Experimental Setup
We evaluated the three models on the SMALL dataset and the LARGE dataset.
Note that in both the SMALL and LARGE dataset, about half of all mentions are discontiguous, both in training and test set. We also want to see whether training on a set where the majority of the mentions are contiguous will affect the performance on recognizing discontiguous mentions. So we also performed another experiment where we trained each model on the original training set where the majority of the entities are contiguous. We refer to this original dataset as "Train-Orig" (it contains 10,405 sentences, including those with no entities) and the  Table 2: Results on the two datasets and two different training data after optimizing regularization hyperparameter λ in development set. The -ENH and -ALL suffixes refer to the ENOUGH and ALL heuristics. The best result in each column is put in boldface. earlier one as "Train-Disc".
First we trained each model on the training set, varying the regularization hyperparameter λ, 6 then the λ with best result in the development set using the respective ENOUGH heuristic for each model is chosen for final result in the test set.
For each experiment setting, we show precision (P), recall (R) and F1 measure. Precision is the percentage of the mentions predicted by the model which are correct, recall is the percentage of mentions in the dataset correctly discovered by the model, and F1 measure is the harmonic mean of precision and recall.

Results and Discussions
The full results are recorded in Table 2.
We see that in general our models have higher precision compared to the linear-chain baseline. This is expected, since our models have less ambiguity, which means that from a given output structure it is easier in our model to get the correct interpretation. We will explore this more in Section 5.5.
The ALL heuristic, as expected, results in higher recall, and this is more pronounced in the linearchain model, with up to 4% increase from the ENOUGH heuristic, achieving the highest recall in three out of four settings. The high recall of the ALL heuristic in the linear-chain model can be explained by the high level of ambiguity the model has. Since it has more ambiguity compared to our models, one label sequence predicted by the model produces a lot of entities, and so it is more likely to overlap with the gold entities. But this has the drawback of very low precision as we can see in the result.
We see switching from one heuristic to the other 6 Taken from the set {0.125, 0.25, 0.5, 1.0, 2.0} does not affect the results of our models much. Looking at the output of our models, they tend to produce output structures with less ambiguity, which causes little difference in the two heuristics. One example where the baseline made a mistake is the sentence: "Ethanol Intoxication and withdrawal". The gold mentions are "Ethanol Intoxication" and "Ethanol withdrawal".
But the linear-chain model labeled it as "[Ethanol] [B] [Intoxication] [I] and [withdrawal] [BD] ", which is inconsistent since there is only one discontiguous component. Our models do not have this issue because in our models every subgraph that may be predicted translates to valid mention combinations, as discussed in Section 3.2.
In the "Train-Orig" column, we see that all models can recognize discontiguous entities better when given more data, even though the majority of the entities in "Train-Orig" are contiguous.

Experiments on Ambiguity
To see the ambiguity of each model empirically, we run the decoding process for each model given the gold output structure, which is the true label sequence for the linear-chain model and the true mention-encoded hypergraph for our models.
We used the entities from the training and development sets for this experiment, and we compare the "Original" datasets with the "Discontiguous" subset to see that the ambiguity is more pronounced when there are more discontiguous entities. Then we show the precision and recall errors (defined as 1 − P and 1 − R, respectively) in Table 3.
Since the ALL heuristics generates all possible mentions from the given encoding, theoretically it should give perfect recall. However, due to errors in the training data, there are mentions which can-   not be properly encoded in the models 7 . Removing these errors results in perfect recall (0% recall error). This means that all models are complete: they can encode any mention combinations.
We see however, a very huge difference on the precision error between the linear-chain model and our models, even more when most of the entities are discontiguous. For the discontiguous subset with the ALL heuristic, the linear-chain model produced 5,463 entities, while the SHARED and SPLIT model produced 2,020 and 2,006 entities, respectively. The total number of gold entities is 1,991. This means one encoding in the linear-chain model produces much more distinct mention combinations compared to our model, which again shows that the linearchain model has more ambiguity. Similarly, we can deduce that the SHARED model has slightly more ambiguity compared to the SPLIT model. This confirms our theoretical result presented previously.
It is also worth noting that in the ENOUGH heuristic our models have smaller errors compared to the linear-chain model, showing that when both models can predict the true output structure (the correct 7 There are 19 errors in the original dataset, and 6 in the discontiguous subset, which include duplicate mentions and mentions with incorrect boundaries label sequence for the baseline model and mentionencoded hypergraph for our models), it is easier in our models to get the desired mention combinations.

Experiments on Multiple Entity Types
We used the LARGE dataset with the multiple-type entities for this experiment. We ran our two models and the linear-chain CRF model with the ENOUGH heuristic on this multi-type dataset, in the same setting as Train-Orig in previous experiments, and the result is shown in Table 4. We used the best lambda from the main experiment for this experiment.
There is a performance drop compared to the LARGE-Train-Orig results in Table 2, which is expected since the presence of multiple types make the task harder. But in general we still see that our models are still better than the baseline, especially the SPLIT model, which shows that in the presence of multiple types, our models can still work better than the baseline model.

Conclusions and Future Work
In this paper we proposed new models that can better represent discontiguous entities that can be overlapping at the same time. We validated our claims through theoretical analysis and empirical analysis on the models' ambiguity, as well as their performances on the task of recognizing disorder mentions on datasets with a substantial number of discontiguous entities. When the true output structure is given, which is still ambiguous in all models, our models show that it is easier to produce the desired mention combinations compared to the linear-chain CRF model with reasonable heuristics. We note that an extension similar to semi-Markov or weak semi-Markov  is possible for our models. We leave this for future investigations.
The supplementary material and our implementations for the models are available at: http://statnlp.org/research/ie

Model Ambiguity
This work attempts to define a more formal way to compare two models in terms of its ambiguity.
To compare the ambiguity between models, we first define a notion of model ambiguity level, which is defined as the average number of distinct interpretations across its set of canonical encodings. A canonical encoding is a fixed, selected representation of a particular set of mentions in a sentence, among (possibly) several alternative representations. Several alternatives may be present due to the ambiguity of the encodingdecoding process and also since the output of the model is not restricted to a specific rule. For example, for the text "John Smith", a model trained in BIO format might output "B-PER I-PER" or "I-PER I-PER", and both will still mean that "John Smith" is a person, although the "correct" encoding would of course be "B-PER I-PER", which is selected as the canonical encoding. Intuitively, a canonical encoding is a formal way to say that we only consider the "correct" encodings.
Note that by definition, the number of canonical encodings cannot be larger than the number of possible interpretations, so ambiguity level will be at least 1. When the number of canonical encodings is strictly smaller than the number of possible entity combinations, we have either ambiguity (when several entity combinations have the same canonical encoding) or incompleteness (when a entity combination does not have an encoding). A model which is complete and not ambiguous will have the lowest ambiguity level, which is 1. We note that in our case, all models are complete as for any entity combination each model is able to encode it. All models that we consider, however, are ambiguous since some in each model there are encodings that can be interpreted in more than one way.
Let N (n) be the number of all possible entity combinations in a sentence with n words, and M(n) be the number of canonical encodings in the model for a sentence with n words. Then the ambiguity level of a model is: assuming the limit exists. If it does not exist, the ambiguity is undefined. This can mean two models which are incomparable, or that the number of possible entity combinations is much larger than the number of 1 canonical encodings. When it exists, the values of A lie in the range [1, ∞). Sometimes when M(n) is very small compared to N (n), it might be useful to compute the log-ambiguity instead, defined as: This is what we will use in the remainder of this article.
Relative Ambiguity For some tasks, the number of possible encodings in a practical model may be very small compared to the number of possible entity combinations resulting in very high ambiguity, and so it is difficult to compare two models. To overcome this, we define a notion of relative ambiguity, which is defined as the ratio of the ambiguity level of two models: A relative ambiguity greater than one means the first model is more ambiguous compared to the second model. A relative ambiguity of 1 means the two models have the same level of ambiguity. Now we calculate the number of canonical encodings M LI (n), M SH (n), and M SP (n) for linear-chain, SHARED, and SPLIT model, respectively.
Calculating M LI (n) For the linear-chain, the number of possible encodings is the number of possible tag sequence, which is 7 n in this case, since there are 7 possible tags. Note that this number is larger than the number of canonical encodings, since some of the tag sequences have the same interpretation and some are not valid. So: Calculating M SH (n) and M SP (n) For our models, the number of canonical encodings is equal to the number of valid subgraph in the model, since each distinct subgraph will yield distinct interpretation. This is quite tricky to calculate; a straightforward application of the standard dynamic programming algorithm similar to the inside algorithm that uses the nodes as states fails in this case because it also includes certain combinations not representable in our graph models. To calculate the number of distinct subgraph, we need to use a combination of nodes as states.
To explain this idea, we will calculate the number of subgraph of a simple graph shown in Figure 1.The nodes are indexed by position from right to left, starting from 1, and also by level from bottom to top, starting from 0, as in the figure. So the top left node has level 2 and position 6, and the bottom right node has level 0 and position 1. Let n j i denote the node at level i and position j. From each node n j i except the nodes in the bottom row and the rightmost in each level, there are two edges, one connecting to n j−1 i and another connecting to n j−1 i−1 . Now we want to count the number of distinct connected directed acyclic graphs (DAGs) that include the top left node and the bottom right node.
First note that at each position, any combination of nodes in the three levels can be assigned a unique number from 0 to 7 by considering the chosen nodes as binary number, with the node at level 2 being the most significant bit. For example, the number 5, or 101 in binary, represents the set of two nodes at level 2 and level 0. Now, we define 8 functions, f 000 (n), f 001 (n), . . ., f 111 (n), one for each combination of nodes at each position. The function f k (n) represents the number of directed acyclic graphs with the node combinations represented by k at position n as the sources, and the bottom right node as the sink. Then we have each  function as the sum over all reachable node combinations at the previous position. To find the reachable states, we enumerate all possible edge combinations for a given nodes configuration. Figure 2 shows the 9 reachable states from the state 110. Note that a node combination may be reachable in multiple ways.
From there, we have f 110 (n) = f 010 (n − 1) + 2f 011 (n − 1) + f 101 (n − 1) + 2f 110 (n − 1) + 3f 111 (n − 1). By representing this as transition vector, we have: Stacking the transition vectors for all 8 functions, we have the following transition matrix T: Using this recursive formulation, we can calculate the number of possible DAGs at the top left node by calculating f 100 (n). Initially we have f 001 (1) = 1 and f k (1) = 0 for k = 1. Then we have the following formulation for f (n), the number of DAGs from the top left node at position n ≥ 3 to the bottom right node: f (n) = 0 0 0 0 1 0 0 0 × T n−1 × 0 1 0 0 0 0 0 0 T In general, when a transition matrix is diagonalizable into T = S −1 JS for some matrix S and diagonal matrix J, the value of T n can be easily calculated by exponentiating the diagonal entries in J since T n = S −1 J n S. Note that when T is diagonalizable, the entries in J will be the eigenvalues of T and there will be coefficients c ij such that: where λ 1 , λ 2 , . . . , λ k are the eigenvalues of T and q i is the multiplicity of λ i . Suppose λ m = max(λ 1 , λ 2 , . . . , λ k ), then n qm λ m n will be the dominating term, and the value of f (n) will asymptotically grow as fast as n qm λ m n . Equivalently, f (n) ∈ Θ(n qm · λ m n ), where: Note that if f (n) ∈ Θ(g(n)) and h(n) < g(n), ∀n, then f (n) ∈ Ω(h(n)). Similarly, if f (n) ∈ Θ(g(n)) and h(n) > g(n), ∀n, then f (n) ∈ O(h(n)).
For example, the matrix T above has the eigenvalues: 3 + √ 5, 3 − √ 5, 2, 1 with 2 and 1 having the multiplicity of 2 and 4, respectively. And after solving for the c ij we have: From Equation 8, we see that the function f (n) grows with order (3 + √ 5) ≈ 5.24. Now, applying the same method to our original hypergraph, we can get the order in which the number of canonical encodings grows. For the SHARED model with at most two components, we have the following transition matrix: 1 0 0 0 1 0 0 0 1 2 0 0 1 2 0 0 0 1 1 1 0 1 1 1 0 3 1 5 0 3 1 5 1 0 2 0 5 0 6 0 1 2 2 4 5 10 6 12 0 1 3 5 0 5 11 17 0 3 3 21 0 15 11 73 with the maximum eigenvalue λ m ≈ 80.61 with multiplicity of 1. Therefore the number of possible DAGs in the SHARED model with two components, which is the number of canonical encodings, is in the order of Ω(80 n ). Analogously, we have the largest eigenvalue for the transition matrix for the SHARED model with three components to be: λ m ≈ 1261.86 > 1024 = 2 10 with multiplicity of 1, and so M SH (n) ∈ Ω(2 10n ). Now, by the definition of Ω, for some n 0 , we have for all n > n 0 : M SH (n) ≥ C · 2 10n for some constant C > 0 and so ∀n > n 0 , log n i=1 M SH (i) > log M SH (n) ≥ log C + log 2 10n = log C + 10n log 2.
Now we can calculate the relative ambiguity between the linear-chain model and our SHARED model as: This means the linear-chain model is more ambiguous compared to our model. Similarly, we can compute the growth order of our SPLIT model to arrive at the conclusion that M SP ∈ Ω(2 15n ). Since M SH (n) ∈ O(2 11n ), we have A r (SH, SP) > 15 11 > 1.
Number of canonical encodings for small n One may be concerned that since the analysis above concerns asymptotic values, it is applicable only for large n. To address that concern, we show in Table 1 the number of canonical encodings in the linear-chain model and our models, compared with the number of possible entity combinations. Note that it is hard to calculate the number of canonical encodings for the linear-chain model, since the labels interact in a complex way. For smaller n, we can exhaustively enumerate the possible encodings, but as n increases it becomes not feasible, so for n ≥ 4 we show the upperbound instead, which is 7 n . We will show how to calculate N (n), the number of all possible entity combinations, later.  For n = 1, 2, the number of possible entity combinations is small, and we see that all models can accurately represents each of them. However, note that our models have the advantage of not having the possibility to output non-canonical encodings for n = 1, 2, unlike the linear-chain model, which has 7 and 49 possible encodings for n = 1, 2, respectively. For n = 3, exhaustive enumeration of the encodings for the linear-chain model shows that there are 46 canonical encodings, which is less than the 80 canonical encodings in our models. For n ≥ 4, which is typical for sentence length, we see that our models have more canonical encodings compared to the linear-chain model, which is bounded above by 7 n , although they are still smaller than the total number of possible entity combinations.
Calculating N (n) Now we show how to calculate the number of possible entity combinations in a sentence of length n. In the paper , the authors had established that the number of possible entity combinations when there is no discontiguous entity is 2 t n(n+1) 2 , where t is the number of possible entity types. For this case, we set t = 1 as there is only one entity type in this task and use the combinatorial as the number of contiguous entities. For the number of discontiguous entities with exactly k components, there is one start position and one end position for each of the k components, and all of them are distinct positions. So the number of discontiguous entities is the number of ways to choose the 2k positions from n + 1 positions, resulting in n+1 2k possible discontiguous entities with exactly k components. The total number of discontiguous entities with at most k components is then N = k i=1 n + 1 2i . Theoretically, each of these entities can exist independently of each other, resulting in possibly overlapping entities. Considering the overlapping entities, in total we have: distinct entity combinations with each discontiguous entities having at most k components. Note that as k goes to n+1 2 (which means when we do not restrict the number of components an entity can have), the formula above reduces to 2 2 n −1 by the combinatorial identity n+1 2 i=0 n + 1 2i = 2 n , which matches the interpretation that when we do not restrict the number of components, then any non-empty subset of words can be an entity, for which there are 2 n −1 of them. Then since we count entity combinations, we have 2 2 n −1 combinations, which is what we got when we let k goes to n+1 2 . In our case, since we are calculating the number of all possible entity combinations for n ≤ 5, the maximum number of components an entity can have is 3, so the number of all possible entity combinations is also the number of entity combinations with at most 3 components, which is also what we calculated for the models.

Data Preprocessing
We performed some preprocessing on the clinical texts, which are: sentence splitting, tokenization, and POS tagging.
Splitting and tokenization We used the document preprocessor from Stanford CoreNLP package to split the documents into sentences, then further processed the output using some rules to better capture the structure of the document. We then tokenized each sentence using a regex-based tokenization, similar to NLTK wordpunct tokenize function. We again further processed the output to handle special anonymization tokens (e.g., "[**doctor first name 77**]") and normalized them into categories (e.g., "doctor name"). After tokenization, we ran Stanford POS tagger on the resulting text to get the POS tags for each token. Finally, we assigned each disorder mention into the corresponding sentence that contains it.
Note that in this process there might be disorder mentions which do not fit into any sentence. This happens when the sentence splitter split two words that are annotated as part of a single entity. In our case, we found that there is only one entity in the training set which is incorrectly annotated to also include the whitespace preceeding a sentence, which is not part of any sentence after sentence splitting. We fixed the annotation by excluding the preceding whitespace.
Determining section names The clinical texts in the datasets are semi-structured, in the sense that the contents are organized into sections. However, the section names seem to be quite irregular, having quite a number of variants of the section names with the same meaning. For the purpose of determining the section name during feature extraction, we used a regular expression to capture lines in the clinical texts that end with a colon or coming after a new line, with some heuristics to handle special cases found in the datasets. More details can be read in our code at http://statnlp.org/research/ie/ #discontiguous-mention 6 3 Handling Ambiguity Both the linear-chain CRF model and our models are still ambiguous to some degree, so during the decoding process we need to handle the ambiguity to produce the mentions. For all models, we define two general heuristics: ENOUGH and ALL. The ENOUGH heuristic handles ambiguity by trying to produce a minimal set of entities which encodes to the one produced by the model, while ALL heuristic handles ambiguity by producing the union of all possible entity combinations that encode to the one produced by the model.
More specifically, for the linear-chain CRF baseline, we first converted the labeled words into a sequence of component types: contiguous, body, and head, referring to the sequence of words encoded by {B, I}, {BD, ID}, and {BH, IH}, respectively. In the ALL heuristic we form all possible combinations between the components, considering the compatibility between components (e.g., a body component cannot be paired with a contiguous component).
In the ENOUGH heuristic, each head component forms at most two entities by pairing it with body components starting from the closest ones from its left, unless there is only one remaining body component of that type after this process, in which case the head component will form the third entity by being paired with this last body component. After this, each unpaired body components of the same type will be paired up to form entities with two or three components. In [Tang et al., 2013] the authors mentioned that they used a few simple rules to convert labels to entities, but the exact details on the rules were not made available.
For our models, for ALL, we generate all possible sets of mentions encoded in the entity-encoded hypergraph by traversing all possible paths in the hypergraph, while in ENOUGH, we generate as many mentions as required to cover all edges present in the hypergraph.

Regularization Hyperparameter
For the experiments, the regularization hyperparameters for each setting are noted in Table 2. Note that the same regularization hyperparameter is used for both the ALL and ENOUGH heuristics during testing.

Splitting the Dataset
We note that the dataset that we have only contains one type of entity. In order to evaluate the models on multiple types, we create another dataset where we split the entities based on the entity-level semantic category. Each entity in the dataset was annotated either with its Concept Unique Identifier (CUI), or with the string "CUI-less". The CUI is a number referencing certain entry in the Unified Medical Language System (UMLS) Metathesaurus. In UMLS, each entry has a corresponding semantic type, which is organized in a hierarchy. There are two major roots in the hierarchy, which are type A and type B. These two types, together 7 with the CUI-less entities, make up the three entity types that we used in creating the dataset with multiple types.

Efficiency in Handling Multiple Entity Types
Theoretically, our models can handle multiple entity types more efficiently compared to the linear-chain CRF model. This is because the time complexity of our models are linear in terms of number of entity types, while it is quadratic for the linear-chain CRF model.
To see this empirically, we randomly split the entities in the dataset into n types, where n = 1, 2, 4, 8, 16, and took note of the time taken for training the models. We show the time per iteration relative to the time taken for handling only one type in Table 3. We can see that the time per training iteration in the baseline model increases faster than that of our model's, confirming the theoretical time complexity of the models.