Labeling Gaps Between Words: Recognizing Overlapping Mentions with Mention Separators

In this paper, we propose a new model that is capable of recognizing overlapping mentions. We introduce a novel notion of mention separators that can be effectively used to capture how mentions overlap with one another. On top of a novel multigraph representation that we introduce, we show that efficient and exact inference can still be performed. We present some theoretical analysis on the differences between our model and a recently proposed model for recognizing overlapping mentions, and discuss the possible implications of the differences. Through extensive empirical analysis on standard datasets, we demonstrate the effectiveness of our approach.


Introduction
Named entity recognition (NER), or in general the task of recognizing entity mentions 1 in a text, has been a research topic for many years (Mc-Callum and Li, 2003;Nadeau and Sekine, 2007;Ratinov and Roth, 2009;Ling and Weld, 2012). However, as noted by Finkel and Manning (2009), many previous works ignored overlapping mentions, although they are quite common. Figure  1 illustrates some examples of overlapping mentions adapted from existing datasets. For example, the location mention Pennsylvania appears within the mention of type organization a Pennsylvania radio station. In practice, overlapping mentions have been found in many existing datasets across different domains (Doddington et al., 2004;Suominen et al., 2013). Developing algorithms that can effectively and efficiently extract overlapping mentions can be crucial for the At issue is the liability of a Pennsylvania  performance of many downstream tasks such as relation extraction (Mintz et al., 2009;Gupta and Andrassy, 2016), event extraction (Lu and Roth, 2012;Li et al., 2013;Nguyen et al., 2016), coreference resolution (Chang et al., 2013;, question answering (Mollá et al., 2007), and equation parsing (Roy et al., 2016).
Overlapping mention recognition is non-trivial, as existing methods that model mention recognition as a sequence prediction problem -e.g., using linear-chain conditional random fields (CRF) (Lafferty et al., 2001) -have difficulties in handling overlapping mentions (Alex et al., 2007). Finkel and Manning (2009) proposed to use a treebased constituency parsing model to handle nested entities. 2 Due to the tree structured representation used, the resulting algorithm has a time complexity that is cubic in n for its inference procedure with n being the number of words in the sentence. This effectively makes the algorithm less scalable compared to models such as linearchain CRF where the complexity is linear in n. Lu and Roth (2015) proposed an alternative approach which shows a time complexity that is linear in n. Their method differs from the conven-tional sequence labeling approach, in that a hypergraph representation was used in their model.
In this work, we make an observation that there exists an efficient model for recognizing overlapping mentions while still regarding the problem as a sequence labeling problem. As opposed to the conventional approach where we assign labels to natural language words, in our new approach we assign labels to the gaps between words, modeling the mention boundaries instead of modeling the role of words in forming mentions. Furthermore, while these gap-based labels can be modeled using conventional graphical models like linear-chain CRFs, we also propose a novel multigraph representation to utilize such gap-based labels efficiently. To the best of our knowledge, this is the first structured prediction model utilizing a gap-based annotation scheme to predict overlapping structures.
In this paper we make the following major contributions: • We propose a set of mention separators which can be collectively used to define all possible mention combinations together with a novel multigraph representation, on top of which efficient and exact inference can be performed. • Theoretically, we show that unlike a recently proposed state-of-the-art model that we compare against, our model does not exhibit the spurious structures issue in its learning procedure. On the other hand, it still maintains the same inference time complexity as the previous model. • Empirically, we show that our model is able to achieve higher F 1 -scores compared to previous models in multiple datasets. We believe our proposed approach and the novel representations can be applied in other research problems involving predicting overlapping structures, and we hope this work can inspire further research along such a direction.

Related Work
NER or mention detection is normally regarded as a chunking task similar to base noun phrase chunking (Kudo and Matsumoto, 2001;Shen and Sarkar, 2005), and hence the entities or mentions are usually represented in a similar way, using BILOU (Beginning, Inside, Last, Outside, Unitlength mention) or the simpler BIO annotation scheme (Ratinov and Roth, 2009). As a chunking task, it is commonly modeled using sequence labeling models, such as the linear-chain CRF (Lafferty et al., 2001), which has time complexity O(nT 2 ) with n being the number of words in the sentence and T the number of mention types.
On the task of recognizing mentions that may overlap with one another, one of the earliest works that attempted to regard this task as a structured prediction task was by McDonald et al. (2005). They represented entity mentions as top-k predictions with positive score from a structured multilabel classification model. Their model has a time complexity of O(n 3 T ). Alex et al. (2007) proposed a cascading approach using multiple linear-chain CRF models, each handling a subset of all the possible mention types, where the models which come later in the pipeline have access to the predictions of the models earlier in the pipeline. This results in the time complexity of roughly O(nT ) depending on how the pipeline was designed. Finkel and Manning (2009) later proposed a constituency parser to handle nested entities by converting each sentence into a tree, and each mention is represented as one of the subtrees. Their model has the standard time complexity for a constituency parser with binary grammar: O(n 3 |G|), where |G| is the size of the grammar, which in this case is proportional to T in the best case, and T 3 in the worst case. They showed that their model outperforms a semi-CRF baseline (Sarawagi and Cohen, 2004) in terms of F 1 -score.
Recently, Lu and Roth (2015) proposed a hypergraph-based model called mention hypergraph that is able to handle overlapping mentions with a linear time complexity O(nT ). The model was shown to achieve competitive results compared to previous models on standard datasets. As we will be making extensive comparisons against this previous state-of-the-art model, we will describe this approach in the next section.

Mention Hypergraph
In the mention hypergraph model of Lu and Roth (2015), nodes and directed hyperedges 3 are used together to encode mentions and their combinations. The following five types of nodes are used at the position k of a sentence: • A k denotes all mentions starting at k or later, Figure 2: (left) An example mention hypergraph encoding two overlapping mentions. (right) An example of spurious structure.
• E k denotes all mentions starting at k, • T k t denotes all mentions (type t) starting at k, • I k t denotes all mentions (type t) covering k, • X denotes the end of a mention (leaf node). Different hyperedges connecting these nodes are used to represent how the semantics of a node is composed from those of its child nodes.
Specifically, each A k is connected to A k+1 and E k through the hyperedge A k → (A k+1 , E k ), denoting the fact that the set of mentions that start at k or later is the union of the set of mentions that start at k + 1 or later and the set of mentions that start at k. Each E k is connected to T k 1 , T k 2 , . . . , T k T through a hyperedge, denoting the fact that the mentions that start at k must be one of the T types. Each T k t can be connected to I k t through an edge (denoting there is a mention of type t that starts at the k-th token) or to X through another edge (denoting there are no mentions of type t that start at the k-th token). Each I k t can be connected to I k+1 t (denoting there is a mention continuing to the next token), to X (denoting there is a mention ending here), or to both (with a single hyperedge, denoting the two cases above occur at the same time, a case of overlapping mentions).
In this mention hypergraph, each possible mention is represented as a path from a T-node to the X-node through a sequence of I-nodes (each denoting the words which are part of the mention), and the set of all mentions present in a given sentence forms a hyperpath from the root node A 0 to the leaf node X. Figure 2 shows how the mention hypergraph represents the two mentions in the phrase "the human TCF-1 protein", which are "TCF-1" and "human TCF-1 protein". The edges T 1 − I 1 and T 2 − I 2 respectively denote that the words "human" and "TCF-1" are the beginning of a mention, and the edges from the I-nodes to the X-node define the end of the mentions. We remark that any mention hypergraph which encodes the mentions in a sentence, like this example, forms a hyperpath from the root node A 0 to the leaf node X, where a hyperpath is defined as a subgraph of a hypergraph with the property that each node has exactly one outgoing (hyper)edge except the last node, and the root node is connected to all nodes.
We refer the readers to Lu and Roth (2015) for more details on the model.

Spurious Structures
Mention hypergraph is trained by maximizing the likelihood of the training data, similar to training a linear-chain CRF. Recall that the likelihood of the training data can be calculated by taking the score of the correct structures and divide it by the normalization term, which is the total score of all possible structures. Lu and Roth (2015) used a dynamic programming algorithm to calculate the normalization term. However, the normalization term calculated this way contains additional terms, which we call the spurious structures. This leads to the following: Theorem 3.1. Let Z be the normalization term as calculated using forward-backward algorithm on mention hypergraph, and let Z be the true normalization term. Then we have Z > Z.
Due to space limitation, we provide a proof sketch here. We refer the reader to the supplemental material for the details on spurious structures.
Proof sketch. First note that Z includes all possible hyperpaths, so Z ≥ Z. Next, due to the presence of a node with multiple parents (e.g., node I 2 in Figure 2 (left)), Z includes the score of that node multiple times with different children, which results in a subgraph which is not a hyperpath. For example, Z includes the score 4 of the structure shown in Figure 2 (right), where node I 2 has two children, and so it is not a hyperpath. Since Z is the sum of all hyperpaths, this structure is not part of Z, but it is included in Z , so Z > Z.
Later we will see how this issue may affect the model's performance in predicting mentions.

Mention Separators
We now describe the mention separators which can be used to encode overlapping mentions in a sentence. Traditional encoding schemes that associate labels to words, such as BIO scheme, attach the semantics of the labels to the role of the words in forming mentions. For example, the label B in BIO scheme denotes the role of the word it is attached to, which is the first word of a mention. This BIO scheme cannot be used directly to encode overlapping mentions, since they only encode whether a word is part of a mention and possibly their position in the mention. We notice that by encoding the mention boundaries instead, we can represent overlapping mentions. This can be accomplished by assigning what we call mention separators to the gaps between two words.
At each gap, we consider eight possible types of mention separators based on the combination of the following three cases: 1. A mention is starting at the next word (S) 2. A mention is ending at the previous word (E) 3. A mention is continuing to the next word (C) Therefore, for each token, the possible combinations of cases are as follows: ECS, EC, CS, C, ES, E, S, and X, where X means none of the three cases applies. For example, the separator EC means there is a mention ending at the current token and another mention (overlapping) continuing to the next token. Note that there might be more than just two mentions involved here. Figure 3 shows an illustration of these separators, and Figure 4a shows how they can be used to encode the example in Figure 2. Now we prove that the following theorem holds: Theorem 4.1. For any combination of mentions in a sentence, there is exactly one sequence of mention separators that encodes it.
Proof. Consider the gap between any two adjacent words in the sentence. The combination of mentions present in the sentence uniquely defines what mention separator is associated with this gap. If there is a mention starting at the next word, then case S applies. Similarly, if there is a mention ending at the previous word, case E applies. And finally, if there is a mention covering both words, case C applies. By combining the cases, we get the corresponding mention separator for this gap. In this way, each gap in the sentence has a unique mention separator, which in turn defines the unique sequence of mention separators.
Note that the converse of Theorem 4.1 is not true, as multiple mention combinations might encode to the same sequence of mention separators. Now we describe two ways the mention separators can be used to encode overlapping mentions.
STATE-based The first is by directly using these mention separators to replace the standard mention encoding scheme (e.g., BIO encoding) in standard linear-chain CRF. So we assign each mention separator to a state in a linear-chain CRF model. Since this model encodes the gap between words and also the gap before the first word and after the last word, a sentence with n words is modeled by a sequence of n + 1 mention separators. Since each sequence of mention separators can only encode mentions of the same type, we support multiple types by using multiple sequences, one for each mention type.
EDGE-based Now, we propose a novel way of utilizing these mention separators. Since the mention separators encode the gaps between words, it is more intuitive to assign the mention separators to the edges of a graphical model, as opposed to the states, as described in the previous paragraph. To do this, we need to define the states of the models in such a way that all possible sequences of mention separators are accounted for. For this purpose we assign two states to each word at position k: • I k : word at k is part of a mention, • O k : word at k is not part of any mentions. Next we define the edges between the states according to the eight possible mention separators between adjacent words. More specifically, each mention separator is mapped to an edge connecting one state in the current position to another state in the next position depending on whether the separator defines current and next word as part of an mention, so in total we have eight edges between two positions in the model. Some mention separators may connect the same two states, for example, the ES and C separator both connect I k to I k+1 since in both cases the current word and the next word are part of a mention. In those cases, we simply define multiple edges between the pair of states. The resulting graph, where there can be multiple edges between two states, is known in graph theory literature as a multigraph 5 . Figure 4: Our mention separator model with the EDGE representation encoding two phrases. Figure 5: The full graph in EDGE-based model.
The first I-and O-nodes in the sentence are connected to the root node, and the last I-and Onodes are connected to the unique leaf node X. Figure 4a shows how the EDGE-based model encodes the two mentions "human TCF-1 protein" and "TCF-1" in the phrase "the human TCF-1 protein", and Figure 4b shows the encoding of the phrase found in the second example in Figure 1. Note how each edge maps to a distinct mention separator visualized in the text in red. Figure 5 shows the full graph of our EDGEbased model, in a format similar to the trellis graph for linear-chain CRFs in Figure 6. We remark that the EDGE-based model can be seen as an extension of linear-chain CRFs, with additional semantics attached to the edges. Also note that this graph encodes only one mention type. To support multiple types, similar to the STATE-based approach we can use multiple chains, one for each type.
Note that the edges in our EDGE-based representations are directed, with nodes on the left serving as parents to the nodes on the right. Such directed edges will be helpful when performing inference, to be discussed in the next section.
We remark that the way we utilize multigraph in the EDGE-based model can also be applied to the discontiguous mention model (DMM) by Muis and Lu (2016). In fact, it can be shown that the number of canonical structures as calculated in the supplementary material of DMM paper matches the number of possible paths in our multigraphbased model, as the transition matrix in DMM corresponds to the number of possible transitions from one position to the next position, which is regarded as a lattice where edges are associated with labels.

Training, Inference and Decoding
We follow the log-linear approach to define our model, using regularized log-likelihood in training data D as our objective function, as follows: Here, (x, y) is a training instance consisting of the sentence x and the correct output y, w is the weight vector, f (e) is the feature vector defined over the edge e, Z w (x) is the normalization term, and λ is the l 2 -regularization parameter. The objective function is then optimized until convergence using L-BFGS (Liu and Nocedal, 1989).
We note the mention hypergraph model also defines the objective in a similar manner. For both of our models, the inference is done based on a generalized inside-outside algorithm. Both models involve directed structures, on top of which the inference algorithm first calculates the inside score for each node from the leaf node to root, and then the outside score from the root to the leaf node, in very much the same way as how inference is done in a classic graphical model. Specifically, for our EDGE-based model, the inside scores are calculated using a bottom-up (right-to-left) dynamic programming procedure, where we calculate the inside score at each node by summing up the scores associated with each path connecting the current node to one of its child nodes. Each ACE-2004   such path score is defined as the product of the inside score stored in that child node and the score defined over the edge connecting them. The computation of the outside scores can be done in an analogous manner from left to right. It can be verified that the time complexity of this inference procedure for our model is O(nT ), which is the same as the mention hypergraph model. Note that, however, both of our models do not have the spurious structures issue, as for any path in these models there are no nodes with multiple incoming edges. During decoding, we perform MAP inference using a max-product procedure that is analogous to how the Viterbi decoding algorithm is used in conventional tree-structured graphical models to find out the highest-scoring subgraph, from which we extract mentions through the process that we call the interpretation process. As noted in previous section, there could be multiple mention combinations that correspond to the same sequence of mention separators, which presents an ambiguity during the interpretation process. For these ambiguous cases, we implemented the same interpretation process as that was done in the mention hypergraph model, which is by resolving ambiguous structures as nested mentions. For other cases, there is exactly one way to interpret the structure. For example, in Figure 4b, although there is only one gap marked as starting position (S) and two gaps marked as ending position (EC and E), the interpretation is clear that the two mentions here are "IL2" and "IL2 regulatory region".

Datasets
To assess our model's capability in recognizing overlapping mentions and make comparisons with previous models, we looked at datasets where overlapping mentions are explicitly annotated. Following the previous work (Lu and Roth, 2015), our main results are based on the standard ACE-2004and ACE-2005datasets (Doddington et al., 2004. We also additionally looked at the GE-NIA dataset , which was used in the previous works (Finkel and Manning, 2009;Lu and Roth, 2015).
For ACE datasets, we used the same splits as used in our previous work (Lu and Roth, 2015), published on our website 6 . For GENIA, we used GENIAcorpus3.02p 7 that comes with POS tags for each word (Tateisi and Tsujii, 2004). Following previous works (Finkel and Manning, 2009;Lu and Roth, 2015), we first split the last 10% of the data as the test set. Next we used the first 80% and the subsequent 10% for training and development, respectively. We made the same modifications as described by Finkel and Manning (2009) by collapsing all DNA, RNA, and protein subtypes into DNA, RNA, and protein, keeping cell line and cell type, and removing other mention types, resulting in 5 mention types. The statistics of each dataset are shown in Table 1. We can see overlapping mentions are common in such datasets.
For more details on the dataset preprocessing, please refer to the supplemental material.

Features
For models that fall under the edge-based paradigm (mention hypergraph and our model), we define features over the edges in the models. Features are defined as string concatenations of input features -information extracted over the inputs (such as current word and POS tags of surrounding words) and output features -structured information extracted over the output structure. We carefully defined the input and output features in a way that allows us to make use of the identical set of features for both our mention separator model and the baseline mention hypergraph model, in order to make a proper comparison. We also followed Lu and Roth (2015) to add the additional mention penalty feature for our model and all baseline approaches so that we are able to tune F 1 -scores on the development set. Roughly speak-  ing, the weight of this feature controls how confident the model should be in predicting more mentions. In other words, this is a way to balance the precision and recall of the model. When defining the input features for both our model and the mention hypergraph model, we implemented the features used by previous works in each dataset based on the descriptions in their papers: we followed Lu and Roth (2015) for the features used in ACE datasets, and Finkel and Manning (2009) for features used in GENIA dataset. In general, they include surrounding words, surrounding POS tags, bag-of-words, Brown clusters (for GENIA only), and orthographic features. See the supplemental material for more details.

Experimental Setup
We trained each model in the training set, then tuned the l 2 -regularization parameter based on the development set. For GENIA experiments, we also tuned the number of Brown clusters. Following Lu and Roth (2015), we also used each development set to tune the mention penalty to optimize the F 1 -score and report the scores on the corresponding test sets separately. Similar to Finkel and Manning (2009), as another baseline model we also trained a standard linear-chain CRF using the BILOU scheme. Although this model does not support overlapping mentions, it gives us a baseline to see the extent to which our model's ability to recognize overlapping mentions can help the overall performance. There is also a simple extension 8 of this linear-chain CRF model that can support overlapping mentions of different types by considering each type separately using multiple chains, one for each type. We call this multiple-chain variant LCRF (multiple) and the earlier standard approach LCRF (single). In all models, we also implement the mention penalty feature, adapted accordingly so that increasing the feature weight will increase the number of mentions predicted by the model. See supplemental material for more details.
We implemented all models using Java, and also made additional comparisons on running time by running them under the same machine. In addition, we also analyzed the convergence rate for different models.
6 Results and Discussion 6.1 Results on ACE Table 2 shows the results on the ACE datasets, and these are our main results. Following previous works (Finkel and Manning, 2009;Lu and Roth, 2015), we report standard precision (P ), recall (R) and F 1 -score percentage scores. The highest results (F 1 -score) and those results that are not significantly different from the highest results are highlighted in bold (based on bootstrap resampling test (Koehn, 2004), where p > 0.01). For ACE datasets, we make comparisons with the two versions of the linear-chain CRF baseline: LCRF (single) which does not support overlapping mentions at all and LCRF (multiple) which does not support overlapping mentions of the same type, as well as our implementation of the mention hypergraph baseline (Lu and Roth, 2015).
From such empirical results we can see that our proposed model using mention separators consistently yields significantly better results (p < 0.01) than the mention hypergraph model across these two datasets, under two setups (whether to optimize F 1 -score or not). Specifically, when the state-based approach is used (STATE), our approach is able to obtain a much higher recall, resulting in improved F 1 -score. Empirically, we found this approach was also faster than the LCRF baseline approach in terms of the number of words processed each second (w/s) during decoding, which is expected, since STATE uses fewer num-  ber of tags. 9 The edge-based approach (EDGE) using our proposed multigraph representation is able to achieve a significant speedup in comparison with the state-based approach. Although this model is still about 50% slower than the mention hypergraph model 10 , but it yielded a significantly higher F 1 -score (up to 3.6 points higher on ACE-2004 before optimizing F 1 -score). These results largely confirm the effectiveness of our proposed mention separator model and the usefulness of the multigraph representation for learning the model. And as expected, the LCRF baselines yields relatively lower results compared to the other models, since it cannot predict overlapping mentions. 11 However, such results give us some idea on how much performance increase we can gain by properly recognizing overlapping mentions by looking at the results of LCRF (single), which in this case can be up to 9.7 points in F 1 -score in ACE-2004. We can also see the gain from recognizing overlapping mentions of the same type by looking at the results of LCRF (multiple), which can be up to 5.3 points in F 1 -score in ACE-2004. Table 3 shows the results of running the models with F 1 -score tuning on GENIA dataset. All models include Brown clustering features learned from PubMed abstracts. Besides the mention hypergraph baseline, we also make comparisons with the system of Finkel and Manning (2009) that can also support overlapping mentions.

Results on GENIA
We see that the mention hypergraph model matches the performance of the constituency parser-based model of Finkel and Manning (2009), while our models based on mention separators yield significantly higher scores (p < 0.05) than all other baselines (except LCRF (multiple), which we will discuss shortly). There are two ob-9 There are eight tags in STATE and nine in LCRF. 10 Though both models have the same time complexity, they differ by a constant factor. 11 LCRF (single) cannot predict any overlapping mentions, while LCRF (multiple) cannot predict overlapping mentions of the same type. % Lu and Roth (2015) This work (EDGE)  servations worth mentioning: (1) the absolute difference of F 1 -scores of our models and the baseline models in GENIA is much smaller compared to that in ACE datasets, and (2) the LCRF (multiple) model in GENIA dataset can achieve higher scores compared to other more complex baseline models, although LCRF (multiple) does not support overlapping mentions of the same type. We suspect that these two observations are due to the small proportion of overlapping mentions in GE-NIA (18%, as compared to >40% in ACE datasets, see Table 1). To investigate this, we conduct a few more sets of experiments.

Further Experiments
On different types of sentences: As these datasets consist of both overlapping and nonoverlapping mentions, to further understand the model's effectiveness in recognizing overlapping mentions (and non-overlapping mentions), we performed some additional experiments on the mention hypergraph model and our model. 12 Specifically, we split the test data into two portions, one that consists of only sentences that contain overlapping mentions (O) and those which do not (Ø). The results are shown in Table 4. We can see that in ACE datasets, our model achieves higher F 1 -scores compared to the mention hypergraph for both portions, but it achieves slightly lower results in GENIA dataset for the portion that contains overlapping mentions. We believe that our models learn parameters so as to obtain an optimal overall performance, and since the proportion of the overlapping mentions in GENIA is much smaller compared to that in ACE datasets, it learns to focus more on the nonoverlapping mentions. This is supported by the fact that the difference of F 1 -score between the mention hypergraph model and our model in GE-NIA is larger compared to the difference in ACE 12 We also performed this on other models. Due to space constraint, we do not include the results here. See the supplemental material for more details.   These results also lead to the interesting empirical finding that our model appears to be able to do well also on recognizing non-overlapping mentions. This motivates us to conduct the next set of experiments.
On data without overlapping mentions: We also performed one additional set of experiments, on the standard CoNLL-2003 dataset (Tjong Kim Sang and De Meulder, 2003), which has no overlapping mentions.
The results (without optimizing F 1 -score) are shown in Table 5. We see that our models based on mention separators outperform baseline models such as the Illinois NER system where external resources are not used (Ratinov and Roth, 2009), and a linear-chain CRF model, although the linearchain CRF baseline models some interactions between distinct mention types and our models do not. Such results also suggest that modeling the interactions between distinct mention types may not be crucial to get a good performance in mention recognition. This is further corroborated by the result of LCRF (multiple), which is higher than the result of LCRF (single) by about 0.5 points. When comparing our model against the mention hypergraph model, we note that our model consistently yields a higher recall. We speculate this is due to the fact that as our model does not exhibit the issue of spurious structures we discussed in Section 3.1, it is more confident in making its predictions.
On convergence: We also empirically analyzed the convergence properties of the two models. Empirically, as illustrated in Figure 7 which shows how the objective improves when the training progresses on ACE-2004, GENIA, andCoNLL-2003, we found that our EDGE-based model requires significantly less iterations to converge than the mention hypergraph on the former two datasets which contain overlapping mentions. We believe it is possible that this slower convergence is due to the spurious structures issue in mention hypergraphs, which causes the objective function to be more complex to optimize. However, some further analyses on the convergence issue and the impact of different ways of exploiting features (over different hyperedges) for the hypergraph-based models are needed.

Conclusion and Future Work
We proposed the novel mention separators for mention recognition where mentions may overlap with one another. We also proposed two ways these mention separators can be utilized to encode overlapping mentions, where one of them utilizes a novel multigraph-based representation. We showed that by utilizing mention separators, we can get better recognition results compared to previous models, and by utilizing the multigraph representation, we can maintain a good inference speed, albeit still slower than the mention hypergraph model. We also performed theoretical analysis on the model and showed that our model does not present the spurious structures issue associated with a previous state-of-the-art model, while still keeping the same inference time complexity.
Future work includes further investigations on how to apply the multigraph approach to other structured prediction tasks, as well as applications of the proposed model in other related NLP tasks that involve the prediction of overlapping structures, such as equation parsing (Roy et al., 2016).
The code used in this paper is available at http://statnlp.org/research/ie/.