Neural Segmental Hypergraphs for Overlapping Mention Recognition

In this work, we propose a novel segmental hypergraph representation to model overlapping entity mentions that are prevalent in many practical datasets. We show that our model built on top of such a new representation is able to capture features and interactions that cannot be captured by previous models while maintaining a low time complexity for inference. We also present a theoretical analysis to formally assess how our representation is better than alternative representations reported in the literature in terms of representational power. Coupled with neural networks for feature learning, our model achieves the state-of-the-art performance in three benchmark datasets annotated with overlapping mentions.


Introduction
One of the most crucial steps towards building a natural language understanding system is the identification of basic semantic chunks in text. Such a task is typically characterized by the named entity recognition task (Grishman, 1997;Tjong Kim Sang and De Meulder, 2003), or the more general mention recognition task, where mentions are defined as references to entities that could be named, nominal or pronominal (Florian et al., 2004). The extracted mentions can be used in various downstream tasks for performing further semantic related tasks, including question answering (Abney et al., 2000), relation extraction (Mintz et al., 2009;Liu et al., 2017), event extraction (Riedel and McCallum, 2011;Li et al., 2013), and coreference resolution (Soon et al., 2001;Ng and Cardie, 2002;Chang et al., 2013).
One popular approach to the task of mention extraction is to regard it as a sequence labeling prob- lem, with the underlying primary assumption being that the mentions are non-overlapping spans in the text. However, as highlighted in several prior research efforts (Alex et al., 2007;Finkel and Manning, 2009;Lu and Roth, 2015), mentions may overlap with one another in practice. Thus, models based on such a simplified assumption may result in sub-optimal performance for a down-stream task when they are deployed in practice. For example, consider a phrase "At the Seattle zoo, . . . " shown in Figure 1, the relation LO-CATEDIN between the mentions "the Seattle zoo" (of type FACILITY) and "Seattle" (of type GPE: Geo-political entities) will not be extracted unless both of these two overlapping mentions could be extracted. Similarly, there are 4 mentions of the same type (PROTEIN) in the text span ". . . PEBP2 alpha A1, alpha B1 . . . " taken from the biomedical domain. A downstream question answering system may fail to return the correct answer as desired, if the mention extraction system it relies on is unable to extract all these valid mentions.
Various approaches to extracting overlapping mentions have been proposed in the past decade. The cascaded approach (Alex et al., 2007) builds a pipeline of sequence labeling models using conditional random fields (CRF) (Lafferty et al., 2001). However, the model is unable to handle overlapping mentions of the same type. Finkel and Manning (2009) presented a parsing based approach to nested mention extraction. Due to the chart-based parsing algorithm involved, the model has a cubic time complexity in the number of words in the sen-tence. A recent approach by Lu and Roth (2015) introduced a hypergraph representation for capturing overlapping mentions, which was shown fast and effective. The work was improved by Muis and Lu (2017), who proposed a sequence labeling approach that assigns tags to gaps between words. However, both approaches suffer from the structural ambiguity issue during inference, as we will further discuss in this paper.
We summarize our contributions as: 1. We propose a novel segmental hypergraph representation that is capable of modeling arbitrary combinations of (potentially overlapping) mentions in a given sentence. The model has a O(cmn) time complexity (m is the number of mention types, n is the number of words in a sentence, and c is the maximal number of words for each mention), and is able to capture features that cannot be captured by existing approaches. 2. Theoretically, we show that our approach based on such a new representation does not have the limitations associated with some recently proposed state-of-the-art approaches for overlapping mention extraction. 3. We show through extensive experiments on standard data that by exploiting both wordlevel and span-level features learned from neural networks, our model is able to achieve the state-of-the-art performance for recognizing overlapping mentions. Our model is also general and robust. Further experiments show that our model yields competitive results when evaluated on data that does not have overlapping mentions annotated when comparing against other recently proposed state-ofthe-art neural models that are capable of extracting non-overlapping mentions only.

Related Work Overlapping Mention Recognition
One of the earliest research efforts on handling overlapping mentions is a rule-based approach Zhou, 2006) that is evaluated on the GENIA dataset . The authors first detected the innermost mentions and then relied on rule-based postprocessing methods to identify overlapping mentions. McDonald et al. (2005) presented a multilabel classification algorithm to model overlapping segments in a sentence systematically. Alex et al. (2007) proposed several ways to combine multiple conditional random fields (CRF) (Lafferty et al., 2001) for such tasks. Their best results were obtained by cascading several CRF models in a specific order while each model is responsible for detecting mentions of a particular type. Outputs of one model can also serve as features to the next model. However, such an approach cannot model overlapping mentions of the same type, which frequently appear in practice. Finkel and Manning (2009) approached this task from a parsing perspective by constructing a constituency tree, mapping each mention to a node in the tree. This approach assumes one mention is contained by the other when they overlap. While such an assumption largely holds in practice, it comes with a cost -the chart-based parser suffers from its cubic time complexity, making it not scalable to large datasets involving long sentences. Based on the same idea, Wang et al. (2018) proposed a scalable transition-based approach to construct a constituency forest (a collection of constituency trees).
Instead of relying on structured models, Xu et al. (2017) proposed a local classifier for each possible span. However, this local approach is unable to capture the interactions between spans. Similar to (Alex et al., 2007), Ju et al. (2018) dynamically stacked multiple flat layers which recognize mentions sequentially from innermost mentions to outermost mentions.
Our work is inspired by the model of Lu and Roth (2015), who introduced a mention hypergraph representation for capturing overlapping mentions. Their model was shown fast and effective, and was improved by the mention separator model (Muis and Lu, 2017). However, we note that (as also highlighted in their papers) both models suffer from the structural ambiguity issue during inference, which we will discuss later. Our new representation does not have this limitation. 2 Recently, Katiyar and Cardie (2018) also proposed a hypergraph-based representation based on the BILOU tagging scheme. Their model is trained greedily using neural networks by viewing the hypergraph construction procedure as a multi-label assignment process.

Neural Models for Mention Recognition
Recently, neural network based approaches to entity or mention recognition have received signifi-  Huang et al. (2015) replaced the CNN with a bidirectional long short-term memory network (LSTM) (Hochreiter and Schmidhuber, 1997). Strubell et al. (2017) proposed an iterated dilated CNN to improve computational efficiency. Beyond word-level compositions, several methods incorporated character-level compositions with character embeddings, either through CNN (Chiu and Nichols, 2016; Ma and Hovy, 2016) or LSTM (Lample et al., 2016).

Segmental Hypergraph
A segmental hypergraph is a representation that aims at representing all possible combinations of (potentially overlapping) mentions in a given sentence. It belongs to a class of directed hypergraphs (Gallo et al., 1993), where each hyperedge e consists of a single designated parent node (head of e) and an ordered list of child nodes (tail of e). Specifically, our segmental hypergraph consists of the following 5 types of nodes: • A i encodes all such mentions that start with the i-th or a later word • E i encodes all mentions that start exactly with the i-th word • T k i represents all mentions of type k starting with the i-th word • I k i,j represents all mentions of type k that contain the j-th word and start with the i-th word • X marks the end of a mention. Hyperedges connecting these nodes are de-signed to indicate how the semantics of a parent node can be re-expressed in terms of its child nodes. Figure 2 gives a partial segmental hypergraph representing all combinations of mentions within the span [i, i + 3] consisting of 4 words. There are 4 types of hyperedges: its children implies the fact that A i consists of those mentions that either "start exactly with the i-th word" (E i ), or "start with a word that appears strictly after the i-th word" . . , T m i )} from E i to its children implies that we should consider all possible types for the mentions (possibly of length 0) that start with the i-th word.

Two hyperedges {T
i indicate that either there exists at least one mention starting with the i-th word (the former hyperedge), or there does not exist any such mention (the latter hyperedge).

Three hyperedges {I
indicate the following three cases respectively: 1) both the j-th and (j + 1)-th words belong to at least one mention that starts with the i-th word, 2) there exists one mention that starts with the i-th word and ends with the j-th word, and 3) both cases are valid. Essentially, the complete hypergraph compactly encodes the whole search space of all possible mentions that can ever appear within a sentence, where such mentions may or may not overlap with one another. When we traverse the complete segmental hypergraph by following the directions as specified by the hyperedges, selecting only one outgoing hyperedge at a time at each node, we arrive at a hyperpath 3 -a rooted, directed substructure contained by the original hypergraph. Figure 3 shows an example. Here, "Israeli UN Ambassador" of type PERSON is captured by the following sequence of nodes (along a hyperpath): "A 1 , E 1 , T 2 1 , I 2 1,1 , I 2 1,2 , I 2 1,3 , X", while "Israeli UN Ambassador Danny" of type PERSON corresponds to the following node sequence: "A 1 , E 1 , T 2 1 , I 2 1,1 , I 2 1,2 , I 2 1,3 , I 2 1,4 , X". Similarly, the following sequence "A 1 , A 2 , E 2 , T 1 2 , I 1 2,2 , X" represents the mention "UN" of type ORGANIZATION. As we can see, such node sequences together form a single hyperpath that encodes this specific combination of mentions that overlap with one another. More details on segmental hypergraph and hyperpaths are in the supplementary material.

Theoretical Analysis
Our proposed segmental hypergraph representation has the following theoretical property: Theorem 3.1. (Structural Ambiguity Free) For any sentence and its segmental hypergraph G = (V, E), let S be the set of all possible mention combinations for the given sentence, and P be the set of all hyperpaths contained by G, there is a one-toone correspondence between elements in P and S.
Due to space, we provide a proof sketch and include more details in the supplementary material. Proof Sketch We note that each hyperpath is uniquely characterized by its collection of hyperedges that involve X nodes. These hyperedges uniquely determine the collection of mentions. Conversely, a collection of mentions can be uniquely characterized by a collection of such hyperedges, which yields a unique hyperpath.
Note that such a theorem states that our novel representation has no structural ambiguity, a nice property that both mention hypergraph model of (Lu and Roth, 2015) and mention separator model of (Muis and Lu, 2017) do not hold. As the authors have mentioned in their papers, for a given sub-structure in their model, there exist multiple ways of interpreting the combination of mentions. Specifically, in both representations, the decisions on where the beginning and the end of a mention are made locally. Such a design will lead to the structural ambiguity as there will be multiple interpretations to the mentions given a particular collection of positions marked as beginning and end of mentions. To illustrate, consider a phrase with 4 words "A B C D" where there are only two overlapping mentions "B C" and "A B C D". In both of the previous approaches, their models would make local predictions and assign both "A" and "B" as left boundaries, and both "C" and "D" as right boundaries. However, based on such local predictions one could also interpret "A B C" as a mention -this is where the ambiguity arises.
In contrast, our model enjoys the structural ambiguity free property as it uses our newly defined I nodes (together with X nodes) to jointly capture the complete boundary information of mentions. Table 1 shows a full comparison. 4

Learning
We adopt a log-linear approach to model the conditional probability of each hyperpath as follows: where f (x, y) is the score function for any pair of input sentence x and output mention combination y, which corresponds to a unique hyperpath G y . Our objective is to minimize the negative loglikelihood of all instances in the training set D: We define features over each hyperedge within the hyperpath G y . The score function can be decomposed into the following form: where e ∈ G y denotes a hyperedge that appears within the hyperpath G y , and ψ(e, x) is a score defined over e when the input sentence is x. Apart from word-level features, the segmental hypergraph also allows span-level features to be defined. The node I k i,j corresponds to a particular span [i, j] over which we can extract our local features. The hyperedge between I nodes can capture the interactions between partial mentions and hyperedge between I k i,j and X precisely represents the mention [i, j] with type k. We note that such features and interactions cannot be captured by the models of (Lu and Roth, 2015) and (Muis and Lu, 2017). Such a unique property makes our segmental hypergraph model more expressive than theirs.

Softmax-Margin Training
Inspired by (Mohit et al., 2012), we consider the softmax-margin (Gimpel and Smith, 2010) in our model. The function ψ(e, x) is defined as follows: where φ(e, x) is a feature function, and ∆(e, G y * ) is the cost function that defines the margin: Here, y * is the gold mention combination, and TX[e] and TI[e] are indicator functions that return true if e is between T and X and between T and I respectively, and false otherwise. We set β ≥ 1 such that the cost function will assign more penalty to false negatives than to false positives.

Feature Representation
We use two bidirectional LSTMs to learn wordlevel and span-level feature representations that can be used in our approach, resulting in our neural segmental hypergraph model. We first map the i-th word in a sentence to its pre-trained word embedding e i , and its POS tag to its embedding p i if it exists. The final representation for i-th word is the concatenation of them: v i = [e i , p i ]. Next, we use the a bidirectional LSTM to capture contextspecific information for each word, resulting in the word-level features: Such representations are then used as inputs to a second LSTM to generate span-level features: Inspired by (Kong et al., 2016), we compute all possible span embeddings efficiently with time complexity O(cn) using dynamic programming, with n being the number of words in the input x and c being the maximal length of a mention.
Recall that there are 4 types of hyperedges in our hypergraph, over which we can define the score functions. Since every valid mention hyperpath contains the first and second kind of hyperedges, defining scores over such hyperedges are unnecessary as their scores would serve as a constant factor that can be eliminated in the overall loss function of the log-linear model. Thus we only need to define the score functions on the latter two types of hyperedges. For hyperedges that only involve two nodes, we use a linear layer to compute their scores: where matrices W TX , W TI ∈ R d 1 ×m , W II ∈ R 2d 2 ×m , W IX ∈ R d 2 ×m , with superscript (k) referring to the k-th column of the matrix, d 1 is the dimension of h w , d 2 is the dimension of h s , and m is the number of mention types. For the hyperedges that involve more than two nodes, the score is computed as follows: where W II ∈ R 2d 2 ×m , W IX ∈ R d 2 ×m . Note that in this work, we set W II = W II and W IX = W IX to reduce the number of free parameters. Learning uses stochastic gradient descent with the update rule of Adam (Kingma and Ba, 2014) and a gradient clipping of 3.0. Dropout (Srivastava et al., 2014) for input vectors v and 2 regularization are used to reduce overfitting; both are tuned during the development process.

Character-level Representation
To make fair comparisons with recent models (Ju et al., 2018;Wang et al., 2018) that additionally incorporate character-level components in capturing orthographic and morphological features of words, we follow Lample et al. (2016) to use a bidirectional LSTM that takes the character embeddings as input. Specifically, the character-level representation ch i for each word is obtained by concatenating the last hidden vectors of the forward and backward LSTMs. When this component is activated, the representation of each word is changed to:

Inference
Inference can be done efficiently using a generalized inside-outside style message-passing algorithm (Baker, 1979). The partition function of (1) can be computed using the inside algorithm applied to the complete hypergraph G, where we traverse from leaf nodes X to the root node A 1 , passing messages to a parent node p from its child nodes: where h(e) is the head of the hyperedge e, and T (e) is the collection of nodes that form the tail of e -they are the child nodes of h(e) given e. The message passing step for the outside algorithm can be defined analogously. It can be verified that such a message passing algorithm, that is analogous to the sum-product belief propagation algorithm (Kschischang et al., 2001) used in standard graphical models, will converge after one forward and one backward pass. For decoding, we perform the standard MAP inference on top of the complete hypergraph to find the most probable hyperpath. The resulting procedure is similar to the max-product message passing algorithm, where we consider only the feature function φ for constructing the messages: During inference, each node corresponds to a sum/max computation. Since one node is incident to 3 hyperedges maximally, the time complexity of inference algorithm can be implied by the number of nodes in the graph, which is O(cmn), where c is the maximal length for any mention. This complexity is the same as that of a zero-th order semi-Markov CRF model (Sarawagi and Cohen, 2005). Please refer to the supplementary material for a detailed explanation of the inference algorithm.  We mainly evaluate our models on the standard ACE-2004, ACE-2005(Doddington et al., 2004, and GENIA  datasets with the same splits used by previous works (Lu and Roth, 2015;Muis and Lu, 2017). Sample data statistics of these datasets are listed in Table 2. 5 We can see that overlapping mentions frequently appear in such datasets. For ACE2004, over 46% of the mentions overlap with one another. GENIA focuses on biomedical entity recognition 6 and overlapping mentions are also common in it. Most mentions (over 93%) are not longer than 6 tokens which we select as maximal length (c) for the restricted models.

Baseline Approaches
We consider the following baseline models: • CRF (LINEAR): a linear-chain CRF model. Since the linear-chain CRF cannot handle overlapping structures, we only use the outermost mentions for learning. Specifically, every outer-most mention is labeled based on the BILOU tagging scheme, which was empirically shown to be better than the BIO scheme (Ratinov and Roth, 2009). • CRF (CASCADED): the cascaded CRF based approach following (Alex et al., 2007). Note that this approach cannot model the overlapping mentions of the same type. • Semi-CRF: the semi-Markov CRF model (Sarawagi and Cohen, 2005). The semi-CRF model is also only trained on the outer-most mentions. It can also capture span-level fea-  tures defined over a complete segment. Similar to our model, semi-CRF typically comes with a length restriction (c) which indicates the maximal length of a mention. • SH (-NN): a non-neural version of our segmental hypergraph model that excludes the LSTMs but employs handcrafted features. 8 As discussed earlier, we also evaluate the vari-7 Note that in ACE2005, Ju et al. (2018) did their experiments with a different split than Lu and Roth (2015) which we follow as our split. 8 To make a proper comparison, we use the same handcrafted features used by (Lu and Roth, 2015), which were proven effective in previous approaches.  ants of our model that takes character-level representations (+char).

Training
Pre-trained embeddings GloVe (Pennington et al., 2014) of dimension 100 are used to initialize the trainable word vectors for experiments in ACE and GENIA datasets. 9 The embeddings for POS tags are initialized randomly with dimension 32. Early stopping is used based on the performance of development set. The value β used in softmaxmargin is chosen from [1, 3] with step size 0.5.

Experimental Results
Main results can be found in Table 3. Using the same set of handcrafted features, our unrestricted non-neural model SH (-NN, c=n) achieves the best performance compared with other nonneural models, revealing the effectiveness of our newly proposed segmental hypergraph representation. It achieves around 1-2% gain in terms of F 1 compared with mention hypergraph of Lu and Roth (2015) and mention separator of Muis and Lu (2017), showing the necessity of eliminating structural ambiguity. CRF (LINEAR) and Semi-CRF do not perform well due to incapability of handling overlapping mentions. In contrast, the pipeline approach CRF (CASCADED) performs better. Our unrestricted neural segmental hypergraph model SH (c=n) already achieves the best results among all previous models in ACE datasets, showing the effectiveness of our neural segmental hypergraph. The improvement mainly comes from its ability to recall more mentions. In GENIA, even without using external features like Brown clustering features as all non-neural models do, our neural models still get significant improvements. Compared with the non-neural SH (-NN) which has around 4.2M parameters, our neural model SH only has 1.9M parameters yet it still performs better. We empirically see that the representations learned by LSTM can better capture complex contextual dependencies in sentences. The character-level representations (+ char) make both restricted and unrestricted SH perform even better. Particularly, SH (c=n) + char achieves the best results in all datasets compared with other recent neural models (Katiyar and Cardie, 2018;Ju et al., 2018;Wang et al., 2018).
One hypothesis we may have is that, without length restriction, a model will enjoy the benefit of recalling more long mentions, but also will be exposed to more false positives. This poses a challenge for a model -whether it is capable of balancing these two factors. Empirically, we find that the length restriction (c=6) improves the precision of semi-CRF and SH at the expense of the recall, providing some evidence to support the hypothesis. However, in terms of F 1 , the unrestricted semi-CRF performs worse while unrestricted SH performs better compared to their restricted counterparts. The reason is that the span-level handcrafted features that the semi-CRF relies on can be very sparse when mentions are overly long. We empirically found this issue is alleviated in the model SH (-NN), possibly due to its ability in capturing interactions between neighboring spans. Even with length restriction, SH still yields competitive results, making it attractive in processing largescale datasets considering its linear time complexity. Furthermore, we find that as c increases, SH performs better consistently in terms of F 1 . The choice of c then becomes a tradeoff between time complexity and performance. Please refer to the supplementary material for details.
Compared with the local approach FOFE, our global approach gives a much better performance, showing its effectiveness in capturing interactions  between spans. Moreover, FOFE's performance suffers significantly in the absence of the length restriction. The reason is that it will generate much more negative training instances under this setting, which makes its learning more challenging.

Additional Analyses
To understand our model better, we conduct some further experiments in this section.

Ablation study
We first conduct an ablation study by removing dropout, softmax-margin and pre-trained embeddings from our model respectively. The results are shown in Table 4. The dropout and pre-trained embeddings can improve the performance of our model significantly and this behavior is consistent with previous neural models for NER (Chiu and Nichols, 2016;Lample et al., 2016). Meanwhile, our new cost function based on softmax margin training also contributes significantly to the good performance of our model across these datasets.

How well does it handle overlapping mentions?
To further understand how well our model can handle overlapping mentions, we split the test data into two portions: sentences with and without overlapping mentions. We compare our model with the two state-of-the-art models and report results on ACE-05 in Table 5. 10 In both portions, SH achieves significant improvements, especially in the portion with overlapping mentions. This observation indicates that our model can better capture the structure of overlapping mentions than these two previous models. It also helps explain why the margin of improvement is larger in ACE than in GENIA since the former has more overlapping mentions than the latter, as shown in Table 2. Compared with the model with length restriction c, the unrestricted model mainly benefits from its ability to recall more overlapping mentions.

Running time
Since other compared models also feature linear time complexity (see Table 1), we examine the decoding speed in terms of the number of words processed per second. We re-implement the models of Lu and Roth (2015) and Muis and Lu (2017) using the same platform as ours (PyTorch) and run them on the same machine (CPU: Intel i5 2.7 GHz). The model of (Wang et al., 2018) is also tested with the same environment. Results on ACE-05 are listed in Table 5. The length bound (c=6) makes our model much faster, resulting in a speed comparable to the model of Muis and Lu (2017). The transition-based model by (Wang et al., 2018) has the best scalability partially because of its greedy strategy for decoding.
What if the data has no overlapping mentions?
To assess the robustness of our model and understand whether it could serve as a general mention extraction model, we additionally evaluate our model on CoNLL 2003 dataset which is annotated with non-overlapping mentions only. We compared our model with recent state-of-the-art neural network based models. For a fair comparison, we used the Collobert et al. (2011) embeddings widely used by previous models, and ignored POS tag features even though they are available. Results are in Table 6. Only neural models without using external features are included. 11 By only relying on word (and character) embeddings, our model achieves competitive results compared with other state-of-the-art neural models that also do not exploit external features, yet these models are mostly designed to handle only non-overlapping mentions. The only exception is the FOFE approach by (Xu et al., 2017) as we discussed earlier.
11 See the supplementary material for complete results.

Notes on mention interactions
The dependencies between overlapping mentions can be very beneficial. SH can capture a specific kind of interaction between neighboring spans. Such interactions happen between mentions that share the same type and the same left boundary. As we can see from the sentence in Figure 3, one mention could also serve as a pre-modifier for another mention and both could share the same type. As shown in Table 2, there are over 8% such mentions in ACE and over 4% in GENIA. Specifically, SH relies on the hyperedges between I nodes to capture such interactions explicitly. To verify the effectiveness of this connection, we zero the weights between I nodes. The ablated model only achieves around 70.0% in ACEs and 71.4% in GE-NIA, implying the impact of this dependency connection. On the other hand, it also reveals the potential direction of improving SH by explicitly modeling more dependencies between mentions, such as the dependencies between mentions with different types. LSTM that serves as feature representation may capture such interactions implicitly, but building the connections could still be an important aspect for improvement.

Conclusion and Future Work
In this work, we propose a novel neural segmental hypergraph model that is able to capture overlapping mentions. We show that our model has some theoretical advantages over previous state-of-theart approaches for recognizing overlapping mentions. Through extensive experiments, we show that our model is general and robust in handling both overlapping and non-overlapping mentions. The model achieves the state-of-the-art results in three standard datasets for recognizing overlapping mentions. We anticipate this model could be leveraged in other similar sequence modeling tasks that involve predicting overlapping structures such as recognizing overlapping and discontinuous entities (Muis and Lu, 2016) which frequently exist in the biomedical domain.