Sequence-to-Nuggets: Nested Entity Mention Detection via Anchor-Region Networks

Sequential labeling-based NER approaches restrict each word belonging to at most one entity mention, which will face a serious problem when recognizing nested entity mentions. In this paper, we propose to resolve this problem by modeling and leveraging the head-driven phrase structures of entity mentions, i.e., although a mention can nest other mentions, they will not share the same head word. Specifically, we propose Anchor-Region Networks (ARNs), a sequence-to-nuggets architecture for nested mention detection. ARNs first identify anchor words (i.e., possible head words) of all mentions, and then recognize the mention boundaries for each anchor word by exploiting regular phrase structures. Furthermore, we also design Bag Loss, an objective function which can train ARNs in an end-to-end manner without using any anchor word annotation. Experiments show that ARNs achieve the state-of-the-art performance on three standard nested entity mention detection benchmarks.


Introduction
Named entity recognition (NER), or more generally entity mention detection 1 , aims to identify text spans pertaining to specific entity types such as Person, Organization and Location. NER is a fundamental task of information extraction which enables many downstream NLP applications, such as relation extraction (GuoDong et al., 2005;Mintz et al., 2009), event extraction (Ji and Grishman, 2008;Li et al., 2013) and machine reading comprehension (Rajpurkar et al., 2016;. Previous approaches (Zhou and Su, 2002;Chieu and Ng, 2002;Bender et al., 2003;Settles, 2004;Figure 1: An example of nested entity mentions. Due to the nested structure, "the","department","of" and "education" belong to both PER and ORG mentions. Lample et al., 2016) commonly regard NER as a sequential labeling task, which generate label sequence for each sentence by assigning one label to each token. These approaches commonly restrict each token belonging to at most one entity mention and, unfortunately, will face a serious problem when recognizing nested entity mentions, where one token may belong to multiple mentions. For example in Figure 1, an Organization entity mention "the department of education" is nested in another Person entity mention "the minister of the department of education". Nested entity mentions are very common. For instance, in the well-known ACE2005 and RichERE datasets, more than 20% of entity mentions are nested in other mentions. Therefore, it is critical to consider nested mentions for real-world applications and downstream tasks.
In this paper, we propose a sequence-to-nuggets approach, named as Anchor-Region Networks (ARNs), which can effectively detect all entity mentions by modeling and exploiting the headdriven phrase structures (Pollard and Sag, 1994;Collins, 2003) of them. ARNs originate from two observations. First, although an entity mention can nest other mentions, they will not share the same head word. And the head word of a mention can provide strong semantic evidence for its entity type (Choi et al., 2018). For example in Figure 1, although the ORG mention is nested in the PER mention, they have different head words "department" and "minister" respectively, and these head words strongly indicate their corresponding entity types to be ORG and PER. Second, entity men-  Figure 2: The overall architecture of ARNs. Here "minister" and "department" are detected anchor words for two mentions respectively.
tions mostly have regular phrase structures. For the two mentions in Figure 1, they share the same "DET NN of NP" structure, where the NN after DET are their head words. Based on above observations, entity mentions can be naturally detected in a sequence-to-nuggets manner by 1) identifying the head words of all mentions in a sentence; and 2) recognizing entire mention nuggets centered at detected head words by exploiting regular phrase structures of entity mentions.
To this end, we propose ARNs, a new neural network-based approach for nested mention detection. Figure 2 shows the architecture of ARNs. First, ARNs employs an anchor detector network to identify whether each word is a head word of an entity mention, and we refer the detected words as anchor words. After that, a region recognizer network is used to determine the mention boundaries centering at each anchor word. By effectively capturing head-driven phrase structures of entity mentions, the proposed ARNs can naturally address the nested mention problem because different mentions have different anchor words, and different anchor words correspond to different mention nuggets.
Furthermore, because the majority of NER datasets are not annotated with head words, they cannot be directly used to train our anchor detector. To address this issue, we propose Bag Loss, an objective function which can be used to train ARNs in an end-to-end manner without any anchor word annotation. Specifically, our Bag Loss is based on at-least-one assumption, i.e., each mention should have at least one anchor word, and the anchor word should strongly indicate its entity type. Based on this assumption, Bag Loss can automatically select the best anchor word within each mention during training, according to the association between words and the entity type of the mention. For example, given an ORG training instance "the department of education", Bag Loss will select "department" as the anchor word of this mention based on its tight correlation with type ORG. While other words in the mention, such as "the" and "of", will not be regarded as anchor words, because of their weak association with ORG type.
We conducted experiments on three standard nested entity mention detection benchmarks, including ACE2005, GENIA and TAC-KBP2017 datasets. Experiments show that ARNs can effectively detect nested entity mentions and achieve the state-of-the-art performance on all above three datasets. For better reproduction, we openly release the entire project at github.com/ sanmusunrise/ARNs.
Generally, our main contributions are: • We propose a new neural network architecture named as Anchor-Region Networks. By effectively modeling and leveraging the headdriven phrase structures of entity mentions, ARNs can naturally handle the nested mention detection problem and achieve the stateof-the-art performance on three benchmarks. To the best of our knowledge, this is the first work which attempts to exploit the headdriven phrase structures for nested NER. • We design an objective function, named as Bag Loss. By exploiting the association between words and entity types, Bag Loss can effectively learn ARNs in an end-to-end manner, without using any anchor word annotation. • Head-driven phrase structures are widely spread in natural language. This paper proposes an effective neural network-based solution for exploiting this structure, which can potentially benefit many NLP tasks, such as semantic role labeling (Zhou and Xu, 2015;He et al., 2017) and event extraction (Chen et al., 2015;Lin et al., 2018).

Related Work
Nested mention detection requires to identify all entity mentions in texts, rather than only outmost mentions in conventional NER. This raises a critical issue to traditional sequential labeling models because they can only assign one label to each token. To address this issue, mainly two kinds of methods have been proposed.
Region-based approaches detect mentions by identifying over subsequences of a sentence respectively, and nested mentions can be detected because they correspond to different subsequences. For this, Finkel and Manning (2009) regarded nodes of parsing trees as candidate subsequences. Recently, Xu et al. (2017) and Sohrab and Miwa (2018) tried to directly classify over all subsequences of a sentence. Besides,  proposed a transition-based method to construct nested mentions via a sequence of specially designed actions. Generally, these approaches are straightforward for nested mention detection, but mostly with high computational cost as they need to classify over almost all sentence subsequences.
Schema-based approaches address nested mentions by designing more expressive tagging schemas, rather than changing tagging units. One representative direction is hypergraph-based methods (Lu and Roth, 2015;Katiyar and Cardie, 2018;, where hypergraphbased tags are used to ensure nested mentions can be recovered from word-level tags. Besides, Muis and Lu (2017) developed a gap-based tagging schema to capture nested structures. However, these schemas should be designed very carefully to prevent spurious structures and structural ambiguity . But more expressive, unambiguous schemas will inevitably lead to higher time complexity during both training and decoding.
Different from previous methods, this paper proposes a new architecture to address nested mention detection. Compared with region-based approaches, our ARNs detect mentions by exploiting head-driven phrase structures, rather than exhaustive classifying over subsequences. Therefore ARNs can significantly reduce the size of candidate mentions and lead to much lower time complexity. Compared with schema-based approaches, ARNs can naturally address nested mentions since different mentions will have different anchor words. There is no need to design complex tagging schemas, no spurious structures and no structural ambiguity.
Furthermore, we also propose Bag Loss, which can train ARNs in an end-to-end manner without any anchor word annotation. The design of Bag Loss is partially inspired by multi-instance learning (MIL) (Zhou and Zhang, 2007;Zhou et al., 2009;Surdeanu et al., 2012), but with a different target. MIL aims to predict a unified label of a bag of instances, while Bag Loss is proposed to train ARNs whose anchor detector is required to predict the label of each instance. Therefore previous MIL methods are not suitable for training ARNs.

Anchor-Region Networks for Nested Entity Mention Detection
Given a sentence, Anchor-Region Networks detect all entity mentions in a two-step paradigm. First, an anchor detector network identifies anchor words and classifies them into their corresponding entity types. After that, a region recognizer network is applied to recognize the entire mention nugget centering at each anchor word. In this way, ARNs can effectively model and exploit head-driven phrase structures of entity mentions: the anchor detector for recognizing possible head words and the region recognizer for capturing phrase structures. These two modules are jointly trained using the proposed Bag Loss, which learns ARNs in an end-to-end manner without using any anchor word annotation. This section will describe the architecture of ARNs. And Bag Loss will be introduced in the next section.

Anchor Detector
An anchor detector is a word-wise classifier, which identifies whether a word is an anchor word of an entity mention of specific types. For the example in Figure 1, the anchor detector should identify that "minister" is an anchor word of a PER mention and "department" is an anchor word of an ORG mention. Formally, given a sentence x 1 , x 2 , ..., x n , all words are first mapped to a sequence of word representations x 1 , x 2 , ..., x n where x i is a combination of word embedding, part-of-speech embedding and character-based representation of word x i following Lample et al. (2016). Then we obtain a context-aware representation h A i of each word x i using a bidirectional LSTM layer: The learned representation h A i is then fed into a multi-layer perceptron(MLP) classifier, which computes the scores O A i of the word x i being an anchor word of specific entity types (or NIL if this word is not an anchor word): where O A i ∈ R |C| and |C| is the number of entity types plus one NIL class. Finally a softmax layer is used to normalize O A i to probabilities: is the probability of word x i being an anchor word of class c j . Note that because different mentions will not share the same anchor word, the anchor detector can naturally solve nested mention detection problem by recognizing different anchor words for different mentions.

Region Recognizer
Given an anchor word, ARNs will determine its exact mention nugget using a region recognizer network. For the example in Figure 1, the region recognizer will recognize that "the minister of the department of education" is the mention nugget for anchor word "minister" and "the department of education" is the mention nugget for anchor word "department". Inspired by the recent success of pointer networks (Vinyals et al., 2015;Wang and Jiang, 2016), this paper designs a pointer-based architecture to recognize the mention boundaries centering at an anchor word. That is, our region recognizer will detect the mention nugget "the department of education" for anchor word "department" by recognizing "the" to be the left boundary and "education" to be the right boundary.
Similar to the anchor detector, a bidirectional LSTM layer is first applied to obtain the contextaware representation h R i of word x i . For recognizing mention boundaries, local features commonly play essential roles. For instance, a noun before a verb is an informative boundary indicator for entity mentions. To capture such local features, we further introduce a convolutional layer upon h R i : where h R i−k:i+k is the concatenation of vectors from h R i−k to h R i+k , W and b are the convolutional kernel and the bias term respectively. k is the (one-side) window size of convolutional layer. Finally, for each anchor word x i , we compute its left mention boundary score L ij and right mention boundary score R ij at word x j by In the above two equations, the first term within the tanh function computes the score of word x j serving as the left/right boundary of a mention centering at word x i . And the second term models the possibility of word x j itself serving as the boundary universally. After that, we select the best left boundary word x j and best right boundary word x k for anchor word x i , and the nugget {x j , ..., x i , ..., x k } will be a recognized mention.

Model Learning with Bag Loss
This section describes how to train ARNs using existing NER datasets. The main challenge here is that current NER corpus are not annotated with anchor words of entity mentions, and therefore they cannot be directly used to train the anchor detector. To address this problem, we propose Bag Loss, an objective function which can effectively learn ARNs in an end-to-end manner, without using any anchor word annotation. Intuitively, one naive solution is to regard all words in a mention as its anchor words. However, this naive solution will inevitably result in two severe problems. First, a word may belong to different mentions when nested mentions exist. Therefore this naive solution will lead to ambiguous and noisy anchor words. For the example in Figure 1, it is unreasonable to annotate the word "department" as an anchor word of both PER and ORG mentions, because it has little association to PER type although the PER mention also contains it. Second, many words in a mention are just function words, which are not associated with its entity type. For example, words "the","of" and "education" in "the department of education" are not associated with its type ORG. Therefore annotating them as anchor words of the ORG mention will introduce remarkable noise.
To resolve the first problem, we observe that a word can only be the anchor word of the innermost mention containing it. This is because a mention nested in another mention can be regarded as a replaceable component, and changing it will not affect the structure of outer mentions. For the case in Figure 1, if we replace the nested mention "the department of education" by other ORG mention(e.g., changing it to "State"), the type of the  Figure 3: An illustration of bags. B i represents the bag where word x i is in. This sentence forms five bags, two of which correspond to two entity mentions and three of which correspond to NIL.
outer mention will not change. Therefore, words in a nested mention should not be regarded as the anchor word of outer mentions, and therefore a word can only be assigned as the anchor word of the innermost mention containing it.
To address the second problem, we design Bag Loss based on the at-least-one assumption, i.e., for each mention at least one word should be regarded as its anchor word. Specifically, we refer to all words belonging to the same innermost mention as a bag. And the type of the bag is the type of that innermost mention. For example, in Figure 3,{the, minister, of} will form a PER bag, and {the, department, of education} will form an ORG bag. Besides, each word not covered by any mention will form a one-word bag with NIL type. So there are three NIL bags in Figure 3, including {convened}, {a} and {meeting}.
Given a bag, Bag Loss will make sure that at least one word in each bag will be selected as its anchor word, and be assigned to the bag type. While other words in that bag will be classified into either the bag type or NIL. Bag Loss selects anchor words according to their associations with the bag type. That is, only words highly related to the bag type (e.g., "department" in "the department of education") will be trained towards the bag type, and other irrelevant words (e.g., "the" and "of" in the above example) will be trained towards NIL. Bag Loss based End-to-End Learning. For ARNs, each training instance is a tuple x = (x i , x j , x k , c i ), where x j , ..., x k is an entity mention with left boundary x j and right boundary x k . c j is its entity type and word x i is a word in this mention's bag 2 . For each instance, Bag loss considers two situations: 1) If x i is its anchor word, the loss will be the sum of the anchor detector loss (i.e., the loss of correctly classifying x i into its bag type c i ) and the region recognizer loss (i.e., the loss of correctly recognizing the mention boundary x j and x k ); 2) If x i is not its anchor word, the loss will be only the anchor detector loss (i.e., correctly classifying x i into NIL). The final loss for this instance is a weighted sum of the loss of these two situations, where the weight are determined using the association between word x i and the bag type c i compared with other words in the same bag. Formally, Bag Loss is written as: where − log P (c i |x i ) is the anchor detector loss.
is the loss for the region recognizer measuring how preciously the region recognizer can identify the boundaries centered at anchor word x i . We define L lef t (x i ; θ) using max-margin loss: where γ is a hyper-parameter representing the margin, and L right (x i ; θ) is similarly defined.
Besides, ω i in Equation (6) measures the correlation between word x i and the bag type c i . Compared with other words in the same bag, a word x i should have larger w i if it has a tighter association with the bag type. Therefore, ω i can be naturally defined as: where B i denotes the bag x i belonging to, i.e., all words that share the same innermost mention with x i . α is a hyper-parameter controlling how likely a word will be regarded as an anchor word rather than regarded as NIL. α = 0 means that all words are annotated with the bag type. And α → +∞ means that Bag Loss will only choose the word with highest P (c i |x i ) as anchor word, while all other words in the same bag will be regarded as NIL. Consequently, Bag Loss guarantees that at least one anchor word (the one with highest P (c i |x i ), and its corresponding w i will be 1.0) will be selected for each bag. For other words that are not associated with the type (the ones with low P (c i |x i )), Bag Loss can make it to automatically learn towards NIL during training.

Experimental Settings
We conducted experiments on three standard English entity mention detection benchmarks with nested mentions: ACE2005, GENIA and TAC-KBP2017 (KBP2017) datasets. For ACE2005 and GENIA, we used the same setup as previous work (Ju et al., 2018;Katiyar and Cardie, 2018 (Pennington et al., 2014) vectors 3 . Hyper-parameters are tuned on the development sets 4 apart from α in Equation (8), which will be further discussed in Section 5.4.

Baselines
We compare ARNs with following baselines 5 : • Conventional CRF models, including LSTM-CRF (Lample et al., 2016) and Multi-CRF. LSTM-CRF is a classical baseline for NER, which doesn't consider nested mentions so only outmost mentions are used for training. Multi-CRF is similar to LSTM-CRF but learns one model for each entity type, and thus is able to recognize nested mentions if they have different types. • Region-based methods, including FOFE (Xu et al., 2017), Cascaded-CRF (Ju et al., 2018) and a transition model (refered as Transition) proposed by . FOFE directly classifies over all sub-sequences of a sentence and thus all potential mentions can be considered. Cascaded-CRF uses several stacked CRF layers to recognize nested mentions at different levels. Transition constructs nested mentions through a sequence of actions. • Hypergraph-based methods, including the LSTM-Hypergraph (LH) model (Katiyar and Cardie, 2018) and the Segmental Hypergraph (SH) by . LH used an LSTM model to learn features and then decode them into a hypergraph. SH further considered the transition between labels to alleviate labeling ambiguity, which is the state-of-the-art in both ACE2005 and GENIA 6 datasets. Besides, we also compared the performance of ARNs with the best system in TAC-KBP 2017 Evaluation (Ji et al., 2017). The same as all previous studies, models are evaluated using microaveraged Precision(P), Recall(R) and F1-score. To balance time complexity and performance,  proposed to restrict the maximum length of mentions to 6, which covers more than 95% mentions. So we also compared to baselines where the maximum length of mention is restricted or unrestricted. Besides, we also compared the decoding time complexity of different methods. Table 1 shows the overall results on ACE2005, GENIA and KBP2017 datasets. From this table, we can see that:

Overall Results
1) Nested mentions have a significant influence on NER performance and are required to be specially treated. Compared with LSTM-CRF and Multi-CRF baselines, all other methods dealing with nested mentions achieved significant F1-score improvements. So it is critical to take nested mentions into consideration for real-world applications and downstream tasks.  (Xu et al., 2017) 76  2) Our Anchor-Region Networks can effectively resolve the nested mention detection problem, and achieved the state-of-the-art performance in all three datasets. On ACE2005 and GENIA, ARNs achieved the state-of-the-art performance on both the restricted and the unrestricted mention length settings. On KBP2017, ARNs outperform the top-1 system in the 2017 Evaluation by a large margin. This verifies the effectiveness of our new architecture.
3) By modeling and exploiting head-driven phrase structure of entity mentions, ARNs reduce the computational cost significantly. ARNs only detect nuggets centering at detected anchor words. Note that for each sentence, the number of potential anchor words k is significantly smaller than the sentence length n. Therefore the computational cost of our region recognizer is significantly lower than that of traditional regionbased methods which perform classification on all sub-sequences, as well as hypergraph-based methods which introduced structural dependencies between labels to prevent structural ambiguity . Furthermore, ARNs are highly parallelizable if we replace the BiLSTM context encoder with other parallelizable context encoder architecture (e.g., Transformer (Vaswani et al., 2017)).

Effects of Bag Loss
In this section, we investigate effects of Bag Loss by varying the values of hyper-parameter α in Equation (8)  KBP2017 datasets when α varies. We can see that: 1) Bag Loss is effective for anchor word selection during training. In Figure 4, setting α to 0 significantly undermines the performance. Note that setting α to 0 is the same as ablating Bag Loss, i.e., the model will treat all words in the same innermost mention as anchor words. This result further verifies the necessity of Bag Loss. That is, because not all words in a mention are related to its type, it will introduce remarkable noise by regarding all words in mentions as anchor words.
2) Bag Loss is not sensitive to α when it is larger than a threshold. In Figure 4, our systems achieve nearly the same performance when α > 0.8. We find that this is because our model can predict anchor word in a very sharp probability distribution, so slight change of α does not make a big difference. Therefore, in all our  experiments we empirically set α = 1 without special declaration. This also verified that Bag Loss can discover head-driven phrase structure steadily without using anchor word annotations.

Further Discussion on Bag Loss and Marginalization-based Loss
One possible alternative solution for Bag Loss is to regard the anchor word as a hidden variable, and obtain the likelihood of each mention by marginalizing over all words in the mention nugget with For P (x i , c), if we assume that the prior for each word being the anchor word is equal, it can be refactorized by P (xi, c) = P (c|xi)P (xi) ∝ P (c|xi).
However, we find that this approach does not work well in practice. This may because that, as we mentioned above, the prior probability of each word being the anchor word should not be equal. Words with highly semantic relatedness to the types are more likely to be the anchor word. Furthermore, this marginalization-based training object can only guarantee that words being regarded as the anchor words are trained towards the mention type, but will not encourage the other irrelevant words in the mention to be trained towards NIL. Therefore, compared with Bag Loss, the marginalization-based solution can not achieve the promising results for ARNs training.

Analysis on Anchor Words
To analyze the detected anchor words, Table 2 shows the most common anchor words for all entity types. Besides, words that frequently appear in a mention but being recognized as NIL are also presented. We can see that the top-10 anchor   Figure 5: A representative error case of ARNs, where the right boundary of the PER mention is misclassified. Braces above the sentence indicate the output of ARNs, and brackets in the sentence represent the golden annotation. We find that the majority of errors occur because of the long-term dependencies stemming from postpositive attributive and attributive clauses.
words of each type are very convincing: all these words are strong indicators of their entity types. Besides, we can see that frequent NIL words in entity mentions are commonly function words, which play significant role in the structure of mention nuggets (e.g., "the" and "a" often indicates the start of an entity mention) but have little semantic association with entity types. This supports our motivation and further verifies the effectiveness of Bag Loss for anchor word selection.

Error Analysis
This section conducts error analysis on ARNs. Table 3 shows the performance gap between the anchor detector and the entire ARNs. We can see that there is still a significant performance gap from the anchor detector to entire ARNs. That is, there exist a number of mentions whose anchor words are correctly detected by the anchor detector but their boundaries are mistakenly recognized by the region recognizer. To investigate the reason behind this above performance gap, we analyze these cases and find that most of these errors stem from the existence of postpositive attributive and attributive clauses. Figure 5 shows an error case stemming from postpositive attributive. These cases are quite difficult for neural networks because long-term dependencies between clauses need to be carefully considered. One strategy to handle these cases is to introduce syntactic knowledge, which we leave as future work for improving ARNs.

Conclusions and Future Work
This paper proposes Anchor-Region networks, a sequence-to-nuggets architecture which can naturally detect nested entity mentions by modeling and exploiting head-driven phrase structures of entity mentions. Specifically, an anchor detector is first used to detect the anchor words of entity mentions and then a region recognizer is designed to recognize the mention boundaries centering at each anchor word. Furthermore, we also propose Bag Loss to train ARNs in an end-to-end manner without using any anchor word annotation. Experiments show that ARNs achieve the state-of-theart performance on all three benchmarks. As the head-driven structures are widely spread in natural language, the solution proposed in this paper can also be used for modeling and exploiting this structure in many other NLP tasks, such as semantic role labeling and event extraction.