Multi-grained Named Entity Recognition

This paper presents a novel framework, MGNER, for Multi-Grained Named Entity Recognition where multiple entities or entity mentions in a sentence could be non-overlapping or totally nested. Different from traditional approaches regarding NER as a sequential labeling task and annotate entities consecutively, MGNER detects and recognizes entities on multiple granularities: it is able to recognize named entities without explicitly assuming non-overlapping or totally nested structures. MGNER consists of a Detector that examines all possible word segments and a Classifier that categorizes entities. In addition, contextual information and a self-attention mechanism are utilized throughout the framework to improve the NER performance. Experimental results show that MGNER outperforms current state-of-the-art baselines up to 4.4% in terms of the F1 score among nested/non-overlapping NER tasks.


Introduction
Effectively identifying meaningful entities or entity mentions from the raw text plays a crucial part in understanding the semantic meanings of natural language. Such a process is usually known as Named Entity Recognition (NER) and it is one of the fundamental tasks in natural language processing (NLP). A typical NER system takes an utterance as the input and outputs identified entities, such as person names, locations, and organizations. The extracted named entities can benefit various subsequent NLP tasks, including syntactic parsing (Koo and Collins, 2010), question answering (Krishnamurthy and Mitchell, 2015) and relation extraction (Lao and Cohen, 2010). However, accurately recognizing representative entities in natural language remains challenging.
Previous works treat NER as a sequence labeling problem. For example, Lample et al. (2016) achieve a decent performance on NER by incorporating deep recurrent neural networks (RNNs) with conditional random field (CRF) (Lafferty et al., 2001). However, a critical problem that arises by treating NER as a sequence labeling task is that it only recognizes non-overlapping entities in a single, sequential scan on the raw text; it fails to detect nested named entities which are embedded in longer entity mentions, as illustrated in Figure 1. Facility Last night , at the Chinese embassy in France , there was a holiday atmosphere . Due to the semantic structures within natural language, nested entities can be ubiquitous: e.g. 47% of the entities in the test split of ACE-2004(Doddington et al., 2004 dataset overlap with other entities, and 42% of the sentences contain nested entities. Various approaches (Alex et al., 2007;Lu and Roth, 2015;Katiyar and Cardie, 2018;Muis and Lu, 2017; have been proposed in the past decade to extract nested named entities. However, these models are designed explicitly for recognizing nested named entities. They usually do not perform well on nonoverlapping named entity recognition compared to sequence labeling models.

GPE GPE
To tackle the aforementioned drawbacks, we propose a novel neural framework, named MGNER, for Multi-Grained Named Entity Recognition. It is suitable for tackling both Nested NER and Non-overlapping NER. The idea of MGNER is natural and intuitive, which is to first detect entity positions in various granularities via a Detector and then classify these entities into different pre-defined categories via a Classifier. MGNER has five types of modules: Word Processor, Sentence Processor, Entity Processor, Detection Network, and Classification Network, where each module can adopt a wide range of neural network designs.
In summary, the contributions of this work are: • We propose a novel neural framework named MGNER for Multi-Grained Named Entity Recognition, aiming to detect both nested and non-overlapping named entities effectively in a single model.
• MGNER is highly modularized. Each module in MGNER can adopt a wide range of neural network designs. Moreover, MGNER can be easily extended to many other related information extraction tasks, such as chunking (Ramshaw and Marcus, 1999) and slot filling (Mesnil et al., 2015).
• Experimental results show that MGNER is able to achieve new state-of-the-art results on both Nested Named Entity Recognition tasks and Non-overlapping Named Entity Recognition tasks.

Related Work
Existing approaches for recognizing nonoverlapping named entities usually treat the NER task as a sequence labeling problem. Various sequence labeling models achieve decent performance on NER, including probabilistic graph models such as Conditional Random Fields (CRF) (Ratinov and Roth, 2009), and deep neural networks like recurrent neural networks or convolutional neural networks (CNN). Hammerton (2003) is the first work to use Long Short-Term Memory (LSTM) for NER. Collobert et al. (2011) employ a CNN-CRF structure, which obtains competitive results to statistical models. Most recent works leverage an LSTM-CRF architecture.  use hand-crafted spelling features; Ma and Hovy (2016) and Chiu and Nichols (2016) utilize a character CNN to represent spelling characteristics; Lample et al. (2016) employ a character LSTM instead. Moreover, the attention mechanism is also introduced in NER to dynamically decide how much information to use from a word or character level component (Rei et al., 2016).
External resources have been used to further improve the NER performance. Peters et al. (2017) add pre-trained context embeddings from bidirectional language models to NER. Peters et al. (2018) learn a linear combination of internal hidden states stacked in a deep bidirectional language model, ELMo, to utilize both higher-level states which capture context-dependent aspects and lower-level states which model aspects of syntax. These sequence labeling models can only detect non-overlapping entities and fail to detect nested ones.
Various approaches have been proposed for Nested Named Entity Recognition. Finkel and Manning (2009) propose a CRF-based constituency parser which takes each named entity as a constituent in the parsing tree. Ju et al. (2018) dynamically stack multiple flat NER layers and extract outer entities based on the inner ones. Such model may suffer from the error propagation problem if shorter entities are recognized incorrectly.
Another series of approaches for Nested NER are based on hypergraphs. The idea of using hypergraph is first introduced in Lu and Roth (2015), which allows edges to be connected to different types of nodes to represent nested entities. Muis and Lu (2017) use a multigraph representation and introduce the notion of mention separator for nested entity detection. Both Lu and Roth (2015) and Muis and Lu (2017) rely on the hand-crafted features to extract nested entities and suffer from structural ambiguity issue.  present a neural segmental hypergraph model using neural networks to obtain distributed feature representation. Katiyar and Cardie (2018) also adopt a hypergraph-based formulation and learn the structure using an LSTM network in a greedy manner. One issue of these hypergraph approaches is the spurious structures of hypergraphs as they enumerate combinations of nodes, types and boundaries to represent entities. In other words, these models are specially designed for the nested named entities and are not suitable for the non-overlapping named entity recognition. Xu et al. (2017) propose a local detection method which relies on a Fixed-size Ordinally Forgetting Encoding (FOFE) method to encode utterance and a simple feed-forward neural network to either reject or predict the entity label for each individual text fragment (Luan et al., 2018;Lee et al., 2017;. Their model is in the same track with the framework we proposed whereas the difference is that we separate the NER task into two stages, i.e., detecting entity positions and classifying entity categories.

The Proposed Framework
An overview of the proposed MGNER framework for multi-grained entity recognition, is illustrated in Figure 2. Specifically, MGNER consists of two sub-networks: the Detector and the Classifier. The Detector detects all the possible entity positions while the Classifier aims at classifying detected entities into pre-defined entity categories. The Detector has three modules: 1) Word Processor which extracts word-level semantic features, 2) Sentence Processor that learns context information for each utterance and 3) Detection Network that decides whether a word segment is an entity or not. The Classifier consists of 1) Word Processor which has the same structure as the one in the Detector, 2) Entity Processor that obtains entity features and 3) Classification Network that classifies entity into pre-defined categories. In addition, a self-attention mechanism is adopted in the En-tity Processor to help the model capture and utilize entity-related contextual information. Each module in MGNER can be replaced with a wide range of different neural network designs. For example, BERT (Devlin et al., 2018) can be used as the Word Processor and a capsule model (Sabour et al., 2017;Xia et al., 2018) can be integrated into the Classification Network.
It is worth mentioning that in order to improve the learning speed as well as the performance of MGNER, the Detector and the Classifier are trained with a series of shared input features, including the pre-trained word embeddings and the pre-trained language model features. Sentencelevel semantic features trained in the Detector are also transferred into the Classifier to introduce and utilize the contextual information. We present the key building blocks and the properties of the Detector in Section 3.1 and the Classifier in Section 3.2, respectively.

The Detector
The Detector is aimed at detecting possible entity positions within each utterance. It takes an utterance as the input and outputs a set of entity candidates. Essentially, we use a semi-supervised neural network inspired by (Peters et al., 2017) to model this process. The architecture of the Detector is illustrated in the left part of Figure 2. Three major modules are contained in the Detector: Word Processor, Sentence Processor and Detection Network. More specifically, pre-trained word embeddings, POS tag information and character-level word information are used for generating semantically meaningful word representations. Word representations obtained from the Word Processor and the language model embeddings-ELMo (Peters et al., 2018), are concatenated together to produce context-aware sentence representations. Each possible word segment is then examined in the Detection Network and to be decided whether accepted it as an entity or not.

Word Processor
Word Processor extracts semantically meaningful word representation for each token. Given an input utterance with K tokens (t 1 , ..., t K ), each token by using a concatenation of a pre-trained word embedding w k , POS tag embedding p k if it exists, and a character-level word information c k . The pre-trained word embedding w k with a dimension D w is obtained from GloVe (Pennington et al., 2014). The character-level word information c k is obtained with a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) layer to capture the morphological information. The hidden size of this character LSTM is set as D cl . As shown in the bottom of Figure 2, character embeddings are fed into the character LSTM. The final hidden states from the forward and backward character LSTM are concatenated as the character-level word information c k . Those POS tagging embeddings and character embeddings are randomly initialized and learned within the learning process.

Sentence Processor
To learn the contextual information from each sentence, another bidirectional LSTM, named word LSTM, is applied to sequentially encode the utterance. For each token, the forward hidden states → h k and the backward hidden states ← h k are concatenated into the hidden states h k . The dimension of the hidden states of the word LSTM is set as D wl .
(1) Besides, we also utilize the language model embeddings pre-trained in an unsupervised way as the ELMo model in (Peters et al., 2018). The pretrained ELMo embeddings and the hidden states in the word LSTM h k are concatenated. Hence, the concatenated hidden states h k for each token can be reformulated as: where ELMo k is the ELMo embeddings for token t k . Speficially, a three-layer bi-LSTM neural network is trained as the language model. Since the lower-level LSTM hidden states have the ability to model syntax properties and higher-level LSTM hidden states can capture contextual information, ELMo computes the language model embeddings as a weighted combination of all the bidirectional LSTM hidden states: where γ is a task-specified scale parameter which indicates the importance of the entire ELMo vector to the NER task. L is the number of layers used in the pre-trained language model, the vector u = [u 0 , · · · , u L ] represents softmax-normalized weights that combine different layers. h LM k,l is the language model hidden state of layer l at the time step k.
A sentence bidirectional LSTM layer with a hidden dimension of D sl is employed on top of the concatenated hidden states h k . The forward and backward hidden states in this sentence LSTM are concatenated for each token as the final sentence representation f k ∈ R 2D sl .

Detection Network
Using the semantically meaningful features obtained in f k , we can identify possible entities within each utterance. The strategy of finding entities is to first generate all the word segments as entity proposals and then estimate the probability of each proposal as being an entity or not.
To enumerate all possible entity proposals, different lengths of entity proposals are generated surrounding each token position. For each token position, R entity proposals with the length varies from 1 to the maximum length R are generated. Specifically, it is assumed that an input utterance consists of a sequence of N tokens (t 1 , t 2 , t 3 , t 4 , t 5 , t 6 , ..., t N ). To balance the performance and the computational cost, we set R as 6. We take each token position as the center and generate 6 proposals surrounding it. All the possible 6N proposals under the max-length of 6 will be generated. As shown in Figure 3, the entity proposals generated surrounding token t 3 are: (t 3 ), (t 3 , t 4 ), (t 2 , t 3 , t 4 ), (t 2 , t 3 , t 4 , t 5 ), (t 1 , t 2 , t 3 , t 4 , t 5 ), (t 1 , t 2 , t 3 , t 4 , t 5 , t 6 ). Similar entity proposals are generated for all the token positions and proposals that contain invalid indexes like (t 0 ,t 1 ,t 2 ) will be deleted. Hence we can obtain all the valid entity proposals under the condition that the max length is R.  Figure 3: All possible entity proposals generated surrounding token t 3 when the maximum length of an entity proposal R is set as 6.
For each token, we simultaneously estimate the probability of a proposal being an entity or not for R proposals. A fully connected layer with a twoclass softmax function is used to determine the quality of entity proposals: where W p ∈ R 2D sl ×2R and b p ∈ R 2R are weights and the bias for the entity proposal layer; s k contains 2R scores including R scores for being an entity and R scores for not being an entity at position k. The cross-entropy loss is employed in the Detector as follows: where y r k is the label for proposal type r at position k and s r k is the probability of being an entity for proposal type r at position k. It is worth mentioning that, most entity proposals are negative proposals. Thus, to balance the influence of positive proposals and negative proposals in the loss function, we keep all positive proposals and use down-sampling for negative proposals when calculating the loss L p . For each batch, we fix the number of the total proposals, including all positive proposals and sampled negative proposals, used in the loss function as N b . In the inference procedure of the Detection Network, an entity proposal will be recognized as an entity candidate if its score of being an entity is higher than score of not being an entity.

The Classifier
The Classifier module aims at classifying entity candidates obtained from the Detector into different pre-defined entity categories. For the nested NER task, all the proposed entities will be saved and fed into the Classifier. For the NER task which has non-overlapping entities, we utilize the non-maximum suppression (NMS) algorithm (Neubeck and Van Gool, 2006) to deal with redundant, overlapping entity proposals and output real entity candidates. The idea of NMS is simple but effective: picking the entity proposal with the maximum probability, deleting conflict entity proposals, and repeating the previous process until all the proposals are processed. Eventually, we can get those non-conflict entity candidates as the input of the Classifier.
To understand the contextual information of the proposed entity, we utilize both sentence-level context information and a self-attention mechanism to help the model focus on entity-related context tokens. The framework of the Classifier is shown in the right part of Figure 2. Essentially, it consists of three modules: Word Processor, Entity Processor and Classification Network.

Word Processor
A same Word Processor as in the Detector is used here to get the word representation for the entity candidates obtained from the Detector. The wordlevel embedding, which is the concatenation of pre-trained word embedding and POS tag embedding if it is exists, is transferred from the Word Processor in the Detector to improve the performance as well as to speed up the learning process. The character-level LSTM and character embeddings are trained separately in the Detector and the Classifier. ACE-2004ACE-2005 CoNLL -2003  TRAIN  DEV  TEST  TRAIN  DEV  TEST  TRAIN  DEV  TEST   sentences  #total  6,799  829  879  7,336

Entity Processor
The word representation is fed into a bidirectional word LSTM with hidden size D wl and the hidden states are concatenated with the ELMo language model embeddings as the entity features. A bidirectional LSTM with hidden size D el is applied to the entity feature to capture sequence information among the entity words. The last hidden states of the forward and backward Entity LSTM are concatenated as the entity representation e ∈ R 2D el . The same word in different contexts may have different semantic meanings. To this end, in our model, we take the contextual information into consideration when learning the semantic representations of entity candidates. We capture the contextual information from other words in the same utterance. Denote c as the context feature vector for these context words, and it can be extracted from the sentence representation f k in the Detector. Hence, the sentence features trained in the Detector is directly transferred to the Classifier.
An easy way to model context words is to concatenate all the word representations or average them. However, this naive approach may fail when there exists a lot of unrelated context words. To select high-relevant context words and learn an accurate contextual representation, we propose a self-attention mechanism to simulate and dynamically control the relatedness between the context and the entity. The self-attention module takes the entity representation e and all the context features C = [c 1 , c 2 , ..., c N ] as the inputs, and outputs a vector of attention weights a: where W ∈ R 2D sl ×2D el is a weight matrix for the self-attention layer, and a is the self-attention weight on different context words. To help the model focus on entity-related context, the attentive vector C att is calculated as the attention-weighted context: C att = a * C.
The lengths of the attentive context C att varies in different contexts. However, the goal of the Classification Network is to classify entity candidates into different categories, and thus it requires a fixed embedding size. We achieve that by adding another LSTM layer. An Attention LSTM with the hidden dimension D ml is used and the concatenation of the last hidden states in the forward and backward LSTM layer as the context representation m ∈ R 2D ml . Hence the shape of the context representation is aligned. We concatenate the context representation and the entity representation together as a context-aware entity representation to classify entity candidates: o = [m; e].

Classification Network
A two-layer fully connected neural network is used to classify candidates into pre-defined categories: are the weights for this fully connected neural network, and D t is the number of entity types. Actually, this classification function classifies entity candidates into (D t + 1) types. Here we add one more type as for the scenario that a candidate may not be a real entity. Finally, the hinge-ranking loss is adopted in the Classification Network: where p w is the probability for the wrong labels y w , p r is the probability for the right label y r , and ∆ is a margin. The hinge-rank loss urges the probability for the right label higher than the probability for the wrong labels and improves the classification performance.

Experiments
To show the ability and effectiveness of our proposed framework, MGNER, for Multi-Grained Named Entity Recognition, we conduct the experiments on both Nested NER task and traditional non-overlapping NER task.

Datasets
We mainly evaluate our framework on ACE-2004and ACE-2005(Doddington et al., 2004 with the same splits used by previous works (Luo et al., 2015;) for the nested NER task. Specifically, seven different types of entities such as person, facility, weapon and vehicle, are contained in the ACE datasets. For the traditional NER task, we use the CoNLL-2003 dataset (Tjong Kim Sang and De Meulder, 2003) which contains four types of named entities: location, organization, person and miscellaneous. An overview of these three datasets is illustrated in Table 1. It can be observed that most entities are less or equal to 6 tokens, and thus we select the maximum entity length R = 6.

Implementation Details
We performed random search (Bergstra and Bengio, 2012) for hyper-parameter optimization and selected the best setting based on performance on the development set. We employ the Adam optimizer (Kingma and Ba, 2014) with learning rate decay for all the experiments. The learning rate is set as 0.001 at the beginning and exponential decayed by 0.9 after each epoch. The batch size of utterances is set as 20. In order to balance the influence of positive proposals and negative proposals, we use down-sampling for negative ones and the total proposal number N b for each batch is 128. To alleviate over-fitting, we add dropout regularizations after the word representation layer and all the LSTM layers with a dropout rate of 0.5. In addition, we employ the early stopping strategy when there is no performance improvement on the development dataset after three epochs. The pretrained word embeddings are from GloVe (Pennington et al., 2014), and the word embedding dimension D w is 300. Besides, the ELMo 5.5B data 1 is utilized in the experiment for the language model embedding. Moreover, the size of character embedding c k is 100, and the hidden size of the Character LSTM D cl is also 100. The size of POS tag embedding p k is 300 for the ACE datassets and no POS tag information is used in the CoNLL-2003 dataset. The hidden dimensions of the Word LSTM layer D wl , the Sentence LSTM layer D sl , the Entity LSTM layer D el and the Attention LSTM layer D ml are all set to 300. The hidden dimension of the classification layer D h is 50. The margin ∆ in the hinge-ranking loss for the entity category classification is set to 5. The ELMo scale parameter γ used in the Detector is 3.35 and 3.05 in the Classifier, respectively.

Results
Nested NER Task. The proposed MGNER is very suitable for detecting nested named entities since every possible entity will be examined and classified. In order to validate this advantage, we compare MGNER with numerous baseline models: 1) Lu and Roth (2015) which propose the mention hypergraphs for recognizing overlapping entities; 2) Lample et al. (2016) which adopt the LSTM-CRF stucture for sequence labelling; 3) Muis and Lu (2017) which introduce mention separators to tag gaps between words for recognizing overlapping mentions; 4) Xu et al. (2017) that propose a local detection method; 5) Katiyar and Cardie (2018) which propose a hypergraph-based model using LSTM for learning feature representations; 6) Ju et al. (2018) that use a layered model which extracts outer entities based on inner ones; 7)  which propose a neural transition-based model that constructs nested mentions through a sequence of actions; 8)  which adopt a neural segmental hypergraph model. Experiment results of the Nested NER task on the ACE-2004 andACE-2005 datasets are reported in Table 2. We can observe from Table  2 that, our proposed framework MGNER outperforms all the baseline approaches. For both datasets, our model improves the state-of-the-art result by around 4% in terms of precision, recall, as well as the F1 score.
To study the contribution of different modules in MGNER, we also report the performance of two ablation variations of the proposed MGNER at the bottom of Table 2. MGNER w/o attention is a variation of MGNER which removes the selfattention mechanism and MGNER w/o context removes all the context information. To remove the self-attention mechanism, we feed the context feature C directly into a bi-directional LSTM to obtain context representation m, other than the attentive context vector C att . As for MGNER w/o context, we only use entity representation e to do classification other than the context-aware entity representation o. By adding the context information, the F1 score improves 0.9% on the ACE-2004 dataset and 0.7% on the ACE-2005 dataset. The self-attention mechanism improves the F1 score by 0.6% on the ACE-2004 dataset and 0.5% on the ACE-2005 dataset.  To analyze how well our model performs on overlapping and non-overlapping entities, we split the test data into two portions: sentences with and without overlapping entities (follow the splits used by ). Four state-of-theart nested NER models are compared with our proposed framework MGNER on the ACE-2005 dataset. As illustrated in Table 3, MGNER consistently performs better than the baselines on both portions, especially for the non-overlapping part. This observation indicates that our model can better recognize non-overlapping entities than previous nested NER models.
The first step in MGNER is to detect entity positions using the Detector, where the effectiveness of proposing correct entity candidates immediately affects the performance of the whole model.  NER Task. We also evaluate the proposed MGNER framework on the NER task which needs to reorganize non-overlapping entities. Two types of baseline models are compared here: sequence labelling models which are designed specifically for non-overlapping NER task and nested NER models which also provide the ability to detect non-overlapping mentions. The first type of models including 1) Lample et al. (2016) which adopt the LSTM-CRF structure; 2) Ma and Hovy (2016) which use a LSTM-CNNs-CRF architecture; 3) Chiu and Nichols (2016) which propose a CNN-LSTM-CRF model; 4) Peters et al. (2017) which add semi-supervised language model embeddings; and 5) Peters et al. (2018) which utilize the state-of-the-art ELMo language model embeddings. The second types include four Nested models mentioned in the Nested NER section: 1) Luo et al. (2015); 2) Muis and Lu (2017); 3) Xu et al. (2017); 4) . Table 4 shows the F1 scores of different approaches on CoNLL-2003 devlopement set and test set for the English NER task. Mean and standard deviation across five runs are reported. It can be observed from Table 4 that the proposed MGNER model outperforms all the baselines. The models designed for non-overlapping entity detection usually performs better than Nested NER models for the NER task.
Our proposed framework outperforms state-of-the-art results both on the NER and Nested NER task. Xu et al. (2017) is the best baseline model among the Nested models since it shares a similar idea of our proposed framework by individually examin-ing each entity proposal. From the ablation study, we can observe that by purely adding the context information, the F1 score on the CoNLL-2003 test set improves from 92.23 to 92.26, and by adding the attention mechanism, the F1 score improves to 92.28.
We also provide the performance of detecting non-overlapping entities in the Detector here. The precision, recall and F1 score are 95.33, 95.69 and 95.51 on the CoNLL-2003 dataset.

Conclusions
In this work, we propose a novel neural framework named MGNER for Multi-Grained Named Entity Recognition where multiple entities or entity mentions in a sentence could be non-overlapping or totally nested. MGNER is framework with high modularity and each component in MGNER can adopt a wide range of neural networks. Experimental results show that MGNER is able to achieve state-of-the-art results on both nested NER task and traditional non-overlapping NER task.