A Boundary-aware Neural Model for Nested Named Entity Recognition

In natural language processing, it is common that many entities contain other entities inside them. Most existing works on named entity recognition (NER) only deal with flat entities but ignore nested ones. We propose a boundary-aware neural model for nested NER which leverages entity boundaries to predict entity categorical labels. Our model can locate entities precisely by detecting boundaries using sequence labeling models. Based on the detected boundaries, our model utilizes the boundary-relevant regions to predict entity categorical labels, which can decrease computation cost and relieve error propagation problem in layered sequence labeling model. We introduce multitask learning to capture the dependencies of entity boundaries and their categorical labels, which helps to improve the performance of identifying entities. We conduct our experiments on GENIA dataset and the experimental results demonstrate that our model outperforms other state-of-the-art methods.

In natural language processing, it is common that many entities contain other entities inside them. Most existing works on named entity recognition (NER) only deal with flat entities but ignore nested ones. We propose a boundary-aware neural model for nested NER which leverages entity boundaries to predict entity categorical labels. Our model can locate entities precisely by detecting boundaries using sequence labeling models. Based on the detected boundaries, our model utilizes the boundary-relevant regions to predict entity categorical labels, which can decrease computation cost and relieve error propagation problem in layered sequence labeling model. We introduce multitask learning to capture the dependencies of entity boundaries and their categorical labels, which helps to improve the performance of identifying entities. We conduct our experiments on nested NER datasets and the experimental results demonstrate that our model outperforms other state-of-the-art methods.

Introduction
Named entity recognition (NER) is a task that seeks to locate and classify named entities in unstructured texts into pre-defined categories such as person names, locations or medical codes. NER is generally treated as single-layer sequence labeling problem (Lafferty et al., 2001;Lample et al., 2016) where each token is tagged with one label. The label is composed by an entity boundary label and a categorical label. For example, a token can be tagged with B-P ER, where B indicates the boundary of an entity and P ER indicates the corresponding entity categorical label. However, when entities are nested within one another, single-layer sequence labeling models can not ex- * Corresponding author tract both entities simultaneously. A token contained inside many entities has more than one categorical label. Consider an example in Figure 1 from GENIA corpus (Kim et al., 2003), "Human TR Beta 1" is an protein and it is also a part of a DN A "Human TR Beta 1 mRNA". Both entities contain the same token "Human". Thus the token should have two different categorical labels. In that case, assigning a single categorical label for "Human" is improper. Figure 1: An example of nested entities and their boundary labels. "B" and "E" indicate the beginning and end of an entity. They are the boundary labels. "I" and "O" denote tokens inside and outside entities, respectively. protein and RN A are categories of entities.
Traditional methods coping with nested entities rely on hand-craft features (Shen et al., 2003;Alex et al., 2007) and suffer from heavy feature engineering. Recent studies tackle the nested NER using neural models without relying on linguistics features or external knowledge resources. Ju et al. (2018) propose a layered sequence labeling model and Sohrab and Miwa (2018) propose a exhaustive region classification model.
• Layered sequence labeling model will first extract the inner entities (contained by other entities) and feed them into the next layer to extract outer entities. Thus, this model suffers from error propagation. When the previous layer extracts wrong entities, the performance of next layer will be affected. Moreover, when an outer entity is extracted first, the inner one will not be detected.
• Exhaustive region classification model enumerates all possible regions or spans in sentences to predict entities in a single layer. One issue of their method is the explicit boundary information is ignored, leading to extraction of some non-entities. We consider an example. In a sequence of tokens in GE-NIA dataset, "novel TH protein" is an entity and "a novel TH protein" is not an entity. However, since they share many tokens, the merged region representations of them are similar to each other. "novel" and "protein" are the boundary of the entity. Without the boundary information, both candidate regions are extracted as the entities.
Despite their shortcomings, layered sequence labeling model and exhaustive region classification model are complementary to each other. Therefore, we can combine them to improve the performance of nested NER. We leverage the sequence labeling model to consider the boundary information into locating entities. In the example mentioned above, "novel" is the boundary of the entity "novel TH protein", while "a" is a general token whose representation is different from "novel". With the guidance of boundary information, the model can detect "novel" as a boundary of the entity rather than token "a". We also utilize the region classification model to predict entities without considering the dependencies of inner and outer entities. In such case, Our model will not suffer error propagation problem.
In this paper, we propose a boundary-aware neural model that makes the fusion of sequence labeling model and region classification model. We apply a single-layer sequence labeling model to identify entity boundaries because the tokens in nested entities can share the same boundary labels. For example, as shown in Figure 1, "Human" can be tagged with the label B although it is the beginning of two different entities. Based on the detected entity boundaries, we predict entity categorical labels by classifying boundary-relevant regions. As shown in Figure 1, we match each token with label B to tokens with label E. The regions between them are considered as candidate entities. The representation of candidate entities will be utilized to classify categorical labels.
Our model is advanced than exhaustive region classification model in two ways: (1) we leverage the explicit boundary information to guide the model to locate and classify entities precisely. Exhaustive region classification model classifies entity regions individually, however, our model can consider the context information of boundary tokens with a sequence labeling model. That facilitates the detection of boundaries.
(2) Our model only classifies the boundary-relevance regions which are much fewer than all possible regions. That decreases the time cost. Our model is advanced than layered sequence labeling model because we extract entities without distinguishing inner and outer entities.
Multitask learning is considered good at optimising the overall goal via alternatively tuning 2+ objectives, which are reinforced each other (Ruder, 2017). Considering our boundary detection module and entity categorical label prediction module share the same entity boundaries, we apply a multitask loss for training the two tasks simultaneously. The shared features of two modules are extracted by a bidirectional long shortterm memory (LSTM) layer. Extensive experiments show the framework of multitask learning improves final performance in a large margin.
In summary, we make the following major contributions in this paper: • We propose a boundary-aware neural model which leverages entity boundaries to predict categorical labels. Our model can locate entities precisely by detecting boundaries using sequence labeling models. Based on the detected boundaries, our model utilizes boundary-relevant regions to predict entity categorical labels, which can decrease computation cost and relieve error propagation problem.
• We introduce the multitask learning to capture the dependencies of entity boundaries and their categorical labels, which helps to improve the performance of identifying entities.
• We conduct our experiments on public nested NER datasets. The experimental results demonstrate our model outperforms previous state-of-the-art methods and our model is much faster in inference speed.

Related Work
NER has drawn the attention of NLP researchers because several downstream tasks such as entity linking (Gupta et al., 2017), relation extraction (Mintz et al., 2009;Liu et al., 2017), co-reference resolution (Chang et al., 2013) and conversation system (Ren et al., 2019) rely on it. Several methods have been proposed on flat named entity recognition (Lample et al., 2016;Ma and Hovy, 2016;Strubell et al., 2017) while few of them address nested entities. Early work on nested entities rely on hand-craft features or rule-based postprocessing Zhou, 2006). They detect the innermost flat entities with a Hidden Markov Model and then use rule-based post-processing to extract the outer entities.
While most work concerns about named entities, Lu and Roth (2015) present a novel hypergraph-based method to tackle the problem of entity mention detection. One issue of their method is the spurious structure of hyper-graphs. Muis and Lu (2017) improve the method of Lu and Roth (2015) by incorporating mention separators along with features.
Recent studies reveal that stacking sequence model like conditional random filed(CRF) layer can extract entities from inner to outer. Alex et al. (2007) propose several CRF-based methods for the GENIA dataset. However, their approach can not recognize nested entities of the same type. Finkel and Manning (2009) present a chart-based parsing method where each named entity is a constituent in the parsing tree. However, their method is not scalable to larger corpus with a cubic time complexity. Ju et al. (2018) dynamically stack flat NER layers to extract nested entities, each flat layer is based on a Bi-LSTM layer and then a cascaded CRF layer. Their model suffers error propagation from layer to layer, an inner entity can not be detected when a outer entity is identified first.
It is difficult for sequence model, like CRF, to extract nested entities where a token can be included in several entities. Wang et al. (2018) present a transition-based model for nested mention detection using a forest representation. One drawback of their model is the greedy training and decoding. Sohrab and Miwa (2018) consider all possible regions in a sentence and classify them into their entity type or non-entity. However, their exhaustive method considers too many irrelevant regions(non-entity regions) into detecting entity types and the regions are classified individually, without considering the contextual information. Our model focuses on the boundary-relevant regions which is much fewer and the explicit leveraging of boundary information helps to locate entities more precisely.

Method
In this paper, we propose a boundary-aware neural model which considers the boundary information into locating and classifying entities. The architecture is illustrated in Figure 2.
Our model is built upon a shared bidirectional LSTM layer. It uses the outputs of LSTM layer to detect entity boundaries and predict categorical labels. We extract entity boundaries as paired tokens with label B and label E, "B" indicates the beginning of an entity and "E" means the end of an entity. We match every detected token with label B and its corresponding token with label E, the regions between them are identified as candidate entities. We represent entities using the corresponding region outputs of shared LSTM and classify them into categorical labels.
The boundary detection module and entity categorical label prediction module are training simultaneously with a multitask loss function, which can capture the underlying dependencies of entity boundaries and their categorical labels. We will describe each part of our model in detail.

Token Representation
We represent each token in the sentence following the success of Ma and Hovy (2016) and Lample et al. (2016) that leverages character embedding for the flat NER task.
For a given sentence consisting of n tokens (t 1 ,t 2 ,....t n ), we represent the word embedding of i-th token t i as equation (1): where e w denotes a word embedding lookup table.
We use pre-trained word embedding (Chiu et al., 2016) to initialize it. We capture the orthographic and morphological features of the word by integrating character representations. Denoting the representation of characters within t i as x c i , The embedding of each character within token t i is denoted as e c (c j ). e c is the character embedding lookup which is initialized randomly. Then we feed them into a bidirectional LSTM layer to learn hidden states. The forward and backward outputs are concatenated to construct character representations: where ← − h c i and − → h c i denote the forward and backward outputs of bidirectional LSTM.

Shared Feature Extractor
As shown in Figure 2, we apply the hard parameter sharing mechanism (Ruder, 2017) for multitask training using bidirectional LSTM as shared feature extractor. Hard parameter sharing greatly reduces the risk of overfitting (Baxter, 1997) and increases the correlation of our boundary detection module and categorical label prediction module. Specifically, the hidden state of bidirectional LSTM can be expressed as following: where x t i is the token representation which is mentioned in section 3.1. We feed x t i into a Dropout layer to prevent overfitting.
− → h t i and ← − h t i denote the i-th forward and backward hidden state of Bi-LSTM layer. Formally, we extract the shared features of each token in a sentence as h t i .

Entity Boundary Detection
Previous works (Lample et al., 2016;Ma and Hovy, 2016) on flat NER (non-nested named entities recognition) predict entity boundaries and categorical labels jointly. However, when entities are nested in other entities, one individual token can be included in many different entities. This means assigning one single categorical label for each token is inappropriate. We divide nested NER into two subtasks: entity boundary detection and categorical label prediction tasks. Unlike assigning an entity categorical label for each token, we predict the boundary labels first. Formally, given a sentence (t 1 ,t 2 ,...t n ), and one entity in the sentence. we represent the entity as R(i, j), which denotes the entity is composed by a continuous token sequence (t i ,t i+1 ,...t j ). Specially, we tag the boundary token t i as "B" and t j as "E". The tokens inside entities are assigned with label "I" and non-entity tokens are assigned with "O" labels. We detect entity boundaries as shown in Fig-ure 3. For each token t i in a sentence, we predict a boundary label by feeding its corresponding shared feature representation h t i (described in section 3.2) into a ReLU activation function and a softmax classifier: where U and b are trainable parameters. We compute the KL-divergence multi-label loss between the true distributiond t i and the predicted distribution d t i as equation (9): Conditional random field (CRF) (Lafferty et al., 2001) is considered good at modeling sequence label dependencies (e.g., label I must be after B). We make a comparison of choosing softmax or CRF as output layer because our sequence labels are different from flat NER models.

Entity Categorical Label Prediction
Given an input sentence sequence X = (x 1 ,x 2 , ... x n ), and a corresponding boundary label sequence L = (l 1 ,l 2 , ... l n ), we match each token with label B to the token with label E to construct candidate entity regions. Especially, considering there are entities containing one single token, we match tokens with label B to themselves firstly. The representation of entity R(i, j) is obtained as following: where h t k denotes the outputs of the shared bidirectional LSTM layer for k-th token in sentence. We simply average the representations for each token within boundary regions. The final representation of entities will be sent into a ReLU activation function and the softmax layer to predict entity categorical labels. We compute the loss of categorical label prediction in equation (11)(12): where U e i,j and b e i,j are trainable parameters.d e i,j and d e i,j denote the true distribution and predicted distribution of entity categorical labels, respectively.

Multitask Training
In our model, it is inconvenient and inefficient for the reason that we predict entity categorical labels after all boundary-relative regions have been detected. Considering our boundary detection module and entity categorical label prediction module share the same entity boundaries, we apply a multitask loss for training the two tasks simultaneously.
During training phase, we feed the ground-truth boundary labels into entity categorical label prediction module so that the classifier will be trained without affection from error boundary detection. As for testing phase, the outputs of boundary detection will be collected. The detected boundaries will indicate which entity regions should be considered into predicting categorical labels. The multitask loss function is defined as follows: (13) where L bcls and L ecls denote the categorical crossentropy loss for boundary detection module and entity categorical label prediction module, respectively. α is a hyper-parameter which is assigned to control the degree of importance for each task.

Dataset
To provide empirical evidence for effectiveness of the proposed model, we employ our experiments on three nested NER datasets: GENIA (Kim et al., 2003), JNLPBA (Kim et al., 2004) and GermEval 2014 (Benikova et al., 2014). GENIA dataset is constructed based on the GE-NIA v3.0.2 corpus. We preprocess the dataset following the same settings of (Finkel and Manning, 2009) and (Lu and Roth, 2015). The dataset is split into 8.1:0.9:1 for training, development and testing. The statistics of GENIA dataset is shown as However, only the flat and top-most entities are preserved. We collapse the sub-categories into 5 categories following the same settings as GENIA dataset.
GermEval 2014 dataset contains German nested named entities. The dataset covers over 31,000 sentences corresponding to over 590,000 tokens.

Baseline Methods
We compare our model with several state-of-theart models on GENIA dataset. These methods can be divided into three groups: Finkel andManning (2009) andJu et al. (2018) propose CRF-based sequence labeling approaches for nested named entity recognition. Finkel and Manning (2009) leverage entity-level features while Ju et al. (2018) propose neural-based method. We rerun the codes of Ju et al. (2018) because they have not shared their pre-processed dataset.
Sohrab and Miwa (2018) propose an exhaustive region classification model for nested NER. We reimplement their method according to their paper because they have not shared the codes. Lu and Roth (2015) and Muis and Lu (2017) build hyper-graphs to represent both the nested entities and their mentions. Muis and Lu (2017) improve the method of Lu and Roth (2015).

Parameter Settings
Our model is implemented by PyTorch framework 1 2 . We use Adam optimizer for training our model. We initialize word vectors with a 200dimension pre-trained word embedding the same as Ju et al. (2018) and Sohrab and Miwa (2018) while the char embedding is set to 50-dimension and initialized randomly. The learning rate is set to 0.005. We set a 0.5 dropout rate for the Dropout layer employed after token-level LSTM during training phase. The output dimension of our shared bidirectional LSTM is 200. The coefficient α of multitask loss is tuned during development process. All of our experiments are performed on the same machine (NVIDIA 1080ti GPU and Intel i7-8700 CPU).

Evaluation Metrics
We use a strict evaluation metrics that an entity is confirmed correct when the entity boundary and the entity categorical label are correct simultaneously. We employ precision, recall and F-score to evaluate the performance.

Overall Evaluation
We conduct our experiments on GENIA test dataset for nested named entity recognition. Table 2 shows our method outperforms the compared methods both in recall and F-score metrics. The CRF-based model is considered as more efficient in sequence labeling task, we compare the utilization of softmax and CRF as output layer of boundary detection module. The results show they gain comparable scores in precision, recall and F-score. However, the CRF-based model is time-consuming, about 3-5 times slower than the softmax-based model in inference speed.

Model P(%) R(%) F(%)
Finkel and Manning (2009)   Our model achieves a recall value of 73.6% and outperforms compared methods in Recall value with a large margin. We think that our model extract entities with a more accurate boundaries comparing to other methods. We evaluate it in experiments on boundary detection module.

Model P(%) R(%) F(%)
Sohrab and Miwa (  The GermEval 2014 dataset from KONVENS 2014 shared task is a German NER dataset. It contains few nested entities. Previous works in this dataset ignore nested entities or extract inner and outer entities in two independent models. Our method can extract nested entities in an end-to-end way. We compare our method with two state-ofthe-art approaches in Table 3. Our method outperforms their approaches both in Recall and F-score metrics. Table 4 describes the performances of our model on the five categories on the test dataset. Our model outperforms the model described in Ju et al. (2018) and Sohrab and Miwa (2018) with Fscore value on all categories.

Performance of Boundary Detection
We conduct experiments on boundary detection to illustrate that our model extract entity boundaries more precisely comparing to Sohrab and Miwa (2018) and Ju et al. (2018). Table 5 shows the results of boundary detection on GENIA test dataset.
Our model locates entities more accurately with a higher recall value (76.9%) than the comparing methods. It gives a reason why our model outperforms other state-of-the-art methods in recall value. We exploit boundary information explic-itly and consider the dependencies of boundaries and entity categorical labels with a multitask loss. While in the method of Sohrab and Miwa (2018), candidate entity regions are classified individually.
Model Boundary Detection P(%) R(%) F(%) Sohrab and Miwa (2018) Table 6 describes the performance of our model in detecting boundary labels for each token in sentences. The results are based on the shared bidirectional LSTM and a softmax classifier. Our model extracts entity boundaries with a relatively high performance. This facilitates the prediction of entity categorical labels because the candidate entity regions are more likely to be true entities.   model only concerns about boundary-relevant regions which is much fewer. We compare the inference speed of our model and the approaches of Sohrab and Miwa (2018) and Ju et al. (2018) in Figure 4(b). Our model is about 4 times faster than Sohrab and Miwa (2018) and about 3 times faster than Ju et al. (2018). The cascaded CRF layers of Ju et al. (2018) are the limitation in inference speed.  Multitask learning can capture the underlying dependencies of boundaries and entity categorical labels. It helps the model focus its attention on those features that actually matter (Ruder, 2017). In pipeline model, entity categorical prediction module will not share information with boundary detection module because they are trained separately. However, entity categorical prediction module and boundary detection module share the same entity boundaries. We assign a shared feature extractor (the bidirectional LSTM layer) to extract the features utilized in both entity categorical prediction and boundary detection. The results have demonstrated that the framework of multitask learning improves final performance.

Ablation Study and Flat NER
We conduct ablation experiments on GENIA development set to evaluate the contributions of neural components including dropout layer, pretrained word embedding and the character-level LSTM. The results are listed in Table 8. All these components contribute to the effectiveness of our model. Dropout layer contributes significantly for both precision and recall values.  To prove our model can work on nested NER and also flat NER task, we perform experiments on the JNLPBA dataset. We achieve 73.6 in term of F-score which is comparable with the state-ofthe-art result of Gridach (2017).
6 Case Study Table 9 shows a case study comparing our model with exhaustive model (Sohrab and Miwa, 2018) and Layered model (Ju et al., 2018). In the example, "human TATA binding factor" is an entity nested in entity "transcriptionally active human TATA binding factor". Our model with multitask learning extracts both entities with exact boundaries and entity categorical labels. Exhaustive model gets the error boundaries and misses the token "human" in entities. Comparing to layered model only detecting an outer entity, our model extract both inner and outer entities. It demonstrates that our combination of sequence labeling models and region classification models can locate entities precisely and extract both inner and outer entities.

Sentence
Cloning of a transcriptionally active human TATA binding factor.  For our pipeline model, without the dependencies information from entity categorical labels, it misses the outer entity boundaries and only extracts the inner one. It verifies that the multitask learning can share boundary information between boundary detection module and entity categorical label prediction module, which is very effective for nested NER.

Conclusion
This paper presents a boundary-aware model which leverages boundaries to predict entity categorical labels. Our model combines sequence labeling model and region classification model to locate and classify nested entities with high performance. To capture the underlying dependencies of boundary detection module and entity categorical prediction module, we apply a multitask loss for training the two tasks simultaneously. Our model outperforms existing nested models in terms of Fscore.
For future work, we consider to model the dependencies among entity regions explicitly and improve the performance of boundary detection module which is important for entity categorical label prediction.