Deep Exhaustive Model for Nested Named Entity Recognition

We propose a simple deep neural model for nested named entity recognition (NER). Most NER models focused on flat entities and ignored nested entities, which failed to fully capture underlying semantic information in texts. The key idea of our model is to enumerate all possible regions or spans as potential entity mentions and classify them with deep neural networks. To reduce the computational costs and capture the information of the contexts around the regions, the model represents the regions using the outputs of shared underlying bidirectional long short-term memory. We evaluate our exhaustive model on the GENIA and JNLPBA corpora in biomedical domain, and the results show that our model outperforms state-of-the-art models on nested and flat NER, achieving 77.1% and 78.4% respectively in terms of F-score, without any external knowledge resources.


Introduction
Named entity recognition (NER) is a task of finding entities with specific semantic types such as Protein, Cell, and RNA in text. NER is generally treated as a sequential labeling task, where each token is tagged with a label that corresponds to its surrounding entity. However, when entities overlap or are nested within one another, treating the task as a sequential labeling task becomes difficult because an individual token can be included in several entities and defining a label for each token can be difficult. For example, in the following phrase from the GENIA corpus (Kim et al., 2004), four levels of nested entities occur and the token "IL-2" is a Protein on its own, and it is also a part of two other Proteins and one DNA.
[  Protein receptor] Protein (IL-2R) alpha chain] Protein gene] DNA NER has drawn considerable attention as the first step towards many natural language processing (NLP) applications including relation extraction (Miwa and Bansal, 2016), event extraction (Feng et al., 2016), co-reference resolution (Fragkou, 2017;Stone and Arora, 2017), and entity linking (Gupta et al., 2017). Much work on NER, however, has ignored nested entities and instead chosen to focus on the non-nested entities, which are also referred to as flat entities. Only a few studies target the nested named entity recognition (Muis and Lu, 2017;Lu and Roth, 2015;Finkel and Manning, 2009).
Recent successes in neural networks have shown impressive performance gains on flat named entity recognition in several domains (Lample et al., 2016;Ma and Hovy, 2016;Gridach, 2017;Strubell et al., 2017). Such models achieve state-of-the-art results without requiring any hand crafted features or external knowledge resources. In contrast, fewer approaches have emphasized the nested entity recognition problem. Existing approaches to nested NER (Shen et al., 2003;Alex et al., 2007;Finkel and Manning, 2009;Lu and Roth, 2015;Xu et al., 2017;Muis and Lu, 2017) are mostly feature-based and thus suffer from heavy feature engineering. In this paper, we present a novel neural exhaustive model that reasons over all the regions within a specified maximum size. The model represents each region using the outputs of bidirectional long short-term memory (LSTM) by combining the boundary representation of a region and inside representation that simply treats all the tokens in a region equally by taking the average of LSTM outputs corresponding to tokens inside the region. It then classifies regions into their entity types or non-entity. Unlike the existing model that relies on token-level labels, our model directly employs an entity type as the label of a region. The model does not rely on any external knowledge resources or NLP tools like part-of-speech taggers. We evaluated our model on the GENIA and JNLPBA corpora in the biomedical domain and the model achieved 77.1% and 78.4% respectively in terms of F-score, which are the new state-of-the-art performances on the corpora.

Neural Exhaustive Model
The proposed model exhaustively considers all possible regions in a sentence using a single neural network; we thus call the model neural exhaustive model. Our model is built upon a shared bidirectional LSTM layer. The model enumerates all possible regions or spans that can include all the nested entities. It then represent the regions by using the outputs of the LSTM layer and detect the entities from the regions. The number of possible regions depend on the predefined maximum size. In this section, we describe the architecture of our neural exhaustive model in detail, which is summarized in Figure 1.

Word Representation
We represent each word by concatenating word embeddings and character-based word representations. Pre-trained word embeddings are used to initialize word embeddings (Chiu et al., 2016). For the character-based word representations, we encode the character-level information of each word following the successes of Ma and Hovy (2016) and Lample et al. (2016) that utilized character embeddings for the flat NER task. The embedding of each character in a word is randomly initialized. We feed the sequence of character embeddings comprising a word to a bidirectional LSTM layer and concatenate the forward and backward output representations to obtain the word representations.

Exhaustive Combination using LSTM
Given an input sentence sequence X = {x 1 , x 2 , ...x n }, where x i denotes the i-th word and n denotes the number of words in the sentence sequence, the distributed embeddings of words and characters are fed into a bidirectional LSTM layer that computes the hidden vector sequence in for- We concatenate the forward and backward outputs as With the LSTM output h i , our exhaustive model shares the underlying representations of all possible regions by exhaustive combination. We generate all possible regions with the sizes less than or equal to the maximum region size L. We use a region(i, j) to represent the region from i to j inclusive, where 1 ≤ i < j ≤ n and j − i < L.

Region Representation and Classification
We represent the region by separating the region into the boundary and inside representations. The boundary representation is important to capture the contexts surrounding the region. We simply rely on the outputs of the bidirectional LSTM layer corresponding to the boundary words of a target region for this purpose. For the inside representation, we simply average the outputs of the bidirectional LSTM layer in the region to treat them equally. We include the outputs for the boundary words to guarantee that the inside representation has corresponding outputs. In summary, we obtain the representation R(i, j) of the region(i, j) as follows: (1) We then feed the representation of each segmented region to a rectified linear unit (ReLU) as an activation function. Finally, the output of the activation layer is passed to a softmax output layer to classify the region into a specific entity type or non-entity.
The exhaustive model represents all possible regions based on maximum entity length and classify all of them. The overall number of classifications for each sentence in the exhaustive model is in O(lmn), where l is a total number of words in the sentence, m is the maximum entity length and n is the total number of possible entity types. Finkel and Manning (2009) and Alex et al. (2007) proposed featured-based approaches for handling nested NER. The time complexity of their models are expensive, i.e., cubic in the number of the words in the sentence. The exhaustive approach is fast since we run the LSTM once and the classifications can be performed in parallel on the combinations created from the LSTM outputs.
The exhaustive model classify each region independently unlike word-level taggers. This makes the model flexible so that it can incorporate phrase-level dictionary information directly and we can tune biases for each type unlike CRF. We leave this evaluation to our future work.

Experimental Settings
We evaluated our exhaustive model on GE-NIA 1 (Kim et al., 2003) and JNLPBA 2 (Kim et al., 2004) datasets to provide empirical evidence for the effectiveness of our model both in nested and flat NER. Table 1 shows the statistics of GENIA dataset.
Our model was implemented in Chainer 3 deep learning framework. We employed pre-trained word embeddings that were trained on MEDLINE abstracts (Chiu et al., 2016), which included 200dimensional embeddings of 2,231,686 vocabulary. We used ADAM (Kingma and Ba., 2015) for learning with a mini-batch size of 100. We used the same hyper-parameters in all the experiments; we set the dimension of word embedding to 200, the dimension of character embedding to 25, the hidden layer size to 200, the gradient clipping to 5, and the ADAM hyper-parameters to its default values (Kingma and Ba., 2015). To deeply understand the model parameters, we compared the models in different regions. We chose the maximum region size from 3, 6, 8 and 10. We also employed different region representation. We tried only the boundary representation (boundary), only the inside representation (inside), and our region representation (bound-ary+inside).
We employed precision, recall, and F-score to    Table 2 shows the comparison of our model with several previous state-of-the nested NER models on the test dataset. Our model outperforms the state-of-the-art models in terms of F-score. Our results on Table 2 is based on bidirectional LSTM with character embeddings and the maximum region size is 10. Table 3 describes the performances of our model on different entity levels on the test dataset. The model performs well on multi-token and toplevel entities. This is interesting because they are often considered difficult for sequential labeling models. Table 4 shows the performances on the five entity types on the test dataset. We here show the performance by Finkel and Manning (2009) (F&M) for the reference. Our system performs better than their model except for the RNA type.

Ablation Tests
We show the differences in the performance on the development dataset to compare the possible scenarios of the proposed approach and to report the     importance of each component in our exhaustive model. Table 5 shows the coverage ratio and the performance with different maximum region sizes. Since the average entity mention length of GE-NIA dataset is less than 4, the system can cover almost all the entities for the maximum sizes of 6 or more. The longer maximum region size is desirable to cover all the mentions, but it requires more computational costs. Fortunately, the performance did not degrade with the long maximum region size, despite the fact that it introduces more out-of-entity regions.
Ablations on character embeddings in Table 6 also show the importance of character embeddings. It also shows that both the boundary information and the inside information, i.e., average of the embeddings in a region, are necessary to improve the performance.

Flat NER
We evaluated our model on JNLPBA as a flat dataset, where nested and discontinuous entities are removed. Table 7 shows the performances of our model on JNLPBA dataset. We compared our result with the state-of-the-art result of Gridach (2017) which achieved 75.8% in F-score, where our model obtained 78.4% in terms of F-score.

Related Work
Interests in nested NER detection have increased in recent years, but it is still the case that NER models deals with only one flat level at a time. Zhou et al. (2004) detected nested entities in a bottom-up way. They detected the innermost flat entities and then found other NEs containing the flat entities as substrings using rules derived from the detected entities. The authors reported an improvement of around 3% in the F-score under certain conditions on the GENIA corpus (Collier et al., 1999). Katiyar and Cardie (2018) proposed a neural network-based approach that learns hypergraph representation for nested entities using features extracted from a recurrent neural network (RNN). The authors reported that the model outperformed the existing state-of-the-art featurebased approaches.
Recent studies show that the conditional random fields (CRFs) can significantly produce higher tagging accuracy in flat (Athavale et al., 2016) or nested (stacking flat NER to nested representation) (Son and Minh, 2017) NERs. Ju et al. (2018) proposed a novel neural model to address nested entities by dynamically stacking flat NER layers until no outer entities are extracted. A cascaded CRF layer is used after the LSTM output in each flat layer. The authors reported that the model outperforms state-of-the-art results by achieving 74.5% in terms of F-score. Finkel and Manning (2009) proposed a tree-based representation to represent each sentence as a constituency tree of nested entities. All entities were treated as phrases and represented as subtrees following the whole tree structure and used a CRFbased approach driven by entity-level features to detect nested entities. We demonstrate that the performance can be improved significantly without CRFs, by training an exhaustive neural model that learns which regions are entity mentions and how to best classify the regions.

Conclusion
This paper presented a neural exhaustive model that considers all possible regions exhaustively for nested NER. The model obtains the representation of each region from an underlying shared LSTM layer, and it represents the region by concatenating boundary representations of the region and inside representation that averages embeddings of words in the region. It then classifies the region into its entity type or non-entity. The model does not depend on any external NLP tools. In the experiment, we show that our model learns to detect nested named entities from the generated mention candidates of all possible regions. Our exhaustive model outperformed existing models with a significant margin in terms of F-score in both flat and nested NER.
For future work, we would like to investigate the use of region-level information. We also consider modeling the dependencies between regions.