Pyramid: A Layered Model for Nested Named Entity Recognition

This paper presents Pyramid, a novel layered model for Nested Named Entity Recognition (nested NER). In our approach, token or text region embeddings are recursively inputted into L flat NER layers, from bottom to top, stacked in a pyramid shape. Each time an embedding passes through a layer of the pyramid, its length is reduced by one. Its hidden state at layer l represents an l-gram in the input text, which is labeled only if its corresponding text region represents a complete entity mention. We also design an inverse pyramid to allow bidirectional interaction between layers. The proposed method achieves state-of-the-art F1 scores in nested NER on ACE-2004, ACE-2005, GENIA, and NNE, which are 80.27, 79.42, 77.78, and 93.70 with conventional embeddings, and 87.74, 86.34, 79.31, and 94.68 with pre-trained contextualized embeddings. In addition, our model can be used for the more general task of Overlapping Named Entity Recognition. A preliminary experiment confirms the effectiveness of our method in overlapping NER.


Introduction
Named Entity Recognition (NER), which aims at identifying text spans as well as their semantic classes, is an essential and fundamental Natural Language Processing (NLP) task. It is typically modeled as a sequence labeling problem, which can be effectively solved by RNN-based approach (Huang et al., 2015;Lample et al., 2016;Ma and Hovy, 2016). However, such formulation oversimplifies the problem and is based on a very strong assumption that entity mentions do not overlap with each other, which is certainly not the real case. In real-world languages, entities might be deeply nested or overlapping, calling for better models to handle such complexity. Many previous studies have focused on recognizing nested entity mentions. A few works use proprietary structures, such as constituency graph (Finkel and Manning, 2009) or hypergraph (Lu and Roth, 2015;Muis and Lu, 2017), to explicitly capture nested entities. These structures, however, do not produce satisfactory performance results.
Some other works handle nested entity mentions in a layered model, which employs multiple flat NER layers (Alex et al., 2007;Ju et al., 2018;Fisher and Vlachos, 2019). Each layer is usually responsible for predicting a group of nested entities having the same nesting level.
Unfortunately, conventional layered schemes do not address the more general overlapping setting, and also suffer from layer disorientation. The latter is a problem arising when the model might output a nested entity from a wrong layer. For example, entity "U.N. Ambassador" is labeled as a secondlayer entity (containing "U.N." and "Ambassador"). Thus, prediction of it from the first layer is considered an error. Generally, a false positive prediction with the correct span and class but from a wrong layer produces an over-estimated loss (despite the correct entity itself), causing the entire model reluctant to predict positive, and eventually harming the recall. This problem occurs quite often, as the target layer for a nested entity is determined by the nesting levels of its composing entities rather than by its own semantics or structure. A recent study on a layered model (Ju et al., 2018) also reports the error propagation issue, i.e. errors in the first few layers are propagated to the next layers.
In this paper, we propose a novel layered model called Pyramid for nested NER. The model consists of a stack of inter-connected layers. Each layer l predicts whether a text region of certain length l, i.e. an l-gram, is a complete entity mention. Between each two consecutive layers of our model, the hidden state sequence is fed into a convolutional network with a kernel of two, allowing a text region embedding in the higher layer to aggregate two adjacent hidden states from the lower layer, and thus forming the pyramid look (as the length of the sequence in the higher layer is one token shorter than the lower layer). Such process enumerates all text spans without breaking the sequence structure. Figure 1 shows a sentence containing eight nested entities being fed into the Pyramid model. These entities are separated into 5 layers according to their number of tokens. The job of each decoding layer is simple and clear -it needs to output entity type when it encounters a complete entity.
In the above scheme, the higher decoding layer relies on the output of the lower decoding layer in a bottom-up manner (from layer 1 to 5 in Figure  1). It is also desirable to construct an inverse pyramid, where a lower decoding layer receives input from a higher layer (from layer 5 to 1), allowing information to flow in the opposite way.
Pyramid outperforms the previous methods in nested NER while addressing all the aforementioned problems with layered model. First, it can be used for more general overlapping NER. Second, it prevents layer disorientation as an l-length entity in the input is only predicted on layer l. Third, it mitigates the error propagation problem, as predictions in one layer do not dictate those in other layers. Our main contributions are as follows: • We propose a novel layered model called Pyramid for nested NER. The model recognizes entity mentions by its length without layer disorientation and error propagation. The proposed model can also address the more general overlapping NER task.
• Besides the normal pyramid, we design an inverse pyramid to allow bidirectional interactions between neighboring layers.
• Additionally, we construct a small dataset that contains overlapping but non-nested entities. Preliminary results on this dataset show the potential of our model for handling overlapping entities.

Related Work
Existing approaches for recognizing nonoverlapping named entities usually treat the NER task as a sequence labeling problem. Various sequence labeling models achieve decent performance on regular NER, including probabilistic graph models such as Conditional Random Fields (CRF) (Ratinov and Roth, 2009) Nested NER has been intensively studied recently. Finkel and Manning 2009 proposes a CRFbased constituency parser and use a constituency tree to represent a sentence. Lu and Roth 2015 introduces the idea of hypergraph which allows edges to connect to multiple nodes to represent nested entities. Muis and Lu 2017 uses a multigraph representation and introduces the notion of mention separator for nested entity detection. Wang and Lu 2018 presents a neural segmental hypergraph model using neural networks to obtain distributed feature representation. Katiyar and Cardie 2018 also adopts a hypergraph-based formulation but instead uses neural networks to learn the structure. Lin et al. 2019 borrows the Anchor Region Networks (ARNs) architecture to predict nested entity mentions. All the above works design proprietary structures to explicitly capture nested entities.
Layered models are common solution for nested NER. Alex et al. 2007 stacks multiple flat NER layers, where the first recognizes the innermost (or outermost) mentions, then the following taggers are used to incrementally recognize next-level mentions. Ju et al. 2018 dynamically stacks multiple flat NER layers and extract outer entities based on the inner ones. Fisher and Vlachos 2019 can also be considered as a layered model with a novel neural network architecture. Our method differs from the above layered models in that (1) it is able to handle overlapping NER, and (2) it does not suffer the layer disorientation or error propagation problem.
Exhaustive region classification model enumerates all possible regions of the input sentence. Byrne 2007;Xu et al. 2017;Sohrab and Miwa 2018;Zheng et al. 2019 aggregate all possible adjacent tokens into potential spans. These spans, together with their left and right contexts, are fed into a classifier -a maximum entropy tagger (Byrne, 2007) or a neural network (Xu et al., 2017;Sohrab and Miwa, 2018;Zheng et al., 2019). Unfortunately, all these works fail to take advantage of the dependencies among nested entities, but perform prediction merely on individual text fragments, thus limiting the performance. Luan et al. 2019 uses propagation layers to capture relation and coreference between spans. Our method also potentially enumerates all possible spans, while maintaining the sequence structure, which leads to better performance.
Pre-trained word embeddings, e.g. Glove (Pennington et al., 2014), have proved to be effective in improving NER performance. Recently, with the rapid development of language model techniques, the performance of NER models has been pushed to a new height. The recent pre-trained language model embeddings include ELMo (Peters et al., 2018), Flair (Akbik et al., 2018), BERT (Devlin et al., 2019), ALBERT (Lan et al., 2019), etc. In our experiments, we leverage these embeddings and observe significant performance improvements.

Proposed Method
In this section, we describe the proposed model and its architecture, which includes an encoder, a pyramid, an inverse pyramid, and a logits layer. Figure  2 shows a toy model with a pyramid (5 bottom-up decoding layers in blue) and its inverse counterpart (5 top-down layers in pink). As shown in the blue pyramid, each decoding layer contains a convolutional network with a kernel of two to reduce the sequence length in its output, so that all possible mention spans can potentially be enumerated. The top-down inverse pyramid will be described later.
We shall use the following notations: Embed the embedding layer LST M the bidirectional LSTM layer LM the language model embedder Linear the fully-connected layer LayerN orm layer normalization The mentioned layers with the same notation, superscript and subscript share the same parameters. For the sake of brevity, we omit the dropout layer in this section.

The Input and Output
The input is a T -length textual sentence. After the encoder, embedding sequences are recursively fed into flat NER decoding layers, producing L tag sequences in the IOB2-format 1 with length T , T − 1, ..., T − L + 1, where L is the number of decoding layers. Note we only label n-grams that are complete mentions, so I-{class} usually does not appear.
Given the running example in Figure  The output from the pyramid would contain layered tag sequences (l = 1, . . . , 5) as follows: Unfortunately, the above layered sequences cannot include any entities of more than 5 tokens. Generally, a stack of L layers cannot predict entities containing more than L tokens! To address this issue, we propose a remedy solution: to predict all entities longer than L tokens on the topmost flat NER layer. Specifically, the bottom L − 1 layers predict B-{class} tags for  This stipulates that when two entities are nested, if one of them is longer than L, the other one cannot be longer than L − 1.
In the running example, suppose we had only 4 decoding layers (l = 1, . . . , 4), then the longest mention (Former U.N. Ambassador Jeane Kirkpatrick) would be recognized in the fourth decoding layer as following: With the remedy solution, our model is able to handle entities longer than L. As most entity mentions are not too long (99% are no longer than 15 tokens), and it is even rarer for both two nested mentions to be longer than 15, we set the default number of flat decoder layers to L = 16 to minimize the impact of the remedy. Parameter L can be tuned for balance between accuracy and inference speed.

The Encoder
We represent each word by concatenating character sequence embeddings and word embeddings. First, the character embeddings are dynamically generated by a LSTM (Lample et al., 2016) to capture the orthographic and morphological features of the word. It is suggested that with the introduction of character embeddings the model can better handle out-of-vocabulary (OOV) words. Second, the word embeddings are initialized with pre-trained word vectors. For OOV words, we randomly initialize an embedding for [UNK], which is tuned during training. The concatenated character and word embeddings are fed into a bidirectional LSTM encoding layer to further leverage contextual information.
Formally, given the input sentence x: x char = LST M char (Embed char (x)) (1) For better performance, we adopt the popular pre-trained contextualized language model embeddings, such as BERT (Devlin et al., 2019). These embeddings are concatenated to the output of LST M enc , followed by a linear layer to reduce the embedding dimension. i.e.:

The Pyramid
The pyramid recognizes entities in a bottom-up manner. It consists of L decoding layers, each of which corresponds to a flat named-entity recognizer.
Each decoding layer has two main components, a LSTM and a CNN with a kernel of two. In layer l, the LSTM recognizes l-length entity mentions, and the CNN aggregates two adjacent hidden states and then feeds the text region embeddings enriched with layer information to the higher (l + 1-th) decoding layer. By passing through l decoding layers with l − 1 CNNs, each hidden state (at t) actually represents the region of l original tokens (from t to t + l − 1). Therefore, the l-th decoding layer enumerates text spans of length l. And all these L layers together produce all possible entity spans. One may notice that the pyramid structure intrinsically provides useful inductive bias: The higher the layer, the shorter the input sequence, forcing the model to capture high-level information for predicting long entities and low-level information for predicting short entities. Moreover, as the length of each span representation is reduced to one on its target decoding layer, the prediction task on each layer is simple and clear -to predict entities whose representation length is one in this layer.
Since the input of the first decoding layer is from the encoder while the others are from the output of their lower neighboring layers, the input bias and scale may differ among layers. This is detrimental to training. To address this issue, we apply layer normalization (Ba et al., 2016) before feeding the region embeddings into the decoding LSTM.

The Inverse Pyramid
Each decoding layer in the bottom-up pyramid takes into account layer information from lower layers. However, a layer cannot get feedback from its higher neighbors, which could potentially help. Moreover, for long entities, their embeddings need to go through numerous lower layers and tend to lose important information. Therefore, we add an inverse pyramid, which recognizes entity mentions in a top-down manner, to address the above issues. While in the pyramid, sequences pass through a CNN to reduce sequence length before being fed into the higher decoding layer, in the inverse pyramid, however, we use another CNN with zero paddings and a kernel of two to reconstruct the lower-level text region embeddings. Specifically, to reconstruct the text region embeddings at the l − 1-th decoding layer, we concatenate the hidden states of the l-th normal and inverse decoding layers, and feed it to the inverse CNN (see bottom-left pink box in Figure 2).
There are two benefits for using the top-down inverse pyramid: (1) It gives the feedback from higher decoding layers, allowing bidirectional interaction between neighboring decoding layers; (2) Since the inverse pyramid needs to reconstruct lower-level sequence, it requires the pyramid to retain as much original information as possible, thereby mitigating the information loss for long entities.
Formally we have the following output from the inverse decoding layers: For the top inverse decoding layer, we cannot compute h L , so we use zeros instead. Finally, with the concatenation of the hidden states of both the normal and inverse decoding layers, we use a feed-forward layer to predict their class: 4 Experiment

Datasets
We evaluate our model on four nested entity recognition corpora: ACE-2004(Doddington et al., 2004, ACE-2005(Walker et al., 2006, GENIA (Kim et al., 2003), and NNE (Ringland et al., 2019).  used in most previous studies. For GENIA, we use GENIAcorpus3.02p 3 , and follow the train/dev/test split of previous works (Finkel and Manning, 2009;Lu and Roth, 2015) i.e.: (1) split first 81%, subsequent 9%, and last 10% as train, dev and test set, respectively; (2) collapse all DNA, RNA, and protein subtypes into DNA, RNA, and protein, keeping cell line and cell type, and (3) removing other entity types, resulting in 5 entity types. For NNE, we keep the original dataset split and pre-processing. The statistics of each dataset are shown in Table 1.

Training Details
We denote by Pyramid-Basic the model using the normal bottom-up pyramid only; and by Pyramid-Full the one with both the normal and inverse pyramids. We try to use as similar settings as possible on all datasets, and Table 2 describes the settings used in our experiments. For the word embeddings, we use 100-dimensional GloVe word embeddings trained on 6B tokens 4 as initialization. We disable updating the word embeddings during training. Besides, characterbased embeddings are generated by a LSTM (Lample et al., 2016). We set the hidden dimension to 200 (100 for each direction in bidirectional LSTM). We use inverse time learning rate decay: lr = lr/(1 + decay rate * steps/decay steps), with decay rate 0.05 and decay steps 1000. All results are averaged on 4 runs to ensure reproducibility. The GENIA corpus significantly differs from the others in its distribution, as it belongs to medical domain. So for GENIA, we initialize word embeddings with word vectors pre-trained on biomedical corpus (Chiu et al., 2016) 5 , which are in 200 dimensions.
We also evaluate our method with pre-trained language model embeddings: (Akbik et al., 2018): Pre-trained contextualized character-level embeddings. Here, we use the concatenation of news-forward and news-backward, forming embeddings of dimension 4096. For GENIA, we use pubmed-forward and pubmed-backward.
Here we use the bert-large-uncased checkpoint, with embeddings of dimension 1024. For each token, we generate the contextualized word embedding by averaging all BERT subword embeddings in the last four layers without fine-tuning. For GENIA, we use BioBERT v1.1 (Lee et al., 2020) 6 .
• [ALBERT] (Lan et al., 2019): A lite BERT with shared transformer parameters. Here we use the albert-xxlarge-v2 checkpoint, with embeddings of dimension 4096. For each token, we average all ALBERT subword embeddings in the last four layers without finetuning.
We generate Flair embeddings with the library provided by Akbik et al. 2019 7 . We use the implementation by Wolf et al. 2019 8 to generate BERT and ALBERT embeddings.
With pre-trained contextualized embeddings, the model is more prone to overfitting. So we increase the dropout rate by 0.05 for these settings.

Tuning Number of Layers
We evaluate our method with different L on all datasets. Due to space limit, we only present the results of ACE-2005 in Table 4. The findings on the other datasets are similar. Results From All Layers We report in Table 4 the detailed results for all entity lengths while tuning L on ACE-2005. Obviously 1-word and 2-word entities account for the majority of entities (77%), where we achieve competitive results. Longer entities see reductions in performance. However, due to our remedy strategy, entities longer than L are still recognized with acceptable performance. Note R(N) is the recall of nested entities, i.e. for layer l, entities nested with other entities shorter than l are also counted in.
Inference Speed Table 4 also shows the inference speed with different L for the basic and full models. Although the basic model does not perform as good as the full model, it is significantly faster. Since the time complexity of our method is O(T L) with T being the number of tokens and L the number of stacked layers, we can further speed up the inference by using smaller L value (e.g. L = 8 or 4), while achieving F1 scores higher than most baselines.

Ablation Study
We conduct ablation study to verify the effectiveness of components of Pyramid. Likewise, we only present the results on ACE-2005 here.
Character Embeddings: Using character is a standard technique for NER to dynamically capture orthographic and morphological features. It provides some improvements.
Layer Normalization: LayerNorm eliminates the bias and scale difference of the inputs of each   decoding layer and improve the F1 score. It also substantially accelerates the converging speed.
Sharing LSTM dec : The jobs of decoding layers are similar: inheriting information from previous layers and recognizing entity representations of length one. Therefore, sharing weights maximizes the use of training data and prevents overfitting.

Method of Reducing Length:
We use CNN to reduce the sequence length at each decoding layer. As shown in Table 5, compared with average pooling and maximum pooling, CNN can effectively retain the original semantic information and capture the boundary information.
Pyramid Layers: Apart from the results shown in Table 5, we emphasize that the performance gain of Pyramid owes a lot to the pyramid layers (both normal and inverse ones). As shown in Table 4, reducing L to 4 leads to a drop of F1 (-1.7). It is clear that when L = 1, our method degrades to a flat entity recognizer, which cannot handle nested mentions any more.  Table 6: Results of Pyramid-Basic with nested and overlapping entities. The dataset is based on part of NNE, with additional program-generated labels.

Overlapping Entity Recognition
Overlapping mentions usually occur along with the attributive clause in natural language. For example, sentence "The burial site of Sheikh Abbad, who died 500 years ago, is located." contains two overlapping mentions "The burial site of Sheikh Abbad" and "Sheikh Abbad, who died 500 years ago". Due to lack of datasets for overlapping NER, we create a small dataset. For all sentences in NNE, we find 2599 which contain ", which" or ", who". We use the ELMo-based constituency parser 9 to find attributive clauses together with their modified noun phrases ("Sheikh Abbad, who ..."), and then see if a bigger noun phrase ("the burial site of Sheikh Abbad") contains the noun phrase. Next, while keeping the original annotations, we add these two mentions to form a new dataset where around 14% sentences have overlapping but non-nested entity mentions. This dataset is split randomly into training, dev, and test sets containing 1599, 400, and 600 sentences respectively. Note the additional annotations are not verified by human, meaning they might contain some errors. However, it is still useful for testing the performance of our model for overlapping NER.
The statistics of the data and the experimental results are shown in Table 6. It can be seen that our method can effectively handle overlapping NER.

Conclusion
This paper presented Pyramid, a novel layered neural model for nested entity recognition. Our model relies on a layer-wise bidirectional decoding process (with both normal and inverse pyramids), allowing each decoding layer to take into account the global information from lower and upper layers. Pyramid does not suffer from layer disorientation or error propagation, and is applicable for the more general overlapping NER. The proposed method obtained state-of-the-art results on four different nested NER datasets, confirming its effectiveness.