Bipartite Flat-Graph Network for Nested Named Entity Recognition

In this paper, we propose a novel bipartite flat-graph network (BiFlaG) for nested named entity recognition (NER), which contains two subgraph modules: a flat NER module for outermost entities and a graph module for all the entities located in inner layers. Bidirectional LSTM (BiLSTM) and graph convolutional network (GCN) are adopted to jointly learn flat entities and their inner dependencies. Different from previous models, which only consider the unidirectional delivery of information from innermost layers to outer ones (or outside-to-inside), our model effectively captures the bidirectional interaction between them. We first use the entities recognized by the flat NER module to construct an entity graph, which is fed to the next graph module. The richer representation learned from graph module carries the dependencies of inner entities and can be exploited to improve outermost entity predictions. Experimental results on three standard nested NER datasets demonstrate that our BiFlaG outperforms previous state-of-the-art models.


Introduction
Named entity recognition (NER) aims to identify words or phrases that contain the names of predefined categories like location, organization or medical codes. Nested NER further deals with entities that can be nested with each other, such as the United States and third president of the United States shown in Figure 1, such phenomenon is quite common in natural language processing (NLP).
NER is commonly regarded as a sequence labeling task (Lample et al., 2016;Ma and Hovy, 2016; * Corresponding author. This paper was partially supported by National Key Research and Development Program of China (No. 2017YFB0304100), Key Projects of National Natural Science Foundation of China (U1836222 and 61733011), Huawei-SJTU long term AI project, Cutting-edge Machine reading comprehension and language model.  Peters et al., 2017). These approaches only work for non-nested entities (or flat entities), but neglect nested entities. There have been efforts to deal with the nested structure. Ju et al. 2018 introduced a layered sequence labeling model to first recognize innermost entities, and then feed them into the next layer to extract outer entities. However, this model suffers from obvious error propagation. The wrong entities extracted by the previous layer will affect the performance of the next layer. Also, such layered model suffers from the sparsity of entities at high levels. For instance, in the well-known ACE2005 training dataset, there are only two entities in the sixth level. Sohrab and Miwa 2018 proposed a region-based method that enumerates all possible regions and classifies their entity types. However, this model may ignore explicit boundary information. Zheng et al. 2019 combined the layered sequence labeling model and region-based method to locate the entity boundary first, and then utilized the region classification model to predict entities. This model, however, cares less interaction among entities located in outer and inner layers.
In this paper, we propose a bipartite flat-graph network (BiFlaG) for nested NER, which models a nested structure containing arbitrary many layers into two parts: outermost entities and inner entities in all remaining layers. For example, as shown in Figure 1, the outermost entity Thomas Jefferson, third president of the United States is considered as a flat (non-nested) entity, while third president of the United States (in the second layer) and the United States (in the third layer) are taken as inner entities. The outermost entities with the maximum coverage are usually identified in the flat NER module, which commonly adopts a sequence labeling model. All the inner entities are extracted through the graph module, which iteratively propagates information between the start and end nodes of a span using graph convolutional network (GCN) (Kipf and Welling, 2017). The benefits of our model are twofold: (1) Different from layered models such as (Ju et al., 2018), which suffers from the constraints of one-way propagation of information from lower to higher layers, our model fully captures the interaction between outermost and inner layers in a bidirectional way. Entities extracted from the flat module are used to construct entity graph for the graph module. Then, new representations learned from graph module are fed back to the flat module to improve outermost entity predictions. Also, merging all the entities located in inner layers into a graph module can effectively alleviate the sparsity of entities in high levels.
(2) Compared with region-based models (Sohrab and Miwa, 2018;Zheng et al., 2019), our model makes full use of the sequence information of outermost entities, which take a large proportion in the corpus.
The main contributions of this paper can be summarized as follows: • We introduce a novel bipartite flat-graph network named BiFlaG for nested NER, which incorporates a flat module for outermost entities and a graph module for inner entities.
• Our BiFlaG fully utilizes the sequence information of outermost entities and meanwhile bidirectionally considers the interaction between outermost and inner layers, other than unidirectional delivery of information.
• With extensive experiments on three benchmark datasets (ACE2005, GENIA, and KBP2017), our model outperforms previous state-of-the-art models under the same settings.

Model
Our BiFlaG includes two subgraph modules, a flat NER module and a graph module to learn outermost and inner entities, respectively. Figure 2 illustrates the overview of our model. For the flat module, we adopt BiLSTM-CRF to extract flat (outermost) entities, and use them to construct the entity graph G 1 as in Figure 2. For the graph module, we use GCN which iteratively propagates information between the start and end nodes of potential entities to learn inner entities. Finally, the learned representation from the graph module is further fed back to the flat module for better outermost predictions.

Token Representation
Given a sequence consisting of N tokens {t 1 , t 2 , ..., t N }, for each token t i , we first concatenate the word-level and character-level embedding t i = [w i ; c i ], w i is the pre-trained word embedding, character embedding c i is learned following the work of (Xin et al., 2018). Then we use a BiL-STM to capture sequential information for each token x i = BILSTM(t i ). We take x i as the word representation and feed it to subsequent modules.

Flat NER Module
We adopt BiLSTM-CRF architecture (Lample et al., 2016;Ma and Hovy, 2016;Yang and Zhang, 2018;Luo et al., 2020) in our flat module to recognize flat entities, which consists of a bidirectional LSTM (BiLSTM) encoder and a conditional random field (CRF) decoder.
BiLSTM captures bidirectional contextual information of sequences and can effectively represent the hidden states of words in context. BiLSTM represents the sequential information at each step, the hidden state h of BiLSTM can be expressed as follows.
where − → θ and ← − θ are trainable parameters. − → h i and ← − h i respectively denote the forward and backward context representations of token t i . The output of BiLSTM H = {h 1 , h 2 , ..., h N } is further fed into the CRF layer.
CRF (Lafferty et al., 2001) has been widely used in state-of-the-art NER models (Lample et   2016; Ma and Hovy, 2016;Yang and Zhang, 2018) to help make better decisions, which considers strong label dependencies by adding transition scores between neighboring labels. Viterbi algorithm is applied to search for the label sequence with highest probability during the decoding process. For y = {y 1 , ..., y N } being a sequence of predictions with length N . Its score is defined as follows.
s(x, y) = where T y i ,y i+1 represents the transmission score from y i to y i+1 , P i,y i is the score of the j th tag of the i th word from BiLSTM encoder. CRF model defines a family of conditional probability p(y|x) over all possible tag sequences y: during training phase, we consider the maximum log probability of the correct predictions. While decoding, we search the tag sequences with maxi-mum score:

Graph Module
Since the original input sentences are plain texts without inherent graphical structure, we first construct graphs based on the sequential information of texts and the entity information from the flat module. Then, we apply GCN (Kipf and Welling, 2017; Qian et al., 2019) which propagates information between neighboring nodes in the graphs, to extract the inner entities. Graph Construction. We create two types of graphs for each sentence as in Figure 2. Each graph is defined as G = (V, E), where V is the set of nodes (words), E is the set of edges.
• Entity graph G 1 : for all the nodes in an extracted entity extracted from the flat module, edges are added between any two nodes e ij = (v i , v j ), where start ≤ i < j ≤ end, as shown in Figure 2, allowing the outermost entity information to be utilized.
• Adjacent graph G 2 : for each pair of adjacent words in the sentence, we add one directed edge from the left word to the right one, allowing local contextual information to be utilized.
Bi-GCN. In order to consider both incoming and outgoing features for each node, we follow the work of (Marcheggiani and Titov, 2017;Fu et al., 2019), which uses Bi-GCN to extract graph features. Given a graph G = (V, E), and the word representation ReLU is the non-linear activation function. e ij represents the edge outgoing from token t i , and e ji represents the edge incoming to token t i . The features of the two graphs are aggregated to get impacts of both graphs where W c ∈ R 2d f ×d f is the weight to be learned, b c ∈ R d f is a bias parameter. f 1 and f 2 are graph features of G 1 and G 2 , respectively. After getting the graph representation F = {f 1 , f 2 , ..., f N } from Bi-GCN, we learn the entity score M ∈ R N ×N ×L for inner layers as L is the number of entity types. M ij ∈ R L represents the type probability for a span starts from token t i and ends at token t j .
For inner entities, we define the ground truth entity of word pair (t i , t j ) asM ij , where t i and t j are start and end nodes of a span. Cross Entropy (CE) is used to calculate the loss Algorithm transform M to graph G 3 by Eq. (10) 7: entity set T ← entities in M and y new 10: end for 11: return entity set T where M ij ∈ R L denotes the entity score in the graph module. I(O) is a switching function to distinguish the loss of non-entity 'O' and other entity types. It is defined as follows.
λ 1 is the bias weight. The larger λ 1 is, the greater impacts of entity types, and the smaller influences of non-entity 'O' on the graph module.

BiFlaG Training
The entity score M in Eq. (7) carries the type probability of each word pair in the sentence. To further consider the information propagation from inner entities to outer ones, we use Bi-GCN to generate new representations from entity score M for the flat module. The largest type score r ij of the word pair (t i , t j ) indicates whether this span is an entity or non-entity and the confidence score of being such type, which is obtained by a max-pooling operation: where type represents the entity type or non-entity 'O' corresponding to the maximum type score.
When the corresponding type is O, there exits no dependencies between t i and t j , thus we set r ij to 0. A new graph that carries the boundary information of inner entities is defined as The new representation used to update flat module consists of two parts. The first part carries the previous representation of each token where W r ∈ R dx×d f , b r ∈ R d f . The second part aggregates inner entity dependencies of the new graph G 3 Finally, α 1 i and α 2 i are added to obtain the new representation x new i is fed into the flat module to update the parameters and extract better outermost entities.
For outermost entities, we use the BIOES sequence labeling scheme and adopt CRF to calculate the loss. The losses corresponding to the two representations (X and X new ) are added together as the outermost loss Entities in the sequence are divided into two disjoint sets of outermost and inner entities, which are modeled by flat module and graph module, respectively. Entities in each module share the same neural network structure. Between two modules, each entity in the flat module is either an independent node, or interacting with one or more entities in the graph module. Therefore, Our BiFlaG is indeed a bipartite graph. Our complete training procedure for BiFlaG is shown in Algorithm 1.

Loss Function
Our BiFlaG model predicts both outermost and inner entities. The total loss is defined as where λ 2 is a weight between loss of flat module and graph module. We minimize this total loss during training phase.

Dataset and Metric
We evaluate our BiFlaG on three standard nested NER datasets: GENIA, ACE2005, and TACKBP2017 (KBP2017) datasets, which contain 22%, 10% and 19% nested mentions, respectively. Table 1 lists the concerned data statistics. GENIA dataset (Kim et al., 2003) is based on the GENIAcorpus3.02p 1 . We use the same setup as previous works (Finkel and Manning, 2009;Lu and Roth, 2015;Lin et al., 2019a). This dataset contains 5 entity categories and is split into 8.1:0.9:1 for training, development and test.
ACE2005 2 (Walker et al., 2006) contains 7 finegrained entity categories. We preprocess the dataset following the same settings of (Lu and Roth, 2015;Wang and Lu, 2018;Katiyar and Cardie, 2018;Lin et al., 2019a) by keeping files from bn, nw and wl, and splitting these files into training, development and test sets by 8:1:1, respectively.
KBP2017 Following (Lin et al., 2019a), we evaluate our model on the 2017 English evaluation dataset (LDC2017E55). The training and development sets contain previous RichERE annotated datasets (LDC2015E29, LDC2015E68, LDC2016E31 and LDC2017E02). The datasets are split into 866/20/167 documents for training, development and test, respectively.
Metric Precision (P ), recall (R) and F-score (F 1 ) are used to evaluate the predicted entities. An entity is confirmed correct if it exists in the target labels, regardless of the layer at which the model makes this prediction.

Parameter Settings
Our model 3 is based on the framework of (Yang and Zhang, 2018). We conduct optimization with the stochastic gradient descent (SGD) and Adam for flat and GCN modules, respectively. For GE-NIA dataset, we use the same 200-dimension pretrained word embedding as (Ju et al., 2018;Sohrab and Miwa, 2018;Zheng et al., 2019). For ACE2005 and KBP2017 datasets, we use the publicly available pre-trained 100-dimension GloVe (Pennington et al., 2014) embedding. We train the character embedding as in (Xin et al., 2018). The learning rate is set to 0.015 and 0.001 for flat and GCN modules, respectively. We apply dropout to embeddings and the hidden states with a rate of 0.5. The hidden sizes of BiLSTM and GCN are both set to 256. The bias weights λ 1 and λ 2 are both set to 1.5.

Results and Comparisons
Table 2 compares our model to some existing state-of-the-art approaches on the three benchmark datasets. Given only standard training data and publicly available word embeddings, the results in Table 2 show that our model outperforms all these models. Current state-of-the-art results on these datasets are tagged with † in Table 2, we make improvements of 0.5/1.3/2.8 F 1 on ACE2005, GE-NIA, and KBP2017 respectively. KBP2017 contains much more entities than ACE2005 and GE-NIA. The number of entities on test set is four times that of ACE2005. Our model has the most significant improvement on such dataset, proving the effectiveness of our BiFlaG model. More notably, our model without POS tags surpasses the previous models (Wang and Lu, 2018;Lin et al., 2019a), which use POS tags as additional representations on all three datasets. Besides, (Lin et al., 2019b) incorporate gazetteer information on ACE2005 dataset, our model also makes comparable results with theirs. Other works like (Straková et al., 2019) 4 , which train their model on both training and development sets, are thus not comparable to our model directly. Table 3 makes a detailed comparison on the five categories of GENIA test dataset with a layered model (Ju et al., 2018) and a region-based model (Zheng et al., 2019). Compared with region-based model, layered model seems to have higher precision and lower recall, for they are subject to error propagate, the outer entities will not be identified if the inner ones are missed. Meanwhile, regionbased model suffers from low precision, as they may generate a lot of candidate spans. By contrast, our BiFlaG model well coordinates precision and recall. The entity types Protein and DNA have the most nested entities on GENIA dataset, the improvement of our BiFlaG on these two entity types is remarkable, which can be attributed to the in-   (Zheng et al., 2019) and (Ju et al., 2018) on GENIA dataset.
teraction of nested information between the two subgraph modules of our BiFlaG. Table 4 evaluates the performance of each module on ACE2005 and GENIA datasets. Our flat module performs well on both datasets for outermost entity recognition. However, the recall of the inner entities is low on GENIA dataset. According to the statistics in Table 1, only 11% of the entities on GENIA are located in inner layers, while on ACE2005 dataset, the proportion is 24%. It can be inferred that the sparsity of the entity distribution in inner layers has a great impact on the results. If these inner entities are identified at each layer, the sparsity may be even worse. We can enhance the impact of sparse entities by increasing the weight λ 1 in Eq. (14), but this may hurt precision, we set λ 1 = 1.5 to have a better tradeoff between precision and recall.

Analysis of Entity Length
We conduct additional experiments on ACE2005 dataset to detect the effect of the lengths of the outermost entities on the extraction of their inner entities as shown in Table 6. Our flat module can well predict outermost entities which account for a large proportion among all types of entities. In general, the performance of inner entities is affected by the extracting performance and length of their outermost entities. A shorter outermost entity is more likely to have its inner entities shared either the  first token or the last token, making the constructed graph more instructive, thus its inner entities are easier to extract.

Ablation Study
In this paper, we use the interactions of flat module and graph module to respectively help better predict outermost and inner entities. We conduct ablation study to verify the effectiveness of the interactions. The first part is the information delivery from the flat module to the graph module. We conduct four experiments: (1) no graph: we skip Eq. (5)-(6) and let graph feature f = LINEAR(x). In this case, inner entities are independent of the outermost entities and only rely on the word representation (section 2.1) which carries contextualized information.
(2) adjacent graph: we further utilize the sequential information of the text to help inner entity prediction.
(3) entity graph: the boundary information of outer entities can be indicative for inner entities, we construct an entity graph based on the entities extracted by the flat module. (4) both graphs: when outer entities are not recognized by the flat module, their inner entities will fail to receive the boundary information, we use the sequential information of the text to make up for the deficiency of using only entity graph. Experimental  results show that entity graph carries more useful information than adjacent graph, which enhances the baseline by 1.4/1.1/1.2 F 1 score, respectively. By combing these two graphs together, we get a larger gain of 1.7/1.6/1.6 F 1 score. The second part is the information delivery from the graph module to the flat module, the new representation X new learned from graph module is propagated back to the flat module. X new is equipped with the dependencies of inner entities and shows useful, yielding an improvement of 0.8/1.5/0.5 F 1 for the three benchmarks, respectively.

Inference Time
We examine the inference speed of our BiFlaG with (Zheng et al., 2019), (Sohrab and Miwa, 2018) and (Ju et al., 2018) in terms of the number of words decoded per second. For all the compared models, we use the re-implemented code released by (Zheng et al., 2019) and set the same batch size 10. Compared with (Zheng et al., 2019) and (Sohrab and Miwa, 2018), our BiFlaG does not need to compute region representation for each potential entity, thus we can take full advantage of GPU parallelism. Compared with (Ju et al., 2018), which requires CRF decoding for each layer, our model only needs to calculate two modules, by contrast, the cascaded CRF layers limit their inference speed.

Related Work
Recently, with the development of deep neural network in a wide range of NLP tasks He et al., 2018Li et al., 2018a,b  2018; Zhang et al., , 2020a, it is possible to build reliable NER systems without hand-crafted features. Nested named entity recognition requires to identity all the entities in texts that may be nested with each other. Though NER is a traditional NLP task, it is not until the very recent years that researches have been paid to this nested structure for named entities. (Lu and Roth, 2015) introduce a novel hypergraph representation to handle overlapping mentions. (Muis and Lu, 2017) further develop a gapbased tagging schema that assigns tags to gaps between words to address the spurious structures issue, which can be modeled using conventional linear-chain CRFs. However, it suffers from the structural ambiguity issue during inference. (Wang and Lu, 2018) propose a novel segmental hypergraph representation to eliminate structural ambiguity. (Katiyar and Cardie, 2018) also propose a hypergraph-based approach based on the BILOU tag scheme that utilizes an LSTM network to learn the hypergraph representation in a greedy manner.
Stacking sequence labeling models to extract entities from inner to outer (or outside-to-inside) can also handle such nested structures. (Alex et al., 2007) propose several different modeling techniques (layering and cascading) to combine multiple CRFs for nested NER. However, their approach cannot handle nested entities of the same entity type. (Ju et al., 2018) dynamically stack flat NER layers, and recognize entities from innermost layer to outer ones. Their approach can deal with nested entities of the same type, but suffers from error propagation among layers.
Region-based approaches are also commonly used for nested NER by extracting the subsequences in sentences and classifying their types. (Sohrab and Miwa, 2018) introduce a neural ex-haustive model that considers all possible spans and classify their types. This work is further improved by (Zheng et al., 2019), which first apply a single-layer sequence labeling model to identify the boundaries of potential entities using context information, and then classify these boundary-aware regions into their entity type or non-entity. (Lin et al., 2019a) propose a sequence-to-nuggets approach named as Anchor-Region Networks (ARNs) to detect nested entity mentions. They first use an anchor detector to detect the anchor words of entity mentions and then apply a region recognizer to identity the mention boundaries centering at each anchor word. (Fisher and Vlachos, 2019) decompose nested NER into two stages. Tokens are merged into entities through real-valued decisions, and then the entity embeddings are used to label the entities identified.

Conclusion
This paper proposes a new bipartite flat-graph (Bi-FlaG) model for nested NER which consists of two interacting subgraph modules. Applying the divideand-conquer policy, the flat module is in charge of outermost entities, while the graph module focuses on inner entities. Our BiFlaG model also facilitates a full bidirectional interaction between the two modules, which let the nested NE structures jointly learned at most degree. As a general model, our BiFlaG model can also handle non-nested structures by simply removing the graph module. In terms of the same strict setting, empirical results show that our model generally outperforms previous state-of-the-art models.