Hierarchical Region Learning for Nested Named Entity Recognition

Named Entity Recognition (NER) is deeply explored and widely used in various tasks. Usually, some entity mentions are nested in other entities, which leads to the nested NER problem. Leading region based models face both the efficiency and effectiveness challenge due to the high subsequence enumeration complexity. To tackle these challenges, we propose a hierarchical region learning framework to automatically generate a tree hierarchy of candidate regions with nearly linear complexity and incorporate structure information into the region representation for better classification. Experiments on benchmark datasets ACE-2005, GENIA and JNLPBA demonstrate competitive or better results than state-of-the-art baselines.


Introduction
As a fundamental information extraction task, Named Entity Recognition (NER) is widely used in various downstream tasks, such as entity linking and entity search. Most studies assigns a label to each token of the sequence for the flat NER problem (Lample et al., 2016). However, it is common that entities are embedded in other entities in many domains (Kim et al., 2003;Ringland et al., 2019). Example from ACE-2005 dataset shown in Fig. 1 illustrates that the top-level PER entity includes a nested entity with ORG label. How to recognize all entities recursively from innermost to outermost is referred to as the Nested NER problem. Existing approaches mainly solve the nested NER problem by classifying all candidate subsequences (a.k.a regions). The key to region based methods lies in candidate region detection. one kind is the brute force method (Sohrab and Miwa, 2018) to enumerate all possible O(n 2 ) subsequences for each sentence with n words. The other kind (Zheng et al., 2019) is to generate and classify candidate regions in a two-stage paradigm, often leading to cascaded errors. Thus region based methods face efficiency and effectiveness challenges.
To tackle these challenges, we propose a Hierarchical Region learning framework, referred to as HiRe. First, inspired by constituent parsing tree as the top of Fig. 1 and its neural syntactic distance (Shen et al., 2018), we introduce the coherence measure between adjacent regions. Then we generate a region tree for each sentence by merging two adjacent regions recursively based on this region coherence measure in a bottom-up manner. Finally, hierarchical regions are classified based on the boundary and merging word representation. We train the hierarchical region generation and classification tasks simultaneously.
Experimental results on three benchmark datasets ACE-2005, GENIA and JNLPBA demonstrate that HiRe shows the competitive or better performance than baselines. HiRe generates only O(n) candidate regions about 77.9% less than the brute-force method and achieves 98.1% true region recall in the GENIA dataset, a good trade-off between efficiency and effectiveness.

Related Work
Given a sentence of n words (w 1 , . . . , w n ), the nested named entity recognition task aims at identifying all the entities especially when one entity subsequence (w i , . . . , w j ), i < j contains others (w p , . . . , w q ), i ≤ p < q ≤ j. According to reduced different problems, existing nested NER models mainly fall into three categories.
Sequence labeling models assign multiple labels to each word assuming that one word may belong to multiple entities, such as linearization method (Straková et al., 2019) and layered CRF (Ju et al., 2018).
Structured label classification models capture the label relationship of a sentence for better performance. (Lu and Roth, 2015;Wang and Lu, 2018) proposed hyper-graphs models to describe the label relationship, and either human designed or latent features were adopted for classification.
Region based models were summarized by (Lin et al., 2019) as obtaining all possible regions and assigning labels to regions. The key to region classification models is how to obtain candidate regions from a sentence. One is the bruteforce method (Sohrab and Miwa, 2018;Xia et al., 2019), which enumerates all subsequences of a sentence for classification with high time complexity. The other is to formulate the task as a two-stage paradigm. (Zheng et al., 2019;Tan et al., 2020) detected a small set of candidate regions with high efficiency, but only about 80% entities could be found in the first stage, making a performance bottleneck. Some studies (Finkel and Manning, 2009) leveraged the external knowledge, such as constituent parsing tree, to guide the first step, which achieved impressive performance but suffered from the cubic time complexity and error propagation from external tools. Most methods above represented the region as the average or weighted sum of word representations, ignoring the region structure.

Methods
To tackle efficiency and effectiveness challenges in region based methods, we propose a Hierarchical Region learning framework for nested NER problem, namely HiRe in Fig. 2.

Overall Architecture
Specifically, we first obtain word representations through the encoder layer. Then, we introduce a word coherence measure based on word representations through word coherence layer. Next, region coherence measure is derived from the word coherence, two adjacent regions are recursively merged based on this measure, and a tree of regions is generated for each sentence. Finally, we use a ranking loss of region boundaries for region generation task and cross entropy loss of labeling candidate regions for entity recognition task in a multi-task framework. Encoder Layer. Consider the i-th word w i in a sentence with n words, we represent it by concatenating their word embedding x w i , part-ofspeech(POS) embedding x p i and character-level embedding x c i together. The character-level embeddings are generated by a convolutional neural network module with the same setting as (Yang et al., 2018) to capture the orthographic and morphological features of the word. Then, we employ a bi-directional LSTM to obtain the long-term context-aware representation as:

Encoder
Word Coherence. Word context representations {h t } n−1 t=0 are fed to the convolutional kernel with window size 2 to obtain the local feature between adjacent words g 0 , g 2 , ...g n−2 = CON V (h 0 , h 1 , . . . , h n−1 ). Then these features are input into a 2-layers feed-forward network (FFN) to obtain the word coherence measure {d t } n−2 t=0 , where d t indicates the affinity between word w t and w t+1 . The higher this measure, the more coherent adjacent words.
Region Coherence. A subsequence of the sentence composed of consecutive words is called a region denoted as R i,j = (w i , . . . , w j ). Based on the word coherence measure, we define the region coherence based on adjacent words between two adjacent regions in Eq.(5). It indicates how likely two adjacent regions are to be a whole.
Hierarchical Region Generation. Based on region coherence measure, we build the region hierarchy from bottom to up recursively as follows. At 1-st level for initialization, each word is treated as a region and the leaf node in this tree. At t-th level, two regions R i,k and R k+1,j will be merged into R i,j at the merging . R i,j will be used at the following levels instead of R i,k and R k+1,j . Because each k has one chance to be the merging point, this merging operation will be repeated at most n − 1 times. The process will generate about O(n) candidate regions. Fig. 3 illustrates this generation process of the example sentence from Fig. 1, where blocks with the same color are of the same region. Practically, it is not essential to generate the whole tree with the restraint of maximum entity length, which further reduces the number of candidate regions. Region Classification. Here a region is composed of two sub-regions. For a region R i,j with its merging point k generated by the above steps, we adopt g k as the representation of its sub-regions R i,k and R k+1,j . To make the classifier more sensitive to entity boundaries, both boundary and merging word representations are concatenated as region Next, a 2-layer feed-forward network is to predict the probability that region R i,j belongs to entity category c as Eq.(6).

Learning and Inference
We train both the hierarchical region generation and classification tasks simultaneously in a multi-task framework as Eq. (7).
For the hierarchical region generation task, we propose to optimize the pairwise ranking loss L region in Eq.(8) to emphasize the partial order between inner and boundary word coherence instead of their values. The predicted partial order is determined by the learned boundary and inner word coherence scores. The loss function is reduced to each region difference between the predicted and ground truth region hierarchy.
However, The ground truth partial order is unavailable in datasets. To solve this problem, we generate the ground truth coherence scores based on the rule that the boundary word w i−1 and w j coherence is always smaller than the inner word {w t } j−1 t=i coherence for each ground truth entity region R i,j . Considered the hierarchy of entity, we define the ground truth word coherence as a logarithmic function of length. Specifically, Ground truth boundary word coherencesd i−1 andd j are defined as −( log 2 (j − i + 2) + 1). Ground truth inner word coherence {d m } j−1 m=i are randomly generated from [−1, − log 2 (j − i + 2) ]. Predicted word coherences {d t } j t=i−1 are derived through above layers.
8) For the region classification task, the cross entropy loss function L label is utilized with a softmax classifier based on the probabilities in Eq.(6).

Experiments
To investigate the effectiveness and efficiency of our proposed method, we conduct comprehensive experiments on three benchmark NER datasets.

Experimental Setting
NER datasets with some nested entities are referred to as nested NER datasets, while NER  (Kim et al., 2003), which contain 36.4% and 21.8% nested entities respectively. We follow the same dataset setup as previous work (Wang and Lu, 2018;Lin et al., 2019). We also conduct ablation experiments on the flat NER dataset JNLPBA (Collier and Kim, 2004), and pre-processed data is obtained from (Zheng et al., 2019). HiRe was implemented by Pytorch 3 . Stanford CoreNLP toolkit  was used to split sentences and for POS tagging. We use ADAM(Kingma and Ba, 2015) for optimization with batch size 32 and learning rate 0.001. Word embeddings are initialized with pretrained 200dimension Glove vectors (Pennington et al., 2014) 4 . Dimensions of POS tag embedding, character embedding, LSTM layer and hidden units are 50, 100, 2 and 256 respectively. The dropout ratio is 0.2 and α is 0.4. We use BERT base for word representations and fine tune parameters with learning rate 3e − 5. The maximum number of hierarchical layer t is set as 8, 6, 6 on ACE, GENIA and JNLPBA separately.  baselines. The performance gain on ACE-2005 is due to the high recall in the region generation step and the incorporation of region structure into its representation in region classification step. Higher performance on ACE-2005 means that HiRe performs better on datasets with more nested entities.

Effectiveness Analysis
Considering baselines with pre-trained language model, we replace LSTM encoder with BERT base in HiRe. Experimental results are listed in Table 2. Our model significantly outperforms baselines. As far as we know, the only reported higher F1 score  on ACE-2005 is obtained from BERT large with three times parameter number of BERT base to learn and infer with low efficiency.

Efficiency Analysis
Given a sentence with n words, the brute force method enumerates O(n 2 ) candidate regions. HiRe generates O(n) candidate regions. (Zheng et al., 2019) finds candidate regions through a token-wise classification with O(n) time complexity. For sentences in GENIA, the number of candi-5 Due to different experimental settings, we reproduced (Sohrab and Miwa, 2018) under the same setting with other baselines and obtained performances similar to results in (Zheng et al., 2019). The other results were taken from their papers  date regions generated by HiRe is 77.9% less than that of the enumeration method discarding 1.3% long entities and more than that of (Zheng et al., 2019). However, the true recall of candidate regions generated by the enumeration method and HiRe are 98.7% and 98.1%, respectively. The recall of the start/end boundary generated by (Zheng et al., 2019) is 84.3%/87.2%. In this sense, HiRe finds a relatively smaller (20% or so) but higher quality (true recall 98.1%) subset of all regions, which is a good trade-off between efficiency and effectiveness.

Ablation Study
To prove our model can also work on flat NER task, we conduct ablation experiments on JNLPBA dataset. We compare our model with a standard flat NER benchmark (Lample et al., 2016) and two nested NER methods. Our model achieves 74.0% in F1 measure, which outperforms these baselines showed in Table 3.
To analyze the role of Hierarchical Region Representation, denoted as HRR in HiRe, we compare performances of HiRe with and without it on ACE-2005. HiRe without HRR employs Average Word Representation (denoted as AWR) instead with precision 78.3%, recall 73.7% and F1 measure 75.9%. In contrast to HiRe AW R , the absolute F1 measure improvement of HiRe HRR is 0.6%. In all, HRR plays an essential part in HiRe.
The reason lies in that the HRR treats each region as a hierarchical structure composed of two subregions rather than a flat structure as AWR does. The hierarchical structure will put more emphasis on some words while the flat structure treats each word equally in AWR. For example, the minister of the department of education composed of the minister and of the department of education two regions should be labeled with PER but may be misclassified into ORG with AWR.

Conclusion
Leading region based approaches to nested NER face the efficiency and effectiveness challenges. We propose a hierarchical region framework to generate hierarchical regions and assign those regions with hierarchical representation an entity categorical label together. Experimental results demonstrate a significant improvement of our proposed framework in terms of efficiency and effectiveness than SoTA baselines. In future work, how to represent hierarchical regions better will be considered.