Improving Named Entity Recognition with Attentive Ensemble of Syntactic Information

Named entity recognition (NER) is highly sensitive to sentential syntactic and semantic properties where entities may be extracted according to how they are used and placed in the running text. To model such properties, one could rely on existing resources to providing helpful knowledge to the NER task; some existing studies proved the effectiveness of doing so, and yet are limited in appropriately leveraging the knowledge such as distinguishing the important ones for particular context. In this paper, we improve NER by leveraging different types of syntactic information through attentive ensemble, which functionalizes by the proposed key-value memory networks, syntax attention, and the gate mechanism for encoding, weighting and aggregating such syntactic information, respectively. Experimental results on six English and Chinese benchmark datasets suggest the effectiveness of the proposed model and show that it outperforms previous studies on all experiment datasets.


Introduction
Named entity recognition (NER) is one of the most important and fundamental tasks in natural language processing (NLP), which identifies named entities (NEs), such as locations, organizations, person names, etc., in running texts, and thus plays an important role in downstream NLP applications including question answering (Pang et al., 2019), semantic parsing (Dong and Lapata, 2018) and entity linking (Martins et al., 2019), etc.
The main methodology for NER is conventionally regarded as a sequence labeling task with models such as hidden Markov model (HMM) (Bikel et al., 1997) and conditional random field (CRF) (McCallum and Li, 2003) applied to it in previous studies. Recently, neural models play a dominate role in this task and illustrated promising results (Collobert et al., 2011;Huang et al., 2015;Lample et al., 2016;Strubell et al., 2017;Yadav and Bethard, 2018;Jie and Lu, 2019;Liu et al., 2019d;Baevski et al., 2019), because they are powerful in encoding contextual information and thus drive NER systems to better understand the text and recognize NEs in the input text. Although it is straightforward and effective to use neural models to help NER, it is expected to incorporate more useful features into an NER system. Among all such features, syntactic ones, such as part-of-speech (POS) labels, syntactic constituents, dependency relations, are of high importance to NER because they are effective in identifying the inherited structure in a piece of text and thus guide the system to find appropriate NEs accordingly, which is proved in a large body of previous studies (McCallum, 2003;Li et al., 2017;Luo et al., 2018;Dang et al., 2018;Jie and Lu, 2019). Although promising results are obtained, existing models are limited in regarding extra features as gold references and directly concatenate them with word embeddings. Therefore, such features are not distinguished and separately treated when they are used in those NER models, where the noise in the extra features (e.g., inaccurate POS tagging results) may hurt model performance. As a result, it is still a challenge to find an appropriate way to incorporate external information into neural models for NER. Moreover, in most cases, one would like to incorporate more than one types of extra features. Consequently, it is essential to design an effective mechanism to combine and weight those features so as to restrict the influence of noisy information.  Figure 1: The overall architecture of the proposed NER model integrated with attentive ensemble of different syntactic information. An example input sentence and its output entity labels are given and the syntactic information for the word "Salt" are illustrated with their processing through KVMN, syntax attention and the gate mechanism.
In this paper, we propose a sequence labeling based neural model to enhance NER by incorporating different types of syntactic information, which is conducted by attentive ensemble with key-value memory networks (KVMN) (Miller et al., 2016), syntax attention and the gate mechanism. Particularly, the KVMN is applied to encode the context features and their syntax information from different types, e.g., POS labels, syntactic constituents, or dependency relations; syntax attention is proposed to weight different types of such syntactic information, and the gate mechanism controls the contribution of the results from the context encoding and the syntax attention to the NER process. Through the attentive ensemble, important syntactic information is highlighted and emphasized during labeling NEs. In addition, to further improve NER performance, we also try different types of pre-trained word embeddings, which is demonstrated to be effective in previous studies (Akbik et al., 2018;Jie and Lu, 2019;Liu et al., 2019b;. We experiment our approach on six widely used benchmark datasets from the general domain, where half of them are in English and the other half are in Chinese. Experimental results on all datasets suggest the effectiveness of our approach to enhance NER through syntactic information, where state-of-theart results are achieved on all datasets.

The Proposed Model
NER is conventionally regarded as a typical sequence labeling task, where an input sequence X = x 1 , x 2 , · · · , x i , · · · , x n with n tokens is annotated with its corresponding NE labels Y = y 1 , y 2 , · · · , y i , · · · , y n in the same length. Following this paradigm, we propose a neural NER model depicted in Figure 1 with attentive ensemble to incorporate different types of syntactic information, where it can be conceptually formalized by where C denotes the set of all syntactic information types and c is one of them; M c is the KVMN for encoding syntactic information of type c with K c and V c referring to the keys and values in it, respectively; SA denotes the syntax attention to weight different types of syntactic information obtained through M c ; GM refers to the gate mechanism to control how to use the encodings from context encoder and that from SA. In the following text, we firstly introduce how we extract different types of syntactic information, then illustrate the attentive ensemble of syntactic information through KVMN, syntax attention, and gate mechanism, finally elaborate the encoding and decoding of the input text for NER as shown in the left part of Figure 1.  Figure 2: The extracted syntactic information in POS labels (a), syntactic constituents (b), and dependency relations (c) for "Salt" in the example sentence, where associated contextual features and the corresponding instances of syntactic information are highlighted in blue.

Syntactic Information Extraction
A good representation of the input text is the key to obtain good model performance for many NLP tasks (Song et al., 2017;Sileo et al., 2019). Normally, a straightforward way to improve model performance is to enhance text representation by embeddings of extra features, which is demonstrated to be useful across tasks (Marcheggiani and Titov, 2017;Song et al., 2018a;Huang and Carley, 2019;Tian et al., 2020c), including NER Seyler et al., 2018;Sui et al., 2019;Gui et al., 2019b,a;Liu et al., 2019b). Among different types of extra features, the syntactic one has been proved to be helpful in previous studies for NER, where the effectiveness of POS labels, syntactic constituents, and dependency relations, are demonstrated by McCallum (2003), Li et al. (2017), and Cetoli et al. (2018), respectively. In this paper, we also focus on these three types of syntactic information. In doing so, we obtain the POS labels, the syntax tree and the dependency parsing results from an off-the-shelf NLP toolkit (e.g., Stanford Parser) for each input sequence X . Then, for each token x i in X , we extract its context features and related syntactic information according to the following procedures.
For POS labels, we treat every x i as the central word and employ a window of ±1 word to extract its context words and their corresponding POS labels. For example, in the example in Figure 2(a), for "Salt", the ±1 word window covers its left and right words, so that the resulting context features are "Salt", "is", and "Lake", and we use the combination of such words and their POS labels as the POS information (i.e., "Salt NNP", "is BVZ", and "Lake NNP") for the NER task.
For syntactic constituents, we start with x i at the leaf of X 's syntax tree, then search up through the tree to find the first acceptable syntactic node 2 , and select all tokens under that node as the context features and the combination of tokens and their syntactic nodes as the constituent information. For example, in Figure 2(b), we start from "Salt" and extract its first accepted node "NP", then collect the tokens under "NP" as the context features (i.e., "Salt", "Lake", and "City") and combine them with "NP" to get the constituent information (i.e., "Salt NP", "Lake NP", and "City NP").
For dependency relations, we find all context features for each x i by collecting all its dependents and governor from X 's dependency parse, and then regard the combination of the context features and their in-bound dependency types as the corresponding dependency information. For example, as illustrated in Figure 2(c), for "Salt", its context features are "Salt" and "City" (the governor of "Salt"), and their corresponding dependency information are "Salt compound" and "City root". 3 As a result, for each type of syntactic inforamtion, we obtain a list of context features and a list of syntactic information instances, which are modeled by a KVMN module to enhance input text representation and thus improve model performance.

KVMN for Syntactic Information
Since the syntactic information is obtained from off-the-shelf toolkits, it is possible that there is noise in the extracted syntactic information, which may hurt model performance if it is not leveraged appropriately. Inspired by the studies that use KVMN and its variants to weight and leverage extra features to enhance model performance in many NLP tasks (Miller et al., 2016;Mino et al., 2017;Xu et al., 2019b;Tian et al., 2020d), for each type of the syntactic information (denoted as c), we propose a KVMN module (M c ) to model the pair-wisely organized context features and the syntactic information instances. Specifically, for each x i in the input, we firstly map its context features and the syntactic information to keys and values in the KVMN, which are denoted by , respectively, with m i the number of context features for x i . Next, we use two matrices to map them to their embeddings, with e kc i,j referring to the embedding of k c i,j and e vc i,j for v c i,j , respectively. Then, for each token x i and its associated context features K c i and syntactic information V c i , the weight assigned to the syntactic information v c i,j is computed by where h i is the hidden vector for x i obtained from the context encoder. Afterwards, we apply the weights p c i,j to their corresponding syntactic information v c i,j by where s c i is the output of M c , containing the weighted syntactic information in type c. Therefore, KVMN ensures that the syntactic information are weighted according to their corresponding context features, so that important information could be distinguished and leveraged accordingly.

The Syntax Attention
Upon encoding each type of syntactic information by KVMN, one can assemble different types of them with an overall representation. The most straightforward way of doing so is to concatenate the encoding from each type by where s i is the aggregated results of s c i , the embedding for each syntactic type from M c . However, given the fact that different syntactic information may conflict to each other, it is expected to have a more effective way to combine them.
Motivated by studies that selectively leverage different features by assigning different weights to them (Kumar et al., 2018;Higashiyama et al., 2019;Tian et al., 2020a,b), we propose a syntax attention for the syntactic information ensemble. Particularly, for each syntactic type c, we firstly concatenate s c i with h i and use the resulting vector to compute the weight q c i for s c i : where W c q and b c q are trainable vector and variable, respectively, and σ is the sigmoid function. Then, a softmax function is applied over all types of syntactic information to compute their corresponding attentions a c i , which is formalized by In the last, we apply the weights to their corresponding encoded syntactic information vectors by where s i is the output of the syntax attention of different syntactic information types.

The Gate Mechanism
To enhance NER with the syntactic information encoded by KVMN and combined by syntax attention, we propose a gate mechanism (GM) to incorporate it to the backbone NER model, where we expect such mechanism could dynamically weight and decide how to leverage the syntactic information in labeling NEs. In detail, we propose a reset function r i to evaluate the encodings from the context encoder and the syntax attention by where W r 1 , W r 2 are trainable matrices and b r the bias term, and use to control the contribution of them, where o i is the output of the gate mechanism corresponding to input x i , 1 is a 1-vector with its dimension matching h i and • the element-wise multiplication operation.

Encoding and Decoding for NER
To provide h i to KVMN, we adopt Adapted-Transformer 4  as the context en- coder in this work. So that the encoding of the input text can be formalized as where H = [h 1 , h 2 , · · · , h i , · · · , h n ] and E = [e 1 , e 2 , · · · , e i , · · · , e n ] are lists of hidden vectors and embeddings of X , respectively. Note that, since pre-trained embeddings contain context information learned from large-scale corpora, and different types of them may carry heterogeneous context information learned from different algorithms and corpora, we incorporate multiple pre-trained embeddings by direct concatenating them in the input: where e i is the final word representation to feed the context encoder; e z i represents the word embedding of x i in embedding type z and Z the set of all embedding types.
For the output, upon the receiving of o i , a trainable matrix W o is used to align its dimension to the output space by u i = W o · o i . Finally, we apply a conditional random field (CRF) decoder to predict the labelsŷ i ∈ T (where T is the set with all NE labels) in the output sequenceŶ bŷ where W c and b c are trainable parameters to model the transition for y i−1 to y i .

Datasets
In our experiments, we use three English benchmark datasets, i.e., OntoNotes 5.0 (ON5e) (Pradhan et al., 2013) , WNUT-16 (WN16), WNUT-17 (WN17) (Derczynski et al., 2017), and three Chinese datasets, i.e., OntoNotes 4.0 (ON4c) (Weischedel et al., 2011), Resume (RE) , Weibo (WE) (Peng and Dredze, to be useful for NER comparing to the vanilla Transformer. 2015). 5 These datasets come from a wide range of sources so that we are able to comprehensively evaluate our approach with them. In detail, WN16 and WN17 are constructed from social media; ON5e consists of mixed sources, such as telephone conversation, newswire, etc.; ON4c is from news domain; RE and WE are extracted from Chinese online resources. For all datasets, we use their original splits and the statistics of them with respect to the number of entity types (# T.), sentences (# S.) and total entities (# E.) in the train/dev/test sets are reported in Table 1.

Implementation
To label NEs, we use the BIOES tagging scheme instead of the standard BIO scheme for the reason that previous studies have shown optimistic improvement with this scheme (Lample et al., 2016;. For the text input, we use three types of embeddings for each language by default. Specifically, for English, we use Glove (100dimension) 6 (Pennington et al., 2014), ELMo (Peters et al., 2018), and the BERT-cased large 7 (Devlin et al., 2019) (the derived embeddings for each word); for Chinese, we use pre-trained character and bi-gram embeddings 8 released by  (denoted as Giga), Tencent Embedding 9 (Song et al., 2018b), and ZEN 10 (Diao et al., 2019). For both BERT and ZEN, we follow their 5 Among these datasets, ON5e and ON4c are multi-lingual datasets. We follow  by extracting the corresponding English and Chinese part from them. 6 We download the Glove.6B embedding from https: //nlp.stanford.edu/projects/glove/ 7 We obtain the pre-trained BERT from https:// github.com/google-research/bert. 8 We obtain the embeddings from https://github. com/jiesutd/LatticeLSTM. 9 We use the official release from https://ai. tencent.com/ailab/nlp/embedding.html. 10 We use the pre-trained ZEN-base downloaded from https://github.com/sinovation/ZEN. Note that we do not use the Chinese BERT since ZEN performs better across three Chinese datasets. For reference, we report the results of using BERT in Appendix A.  Table 2: F 1 scores of the baseline model and ours enhanced with different types of syntactic information ("POS.", "CON." and "DEP." refer to POS labels, syntactic constituents and dependency relations, respectively).  Table 3: F 1 scores of our models with different combinations of syntactic information. "TYPE" indicates how they are combined, where "DC" and "SA" refer to direct concatenation and syntax attention, respectively.

TYPE
default settings, i.e., 24 layers of self-attention with 1024 dimensional embeddings for BERT-large and 12 layers of self-attention with 768 dimensional embeddings for ZEN-base. For syntactic information, we use the Stanford CoreNLP Toolkit 11  to produce the aforementioned three types of syntactic information, i.e. POS labels, syntactic constituents, and dependency relations, for each input text. In the context encoding layer, we use a two-layer Adapted-Transformer encoder 12 with 128 hidden units and 12 heads and set the dropout rate to 0.2. For the memory module, all key and value embeddings are initialized randomly. During the training process, we fix all pretrained embeddings and use Adam (Kingma and Ba, 2015) to optimize negative log-likelihood loss function with the learning rate set to η = 0.0001, β 1 = 0.9 and β 2 = 0.99. In all experiments, we run a maximum of 100 epochs with the batch size of 32 and tune the hyper-parameters on the development set. 13 The model that achieves the highest 11 We use its 3.9.2 version downloaded from https:// stanfordnlp.github.io/CoreNLP/. 12 We also try other encoders (i.e., Bi-LSTM and Transformer) and report their results in Appendix B for reference. 13 We report the hyper-parameter settings of different models as well as the best one in Appendix C. performance on the development set is evaluated on the test set with respect to the F 1 scores obtained from the official conlleval toolkits 14 .

Effect of Key-Value Memory Networks
To explore how different syntactic information helps NER, we run the baselines without syntactic information and the ones with each type of syntactic information through KVMN. 15 Experimental results (F 1) are reported in Table 2 for all datasets.
It is observed from the results that the models with syntactic information outperform the baseline in all cases, which demonstrates the effectiveness of using KVMN in our model. In addition, it is also noticed that the best performed model is not exactly the same one across different datasets, which indicates the contributions of different syntactic information vary in those datasets. For example, in most datasets, models using syntactic constituents achieve the best results, which can be explained by that syntactic constituents provide important cues of NE chunks. As a comparison, POS labels are 14 https://www.clips.uantwerpen.be/ conll2000/chunking/conlleval.txt.
15 Syntax attention and the gate mechanism are not applied.   Table 5: Comparison of F 1 scores of our best performing model (i.e. the full model with attentive ensemble of all syntactic information) with that reported in previous studies on all English and Chinese benchmark datasets. "*" indicates the studies using BERT as the text encoder; " †" means the results are our runs of their models.
the most effective syntactic information for WE dataset, which can be attributed to the natural of the dataset that most sentences in social media are not formally written, so that their parsing results could be inaccurate and mislead the NER process.

Effect of Syntax Attention
To examine the effectiveness of syntax attention (SA), we compare it with another strategy through direct concatenation (DC) of the KVMN output to model ouptut. The results are reported in Table 3 with applying all combinations of different syntactic information by DC and SA.
There are several observations. First, interestingly, compared to the results in Table 2, direct concatenation of different syntactic information hurts NER performance in most cases. For example, on the RE dataset, the ensemble of all types of syntactic information through DC obtains the worst results compared to all other results with integrating less information under the same setting. The reason behind this phenomenon could be that differ-ent syntactic information may provide conflict cues to NE tags and thus result in inferior performance. Second, on the contrary, SA is able to improve NER with integrating multiple types of syntactic information, where consistent improvements are observed among all datasets when more types of syntactic information are incorporated. As a result, the best results are achieved by the model using all types of syntactic information. This observation suggests that the syntax attention is able to weight different syntactic information and distinguish important ones from others, thus alleviate possible conflicts of them when labeling entities.

Effect of the Gate Mechanism
We experiment our model under its best setting (i.e., SA over all combinations of syntactic information with KVMN) with and without the gate mechanism to investigate its effectiveness of actively controlling the information flow from the context encoder and SA. The results are presented in Table 4, where the ones without using the gate  Table 6: Experimental results (F 1 scores) of our best performing model (i.e., the full model with attentive ensemble of all syntactic information) using different pre-trained embeddings and their combinations as input.
mechanism are obtained directly from Table 3. 16 It is clearly shown that in all cases, the model with gate mechanism achieves superior performance to the other one without it. These results suggest that the importance of the information from the context encoder and SA varies, so that the proposed gate mechanism is effective in balancing them.

Comparison with Previous Studies
To further illustrate the effectiveness of our models, we compare the best performing one, i.e., the last line in Table 4, with the results from previous studies. The results are shown in Table 5, where our approach outperforms previous models with BERT encoder (marked by "*") and achieves state-of-theart results on all English and Chinese datasets. This observation indicates that incorporate different embeddings as input is more effective than directly using pre-trained models. In addition, compared to some previous studies (Luo et al., 2018;Dang et al., 2018) 17 that leverage multiple types of syntactic information by regarding the information as gold references and directly concatenating their embeddings with word embeddings, our approach has its superiority by using attentive ensemble through KVMN, syntax attention, and the gate mechanism to selectively learn from different syntactic information according to their contribution to NER, where such multi-phase strategy of attentive ensemble guarantees the appropriateness of learning them in a reasonable manner. 16 The results of those models on the development sets of all datasets are reported in Appendix D. 17 Luo et al. (2018) and Dang et al. (2018) do not report their results on all general domain benchmark datasets, because the focus of their studies is biomedical NER. Therefore, we report our runs of their method in Table 5 (marked by " †").

Effect of Different Word Embeddings
Neural models are sensitive to input embeddings, which is also true for our approach. Consider that different types of embeddings carry contextual information learned from various corpora and algorithms, we explore the effect of those embeddings when they are used separately or combined as the input. The experiment is performed on our best model (i.e., KVMN+SA+GM on all syntactic information), with the results reported in Table 6. It is clearly observed that for all English and Chinese datasets, the model with all three embeddings achieves the best performance and its performance drops consistently when more types of embeddings are excluded. It is confirmed that different types of embedding do provide complement context information to enhance the understanding of input texts for NER. Particularly, although contextualized embeddings (i.e., ELMo, BERT, and ZEN) show significantly better performance than others (especially on Chinese), combining them with static embeddings still provide further improvement on the F 1 score of NER systems.

Case Study
To better understand how attentive ensemble of syntactic information helps NER, we conduct a case study for the word "Bill" in an example sentence "Mason was one of the drafters of the Bill of Rights" from the ON5e dataset. Figure 3 visualizes the weights for different context features in KVMN, as well as the weights from SA and GM, where darker colors refer to higher weights. 18 Interestingly, in this case, both POS labels and dependency relations receiving highest weights 18 For the weights, we visualize p c i,j in Eq. (2) Figure 3: An illustration of how our model encodes syntactic information through KVMN, weights them by syntax attention (SA) and learns from the gate mechanism (GM), where the weights for different features and information types are visualized. The example sentence is shown at the top with the gold NE tags for each word marked below. The weights assigned to different syntactic information for "Bill" in KVMN, SA, and GM are highlighted with colors, where the darker colors referring to higher values. suggest a misleading "PERSON" label 19 because of their context features, so that an incorrect NER prediction is expected if treating the three types of syntactic information equally. However, the syntactic constituents give strong indication of the correct label through the word "Rights" for a "LAW" entity. Later, the syntax attention ensures that the constituent information should be emphasized and the gate mechanism also tends to use syntax for this input with higher weights. Therefore this case clearly illustrates the contribution of each component in our attentive ensemble of syntactic information.
exploited POS labels and syntactic constituents in their methods and found that the combination of them improves NER performance. Yet they are limited in regarding such syntactic information as gold references and directly concatenated them to the input embeddings, so that noises are expected to affect NER accordingly. Compared with them, our model provides an alternative option to leverage syntactic information with attentive ensemble to encode, weight and select them to help NER, which proves its effectiveness and has the potential to be applied in other similar tasks.

Conclusion
In this paper, we proposed a neural model following the sequence labeling paradigm to enhance NER through attentive ensemble of syntactic information. Particularly, the attentive ensemble consists of three components in a sequence: each type of syntactic information is encoded by key-value memory networks, different information types are then weighted in syntax attention, and the gate mechanism is finally applied to control the contribution of syntax attention outputs to NER for different contexts. In doing so, different syntactic information are comprehensively and selectively learned to enhance NER, where the experimental results on six benchmark datasets in English and Chinese confirm the validity and effectiveness of our model with state-of-the-art performance obtained.   "GM" is the gate mechanism; "P.", "C." and "D." refer to POS labels, syntactic constituents and dependency relations, respectively.
In Table 10, we report the experimental results (F 1) of our models (i.e., with all types of embeddings and KVMN) under different configurations (using syntax attention on different combinations of syntactic information and whether to use the gate mechanism) on development set of all datasets.