Better Feature Integration for Named Entity Recognition

It has been shown that named entity recognition (NER) could benefit from incorporating the long-distance structured information captured by dependency trees. We believe this is because both types of features - the contextual information captured by the linear sequences and the structured information captured by the dependency trees may complement each other. However, existing approaches largely focused on stacking the LSTM and graph neural networks such as graph convolutional networks (GCNs) for building improved NER models, where the exact interaction mechanism between the two types of features is not very clear, and the performance gain does not appear to be significant. In this work, we propose a simple and robust solution to incorporate both types of features with our Synergized-LSTM (Syn-LSTM), which clearly captures how the two types of features interact. We conduct extensive experiments on several standard datasets across four languages. The results demonstrate that the proposed model achieves better performance than previous approaches while requiring fewer parameters. Our further analysis demonstrates that our model can capture longer dependencies compared with strong baselines.


Introduction
Named entity recognition (NER) is one of the most fundamental and important tasks in natural language processing (NLP). While the literature (Peters et al., 2018;Akbik et al., 2018;Devlin et al., 2019) largely focuses on training deep language models to improve the contextualized word representations, previous studies show that the structured information such as interactions between non-adjacent words can also be important for NER (Finkel et al., 2005;Jie et al., 2017;Aguilar and Solorio, 2019).
However, sequence models such as bidirectional LSTM (Hochreiter and Schmidhuber, 1997) are not able to fully capture the long-range dependencies (Bengio, 2009). For instance, Figure 1 (top) shows one type of structured information in NER. The words "Precision Castparts Corp." can be easily inferred as ORGANIZATION by its context (i.e., Corp.). However, the second entity "PCP" could be misclassified as a PRODUCT entity if a model relies more on the context "begin trading with" but ignores the hidden information that "PCP" is the symbol of "Precision Castparts Corp.".
Previous research works (Li et al., 2017;Jie and Lu, 2019;Wang et al., 2019) have been using the parse trees (Chomsky, 1956(Chomsky, , 1969Sandra and Taft, 2014) to incorporate such structured information. Figure 1 (Dependency Path) shows that the first entity can be connected to the second entity following the dependency tree with 5 hops. Incorporating the dependency information can be done with graph neural networks (GNNs) such as graph convolutional networks (GCNs) (Kipf and Welling, 2017). However, simply stacking the LSTM and GCN architectures for NER can only provide us with modest improvements; sometimes, it decreases performance (Jie and Lu, 2019). Based on the depen-dency path in Figure 1, it requires a 5-layer GCN to capture the connections between these two entities. However, deep GCN architectures often face training difficulties, which cause a performance drop (Hamilton et al., 2017b;Kipf and Welling, 2017). Directly stacking GCN and LSTM has difficulties in modeling the interaction between dependency trees and contextual information.
To address the above limitations, we propose the Synergized-LSTM (Syn-LSTM), a new recurrent neural network architecture that considers an additional graph-encoded representation to update the memory and hidden states, as shown in Figure  2. More specifically, the graph-encoded representation for each word can be obtained with GCNs. Our proposed Syn-LSTM allows the cell to receive the structured information from the graph-encoded representation. With the newly designed gating mechanism, our model is able to make independent assessments on the amounts of information to be retrieved from the word representation and the graph-encoded representation respectively. Such a mechanism allows for better integration of both contextual and structured information.
Our contributions can be summarized as: • We propose a simple and robust Syn-LSTM model to better incorporate the structured information conveyed by dependency trees. The output of the Syn-LSTM cell is jointly determined by both contextual and structured information. We adopt the classic conditional random fields (CRF) (Lafferty et al., 2001) on top of the Syn-LSTM for NER.
• We conduct extensive experiments on several standard datasets across four languages. The proposed model significantly outperforms previous approaches on these datasets.
• We show that the proposed model can capture long-distance interactions between entities. Our further analysis statistically demonstrates the proposed gating mechanism is able to aggregate the structured information selectively.

Incorporating Structured Information
To incorporate the long-range dependencies, we consider an additional graph-encoded representation g t (Figure 2) as the model input to integrate Previous Hidden contextual and structured information. The graphencoded representation g t can be derived from Graph Neural Networks (GNNs) such as GCN (Kipf and Welling, 2017), which are capable of bringing in structured information through graph structure (Hamilton et al., 2017a).
However, structured information sometimes is hard to encode, as we can see from the example in Figure 1. One naive approach is to use a deep GNN to capture such information along multiple dependency arcs between two words, which could mess up information and lead to training difficulties. A straightforward solution is to integrate both structured and contextual information via LSTM. As shown in Figure 1 (Hybrid Paths), the structured information can be passed to neighbors or context, which allows a model to use less number of GNN layers and alleviate such issues for long-range dependencies. The input to the LSTM can simply be the concatenation of word representation x t and g t at each position (Jie and Lu, 2019) 2 . However, because such an approach requires both x t and g t to decide the value of the input gate jointly, it could be a potential victim of two sources of uncertainties: 1) the uncertainty of the quality of graph-encoded representation g t , and 2) the uncertainty of the exact interaction mechanism between the two types of features. These may lead to sub-optimal performance, especially if the graph-encoded representation g t is unsatisfactory. Thus, we need to design a new approach to incorporate both types of information from x t and g t with a more explicit interaction mechanism, with which we hope to alleviate the above issues.

Syn-LSTM Cell
We propose the Synergized-LSTM (Syn-LSTM) to better integrate the contextual and structured information to address the above limitations. The inputs of the Syn-LSTM cell include previous cell state c t−1 , previous hidden state h t−1 , current cell input x t , and an additional graph-encoded representation g t . The outputs of the Syn-LSTM cell include current cell state c t and current hidden state h t . Within the cell, there are four gates: input gate i t , forget gate f t , output gate o t , and an additional new gate m t to control the flow of information. Note that the forget gate f t and output gate o t are not just looking at h t−1 and x t ; they are also affected by the graph-encoded representation g t . The cell state c t and hidden state h t are computed as follows: where σ is the sigmoid function, W (·) , U (·) , Q (·) and b (·) are learnable parameters. The additional new gate m t is used to control the information from the graph-encoded representation directly. Such a design allows the original input gates i t and our new gate m t to make independent assessments on the amounts of information to be retrieved from the word representation x t and the graph-encoded representation g t respectively. On the other hand, we also have a different candidate states t to represent the cell state that corresponds to the graph-encoded representation separately.
With the proposed Syn-LSTM, the structured information captured by the dependency trees can be passed to each cell, and the additional gate m t is able to control how much structured information can be incorporated. The additional gate enables the model to feed the contextual and structured information into the LSTM cell separately. Such a mechanism allows our model to aggregate the information from linear sequence and dependency trees selectively.
Graph Convolutional Network Similar to the previous work (Levy et al., 2018), it is also possible to show that the cell state c t implicitly computes the element-wise weighted sum of the previous states by expanding Equation 7: Note that the two terms, a t j and q t j , are the product of gates. The value of the two terms are in the range from 0 to 1. Since thec t ands t represent contextual and structured features, the corresponding weights control the flow of information.

Syn-LSTM-CRF
The goal of named entity recognition is to predict the label sequence y = {y 1 , y 2 , ..., y n } given the input sequence w = {w 1 , w 2 , ..., w n }, where w t represents the t-th word and n is the number of words. Our model is mainly constructed with three layers: input representation layer, bi-directional Syn-LSTM layer, and CRF layer. The architecture of our Syn-LSTM-CRF is shown in Figure 3.
Input Representation Layer Similar to the work by Lample et al. (2016), our input representation also includes the character embeddings, which are the hidden states of character-based BiLSTM. Jie and Lu (2019) highlight that the dependency relation helps to enhance the input representation. Furthermore, previous methods  use embeddings of part-ofspeech (POS) tags as additional input representation. The input representation x t of our model is the concatenation of the word embedding v t , the character representation e t , the dependency relation embedding r t , and the POS embedding p t : where both r t and p t embeddings are randomly initialized and are fine-tuned during training. For experiments with the contextualized representations (e.g., BERT (Devlin et al., 2019)), we further concatenate the contextual word representation to x t . For our task, we employ the graph convolutional network (Kipf and Welling, 2017;Zhang et al., 2018b) to get the graph-encoded representation g t . Given a graph, an adjacency matrix A of size n × n is able to represent the graph structure, where n is the number of nodes; A i,j = 1 indicates that node i and node j are connected. We transform dependency tree into its corresponding adjacency matrix 3 A, and A i,j = 1 denotes that node i and node j have dependency relation. Note that the purpose of graph-encoded representation g t is to incorporate the dependency information from neighbor nodes. The input and output representations of the l-th layer GCN at t-th position are denoted as g l−1 t and g l t respectively. Similar to the work by Zhang et al. (2018b), we use d t = n j=1 A t,j , which is the total number of neighbors of node t, to normalize the representation before going through the nonlinear function. The GCN operation is defined as: where W l is a linear transformation and b l is a bias. The initial g 0 t is the concatenation of word embedding v t , character embedding e t , and dependency relation embedding r t : Bi-directional Syn-LSTM Layer With the word representation x t and the graph-encoded representation g t , a bi-directional Syn-LSTM is applied to generate contextual representation. The forward and backward Syn-LSTM enable the model to integrate the contextual and structured information from both directions. We concatenate the hidden state − → h t from forward Syn-LSTM and hidden state 3 We treat the dependency edge as undirected and add a self-loop for each node: Ai,j = Aj,i and Ai,i = 1. ← − h t from backward Syn-LSTM to form the contextual representation of t-th token: CRF Layer The CRF (Lafferty et al., 2001) is widely used in NER tasks as it is capable of capturing the structured correlations between adjacent output labels. Given the sentence w and dependency tree τ , the probability of the label sequence y is defined as: The score function is defined as: (15) where T yt,y t+1 denotes the transition score from label y t to y t+1 , E yt denotes the score of label y t at the t-th position and the scores are computed using the hidden state h t . We learn the model parameters by minimizing the negative log-likelihood and employ the Viterbi algorithm to obtain the best label sequence during evaluation.  . Detailed statistics of each dataset can be found in Table 1. Intuitively, longer sentences would require the model to capture more long-distance interactions in the sentences. We present the number of entities in terms of different sentence lengths to show that these datasets have a modest amount of entities in long sentences.
Experimental Setup For Catalan, Spanish, and Chinese, we use the FastText (Grave et al., 2018) 300 dimensional embeddings to initialize the word embeddings. For OntoNotes 5.0 English, we adopt the publicly available GloVE (Pennington et al., 2014) 100 dimensional embeddings to initialize the word embeddings. For experiments with the contextualized representation, we adopt the pre-trained language model BERT (Devlin et al., 2019) for the four datasets. Specifically, we use bert-as-service (Xiao, 2018) to generate the contextualized word representation without fine-tuning. Following Luo et al. (2020), we use the cased version of BERT large model for the experiments on the OntoNotes 5.0 English data. We use the cased version of BERT base model for the experiments on the other three datasets. For the character embedding, we randomly initialize the character embeddings and set the dimension as 30, and set the hidden size of character-level BiLSTM as 50. The hidden size of GCN and Syn-LSTM is set as 200, the number of GCN layer is 2. We adopt stochastic gradient descent (SGD) to optimize our model with batch size 100, L2 regularization 10 −8 , initial learning rate lr 0.2 and the learning rate is decayed 4 with respect to the number of epoch. We select the best model based on the performance on the dev set 5 and apply it to the test set. We use the bootstrapping t-test to compare the results.
Baselines We compare our model with several baselines with or without dependency tree information. The first one is BERT-CRF, where we apply a CRF layer on top of BERT (Devlin et al., 2019). Secondly, we compare with the BERT implementation by HuggingFace (Wolf et al., 2019). For models with dependency trees, we take the models BiLSTM-GCN-CRF and dependency- 4 We set the decay as 0.1 and the learning rate for each epoch equals to lr/(1 + decay * (epoch − 1)). 5 The experimental results on the dev set and other experimental details can be found in the Appendix.    (Peters et al., 2018), but we also implement it with BERT. Besides, we compare our model with previous works that have results on these datasets.

Main Results
SemEval 2010 Task 1 Table 2 shows comparisons of our model with baseline models on the SemEval 2010 Task 1 Catalan and Spanish datasets. Our Syn-LSTM-CRF model outperforms all existing models with F 1 82.76 and 85.09 (p < 10 −5 ) compared to DGLSTM-CRF on Catalan and Spanish datasets when FastText word embeddings are used. Our model outperforms the BiLSTM-CRF model by 13.25 and 11.22 F 1 points, and outperforms BiLSTM-GCN-CRF (Jie and Lu, 2019) model by 4.64 and 3.16 on Catalan and Spanish. The large performance gap between BiLSTM-GCN-CRF and our model indicates that Syn-LSTM-CRF shows better compatibility with GCN, and this confirms that simply stacking GCN on top of the BiLSTM does not perform well. Our method outperforms GCN-BiLSTM-CRF model by 5.33 and 3.24 F 1 points on Catalan and Spanish. This shows that our proposed model demonstrates a better integration of contextual information and structured information. Furthermore, our proposed method brings 1.12 and 1.62 F 1 points improvement on Catalan and Spanish datasets compare to the DGLSTM-CRF (Jie and Lu, 2019). The DGLSTM-CRF employs 2-layer dependency guided BiLSTM to capture grandchild dependencies, which leads to longer training time and more model parameters. However, our Syn-LSTM-CRF is able to get better performance with fewer model parameters and shorter training time because of the fewer LSTM layers. Such results demonstrate that our proposed Syn-LSTM-CRF manages to capture structured information effectively.
Furthermore, with the contextualized word representation, the Syn-LSTM-CRF + BERT achieves much higher performance improvement than any other method. Our model outperforms the strong baseline model DGLSTM-CRF + ELMO by 4.83 and 2.54 in terms of F 1 (p < 10 −5 ) on Catalan and Spanish, respectively.
OntoNotes 5.0 English To understand the generalizability of our model, we evaluate the proposed Syn-LSTM-CRF model on large scale OntoNotes 5.0 datasets. Table 3 shows comparisons of our model with baseline models on English. Our Syn-LSTM-CRF model outperforms all existing methods with 89.04 in terms of F 1 score (p < 0.01) compared to DGLSTM-CRF, when GloVE word embeddings are used. Our model outperforms the BiLSTM-CRF model by 1.97 in F 1 , BiLSTM-GCN-CRF (Jie and Lu, 2019) model by 0.86. Note that our implemented GCN-BiLSTM-CRF outperforms the previous DGLSTM-CRF (Jie and Lu, 2019) by 0.14 in F 1 . Our Syn-LSTM-CRF further brings the improvement to 0.52. Moreover, with the contextualized word representation BERT, our method achieves an F 1 score of 90.85 (p < 10 −5 ) compared to DGLSTM-CRF + ELMO . Our method outperforms the previous model (Luo et al., 2020), which relies on document-level information, by 0.55 in F 1 . Furthermore, the performance improvement on recall is more prominent as compared to precision. This shows that the proposed Syn-LSTM-CRF is able to extract more entities.
OntoNotes 5.0 Chinese We present the experimental results on the OntoNotes 5.0 Chinese test set in Table 4 (2019). There are also other methods (Li et al., 2020a,b) that use external information, (Yu et al., 2020) use document-level information to encode the sentence, which are not direct comparisons to ours. forms the baseline models, specifically by 2.04 in F 1 compared to BiLSTM-CRF, by 2.39 compared to BiLSTM-GCN-CRF, by 1.86 compared to GCN-BILSTM-CRF and by 1.11 (p < 10 −5 ) compared to DGLSTM-CRF when FastText is used. Note that the baseline BiLSTM-GCN-CRF model is 0.35 points worse than BiLSTM-CRF. Such results further confirm the effectiveness of our proposed Syn-LSTM-CRF for incorporating structured information. We find a similar behavior when the contextualized word representation BERT is used. With the contextualized word representation, we achieve a higher F 1 score of 80.20.

Analysis
Robustness Analysis To study the robustness of our model and check whether our model can regulate the flow of information from the graphencoded representation, we analyze the influence of the quality of dependency trees. We train and evaluate an additional dependency parser (Dozat and Manning, 2017 Jie and Lu (2019). There are also other methods (Li et al., 2020a,b) that use external information, which are not direct comparisons to ours. dependency parser 6 on the given training datasets and select the best model based on the dev sets. Then we apply the best model to the test sets to obtain dependency trees. We also train and evaluate our model with random dependency trees. Table  8 presents the comparisons between Syn-LSTM-CRF + BERT and DGLSTM-CRF + ELMO with given, predicted and random dependency trees. We observe that both models encounter a performance drop when we use the predicted parse trees and random trees. Our performance differences with the given parse trees are relatively smaller than the corresponding differences in DGLSTM-CRF + ELMO . Such an observation demonstrates the robustness of our proposed model against structured information from the trees of different quality. It is worthwhile to note that, with the predicted dependencies, our proposed Syn-LSTM-CRF + BERT is still able to outperform the strong baseline DGLSTM-CRF + ELMO even with the given parse trees on Catalan, English, and Chinese datasets.
To further study the robustness, we conduct an analysis to investigate if the gate m t (Figure 2) has the ability to regulate the flow of information from the graph-encoded representation. Intuitively, the gate m t should tend to have a small value when 6 The performance of the dependency parser can be found in the Appendix.  the quality of the parse tree is not good (e.g., with random trees). We statistically plot the number of words with respect to different gate value ranges (m t ). Figure 4 shows the comparison between the models of using random trees and given trees on Catalan and Spanish 7 . We observe that the gate m t is more likely to open (the value is higher) when we use the given parse trees compared with random parse trees. Such behavior demonstrates that our proposed model can selectively aggregate the information from the graph-encoded representation.

Effect of Sentence Length
We compare the performance of our Syn-LSTM-CRF + BERT with BiLSTM-CRF + BERT and DGLSTM-CRF + ELMO models with respect to sentence length, and the results are shown in Figure 5. We observe that the Syn-LSTM-CRF + BERT model consistently outperforms the two baseline models on the four languages 8 . In particular, although the performance tends to drop as the sentence length increases, our proposed model shows relatively better performance when the sentence length is ≥ 60. This confirms that the proposed Syn-LSTM-CRF + BERT is able to effectively incorporate structured information. Note that our 2-layer GCN is computed based on the  dependency trees, which include both short-range dependencies and long-range dependencies. With the graph-encoded representation and the proposed Syn-LSTM-CRF + BERT , the individual word representation is enhanced by both contextual and structured information. Therefore, for the sentences with length of ≤ 14, we can still observe obvious improvements. The significant performance improvements on the four datasets show the capability of our Syn-LSTM-CRF to capture the structured information despite the sentence length.

Effect of Entity Length
We conduct another evaluation on BiLSTM-CRF + BERT , DGLSTM-CRF + ELMO , and Syn-LSTM-CRF + BERT models with respect to entity length ∈ {1, 2, 3, 4, 5, ≥ 6} on the four languages. Table 6 shows the performance comparison of two models with respect to entity length. With the structured information, both DGLSTM-CRF + ELMO and Syn-LSTM-CRF + BERT achieve better performance compared to BiLSTM-CRF + BERT . When the length of entity is ≤ 3, Syn-LSTM-CRF + BERT achieves better results compared to DGLSTM-CRF + ELMO . This confirms that our proposed method can effectively incorporate the structured information. Our model consistently outperforms BiLSTM-CRF + BERT , and the performance tends to have more improvements when entities are getting longer except on the Chinese dataset. We note there are some special characteristics of the Chinese language. As mentioned by Jie and Lu (2019), the percentage of entities that are able to perfectly form a sub-tree is only 92.9% for OntoNotes Chinese, as compared to 98.5%, 100%, 100% for OntoNotes English, SemEval Catalan and Spanish. Furthermore, the ratio of long entities is much higher for Catalan and Spanish compared  to English and Chinese. The experimental results on Catalan and Spanish datasets show significant improvements for long entities. Such results show that the structured information conveyed by the dependency trees can be more crucial when entity length becomes longer.

Number of GCN Layers
To fully explore the impact of the number of GCN layers, we conduct another experiment on Syn-LSTM-CRF + BERT model with the number of GCN layers ∈ {1, 2, 3}, and Figure 6 shows the performance on the dev set of the four languages. The last bar, indicated as AVG, is obtained by averaging the dev results on the four datasets. We observe that the overall performance is better when the number of GCN layers equals 2. Note that similar behavior can also be found in the work by Kipf and Welling (2017) for document classification and node classification. Therefore, we evaluate our proposed Syn-LSTM-CRF model with 2-layer GCN.  Ablation Study To understand the contribution of each component, we conduct an ablation study on the OntoNotes 5.0 English dataset, and Table  7 presents the detailed results of our model with contextualized representation. We find that the performance drops by 0.24 F 1 score when we only use 1-layer GCN. Without GCN at all, the score drops by 1.13 F 1 . The original dependency contributes 0.27 F 1 score. Removing the dependency relation embedding also decreases the performance by 0.27 F 1 . When we remove the POS tags embedding, the result drops by 0.39 F 1 .

Related Work
LSTM LSTM has demonstrated its great effectiveness in many NLP tasks and becomes a standard module for many state-of-the-art models (Wen et al., 2015;Ma and Hovy, 2016;Dozat and Manning, 2017). However, the sequential nature of the LSTM makes it challenging to capture long-range dependencies. Zhang et al. (2018a) propose the S-LSTM model to include a sentence state to allow both local and global information exchange simultaneously. Mogrifier LSTM (Melis et al., 2020) mutually gates the current input and the previous output to enhance the interaction between the input and the context. These two works do not consider structured information for the LSTM design. Since natural language is usually structured, Shen et al. (2018) propose ON-LSTM to add a hierarchical bias to allow the neurons to be updated by following certain order. While the ON-LSTM is learning the latent constituency parse trees, we focus on incorporating the explicit structured information conveyed by the dependency parse trees.
NER Early work (Sasano and Kurohashi, 2008) uses syntactic dependency features to improve the SVM performance on Japanese NER task. Liu et al. (2010) propose to construct skip-edges to link similar words or words having typed dependencies to capture long-range dependencies. The later works (Collobert et al., 2010;Lample et al., 2016; Chiu and Nichols, 2016b) focus on using neural networks to extract features and achieved the stateof-the-art performance. Jie et al. (2017) find that some relations between the dependency edges and the entities can be used to reduce the search space of their model, which significantly reduces the time complexity. Yu et al. (2020) employ pre-trained language model to encode document-level information to explore all spans with the graph-based dependency graph based ideas. The pre-trained language models (e.g., BERT (Devlin et al., 2019), ELMO (Peters et al., 2018)) further improve neuralbased approaches with a good contextualized representation. However, previous works did not focus on investigating how to effectively integrate structured and contextual information well.

Conclusion
In this paper, we propose a simple and robust Syn-LSTM model to better integrate the structured information leveraged from the long-range dependencies. Specifically, we introduce an additional graphencoded representation to each recurrent unit. Such a graph-encoded representation can be obtained via GNNs. Through the newly designed gating mechanism, the hidden states are enhanced by contextual information captured by the linear sequence and structured information captured by the dependency trees. We present the Syn-LSTM-CRF for NER and adopt the GCN on dependency trees to obtain the graph-encoded representations. Our extensive experiments and analysis on the datasets with four languages demonstrate that the proposed Syn-LSTM is able to effectively incorporate both contextual and structured information. The robustness analysis demonstrates that our model is capable of selectively aggregating the information from the graph-encoded representation.   language model BERT (Devlin et al., 2019) for the four datasets. Specifically, we use bert-as-service (Xiao, 2018) to generate the contextualized word representation without fine-tuning. Following Luo et al. (2020), we select the 18 th layer of the cased version of BERT large model for the experiments on the OntoNotes 5.0 English data. We use the the 9 th layer of cased version of BERT base model for the experiments on the rest three datasets. For the character embedding, we randomly initialize the character embeddings and set the dimension as 30, and set the hidden size of character-level BiLSTM as 50. The hidden size of GCN and Syn-LSTM is set as 200. Note that we only use one layer of bi-directional Syn-LSTM for our experiments. Dropout is set to 0.5 for input embeddings and hidden states. We adopt stochastic gradient descent (SGD) to optimize our model with batch size 100, L2 regularization 10 −8 , learning rate 0.2 and the learning rate is decayed with respect to the number of epoch 9 .
B Performance of dependency parser Table 8 presents the performance of dependency parser. 9 We set the decay as 0.1 and the learning rate for each epoch equals to learning_rate/(1 + decay * (epoch − 1)). x-axis: sentence length. y-axis:F 1 score (%). Note that DGLSTM-CRF + ELMO have better performance compared to DGLSTM-CRF + BERT based on the results in the main paper.
C More data statistics Table 9 shows the statistics of the number of entities with respect to entity length for OntoNotes 5.0 English and Chinese, SemEval 2010 Task 1 Catalan and Spanish datasets. Figure 7 shows the comparisons of the models of using random trees and given trees on OntoNotes 5.0 English and Chinese datasets.

E Effect of Sentence Length
We compare the performance of our Syn-LSTM-CRF + BERT with BiLSTM-CRF + BERT and DGLSTM-CRF + ELMO models with respect to sentence length, and the results are shown in Figure 8.

F Case Study
We further show an example to visualize the propagation of non-local information (Figure 9). The example is selected from OntoNotes 5.0 English dataset. Even though the DGLSTM-CRF (Jie and Lu, 2019) model is able to recognize "Tianshui" as a named entity, it predicts a wrong entity type as PERSON while the true type is GPE. If only looking at the first half of the sentence, it is possible to predict "Tianshui" as PERSON because of the local information "age". However, the second half of the sentence confirms that the entity type of  During Tanshui 's golden age , large and small boats were constantly coming and going in the harbor , and it was not usual to see enormous steamships .
ROOT Figure 9: An example of dependency tree. The mentioned entity is highlighted in orange, and the entity type is GPE.
"Tianshui" is GPE. With the non-local information from the graph-encoded representation, our Syn-LSTM-CRF successfully predicts the right entity type.