A Lexicon-Based Graph Neural Network for Chinese NER

Recurrent neural networks (RNN) used for Chinese named entity recognition (NER) that sequentially track character and word information have achieved great success. However, the characteristic of chain structure and the lack of global semantics determine that RNN-based models are vulnerable to word ambiguities. In this work, we try to alleviate this problem by introducing a lexicon-based graph neural network with global semantics, in which lexicon knowledge is used to connect characters to capture the local composition, while a global relay node can capture global sentence semantics and long-range dependency. Based on the multiple graph-based interactions among characters, potential words, and the whole-sentence semantics, word ambiguities can be effectively tackled. Experiments on four NER datasets show that the proposed model achieves significant improvements against other baseline models.


Introduction
The task of named entity recognition (NER) involves determining entity boundaries and recognizing categories of named entities, which is a fundamental task in the field of natural language processing (NLP). NER plays an important role in many downstream NLP tasks, including information retrieval (Chen et al., 2015b), relation extraction (Bunescu and Mooney, 2005), question answering systems (Diefenbach et al., 2018), and other applications. Compared with English NER, Chinese named entities are more difficult to identify due to their uncertain boundaries, complex composition, and NE definitions within the nest (Duan and Zheng, 2011).
One intuitive way to alleviate word boundary problems is to first perform word segmentation * Equal contribution.  Figure 1: Example of word character lattice with partial input. Because of the characteristic of chain structure, RNN-based methods must predict the label "度" using only previous partial sequences "印度 (India)", which may suffer from word ambiguities without global sentence semantics. and then apply word sequence labeling (Yang et al., 2016;He and Sun, 2017). However, the rare gold-standard segmentation in NER datasets and incorrectly segmented entity boundaries both negatively impact the identification of named entities (Peng and Dredze, 2015;He and Sun, 2016). Hence, character-level Chinese NER using lexicon features to better leverage word information has attracted research attention (Passos et al., 2014;Zhang and Yang, 2018). In particular, Zhang and Yang (2018) introduced a variant of a long short-term memory network (latticestructured LSTM) that encodes all potential words matching a sentence to exploit explicit word information, achieving state-of-the-art results.
However, these methods are usually based on RNN or CRF to sequentially encode a sentence, while the underlying structure of language is not strictly sequential (Shen et al., 2019). As a result, these models would encounter serious word ambiguity problems (Mich et al., 2000). Especially in Chinese texts, the recognition of named entities with overlapping ambiguous strings is even more challenging. As shown in Figure 1, the middle character of an overlapping ambiguous string can constitute words with the characters to both their left and their right (Yen et al., 2012), such as "河 流 (River)" and "流 经 (Flow through)", which share a common character "流". However, RNNbased models process characters in a strictly serial order, which is similar to reading Chinese, and a character has priority in being assigned to the word on the left (Perfetti and Tan, 1999). More seriously, RNN-based models must give the label of "度" using only previous partial sequences "印 度 (India)", which is problematic without seeing the remaining characters. Hence, Ma et al. (2014) suggested that the overlapping ambiguity must be settled using sentence context and high-level information.
In this work, we introduce a lexicon-based graph neural network (LGN) that achieves Chinese NER as a node classification task. The proposed model breaks the serialization processing structure of RNNs with better interaction results between characters and words through careful connections. The lexicon knowledge connects related characters to capture the local composition. Meanwhile, a global relay node is designed to capture long-range dependency and high-level features.
LGN follows a neighborhood aggregation scheme wherein the node representation is computed by recursively aggregating its incoming edges and the global relay node. Because of multiple iterations of aggregation, the model can use global context information to repeatedly compare ambiguous words for better prediction. Experimental results show that the proposed method can achieve state-of-the-art performance on four NER datasets.
The main contributions of this paper can be summarized as follows: 1) we propose the use of a lexicon to construct a graph neural network and achieve Chinese NER as a graph node classification task; 2) the proposed model can capture global context information and local compositions to tackle Chinese word ambiguity problems through recursively aggregating mechanism; 3) several experimental results demonstrate the effectiveness of the proposed method in different aspects.
2 Related Work 2.1 Chinese NER with Lexicon.
Some previous Chinese NER researches have compared word-based and character-based methods  and show that due to the limited performance of the current Chinese word segmentation, character-based name taggers can outperform their word-based counterparts (He and Wang, 2008;Liu et al., 2010). Lexicon features have been widely used to better leverage word information for Chinese NER Luo et al., 2015;Gui et al., 2019). Especially, Zhang and Yang (2018) proposed a lattice LSTM to model characters and potential words simultaneously. However, their lattice LSTM used a concatenation of independently trained left-toright and right-to-left LSTM to represent features, which was also limited (Devlin et al., 2018). In this work, we propose a novel character-based method that treats the named entities as a node classification task. The proposed method can utilize global information (both the left and the right context) (Dong et al., 2019) to tackle word ambiguities.

Graph Neural Networks on Texts
Graph neural networks have been successfully applied to several text classification tasks (Veličković et al., 2017;Yao et al., 2018;Zhang et al., 2018b). Peng et al. (2018) proposed a GCNbased deep learning model for text classification. Zhang et al. (2018c) proposed using the dependency parse trees to construct a graph for relation extraction. Recently, multi-head attention mechanisms (Vaswani et al., 2017) have been widely used by graph neural networks during the fusion process (Zhang et al., 2018a;, which can aggregate graph information by assigning different weights to neighboring nodes or associated edges. Given a set of vectors H ∈ R n×d , a query vectorq ∈ R 1×d , and a set of trainable parameters W, this mechanism can be formulated as: However, very little work has explored how to use the relationship among characters to construct graphs in raw Chinese texts. The few previous studies on morphological processing in Chinese proposed a decomposed lexical structure (Zhang and Peng, 1992;Zhou and Marslen-Wilson, 1994) in which Chinese words are represented in terms of their constituent characters. Inspired by these theoretical basis, we propose the use of graph neural networks to construct the relationship between constituent characters and words.

Lexicon-Based Graph Neural Network
In this work, we propose the use of lexicon information to construct graph neural networks, and achieve Chinese NER as a node classification task. The proposed model obtains better interaction among characters, words, and sentences through, aggregation → update → aggregation → . . . , an efficient graph message passing architecture (Gilmer et al., 2017).

Graph Construction and Aggregation
We use the lexicon knowledge to connect characters to capture the local composition and potential word boundaries. In addition, we propose a global relay node to capture long-range dependency and high-level features. The implementation of the aggregation module for nodes and edges is similar to the multi-head attention mechanism in Transformer (Vaswani et al., 2017).

Graph Construction
The whole sentence is converted into a directed graph, as shown in Figure  2, where each node represents a character and the connection between the first and last characters in a word can be treated as an edge. The state of the i-th node represents the features of the ith token in a text sequence. The state of each edge represents the features of a corresponding potential word. The global relay node is used as a virtual hub to gather the information from all the nodes and edges, and then utilizes the global information to help the node remove ambiguity. Formally, let s = c 1 , c 2 , ..., c n denote a sentence, where c i denotes the i-th character. The potential words in the lexicon that match a character subsequence can be formulated as w b,e = c b , c b+1 , ..., c e−1 , c e , where the index of the first and last letters are b and e, respectively. In this work, we propose the use of a directed graph G = (V, E) to model a sentence, where each character c i ∈ V is a graph node and E is the set of edges. Once a character subsequence matches a potential word w b,e , we construct one edge e b,e ∈ E, pointing from the beginning character c b to the ending character c e .
To capture global information, we add a global relay node to connect each character node and word edge. For a graph with n character nodes and m edges, there are n + m global connections linking each node and edge to the shared relay node. With the global connections, every two nonadjacent nodes are two-hop neighbors and receive non-local information with a two-step update.
In addition, we consider the transpose of the constructed graph 1 . It is another directed graph on the same set of nodes with all of the edges reversed compared to the orientation of the corresponding edges in G. We denote the transpose graph as G . Similar to the bidirectional LSTM, we compose G and G as a bidirectional graph and concatenate the node states of G and G as final outputs. Local Aggregation Given the node features c t i and the incoming edge features E t c i = {∀ k e t k,i }, we use multi-head attention to aggregate e k,i and the corresponding predecessor nodes c k for each node c i , where intuition is that the incoming edges and predecessor nodes can effectively indicate potential word boundary information, as shown in Figure 3 (a). Formally, the node aggregation function can be formulated as follows: where t refers to the aggregation at the t-th step and [·; ·] represents concatenation operation.
For edge aggregation, all the forces or potential energies acting on the edges should be considered (Battaglia et al., 2018). To exploit the word orthographic information, lexicons used to construct edges should consider all the character composition, as shown in Figure 3 (b). Hence, different from the classic graph neural networks that use the features of terminal vertices to aggregate edges, we use the whole matching character subsequence C t b,e = {c t b , . . . , c t e } for the edge aggregation function, as follows: Given the character sequence embeddings C ∈ R n×d and potential word embeddings E ∈ R m×d , we first fed C into an LSTM network to generate contextualized representations as the initial node states C 0 (Zhang et al., 2018c), and we used the word embeddings as the initial edge states E 0 . Global Aggregation The underlying structure of language is not strictly sequential (Shen et al., 2019). To capture long-range dependency and high-level features, as shown in Figure 3 (c), we utilized a global relay node to aggregate each character node and edge, as follows: After multiple exchanges of information ( § 3.2), g t aggregates node vectors and edge vectors to summarize the global information, andê t b,e captures the compositional character information to form the local composition. As a result, the proposed model, with a thorough knowledge of both local and non-local composition, would contribute character nodes to distinguish ambiguous words (Ma et al., 2014).

Recurrent-based Update Module
Node Update The effective use of sentence context to tackle the ambiguity among the potential words is still a key issue (Ma et al., 2014). For a general graph, it is common practice to apply recurrent-based modules to update hidden representations of nodes (Scarselli et al., 2009;Li et al., 2015). Hence, we fused the global featureĝ into a character nodes update module, as follows: where W, V, and b are trainable parameters. ξ t i is the concatenation of adjacent vectors of a context window. The window size in our model is 2 and actually plays a role as a character bigram, which has been shown to be useful for representing characters in sequence labeling tasks (Chen et al., 2015a;Zhang and Yang, 2018). χ t i is the concatenation of the global information vectorĝ t and the e→c aggregation resultĉ t i . The gates i t i , f t i and l t i control information flow from global features to c t i , which can make further readjustment of the weights of the lexicon attention (e→c) to tackle the ambiguities at the subsequent aggregation step. Edge Update To better leverage the interaction among characters, words, and whole sentences, we not only designed a recurrent module for nodes but also for edges and the global relay node (Battaglia et al., 2018). We update the edges as follows: where χ t b,e is the concatenation ofĝ t and the c→e aggregation resultê t b,e . Similar to the node update function, i t b,e and f t b,e are gates that control information flow from e t−1 b,e andĝ t to e t b,e . Global Relay Node Update In terms of the global state g, recent works (Zhang et al., 2018b;Guo et al., 2019) have shown the effectiveness of sharing useful messages across contexts. Thus, we also designed an update function for g, with the initialization g 0 = average(C, E). More formally:

Decoding and Training
A standard conditional random field (CRF) is used in the graph message passing process. Given the sequence of final node states c T 1 , c T 2 , . . . , c T n , the probability of a label sequenceŷ =l 1 ,l 2 , . . . ,l n can be defined as follows: where, Y(s) is the set of all arbitrary label are the weight and bias parameters specific to the labels l i−1 and l i .
For training, we minimize the sentence-level negative log-likelihood loss as follows: For testing and decoding, we maximized the likelihood to find the optimal sequence y * : We used the Viterbi algorithm to calculate the above equations, which can reduce the computational complexity efficiently.

Experimental Setup
In this section, we describe the datasets across different domains and the baseline methods applied for comparison. We also detail the hyperparameter configuration of the proposed model. Our codes and datasets can be found at https: //github.com/RowitZou/LGN.

Lexicon
We used the lexicon over automatically segmented Chinese Giga-Word 4 , obtaining 704.4k words in the final lexicon. The embeddings of lexicon words were pre-trained using word2vec (Mikolov et al., 2013) and fine-tuned during training. According to the lexicon statistics, the number of single-character, two-character and three-character words are 5.7k, 291.5k, 278.1k, respectively. It covers 31.2% of the named entities in the four data sets, which means most of the lexicon words are not named entities. For a fair comparison, we used such a general lexicon instead of a professional named entity lexicon in our experiments and we still obtained competitive results. Empirically, a high-quality lexicon could lead to further performance improvements. Character embeddings are pre-trained on Chinese Giga-Word using word2vec and fine-tuned at model training. Both the pre-trained character and lexicon word embeddings are released by Zhang and Yang (2018)

Comparison Methods
We applied the character-level and word-level methods as baselines for comparison, which incorporate the bichar, softword, and lexicon features. We also compared several state-of-the-art methods on the four datasets to verify the effectiveness of our method. We used the BMES tagging scheme for both character-level and word-level NER tagging.
Character-level methods: These methods are based on character sequences. We applied the bidirectional LSTM (Hochreiter and Schmidhuber, 1997) and CNN (Kim, 2014) as classic baseline methods.
Character-level methods + bichar + softword: Character bigrams are useful for capturing adjacent features and representing characters. We concatenated bigram embeddings with character embeddings to better leverage the bigram information. In addition, we added the segmentation information by incorporating segmentation label embeddings into the character representation. The BMES scheme is used for representing the word segmentation (Xue and Shen, 2003).
Word-level methods: For the datasets with gold segmentation, we directly employed wordlevel NER methods to evaluate the performance, which are denoted as Gold seg. Otherwise, we first used open source segmentation toolkit 6 to automatically segment the datasets. Then wordlevel NER methods are applied, which are denoted as Auto seg. The bi-directional LSTM and CNN are also applied as baselines.
Word-level methods + char + bichar: For characters in the subsequence w b,e , we first used a bi-directional LSTM to learn their hidden states and bigram states. We then augmented the wordlevel methods with the character-level features. 5 https://github.com/jiesutd/LatticeLSTM 6 https://github.com/lancopku/PKUSeg-python  Lattice LSTM: Lattice LSTM (Zhang and Yang, 2018) incorporates word information into character-level recurrent units, which can avoid segmentation errors. This method achieved stateof-the-art performance on the four datasets.

Hyper-parameter Settings
We used the Adam (Kingma and Ba, 2014) as the optimizer, with a learning rate of 2e-5 for large datasets like Ontonotes and MSRA, while a rate of 2e-4 for small datasets Weibo and Resume. A densely connected structure (Huang et al., 2017) was applied, which composites all hidden states from previous update steps as final inputs for aggregation modules at step t. To further reduce overfitting, we employed the Dropout (Srivastava et al., 2014) with a rate of 0.5 for embeddings and a rate of 0.2 for aggregation module outputs. The embedding size and state size were both set to 50. The head number of multi-head attention was 10. The head dimension was set to 10 for small datasets like Weibo and Resume, while the head dimension was 20 for Ontonotes and MSRA.

Results and Discussion
In this section, we demonstrate the main results of LGN for the Chinese NER task across different domains. The model achieving best results on the development set was chosen for the final evaluation on the test set. We also probe the  effectiveness and interpretability of LGN by explanatory experiments.

Main Results
OntoNotes Table 2 7 shows the results of wordlevel and character-level methods on OntoNotes with various settings. In the gold or automatic segmentation settings, the char and bichar features boost the performance of word-level methods. In particular, with the gold-standard segmentation, these methods are able to achieve competitive state-of-the-art results on the dataset . However, the goldstandard segmentation is not always available. On the other hand, the automatic segmentation may induce word segmentation errors and result in a loss of performance for the downstream NER task. A feasible solution is applying character-level methods to avoid the need for word segmentation. Our proposed LGN is a character-level model based on graphic structure. It outperforms lattice LSTM by 1.01% in F1 score and leads to a 3.00% increment of F1 score over the LSTM with bichar and softword features. The LGN also significantly outperforms the word-level models with automatic segmentation. MSRA/Weibo/Resume Results on the MSRA, Weibo, and Resume datasets are shown in Table 3, 4, and 5, respectively. Gold-standard segmentation is not available for the Weibo and Resume datasets and the test set of MSRA. The best classic methods leverage rich handcrafted features (Chen et al., 2006;Zhang et al., 2006;Zhou et al., 2013), embedding features (Lu et al., 2016;Peng and Dredze, 2015), radical features , cross-domain, and semi-   supervised data (He and Sun, 2017) for Chinese NER. Compared with the existing methods and the word-level and character-level methods, our LGN model gives the best results by a large margin. Moreover, different from the lattice LSTM, which also leverages lexicon features, our LGN model integrates lexicon information into the graph neural network in a more effective fashion. As a result, it outperforms the lattice LSTM on all three datasets.

Steps of Message Passing
To investigate the influence of step number T during the update process, we analyzed the performance of LGN under different step numbers. Figure 4 illustrates the variation of F1 score on the development sets 8 as the step number increases. We used D-F1 to represent the F1 scores at different steps minus the best results.
The results indicate that the number of update steps is crucial to the performance of LGN, which peaks when T ≥ 3 on all four datasets. The F1 score decreases 1.20% on average against the best results when the step number is less than 3. In particular, the F1 score of the OntoNotes and Weibo datasets even suffered a serious reduction around 1.5% and 1.8%, respectively. After several rounds of updates, the model gives steady and competitive results and reveals that LGN benefits from the update process. Empirically, at each update step, graph nodes aggregate information from their neighbors and incrementally gain more information from further reaches of the graph as the process iterates (Hamilton et al., 2017). In the LGN model, more valuable information can be captured through the recursive aggregation.

Ablation Experiments
To study the contribution of each component in LGN, we conducted ablation experiments on the four datasets and display the results in Table 6.
The results show that the model's performance is degraded if the global relay node is removed, indicating that global connections are useful in the graph structure. We also find that lexicons play an important role in character-level NER.
In particular, the performance of the OntoNotes, MSRA and Weibo datasets are seriously hurt by over 3.0% without lexicons. Moreover, missing both edges and the global node will cause a further performance loss.
To better illustrate the advantage of our model, we remove the CRF decoding layer and simplify the structure to a non-bidirectional version on both LGN and the lattice LSTM model. The results show that, with a single direction structure, the LGN achieves a higher F1 score by 0.77% on average than the lattice LSTM. In addition, the two models have an obvious performance gap when they get rid of the CRF layer. The F1 score of LGN decreases by 3.59% on average on the four datasets without CRF. In contrast, the lattice LSTM decreases by 6.24%. It manifests the LGN has stronger ability to model sentences. Figure 5 shows the performance of LGN and several baseline models on the OntoNotes dataset.  We split the dataset into six parts according to the sentence length. The lattice is a strong baseline that outperforms the word+char+bichar and char+bichar+softword models over different sentence lengths. However, the lattice accuracy decreases significantly as the sentence length increases. In contrast, the LGN not only gives higher results over short sentences, but also shows its effectiveness and robustness when the sentence length is more than 80 characters. It gives a higher F1 score in most cases compared to the baselines, which indicates that global sentence semantics and long-range dependency can be better captured under the graph structure. Table 7 illustrates an example that probes the ability of LGN to tackle the word ambiguity problems. The lattice LSTM ignores the sentence context and wrongly identifies "印 度(India)".

Case Study
Removing the global relay node, LGN also makes the same mistake, which indicates that global connections are indispensable and can capture high-level information to help LGN better understand the sentence context. In contrast, with the global relay node, the LGN can correctly identify the entity boundary, even though the LGN  graph composition states are updated for only one step. However, it gives an incorrect class of the entity "印度河(The Indus River)", which is a location entity but not a GPE (Geo-Political Entity). Because of the multi-step graph message passing process, the LGN is able to fuse the context information and finally detects the correct location entity in success.

Conclusion
In this work, we investigated a GNN-based approach to alleviate the word ambiguity in Chinese NER. Lexicons are used to construct the graph and provide word-level features. The LGN enables interactions among different sentence compositions and can capture non-sequential dependencies between characters based on the global sentence semantics. As a result, it shows improved performance significantly on multiple datasets in different domains. The explanatory experiments also illustrate the effectiveness and interpretability of our proposed model.