Leverage Lexical Knowledge for Chinese Named Entity Recognition via Collaborative Graph Network

The lack of word boundaries information has been seen as one of the main obstacles to develop a high performance Chinese named entity recognition (NER) system. Fortunately, the automatically constructed lexicon contains rich word boundaries information and word semantic information. However, integrating lexical knowledge in Chinese NER tasks still faces challenges when it comes to self-matched lexical words as well as the nearest contextual lexical words. We present a Collaborative Graph Network to solve these challenges. Experiments on various datasets show that our model not only outperforms the state-of-the-art (SOTA) results, but also achieves a speed that is six to fifteen times faster than that of the SOTA model.


Introduction
Named entity recognition (NER) aims to locate and classify certain occurrences of words or expressions in unstructured text into predefined semantic categories such as the person names, locations, organizations, etc. NER is an essential pre-processing step for many natural language processing (NLP) applications, such as relation extraction (Bunescu and Mooney, 2005), event extraction (Chen et al., 2015), question answering (Mollá et al., 2006) etc. In English NER, LSTM-CRF models (Lample et al., 2016;Ma and Hovy, 2016;Chiu and Nichols, 2016; leveraging word-level representations and character-level representations achieve the stateof-the-art results. In this paper, we focus on Chinese NER. Compared with English, Chinese has no obvious word boundaries. Since without word boundaries information, it is intuitive to use character information 1 The code is available at https://github.com/ DianboWork/Graph4CNER only for Chinese NER (He and Wang, 2008;Liu et al., 2010;Li et al., 2014), although such methods could result in the disregard of word information. However, word information is very useful in Chinese NER, because word boundaries are usually the same as named entity boundaries. For example, as shown in Figure 1, the boundaries of the word " ¬::" (Beijing airport) are the same as the boundaries of the named entity " ¬::" (Beijing airport). Therefore, making full use of word information would help to improve the Chinese NER performance. There are three main ways to incorporate word information in NER. The first one is the pipeline method. The way of pipeline method is to apply Chinese Word Segmentation (CWS) first, and then to use a word-based NER model. However, the pipeline method suffers from error propagation, since the error of CWS may affect the performance of NER. The second one is to learn CWS and NER tasks jointly (Xu et al., 2013;Peng and Dredze, 2016;Cao et al., 2018;Wu et al., 2019). However, the joint models must rely on CWS annotation datasets, which are costly and are annotated under many diverse segmentation criteria (Chen et al., 2017). The third one is to leverage an automatically constructed lexicon, which is pre-trained on large automatically segmented texts. Lexical knowledge includes boundaries and semantic information. Boundaries information is provided by the lexicon word itself, and semantic information is provided by pre-trained word embeddings (Bengio et al., 2003;Mikolov et al., 2013). Compared with joint methods, a lexicon is easy to obtain and additional annotation CWS datasets are not required. Recently, Zhang and Yang (2018) propose a lattice LSTM to integrate lexical knowledge in NER. However, integrating lexical knowledge into sentences still faces two challenges.
The first challenge is to integrate self-matched lexical words. A self-matched lexical word of a character is the lexical word that contains this character. For instance, " ¬: : ::" (Beijing Airport) and ": : ::" (Airport) are the self-matched words of the character ": : :" (airplane). "» " (leave) is not the self-matched word of the character ":" (airplane), since " :" (airplane) is not contained in the word " » " (leave). The lexical knowledge of self-matched word is useful in Chinese NER. For example, as shown in Figure 1, the boundaries and semantic knowledge of the selfmatched word " ¬::" (Beijing Airport) can help the character ":"(airplane) to predict an "I-LOC" tag, instead of "O" or "B-LOC" tags. However, due to the limits of the word-character lattice, the lattice LSTM (Zhang and Yang, 2018) fails to integrate the self-matched word " ¬::" (Beijing Airport) into the character ":" (airplane).
The second challenge is to integrate the nearest contextual lexical words directly. The nearest contextual lexical word of a character is the word that matches the nearest past or future subsequence in the given sentence of this character. For instance, the lexical word "» " (leave) is the nearest contextual word of the character "•" (-ton), since the word matches the nearest future subsequence "» " of the character, while " ¬" (Beijing) is not the nearest contextual lexical word of this character. The nearest contextual lexical words are beneficial for Chinese NER. For example, as shown in Figure 1, by directly using the semantic knowledge of the nearest contextual words "» " (leave), an "I-PER" tag can be predicted instead of an "I-ORG" tag, since " •" (Hilton Hotels) cannot be taken as the subject of the verb "» " (leave). However, a lattice model (Zhang and Yang, 2018) only implicitly integrate the knowledge of the nearest contextual lexical words via the previous hidden state. The information of the nearest contextual lexical word may be disturbed by other information. To solve the above challenges, we propose a character-based Collaborative Graph Network, including an encoding layer, a graph layer, a fusion layer and a decoding layer. Specifically, there are three word-character interactive graphs in the graph layer. The first one is the Containing graph (C-graph), which is designed for integrating self-matched lexical words. It models the connection between characters and self-matched lexical words. The second one is the Transition graph (T-graph), which builds the direct connection between characters and the nearest contextual matched words. It helps to handle the challenge of integrating the nearest contextual words directly. The third one is the Lattice graph (L-graph), which is inspired by the lattice LSTM (Zhang and Yang, 2018). L-graph captures partial information of self-matched lexical words and the nearest contextual lexical words implicitly by multiple hops. These graphs are built without external NLP tools, which can avoid error propagation problem. Besides, these graphs complement each other nicely and a fusion layer is designed for collaboration between these graphs. We test our model with various Chinese NER datasets. our model not only significantly outperforms the existing state-of-the-art (SOTA) model but also is six to fifteen times faster than the speed of the SOTA model.
In summary, our main contributions are as follows: • We propose a Collaborative Graph Network to integrate lexical knowledge directly and efficiently for Chinese NER.
• To solve the challenges of integrating selfmatched lexical words and the nearest contextual lexical words, we propose three wordcharacter interactive graphs. These interactive graphs can capture different lexical knowledge and are built without external NLP tools.
• We achieve the state-of-the-art results in various popular Chinese NER datasets, and our model achieves a 6-15x speedup over the existing SOTA model.

Related Work
NER. There is rich literature on NER. This includes statistic methods, such as SVM (Isozaki and Kazawa, 2002), HMMs (Bikel et al., 1997) and CRF (Lafferty et al., 2001), suffering from feature engineering. There are also a number of recent neural network approaches applied to NER, such as (Collobert et al., 2011;Huang et al., 2015;Lample et al., 2016;Ma and Hovy, 2016;Chiu and Nichols, 2016;Akbik et al., 2018;Jie et al., 2019;Akbik et al., 2019). Compared with English, Chinese is not featured with obvious word boundaries, but it is important to leverage word boundaries and semantic information in Chinese NER. Many works use word segmentation information as extra features for Chinese NER, such as ( ) heavily rely on the dependency tree to construct a single graph, which suffer from error propagation. To capture different semantic and boundaries information, we propose a Collaborative Graph Network consisting of three automatically constructed graphs, which can avoid error propagation problem naturally. To our best knowledge, we are the first to introduce GAT and automatically constructed semantic graphs to Chinese NER tasks.

Approach
In this section, we first introduce the construction of graphs to integrate self-matched lexical words and the nearest contextual lexical words into sentences. We then introduce the architecture of Collaborative Graph Network as a core for solving Chinese NER tasks.

The Construction of Graphs
To integrate self-matched lexical words and the nearest contextual lexical words, we propose three word-character interactive graphs. The first is the word-character Containing graph (C-graph), which is to assist the character to capture the boundaries and semantic information of selfmatched lexical words. The second is the wordcharacter Transition graph (T-graph). The function of T-graph is to assist the character to capture the semantic information of the nearest contextual lexical words. The third is the Lattice graph (L-graph). Zhang and Yang (2018) propose a lattice structure, nested in the LSTM (Hochreiter and Schmidhuber, 1997), to integrate lexical knowledge. We free the lattice structure from the LSTM and adopt it as the third graph. These three graphs share the same vertex set, but the edge sets of the three graphs are completely different. The vertex set is made up of the characters in the sentence and the matched lexical words, for example, as shown in Figure 1, the vertex set is V={ , ,..., †, , •, ..., ¬::}. To represent the edge set, adjacency matrix needs to be introduced. The elements of the adjacency matrix indicate whether pairs of vertices are adjacent or not in the graph. Since the edge sets of the three graphs are totally different, the adjacency matrices of these three graphs are introduced below:

Word-Character Containing graph
With the C-graph, the characters in the sentence can capture the boundaries and semantic information of self-matched lexical words. As shown in Figure 2, if a lexical word i contains a character j, the (i, j)-entry of the C-graph corresponding adjacency matrix A C will be assigned a value of 1.

Word-Character Transition graph
The T-graph is to assist the character to capture the semantic information of the nearest contextual lexical words. As shown in Figure 3, if a lexical word i or a character m matches the nearest preceding or following subsequence of a character j, the (i, j) or (m, j)-entry of the T-graph corresponding adjacency matrix A T will be assigned a value of 1. Moreover, for capturing the context relation between lexical words, if a lexical word i is the preceding or following context of another lexical word k, we will assign "A T ik = 1". Note that the T-graph is the same with the word cutting graph which is used in Chinese Word Segmentation. Word-Character Lattice graph Zhang and Yang (2018) propose a lattice structure LSTM to exploit lexical knowledge for Chinese N-ER. A lattice structure can capture the information of the nearest contextual lexical words implicitly and capture some information of self-matched lexical words. As shown in Figure 4, if a character m is the nearest preceding or following character of a character j, the (m, j)-entry of the L-graph corresponding adjacency matrix A L will be assigned a value of 1. Moreover, if a character j matches the lexical word i first character or end character, we will assign "A L ij = 1".

Model
A character-based Collaborative Graph Network includes an encoding layer, a graph layer, a fusion layer, and a decoding layer. The encoding layer is to capture contextual information of the sentence and to represent the semantic information of lexical words. The graph layer is based on GAT (Veličković et al., 2018) for modeling over three word-character interactive graphs. A fusion layer is used for fusing different lexical knowledge captured by these three graphs. Finally, a standard CRF (Lafferty et al., 2001) model is used for decoding labels.

Encoding
The input of the model is a sentence and all lexical words that match consecutive subsequences of the sentence. We denote the sentence as s = {c 1 , c 2 , ..., c n }, where c i is the i-th character, and denote the matched lexical words as l = {l 1 , l 2 , ..., l m }. By looking up the embedding vector from a pre-train character embedding matrix, each character c i is represented as a vector, which denotes as x i .
e c is a character embedding lookup table.
To represent the semantic information of lexical words, we look up word embeddings from a  The left side shows the overall architecture, including an encoding layer, a graph layer, a fusion layer, and a decoding layer. On the right side, we show the details of graph attention networks over three word-character interactive graphs. We use blue to denote the characters in the sentence and use green to denote the matched lexicon words.
pre-train word embedding matrix, and each lexical words l i is represented as a semantic vector, which denotes as wv i .
e w is a word embedding lookup table. We concatenate the contextual representation and the word embeddings as the output of this layer, denoting it as Node f .

Graph Attention Networks over Word-Character Interactive Graphs
We use Graph Attention Networks (GAT) to model over three interactive graphs. In an M-layer GAT, the input of j-th layer is a set of node features, NF j = {f 1 , f 2 , ..., f N }, together with an adjacency matrix A , f i ∈ R F , A ∈ R N ×N , where N denotes the number of the nodes and F is the the dimension of features at j-th layer. The output of j-th layer is a new set of node features, NF (j+1) = {f 1 , f 2 , ..., f N }. A GAT operation with K independent attention head can be written as : where denotes concatenation operation, σ is a nonlinear activation function, N i is the neighborhood of node i in the graph, α k ij are the attention coefficients, W k ∈ R F ×F , and a ∈ R 2F is a single-layer feed-forward neural network. Note that, the dimension of the output f i is KF . At the last layer, averaging will be adopted, and the dimension of final output features is F .
To model three totally different word-character interactive graphs, We build three independent graph attention networks, which are denoted as GAT 1 , GAT 2 , and GAT 3 . Since three wordcharacter interactive graphs share the same vertex set, the input node features of all GAT are matrix Node f , which is shown in Equation 6. The output node features are denoted as G 1 , G 2 and G 3 , Extra Resource Models Named Entity Named Mention Overall P(%) R(%) F1(%) P(%) R(%) F1(%) F1(%) Automatic word seg Peng and Dredze (2015)  where G k ∈ R F ×(n+m) , k ∈ {1, 2, 3}. We keep the first n columns of these matrices and discard the last m columns, because only character representations are used to decode labels.

Fusion Layer
A fusion layer is used to fuse different lexical knowledge captured by word-character interactive graphs. The input of the fusion layer is the contextual representation H and the output of the graph layer Q i , i ∈ {1, 2, 3}. The equation of the fusion layer is introduced below: where W 1 , W 2 , W 3 and W 4 are trainable matrices. Via a fusion layer, we obtain a matrix R, R ∈ R F ×n , which is a new sentence representation integrating the contextual information as well as the lexical knowledge of self-matched lexical words and the nearest contextual lexical words.

Decoding and Training
We use a standard CRF (Lafferty et al., 2001) layer to capture the dependencies between successive labels. Given a sentence s = {c 1 , c 2 , ..., c n }, the input of the CRF layer is R = {r 1 , r 2 , ..., r n }, and the probability of the ground-truth tag sequence y = {y 1 , y 2 , ..., y n } is Here y is an arbitrary label sequence, W y i is used for modeling emission potential for the i-th character in the sentence, and T is the transition matrix storing the score of transferring from one tag to another. Viterbi algorithm (Viterbi, 1967) is used to get the label sequence with the highest score. Given a manually annotated training data , we optimize the model by minimizing the negative log-likelihood loss with L 2 regularization. The loss function is defined as: where λ denotes the L 2 regularization parameter and Θ is the all trainable parameters set

Experiments
In this section, we carry out extensive experiments to investigate the effectiveness of the Collaborative Graph Network.
On Weibo NER, we use the same training, development and test split as Peng and Dredze (2015). On OntoNotes, we use the same data split as . Since the MSRA dataset does not have a development set, we randomly select 10% samples from the training set as the development set.

Experimental Settings
In our experiments, We use the same character embeddings as Zhang and Yang (2018), which is pretrained on Chinese Giga-Word. We use the lexicon provided by Li et al. (2018), including 1.3 million Chinese words. We set the dimensionality of LSTM hidden states to 300 and set the initial learning rate to 0.001. Since the scale of each dataset varies, we set different training batch size for different datasets. Specifically, we set batch sizes of MSRA, OntoNotes and Weibo NER as 64, 20 and 10. We use stochastic gradient Descent (SGD) algorithm to optimize parameters in OntoNotes and WeiboNER, and use Adam (Kingma and Ba, 2014) algorithm to optimize parameters in MSRA. We stop the training when we find the best result in the development set.

Overall Performance
Weibo NER. Table 1 shows the results on Weibo NER. Zhu and Wang (2019) propose a Convolutional Attention Network using segmentation information, which is the existing state-of-the-art (SOTA) model. Our model outperforms SOTA model by 3.78%, 1.07% and 5.34% in F1 score on Overall, Named Entity, and Nominal Mention. Zhang and Yang (2018) propose a lattice LSTM to integrate lexical knowledge. Our model outperforms the lattice LSTM by 4.3%, 3.41% and 6.07% in F1 score on Overall, Named Entity, and Nominal Mention.
OntoNotes. Table 2 shows the results on OntoNotes. Compared with lattice LSTM (Zhang and Yang, 2018), Our model gains a 0.91% improvement in F1 score. Compared with the best result (Yang et al., 2016), our model doesn't rely on gold-standard segmentation, which is not available in the real world. Note that our model even outperforms the model proposed by Yang et al., 2016;Zhu and Wang, 2019), which uses the information of gold-standard segmentation.  MSRA. Results on the MSRA dataset are shown in Table 3. By leveraging hand crafted features (Chen et al., 2006;Zhang et al., 2006;Zhou et al., 2013) and character embeddings (Lu et al., 2016), statistical models achieve good results on MSRA dataset. Dong et al. (2016) integrate LSTM-CRF with radical features and Zhang and Yang (2018) propose a lattice LSTM to integrate lexical knowledge. Our model outperforms the lattice LSTM by 0.29% in F1 score on MSRA datasets.
Speed. As an essential preprocessing NLP tool, NER tasks require high speeds of both training and testing. Since aligning word-character lattice structure for batch training is usually non-trivial, the lattice LSTM (Zhang and Yang, 2018) suffers from slow speeds in training and testing. However, both LSTM and GAT in our model can compute efficiently by batch training.
For fair comparison, both the lattice LSTM and our model are implemented under PyTorch 2 . By using a single NVIDIA GeForce GTX 1080 Ti G-PU, We randomly select 10 training and testing epoch as samples. The average time of training and testing is shown in Table 4. Our model can achieve a 6-15x speedup over the lattice LSTM.

Effectiveness of Three Word-Character Interactive Graphs
We conduct ablation experiments to demonstrate the effectiveness of these three word-character interactive graphs. Comparison Setting. We design ablation studies as follow: 1) w/o C: without word-character Containing graph(C-graph). 2) w/o T: without word-character Transition graph (T-graph).  Table 6: Case study. w/o C-graph predicted label means without C-graph predicted label, and w/o T-graph predicted label means without T-graph predicted label. We use green to denote the correct labels and use red to denote the wrong labels.  T & L : without T-graph and L-graph, only keep L-graph. 7) BiLSTM+CRF: baseline model.

Models
Comparison Results. Table 5 shows the results of ablation experiments. We can clearly see that removing any graph causes obvious performance degradation, but the importance of different graphs varies from dataset to dataset. Specifically, on OntoNotes and MSRA, 'w/o T-graph' obtains worse performance than others, showing that Tgraph is important. However, T-graph performs poorly without cooperating with other graphs. We guess that "T-graph" graph can only capture the information of the nearest contextual lexical words, and it is not enough to rely solely on T-graph. On Weibo NER, these graphs show equal importance. Since dialects slangs and irregular phrases are very common in social domain, we must rely on C-graph, T-graph, and L-graph jointly to handle the informal and complex contexts. In conclusion, from ablation experiments, we can find that each graph can be implemented independent of the other, but together they can achieve the best result, showing that all these three graphs are essential to our model.

Case Study
To show visually that our model can solve the challenges when integrating self-matched lexical words and the nearest contextual lexical words, a case study comparing without C-graph, without Tgraph and the complete model is shown in Table  6. In the first case, there is an entity " •‰5PÑ €'f"(Xidian University) with nested "•‰" (Xi'an) and " 5PÑ€'f" (UESTC). These common entities are all in the lexicon. Without Cgraph, the model can't integrate the information of the self-matched lexical word '•‰5PÑ€' f" (Xidian University) into the characters " 5" and "‰". Influenced by another lexical word " 5 PÑ€'f" (UESTC), the predicted label of the character " 5" is "B-ORG", and the label of the character "‰" is predicted to be "I-ORG", affect by the lexical word "•‰" (Xi'an). In the second case, there is an entity " Tó" (Lenovo), which can also be a common verb ("Associate") in Chinese. Without T-graph, the model can't integrate the information of the nearest contextual lexical words "~¯" (Tencent) and "T " (Joint) into the characters " T" and " ó", so the predicted labels of the characters "T" and " ó" are "O"s. However, with the help of T-graph, the model can use the information of the nearest contextual lexical words "~¯" (Tencent) and " T " (Joint) to predict the correct labels.

Conclusion
In this paper, we propose a Collaborative Graph Network for integrating lexical knowledge in Chinese NER. The core of the network is three lexical word-character interactive graphs. These interactive graphs can capture different lexical knowledge and are built without external NLP tools. We show through various experiments that our model has complementary strengths to the SOTA model and these interactive graphs are effective.