Simplify the Usage of Lexicon in Chinese NER

Recently, many works have tried to augment the performance of Chinese named entity recognition (NER) using word lexicons. As a representative, Lattice-LSTM has achieved new benchmark results on several public Chinese NER datasets. However, Lattice-LSTM has a complex model architecture. This limits its application in many industrial areas where real-time NER responses are needed. In this work, we propose a simple but effective method for incorporating the word lexicon into the character representations. This method avoids designing a complicated sequence modeling architecture, and for any neural NER model, it requires only subtle adjustment of the character representation layer to introduce the lexicon information. Experimental studies on four benchmark Chinese NER datasets show that our method achieves an inference speed up to 6.15 times faster than those of state-of-the-art methods, along with a better performance. The experimental results also show that the proposed method can be easily incorporated with pre-trained models like BERT.


Introduction
Named Entity Recognition (NER) is concerned with the identification of named entities, such as persons, locations, and organizations, in unstructured text. NER plays an important role in many downstream tasks, including knowledge base construction (Riedel et al., 2013), information retrieval (Chen et al., 2015), and question answering (Diefenbach et al., 2018). In languages where words are naturally separated (e.g., English), NER has been conventionally formulated as a sequence * Equal contribution. 1 The source code of this paper is publicly available at https://github.com/v-mipeng/ LexiconAugmentedNER. labeling problem, and the state-of-the-art results have been achieved using neural-network-based models (Huang et al., 2015;Chiu and Nichols, 2016;Liu et al., 2018).
Compared with NER in English, Chinese NER is more difficult since sentences in Chinese are not naturally segmented. Thus, a common practice for Chinese NER is to first perform word segmentation using an existing CWS system and then apply a word-level sequence labeling model to the segmented sentence (Yang et al., 2016;He and Sun, 2017b). However, it is inevitable that the CWS system will incorrectly segment query sentences. This will result in errors in the detection of entity boundary and the prediction of entity category in NER. Therefore, some approaches resort to performing Chinese NER directly at the character level, which has been empirically proven to be effective (He and Wang, 2008;Liu et al., 2010;Li et al., 2014;Sui et al., 2019;Gui et al., 2019b;Ding et al., 2019).
A drawback of the purely character-based NER method is that the word information is not fully exploited. With this consideration, Zhang and Yang, (2018) proposed Lattice-LSTM for incorporating word lexicons into the character-based NER model. Moreover, rather than heuristically choosing a word for the character when it matches multiple words in the lexicon, the authors proposed to preserve all words that match the character, leaving the subsequent NER model to determine which word to apply. To realize this idea, they introduced an elaborate modification to the sequence modeling layer of the LSTM-CRF model (Huang et al., 2015). Experimental studies on four Chinese NER datasets have verified the effectiveness of Lattice-LSTM. However, the model architecture of Lattice-LSTM is quite complicated. In order to introduce lexicon information, Lattice-LSTM adds several additional edges between nonadjacent characters in the input sequence, which significantly slows its training and inference speeds. In addition, it is difficult to transfer the structure of Lattice-LSTM to other neural-network architectures (e.g., convolutional neural networks and transformers) that may be more suitable for some specific tasks.
In this work, we propose a simpler method to realize the idea of Lattice-LSTM, i.e., incorporating all the matched words for each character to a character-based NER model. The first principle of our model design is to achieve a fast inference speed. To this end, we propose to encode lexicon information in the character representations, and we design the encoding scheme to preserve as much of the lexicon matching results as possible. Compared with Lattice-LSTM, our method avoids the need for a complicated model architecture, is easier to implement, and can be quickly adapted to any appropriate neural NER model by adjusting the character representation layer. In addition, ablation studies show the superiority of our method in incorporating more complete and distinct lexicon information, as well as introducing a more effective word-weighting strategy. The contributions of this work can be summarized as follows: • We propose a simple but effective method for incorporating word lexicons into the character representations for Chinese NER.
• The proposed method is transferable to different sequence-labeling architectures and can be easily incorporated with pre-trained models like BERT (Devlin et al., 2018).
We performed experiments on four public Chinese NER datasets. The experimental results show that when implementing the sequence modeling layer with a single-layer Bi-LSTM, our method achieves considerable improvements over the state-of-theart methods in both inference speed and sequence labeling performance.

Background
In this section, we introduce several previous works that influenced our work, including the Softword technique and Lattice-LSTM.

Softword Feature
The Softword technique was originally used for incorporating word segmentation information into downstream tasks (Zhao and Kit, 2008;Peng and Dredze, 2016). It augments the character representation with the embedding of its corresponding segmentation label: Here, seg(c j ) ∈ Y seg denotes the segmentation label of the character c j predicted by the word segmentor, e seg denotes the segmentation label embedding lookup table, and typically Y seg = {B, M, E, S}. However, gold segmentation is not provided in most datasets, and segmentation results obtained by a segmenter can be incorrect. Therefore, segmentation errors will inevitably be introduced through this approach.

Lattice-LSTM
Lattice-LSTM designs to incorporate lexicon information into the character-based neural NER model. To achieve this purpose, lexicon matching is first performed on the input sentence. If the subsequence {c i , · · · , c j } of the sentence matches a word in the lexicon for i < j, a directed edge is added from c i to c j . All lexicon matching results related to a character are preserved by allowing the character to be connected with multiple other characters. Intrinsically, this practice converts the input form of a sentence from a chain into a graph.
In a normal LSTM layer, the hidden state h i and the memory cell c i of each time step is updated by: However, in order to model the graph-based input, Lattice-LSTM introduces an elaborate modification to the normal LSTM. Specifically, let s < * ,j> denote the list of sub-sequences of sentence s that match the lexicon and end with c j , h < * ,j> denote the corresponding hidden state list {h i , ∀s <i,j> ∈ s < * ,j> }, and c < * ,j> denote the corresponding memory cell list {c i , ∀s <i,j> ∈ s < * ,j> }. In Lattice-LSTM, the hidden state h j and memory cell c j of c j are now updated as follows: (3) where f is a simplified representation of the function used by Lattice-LSTM to perform memory update.
From our perspective, there are two main advantages to Lattice-LSTM. First, it preserves all the possible lexicon matching results that are related to a character, which helps avoid the error propagation problem introduced by heuristically choosing a single matching result for each character. Second, it introduces pre-trained word embeddings to the system, which greatly enhances its performance.
However, efficiency problems exist in Lattice-LSTM. Compared with normal LSTM, Lattice-LSTM needs to additionally model s < * ,j> , h < * ,j> , and c < * ,j> for memory update, which slows the training and inference speeds. Additionally, due to the complicated implementation of f , it is difficult for Lattice-LSTM to process multiple sentences in parallel (in the published implementation of Lattice-LSTM, the batch size was set to 1). These problems limit its application in some industrial areas where real-time NER responses are needed.

Approach
In this work, we sought to retain the merits of Lattice-LSTM while overcoming its drawbacks. To this end, we propose a novel method in which lexicon information is introduced by simply adjusting the character representation layer of an NER model. We refer to this method as SoftLexicon. As shown in Figure 1, the overall architecture of the proposed method is as follows. First, each character of the input sequence is mapped into a dense vector. Next, the SoftLexicon feature is constructed and added to the representation of each character. Then, these augmented character representations are put into the sequence modeling layer and the CRF layer to obtain the final predictions.

Character Representation Layer
For a character-based Chinese NER model, the input sentence is seen as a character sequence s = {c 1 , c 2 , · · · , c n } ∈ V c , where V c is the character vocabulary. Each character c i is represented using a dense vector (embedding): where e c denotes the character embedding lookup table.
Char + bichar. In addition, Zhang and Yang, (2018) has proved that character bigrams are useful for representing characters, especially for those methods not using word information. Therefore, it is common to augment the character representations with bigram embeddings: where e b denotes the bigram embedding lookup table.

Incorporating Lexicon Information
The problem with the purely character-based NER model is that it fails to exploit word information.
To address this issue, we proposed two methods, as described below, to introduce the word information into the character representations. In the following, for any input sequence s = {c 1 , c 2 , · · · , c n }, w i,j denotes its sub-sequence {c i , c i+1 , · · · , c j }.

ExSoftword Feature
The first conducted method is an intuitive extension of the Softword method, called ExSoftword. Instead of choosing one segmentation result for each character, it proposes to retain all possible segmentation results obtained using the lexicon: where segs(c j ) denotes all segmentation labels related to c j , and e seg (segs(c j )) is a 5-dimensional multi-hot vector with each dimension corresponding to an item of {B, M, E, S, O}.

SoftLexicon
Based on the analysis on Exsoftword, we further developed the SoftLexicon method to incorporate the lexicon information. The SoftLexicon features are constructed in three steps.
Categorizing the matched words. First, to retain the segmentation information, all matched words of each character c i is categorized into four word sets "BMES", which is marked by the four segmentation labels. For each character c i in the input sequence = {c 1 , c 2 , · · · , c n }, the four set is constructed by: Here, L denotes the lexicon we use in this work. Additionally, if a word set is empty, a special word "NONE" is added to the empty word set. An example of this categorization approach is shown in Figure 3. Noted that in this way, not only we can introduce the word embedding, but also no information loss exists since the matching results can be exactly restored from the four word sets of the characters.
Condensing the word sets. After obtaining the "BMES" word sets for each character, each word set is then condensed into a fixed-dimensional vector. In this work, we explored two approaches for implementing this condensation.
The first implementation is the intuitive meanpooling method: Here, S denotes a word set and e w denotes the word embedding lookup table. However, as shown in Table 8, the results of empirical studies revealed that this algorithm does not perform well. Therefore, a weighting algorithm is introduced to further leverage the word information. To maintain computational efficiency, we did not opt for a dynamic weighting algorithm like attention. Instead, we propose using the frequency of each word as an indication of its weight. Since the frequency of a word is a static value that can be obtained offline, this can greatly accelerate the calculation of the weight of each word.
Specifically, let z(w) denote the frequency that a lexicon word w occurs in the statistical data, the weighted representation of the word set S is obtained as follows: where Z = w∈B∪M∪E∪S z(w).
Here, weight normalization is performed on all words in the four word sets to make an overall comparison. In this work, the statistical data set is constructed from a combination of training and developing data of the task. Of course, if there is unlabelled data in the task, the unlabeled data set can serve as the statistical data set. In addition, note that the frequency of w does not increase if w is covered by another sub-sequence that matches the lexicon. This prevents the problem in which the frequency of a shorter word is always less than the frequency of the longer word that covers it.
Combining with character representation. The final step is to combine the representations of four word sets into one fix-dimensional feature, and add it to the representation of each character. In order to retain as much information as possible, we choose to concatenate the representations of the four word sets, and the final representation of each character is obtained by: Here, v s denotes the weighting function above.

Sequence Modeling Layer
With the lexicon information incorporated, the character representations are then put into the sequence modeling layer, which models the dependency between characters. Generic architectures for this layer including the bidirectional longshort term memory network(BiLSTM), the Convolutional Neural Network(CNN) and the transformer (Vaswani et al., 2017). In this work, we implemented this layer with a single-layer Bi-LSTM.
Here, we precisely show the definition of the forward LSTM: where σ is the element-wise sigmoid function and represents element-wise product. W and b are trainable parameters. The backward LSTM shares the same definition as the forward LSTM

Label Inference Layer
On top of the sequence modeling layer, it is typical to apply a sequential conditional random field (CRF) (Lafferty et al., 2001) layer to perform label inference for the whole character sequence at once: p(y|s; θ) = n t=1 φ t (y t−1 , y t |s) y ∈Ys n t=1 φ t (y t−1 , y t |s) . (13) Here, Y s denotes all possible label sequences of s, and φ t (y , y|s) = exp(w T y ,y h t + b y ,y ), where w y ,y and b y ,y are trainable parameters corresponding to the label pair (y , y), and θ denotes model parameters. For label inference, it searches for the label sequence y * with the highest conditional probability given the input sequence s: which can be efficiently solved using the Viterbi algorithm (Forney, 1973).

Experiment Setup
Most experimental settings in this work followed the protocols of Lattice-LSTM (Zhang and Yang, 2018), including tested datasets, compared baselines, evaluation metrics (P, R, F1), and so on.
To make this work self-completed, we concisely illustrate some primary settings of this work.

Datasets
The methods were evaluated on four Chinese NER datasets, including OntoNotes (Weischedel et al., 2011), MSRA (Levow, 2006), Weibo NER (Peng   (Zhang and Yang, 2018). OntoNotes and MSRA are from the newswire domain, where gold-standard segmentation is available for training data. For OntoNotes, gold segmentation is also available for development and testing data. Weibo NER and Resume NER are from social media and resume, respectively. There is no gold standard segmentation in these two datasets. Table 1 shows statistic information of these datasets. As for the lexicon, we used the same one as Lattice-LSTM, which contains 5.7k single-character words, 291.5k two-character words, 278.1k three-character words, and 129.1k other words. In addition, the pretrained character embeddings we used are also the same with Lattice-LSTM, which are pre-trained on Chinese Giga-Word using word2vec.

Implementation Detail
In this work, we implement the sequence-labeling layer with Bi-LSTM. Most implementation details followed those of Lattice-LSTM, including character and word embedding sizes, dropout, embedding initialization, and LSTM layer number. Additionally, the hidden size was set to 200 for small datasets Weibo and Resume, and 300 for larger datasets OntoNotes and MSRA. The initial learning rate was set to 0.005 for Weibo and 0.0015 for the rest three datasets with Adamax (Kingma and Ba, 2014) step rule 2 .  Figure 4: Inference speed against sentence length. We use a same batch size of 1 for a fair speed comparison. table, we can observe that when decoding with the same batch size (=1), the proposed method is considerably more efficient than Lattice-LSTM and LR-CNN, performing up to 6.15 times faster than Lattice-LSTM. The inference speeds of Soft-Lexicon(LSTM) with bichar are close to those without bichar, since we only concatenate an additional feature to the character representation. The inference speeds of the BERT-Tagger and SoftLexicon (LSTM) + BERT models are limited due to the deep layers of the BERT structure. However, the speeds of the SoftLexicon (LSTM) + BERT model are still faster than those of Lattice-LSTM and LR-CNN on all datasets.

Computational Efficiency Study
To further illustrate the efficiency of the Soft-Lexicon method, we also conducted an experiment to evaluate its inference speed against sentences of different lengths, as shown in Table 4. For a fair comparison, we set the batch size to 1 in all of the compared methods. The results show that the proposed method achieves significant improvement in speed over Lattice-LSTM and LR-CNN when processing short sentences. With the increase of sentence length, the proposed method is consistently faster than Lattice-LSTM and LR-CNN despite the speed degradation due to the recurrent architecture of LSTM. Overall, the proposed SoftLexicon method shows a great advantage over other methods in computational efficiency.

Effectiveness Study
Tables 3−6 3 show the performances of our method against the compared baselines. In this study, the sequence modeling layer of our method was  Table 3: Performance on OntoNotes. A model followed by (LSTM) (e.g., Proposed (LSTM)) indicates that its sequence modeling layer is LSTM-based.
implemented with a single layer bidirectional LSTM.
OntoNotes. Table 3 shows results 4 on the OntoNotes dataset, where gold word segmentation is provided for both training and testing data. The methods of the "Gold seg" and the "Auto seg" groups are all word-based, with the former input building on gold word segmentation results and the latter building on automatic word segmentation results by a segmenter trained on OntoNotes training data. The methods used in the "No seg" group are character-based. From the table, we can make several observations. First, when gold word segmentation was replaced by automatically generated word segmentation, the F1 score decreases from 75.77% to 71.70%. This reveals the problem of treating the predicted word segmentation result as the true result in the word-based Chinese NER. Second, the F1 score of the Char-based (LSTM)+ExSoftword model is greatly improved from that of the Char-based (LSTM) model. This indicates the feasibility of the naive ExSoftword method. However, it still greatly underperforms relative to Lattice-LSTM, which reveals its deficiency in utilizing word information. Lastly, the proposed SoftLexicon method outperforms Lattice-LSTM by 1.76% with respect to the F1 score, and obtains a greater improvement of 2.28% combining the bichar   feature. It even performs comparably with the word-based methods of the "Gold seg" group, verifying its effectiveness on OntoNotes.
MSRA/Weibo/Resume. Tables 4, 5 and 6 show results on the MSRA, Weibo and Resume datasets, respectively. Compared methods include the best statistical models on these data set, which leveraged rich handcrafted features (Chen et al., 2006;Zhang et al., 2006;Zhou et al., 2013), character embedding features (Lu et al., 2016;Peng and Dredze, 2016), radical features , cross-domain data, and semi-supervised data (He and Sun, 2017b). From the tables, we can see that the performance of the proposed Softlexion method is significant better than that of Lattice-LSTM and other baseline methods on all three datasets.     which means it can complement the information obtained from the pre-trained model.

Ablation Study
To investigate the contribution of each component of our method, we conducted ablation experiments on all four datasets, as shown in table 8.
(1) In Lattice-LSTM, each character receives word information only from the words that begin or end with it. Thus, the information of the words that contain the character inside is ignored. However, the SoftLexicon prevents the loss of this information by incorporating the "Middle" group of words. In the " -'M' group" experiment, we removed the "Middle" group in SoftLexicon, as in Lattice-LSTM. The degradation in performance on all four datasets indicates the importance of the "M" group of words, and confirms the advantage of our method.
(2) Our method proposed to draw a clear distinction between the four "BMES" categories of matched words. To study the relative contribution of this design, we conducted experiments to remove this distinction, i.e., we simply added up all the weighted words regardless of their categories. The decline in performance verifies the significance of a clear distinction for different matched words.
(3) We proposed two strategies for pooling the four word sets in Section 3.2. In the "-Weighted pooling" experiment, the weighted pooling strategy was replaced with mean-pooling, which degrades the performance. Compared with mean-pooling, the weighting strategy not only succeeds in weighing different words by their significance, but also introduces the frequency information of each word in the statistical data, which is verified to be helpful.
(4) Although existing lexicon-based methods like Lattice-LSTM also use word weighting, unlike the proposed Soft-lexion method, they fail to perform weight normalization among all the matched words. For example, Lattice-LSTM only normalizes the weights inside the "B" group or the "E" group. In the "-Overall weighting" experiment, we performed weight normalization inside each "BMES" group as Lattice-LSTM does, and found the resulting performance to be degraded. This result shows that the ability to perform overall weight normalization among all matched words is also an advantage of our method.

Conclusion
In this work, we addressed the computational efficiency of utilizing word lexicons in Chinese NER. To obtain a high-performing Chinese NER system with a fast inference speed, we proposed a novel method to incorporate the lexicon information into the character representations. Experimental studies on four benchmark Chinese NER datasets reveal that our method can achieve a much faster inference speed and better performance than the compared state-of-the-art methods.