Incorporating Uncertain Segmentation Information into Chinese NER for Social Media Text

Chinese word segmentation is necessary to provide word-level information for Chinese named entity recognition (NER) systems. However, segmentation error propagation is a challenge for Chinese NER while processing colloquial data like social media text. In this paper, we propose a model (UIcwsNN) that specializes in identifying entities from Chinese social media text, especially by leveraging uncertain information of word segmentation. Such ambiguous information contains all the potential segmentation states of a sentence that provides a channel for the model to infer deep word-level characteristics. We propose a trilogy (i.e., Candidate Position Embedding => Position Selective Attention => Adaptive Word Convolution) to encode uncertain word segmentation information and acquire appropriate word-level representation. Experimental results on the social media corpus show that our model alleviates the segmentation error cascading trouble effectively, and achieves a significant performance improvement of 2% over previous state-of-the-art methods.


Introduction
Named entity recognition (NER) is a fundamental task for natural language processing and fulfills lots of downstream applications, such as semantic understanding of social media contents.
Chinese NER is often considered as a characterwise sequence labeling task since there are no natural delimiters between Chinese words (Liu et al., 2010;Li et al., 2014). But the word-level information is necessary for a Chinese NER system (Mao et al., 2008;Peng and Dredze, 2015;. Various segmentation features can be obtained from the Chinese word segmentation (CWS) procedures then used into a pipeline NER module (Peng and Dredze, 2015;He and Sun, 2017a;Zhu and Wang, 2019), or be co-trained by   Step 2: Position Selective Attention Step 1: Candidate Position Embedding Step 3 The architecture of our model. An interesting instance "南京市长江大桥调研(Daqiao Jiang, major of Nanjing City, is investigating)..." is represented, which is cited from .
However, segmentation error propagation is a challenge for Chinese NER, when processing informal data like social media text (Duan et al., 2012). The CWS will produce more unreliable results on the social media text than on the formal data. Incorrectly segmented entity boundaries may lead to NER errors. Nevertheless, most existing extractors always assume that input segmentation information is affirmative and reliable without conscious error discrimination. That is, they acquiesce in that "The one supposed-reliable word segmentation output of a CWS module will be input into the NER module". Although the joint training way may improve the accuracy of word segmentations, the NER module still cannot recognize inevitable segmentation errors.
To solve this problem, we design a model (UIcwsNN) that dedicates to identifying entities from Chinese social media text, by incorporating Uncertain Information of Chinese Word Segmentation into a Neural Network. This kind of uncertain information reflects all the potential segmentation states of a sentence, not just the certain one that is supposed-reliable by the CWS module. Furthermore, we propose a trilogy to encode uncertain word segmentation information and acquire word-level representation, as shown in Figure 1.
In summary, the contributions of this paper are as follows: • We embed candidate position information of characters into the model (in Section 3.1) to express the states of underlying word. And we design the Position Selective Attention (in Section 3.2) that enforces the model to focus on the appropriate positions while ignoring unreliable parts. The above operations provide a wealth of resources to allow the model to infer word-level deep characteristics, rather than bluntly impose segmentation information.
• We introduce the Adaptive Word Convolution (in Section 3.3), it dynamically provides wordlevel representation for the characters in specific positions, by encoding segmentations of different lengths. Hence our model can grasp useful word-level semantic information and alleviate the interference of segmentation error cascading.
• Experimental results on different datasets show that our model achieves significant performance improvements compared to baselines that use only character information. Especially, our model outperforms the previous state-of-the-art method by 2% on the social media.

Related Work
The NER on English has achieved promising performance by naturally integrating character information into word representations (Ma and Hovy, 2016;Peters et al., 2018;Yadav and Bethard, 2019;Li et al., 2020). However, Chinese NER is still underachieving because of the word segmentation problem. Unlike the English language, words in Chinese sentences are not separated by spaces, so that we cannot get Chinese words without pre-processed CWS. In particular, identifying entities on Chinese social media is harder than on other formal text since there is worse segmentation error propagation trouble. Existing methods payed little attention to this issue, and there were few entity recognition methods specifically for Chinese social media text (Peng and Dredze, 2015;He and Sun, 2017a,b).
As for the Chinese NER, existing methods could be classified as either word-wise or character-wise. The former one used words as the basic tagging unit (Ji and Grishman, 2005). Segmentation errors would be directly and inevitably entered into NER systems. The latter used characters as the basic tokens in the tagging process (Chen et al., 2006;Mao et al., 2008;Lu et al., 2016;. Character-wise methods that outperformed wordwise methods for Chinese NER (Liu et al., 2010;Li et al., 2014).
There were two main ways to take word-level information into a character-wise model. One was to employ various segmentation information as feature vectors into a cascaded NER model. Chinese word segmentation was performed first before applying character sequence labeling (Guo et al., 2004;Mao et al., 2008;Zhu and Wang, 2019). The pre-processing segmentation features included character positional embedding (Peng and Dredze, 2015;He and Sun, 2017a,b), segmentation tags Zhu and Wang, 2019), word embedding (Peng and Dredze, 2015;Liu et al., 2019;E and Xiang, 2017) and so on. The other was to train NER and CWS tasks jointly to incorporate task-shared word boundary information from the CWS into the NER (Xu et al., 2013;Peng and Dredze, 2016;Cao et al., 2018). Although co-training might improve the validity of the word segmentation, the NER module still had no specific measures to avoid segmentation errors. The above existing methods suffered the potential issue of error propagation.
A few researchers tried to address the above defect. Luo and Yang (2016) used multiple word segmentation outputs as additional features to a NER model. However, they treated the segmentations equally without error discrimination. Liu et al. (2019) introduced four naive selection strategies to select words from the pre-prepared Lexicon for their model. However, these strategies did not consider the context of a sentence.  proposed a Lattice LSTM model that used the gated recurrent units to control the contribution of the potential words. However, as shown by Liu et al. (2019), the gate mechanism might cause the model to degenerate into a partial word-based model. Ding et al. (2019) and Gui et al. (2019) proposed the models with graph neural network based on the information that the gazetteers or lexicons offered. Obtaining largescale, high-quality lexicons would be costly. They were dedicated to capturing the correct segmentation information but might not alleviate the interference of inappropriate segmentations.
It is worth mentioning that the above methods were not specifically aimed at social media. We propose a method to learn word-level representation by leveraging uncertain word segmentation information while considering the informal expression characteristics of social media text. Figure 1 illustrates the overall architecture of our model UIcwsNN. Given a sentence S = {c 1 , c 2 , · · · , c n } as the sequence of characters, each character will be assigned a pre-prepared tag.

Methodology
We use a Conditional random fields (CRF) layer to decode tags according to the outputs from the sequence encoder (Lample et al., 2016;. As for the sequence encoding, we use the convolution operation as our basic encoding unit. The colloquial social media text usually does not have normative grammar or syntax and presents semantics in fragmented form, for example, "有好多好 多的话想对你说李巾凡想要瘦瘦瘦成李帆我 是想切开云朵的心(Have many many words to say to you Jinfan Li wanna thin thin thin to Fan Li I am a heart that want to cut the cloud)". These properties will destroy the propagation of temporal semantic information that comes with the textual sequence. Therefore, the Convolutional neural network (CNN) is naturally suitable for encoding colloquial text because it specializes in capturing salient local features from a sequence.
More importantly, we use a trilogy to learn the word-level representation by incorporating uncertain information of Chinese text segmentation, as shown in the following details. ...

Step-1: Candidate Position Embedding
We design the candidate position embedding to represent candidate positions of each character in all potential words. It reflects the states of all underlying segmentation in a sentence.
Next, we use a 4-dimensional vector c (p) i to embed candidate position information of a character, where each dimension indicates the positional candidate (i.e., Begin, Inside, End, Single) of a character in words. 1 if it exists, 0 otherwise. For example, as shown in middle and top parts of Figure 2, as "京(Jing)" being the begin of "京市(Beijing City)", the inside of "南京市(Nanjing City)", and the end of "南京(Nanjing)", the 1 st , 2 nd and 3 rd dimensions of the embedding of "京(Jing)" are 1, but the 4 th dimension is 0 (i.e., [1, 1, 1, 0]).
The correct segmentation sequence for the example should be "南京(Nanjing)/市长(major)/江大 桥(Daqiao Jiang)/调研(is investigating)/...". However, the one certain segmentation output that is supposed-reliable by the above CWS tool is "南京市(Nanjing City)/长江大桥(Yangtze River Bridge)/调 研(investigates)/...". The errors may cause that the entity "江大桥(Daqiao Jiang)" is not recognized. In contrast, the candidate position embedding should be a more reasonable representation for the Chinese sentence segmentation. It is flexible for a model to infer word-level characteristics.

Step-2: Position Selective Attention
There should be only one certain position for a character in the given sentence. We design the position selective attention over candidate positions. It enforces the model to focus on the most relevant positions while ignoring unreliable parts.
Each sequence S is projected to an attention matrix A that captures the semantics of position features interaction according to the contexts.
where A is a matrix of n × 4, W is trainable parameters.
We apply a set of convolution operations that involve filters W (c) and bias terms b (c) to the sequence to learn a representation h i for character c i .
i represents a feature that is generated from a window of length l started with c i . The x i is the combination of character embedding c (e) i and expanded candidate position embedding, as where c To enhance the learning of the position information assisted by the character semantic information, we ensure d e d p .
Given the matrix A, we define

Step-3: Adaptive Word Convolution
Based on the position selection of each character, the step-3 encodes word segmentations to obtain complete word-level semantics. As for each character c i , we expect to encode the segmentation that involves the c i as its word-level representation. There is a challenge: The lengths of word segmentations are diverse, and the positions of characters located in segmentations are flexible. A single encoding structure is difficult to adapt to this situation. Therefore, we propose the adaptive word convolution.
When c i is the k th character of the word w, we design the word to consist of two parts, namely, the left subword and the right subword, in the form where 1 m n, 1 h 4, 2 m i m + h, and 0 k < h, ⊕ denotes join operation. For the instance mentioned above, we expect to get the tabulation, as shown in Figure 3. For example, the "南(South)" is the first (i.e., k = 0) character of the word "南京"(Nanjing) (i.e. i = m = 1 and h = 2), we can use the left subw 1:1 and the right subw 1:2 to express the word w 1:2 , and then as the word-level representation for the character "南(South)". Especially, we discard the subw 1:1 beacuse subw 1:2 contains it.
To model subwords automatically, we learn a feature map F (n × 7) through a set of convolution operations with windows of different directions and different sizes, as where W (v) ∈ R dv , the → indicates the windows sliding forward, whereas ← shows the windows sliding backward. Based on the candidate position distribution of characters learned from the step-2, our model can adaptively separate valid subwords from the F to learn the word-level representation w i , in detail, After performing the trilogy, the model can grasp useful word-level semantic information and avoid the trouble of segmentation error cascading.  (Levow, 2006), is in the formal text domain. There are 50,729 annotated sentences with three entity types (PER, ORG, and LOC). We use the BIOES scheme (Begin, Inside, Outside, End, Single) to indicate the position of the token in an entity (Ratinov and Roth, 2009).
Evaluation. We measure the performance of models by regarding three complementary metrics, Precision (P), Recall (R), and F1-measure (F). Each experiment will be performed five times under different random seeds to reduce the volatility of models. Then we report the mean and standard deviation for each model.
Hyperparameters. The character embedding is pre-trained on the raw microblog text 3 by the word2vec 4 , and its dimension is 100. As for the base model BiLSTM+CRF, we use hidden state size as 200 for a bidirectional LSTM. As for the base model CNNs+CRF, we use 100 filters with window length {2, 3, 4, 5}. We tune other parameters and set the learning rate as 0.001, dropout rate as 0.5. We randomly select 20% of the training set as a validation set. We train each model for a maximum of 120 epochs using Adam optimizer and stop training if the validation loss does not decrease for 20 consecutive epochs. Besides, we set d e = d p = 100 and d v = 25. We also experiment with other settings and find that these are the most reasonable.

Ablation Study
To study the contribution of each component in our model, we conducted ablation experiments on the two datasets where we use the product of each step to decode tags. We display the results in Table 1 and draw the following conclusions.
The feature (CS) is generated from the one certain segmentation output that is supposed-reliable by the CWS tool Jieba, and it may not benefit the NER on social media text. Compared with the corresponding baseline, the feature (CS) impels the model to improve its performance on the MSRA dataset but to reduce performance on the WeiboNER corpus. There are more segmentation errors on social media text than on formal text so that the impact of error cascading is heavy for NER on social media.
On the WeiboNER dataset, the three steps exert different capabilities for improving model performance. Compared with the baseline, the model with the step-1 (+CPE) yields 1.3% improvement in the F value, and its recall improves significantly by 3%, although the precision decreases 1.2%. After we continue with the step-2 (+ PSA), the F value further increases by 0.6%. In this scenario, both precision and recall are higher than the baseline. When the step-3 (+AWC) is completed, the F value further increases by 0.9%. In this scenario, the recall significantly improves by 4% with 0.9% improvement in precision, compared to the baseline. Combining the results on the two different datasets, we find several consistent phenomena. Globally, the F values of the model keep increasing after each step. From a decomposition perspective, the step-2 (+PSA) is notable for improving the precision of the model. And the step-3 (+AWC) is significant for improving the recall. Therefore, the trilogy is complementary.
Our method has good robustness. On the two datasets from different domains, the uncertain information of word segmentations is always efficient, the trilogy (i.e., +CPE, +PSA, +AWC) is valuable. However, performance improvement on the WeiboNER dataset is more significant than on the MSRA dataset. In contrast with formal text, the social media text contains more word segmentation errors that better reflects the advantages of our method.
Finally, We verify the influence of the pretrained language model BERT (Devlin et al., 2018) on our model. We optimize the BERT 5 to obtain the character embedding and train the model CNNs+CRF jointly, where its F value reaches 75% on the WeiboNER dataset. The BERT improves the entity recognition outcome dramatically since it uses large-scale external data to pre-train the contextual embedding. When we use our model UIcwsNN to replace the base model CNNs+CRF, the effect is improved by nearly 1%. It proves that our trilogy and the BERT are complementary. The BERT can provide high-quality character-level embedding to the model, and our method contributes word-level semantic information for the model. This conclusion can also be drawn from the results of the MSRA dataset.   (Chen et al., 2006) 91.22 81.71 86.20  91.28 90.62 90.95 Yang, 2018) 93.57 92.79 93.18 (Zhu and 93.53 92.42 92.97 (Ding et al., 2019) 94.60 94.20 94.40 (Zhao et al.,  art performance. The overall score of our model is generally more than 2% higher than the scores of other models. Many methods use lexicon instead of the CWS to provide extractors with external word-level information, but how to choose the appropriate words based on sentence contexts is their challenge. Besides, the approaches that jointly train NER and CWS tasks do not achieve desired results, because segmentation noises affect their effectiveness inevitably. Our model handles this trouble.

Comparison with Existing Methods
The CNN-based models achieve better performance compared to the model BiLSTM+CRF. Furthermore, most of the existing methods construct encoders based on recurrent neural networks or graph neural networks. Although they perform excellent results on the MSRA dataset, they do not achieve a significant improvement on the Wei-boNER corpus. In addition to the word segmentation error propagation on social media, another important reason may be that the fragmented semantic expression of colloquial text limits their performance. In contrast, our CNN-based model plays a better advantage in capturing the fragmented semantics of colloquial text.
Results on the MSRA dataset are shown in Table 3. Our model UIcwsNN specializes in learning word-level representation, but rarely considers other-levels characteristics, such as long-distance temporal semantics. Therefore, it only achieves competitive performance on the formal text. But our model UIcwsNN+BERT realizes new state-ofthe-art performance.

Error Analysis
We count the output errors of models and classify them into two categories 6 : type error and boundary error, as shown in Figure 4. The model CNNs+CRF+CS produces more boundary errors than type errors. However, our model UIcwsNN dramatically decreases the boundary error outputs (and the type errors are also reduced), so that the error distribution is reversed. That is, in model UIcwsNN, the proportion of boundary errors is smaller than that of type errors, but in model CNNs+CRF+CS, the opposite is true. This situation shows that word segmentation errors generated by the word segmentation tool seriously affect model performance, especially misleading the model to identify wrong entity boundaries. Our method can learn the word boundaries effectively, thereby alleviating the cascade of segmentation errors.   Figure 5 shows the performance of recognizing entities with different lengths {1, 2, 3, 4}. According to statistics, entities with two or three characters account for more than 95% of the total number of entities. Both models give high F scores for entities of moderate lengths {2, 3}, but low performance for entities that are too short or too long. The reasons may be that entities with a single character or more than four characters are rare, resulting in model training inadequately. Our model UIcwsNN achieves better results than the base model CNNs+CRF when identifying entities of various lengths. In particular, as for entities with two or three characters, the model UIcwsNN yields more than 2% improvement. This situation implies that our model captures word-level semantic information by modeling the uncertain information of word segmentations so that it is good at recognizing multi-character entities. Table 4 shows several examples with word segmentation errors. When we use the one certain (supposed-reliable) segmentation sequence from the tool Jieba as the word-level feature for the model CNNs+CRF+CS, the segmentation errors "女真'(Nuzhen)" and "微博准(wei bo zhun)" lead to the misjudgments of the entities "女(daughter)" and "准 会 员(associate member)", respectively. Our model UIcwsNN can extract these entities. The uncertain character positions can provide our model with rich word-level information. Then, we use the position selective attention to support the model to learn appropriate segmentation states. The visualization of the first case in Figure 6 shows that our model can assign higher attention values to the appropriate positions while mitigating error interferences.

Conclusion
Named entity recognization is an urgent task for semantic understanding of social media content. As for the Chinese NER, Chinese word segmentation error propagation is prominent since there is much colloquial text in social media. In this paper, we explore a trilogy to leverage the uncertain information of word segmentation to avoid the interference of segmentation errors. The step-1 utilizes the Candidate Position Embedding to present the potential segmentation states of a sentence; The step-2 employs the Position Selective Attention to capture appropriate segmentation states while ignoring unreliable parts; The step-3 uses the Adaptive Word Convolution to encode word-level representation dynamically. We analyze the performance of each component of the model and discuss the relationship between the model and related factors such as segmentation error, BERT, and entity length. Experiment results on different datasets show that our model achieves new state-of-the-art performance. It demonstrates that our method has an excellent ability to capture word-level semantics and can alleviate the segmentation error cascading trouble effectively. In future work, we hope that the model can get rid of the word segmentation tool, instead, learn the candidate position informationn autonomously. We will release the source code when the paper is openly available.