F-Score Driven Max Margin Neural Network for Named Entity Recognition in Chinese Social Media

We focus on named entity recognition (NER) for Chinese social media. With massive unlabeled text and quite limited labelled corpus, we propose a semi-supervised learning model based on B-LSTM neural network. To take advantage of traditional methods in NER such as CRF, we combine transition probability with deep learning in our model. To bridge the gap between label accuracy and F-score of NER, we construct a model which can be directly trained on F-score. When considering the instability of F-score driven method and meaningful information provided by label accuracy, we propose an integrated method to train on both F-score and label accuracy. Our integrated model yields 7.44% improvement over previous state-of-the-art result.


Introduction
With the development of Internet, social media plays an important role in information exchange.The natural language processing tasks on social media are more challenging which draw attention of many researchers (Li and Liu, 2015;Habib and van Keulen, 2015;Radford et al., 2015;Cherry and Guo, 2015).As the foundation of many downstream applications (Weissenborn et al., 2015;Delgado et al., 2014;Hajishirzi et al., 2013) such as information extraction, named entity recognition (NER) deserves more research in prevailing and challenging social media text.NER is a task to identify names in texts and to assign names with particular types (Sun et al., 2009;Sun, 2014;Sun et al., 2014).It is the informality of social media that discourages accuracy of NER systems.While efforts in English have narrowed the gap between social media and formal domains (Cherry and Guo, 2015), the task in Chinese remains challenging.It is caused by Chinese logographic characters which lack many clues to indicate whether a word is a name, such as capitalization.The scant labelled Chinese social media corpus makes the task more challenging (Neelakantan and Collins, 2015;Skeppstedt, 2014;Liu et al., 2015).
To address the problem, one approach is to use the lexical embeddings learnt from massive unlabeled text.To take better advantage of unlabeled text, Peng and Dredze (2015) evaluates three types of embeddings for Chinese text, and shows the effectiveness of positional character embeddings with experiments.Considering the value of word segmentation in Chinese NER, another approach is to construct an integrated model to jointly train learned representations for both predicting word segmentations and NER (Peng and Dredze, 2016).
However, the two above approaches are implemented within CRF model.We construct a semisupervised model based on B-LSTM neural network to learn from the limited labelled corpus by using lexical information provided by massive unlabeled text.To shrink the gap between label accuracy and F-Score, we propose a method to directly train on F-Score rather than label accuracy in our model.In addition, we propose an integrated method to train on both F-Score and label accuracy.Specifically, we make contributions as follows: • We propose a method to directly train on F-Score rather than label accuracy.In addition, we propose an integrated method to train on both F-Score and label accuracy.
• We combine transition probability into our B-LSTM based max margin neural network to form structured output in neural network.
• We evaluate two methods to use lexical embeddings from unlabeled text in neural network.

Model
We construct a semi-supervised model which is based on B-LSTM neural network and combine transition probability to form structured output.We propose a method to train directly on F-Score in our model.In addition, we propose an integrated method to train on both F-Score and label accuracy.

Transition Probability
B-LSTM neural network can learn from past input features and LSTM layer makes it more efficient (Hammerton, 2003;Hochreiter and Schmidhuber, 1997;Chen et al., 2015;Graves et al., 2006).However, B-LSTM cannot learn sentence level label information.Huang et al. (2015) combine CRF to use sentence level label information.We combine transition probability into our model to gain sentence level label information.To combine transition probability into B-LSTM neural network, we construct a Max Margin Neural Network (MMNN) (Pei et al., 2014) based on B-LSTM.The prediction of label in position t is given as: where W hy are the transformation parameters, h t the hidden vector and b y the bias parameter.For a input sentence c [1:n] with a label sequence l [1:n] , a sentence-level score is then given as: ) indicates the probability of label l t at position t by the network with parameters Λ, A indicates the matrix of transition probability.In our model, f Λ (l t |c [1:n] ) is computed as: We define a structured margin loss ∆(l, l) as Pei et al. (2014): where n is the length of setence x, κ is a discount parameter, l a given correct label sequence and l a predicted label sequence.For a given training instance (x i , y i ), our predicted label sequence is the label sequence with highest score: The label sequence with the highest score can be obtained by carrying out viterbi algorithm.The regularized objective function is as follows: By minimizing the object, we can increase the score of correct label sequence l and decrease the score of incorrect label sequence l.

F-Score Driven Training Method
Max Margin training method use structured margin loss ∆(l, l) to describe the difference between the corrected label sequence l and predicted label sequence l.In fact, the structured margin loss ∆(l, l) reflect the loss in label accuracy.Considering the gap between label accuracy and F-Score in NER, we introduce a new training method to train directly on F-Score.To introduce F-Score driven training method, we need to take a look at the subgradient of equation ( 5): In the subgradient, we can know that structured margin loss ∆(l, l) contributes nothing to the subgradient of the regularized objective function J(θ).MMNN.We can introduce a new trigger function to guide the training process of neural network.F-Score Trigger Function The main criterion of NER task is F-score.However, high label accuracy does not mean high F-score.For instance, if every named entity's last character is labeledas O, the label accuracy can be quite high, but the precision, recall and F-score are 0. We use the F-Score between corrected label sequence and predicted label sequence as trigger function, which can conduct the training process to optimize the F-Score of training examples.Our new structured margin loss can be described as: where F Score is the F-Score between corrected label sequence and predicted label sequence.

F-Score and Label Accuracy Trigger Function
The F-Score can be quite unstable in some situation.For instance, if there is no named entity in a sentence, F-Score will be always 0 regardless of the predicted label sequence.To take advantage of meaningful information provided by label accuracy, we introduce an integrated trigger function as follows: where β is a factor to adjust the weight of label accuracy and F-Score.
Because F-Score depends on the whole label sequence, we use beam search to find k label sequences with top sentece-level score s(x, l, θ) and then use trigger function to rerank the k label sequences and select the best.

Word Segmentation Representation
Word segmentation takes an important part in Chinese text processing.Both Peng and Dredze (2015) and Peng and Dredze (2016) show the value of word segmentation to Chinese NER in social media.We present two methods to use word segmentation information in neural network model.

Character and Position Embeddings
To incorporate word segmentation information, we attach every character with its positional tag.This method is to distinguish the same character at different position in the word.We need to word segment the text and learn positional character embeddings from the segmented text.

Character Embeddings and Word Segmentation Features
We can treat word segmentation as discrete features in neural network model.The discrete features can be easily incorporated into neural network model (Collobert et al., 2011).We use word embeddings from a LSTM pretrained on MSRA 2006 corpus to initialize the word segmentation features.

Datasets
We use the same labelled corpus1 as Peng and Dredze (2016)  the same unlabelled text as Peng and Dredze (2016) from Sina Weibo service in China and the text is word segmented by a Chinese word segmentation system Jieba 2 as Peng and Dredze (2016) so that our results are more comparable to theirs.

Parameter Estimation
We pre-trained embeddings using word2vec (Mikolov et al., 2013) with the skip-gram training model, without negative sampling and other default parameter settings.Like Mao et al. (2008), we use bigram features as follow: We use window approach (Collobert et al., 2011) to extract higher level Features from word feature vectors.We treat bigram features as discrete features (Collobert et al., 2011) for our neural network.Our models are trained using stochastic gradient descent with an L2 regularizer.As for parameters in our models, window size for word embedding is 5, word embedding dimension, 2 https://github.com/fxsjy/jieba.feature embedding dimension and hidden vector dimension are all 100, discount κ in margin loss is 0.2, and the hyper parameter for the L2 is 0.000001.As for learning rate, initial learning rate is 0.1 with a decay rate 0.95.For integrated model, β is 0.2.We train 20 epochs and choose the best prediction for test.

Results and Analysis
We evaluate two methods to incorporate word segmentation information.The experiment results of two methods are shown as Table 2.We can observe that positional character embeddings perform better in neural network.This is probably because positional character embeddings method can learn word segmentation information from unlabeled text while word segmentation can only adjust on training cor-pus.
We adopt positional character embeddings in our next four models.Our first model is a B-LSTM neural network (baseline).To take advantage of traditional model (Chieu and Ng, 2003;Mccallum et al., 2001) such as CRF, we combine transition probability in our B-LSTM based MMNN.We design a F-Score driven training method in our third model F-Score Driven Model I .We propose an integrated training method in our fourth model F-Score Driven Model II .The results of models are depicted as Figure 1.From the figure, we can know our models perfrom better with little loss in time.
Table 3 shows results for NER on test sets.In the Table 3, we also show micro F1-score (Overall) and out-of-vocabulary entities (OOV) recall.Peng and Dredze.( 2016) is the state-of-the-art NER system in Chinese Social media.By comparing the results of B-LSTM model and B-LSTM + MTNN model, we can know transition probability is significant for NER.Compared with B-LSTM + MMNN model, F-Score Driven Model I improves the result of named entity with a loss in nominal mention.As for the loss in nominal mention, it may be caused by the sentences without a named entity or nominal mention.The detailed analysis can be found in section 2.2.The integrated training model (F-Score Driven Model II) benefits from both label accuracy and F-Score, which achieves a new state-of-the-art NER system in Chinese social media.Our integrated model improves 2.19% on named entity and 13.65% on nominal mention.To better understand the impact of the factor β, we show the results of our integrated model with different values of β in Figure 2. From Figure 2, we can know that β is an important factor for us to balance F-score and accuracy.Our integrated model may help alleviate the influence of noise in NER in Chinese social media.

Conclusions and Future Work
The results of our experiments also suggest directions for future work.We can observe all models in Table 3 achieve a much lower recall than precision (Pink et al., 2014).So we need to design some methods to solve the problem.
The margin loss ∆(l, l) serves as a trigger function to conduct the training process of B-LSTM based Running time of the models.

Figure 2 :
Figure 2: Overall F1-Score with different values of beta.

Table 2 :
for NER in Chinese social media.Details of the data are listed in Table1.We also use Two methods to incorporate word segmentation information.

Table 3 :
NER results for named and nominal mentions on test data.