Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model

Named entity recognition (NER) plays an important role in the NLP literature. The traditional methods tend to employ large annotated corpus to achieve a high performance. Different with many semi-supervised learning models for NER task, in this paper, we employ the graph-based semi-supervised learning (GBSSL) method to utilize the freely available unlabeled data. The experiment shows that the unlabeled corpus can enhance the state-of-the-art conditional random field (CRF) learning model and has potential to improve the tagging accuracy even though the margin is a little weak and not satisfying in current experiments.


Introduction
Named entity recognition (NER) can be regarded as a sub-task of the information extraction, and plays an important role in the natural language processing literature. The NER challenge has attracted a lot of researchers from NLP, and some successful NER tasks have been held in the past years. The annotations in MUC-7 1 Named Entity tasks (Marsh and Perzanowski, 1998) consist of entities (organization, person, and location), times and quantities such as monetary values and percentages, etc. among the languages of English, Chinese and Japanese.
The SIGHAN bakeoff-3 (Levow, 2006) and bakeoff-4 (Jin and Chen, 2008) tasks offer standard Chinese NER (CNER) corpora for training and testing, which contain the three commonly used entities, i.e., personal names, location names, and organization names. The CNER task is generally more difficult than the western languages due to the lack of word boundary information in Chinese expression.
Traditional methods used for the entity recognition tend to employ external annotated corpora to enhance the machine learning stage, and improve the testing scores using the enhanced models (Zhang et al., 2006;Mao et al., 2008;Yu et al., 2008). The conditional random filed (CRF) models have shown advantages and good performances in CNER tasks as compared with other machine learning algorithms (Zhou et al., 2006;Zhao and Kit, 2008), such as ME, HMM, etc. However, the annotated corpora are generally very expensive and time consuming.
On the other hand, there are a lot of freely available unlabeled data in the internet that can be used for our researches. Due to this reason, some researchers begin to explore the usage of the unlabeled data and the semi-supervised learning methods based on labeled training data and unlabeled external data have shown their advantages (Blum and Chawla, 2001;Shin et al., 2006;Zha et al., 2008;Zhang et al., 2013).
The graph-based semi-supervised learning (GBSSL) methods have been successfully employed by many researchers. For instance, Goldberg and Zhu (2006) design the GBSSL model for sentiment categorization; Celikyilmaz et al. (2009) propose a GBSSL model for questionanswering; Talukdar and Pereira (2010) use the GBSSL methods for class-Instance acquisition; Subramanya et al. (2010) utilize the GBSSL model for structured tagging models; Zeng et al., (2013) use the GBSSL method for the joint Chinese word segmentation and part of speech (POS) tagging and result in higher performances as compared with previous works. However, as far as we know, the GBSSL method has not been employed into the CNER task. To testify the effectiveness of the GBSSL model in the traditional CNER task, this paper utilizes some unlabeled data to enhance the CRF learning through GBSSL method.

Designed Models
To briefly introduce the GBSSL method, we assume = {( , )} =1 denote annotated data and the empirical label distribution of is . Assume the unlabeled data types are denoted as = { } = +1 . Then, the entire dataset can be represented as = ∪ . Let = ( , ) corresponds to an undirected graph with V as the vertices and E as the edges. Let and represent the labeled and unlabeled vertices respectively. One important thing is to select a proper similarity measure to calculate the similarity between a pair of vertices (Das and Smith, 2012). According to the smoothness assumption, if two instances are similar according to the graph, then the output labels should also be similar (Zhu, 2005).
There are mainly three stages in the designed models, i.e., graph construction, label propagation and CRF learning. Graph construction is performed on both labeled and unlabeled data, and the unlabeled data is automatically tagged through the label propagation stage. Then, the tagged external data will be added into the manually annotated training corpus to enhance the CRF learning model.

Graph Construction & Label Propagation
We follow the research of Subramanya et al. (2010) to represent the vertices using character trigrams in labeled and unlabeled sentences for graph construction.
A symmetric k-NN graph is utilized with the edge weights calculated by a symmetric similarity function designed by Zeng et al. (2013).
The feature set we employed to measure the similarity of two vertices based on the cooccurrence statistics is the optimized one by Han et al. (2013) for CNER tasks, as denoted in Table  1.

Feature Meaning
, ∈ (−4,2) Unigram, from previous 4 th to following 2 nd character , +1 , ∈ (−2,1) Bigram, 4 pairs of features, from previous 2 nd to following 2 nd character After the graph construction on both labeled and unlabeled data, we use the sparsity inducing penalty (Das and Smith, 2012) label propagation algorithm to induce trigram level label distributions from the constructed graph, which is based on the Junto toolkit (Talukdar and Pereira, 2010).

CRF Training
In the CRF model, assume a graph = ( , ) comprising a set of vertices or nodes together with a set of edges or lines and = { | ∈ } so is indexed by the vertices of . The joint distribution over the label sequence given is presented as the form: The and are the feature functions and and are the parameters that are trained from specific dataset (Lafferty et al., 2001). The feature set employed in the CRF learning is also the optimized one as shown in Table 1. The training method utilized for the CRF model is a quasinewton algorithm 2 . The automatically annotated corpus by the graph based label propagation will affect the trained parameters and .

Data
We employ the SIGHAN bakeoff-3 (Levow, 2006) MSRA (Microsoft research of Asia) training and testing data as standard setting. To testify the effectiveness of the GBSSL method for CRF model in CNER tasks, we utilize some plain (unannotated) text from SIGHAN bakeoff-2 (Emerson, 2005) and bakeoff-4 (Jin and Chen, 2008) as external unlabeled data. The data set is introduced in Table 2

Result Analysis
We set two baseline scores for the evaluation. One baseline is the simple left-to-right maximum matching model (MaxMatch) based on the training data, another baseline is the closed CRF model (Closed-CRF) without using unlabeled data. The employment of GBSSL model into semi-supervised CRF learning is denoted as GBSSL-CRF.
The training costs of the CRF learning stage are detailed in  The evaluation results are shown in Table 4, from the aspects of recall, precision and the harmonic mean of recall and precision (F1-score). The evaluation shows that both the Closed-CRF and GBSSL-CRF models have largely outperformed baseline-1 model (MaxMatch). As compared with the Closed-CRF model, the GBSSL-CRF model yielded a higher performance in precision score, a lower performance in recall score, and finally resulted in a faint improvement in F1 score. Both the GBSSL-CRF and Closed-CRF show higher performance in precision and lower performance in recall value.  To look inside the GBSSL performance on each kind of entity, we denote the detailed evaluation results from the aspect of F1-score in Table  5. The detailed evaluation from three kinds of entities shows that both the GBSSL-CRF and Closed-CRF show higher performance in LOC entity type, and lower performance in PER and ORG entities.  Fortunately, the GBSSL model can enhance the CRF learning on the two kinds of difficult entities PER and ORG with the better performances of 0.28% and 0.58% respectively. However, the GBSSL model decreases the F1 score on LOC entity by 0.19%. The lower performance of GBSSL model on LOC entity may be due to that the unlabeled data is only as much as 62.75% of the training corpus, which is not large enough to cover the Out-of-Vocabulary (OOV) testing words of LOC entity; on the other hand, the unlabeled data also bring some noise into the model. Nadeau (2007) employs the semi-supervised learning method to recognize 100 entity types on English documents with little supervision. Similarly, Liao and Veeramachaneni (2009) propose a simple semi-supervised algorithm for English entity recognition. Liu et al. (2011) design an interesting application of the semi-supervised learning model for online tweets document for English NER. Pham et al. (2012) use semi-supervised learning method of CRFs into the Vietnamese NER task with generalized expectation criteria. Similarly, Vo and Ock (2012) utilize a hybrid approach semi-supervised learning approach into the NER task for Vietnamese document.  and  recently propose the usage of bilingual constraints to enhance the NER accuracy.

Related Work
Some advanced technologies of GBSSL methods are introduced in the papers Zhu and Lafferty (2005), Culp and Michailidis (2008), and Zhang and Wang (2011), etc.

Conclusion and Future Work
This paper makes an effort to see the effectiveness of the GBSSL model for the traditional CNER task. The experiments verify that the GBSSL can enhance the state-of-the-art CRF learning models. The improvement score is a little weak because the unlabeled data is not large enough. In the future work, we decide to use larger unlabeled dataset to enhance the CRF learning model.
The feature set optimized for CRF learning may be not the best one for the similarity calculation in graph construction stage. So we will make efforts to select the best feature set for the measuring of vertices similarity in graph construction on CNER documents.
In this paper, we utilized the Microsoft research of Asia corpus for experiments. We will use more kinds of Chinese corpora for testing, such as CITYU and LDC corpus, etc.
The GBSSL model generally improves the tagging accuracy of the Out-of-Vocabulary (OOV) words in the test data, which are unseen in the training corpora. In the future work, we plan to give a detailed analysis of the GBSSL model performance on the OOV words for CNER tasks.