Word-Context Character Embeddings for Chinese Word Segmentation

Neural parsers have benefited from automatically labeled data via dependency-context word embeddings. We investigate training character embeddings on a word-based context in a similar way, showing that the simple method improves state-of-the-art neural word segmentation models significantly, beating tri-training baselines for leveraging auto-segmented data.


Introduction
Neural network Chinese word segmentation (CWS) models Liu et al., 2016;Cai and Zhao, 2016) appeal for their strong ability of feature representation, employing unigram and bigram character embeddings as input features (Zheng et al., 2013;Pei et al., 2014;Ma and Hinrichs, 2015;Chen et al., 2015a). They give state-of-the-art performances. We investigate leveraging automatically segmented texts for enhancing their accuracies.
Such semi-supervised methods can be divided into two main categories. The first one is bootstrapping, which includes self-training and tritraining. The idea is to generate more training instances by automatically labeling large-scale data. Self-training (Yarowsky, 1995;McClosky et al., 2006;Huang et al., 2010;Liu and Zhang, 2012) labels additional data by using the base classifier itself, and tri-training (Zhou and Li, 2005;Li et al., 2014) uses two extra classifiers, taking the instances with the same labels for additional training data. A second semi-supervised learning method in NLP is knowledge distillation, which extracts knowledge from large-scale auto-labeled data as features. * Equal contributions Tri-training has been used in neural parsing, giving considerable improvements for both of dependency (Weiss et al., 2015) and constituent parsing (Vinyals et al., 2015;Choe and Charniak, 2016). Knowledge from auto-labeled data has also been used for parsing (Bansal et al., 2014;Melamud et al., 2016), where word embeddings are trained on automatic dependency tree context. Such knowledge has also been proved effective in conventional discrete CWS models, such as label distribution information (Wang et al., 2011;. However, it has not been investigated for neural CWS. We propose word-context character embeddings (WCC), using segmentation label information in the pre-training of unigram and bigram character embeddings. The method packs the label distribution information into the embeddings, which could be regarded as a way for knowledge parameterization. Our idea follows Levy and Goldberg (2014), who use dependency contexts to train word embeddings. Additionally, motivated by co-training, we propose multi-view wordcontext character embeddings for cross-domain segmentation, which pre-trains two types of embedding for in-domain and out-of-domain data, respectively. In-domain embeddings are used for solving data sparseness, and out-of-domain embeddings are used for domain adaptation.
Our proposed model is simple, efficient and effective, giving average 1% accuracy improvement on in-domain data and 3.5% on out-of-domain data, respectively, significantly out-performing self-training and tri-training methods for leveraging auto-segmented data.  (Xue, 2003;Low et al., 2005;Zhao et al., 2006). B, M, E represent the character is the beginning, middle or end of a multi-character word, respectively. S represents that the current character is a single character word. Following Chen et al. (2015b), a standard bi-LSTM model (Graves, 2008) is used to assign segmentation label for each character. As shown in Figure 1, our model consists of a representation layer and a scoring layer. The representation layer utilizes a bi-LSTM to capture the context of each character in the sentence. Given a sentence {w 1 , w 2 , w 3 , · · · , w N }, where w i is the i th character in the sentence, and N is the sentence length, we have a corresponding embedding e w i and e w i−1 w i for each character unigram w i and character bigram w i−1 w i , respectively. A forward word representation e f i is calculated as follows: A backward representation e b i can be obtained in the same way. Then e f i and e b i are fed into forward and backward LSTM units at current position, obtaining the corresponding forward and backward LSTM representations r lstm−f i and r lstm−b i , respectively.
In the scoring layer, we first obtain a linear combination of r lstm−f i and r lstm−b i , which is the final representation at the i th position.
Given the representation r i , we use a scoring unit to score for each potential segment label. Given r i , the score of segment label M is: W M is the score matrix for label M, and e M is the label embedding for label M.

Word-Context Character Embeddings
Our model structure is a derivation from the skipgram model (Mikolov et al., 2013), similar to Levy and Goldberg (2014). Given a sentence with length n: {w 1 , w 2 , w 3 , · · · w n } and its corresponding segment labels: {l 1 , l 2 , l 3 , · · · l n }, the pre-training context of current character w t is the around characters in the windows with size c, together with their corresponding segment labels ( Figure 2). Characters w i and labels l i in the context are represented by vectors e c w i ∈ R d and e c l i ∈ R d , respectively, where d is the embedding dimensionality.
The word-context embedding of character w t is represented as e wt ∈ R d , which is trained by predicting the surrounding context representations e c w ′ and e c l i , parameterizing the labeled segmentation information in the embedding parameters. To capture order information (Ling et al., 2015), we use different embedding matrices for context embedding in different context positions, training different embeddings for the same word when they reside on different locations as the context word. In particular, our context window size is five. As a result, each word has four different versions of e c , namely e c −1 , e c −2 , e c +1 , and e c +2 , each taking a distinct embedding matrix. Given the context window [w −2 , w −1 , w, w +1 , w +2 ], w −1 is the left first context word of the focus word w, e c −1,w i will be selected from embedding matrix E −1 , and w +1 is the right first word of w, e c +1,w i will be selected from embedding matrix E +1 .
Note that each character has two types of embeddings, where e w i is the embedding form of w i when w i is the focus word, and e c w i is the embedding form of w i when w i is used as a surrounding context word. We do not have e l i because l i only acts as the surrounding context. After pre-training, e w i will be used as the WCC embeddings.
The objective of our model is to maximize the average log probability of the context: Negative sampling (Mikolov et al., 2013) is used, where log p(w t+j |w t ) and log p(t t+j |w t ) are computed as: and p(t t+j |w t ) = log σ(e c l t+j respectively, where P n (w) and P n (l) is the noise distributions and k is the size of negative samples for each data sample.
Bigram embeddings are trained in the same way as unigram character embeddings. For out-ofdomain segmentation, we pre-train two embeddings for each token, extracting knowledge from the two domains, respectively.

Set-up
We perform experiments on three standard datasets for Chinese word segmentation: PKU and MSR from the second SIGHAN bakeoff shared task, and Chinese Treebank 6.0 (CTB6). For PKU and MSR, 10% of the training data are randomly selected as development data. We follow  to split the CTB6 corpus into training, development and testing sections. For evaluating cross-domain performance, we also experiment on Chinese novel data. Following , the training set of CTB5 is selected for training, and the manually annotated sentences of free Internet novel 'Zhuxian' (ZX) are selected as the development and test data (Liu and Zhang, 2012) 1 . Chinese Gigaword (LDC2011T13, 4M) is used for in-domain unlabeled data. For out-of-domain data, 20K raw sentences of Zhuxian is used. We take self-training and tri-training as baselines, which also use large-scale auto-segmented data. For self-training, skip-gram pre-training and word-context character embedding, unlabeled corpus is segmented automatically by our baseline model. For tri-training, we additionally use the ZPar (Zhang and Clark, 2007) and ICTCLAS 2 as our base classifiers .
We use F1 to evaluate segmentation accuracy. The recalls of in-vocabulary (IV) and out-ofvocabulary (OOV) are also measured.

Hyper-Parameters
The hyper-parameters used in this work are listed in Table 1   to the development set of CTB6. Many previous character-based CWS models use a transition matrix to model the tag dependency and CRF for structured inference (Pei et al., 2014;Chen et al., 2015a). However, we find that, the greedy model obtains comparable segmentation accuracies across CTB6, PKU and MSR, yet giving much fast speed (Table 2). Hence we adopt the greedy model as our baseline segmentation model.

Utilizing Varying-Scale Data
The results of self-training and tri-training with varying-scale training data are list in Table 3, where +4X means adding 4 times the size of supervised training data into the training set. We find that self-training does not work well, and tritraining with 16X gives a 0.5% accuracy improvement. We adopt this setting for our baseline in the remaining experiments 3 . We also try to choose more effective examples for self-training and tri-training, by selecting training instances according to the base segmentation model score. However, the segmentation performances do not get improved. A possible reason is that the training instances with higher confidence are always shorter than the original sampled sentences, which may not be very helpful for semispervised segmentation. Table 4, pre-training with conventional skip-gram embeddings gives only small improvements, which is consistent as findings of previous work (Chen et al., 2015a;Ma and Hinrichs,    2015; Cai and Zhao, 2016). Segmentation with self-training even shows accuracy drops on PKU and MSR. We speculate that the self-training by the neural CWS baseline is sensitive to the segmentation errors of the auto-labeled data. On average, our method obtains an absolute 1% accuracy improvement over the baseline, outperforming other semi-supervised method significantly 4 .

As shown in
We compare our model with other state-of-theart segmentation models 5 , which are grouped into 3 classes, namely traditional segmentation models (non-nn), neural segmentation models (nn), and the combination of both neural and traditional discrete features (comb). Our simple model gives top accuracies compared with related work. Liu et al. (2016), Cai and Zhao (2016) and  propose to incorporate word embedding features in the neural CWS, pre-training the word embeddings in the large-scale labeled data. Different to them, we employ a simpler character level model containing word information, yet obtaining higher F1 scores. ours: baseline: Figure 3: Case studies.

Out-of-Domain Results
We test out-of-domain performance of our model on the ZX dataset. We also use the multi-view word-context character embeddings (WCC) for cross domain segmentation, which uses two types of embeddings by simple vector concatenation. One type of embeddings is pre-trained on indomain data, and the other type is pre-trained on out-of-domain data. In such case, the multi-view embeddings includes cross-domain information, which may enhance the cross-domain segmentation performance (Mou et al., 2016). As shown in Table 5, using word-context character (WCC) embeddings and multi-view wordcontext character embeddings both give significantly higher accuracy improvements compared with other semi-supervised methods. Additionally, we find that multi-view WCC embeddings give an extra 1% F1 score improvement over WCC embeddings. Our proposed model also significantly improves the OOV recall (ROOV) and IV recall (RIV). By studying the cases of segmented output (Figure 3), we find that our model can recognize OOV words such as '鬼王', '七星剑' and the IV word '器重', which are incorrectly labeled by the baseline. This confirms that our proposed model is helpful for the data sparseness problem on closed domain and domain adaptation on across domain.
We also list the results of  and  on this dataset.  obtains better out-of-domain performance than our model. However, their results cannot be compared directly with ours because they use partial labeled URL link data from Chinese Wikipedia data for training.

Conclusion
We proposed word-context character embeddings for semi-supervised neural CWS, which makes the segmentation model more accurate on in-domain  Table 5: Results on the out-of-domain data. Models with † do not use large-scale data, models with ‡ use in-domain large-scale data, and models with ♯ use both in-domain, and out-of-domain largescale data.
data, and more robust on the out-of-domain data. Our segmentation model is simple yet effective, achieving state-of-the-art segmentation accuracies on standard benchmarks. It can also be useful for other NLP tasks with small labeled training data, but a large unlabeled data. Our code could be downloaded at https://github.com/ zhouh/WCC-Segmentation.