Unsupervised Neural Word Segmentation for Chinese via Segmental Language Modeling

Previous traditional approaches to unsupervised Chinese word segmentation (CWS) can be roughly classified into discriminative and generative models. The former uses the carefully designed goodness measures for candidate segmentation, while the latter focuses on finding the optimal segmentation of the highest generative probability. However, while there exists a trivial way to extend the discriminative models into neural version by using neural language models, those of generative ones are non-trivial. In this paper, we propose the segmental language models (SLMs) for CWS. Our approach explicitly focuses on the segmental nature of Chinese, as well as preserves several properties of language models. In SLMs, a context encoder encodes the previous context and a segment decoder generates each segment incrementally. As far as we know, we are the first to propose a neural model for unsupervised CWS and achieve competitive performance to the state-of-the-art statistical models on four different datasets from SIGHAN 2005 bakeoff.


Introduction
Unlike English and many other languages, Chinese sentences have no explicit word boundaries. Therefore, Chinese Word Segmentation (CWS) is a crucial step for many Chinese Natural Language Processing (NLP) tasks such as syntactic parsing, information retrieval and word representation learning (Grave et al., 2018).

Segment Decoder
Figure 1: A Segmental Language Model (SLM) works on y = y 1 y 2 y 3 y 4 with the candidate segmentation y 1 , y 2:3 and y 4 , where y 0 is an additional start symbol which is kept same for all sentences.
been proposed and given competitive results to the best statistical models (Sun, 2010). However, the neural approaches for unsupervised CWS have not been investigated. Previous unsupervised approaches to CWS can be roughly classified into discriminative and generative models. The former uses carefully designed goodness measures for candidate segmentation, while the latter focuses on designing statistical models for Chinese and finds the optimal segmentation of the highest generative probability.
Popular goodness measures for discriminative models include Mutual Information (MI) (Chang and Lin, 2003), normalized Variation of Branching Entropy (nVBE) (Magistry and Sagot, 2012) and Minimum Description Length (MDL) (Magistry and Sagot, 2013). There is a trivial way to extend these statistical discriminative approaches, because we can simply replace the n-gram language models in these approaches by neural language models (Bengio et al., 2003). There may exists other more sophisticated neural discriminative approaches, but it is not the focus of this paper.
For generative approaches, typical statistical models includes Hidden Markov Model (HMM) (Chen et al., 2014), Hierarchical Dirichlet Process (HDP) (Goldwater et al., 2009) and Nested Pitman-Yor Process (NPY) (Mochihashi et al., 2009). However, none of them can be easily extended into a neural model. Therefore, neural generative models for word segmentation are remaining to be investigated.
In this paper, we proposed the Segmental Language Models (SLMs), a neural generative model that explicitly focuses on the segmental nature of Chinese: SLMs can directly generate segmented sentences and give the corresponding generative probability. We evaluate our methods on four different benchmark datasets from SIGHAN 2005bakeoff (Emerson, 2005, namely PKU, MSR, AS and CityU. To our knowledge, we are the first to propose a neural model for unsupervised Chinese word segmentation and achieve competitive performance to the state-of-the-art statistical models on four different datasets. 1

Segmental Language Models
In this section, we present our segmental language models (SLMs). Notice that in Chinese NLP, characters are the atom elements. Thus in the context of CWS, we use "character" instead of "word" for language modeling.

Language Models
The goal of language modeling is to learn the joint probability function of sequences of characters in a language. However, This is intrinsically difficult because of the curse of dimensionality. Traditional approaches obtain generalization based on n-grams, while neural approaches introduce a distributed representation for characters to fight the curse of dimensionality.
A neural Language Model (LM) can give the conditional probability of the next character given the previous ones, and is usually implemented by a Recurrent Neural Network (RNN): (1) p(y t |y 1:t−1 ) = g(h t , y t ) where y t is the distributed representation for the t th character and h t represents the information of the previous characters.

Segmental Language Models
Similar to neural language modeling, the goal of segmental language modeling is to learn the joint probability function of the segmented sequences of characters. Thus, for each segment, we have: where y (i) t is the distributed representation for the t th character in the i th segment and y (1:i−1) is the previous segments. And the concatenation of all segments y (i) 1:T i is exactly the whole sentence y 1:T , where T i is the length of the i th segment y (i) , T is the length of the sentence y.
Moreover, we introduce a context encoder RNN to process the character sequence y (1:i−1) in order to make y Notice that although we have an encoder and the segment decoder g, SLM is not an encoderdecoder model. Because the content that the decoder generates is not the same as what the encoder provides. Figure 1 illustrates how SLMs work with a candidate segmentation.

Properties of SLMs
However, in unsupervised scheme, the given sentences are not segmented. Therefore, the probability for SLMs to generate a given sentence is the joint probability of all possible segmentation: where y (i) T i +1 = eos is the end of segment symbol at the end of each segment, and y (i) 0 is the context representation of y (1:i−1) .
Moreover, for sentence generation, SLMs are able to generate arbitrary sentences by generating segments one by one and stopping when generating end of sentence symbol EOS . In addition, the time complexity is linear to the length of the generated sentence, as we can keep the hidden state of the context encoder RNN and update it when generating new words.
Last but not least, it is easy to verify that SLMs preserve the probabilistic property of language models: where s i enumerates all possible sentences. In summary, the segmental language models can perfectly substitute vanilla language models.

Training and Decoding
Similar to language model, the training is achieved by maximizing the training corpus log-likelihood: Luckily, we can compute the loss objective function in linear time complexity using dynamic programming, given the initial condition that p(y 1:0 ) = 1: p(y 1:n ) = K k=1 p(y 1:n−k )p(y n−k+1:n ) (7) where p(·) is the joint probability of all possible segmentation,p(·) is the probability of one segment and K is the maximal length of the segments.
We can also find the segmentation with maximal probability (namely, decoding) in linear time using dynamic programming in the similarly way withp(y 1:0 ) = 1: δ(y 1:n ) = arg K max k=1p (y 1:n−k )p(y n−k+1:n ) (9) wherep is the probability of the best segmentation and δ is used to trace back the decoding.

Experimental Settings and Detail
We evaluate our models on SIGHAN 2005 bakeoff (Emerson, 2005) datasets and replace all the punctuation marks with punc , English characters with eng and Arabic numbers with num  (Chen et al., 2014;Wang et al., 2011;Mochihashi et al., 2009;Magistry and Sagot, 2012) for all text and only consider segment the text between punctuations. Following Chen et al. (2014) , we use both training data and test data for training and only test data are used for evaluation. In order to make a fair comparison with the previous works, we do not consider using other larger raw corpus. We apply word2vec (Mikolov et al., 2013) on Chinese Gigaword corpus (LDC2011T13) to get pretrained embedding of characters.
A 2-layer LSTM (Hochreiter and Schmidhuber, 1997) is used as the segment decoder and a 1-layer LSTM is used as the context encoder.
We use stochastic gradient decent with a minibatch size of 256 and a learning rate of 16.0 to optimize the model parameters in the first 400 steps, then we use Adam (Kingma and Ba, 2014) with a learning rate of 0.005 to further optimize the models. Model parameters are initialized by normal distributions as Glorot and Bengio (2010) suggested. We use a gradient clip = 0.1 and apply a dropout with dropout rate = 0.1 to the character embedding and RNNs to prevent over-fit.
The standard word precision, recall and F1 measures (Emerson, 2005) are used to evaluate segmentation performance.

Results and Analysis
Our final results are shown in Table 1 Table 2: Results of SLM-4 incorporating ad hoc guidelines, where † represents using additional 1024 segmented setences for training data and * represents using a rule-based post-processing boldface. We test the proposed SLMs with different maximal segment length K = 2, 3, 4 and use "SLM-K" to denote the corresponding model. We do not try K > 4 because there are rare words that consist more than 4 characters. As can be seen, it is hard to predict what choice of K will give the best performance. This is because the exact definition of what a word remains hard to reach and different datasets follow different guidelines. Zhao and Kit (2008) use crosstraining of a supervised segmentation system in order to have an estimation of the consistency between different segmentation guidelines and the average consistency is found to be as low as 85 (f-score). Therefore, this can be regarded as a top line for unsupervised CWS. Table 1 shows that SLMs outperform previous best discriminative and generative models on PKU and AS datasets. This might be due to that the segmentation guideline of our models are closer to these two datasets.
Moreover, in the experiments, we observe that Chinese particles often attach other words, for example, "的" following adjectives and "了" following verbs. It is hard for our generative models to split them apart. Therefore, we propose a rulebased post-processing module to deal with this problem, where we explicitly split the attached particles from other words. 3 The post-processing is applied on the results of "SLM-4". In addition, we also evaluate "SLM-4" using the first 1024 sentences of the segmented training datasets (about 5.4% of PKU, 1.2% of MSR, 0.1% of AS and 1.9% of CityU) for training, in order to teach "SLM-4" the corresponding ad hoc segmentation guidelines. Table 2 shows the results.
We can find from the table that only 1024 guideline sentences can improve the performance of "SLM-4" significantly.
While rule-based 3 The rules we use are listed in the appendix at https: //github.com/Edward-Sun/SLM.

Error
SLM-2 SLM-3 SLM- 4  Insertion  7866  4803  3519  Deletion  3855  7518  8851   Table 3: Statistics of insertion errors and deletion errors that SLM-K produces on PKU dataset post-processing is very effective, "SLM-4 †" can outperform "SLM-4*" on all the four datasets. Moreover, performance drops when applying the rule-based post-processing to "SLM-4 †" on three datasets. These indicate that SLMs can learn the empirical rules for word segmentation given only a small amount of training data. And these guideline data can improve the performance of SLMs naturally, superior to using explicit rules.

The Effect of the Maximal Segment Length
The maximal segment length K represents the prior knowledge we have for Chinese word segmentation. For example K = 3 represents that there are only unigrams, bigrams and trigrams in the text. While there do exist words that contain more than four characters, most of the Chinese words are unigram or bigram. Therefore, K denotes a trade-off between the accuracy of short words and long words. Specifically, we investigate two major segmentation problems that might affect the accuracy of word segmentation performance, namely, insertion errors and deletion errors. An insertion error insert a segment in a word, which split a correct word. And an deletion error delete the segment between two words, which results in a composition error (Li and Yuan, 1998). Table 3 shows the statistics of different errors on PKU of our model with different K. We can observe that insertion error rate decrease with the increase of K, while the deletion error rate increase with the increase of K.
We also provide some examples in Table 4, which are taken from the results of our models. It clearly illustrates that different K could result in different errors. For example, there is an insertion error on "反过来" by SML-2, and a deletion error on "促进" and "了" by SLM-4.

Related Work
Generative Models for CWS Goldwater et al. (2009) are the first to proposed a generative model for unsupervised word segmentation. They

Model
Example Segmental Sequence Models Sequence modeling via segmentations has been well investigated by , where they proposed the Sleep-AWake Network (SWAN) for speech recognition. SWAN is similar to SLM. However, SLMs do not have sleep-awake states. And SLMs predict the following segment given the previous context while SWAN tries to recover the information in the encoded state. Therefore, the key difference is that SLMs are unsupervised language models while SWANs are supervised seq2seq models. Thereafter,  successfully apply SWAN in their phrase-based machine translation. Another related work in machine translation is the online segment to segment neural transduction (Yu et al., 2016), where the model is able to capture unbounded dependencies in both the input and output sequences. Kong (2017) also proposed a Segmental Recurrent Neural Network (SRNN) with CTC to solve segmental labeling problems.

Conclusion
In this paper, we proposed a neural generative model for fully unsupervised Chinese word segmentation (CWS). To the best of knowledge, this is the first neural model for CWS. Our segmental language model is an intuitive generalization of vanilla neural language models that directly modeling the segmental nature of Chinese. Ex-perimental results show that our models achieve competitive performance to the previous state-ofthe-art statistical models on four datasets from SIGHAN 2005. We also show the improvement of incorporating ad hoc guidelines into our segmental language models. Our future work may include the following two directions.
• In this work, we only consider the sequential segmental language modeling. In the future, we are interested in build a hierarchical neural language model like the Pitman-Yor process.
• Like vanilla language models, the segmental language models can also provide useful information for semi-supervised learning tasks. It would also be interesting to explore our models in the semi-supervised schemes.