Neural Word Segmentation Learning for Chinese

Most previous approaches to Chinese word segmentation formalize this problem as a character-based sequence labeling task so that only contextual information within fixed sized local windows and simple interactions between adjacent tags can be captured. In this paper, we propose a novel neural framework which thoroughly eliminates context windows and can utilize complete segmentation history. Our model employs a gated combination neural network over characters to produce distributed representations of word candidates, which are then given to a long short-term memory (LSTM) language scoring model. Experiments on the benchmark datasets show that without the help of feature engineering as most existing approaches, our models achieve competitive or better performances with previous state-of-the-art methods.


Introduction
Most east Asian languages including Chinese are written without explicit word delimiters, therefore, word segmentation is a preliminary step for processing those languages.Since Xue (2003), most methods formalize the Chinese word segmentation (CWS) as a sequence labeling problem with character position tags, which can be handled with su-pervised learning methods such as Maximum Entropy (Berger et al., 1996;Low et al., 2005) and Conditional Random Fields (Lafferty et al., 2001;Peng et al., 2004;Zhao et al., 2006a).However, those methods heavily depend on the choice of handcrafted features.
Recently, neural models have been widely used for NLP tasks for their ability to minimize the effort in feature engineering.For the task of CWS, Zheng et al. (2013) adapted the general neural network architecture for sequence labeling proposed in (Collobert et al., 2011), and used character embeddings as input to a two-layer network.Pei et al. (2014) improved upon (Zheng et al., 2013) by explicitly modeling the interactions between local context and previous tag.Chen et al. (2015a) proposed a gated recursive neural network to model the feature combinations of context characters.Chen et al. (2015b) used an LSTM architecture to capture potential long-distance dependencies, which alleviates the limitation of the size of context window but introduced another window for hidden states.
Despite the differences, all these models are designed to solve CWS by assigning labels to the characters in the sequence one by one.At each time step of inference, these models compute the tag scores of character based on (i) context features within a fixed sized local window and (ii) tagging history of previous one.
Nevertheless, the tag-tag transition is insufficient to model the complicated influence from previous segmentation decisions, though it could sometimes be a crucial clue to later segmentation decisions.The fixed context window size, which is broadly adopted by these methods for feature engineering, also restricts the flexibility of modeling diverse distances.Moreover, word-level information, which is being the greater granularity unit as suggested in (Huang and Zhao, 2006), remains
To alleviate the drawbacks inside previous methods and release those inconvenient constrains such as the fixed sized context window, this paper makes a latest attempt to re-formalize CWS as a direct segmentation learning task.Our method does not make tagging decisions on individual characters, but directly evaluates the relative likelihood of different segmented sentences and then search for a segmentation with the highest score.To feature a segmented sentence, a series of distributed vector representations (Bengio et al., 2003) are generated to characterize the corresponding word candidates.Such a representation setting makes the decoding quite different from previous methods and indeed much more challenging, however, more discriminative features can be captured.
Though the vector building is word centered, our proposed scoring model covers all three processing levels from character, word until sentence.First, the distributed representation starts from character embedding, as in the context of word segmentation, the n-gram data sparsity issue makes it impractical to use word vectors immediately.Second, as the word candidate representation is derived from its characters, the inside character structure will also be encoded, thus it can be used to determine the word likelihood of its own.Third, to evaluate how a segmented sentence makes sense through word interacting, an LSTM (Hochreiter and Schmidhuber, 1997) is used to chain together word candidates incrementally and construct the representation of partially segmented sentence at each decoding step, so that the coherence between next word candidate and previous segmentation history can be depicted.
To our best knowledge, our proposed approach to CWS is the first attempt which explicitly models the entire contents of the segmenter's state, including the complete history of both segmentation decisions and input characters.The compar- isons of feature windows used in different models are shown in Table 1.Compared to both sequence labeling schemes and word-based models in the past, our model thoroughly eliminates context windows and can capture the complete history of segmentation decisions, which offers more possibilities to effectively and accurately model segmentation context.

Overview
We formulate the CWS problem as finding a mapping from an input character sequence x to a word sequence y, and the output sentence y * satisfies: where n is the number of word candidates in y, and GEN(x) denotes the set of possible segmentations for an input sequence x.Unlike all previous works, our scoring function is sensitive to the complete contents of partially segmented sentence.As shown in Figure 1, to solve CWS in this way, a neural network scoring model is designed to evaluate the likelihood of a segmented sentence.Based on the proposed model, a decoder is developed to find the segmented sentence with the highest score.Meanwhile, a max-margin method is utilized to perform the training by comparing the structured difference of decoder output and the golden segmentation.The following sections will introduce each of these components in detail.

Neural Network Scoring Model
The score for a segmented sentence is computed by first mapping it into a sequence of word candidate vectors, then the scoring model takes the vector sequence as input, scoring on each word candidate from two perspectives: (1) how likely the word candidate itself can be recognized as a legal word; (2) how reasonable the link is for the word candidate to follow previous segmentation history immediately.After that, the word candidate is appended to the segmentation history, updating the state of the scoring system for subsequent judgements.Figure 2 illustrates the entire scoring neural network.

Word Score
Character Embedding.While the scores are decided at the word-level, using word embedding (Bengio et al., 2003;Wang et al., 2016) immediately will lead to a remarkable issue that rare words and out-of-vocabulary words will be poorly estimated (Kim et al., 2015).In addition, the character-level information inside an n-gram can be helpful to judge whether it is a true word.Therefore, a lookup table of character embeddings is used as the bottom layer.Formally, we have a character dictionary D of size |D|.Then each character c ∈ D is represented as a real-valued vector (character embedding) c ∈ R d , where d is the dimensionality of the vector space.The character embeddings are then stacked into an embedding matrix M ∈ R d×|D| .For a character c ∈ D, its character embedding c ∈ R d is retrieved by the embedding layer according to its index.
Gated Combination Neural Network.In order to obtain word representation through its characters, in the simplest strategy, character vectors are integrated into their word representation using a weight matrix W (L) that is shared across all words with the same length L, followed by a non-linear function g(•).Specifically, c i (1 ≤ i ≤ L) are d-dimensional character vector representations respectively, the corresponding word vector w will be d-dimensional as well: where W (L) ∈ R d×Ld and g is a non-linear function as mentioned above.Although the mechanism above seems to work well, it can not sufficiently model the complicated combination features in practice, yet.
Gated structure in neural network can be useful for hybrid feature extraction according to (Chen et al., 2015a;Chung et al., 2014;Cho et al., 2014), we therefore propose a gated combination neural network (GCNN) especially for character compositionality which contains two types of gates, namely reset gate and update gate.Intuitively, the reset gates decide which part of the character vectors should be mixed while the update gates decide what to preserve when combining the characters information.Concretely, for words with length L, the word vector w ∈ R d is computed as follows: are update gates for new activation ŵ and governed characters respectively, and indicates element-wise multiplication.
The new activation ŵ is computed as: where are the reset gates for governed characters respectively, which can be formalized as: where R (L) ∈ R Ld×Ld is the coefficient matrix of reset gates and σ denotes the sigmoid function.
The update gates can be formalized as: where U (L) ∈ R (L+1)d×(L+1)d is the coefficient matrix of update gates, and Z ∈ R d is the normal-ization vector, According to the normalization condition, the update gates are constrained by: The gated mechanism is capable of capturing both character and character interaction characteristics to give an efficient word representation (See Section 6.3).
Word Score.Denote the learned vector representations for a segmented sentence y with where n is the number of word candidates in the sentence.word score will be computed by the dot products of vector It indicates how likely a word candidate by itself is to be a true word.

Link Score
Inspired by the recurrent neural network language model (RNN-LM) (Mikolov et al., 2010;Sundermeyer et al., 2012), we utilize an LSTM system to capture the coherence in a segmented sentence.
Long Short-Term Memory Networks.The LSTM neural network (Hochreiter and Schmidhuber, 1997) is an extension of the recurrent neural network (RNN), which is an effective tool for sequence modeling tasks using its hidden states for history information preservation.At each time step t, an RNN takes the input x t and updates its recurrent hidden state h t by where g is a non-linear function.
Although RNN is capable, in principle, to process arbitrary-length sequences, it can be difficult to train an RNN to learn long-range dependencies due to the vanishing gradients.LSTM addresses where σ, are respectively the element-wise sigmoid function and multiplication, i t , f t , o t , c t are respectively the input gate, forget gate, output gate and memory cell activation vector at time t, all of which have the same size as hidden state vector Link Score.LSTMs have been shown to outperform RNNs on many NLP tasks, notably language modeling (Sundermeyer et al., 2012).In our model, LSTM is utilized to chain together word candidates in a left-to-right, incremental manner.At time step t, a prediction p t+1 ∈ R d about next word y t+1 is made based on the hidden state h t : link score for next word y t+1 is then computed as: Due to the structure of LSTM, the prediction vector p t+1 carries useful information detected from the entire segmentation history, including previous segmentation decisions.In this way, our model gains the ability of sequence-level discrimination rather than local optimization.

Sentence score
Sentence score for a segmented sentence y with n word candidates is computed by summing up word scores (2) and link scores (3) as follow: where θ is the parameter set used in our model.

Decoding
The total number of possible segmented sentences grows exponentially with the length of character sequence, which makes it impractical to compute the scores of every possible segmentation.In order to get exact inference, most sequence-labeling systems address this problem with a Viterbi search, which takes the advantage of their hypothesis that the tag interactions only exist within adjacent characters (Markov assumption).However, since our model is intended to capture complete history of segmentation decisions, such dynamic programming algorithms can not be adopted in this situation.
Algorithm 1 Beam Search.To make our model efficient in practical use, we propose a beam-search algorithm with dynamic programming motivations as shown in Algorithm 1.The main idea is that any segmentation of the first i characters can be separated as two parts, the first part consists of characters with indexes from 0 to j that is denoted as y, the rest part is the word composed by c[j+1 : i].The influence from previous segmentation y can be represented as a triple (y.score, y.h, y.c), where y.score, y.h, y.c indicate the current score, current hidden state vector and current memory cell vector respectively.Beam search ensures that the total time for segmenting a sentence of n characters is w × k × n, where w, k are maximum word length and beam size respectively.

Training
We use the max-margin criterion (Taskar et al., 2005) to train our model.As reported in (Kummerfeld et al., 2015), the margin methods generally outperform both likelihood and perception methods.For a given character sequence x (i) , denote the correct segmented sentence for x (i) as y (i) .We define a structured margin loss ∆(y (i) , ŷ) for predicting a segmented sentence ŷ: where m is the length of sequence x (i) and µ is the discount parameter.The calculation of margin loss could be regarded as to count the number of incorrectly segmented characters and then multiple it with a fixed discount parameter for smoothing.Therefore, the loss is proportional to the number of incorrectly segmented characters.
Given a set of training set Ω, the regularized objective function is the loss function J(θ) including an 2 norm term: where the function s(•) is the sentence score defined in equation ( 4).Due to the hinge loss, the objective function is not differentiable, we use a subgradient method (Ratliff et al., 2007) which computes a gradientlike direction.Following (Socher et al., 2013), we use the diagonal variant of AdaGrad (Duchi et al., 2011)  The update for the i-th parameter at time step t is as follows: where α is the initial learning rate and is the subgradient at time step τ for parameter θ i .

Datasets
To evaluate the proposed segmenter, we use two popular datasets, PKU and MSR, from the second International Chinese Word Segmentation Bakeoff (Emerson, 2005).These datasets are commonly used by previous state-of-the-art models and neural network models.Both datasets are preprocessed by replacing the continuous English characters and digits with a unique token.All experiments are conducted with standard Bakeoff scoring program1 calculating precision, recall, and F 1 -score.

Hyper-parameters
Hyper-parameters of neural network model significantly impact on its performance.To determine a set of suitable hyper-parameters, we divide the training data into two sets, the first 90% sentences as training set and the rest 10% sentences as development set.We choose the hyper-parameters as shown in Table 2.
We found that the character embedding size has a limited impact on the performance as long as it is large enough.The size 50 is chosen as a good trade-off between speed and performance.The number of hidden units is set to be the same as the character embedding.Maximum word length determines the number of parameters in GCNN part and the time consuming of beam search, since the words with a length l > 4 are relatively rare, Dropout is a popular technique for improving the performance of neural networks by reducing overfitting (Srivastava et al., 2014).We also drop the input layer of our model with dropout rate 20% to avoid overfitting.

Model Analysis
Beam Size.We first investigated the impact of beam size over segmentation performance.Figure 5 shows that a segmenter with beam size 4 is enough to get the best performance, which makes our model find a good balance between accuracy and efficiency.
GCNN.We then studied the role of GCNN in our model.which replaces the GCNN part with a single nonlinear layer as in equation ( 1).The results are listed in Table 3, which demonstrate that the performance is significantly boosted by exploiting the GCNN architecture (94.0% to 95.5% on F 1 -score), while the best performance that the simplified version can achieve is 94.7%, but using a much larger character embedding size.
Link Score & Word Score.We conducted several experiments to investigate the individual effect of link score and word score, since these two types of scores are intended to estimate the sentence likelihood from two different perspectives: the semantic coherence between words and the existence of individual words.The learning curves of models with different scoring strategies are shown in Figure 6.The model with only word score can be regarded as the situation that the segmentation decisions are made only based on local window information.The comparisons show that such a model gives moderate performance.By contrast, the model with only link score gives a much better performance close to the joint model, which demonstrates that the complete segmentation history, which can not be effectively modeled in previous schemes, possesses huge appliance value for word segmentation.We first compare our model with the latest neural network methods as shown in Table 4.The results presented in (Chen et al., 2015a;Chen et al., 2015b) used an extra preprocess to filter out Chinese idioms according to an external dictionary.4Table 4 lists the results (F 1 -scores) with different dictionaries, which show that our models perform better when under the same settings.Table 5 gives comparisons among previous neural network models.In the first block of Table 5, the character embedding matrix M is randomly initialized.The results show that our proposed novel model outperforms previous neural network methods.

Models
Previous works have found that the perfor-mance can be improved by pre-training the character embeddings on large unlabeled data.Therefore, we use word2vec (Mikolov et al., 2013) toolkit6 to pre-train the character embeddings on the Chinese Wikipedia corpus and use them for initialization.Table 5 also shows the results with additional pre-trained character embeddings.Again, our model achieves better performance than previous neural network models do.Table 6 compares our models with previous state-of-the-art systems.Recent systems such as (Zhang et al., 2013), (Chen et al., 2015b) and (Chen et al., 2015a) rely on both extensive feature engineering and external corpora to boost performance.Such systems are not directly comparable with our models.In the closed-set setting, our models can achieve state-of-the-art performance on PKU dataset but a competitive result on MSR dataset, which can attribute to too strict maximum word length setting for consistence as it is well known that MSR corpus has a much longer average word length (Zhao et al., 2010).
Table 7 demonstrates the results on MSR corpus with different maximum decoding word lengths, in which both F 1 scores and training time are given.The results show that the segmentation performance can indeed further be improved by allowing longer words during decoding, though longer training time are also needed.As 6character words are allowed, F 1 score on MSR can be furthermore improved 0.3%.
For the running cost, we roughly report the current computation consuming on PKU dataset. 7It takes about two days to finish 50 training epochs (for results in Figure 6 and the last row of Table 6) only with two cores of an Intel i7-5960X CPU.The requirement for RAM during training is less than 800MB.The trained model can be saved within 4MB on the hard disk.

Related Work
Neural Network Models.Most modern CWS methods followed (Xue, 2003) treated CWS as a sequence labeling problems (Zhao et al., 2006b).Recently, researchers have tended to explore neural network based approaches (Collobert et al., 2011) to reduce efforts of feature engineering (Zheng et al., 2013;Qi et al., 2014;Chen et al., 2015a;Chen et al., 2015b).They modeled CWS as tagging problem as well, scoring tags on individual characters.In those models, tag scores are decided by context information within local windows and the sentence-level score is obtained via context-independent tag transitions.Pei et al. (2014) introduced the tag embedding as input to capture the combinations of context and tag history.However, in previous works, only the tag of previous one character was taken into consideration though theoretically the complete history of actions taken by the segmenter should be considered. 7Our code is released at https://github.com/jcyk/CWS.
Alternatives to Sequence Labeling.Besides sequence labeling schemes, Zhang and Clark (2007) proposed a word-based perceptron method.Zhang et al. (2012) used a linear-time incremental model which can also benefits from various kinds of features including word-based features.But both of them rely heavily on massive handcrafted features.Contemporary to this work, some neural models (Zhang et al., 2016a;Liu et al., 2016) also leverage word-level information.Specifically, Liu et al. (2016)  Another notable exception is (Ma and Hinrichs, 2015), which is also an embedding-based model, but models CWS as configuration-action matching.However, again, this method only uses the context information within limited sized windows.
Other Techniques.The proposed model might furthermore benefit from some techniques in recent state-of-the-art systems, such as semisupervised learning (Zhao and Kit, 2008b;Zhao and Kit, 2008a;Sun and Xu, 2011;Zhao and Kit, 2011;Zeng et al., 2013;Zhang et al., 2013), incorporating global information (Zhao and Kit, 2007;Zhang et al., 2016b), and joint models (Qian and Liu, 2012;Li and Zhou, 2012).

Conclusion
This paper presents a novel neural framework for the task of Chinese word segmentation, which contains three main components: (1) a factory to produce word representation when given its governed characters; (2) a sentence-level likelihood evaluation system for segmented sentence; (3) an efficient and effective algorithm to find the best segmentation.
The proposed framework makes a latest attempt to formalize word segmentation as a direct structured learning procedure in terms of the recent distributed representation framework.
Though our system outputs results that are better than the latest neural network segmenters but comparable to all previous state-of-the-art systems, the framework remains a great of potential that can be further investigated and improved in the future.

Neural Word Segmentation Learning
Experiments Q&A Task Review

Task Review
The ultimate goal of word segmentation algorithms is to output a word sequence (i.e, sentence) that satisfies the following two requirements when given a character sequence.where GEN(x) denotes the set of all possible segmentations for the input sequence x.
Cover information at all levels (character, word and sentence).Make use of complete historical information (both plain text and decisions) Beam Search

Problem
The total number of possible segmentations grows exponentially with the length of input sequence.

Solution
Split segmentation into two parts, (i) the last word, (ii) the sub segmentation in front of (i).
Approximate k-best segmentations of its prefixes iteratively.

Figure 6 :
Figure 5: Performances of different beam sizes on PKU dataset.
use a semi-CRF taking segment-level embeddings as input and Zhang et al. (2016a) use a transition-based framework.
Architecture of our proposed neural network scoring model, where c i denotes the i-th input character, y j denotes the learned representation of the j-th word candidate, p k denotes the prediction for the (k + 1)-th word candidate and u is the trainable parameter vector for scoring the likelihood of individual word candidates.
of gates, namely input gate, forget gate and output gate.Concretely, each step of LSTM takes input x t , h t−1 , c t−1 and produces h t , c t via the following calculations:

Table 2 :
with minibatchs to minimize the objective.Hyper-parameter settings.

Table 3 :
Performances of different models on PKU dataset.

Table 4 :
Comparison of using different Chinese idiom dictionaries.3

Table 5 :
Comparison with previous neural network models.Results with * are from our runs on their released implementations.5

Table 6 :
Comparison with previous state-of-the-art models.Results with * used external dictionary or corpus.

Table 7 :
Results on MSR dataset with different maximum decoding word length settings.
Problems with longer words less training data (most of them are hierarchical entity names).more parameters to train (GCNN part).
Long words (with length > 4) account for 0.19% in PKU test set but 1.07% in MSR test set.Max.word length F 1 score Time (Days)