Fast and Accurate Neural Word Segmentation for Chinese

Neural models with minimal feature engineering have achieved competitive performance against traditional methods for the task of Chinese word segmentation. However, both training and working procedures of the current neural models are computationally inefficient. In this paper, we propose a greedy neural word segmenter with balanced word and character embedding inputs to alleviate the existing drawbacks. Our segmenter is truly end-to-end, capable of performing segmentation much faster and even more accurate than state-of-the-art neural models on Chinese benchmark datasets.


Introduction
Word segmentation is a fundamental task for processing most east Asian languages, typically Chinese.Almost all practical Chinese processing applications essentially rely on Chinese word segmentation (CWS), e.g., (Zhao et al., 2017).
Since (Xue, 2003), most methods formalize this task as a sequence labeling problem.In a supervised learning fashion, sequence labeling may adopt various models such as Maximum Entropy (ME) (Low et al., 2005) and Conditional Random Fields (CRF) (Lafferty et al., 2001;Peng et al., 2004).However, these models rely heavily on hand-crafted features.
To minimize the efforts in feature engineering, neural word segmentation has been actively studied recently.Zheng et al. (2013) first adapted the sliding-window based sequence labeling (Collobert et al., 2011) with character embeddings as input.A number of other researchers have attempted to improve the segmenter of (Zheng et al., 2013) by augmenting it with additional complexity.Pei et al. (2014) introduced tag embeddings.Chen et al. (2015a) proposed to model ngram features via a gated recursive neural network (GRNN).Chen et al. (2015b) used a Long shortterm memory network (LSTM) (Hochreiter and Schmidhuber, 1997) to capture long-distance context.Xu and Sun (2016) integrated both GRNN and LSTM for deeper feature extraction.
Besides sequence labeling schemes, Zhang et al. (2016) proposed a transition-based framework.Liu et al. (2016) used a zero-order semi-CRF based model.However, these two models rely on either traditional discrete features or nonneural-network components for performance enhancement, their performance drops rapidly when solely depending on neural models.Most closely related to this work, Cai and Zhao (2016) proposed to score candidate segmented outputs directly, employing a gated combination neural network over characters for word representation generation and an LSTM scoring model for segmentation result evaluation.
Despite the active progress of most existing works in terms of accuracy, their computational needs have been significantly increased to the extent that training a neural segmenter usually takes days even using cutting-edge hardwares.Meanwhile, different applications often require diverse segmenters and offer large-scale incoming data.The efficiency of a word segmenter either for training and decoding is crucial in practice.In this paper, we propose a simple yet accurate neu-ral word segmenter who searches greedily during both training and working to overcome the existing efficiency obstacle.Our evaluation will be performed on Chinese benchmark datasets.

Related Work
Statistical Chinese word segmentation has been studied for decades (Huang and Zhao, 2007).(Xue, 2003) was the first to cast it as a characterbased tagging problem.Peng et al. (2004) showed CRF based model is particularly effective to solve CWS in the sequence labeling fashion.This method has been followed by most later segmenters (Tseng et al., 2005;Zhao et al., 2006;Zhao and Kit, 2008c;Zhao et al., 2010;Sun et al., 2012;Zhang et al., 2013).The same spirit has also be followed by most neural models (Zheng et al., 2013;Pei et al., 2014;Qi et al., 2014;Chen et al., 2015a,b;Ma and Hinrichs, 2015;Xu and Sun, 2016).
Unlike most previous works, which extract features within a fixed sized sliding window, Cai and Zhao (2016) proposed a direct segmentation framework that extends the feature window to cover complete input and segmentation history and uses beam search for decoding.In this work, we will make a series of significant improvement over the basic framework and especially adopt greedy search instead.
Another notable exception of embedding based methods is (Ma and Hinrichs, 2015), which used character-specified tags matching for fast decoding and resulted in a character-based greedy segmenter.

Models
To segment a character sequence, we employ neural networks to score the likelihood of a candidate segmented sequence being a true sentence, and the one with the highest score will be picked as output.

Neural Scorer
Our neural architecture to score a segmented sequence (word sequence) can be described in the following three steps (illustrated in Figure 1).
Encoding To make use of neural networks, symbolic data needs to be transformed into distributed representations.The most straightforward solution is to use a lookup table for word vectors (Bengio et al., 2003).However, in the context of neural word segmentation, it will generalize poorly due to the severe word sparsity in Chinese.An alternative is employing neural networks to compose word representations from character embedding inputs.However, it is empirically hard to learn a satisfactory composition function.In fact, quite a lot of Chinese words, like "沙(sand)发(issue)" (sofa) , are not semantically character-level compositional at all.For the dilemma that composing word representations from character may be insufficient while the direct use of word embedding may lose generalization ability, we propose a hybrid mechanism to alleviate the problem.Concretely, we keep a short list H of the most frequent words w = c 1 ..c l to balance character composition.If w in H, the immediate word embedding w ∈ R dw is attached via average pooling1 , otherwise, the character composition is used alone.
Our character composition function COMP(•) for l-length word is where denotes the element-wise multiplication.r i ∈ R dc is the gate that controls the information flow from character embedding c i ∈ R dc to word.Intuitively, the gating mechanism is used to determine which part of the character vectors should be retrieved when composing a certain word.This is indeed important due to the ambiguity of individual Chinese characters.
[r 1 ; . . .; In contrast, the model in (Cai and Zhao, 2016) further combined COMP(•) and character embeddings c i via an update gate z (As in Figure 2), which has been shown helpless to the performance but requires huge computational cost according to our empirical study.
Linking To capture word interactions within a word sequence, the resulted word vectors are then linked sequentially via an LSTM (Sundermeyer et al., 2012).At each time step i, a prediction about next word is made according to the current hidden state h i ∈ R H of LSTM.The procedure can be described as the following equation.
The predictions p ∈ R dw will then be used to evaluate how reasonable the transition is between next word and the preceding word sequence.
Scoring The segmented sequence is evaluated from two perspectives, (i) the legality of individual words, (ii) the smoothness or coherence of the word sequence.The former is judged by a trainable parameter vector u ∈ R dw , which is supposed to work like a hyperplane separating legal and illegal words.For the latter, the prediction p made for each position can be used to score the fitness of the actual word.Both scoring operations are implemented via dot product in our settings.Summing up all scores, the segmented sequence (sentence) is scored as follow.

Search
The number of possible segmented sentences grows exponentially with the length of the input character sequence.Most existing methods made Markov assumptions to keep the exact search tractable.2However, such assumptions cannot be made in our model as the LSTM component takes advantage of the full segmentation history.We then adopt a beam search scheme, which works iteratively on every prefix of the input character sequence, approximating the k highest-scored word sequences of each prefix (i.e., k is the beam size).The time complexity of our beam search is O(wkn), where w is the maximum word length and n is the input sequence length.

Training Criteria
Our segmenter is trained using max-margin methods (Taskar et al., 2005) where the structured margin loss is defined as µ times the number of incorrectly segmented characters (Cai and Zhao, 2016).However, according to (Huang et al., 2012), standard parameter update cannot guarantee convergence in the case of inexact search.We thus additionally examine two strategies as follows.
Early update This strategy proposed in (Collins and Roark, 2004) can be simplified into "update once the golden answer is unreachable".In our case, when the considering character prefix can be correctly segmented but the correct one falls off the beam, an update operation will be conducted and the rest part will be ignored.
LaSO update One drawback of early update is that the search may never reach the end of a training instance, which means the rest part of the instance is "wasted".Differently, LaSO method of (Daumé III and Marcu, 2005) continues on the same instance with correct hypothesis after each update.In our case, the beam will be emptied and the corresponding prefix of the correct word sequence will be inserted into the beam.

Datasets and Settings
We conduct experiments on two popular benchmark datasets, namely PKU and MSR, from the second international Chinese word segmentation bakeoff (Emerson, 2005) (Bakeoff-2005).Data statistics are in Table 1.
Throughout this paper, we use the same model setting as shown in Table 2.These numbers are tuned on development sets. 3We follow (Dyer et al., 2015) to train model parameters.The learning rate at epoch t is set as η t = 0.2/(1 + γt), where γ = 0.1 for PKU dataset and γ = 0.2 for MSR dataset.The character embeddings are either randomly initialized or pre-trained by word2vec (Mikolov et al., 2013) toolkit on Chinese Wikipedia corpus (which will be indicated by +pre-train in tables.),while the word embeddings are always randomly initialized.The beam size is kept the same during training and working.By default, early update strategy is adopted and the word table H is top half of in-vocabulary (IV) words by frequency.

Model Analysis
Beam search collapses into greedy search Figure 3 demonstrates the effect of beam size.To our surprise, beam size change has little influence on the performance.Namely, simple stepwise greedy search nearly achieves the best performance, which suggests that word segmentation can be greedily solvable at word-level.It may be due to that right now the model is optimal  enough to make correct decisions at the first position.In fact, similar phenomenon was observed at character-level (Ma and Hinrichs, 2015).The rest experiments will thus only report the results of our greedy segmenter.

Methods
Comparing different update methods Table 3 compares the concerned three training strategies.We find that early update leads to faster convergence and a bit better performance compared to both standard and LaSO update.
Character composition versus word embedding Following Section 3.1, direct use of word embedding may bring efficiency and effectiveness for identifying IV words, but weaken the ability to recognize out-of-vocabulary (OOV) words. We

Main Results
Table 4 compares our final results (greedy search is adopted by setting k=1) to prior neural models.Pre-training character embeddings on largescale unlabeled corpus (not limited to the training corpus) has been shown helpful for extra performance improvement.The results with or without pre-trained character embeddings are listed separately for following the strict closed test setting of SIGHAN Bakeoff in which no linguistic resource other than training corpus is allowed.We also show the state-of-the-art results in (Zhao and Kit, 2008b) of traditional methods.The comparison shows our neural word segmenter outperforms all state-of-the-art neural systems with much less computational cost.Finally, we present the results on all four Bakeoff-2005 datasets compared to (Zhao and Kit, 2008c) in Table 5.Note (Zhao and Kit, 2008c) used AV features, which are derived from global 4 To distinguish the performance improvement from model optimization, we especially list the results of standalone neural models in (Zhang et al., 2016) and (Liu et al., 2016).All the running time results are from our runs of released implementations on a single CPU (Intel i7-5960X) with two threads only, except for those of (Zhang et al., 2016) which are from personal communication.The results of (Xu and Sun, 2016) are not listed due to their use of external Chinese idiom dictionary.

Models
PKU MSR CityU AS (Zhao and Kit, 2008c) 95 statistics over entire training corpus in a similar way of unsupervised segmentation (Zhao and Kit, 2008a), for performance enhancement. 5The comparison to their results without AV features show that our neural models for the first time present comparable performance against state-of-the-art traditional ones under strict closed test setting.6

Conclusion
In this paper, we presented a fast and accurate word segmenter using neural networks.Our experiments show a significant improvement over existing state-of-the-art neural models by adopting the following key model refinements.
(1) A novel character-word balanced mechanism for word representation generation.(2) A more efficient model for character composition by dropping unnecessary designs.(3) Early update strategy during max-margin training.(4) With the above modifications, we discover that beam size has little influence on the performance.Actually, greedy search achieves very high accuracy.
Through these improvement from both neural models and linguistic motivation, our model becomes simpler, faster and more accurate. 7

Figure 3 :
Figure 3: The effect of different beam sizes.

Table 3 :
The effect of different update methods.#epochs denotes the number of training epochs to convergence.

Table 4 :
Comparison with previous models.Results with * are from(Cai and Zhao, 2016). 4