Accurate Linear-Time Chinese Word Segmentation via Embedding Matching

This paper proposes an embedding matching approach to Chinese word segmentation, which generalizes the traditional sequence labeling framework and takes advantage of distributed representations. The training and prediction algorithms have linear-time complexity. Based on the proposed model, a greedy segmenter is developed and evaluated on benchmark corpora. Experiments show that our greedy segmenter achieves improved results over previous neural network-based word seg-menters, and its performance is competitive with state-of-the-art methods, despite its simple feature set and the absence of ex-ternal resources for training.


Introduction
Chinese sentences are written as character sequences without word delimiters, which makes word segmentation a prerequisite of Chinese language processing. Since Xue (2003), most work has formulated Chinese word segmentation (CWS) as sequence labeling (Peng et al., 2004) with character position tags, which has lent itself to structured discriminative learning with the benefit of allowing rich features of segmentation configurations, including (i) context of character/word ngrams within local windows, (ii) segmentation history of previous characters, or the combinations of both. These feature-based models still form the backbone of most state-of-the art systems.
Nevertheless, many feature weights in such models are inevitably poorly estimated because the number of parameters is so large with respect to the limited amount of training data. This has motivated the introduction of low-dimensional, realvalued vectors, known as embeddings, as a tool to deal with the sparseness of the input. Em-beddings allow linguistic units appearing in similar contexts to share similar vectors. The success of embeddings has been observed in many NLP tasks. For CWS, Zheng et al. (2013) adapted Collobert et al. (2011) and uses character embeddings in local windows as input for a two-layer network. The network predicts individual character position tags, the transitions of which are learned separately.  also developed a similar architecture, which labels individual characters and uses character bigram embeddings as additional features to compensate the absence of sentence-level modeling. Pei et al. (2014) improved upon Zheng et al. (2013) by capturing the combinations of context and history via a tensor neural network.
Despite their differences, these CWS approaches are all sequence labeling models. In such models, the target character can only influence the prediction as features. Consider the the segmentation configuration in (1), where the dot appears before the target character in consideration and the box (2) represents any character that can occur in the configuration. In that example, the known history is that the first two characters 中国 'China' are joined together, which is denoted by the underline.
(1) 中国·2 格外 (where 2 ∈ {风, 规, ...}) (2) 中国风 格外 'China-style especially' (3) 中国 规格 外 'besides Chinese spec.' For possible target characters, 风 'wind' and 规 'rule', the correct segmentation decisions for them are opposite, as shown in (2) and (3), respectively. In order to correctly predict both, current models can set higher weights for target character-specific features. However, in general, 风 is more likely to start a new word instead of joining the existing one as in this example. Given such conflicting evidence, models can rarely find optimal feature weights, if they exist at all.
The crux of this conflicting evidence problem is that similar configurations can suggest opposite decisions, depending on the target character and vice versa. Thus it might be useful to treat segmentation decisions for distinct characters separately. And instead of predicting general segmentation decisions given configurations, it could be beneficial to model the matching between configurations and character-specific decisions.
To this end, this paper proposes an embedding matching approach (Section 2) to CWS, in which embeddings for both input and output are learned and used as representations to counteract sparsities. Thanks to embeddings of characterspecific decisions (actions) serving as both input features and output, our hidden-layer-free architecture (Section 2.2) is capable of capturing prediction histories in similar ways as the hidden layers in recurrent neural networks (Mikolov et al., 2010). We evaluate the effectiveness of the model via a linear-time greedy segmenter (Section 3) implementation. The segmenter outperforms previous embedding-based models (Section 4.2) and achieves state-of-the-art results (Section 4.3) on a benchmark dataset. The main contributions of this paper are: • A novel embedding matching model for Chinese word segmentation.
• Developing a greedy word segmenter, which is based on the matching model and achieves competitive results.
• Introducing the idea of character-specific segmentation action embeddings as both feature and output, which are cornerstones of the model and the segmenter.

Embedding Matching Models for Chinese Word Segmentation
We propose an embedding based matching model for CWS, the architecture of which is shown in Figure 1. The model employs trainable embeddings to represent both sides of the matching, which will be specified shortly, followed by details of the architecture in Section 2.2.

Segmentation as Configuration-Action Matching
Output. The word segmentation output of a character sequence can be described as a sequence of character-specific segmentation actions. We use separation (s) and combination (c) as possible actions for each character, where a separation action starts a new word with the current character, while a combination action appends the character to the preceding ones. We model character-action combinations instead of atomic, character independent actions. As a running example, sentence (4b) is the correct segmentation for (4a), which can be represented as the sequence (猫 -s, 占 -s, 领 -c, 了 -s, 婴 -s, 儿 -c, 床 -c) .
(4) a. 猫占领了婴儿床 b. 猫 占领 了 婴儿床 c. 'The cat occupied the crib' Input. The input are the segmentation configurations for each character under consideration, which are described by context and history features. The context features of captures the characters that are in the same sentence of the current character and the history features encode the segmentation actions of previous characters.
• Context features. These refer to character unigrams and bigrams that appear in the local context window of h characters that centers at c i , where c i is 领 in example (4) and h = 5 is used in this paper. The template for features are shown in Table 1. For our example, the uni-and bi-gram features would be: 猫, 占, 领, 了, 婴 and 猫占, 占领, 领了, 了 婴, respectively.
To make inference tractable, we assume that only previous l character-specific actions are relevant, where l = 2 for this study. In our example, 猫 -s and 战 -s are the history features. Such features capture partial information of syntactic and semantic dependencies between previous words, which are clues for segmentation that pure character contexts could not provide. A dummy character START is used to represent the absent (left) context characters in the case of the first l characters in a sentence. And the predicted action for the START symbol is always s.
Matching. CWS is now modeled as the matching of the input (segmentation configuration) and output (two possible character-specific actions) for each character. Formally, a matching model learns  (4), which is the second character of word 占领 'occupy'. Both feature and output embeddings are trainable parameters of the model.

Group
Feature template Table 1: Uni-and bi-gram feature template the following function: where c 1 c 2 ...c n is the character sequence, b j and a j are the segmentation configuration and action for character c j , respectively.
indicates that the configuration for each character is a function that depends on the actions of the previous l characters and the characters in the local window of size h. Why embedding. The above matching model would suffer from sparsity if these outputs (character-specific action a j ) were directly encoded as one-hot vectors, since the matching model can be seen as a sequence labeling model with C ×L outputs, where L is the number of original labels while C is the number of unique characters. For Chinese, C is at the order of 10 3 − 10 4 . The use of embeddings, however, can serve the matching model well thanks to their low dimensionality.

The Architecture
The proposed architecture ( Figure 1) has three components, namely look-up table, concatenation and softmax function for matching. We will go through each of them in this section.
Look-up table. The mapping between features/outputs to their corresponding embeddings are kept in a look-up table, as in many previous embedding related work (Bengio et al., 2003;Pei et al., 2014). Such features are extracted from the training data. Formally, the embedding for each distinct feature d is denoted as Embed(d) ∈ R N , which is a real valued vector of dimension N . Each feature is retrieved by its unique index. The retrieval of the embeddings for the output actions is similar.
Concatenation. To predict the segmentation for the target character c j , its feature vectors are concatenated into a single vector, the input embedding, i(b j ) ∈ R N ×K , where K is the number of features used to describe the configuration b j . Softmax. The model then computes the dot product of the input embedding i(b j ) and each of the two output embeddings, o(a j,1 ) and o(a j,2 ), which represent the two possible segmentation actions for the target character c j , respectively. The exponential of the two raw scores are normalized to obtain probabilistic values ∈ [0, 1].
We call the resulting scores matching probabilities, which denote probabilities that actions match the given segmentation configuration. In our example, 领 -c has the probability of 0.7 to be the correct action, while 领 -s is less likely with a lower probability of 0.3. Formally, the above matching procedure can be described as a softmax function, as shown in (2), which is also an individual f term in (1).
In (2), a j,k (1 ≤ k ≤ 2) represent two possible actions, such as 领 -c and 领 -s for 领 in our example. Note that, to ensure the input and output are of the same dimension, for each character specific action, the model trains two distinct embeddings, one ∈ R N as feature and the other ∈ R N ×K as output, where K is the number of features for each input.
Best word segmentation of sentence. After plugging (2) into (1) and applying (and then dropping) logarithms for computational convenience, finding the best segmentation for a sentence becomes an optimization problem as shown in (3). In the formula,Ŷ is the best action sequence found by the model among all the possible ones, Y = a 1 a 2 ...a n , where a j is the predicted action for the character c j (1 ≤ j ≤ n), which is either c j -s or c j -c, such as 领 -s and 领 -c.

The Greedy Segmenter
Our model depends on the actions predicted for the previous two characters as history features. Traditionally, such scenarios call for dynamic programming for exact inference. However, preliminary experiments showed that, for our model, a Viterbi search based segmenter, even supported by conditional random field (Lafferty et al., 2001) style training, yields similar results as the greedy search based segmenter in this section. Since the greedy segmenter is much more efficient in training and testing, the rest of the paper will focus on the proposed greedy segmenter, the details of which will be described in this section.

Greedy Search
Initialization. The first character in the sentence is made to have two left side characters that are dummy symbols of START, whose predicted actions are always START-s, i.e. separation.
Iteration. The algorithms predicts the action for each character c j , one at a time, in a left-to-right, incremental manner, where 1 ≤ j ≤ n and n is the sentence length. To do so, it first extracts context features and history features, the latter of which are the predicted character-specific actions for the previous two characters. Then the model matches the concatenated feature embedding with embeddings of the two possible character-specific actions, c j -s and c i -c. The one with higher matching probability is predicted as segmentation action for the character, which is irreversible. After the action for the last character is predicted, the segmented word sequence of the sentence is built from the predicted actions deterministically.
Hybrid matching. Character-specific embeddings are capable of capturing subtle word formation tendencies of individual characters, but such representations are incapable of covering matching cases for unknown target characters. Another minor issue is that the action embeddings for certain low frequent characters may not be sufficiently trained. To better deal with these scenarios, We also train two embeddings to represent character-independent segmentation actions, ALL-s and ALL-c, and use them to average with or substitute embeddings of infrequent or unknown characters, which are either insufficiently trained or nonexistent. Such strategy is called hybrid matching, which can improve accuracy.
Complexity. Although the total number of actions is large, the matching for each target character only requires the two actions that correspond to that specific character, such as 领 -s and 领 -c for 领 in our example. Each prediction is thus similar to a softmax computation with two outputs, which costs constant time C. Greedy search ensures that the total time for predicting a sentence of n characters is n × C, i.e. linear time complexity, with a minor overhead for mapping actions to segmentations.

Training
The training procedure first predicts the action for the current character with current parameters, and then optimizes the log likelihood of correct segmentation actions in the gold segmentations to update parameters. Ideally, the matching probability for the correct action embedding should be 1 while that of the incorrect one should be 0. We minimize the cross-entropy loss function as in (4) for the segmentation prediction of each character c j to pursue this goal. The loss function is convex, similar to that of maximum entropy models.
where a j,k denotes a possible action for c j and i is a compact notation for i(b j ). In (4), δ(a j,k ) is an indicator function defined by the following formula, whereâ j denotes the correct action.
To counteract over-fitting, we add L2 regularization term to the loss function, as follows: The formula in (4) and (5) are similar to that of a standard softmax regression, except that both input and output embeddings are parameters to be trained. We perform stochastic gradient descent to update input and output embeddings in turn, each time considering the other as constant. We give the gradient (6) and the update rule (7) for the input embedding i(b j ) (i for short), where o k is a short notation for o(a j,k ). The gradient and update for output embeddings are similar. The α in (7) is the learning rate, which we use a linear decay scheme to gradually shrink it from its initial value to zero. Note that the update for the input embedding i is actually performed for the feature embeddings that form i in the concatenation step.
Complexity. For each iteration of the training process, the time complexity is also linear to the input character number, as compared with search, only a few constant time operations of gradient computation and parameter updates are performed for each character.

Data and Evaluation Metric
In the experiments, we use two widely used and freely available 1 manually word-segmented corpora, namely, PKU and MSR, from the second SIGHAN international Chinese word segmentation bakeoff (Emerson, 2005). Table 2 shows the details of the two dataset. All evaluations in this paper are conducted with official training/testing set split using official scoring script. 2 PKU MSR Word types 5.5 × 10 4 8.8 × 10 4 Word tokens 1.1 × 10 6 2.4 × 10 6 Character types 5 × 10 3 5 × 10 3 Character tokens 1.8 × 10 6 4.1 × 10 6 The segmentation accuracy is evaluated by precision (P ), recall (R), F-score and R oov , the recall for out-of-vocabulary words. Precision is defined as the number of correctly segmented words divided by the total number of words in the segmentation result. Recall is defined as the number of correctly segmented words divided by the total number of words in the gold standard segmentation. In particular, R oov reflects the model generalization ability. The metric for overall performance, the evenly-weighted F-score is calculated as in (8): To comply with CWS evaluation conventions and make comparisons fair, we distinguish the following two settings: • closed-set: no extra resource other than training corpora is used.
• open-set: additional lexicon, raw corpora, etc are used.
We will report the final results of our model 3 on PKU and MSR corpora in comparison with previous embedding based models (Section 4.2) and state-of-the-art systems (Section 4.3), before going into detailed experiments for model analyses (Section 4.5).  Table 3, under close-set evaluation, our model significantly outperform previous embedding based models in all metrics. Compared with the previous best embedding-based model, our greedy segmenter has achieved up to 2.2% and 25.8% absolute improvements (MSR) on F-score and R oov , respectively. Surprisingly, our close-set results are also comparable to the best open-set results of previous models. As we will see in (Section 4.4), when using same or less character uniand bi-gram features, our model still outperforms previous embedding based models in closed-set evaluation, which shows the effectiveness of our matching model.

Comparison with Previous Embedding-Based Models
Significance test. Table 4 shows the 95% confidence intervals (CI) for close-set results of our model and the best performing previous model (Pei et al., 2014), which are computed by formula (9), following (Emerson, 2005).
where F is the F-score value and the N is the word token count of the testing set, which is 104,372 and 106,873 for PKU and MSR, respectively. We see that the confidence intervals of our results do not overlap with that of (Pei et al., 2014), meaning that our improvements are statistically significant. Table 5 shows that the results of our greedy segmenter are competitive with the state-of-the-art supervised systems (Best05 closed-set, Zhang and Clark, 2007), although our feature set is much simpler. More recent state-of-the-art systems rely on both extensive feature engineering and extra raw corpora to boost performance, which are semi-supervised learning. For example,  developed 8 types of static and dynamic features to maximize the co-training system that used extra corpora of Chinese Gigaword and Baike, each of which contains more than 1 billion character tokens. Such systems are not directly comparable with our supervised model. We leave the development of semi-supervised learning methods for our model as future work. Features complement each other and removing any group of features leads to a limited drop of Fscore up to 0.7%. Note that features of previous (two) actions are even more informative than all unigram features combined, suggesting that intra-an inter-word dependencies reflected by action features are strong evidence for segmentation. Moreover, using same or less character ngram features, our model outperforms previous embedding based models, which shows the effectiveness of our matching model.

Model Analysis
Learning curve. Figure 2 shows that the training procedure coverages quickly. After the first iteration, the testing F-scores are already 93.5% and 95.7% for PKU and MSR, respectively, which then gradually reach their maximum within the next 9 iterations before the curve flats out.

Speed.
With an unoptimized single-thread Python implementation running on a laptop with intel Core-i5 CPU (1.9 GHZ), each iteration of the training procedure on PKU dataset takes about 5 minutes, or 6,000 characters per second. The pre-    diction speed is above 13,000 character per second.
Hyper parameters. The hyper parameters used in the experiments are shown in Table 7. We initialized hyper parameters with recommendations in literature before tuning with dev-set experiments, each of which change one parameter by a magnitude. We fixed the hyper parameter to the current setting without spending too much time on tuning, since that is not the main purpose of this paper.
• Embedding size determines the number of parameters to be trained, thus should fit the   training data size to achieve good performance. We tried the size of 30 and 100, both of which performs worse than 50. A possible tuning is to use different embedding size for different groups of features instead of setting N 1 = 50 for all features.
• Context window size. A window size of 3-5 characters achieves comparable results. Zheng et al. (2013) suggested that context window larger than 5 may lead to inferior results.
• Initial Learning rate. We found that several learning rates between 0.04 to 0.15 yielded very similar results as the one reported here. The training is not very sensitive to reason-able values of initial learning rate. However, Instead of our simple linear decay of learning rate, it might be useful to try more sophisticated techniques, such as AdaGrad and exponential decaying .
• Regularization. Our model suffers a little from over-fitting, if no regularization is used.
In that case, the F-score on PKU drops from 95.1% to 94.7%.
• Pre-training. We tried pre-training character embeddings using word2vec 5 with Chinese Gigaword Corpus 6 and use them to initialize the corresponding embeddings in our model, as previous work did. However, we were only able to see insignificant F-score improvements within 0.1% and observed that the training F-score reached 99.9% much earlier. We hypothesize that pre-training leads to sub-optimal local maximums for our model.
• Hybrid matching. We tried applying hybrid matching (Section 3.1) for target characters which are less frequent than the top f top characters, including unseen characters, which leads to about 0.15% of F-score improvements.

Related Work
Word segmentation. Most modern segmenters followed Xue (2003) to model CWS as sequence labeling of character position tags, using conditional random fields (Peng et al. 2004), structured perceptron (Jiang et al., 2008), etc. Some notable exceptions are (Zhang and Clark, 2007;Zhang et al., 2012), which exploited rich word-level features and (Ma et al., 2012;Ma, 2014;, which explicitly model word structures. Our work generalizes the sequence labeling to a more flexible framework of matching, and predicts actions as in (Zhang and Clark, 2007;Zhang et al., 2012) instead of position tags to prevent the greedy search from suffering tag inconsistencies. To better utilize resources other than training data, our model might benefit from techniques used in recent state-of-the-art systems, such as semi-supervised learning (Zhao and Kit, 2008;Sun and Xu, 2011;Zeng et al., 2013), joint models (Li and Zhou, 2012;Qian and Liu, 2012), and partial annotations Yang and Vozila, 2014). Distributed representation and CWS. Distributed representation are useful for various NLP tasks, such as POS tagging (Collobert et al., 2011), machine translation (Devlin et al., 2014) and parsing (Socher et al., 2013). Influenced by Collobert et al. (2011), Zheng et al. (2013) modeled CWS as tagging and treated sentence-level tag sequence as the combination of individual tag predictions and context-independent tag transition.  was inspired by Bengio et al. (2003) and used character bigram embeddings to compensate for the absence of sentence level optimization. To model interactions between tags and characters, which are absent in these two CWS models, Pei et al. (2014) introduced the tag embedding and used a tensor hidden layer in the neural net. In contrast, our work uses character-specific action embeddings to explicitly capture such interactions. In addition, our work gains efficiency by avoiding hidden layers, similar as Mikolov et al. (2013).
Learning to match. Matching heterogeneous objects has been studied in various contexts before, and is currently flourishing, thanks to embeddingbased deep (Gao et al., 2014) and convolutional (Huang et al., 2013;Hu et al., 2014) neural networks. This work develops a matching model for CWS and differs from others in its "shallow"yet effective architecture.

Discussion
Simple architecture. It is possible to adopt standard feed-forward neural network for our embedding matching model with character-action embeddings as both feature and output. Nevertheless, we designed the proposed architecture to avoid hidden layers for simplicity, efficiency and easytuning, inspired by word2vec. Our simple architecture is effective, demonstrated by the improved results over previous neural-network word seg-menters, all of which use feed-forward architecture with different features and/or layers. It might be interesting to directly compare the performances of our model with same features on the current and feed-forward architectures, which we leave for future work.
Greedy and exact search-based models. As mentioned in Section 3, we implemented and preliminarily experimented with a segmenter that trains a similar model with exact search via Viterbi algorithm. On the PKU corpus, its F-score is 0.944, compared with greedy segmenter's 0.951. Its training and testing speed are up to 7.8 times slower than that of the greedy search segmenter. It is counter-intuitive that the performance of the exact-search segmenter is no better or even worse than that of the greedy-search segmenter. We hypothesize that since the training updates parameters with regard to search errors, the final model is "tailored" for the specific search method used, which makes the model-search combination of greedy search segmenter not necessarily worse than that of exact search segmenter. Another way of looking at it is that search is less important when the model is accurate. In this case, most step-wise decisions are correct in the first place, which requires no correction from the search algorithm. Empirically, Zhang and Clark (2011) also reported exact-search segmenter performing worse than beam-search segmenters.
Despite that the greedy segmenter is incapable of considering future labels, this rarely causes problems in practice. Our greedy segmenter has good results, compared with the exact-search segmenter above and previous approaches, most of which utilize exact search. Moreover, the greedy segmenter has additional advantages of faster training and prediction.
Sequence labeling and matching. A traditional sequence labeling model such as CRF has K (number of labels) target-character-independent weight vectors, where the target character influences the prediction via the weights of the features that contain it. In a way, a matching model can be seen as a family of "sub-models", which keeps a group of weight vectors (the output embeddings) for each unique target character. Different target characters activate different sub-models, allowing opposite predictions for similar input features, as the target weight vectors used are different.

Conclusion and Future Work
In this paper, we have introduced the matching formulation for Chinese word segmentation and proposed an embedding matching model to take advantage of distributed representations. Based on the model, we have developed a greedy segmenter, which outperforms previous embeddingbased methods and is competitive with state-ofthe-art systems. These results suggest that it is promising to model CWS as configuration-action matching using distributed representations. In addition, linear-time training and testing complexity of our simple architecture is very desirable for industrial application. To the best of our knowledge, this is the first greedy segmenter that is competitive with the state-of-the-art discriminative learning models.
In the future, we plan to investigate methods for our model to better utilize external resources. We would like to try using convolutional neural network to automatically encode ngram-like features, in order to further shrink parameter space. It is also interesting to study whether extending our model with deep architectures can benefit CWS. Lastly, it might be useful to adapt our model to tasks such as POS tagging and name entity recognition.