State-of-the-art Chinese Word Segmentation with Bi-LSTMs

A wide variety of neural-network architectures have been proposed for the task of Chinese word segmentation. Surprisingly, we find that a bidirectional LSTM model, when combined with standard deep learning techniques and best practices, can achieve better accuracy on many of the popular datasets as compared to models based on more complex neuralnetwork architectures. Furthermore, our error analysis shows that out-of-vocabulary words remain challenging for neural-network models, and many of the remaining errors are unlikely to be fixed through architecture changes. Instead, more effort should be made on exploring resources for further improvement.


Introduction
Neural networks have become ubiquitous in natural language processing. For the word segmentation task, there has been a growing body of work exploring novel neural network architectures for learning useful representation and thus better segmentation prediction (Pei et al., 2014;Ma and Hinrichs, 2015;Zhang et al., 2016a;Liu et al., 2016;Cai et al., 2017;Wang and Xu, 2017).
We show that properly training and tuning a relatively simple architecture with a minimal feature set and greedy search achieves state-of-the-art accuracies and beats more complex neural-network architectures. Specifically, the model itself is a straightforward stacked bidirectional LSTM (Figure 1) with just two input features at each position (character and bigram). We use three widely recognized techniques to get the most performance out of the model: pre-trained embeddings Zhou et al., 2017), dropout (Srivastava et al., 2014), and hyperparameter tuning (Weiss et al., 2015;Melis et al., 2018). These results have important ramifications for further model development. Unless best practices are followed, it is difficult to compare the impact of modeling decisions, as differences between models are masked by choice of hyperparameters or initialization.
In addition to the simpler model we present, we also aim to provide useful guidance for future research by examining the errors that the model makes. About a third of the errors are due to annotation inconsistency, and these can only be eliminated with manual annotation. The other two thirds are those due to out-of-vocabulary words and those requiring semantic clues not present in the training data. Some of these errors will be almost impossible to solve with different model architectures. For example, while 抽象概念 (abstract concept) appears as one word at test time, any model trained only on the MSR dataset will segment it as two words: 抽象 (abstract) and 概 念 (concept), which are seen in the training set 28 and 90 times, respectively, and never together. Thus, we expect that iterating on model architectures will give diminishing returns, while leveraging external resources such as unlabeled data or lexicons is a more promising direction.
In sum, this work contributes two significant pieces of evidence to guide further development in Chinese word segmentation. First, comparing different model architectures requires careful tuning and application of best practices in order to obtain rigorous comparisons. Second, iterating on neural architectures may be insufficient to solve the remaining classes of segmentation errors without further efforts in data collection.

Model
Our model is relatively simple. Our approach uses long short-term memory neural networks architectures (LSTM) since previous work has found success with these models (Chen et al., 2015;Zhou et al., 2017, inter alia). We use two features: uni- In the next sections we describe the best practices we used to achieve state-of-the-art performance from this architecture. Note that all of these practices and techniques are derived from related work, which we describe.
Recurrent Dropout. Contrary to the recommendation of Zaremba et al. (2014), we apply dropout to the recurrent connections of our LSTMs, and we see similar improvements when following the recipe of Gal and Ghahramani (2016) or simply sample a new dropout mask at every recurrent connection.
Hyperparameters. We use the momentumbased averaged SGD procedure from (Weiss et al., 2015) to train the model, with few additions. We normalized each gradient to be at most unit norm, and used asynchronous SGD updates to speed up training time. For each configuration we evaluated, we trained different settings of a manually tuned hyperparameter grid, varying the initial learning rate, learning rate schedule, and input and recurrent dropout rates. We fixed the momentum parameter µ = 0.95. The full list of hyperparameters is given in Table 2. We show the impact of this tuning procedure in Table 7, which we found was crucial to measure the best performance of the simple architecture.
Pretrained Embeddings. Pre-training embedding matrices from automatically gathered data is a powerful technique that has been applied to many NLP problems for several years (e.g. Collobert et al. (2011);Mikolov et al. (2013)). We pretrain the character embeddings and characterbigram embeddings using wang2vec 1 (Ling et al., 2015), which modifies word2vec by incorporating character/bigram order information during training. Note that this idea has been used in segmentation previously by Zhou et al. (2017), but they also augment the contexts by adding the predictions of a baseline segmenter as an additional context. We experimented with both treating the pretrained embeddings as constants or fine-tuning on the particular datasets.
Other Related Work. Recently, a number of different neural network based models have been proposed for word segmentation task. One common approach is to learn word representation through the characters of that word. For example, Liu et al. (2016) runs bi-directional LSTM over characters of the word candidate and then concatenate bi-directional LSTM outputs at both end points. Cai et al. (2017) adopts a gating mechanism to control relative importance of each character in the word candidate. Besides modeling word representation directly, sequential labeling is another popular approach. For instance, Zheng et al. (2013) and Pei et al. (2014) predict the label of a character based context of a fixed sized local window. Chen et al. (2015) extends the approach by using LSTMs to capture potential long distance information. Both Chen et al. (2015) and Pei et al. (2014) use a transition matrix to model interaction between adjacent tags. Zhou et al. (2017) conduct rigorous comparison and show that such transition matrix rarely improves accuracy. Our model is similar to Zhou et al. (2017), except that we stack the backward LSTM on top of the forward one, which improves accuracy as shown in later section.
Our model is also trained via a simple maximum likelihood objective. In contrast, other state-   of-the-art models use a non-greedy approach to training and inference, e.g.  and Zhang et al. (2016b).

Experiments
Data. We conduct experiments on the following datasets: Chinese Penn Treebank 6.0 (CTB6) with data split according the official document; Chinese Penn Treebank 7.0 (CTB7) with recommended data split (Wang et al., 2011); Chinese Universal Treebank (UD) from the Conll2017 shared task (Zeman et al., 2017) with the official data split; Dataset from SIGHAN 2005 bake-off task (Emerson, 2005

Main Results
Table 2 contains the state-of-the-art results from recent neural network based models, together with the performance of our model. Table 3 contains results achieved without using any pretrained embeddings.
Our model achieves the best results among NN models on 6/7 datasets. In addition, while the majority of datasets work the best if the pretrained embedding matrix is treated as constant, the MSR dataset is an outlier: fine-tuning embeddings yields a very large improvement. We observe that the likely cause is a low OOV rate in the MSR evaluation set compared to other datasets.

Ablation Experiments
To see which decisions had the greatest impact on the result, we performed ablation experiments on the holdout sets of the different corpora. Starting with our proposed system 2 , we remove one decision, perform hyperparameter tuning, and see the change in performance. The results are summarized in Table 6. Negative numbers in Table 6 correspond to decreases in performance for the ablated system. Note that although each of the components help performance on average, there are cases where we observe no impact. For example using recurrent dropout on AS and MSR rarely af-    Table 7, we compare fully tuned models with those that share hyperparameter configurations across dataset for three settings of the model. We can see that hyperparameter tuning consistently improves model accuracy across all settings.

Error Analysis
In order to guide future research on Chinese word segmentaion, it is important to understand the types of errors that the system is making. To get a sense of this, we randomly selected 54 and 50 errors from the CTB-6 and MSR test set, respectively. We then manually analyzed them.
The model learns to remember words it has seen, especially for high frequency words. It also learns the notion of prefixes/suffixes, which aids predicting OOV words, a major source of segmentation errors (Huang and Zhao, 2007). Using pretrained embeddings enables the model to expand the set of prefixes/suffixes through their nearest neighbors in the embedding spaces, and therefore further improve OOV recall (on average, using pretrained embeddings contributes to 10% OOV recall improvement, also see Table 5 for more details).
Nevertheless, OOV remains challenging especially for those that can be divided into words fre-quently seen in the training data, and most (37 out of 43) of the oversegmentation errors are due to this. For instance, the model incorrectly segmented the OOV word 抽象概念 (abstract concept) as 抽象 (abstract) and 概念 (concept). 抽象 and 概念 are seen in the training set for 28 times and 90 times, respectively. Unless high coverage dictionaries are used, it is difficult for any supervised model to learn not to follow this trend in the training data.
In addition, the model sometimes struggles when a prefix/suffix can also be a word by itself. For instance, 权 (right/power) frequently serves as a suffix, such as 管理权 (right of management), 立 法权 (right of legislation) and 终 审权 (right of final judgment). When the model encounters 下放 (delegate/transfer) 权(power), it incorrectly merges them together. Similarly, the model segments 居 (in/at) + 中 (middle) as 居中 (in the middle), since the training data contains words such as 居首 (in the first place) and 居次 (in the second place). This example also hints at the ambiguity of word delineation in Chinese, and explains the difficulty in keeping annotations consistent.
Fixing the above errors requires semantic level knowledge such as 'Bank' (银行) is unlikely to be the name of a county (县), and likewise, transfer power (下放权) is not a type of right (权). Previous work (Huang and Zhao, 2007) also pointed out that OOV is a major obstacle to achieving high segmentation accuracy. They also mentioned that machine learning approaches together with character-based features are more promising in solving OOV problem than rule based methods. Our analysis indicate that learning from the training corpus alone can hardly solve the above mentioned errors. Exploring other sources of knowledge is essential for further improvement. One potential way to acquire such knowledge is to use a language model that is trained on a large scale corpus (Peters et al., 2018). We leave this to future investigation.
Unfortunately, a third (34 out of 104) of the errors we have looked at were due to annotation inconsistency. For example, 建筑系 (Department of Architecture) is once annotated as 建筑 (Architecture) + 系 (Department) and once as 建筑系 under exactly the same context 建筑系教授喻肇 青 (Zhaoqing Yu, professor of Architecture). 高 新技术 (advanced technology) is annotated as 高 (advanced) + 新 (new) + 技术 (technology) for 37 times, and is annotated as 高新 (advanced and new) + 技术 (technology) for 19 times.
In order to augment the manual verification we performed above, we also wrote a script to automatically find inconsistent annotations in the data. Since this is an automatic script, it cannot distin-  guish between genuine ambiguity and inconsistent annotations. The heuristic we use is the following: for all word bigrams in the training data, we see if they also occur as single words or word trigrams. We ignore the dominant analysis and count the number of occurrences of the less frequent analyses and report this number as a fraction of the number of tokens in the corpus. Table 8 shows the results of running the script. We see that the AS corpus is the least consistent (according to this heuristic) while MSR is the most consistent. This might explain why both our system and prior work have relatively low performance on AS even though this has the largest training set. By contrast results are much stronger on MSR, and this might be in part because it is more consistently annotated. The ordering of corpora by inconsistency roughly mirrors their ordering by accuracy.

Conclusion
In this work, we showed that further research in Chinese segmentation must overcome two key challenges: (1) rigorous tuning and testing of deep learning architectures and (2) more effort should be made on exploring resources for further performance gain.