Neural Word Segmentation with Rich Pretraining

Neural word segmentation research has benefited from large-scale raw texts by leveraging them for pretraining character and word embeddings. On the other hand, statistical segmentation research has exploited richer sources of external information, such as punctuation, automatic segmentation and POS. We investigate the effectiveness of a range of external training sources for neural word segmentation by building a modular segmentation model, pretraining the most important submodule using rich external sources. Results show that such pretraining significantly improves the model, leading to accuracies competitive to the best methods on six benchmarks.


Introduction
There has been a recent shift of research attention in the word segmentation literature from statistical methods to deep learning (Zheng et al., 2013;Pei et al., 2014;Morita et al., 2015;Chen et al., 2015b;Cai and Zhao, 2016;Zhang et al., 2016b). Neural network models have been exploited due to their strength in non-sparse representation learning and non-linear power in feature combination, which have led to advances in many NLP tasks. So far, neural word segmentors have given comparable accuracies to the best statictical models.
With respect to non-sparse representation, character embeddings have been exploited as a foundation of neural word segmentors. They serve to reduce sparsity of character ngrams, allowing, for example, "猫(cat) 躺(lie) 在(in) 墙角(corner)" to be connected with "狗(dog) 蹲(sit) 在(in) 墙 * Equal contribution. 角(corner)" (Zheng et al., 2013), which is infeasible by using sparse one-hot character features. In addition to character embeddings, distributed representations of character bigrams Pei et al., 2014) and words (Morita et al., 2015;Zhang et al., 2016b) have also been shown to improve segmentation accuracies.
With respect to non-linear modeling power, various network structures have been exploited to represent contexts for segmentation disambiguation, including multi-layer perceptrons on fivecharacter windows (Zheng et al., 2013;Pei et al., 2014;Chen et al., 2015a), as well as LSTMs on characters (Chen et al., 2015b;Xu and Sun, 2016) and words (Morita et al., 2015;Cai and Zhao, 2016;Zhang et al., 2016b). For structured learning and inference, CRF has been used for character sequence labelling models (Pei et al., 2014;Chen et al., 2015b) and structural beam search has been used for word-based segmentors (Cai and Zhao, 2016;Zhang et al., 2016b).
Previous research has shown that segmentation accuracies can be improved by pretraining character and word embeddings over large Chinese texts, which is consistent with findings on other NLP tasks, such as parsing (Andor et al., 2016). Pretraining can be regarded as one way of leveraging external resources to improve accuracies, which is practically highly useful and has become a standard practice in neural NLP. On the other hand, statistical segmentation research has exploited raw texts for semi-supervised learning, by collecting clues from raw texts more thoroughly such as mutual information and punctuation (Li and Sun, 2009;Sun and Xu, 2011), and making use of selfpredictions (Wang et al., 2011;Liu and Zhang, 2012). It has also utilised heterogenous annotations such as POS (Ng and Low, 2004;Zhang and Clark, 2008)  standards (Jiang et al., 2009). To our knowledge, such rich external information has not been systematically investigated for neural segmentation. We fill this gap by investigating rich external pretraining for neural segmentation. Following Cai and Zhao (2016) and Zhang et al. (2016b), we adopt a globally optimised beam-search framework for neural structured prediction (Andor et al., 2016;Zhou et al., 2015;Wiseman and Rush, 2016), which allows word information to be modelled explicitly. Different from previous work, we make our model conceptually simple and modular, so that the most important sub module, namely a five-character window context, can be pretrained using external data. We adopt a multi-task learning strategy (Collobert et al., 2011), casting each external source of information as a auxiliary classification task, sharing a five-character window network. After pretraining, the character window network is used to initialize the corresponding module in our segmentor.
Results on 6 different benchmarks show that our method outperforms the best statistical and neural segmentation models consistently, giving the best reported results on 5 datasets in different domains and genres. Our implementation is based on LibN3L 1 (Zhang et al., 2016a). Code and models can be downloaded from http://gitHub. com/jiesutd/RichWordSegmentor

Related Work
Work on statistical word segmentation dates back to the 1990s (Sproat et al., 1996). State-of-the-art approaches include character sequence labeling models (Xue et al., 2003) using CRFs (Peng et al., 1 https://github.com/SUTDNLP/LibN3L 2004; Zhao et al., 2006) and max-margin structured models leveraging word features (Zhang and Clark, 2007;Sun, 2010). Semisupervised methods have been applied to both character-based and word-based models, exploring external training data for better segmentation (Sun and Xu, 2011;Wang et al., 2011;Liu and Zhang, 2012;Zhang et al., 2013). Our work belongs to recent neural word segmentation.
To our knowledge, there has been no work in the literature systematically investigating rich external resources for neural word segmentation training. Closest in spirit to our work, Sun and Xu (2011) empirically studied the use of various external resources for enhancing a statistical segmentor, including character mutual information, access variety information, punctuation and other statistical information. Their baseline is similar to ours in the sense that both character and word contexts are considered. On the other hand, their model is statistical while ours is neural. Consequently, they integrate external knowledge as features, while we integrate it by shared network parameters. Our results show a similar degree of error reduction compared to theirs by using external data.
Our model inherits from previous findings on context representations, such as character windows Pei et al., 2014;Chen et al., 2015a) and LSTMs (Chen et al., 2015b;Xu and Sun, 2016). Similar to Zhang et al. (2016b) and Cai and Zhao (2016), we use word context on top of character context. However, words play a relatively less important role in our model, and we find that word LSTM, which has been used by all previous neural segmentation work, is unnecessary for our model. Our model is conceptually simpler and more modularised compared with  Figure 1: Overall model. Zhang et al. (2016b) and Cai and Zhao (2016), allowing a central sub module, namely a fivecharacter context window, to be pretrained.

Model
Our segmentor works incrementally from left to right, as the example shown in Table 1. At each step, the state consists of a sequence of words that have been fully recognized, denoted as W = [w −k , w −k+1 , ..., w −1 ], a current partially recognized word P , and a sequence of next incoming characters, denoted as C = [c 0 , c 1 , ..., c m ], as shown in Figure 1. Given an input sentence, W and P are initialized to [ ] and φ, respectively, and C contains all the input characters. At each step, a decision is made on c 0 , either appending it as a part of P , or seperating it as the beginning of a new word. The incremental process repeats until C is empty and P is null again (C = [ ], P = φ). Formally, the process can be regarded as a state-transition process, where a state is a tuple S = W, P, C , and the transition actions include SEP (seperate) and APP (append), as shown by the deduction system in Figure 2 2 . In the figure, V denotes the score of a state, given by a neural network model. The score of the initial state (i.e. axiom) is 0, and the score of a non-axiom state is the sum of scores of all incremental decisions resulting in the state. Similar to Zhang et al. (2016b) and Cai and Zhao (2016), our model is a global structural model, using the overall score to disambiguate states, which correspond to sequences of inter-dependent transition actions.
Different from previous work, the structure of our scoring network is shown in Figure 1. It consists of three main layers. On the bottom is a representation layer, which derives dense representations X W , X P and X C for W, P and C, respectively. We compare various distributed representations and neural network structures for learning X W , X P and X C , detailed in Section 3.1. On top of the representation layer, we use a hidden layer to merge X W , X P and X C into a single vector (1) The hidden feature vector h is used to represent the state S = W, P, C , for calculating the scores of the next action. In particular, a linear output layer with two nodes is employed: The first and second node of o represent the scores of SEP and APP given S, namely Score(S, SEP), Score(S, APP) respectively.

Representation Learning
Characters. We investigate two different approaches to encode incoming characters, namely a window approach and an LSTM approach. For the former, we follow prior methods (Xue et al., 2003;Pei et al., 2014), using five-character window [c −2 , c −1 , c 0 , c 1 , c 2 ] to represent incoming characters. Shown in Figure 3, a multi-layer perceptron (MLP) is employed to derive a five-character win- For the latter, we follow recent work (Chen et al., 2015b;Zhang et al., 2016b), using a bidirectional LSTM to encode input character sequence. 3 In particular, the bi-directional LSTM ] of the next incoming character c 0 is used to represent the coming characters [c 0 , c 1 , ...] given a state. Intuitively, a five-character window provides a local context from which the meaning of the middle character can be better disambiguated. LSTM, on the other hand, captures larger contexts, which can contain more useful clues for dismbiguation but also irrelevant information. It is therefore interesting to investigate a combination of their strengths, by first deriving a locally-disambiguated version of c 0 , and then feed it to LSTM for a globally disambiguated representation. Now with regard to the single-character vector representation V c i (i ∈ [−2, 2]), we follow previous work and consider both character embedding e c (c i ) and character-bigram embedding e b (c i , c i+1 ) , investigating the effect of each on the accuracies. When both e c (c i ) and e b (c i , c i+1 ) are utilized, the concatenated vector is taken as V c i . Partial Word. We take a very simple approach to representing the partial word P , by using the embedding vectors of its first and last characters, as well as the embedding of its length. Length embeddings are randomly initialized and then tuned in model training. X P has relatively less influence on the empirical segmentation accuracies. Word. Similar to the character case, we investigate two different approaches to encoding incoming characters, namely a window approach and an LSTM approach. For the former, we follow prior methods (Zhang and Clark, 2007;Sun, 2010), using the two-word window [w −2 , w −1 ] to represent recognized words. A hidden layer is employed to derive a two-word vector X W from single word embeddings e w (w −2 ) and e w (w −1 ).
For the latter, we follow Zhang et al. (2016b) and Cai and Zhao (2016), using an uni-directional LSTM on words that have been recognized.

Pretraining
Neural network models for NLP benefit from pretraining of word/character embeddings, learning distributed sementic information from large raw texts for reducing sparsity. The three basic elements in our neural segmentor, namely characters, character bigrams and words, can all be pretrained Figure 3: Shared character representation.
over large unsegmented data. We pretrain the fivecharacter window network in Figure 3 as an unit, learning the MLP parameter together with character and bigram embeddings. We consider four types of commonly explored external data to this end, all of which have been studied for statistical word segmentation, but not for neural network segmentors. Raw Text. Although raw texts do not contain explicit word boundary information, statistics such as mutual information between consecutive characters can be useful features for guiding segmentation (Sun and Xu, 2011). For neural segmentation, these distributional statistics can be implicitly learned by pretraining character embeddings. We therefore consider a more explicit clue for pretraining our character window network, namely punctuations (Li and Sun, 2009). Punctuation can serve as a type of explicit markup (Spitkovsky et al., 2010), indicating that the two characters on its left and right belong to two different words. We leverage this source of information by extracting character five-grams excluding punctuation from raw sentences, using them as inputs to classify whether there is punctuation before middle character. Denoting the resulting five character window as [c −2 , c −1 , c 0 , c 1 , c 2 ], the MLP in Figure 3 is used to derive its representation D C , which is then fed to a softmax layer for binary classification: Here P (punc) indicates the probability of a punctuation mark existing before c 0 . Standard backpropagation training of the MLP in Figure 3 can be done jointly with the training of W punc and b punc . After such training, the embedding V ci and MLP values can be used to initialize the corresponding parameters for D C in the main segmentor, before its training. Automatically Segmented Text. Large texts automatically segmented by a baseline segmentor can be used for self-training (Liu and Zhang, 2012) or deriving statistical features (Wang et al., 2011).
We adopt a simple strategy, taking automatically segmented text as silver data to pretrain the five-character window network. Given [c −2 , c −1 , c 0 , c 1 .c 2 ], D C is derived using the MLP in Figure 3, and then used to classify the segmentation of c 0 into B(begining)/M(middle)/E(end)/S(single character word) labels.
Here W silv and b silv are model parameters. Training can be done in the same way as training with punctuation. Heterogenous Training Data. Multiple segmentation corpora exist for Chinese, with different segmentation granularities. There has been investigation on leveraging two corpora under different annotation standards to improve statistical segmentation (Jiang et al., 2009). We try to utilize heterogenous treebanks by taking an external treebank as labeled data, training a B/M/E/S classifier for the character windows network.
POS Data. Previous research has shown that POS information is closely related to segmentation (Ng and Low, 2004;Zhang and Clark, 2008). We verify the utility of POS information for our segmentor by pretraining a classifier that predicts the POS on each character, according to the character window representation D C . In particular, given [c −2 , c −1 , c 0 , c 1 , c 2 ], the POS of the word that c 0 belongs to is used as the output.
Multitask Learning. While each type of external training data can offer one source of segmentation information, different external data can be complimentary to each other. We aim to inject all sources of information into the character window representation D C by using it as a shared representation for different classification tasks. Neural model have been shown capable of doing multi-task learning via parameter sharing (Collobert et al., 2011). Shown in Figure 3, in our Algorithm 1: Training case, the output layer for each task is independent, but the hidden layer D C and all layers below D C are shared.
For training with all sources above, we randomly sample sentences from the Punc./Autoseg/Heter./POS sources with the ratio of 10/1/1/1, for each sentence in punctuation corpus we take only 2 characters (character before and after the punctuation) as input instances.

Decoding and Training
To train the main segmentor, we adopt the global transition-based learning and beam-search strategy of Zhang and Clark (2011). For decoding, standard beam search is used, where the B best partial output hypotheses at each step are maintained in an agenda. Initially, the agenda contains only the start state. At each step, all hypotheses in the agenda are expanded, by applying all possible actions and B highest scored resulting hypotheses are used as the agenda for the next step.
For training, the same decoding process is applied to each training example (x i , y i ). At step j, if the gold-standard sequence of transition actions y i j falls out of the agenda, max-margin update is performed by taking the current best hypothesiŝ y j in the beam as a negative example, and y i j as where δ(ŷ j , y i j ) is the number of incorrect local decisions inŷ j , and η controls the score margin.
The strategy above is early-update (Collins and Roark, 2004). On the other hand, if the goldstandard hypothesis does not fall out of the agenda until the full sentence has been segmented, a final update is made between the highest scored hypothesisŷ (non-gold standard) in the agenda and the gold-standard y i , using exactly the same loss function. Pseudocode for the online learning algorithm is shown in Algorithm 1.
We use Adagrad (Duchi et al., 2011) to optimize model parameters, with an initial learning rate α. L2 regularization and dropout (Srivastava et al., 2014) on input are used to reduce overfitting, with a L2 weight λ and a dropout rate p. All the parameters in our model are randomly initialized to a value (−r, r), where r = 6.0 f an in +f anout (Bengio, 2012). We fine-tune character and character bigram embeddings, but not word embeddings, acccording to Zhang et al. (2016b).

Experimental Settings
Data. We use Chinese Treebank 6.0 (CTB6) (Xue et al., 2005) as our main dataset. Training, development and test set splits follow previous work (Zhang et al., 2014). In order to verify the robustness of our model, we additionally use SIGHAN 2005 bake-off (Emerson, 2005) and NLPCC 2016 shared task for Weibo segmentation (Qiu et al., 2016)   words, characters and character bigrams, we use Chinese Gigaword (simplified Chinese sections) 4 , automatically segmented using ZPar 0.6 off-theshelf (Zhang and Clark, 2007), the statictics of which are shown in Table 3. For pretraining character representations, we extract punctuation classification data from the Gigaword corpus, and use the word-based ZPar and a standard character-based CRF model (Tseng et al., 2005) to obtain automatic segmentation results. We compare pretraining using ZPar results only and using results that both segmentors agree on. For heterogenous segmentation corpus and POS data, we use a People's Daily corpus of 5 months 5 . Statistics are listed in Table 3. Evaluation. The standard word precision, recall and F1 measure (Emerson, 2005) are used to evaluate segmentation performances. Hyper-parameter Values. We adopt commonly used values for most hyperparameters, but tuned the sizes of hidden layers on the development set. The values are summarized in Table 2.

Development Experiments
We perform development experiments to verify the usefulness of various context representations, network configurations and different pretraining methods, respectively.

Context Representations
The influence of character and word context representations are empirically studied by varying the network structures for X C and X W in Figure 1, respectively. All the experiments in this section are performed using a beam size of 8. Character Context. We fix the word representation X W to a 2-word window and compare different character context representations. The results are shown in Table 4, where "no char" represents our model without X C , "5-char window" represents a five-character window context, "char LSTM" represents character LSTM context and  Table 4: Influence of character contexts.
"5-char window + LSTM" represents a combination, detailed in Section 3.1. "-char emb" and "bichar emb" represent the combined window and LSTM context without character and characterbigram information, respectively. As can be seen from the table, without character information, the F-score is 84.62%, demonstrating the necessity of character contexts. Using window and LSTM representations, the Fscores increase to 95.41% and 95.51%, respectively. A combination of the two lead to further improvement, showing that local and global character contexts are indeed complementary, as hypothesized in Section 3.1. Finally, by removing character and character-bigram embeddings, the F-score decreases to 95.20% and 94.27%, respectively, which suggests that character bigrams are more useful compared to character unigrams. This is likely because they contain more distinct tokens and hence offer a larger parameter space. Word Context. The influence of various word contexts are shown in Table 5. Without using word information, our segmentor gives an F-score of 95.66% on the development data. Using a context of only w −1 (1-word window), the F-measure increases to 95.78%. This shows that word contexts are far less important in our model compared to character contexts, and also compared to word contexts in previous word-based segmentors (Zhang et al., 2016b;Cai and Zhao, 2016). This is likely due to the difference in our neural network structures, and that we fine-tune both character and character bigram embeddings, which significantly enlarges the adjustable parameter space as compared with Zhang et al. (2016b). The fact that word contexts can contribute relatively less than characters in a word is also not surprising in the sense that word-based neural segmentors do not outperform the best character-based models by large margins. Given that character context is what we pretrain, our model relies more heavily With both w −2 and w −1 being used for the context, the F-score further increases to 95.86%, showing that a 2-word window is useful by offering more contextual information. On the other hand, when w −3 is also considered, the F-score does not improve further. This is consistent with previous findings of statistical word segmentation (Zhang and Clark, 2007), which adopt a 2-word context. Interestingly, using a word LSTM does not bring further improvements, even when it is combined with a window context. This suggests that global word contexts may not offer crucial additional information compared with local word contexts. Intuitively, words are significantly less polysemous compared with characters, and hence can serve as effective contexts even if used locally, to supplement a more crucial character context.

Stuctured Learning and Inference
We verify the effectiveness of structured learning and inference by measuring the influence of beam size on the baseline segmentor. Figure 4 shows the F-scores against different numbers of training iterations with beam size 1,2,4,8 and 16, respectively. When the beam size is 1, the inference is local and greedy. As the size of the beam increases, more global structural ambiguities can be resolved since learning is designed to guide search. A contrast between beam sizes 1 and 2 demonstrates the usefulness of structured learning and inference. As the beam size increases, the gain by doubling the beam size decreases. We choose a beam size of 8 for the remaining experiments for a tradeoff between speed and accuracy. Table 6 shows the effectiveness of rich pretraining of D c on the development set. In particular, by using punctuation information, the F-score increases from 95.86% to 96.25%, with a relative error reduction of 9.4%. This is consistent with   the observation of Sun and Xu (2011), who show that punctuation is more effective compared with mutual information and access variety as semisupervised data for a statistical word segmentation model. With automatically-segmented data 6 , heterogenous segmentation and POS information, the F-score increases to 96.26%, 96.27% and 96.22%, respectively, showing the relevance of all information sources to neural segmentation, which is consistent with observations made for statistical word segmentation (Jiang et al., 2009;Wang et al., 2011;Zhang et al., 2013). Finally, by integrating all above information via multi-task learning, the F-score is further improved to 96.48%, with a 15.0% relative error reduction. Zhang et al. (2016b) Both our model and Zhang et al. (2016b) use global learning and beam search, but our network is different. Zhang et al. (2016b) utilizes the action history with LSTM encoder, while we use partial word rather than action information. Besides, the character and character bigram embeddings are fine-tuned in our model while Zhang et al. (2016b) set the embeddings fixed during training. We study the F-measure distribution with respect to sentence length on our baseline model, multitask pretraining model and Zhang et al. (2016b).

Comparision with
In particular, we cluster the sentences in the development dataset into 6 categories based on their length and evaluate their F1-values, respectively. As shown in Figure 5, the models give different error distributions, with our models being more robust to the sentence length compared with Zhang et al. (2016b). Their model is better on very short sentences, but worse on all other cases. This shows the relative advantages of our model.

Final Results
Our final results on CTB6 are shown in Table 7, which lists the results of several current state-ofthe-art methods. Without multitask pretraining, our model gives an F-score of 95.44%, which is higher than the neural segmentor of Zhang et al. (2016b), which gives the best accuracies among pure neural segments on this dataset. By using multitask pretraining, the result increases to 96.21%, with a relative error reduction of 16.9%. In comparison, Sun and Xu (2011) investigated heterogenous semi-supervised learning on a stateof-the-art statistical model, obtaining a relative error reduction of 13.8%. Our findings show that external data can be as useful for neural segmentation as for statistical segmentation. Our final results compare favourably to the best statistical models, including those using semisupervised learning (Sun and Xu, 2011;Wang et al., 2011), and those leveraging joint POS and syntactic information (Zhang et al., 2014). In addition, it also outperforms the best neural models, in particular Zhang et al. (2016b)*, which is a hybrid neural and statistical model, integrating man-  ual discrete features into their word-based neural model. We achieve the best reported F-score on this dataset. To our knowledge, this is the first time a pure neural network model outperforms all existing methods on this dataset, allowing the use of external data 7 . We also evaluate our model pretrained only on punctuation and auto-segmented data, which do not include additional manual labels. The results on CTB test data show the accuracy of 95.8% and 95.7%, respectivley, which are comparable with those statistical semi-supervised methods (Sun and Xu, 2011;Wang et al., 2011). They are also among the top performance methods in Table 7. Compared with discrete semisupervised methods (Sun and Xu, 2011;Wang et al., 2011), our semi-supervised model is free from hand-crafted features. In addition to CTB6, which has been the most commonly adopted by recent segmentation research, we additionally evaluate our results on the SIGHAN 2005 bakeoff and Weibo datasets, to examine cross domain robustness. Different stateof-the-art methods for which results are recorded on these datasets are listed in Table 8. Most neural models reported results only on the PKU 8 and MSR datasets of the bakeoff test sets, which are in simplified Chinese. The AS and CityU corpora are in traditional Chinese, sourced from Taiwan and 7 We did not investigate the use of lexicons (Chen et al., 2015a,b) in our research, since lexicons might cover different OOV in the training and test data, and hence directly affecting the accuracies, which makes it relatively difficult to compare different methods fairly unless a single lexicon is used for all methods, as observed by Cai and Zhao (2016). 8 We notice that both PKU dataset and our heterogenous data are based on the news of People's Daily. While the heterogenous data only collect news from Febuary 1998 to June 1998, it does not contain the sentences in the dev and test datasets of PKU.  Hong Kong corpora, respectively. We map them into simplified Chinese before segmentation. The Weibo corpus is in a yet different genre, being social media text. Xia et al. (2016) achieved the best results on this dataset by using a statistical model with features learned using external lexicons, the CTB7 corpus and the People Daily corpus. Similar to Table 7, our method gives the best accuracies on all corpora except for MSR, where it underperforms the hybrid model of Zhang et al. (2016b) by 0.2%. To our knowledge, we are the first to report results for a neural segmentor on more than 3 datasets, with competitive results consistently. It verifies that knowledge learned from a certain set of resources can be used to enhance cross-domain robustness in training a neural segmentor for different datasets, which is of practical importance.

Conclusion
We investigated rich external resources for enhancing neural word segmentation, by building a globally optimised beam-search model that leverages both character and word contexts. Taking each type of external resource as an auxiliary classification task, we use neural multi-task learning to pre-train a set of shared parameters for character contexts. Results show that rich pretraining leads to 15.4% relative error reduction, and our model gives results highly competitive to the best systems on six different benchmarks.