Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning

The ambiguous annotation criteria bring into the divergence of Chinese Word Segmentation (CWS) datasets with various granularities. Multi-criteria learning leverage the annotation style of individual datasets and mine their common basic knowledge. In this paper, we proposed a domain adaptive segmenter to capture diverse criteria of datasets. Our model is based on Bidirectional Encoder Representations from Transformers (BERT), which is responsible for introducing external knowledge. We also optimize its computational efficiency via model pruning, quantization, and compiler optimization. Experiments show that our segmenter outperforms the previous results on 10 CWS datasets and is faster than the previous state-of-the-art Bi-LSTM-CRF model.


Introduction
Chinese Word Segmentation (CWS) is regarded as a low-level task in NLP.Unlike the language with space between words such as English and French, Chinese is a type of polysynthetic language where compounds are developed from indigenous morphemes (Jernudd and Shapiro, 2011;Gong et al., 2017).The ambiguous distinction between morphemes and compound words leads to the cognitive divergence of words concepts.The labeled datasets seriously diverge due to annotation inconsistency that results in multi-grained compounds.In practice, a segmenter usually provide multiple granularities and configured according to highlevel tasks needs.Fine-grained words can help reduce the vocabulary to relieve the sparseness.On the other hand, coarse-grain words make models match exactly and easy to analyze.A multicriteria model may provide flexibility for this demand.* Equal contribution In recent years, several multi-criteria learning methods of CWS have been proposed to explore the common knowledge of heterogeneous dataset by utilizing the information across the whole corpora, which can boost the out-of-vocabulary (OOV) recalls mutually.(Qiu et al., 2013;Chao et al., 2015;Liu et al., 2016;Chen et al., 2017).First, although the heterogeneous corpora can help each other, the whole datasets are still not big enough to provide adequate linguistic knowledge.Second, the standard recurrent networks including LSTM are limited by decoding speed even using cutting-edge hardware since the computation of states cannot occur in parallel.
In this paper, we propose a multi-criteria method of CWS.Our model uses a domain projection layer to adopt multiple datasets with various granularities.We adopt the bidirectional pre-training encoder from the transformer (BERT) (Vaswani et al., 2017;Devlin et al., 2018) to introduce external knowledge.BERT can be regarded contextual representations and it has achieved great success in some NLU tasks (Reddy et al., 2018;Rajpurkar et al., 2018).But both the fine-tuning and inference procedures of the provided models are computationally inefficient due to a large number of parameters.
The main advantages of our proposed method are scalable and simple, we provide a trade-off between the accuracy and decoding speed.According to the length of the sentence, the number of layers can be adjusted flexibly.We mainly use three techniques including layer-level pruning, quantization and compiler optimization to improve the scalability.Our method not only significantly outperforms the SOTA results on 10 CWS dataset but also faster than the previous SOTA Bi-LSTM-CRF (Ma et al., 2018;Xinchi et al., 2017;Yang et al., 2018)

Model Description
Figure 1 summarizes the proposed model architecture, include a feature extraction layer, a domain projection layer and an inference layer.

BERT for Feature Extraction
As shown in Figure 1, we employ BERT to extract feature for the input sequence.Characters are first mapped into embedding vectors and then go through several transformer blocks.Compared with Bi-LSTM which process the sequence step by step, the transformer parallelly learns features for all steps so that the decoding speed can be accelerated.However, twelve transformer layers of the original BERT are too heavy for the CWS.To balance computational cost and segmentation accuracy, we prune the layers of BERT and fine tune on our datasets.BERT is pre-trained on a large corpus to capture semantic feature and abundant knowledge, which is of critical importance for the word segmentation task.To speed up both the finetuning and inference procedures, we make further optimization as discussed in section 2.4.

Domain Projection for Multi-Criteria Learning
Inspired by previous works (Chen et al., 2017;Peng and Dredze, 2017), we propose a domain projection layer to enable our model to adapt datasets with diverse criteria.The domain projection layer helps to capture heterogeneous segmen-tation criteria of each dataset.Section 3.5 shows several examples proving this.There are many variations for the projection layer, while in this paper we use linear transformation which is simple but effective for this task.As shown in Figure 1, an extra shared projection layer is to learn common knowledge from datasets.

Tag Inference
The output of domain-specific projection and shared projection are concatenated, then feed into the conditional random fields (CRF) layer (Lafferty et al., 2001).In CRF layer, the probability of a possible label sequence is formalized as: (1) where y ∈ {B, M, E, S} is the label, score function s(X, i) y i is output of the projection layer at i th character, and b y i−1 y i is trainable parameters.By solving Eq 2 we can obtain the optimal sequence tags:

Speed Optimization
Neural CWS models improve the performance by increasing the model complexity, which however harms the decoding speed and limits their application in real life.To bridge the gap, we apply model acceleration techniques as follow.
Pruning.Many parameters in deep networks are unimportant or unnecessary, thus pruning methods can be used to remove these parameters and expand the model sparsity (Han et al., 2015).Pruning methods can be categorized into a finegrained level, kernel level, filter level, and layer level.The acceleration rate increase with the granularity of the pruning strategy.In addition, finegrained level pruning usually assumes that underlying hardware platforms provide mechanisms for sparse tensor compressing and sparse tensor computation accelerating (Zhu et al., 2017), which is impractical for most current hardware platforms.To maximize the profit of model pruning for reallife application, we perform layer level pruning on the transformer blocks in BERT.
Quantization.Quantization methods also have been investigated for network acceleration.We conduct fixed-point quantization (Gupta et  2015) to leverage NVIDIA's Volta architectural features.Specifically, kernels of multi-head attention layers and feedforward layers use halfprecision (FP16), while rest parameters like embedding and normalization parameters use full precision (FP32).The quantization method not only accelerates the computation but also reduce the model size.
Compiler Optimization.XLA is a domainspecific compiler for linear algebra that optimizes TensorFlow (Abadi et al., 2015) computations.By introducing XLA into our model, graphs are compiled into machine instructions, low-level ops are fused to improve the execution speed.For example, batch matmul is always followed by a transpose operation in the transformer computation graph.By fusing these two operations, the intermediate product does not need to write back to memory, thus reducing the redundant memory access time and kernel launch overhead.

Experimental Settings
Our all experiments are implemented on the hardware with Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz and NVIDIA Tesla V100.
Datasets.We evaluate our model on ten standard Chinese word segmentation datasets: MSR,PKU,AS,CITYU from SIGHAN 2005 bakeoff task (Emerson, 2005).SXU from SIGHAN 2008 bake-off task (MOE, 2008).Chinese Penn Treebank 6.0 (CTB6) from Xue et al. (2005) Preprocessing.AS and CITYU are mapped from traditional Chinese to simplified Chinese before segmentation.A unique token respectively replaces continuous English characters and digits in the datasets.Full-width tokens are converted to half-width to handle the mismatch between training and test set.
Hyperparameters.The number of domain projection layer is 1, the max sequence length is set to 128.During fine tuning, we use Adam with the learning rate of 2e-5, L2 weight decay of 0.01, dropout probability of 0.1.

Main Results
We prune the number of transformer layers from 12 to 1 and find that compared with using 12 layers, the average F-score using 3 layers drop slightly from 97.1% to 96.8% as shown in Table 3.To balance segmentation speed and accuracy, we prune the model to 3 layers.Performance of our model and recent neural CWS models are shown in Table 1.Our model outperform prior work on 10 datasets, with 8.1%, 4.5%, 10.5%, 14.3%, 27.3%, 25.0%, 12.9%, 6.6%, 28.1%, 30.2% error reductions on PKU, MSR, AS, CITYU, CTB6, SXU, UD, CNC, WTB, ZX datasets respectively.Among these datasets, SXU, UD, WTB, ZX are relatively small, but they achieve large error reductions thanks to the shared feature extraction layer.By further applying half-precision (FP16), the accuracy reduction is minor and the model still out- performs previous SOTA results on 10 datasets.

Multi-Criteria Learning Improve OOV Recall
Previous work (Huang and Zhao, 2007;Ma et al., 2018) pointed out that OOV is a major error and exploring further sources of knowledge is essential to solving this problem.From a certain point of view, datasets are complementary to each other since OOV in a dataset may appear in other datasets.To utilize knowledge from each other to improve the OOV recall, our model performs multi-criteria learning with the domain projection layer.To evaluate this, we train the proposed model respectively on each dataset, i.e., singlecriteria learning.Table 2 shows that comparing with single-criteria learning, multi-criteria learning improve the OOV recall on 10 datasets.

Scalability
Decoding speed is essential in practice since the word segment is fundamental for many downstream NLP tasks.Previous neural CWS models (Ma et al., 2018;Xinchi et al., 2017;Yang et al., 2018;Gong et al., 2018) use Bi-LSTM with concatenated embedding size 100,100,128,100 respectively.To make a fair comparison, we set the Bi-LSTM embedding size and hidden size to 100, one hidden layer with CRF on the top.Figure 2 shows the decoding speed with regards to batch size.Our model employed original BERT with 12 transformer layers is slower than Bi-LSTM.
On the other hand, the speed can be increased by some optimizations including layer-level pruning,  weights quantization, and compiler optimization.Combining all of these three techniques, our models outperform Bi-LSTM with 1.6× -2.6× acceleration.Our model are more scalable compared with the Bi-LSTM that are limited in their capability to process tasks involving very long sequences.By observing the sequence length distribution, we can search a appropriate layer number to balance F-score and decoding speed.

Case Study
Figure 3 shows three examples of the segmentation results on all datasets: "下 午 五 时(Five o'clock in the afternoon)", "副局长(deputy director)","令人满意(make sb pleased)".The segmentation granularity of these words is different according to diverse criteria of the datasets.With the help of the domain projection layer, our model correctly segments these words on each dataset.Without the domain projection layer, the segmentation results are unstable.For instance, "副 局 长" is segmented as "副局长" or "副/局长" on the same dataset with different context.

Conclusion
In this paper, we proposed a simple but effective Chinese Word Segmentation (CWS) method that employ BERT and add a domain projection layer on the top with multi-criteria learning.To be practicability, acceleration techniques including pruning, quantization, and compiler optimization are applied to improve the word segmentation speed.Experiments show that our proposed model achieve higher performance on the CWS accuracy and prediction speed than the SOTA methods.A Do we really need 12 transformer layers for word segmentation?
In our paper, We have demonstrated that BERT with multi-criteria learning has superior performance on accuracy and prediction speed in word segmentation.We use 3 transformer layers instead of the original 12 layers to balance the F-score and decoding speed.Why do three layers seem to the CWS on ten datasets to be the most cost-effective?
Here we do some analysis as complementary.

A.1 Layer Attention
BERT has achieved great success in many NLU tasks by pre-training a stack of 12 transformer layers to learn abundant knowledge.Intuitively, top layers capture high-level semantic features while bottom layers learn low-level features like grammar.As for word segmentation task, high-level semantic features may have a small impact (see the section A.2) so that we make further investigation to find the minimal number of transformer layers.We freeze weights of each layer in the pre-trained BERT and conduct layer attention fine-tuning on word segmentation datasets.As shown in Figure 4, the attention score gradually decrease in top layers from 7 to 12, and the third layer gains the highest attention score.The results prove that the model with three layers contains most information for word segmentation.

A.2 Self Attention
In the self-attention layers of the transformer, the attention score of each character is calculated with the rest characters.We average the attention scores at each index of the sentences with a length larger than 50.As shown in Figure 5, characters around the current character gain larger weights than those far away.The result indicates that word segmentation depends more on grammar and long term dependencies are relatively unimportant.It  intuitively proves that it is not necessary to keep long term memory of the sequence for CWS.As for the transformer, self-attention may be limited to a fixed window size to reduce computation and make model acceleration.We leave this for future work.

Figure 1 :
Figure 1: An overview of our model architechture.

Figure 2 :
Figure 2: Decoding speed at char level w.r.t batch size, the sequence length is 64.

Figure 3 :
Figure 3: Our model learns to segment with diverse criteria.

Figure 4 :
Figure 4: Distribution of layer attention score.
(a) Current char index 10 (b) Current char index 20 (c) Current char index 30 (d) Current char index 40

Table 1 :
al., The state of the art performance on different datasets (F-score, %).

Table 3 :
Average Precision, Recall, F-score on 10 datasets with different number of layers.