Adversarial Multi-Criteria Learning for Chinese Word Segmentation

Different linguistic perspectives causes many diverse segmentation criteria for Chinese word segmentation (CWS). Most existing methods focus on improve the performance for each single criterion. However, it is interesting to exploit these different criteria and mining their common underlying knowledge. In this paper, we propose adversarial multi-criteria learning for CWS by integrating shared knowledge from multiple heterogeneous segmentation criteria. Experiments on eight corpora with heterogeneous segmentation criteria show that the performance of each corpus obtains a significant improvement, compared to single-criterion learning. Source codes of this paper are available on Github.


Introduction
Chinese word segmentation (CWS) is a preliminary and important task for Chinese natural language processing (NLP).Currently, the state-ofthe-art methods are based on statistical supervised learning algorithms, and rely on a large-scale annotated corpus whose cost is extremely expensive.Although there have been great achievements in building CWS corpora, they are somewhat incompatible due to different segmentation criteria.As shown in Table 1, given a sentence "YaoMing reaches the final", the two commonlyused corpora, PKU's People's Daily (PKU) (Yu et al.(2001) Yu, Lu, Zhu, Duan, Kang, Sun, Wang, Zhao, and Zhan) and Penn Chinese Treebank (CTB) (Fei(2000)), use different segmentation criteria.In a sense, it is a waste of resources if we fail to fully exploit these corpora.These methods adopted stacking or multi-task architectures and showed that heterogeneous corpora can help each other.However, most of these model adopt the shallow linear classifier with discrete features, which makes it difficult to design the shared feature spaces, usually resulting in a complex model.Fortunately, recent deep neural models provide a convenient way to share information among multiple tasks (Collobert and Weston(2008); Luong et al.(2015)Luong, Le, Sutskever, Vinyals, and Kaiser; Chen et al.(2016) Chen, Zhang, and Liu).
In this paper, we propose an adversarial multicriteria learning for CWS by integrating shared knowledge from multiple segmentation criteria.Specifically, we regard each segmentation criterion as a single task and propose three different shared-private models under the framework of multi-task learning (Caruana(1997); Ben-David and Schuller(2003)), where a shared layer is used to extract the criteria-invariant features, and a private layer is used to extract the criteriaspecific features.Inspired by the success of adversarial strategy on domain adaption (Ajakan et al.(2014) Ajakan, Germain, Larochelle, Laviolette, and Marchand;Ganin et al.(2016)Ganin, Ustinova, Ajakan, Germain, Larochelle, Lavi-olette, Marchand, and Lempitsky; Bousmalis et al.(2016) Bousmalis, Trigeorgis, Silberman, Krishnan, and Erhan), we further utilize adversarial strategy to make sure the shared layer can extract the common underlying and criteria-invariant features, which are suitable for all the criteria.Finally, we exploit the eight segmentation criteria on the five simplified Chinese and three traditional Chinese corpora.Experiments show that our models are effective to improve the performance for CWS.We also observe that traditional Chinese could benefit from incorporating knowledge from simplified Chinese.
The contributions of this paper could be summarized as follows.
• Multi-criteria learning is first introduced for CWS, in which we propose three sharedprivate models to integrate multiple segmentation criteria.• An adversarial strategy is used to force the shared layer to learn criteria-invariant features, in which an new objective function is also proposed instead of the original crossentropy loss.
The general architecture of neural CWS could be characterized by three components: (1) a character embedding layer; (2) feature layers consisting of several classical neural networks and (3) a tag inference layer.The role of feature layers is to extract features, which could be either convolution neural network or recurrent neural network.In this paper, we adopt the bi-directional long short-term memory neural networks followed by CRF as the tag inference layer.Figure 2 illustrates the general architecture of CWS.

Embedding layer
In neural models, the first step usually is to map discrete language symbols to distributed embedding vectors.Formally, we lookup embedding vector from embedding matrix for each character x i as e x i ∈ R de , where d e is a hyper-parameter indicating the size of character embedding.

Feature layers
We adopt bi-directional long short-term memory (Bi-LSTM) as feature layers.
While there are numerous LSTM variants, here we use the LSTM architecture used by (Jozefowicz et al.(2015)Jozefowicz, Zaremba, and Sutskever), which is similar to the architecture of (Graves(2013)) but without peep-hole connections.
LSTM LSTM introduces gate mechanism and memory cell to maintain long dependency information and avoid gradient vanishing.Formally, LSTM, with input gate i, output gate o, forget gate f and memory cell c, could be expressed as: where W g ∈ R (de+d h )×4d h and b g ∈ R 4d h are trainable parameters.d h is a hyper-parameter, indicating the hidden state size.Function σ(•) and φ(•) are sigmoid and tanh functions respectively.
Bi-LSTM In order to incorporate information from both sides of sequence, we use bi-directional LSTM (Bi-LSTM) with forward and backward directions.The update of each Bi-LSTM unit can be written precisely as follows: where − → h i and ← − h i are the hidden states at position i of the forward and backward LSTMs respectively; ⊕ is concatenation operation; θ denotes all parameters in Bi-LSTM model.

Inference Layer
After extracting features, we employ conditional random fields (CRF) (Lafferty et al.(2001) Lafferty, McCallum, and Pereira) layer to inference tags.In CRF layer, p(Y |X) in Eq (1) could be formalized as: Here, Ψ(Y |X) is the potential function, and we only consider interactions between two successive labels (first order linear chain CRFs): ψ(x, i, y , y) = exp(s(X, i) y + b y y ), (9) where b y y ∈ R is trainable parameters respective to label pair (y , y).Score function s(X, i) ∈ R |L| assigns score for each label on tagging the i-th character: where h i is the hidden state of Bi-LSTM at position i; W s ∈ R d h ×|L| and b s ∈ R |L| are trainable parameters.
3 Multi-Criteria Learning for Chinese Word Segmentation Although neural models are widely used on CWS, most of them cannot deal with incompatible criteria with heterogonous segmentation criteria simultaneously.
Inspired by the success of multi-task learning (Caruana(1997); Ben-David and Schuller(2003); Liu et al.(2016a)Liu, Qiu, and Huang;Liu et al.(2016b)Liu, Qiu, and Huang), we regard the heterogenous criteria as multiple "related" tasks, which could improve the performance of each other simultaneously with shared information.
Formally, assume that there are M corpora with heterogeneous segmentation criteria.We refer D m as corpus m with N m samples: where X m i and Y m i denote the i-th sentence and the corresponding label in corpus m.
To exploit the shared information between these different criteria, we propose three sharing models for CWS task as shown in Figure 3.The feature layers of these three models consist of a private (criterion-specific) layer and a shared (criterioninvariant) layer.The difference between three models is the information flow between the task layer and the shared layer.Besides, all of these three models also share the embedding layer.

Model-I: Parallel Shared-Private Model
In the feature layer of Model-I, we regard the private layer and shared layer as two parallel layers.For corpus m, the hidden states of shared layer and private layer are: i+1 , θ m ), (13) and the score function in the CRF layer is computed as: where ∈ R |L| are criterion-specific parameters for corpus m.

Model-II: Stacked Shared-Private Model
In the feature layer of Model-II, we arrange the shared layer and private layer in stacked manner.The private layer takes output of shared layer as input.For corpus m, the hidden states of shared layer and private layer are: and the score function in the CRF layer is computed as: where ∈ R |L| are criterion-specific parameters for corpus m.

Model-III: Skip-Layer Shared-Private Model
In the feature layer of Model-III, the shared layer and private layer are in stacked manner as Model-II.Additionally, we send the outputs of shared layer to CRF layer directly.
The Model III can be regarded as a combination of Model-I and Model-II.For corpus m, the hidden states of shared layer and private layer are the same with Eq (15) and ( 16), and the score function in CRF layer is computed as the same as Eq (14).

Objective function
The parameters of the network are trained to maximize the log conditional likelihood of true labels on all the corpora.The objective function J seg can be computed as: where Θ m and Θ s denote all the parameters in private and shared layers respectively.Therefore, besides the task loss for CWS, we additionally introduce an adversarial loss to prevent criterion-specific feature from creeping into shared space as shown in Figure 4. We use a criterion discriminator which aims to recognize which criterion the sentence is annotated by using the shared features.
Specifically, given a sentence X with length n, we refer to h (s) X as shared features for X in one of the sharing models.Here, we compute h (s) X by simply averaging the hidden states of shared layer h x i .The criterion discriminator computes the probability p(•|X) over all criteria as: where Θ d indicates the parameters of criterion discriminator W d ∈ R d h ×M and b d ∈ R M ; Θ s denotes the parameters of shared layers.

Adversarial loss function
The criterion discriminator maximizes the cross entropy of predicted criterion distribution p(•|X) and true criterion.
An adversarial loss aims to produce shared features, such that a criterion discriminator cannot reliably predict the criterion by using these shared features.Therefore, we maximize the entropy of predicted criterion distribution when training shared parameters.
where H(p) = − i p i log p i is an entropy of distribution p.
Algorithm 1 Adversarial multi-criteria learning for CWS task.

Training
Finally, we combine the task and adversarial objective functions.
where λ is the weight that controls the interaction of the loss terms and D is the training corpora.
The training procedure is to optimize two discriminative classifiers alternately as shown in Algorithm 1.We use Adam (Kingma and Ba(2014)) with minibatchs to maximize the objectives.
Notably, when using adversarial strategy, we firstly train 2400 epochs (each epoch only trains on eight batches from different corpora), then we only optimize J seg (Θ m , Θ s ) with Θ s fixed until convergence (early stop strategy).

Experiments 6.1 Datasets
To evaluate our proposed architecture, we experiment on eight prevalent CWS datasets from SIGHAN2005 (Emerson(2005)) and SIGHAN2008 (Jin and Chen(2008)).Table 1 gives the details of the eight datasets.Among these datasets, AS, CITYU and CKIP are traditional Chinese, while the remains, MSRA, PKU, CTB, NCC and SXU, are simplified Chinese.We use 10% data of shuffled train set as development set for all datasets.

Experimental Configurations
For hyper-parameter configurations, we set both the character embedding size d e and the dimen- For initialization, we randomize all parameters following uniform distribution at (−0.05, 0.05).
We simply map traditional Chinese characters to simplified Chinese, and optimize on the same character embedding matrix across datasets, which is pre-trained on Chinese Wikipedia corpus, using word2vec toolkit (Mikolov et al.(2013)Mikolov, Chen, Corrado, and Dean).Following previous work (Chen et al.(2015b)Chen, Qiu, Zhu, Liu, and Huang;Pei et al.(2014)Pei, Ge, and Baobao), all experiments including baseline results are using pre-trained character embedding with bigram feature.

Overall Results
Table 2 shows the experiment results of the proposed models on test sets of eight CWS datasets, which has three blocks.
(1) In the first block, we can see that the performance is boosted by using Bi-LSTM, and the performance of Bi-LSTM cannot be improved by merely increasing the depth of networks.In addition, although the F value of LSTM model in (Chen et al.(2015b)Chen, Qiu, Zhu, Liu, and Huang) is 97.4%, they additionally incorporate an external idiom dictionary.
(2) In the second block, our proposed three models based on multi-criteria learning boost performance.Model-I gains 0.75% improvement on averaging F-measure score compared with Bi-LSTM result (94.14%).Only the performance on MSRA drops slightly.Compared to the baseline results (Bi-LSTM and stacked Bi-LSTM), the proposed models boost the performance with the help of exploiting information across these heterogeneous segmentation criteria.Although various criteria have different segmentation granularities, there are still some underlying information shared.For instance, MSRA and CTB treat family name and last name as one token " (NingZe-Tao)", whereas some other datasets, like PKU, regard them as two tokens, " (Ning)" and " (ZeTao)".The partial boundaries (before " (Ning)" or after " (Tao)") can be shared.
(3) In the third block, we introduce adversarial training.By introducing adversarial training, the performances are further boosted, and Model-I is slightly better than Model-II and Model-III.The adversarial training tries to make shared layer keep criteria-invariant features.For instance, as shown in Table 2, when we use shared information, the performance on MSRA drops (worse than baseline result).The reason may be that the shared parameters bias to other segmentation criteria and introduce noisy features into shared parameters.When we additionally incorporate the adversarial strategy, we observe that the performance on MSRA is improved and outperforms the baseline results.We could also observe the improvements on other datasets.However, the boost from the adversarial strategy is not significant.The main rea- Table 2: Results of proposed models on test sets of eight CWS datasets.There are three blocks.The first block consists of two baseline models: Bi-LSTM and stacked Bi-LSTM.The second block consists of our proposed three models without adversarial training.The third block consists of our proposed three models with adversarial training.Here, P, R, F, OOV indicate the precision, recall, F value and OOV recall rate respectively.The maximum F values in each block are highlighted for each dataset.
son might be that the proposed three sharing models implicitly attempt to keep invariant features by shared parameters and learn discrepancies by the task layer.

Speed
To further explore the convergence speed, we plot the results on development sets through epochs.
Figure 5 shows the learning curve of Model-I without incorporating adversarial strategy.As shown in Figure 5, the proposed model makes progress gradually on all datasets.After about 1000 epochs, the performance becomes stable and convergent.
We also test the decoding speed, and our mod-els process 441.38 sentences per second averagely.
As the proposed models and the baseline models (Bi-LSTM and stacked Bi-LSTM) are nearly in the same complexity, all models are nearly the same efficient.However, the time consumption of training process varies from model to model.For the models without adversarial training, it costs about 10 hours for training (the same for stacked Bi-LSTM to train eight datasets), whereas it takes about 16 hours for the models with adversarial training.All the experiments are conducted on the hardware with Intel(R) Xeon(R) CPU E5-2643 v3 @ 3.40GHz and NVIDIA GeForce GTX TITAN X.

Error Analysis
We further investigate the benefits of the proposed models by comparing the error distributions between the single-criterion learning (baseline model Bi-LSTM) and multi-criteria learning (Model-I and Model-I with adversarial training) as shown in Figure 6.According to the results, we could observe that a large proportion of points lie above diagonal lines in Figure 6a and Figure 6b, which implies that performance benefit from integrating knowledge and complementary information from other corpora.As shown in Table 2, on the test set of CITYU, the performance of Model-I and its adversarial version (Model-I+ADV) boost from 92.17% to 95.59% and 95.42% respectively.
In addition, we observe that adversarial strategy is effective to prevent criterion specific features

Knowledge Transfer
We also conduct experiments of whether the shared layers can be transferred to the other related tasks or domains.In this section, we investigate the ability of knowledge transfer on two experiments: (1) simplified Chinese to traditional Chinese and (2) formal texts to informal texts.

Results
Formal documents (like the eight datasets in Table 1) and micro-blog texts are dissimilar in many aspects.Thus, we further investigate that if the formal texts could help to improve the performance of micro-blog texts.Table 5 gives the results of Model-I on the NLPCC 2016 dataset under the help of the eight datasets in Table 1.Specifically, we firstly train the model on the eight datasets, then we train on the NLPCC 2016 dataset alone with shared parameters fixed.The baseline model is Bi-LSTM which is trained on the NLPCC 2016 dataset alone.
As we can see, the performance is boosted by 0.30% on F-measure score (from 93.94% to 94.24%), and we could also observe that the OOV recall rate is boosted by 3.97%.It shows that the shared features learned from formal texts can help to improve the performance on of micro-blog texts.

Related Works
There are many works on exploiting heterogeneous annotation data to improve various NLP tasks.Jiang et al.(2009)Jiang, Huang, and Liu) proposed a stacking-based model which could train a model for one specific desired annotation criterion by utilizing knowledge from corpora with other heterogeneous annotations.Sun and Wan(2012)) proposed a structure-based stacking model to reduce the approximation error, which makes use of structured features such as subwords.These models are unidirectional aid and also suffer from error propagation problem.Qiu et al.(2013)Qiu, Zhao, and Huang) used multi-tasks learning framework to improve the performance of POS tagging on two heterogeneous datasets.Li et al.(2015)Li, Chao, Zhang, and Chen) proposed a coupled sequence labeling model which could directly learn and infer two heterogeneous annotations.Chao et al.(2015)Chao, Li, Chen, and Zhang) also utilize multiple corpora using coupled sequence labeling model.These methods adopt the shallow classifiers, therefore suffering from the problem of defining shared features.
Our proposed models use deep neural networks, which can easily share information with hidden shared layers.Chen et al.(2016)Chen, Zhang, and Liu) also adopted neural network models for exploiting heterogeneous annotations based on neural multi-view model, which can be regarded as a simplified version of our proposed models by re-moving private hidden layers.
Unlike the above models, we design three sharing-private architectures and keep shared layer to extract criterion-invariance features by introducing adversarial training.Moreover, we fully exploit eight corpora with heterogeneous segmentation criteria to model the underlying shared information.

Conclusions & Future Works
In this paper, we propose adversarial multi-criteria learning for CWS by fully exploiting the underlying shared knowledge across multiple heterogeneous criteria.Experiments show that our proposed three shared-private models are effective to extract the shared information, and achieve significant improvements over the single-criterion methods.

Figure 1 :
Figure 1: Illustration of the different segmentation criteria.

Figure 3 :
Figure3: Three shared-private models for multi-criteria learning.The yellow blocks are the shared Bi-LSTM layer, while the gray block are the private Bi-LSTM layer.The yellow circles denote the shared embedding layer.The red information flow indicates the difference between three models.

Figure 4 :
Figure 4: Architecture of Model-III with adversarial training strategy for shared layer.The discriminator firstly averages the hidden states of shared layer, then derives probability over all possible criteria by applying softmax operation after a linear transformation.
Figure 5: Convergence speed of Model-I without adversarial training on development sets of eight datasets.

Figure 7 :
Figure 7: Segmentation cases of personal names.

Table 1 :
Details of the eight datasets.
sionality of LSTM hidden states d h to 100.The initial learning rate α is set to 0.01.The loss weight coefficient λ is set to 0.05.Since the scale of each dataset varies, we use different training batch sizes for datasets.Specifically, we set batch sizes of AS and MSR datasets as 512 and 256 respectively, and 128 for remains.We employ dropout strategy on embedding layer, keeping 80% inputs (20% dropout rate).

Table 3 :
Performance on 3 traditional Chinese datasets.Model-I * means that the shared parameters are trained on 5 simplified Chinese datasets and are fixed for traditional Chinese datasets.

Table 5 :
Performances on the test set of NLPCC 2016 dataset.Model-I * means that the shared parameters are trained on 8 Chinese datasets (Table1) and are fixed for NLPCC dataset.