RethinkCWS: Is Chinese Word Segmentation a Solved Task?

The performance of the Chinese Word Segmentation (CWS) systems has gradually reached a plateau with the rapid development of deep neural networks, especially the successful use of large pre-trained models. In this paper, we take stock of what we have achieved and rethink what's left in the CWS task. Methodologically, we propose a fine-grained evaluation for existing CWS systems, which not only allows us to diagnose the strengths and weaknesses of existing models (under the in-dataset setting), but enables us to quantify the discrepancy between different criterion and alleviate the negative transfer problem when doing multi-criteria learning. Strategically, despite not aiming to propose a novel model in this paper, our comprehensive experiments on eight models and seven datasets, as well as thorough analysis, could search for some promising direction for future research. We make all codes publicly available and release an interface that can quickly evaluate and diagnose user's models: https://github.com/neulab/InterpretEval.


Introduction
Chinese word segmentation (CWS), as a crucial first step in Chinese language processing, has drawn a large body of research (Sproat and Shih, 1990;Xue and Shen, 2003;Huang et al., 2007;Liu et al., 2014).Recent years have seen remarkable success in the use of deep neural networks on CWS (Zhou et al., 2017;Yang et al., 2017;Ma et al., 2018;Yang et al., 2019;Zheng et al., 2013;Chen et al., 2015b,a;Cai and Zhao, 2016;Pei et al., 2014), and the large unsupervised pre-trained models drive the state-of-the-art results to a new level (Huang et al., 2019).
However, the performance of CWS systems gradually reaches a plateau and the development of this * These two authors contributed equally.field has slowed down.For example, the CWS systems on many existing datasets (e.g.msr, ctb) have achieved F 1-score higher than 97.0 but with little further improvement.Naturally, a question would be raised: is CWS a solved task?When we rethink on what we have achieved so far, we find that there are still some important while rarely discussed unsolved questions for this task: Q1: Does current excellent performance (e.g. more than 98.0 F 1-score on the msr dataset) indicate a perfect CWS system, or are there still some limitations?Existing CWS systems are mainly evaluated by a corpus-level metric.The holistic measure fails to provide a fine-grained analysis.As a result, we are not clear about what the strengths and weaknesses of a specific model are.
To address this problem, we shift the traditional trend of holistic evaluation to fine-grained evaluation (Fu et al., 2020a), in which the notion of the attribute (i.e., word length) has been introduced to describe a property of each word.Then test words will be partitioned into different buckets, in which we can observe the system's performances under different aspects based on word's attributes (e.g.long words will obtain lower F 1-score).
Q2: Is there a one-size-fits-all system (i.e., bestperforming systems on different datasets are the same)?If no, how can we make different choices of model architectures in different datasets?Insights are still missing for how the choices of different datasets influence architecture design.
To answer this question, we make use of our proposed fine-grained evaluation methodology and present two types of diagnostic methods for existing CWS systems, which not only helps us to identify the strengths and weaknesses of current approaches but provides us with more insight about how different choices of datasets influence the model design.
tems can benefit from multi-criteria learning at the cost of negative transfer (Chen et al., 2017;Qiu et al., 2019), can we design a measure to quantify the discrepancies among different criteria and use it to instruct the multi-criteria learning process (i.e., alleviate negative transfer)?
To answer this question, we extend the in-dataset evaluation (i.e., a system is trained and tested on the same dataset) to the setting of cross-dataset, in which a CWS model trained on one corpus would be evaluated on a range of out-of-domain corpora.On the other hand, it's the above in-dataset analysis (in Q1 & Q2) that helps us to design a measure to quantify the discrepancies of cross-dataset criterion.Empirical results not only show that the measure, calculated solely based on statistics of two datasets, has a higher correlation with cross-dataset performances but also helps us avoid the negative transfer (i.e., selecting the useful parts of source domains as training sets and achieve better results based on fewer training samples) Our contributes can be summarized as follows: 1) Instead of using a holistic metric, we proposed an attribute-aided evaluation methodology for CWS systems.This allows us to diagnose the weakness of existing CWS systems (e.g., BERT-based models are not impeccable and limited in dealing with words with high label inconsistency).2) We show that best-performing systems on different datasets are diverse.Based on some proposed quantified measures, we can make good choices of model architectures in different datasets.3) We quantify the criterion discrepancy between different datasets, which can alleviate the negative transfer problem when performing multi-criteria learning for CWS.

Task Description
Chinese word segmentation (CWS) was usually conceptualized as a character-based se-quence labeling problem.Formally, let X = {x 1 , x 2 , . . ., x T } be a sequence of characters, and Y = {y 1 , y 2 , . . ., y T } be the output tags.The goal of the task is to estimate the conditional probability: P (Y |X) = P (y t |X, y 1 , • • • , y t−1 ).Here, y t usually takes one value of {B, M, E, S}.

Attribute-aided Evaluation Methodology
The standard metric of CWS is becoming hard to distinguish the state-of-the-art word segmentation systems (Qian et al., 2016).Instead of evaluating CWS systems based on a holistic metric (F1 score), in this paper, we take a step towards the fine-grained evaluation of the current CWS systems by proposing an attribute-aided evaluation method.Specifically, we first introduce the notion of attributes to characterize the properties of the test words.Then, the test set will be divided into different subsets, and the overall performance could be broken down into several interpretable buckets.Below, we will introduce the seven attributes that we have explored to depict the word in diverse aspects.Fig. 1 gives an example for the test word "图 书馆".
Aspect-I: Intrinsic nature We can characterize a word based on its (or the sentence it belongs to) constitute features.Here, we define three attributes: word length (wLen); sentence length (sLen); OOV density (oDen): the number of words outside the training set in a sentence divided by sentence length.

Aspect-II: Familiarity
We introduce a notion of familiarity to quantify the degree to which a test word (or its constituents) has been seen in the training set.Specifically, the familiarity of a word can be calculated based on its frequency in the training set.For example, in Fig. 1, if the frequency in the training set of the test word 图 书 馆 (library) is 0.3, the attribute of word frequency of 图书馆 will be 0.3.In this paper, we consider two kinds of familiarity: word frequency (wFre); character frequency (cFre).
Aspect-III: Label consistency In this paper, we attempt to design a measure that can quantify the degree of label consistency phenomenon (Fu et al., 2020b;Gong et al., 2017;Luo and Yang, 2016;Chen et al., 2017) for each test word (or character).Here, we investigate two attributes for label consistency: label consistency of word (wCon); label consistency of character (cCon).Following, 6 t p e J v K l O y c u 0 k u T H e 5 S m p w e b P d s 6 C W r F g 0 g 9 x X s y V 8 0 m r 0 9 j B L g 6 o n 8 c o 4 w w V V M n 7 B o 9 4 w r P m a 7 f a n X b / m a q l E s 0 2 v g 3 t 4 Q P i d p w j < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " J I 6 t p e J v K l O y c u 0 k u T H e 5 S m p w e b P d s 6 C W r F g 0 g 9 x X s y V 8 0 m r 0 9 j B L g 6 o n 8 c o 4 w w V V M n 7 B o 9 4 w r P m a 7 f a n X b / m a q l E s 0 2 v g 3 t 4 Q P i d p w j < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " J I 6 t p e J v K l O y c u 0 k u T H e 5 S m p w e b P d s 6 C W r F g 0 g 9 x X s y V 8 0 m r 0 9 j B L g 6 o n 8 c o 4 w w V V M n 7 B o 9 4 w r P m a 7 f a n X b / m a q l E s 0 2 v g 3 t 4 Q P i d p w j < / l a t e x i t > 10 < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 u j M H c + 0 s 2 J / 8 x 5 o T 3 W 3 P q 1 u 7 h U S K 3 F O 7 F + 6 z 8 z / 6 l Q t E m e o 6 h p 8 q i n R j K r O y 1 0 y / S r q 5 u a X q i Q 5 M H c + 0 s 2 J / 8 x 5 o T 3 W 3 P q 1 u 7 h U S K 3 F O 7 F + 6 z 8 z / 6 l Q t E m e o 6 h p 8 q i n R j K r O y 1 0 y / S r q 5 u a X q i Q 5 M H c + 0 s 2 J / 8 x 5 o T 3 W 3 P q 1 u 7 h U S K 3 F O 7 F + 6 z 8 z / 6 l Q t E m e o 6 h p 8 q i n R j K r O y 1 0 y / S r q 5 u a X q i Q 5 k 2 q j p G n y q K d G M q s 7 L X T L 9 K u r m 5 q e q J D k k x C l 8 T n F B 2 N P K 4 T u b W p P q 2 t X b O j r + q j M V q / Z e n p v h T d 2 S G m x / b + d P 0 K h a N r P s w 2 p 5 t 5 K 3 u o B l r K B C / d z G L v Z x g D p 5 X + M B j 3 g y r o w b 4 9 a 4 + 0 g 1 R n L N E r 4 M 4 / 4 d N P O e W A = = < / l a t e x i t > (0.1,0.13,0.2) < l a t e x i t s h a 1 _ b a s e 6 4 = " T p s w 8 3

3
< l a t e x i t s h a 1 _ b a s e 6 4 = " s R P j E e R 5 / X e o q f e J i F d Z S N 8 c 9 g N g e z V 3 a N U s 2 o b 3 2 z I 2 a 0 S m C f k l K B y u k S S l P E t a n O S Z e G G f N / u b d N 5 7 6 b p e 0 h q V X T K z C G b F / 6 T 4 y / 6 v T t S h 0 s G l q i K i m z D C 6 O l a 6 F O Z V 9 M 2 d T 1 U p c s i I 0 / i U 4 p I w M 8 q P d 3 a M J j e 1 6 7 c N T P z F Z G p W 7 1 m Z W + B V 3 5 I a 7 H 9 v 5 0 9 w u O b 6 n u v v r 1 W 3 a 2 W r J 7 C E Z a x S P z e w j V 3 s o U H e V 7 j H A x 4 t Y V 1 b N 9 b t e 6 o 1 V G o W 8 W V Y d 2 + l I 5 u z < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " s R P j E e R 5 / X e o q f e J i F d Z S N 8 c 9 g N g e z V 3 a N U s 2 o b 3 2 z I 2 a 0 S m C f k l K B y u k S S l P E t a n O S Z e G G f N / u b d N 5 7 6 b p e 0 h q V X T K z C G b F / 6 T 4 y / 6 v T t S h 0 s G l q i K i m z D C 6 O l a 6 F O Z V 9 M 2 d T 1 U p c s i I 0 / i U 4 p I w M 8 q P d 3 a M J j e 1 6 7 c N T P z F Z G p W 7 1 m Z W + B V 3 5 I a 7 H 9 v 5 0 9 w u O b 6 n u v v r 1 W 3 a 2 W r J 7 C E Z a x S P z e w j V 3 s o U H e V 7 j H A x 4 t Y V 1 b N 9 b t e 6 o 1 V G o W 8 W V Y d 2 + l I 5 u z < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " s R P j E e R 5 / X e o q f e J i F d Z S N 8 c 9 馆(library)" in the sentence: "图书馆在节假日会关 闭(The library is closed on holidays", and its ground truth label is BME.The text in the circle is the abbreviation of the attribute name, and the text in gray and in pink is the full name and the attribute value, respectively.con. in grey denotes consistency. we give the definition of label consistency of word, and the label consistency of character can be defined in a similar way.Specifically, we refer to w k i as a test word with label k, whose label consistency ψ(w k i ) is defined as: where |w tr,k i | represents the occurrence of word w i with label k in the training set, and D tr is the training set.For example, in Fig. 1, in the training set, "图书馆 (library)" is labeled as BME 7 times, and BMM 3 times, so ψ (" 图书馆 BM E ") = 7/10 = 0.7, and ψ ("图书馆 BM M ") = 3/10 = 0.3 .

Setup
This section focuses on the in-dataset setting, in which each CWS model will be trained and test on the same dataset.
Datasets We choose seven mainstream datasets from SIGHAN20051 and SIGHAN20082 , in which cityu and ckip are traditional Chinese, while msr, pku, ctb, ncc and sxu are simplified Chinese.We map traditional Chinese characters to simplified Chinese in our experiment.The details of the seven datasets used in this study are described in Chen et al. (2017).

Measures
Here, we refer to M = {m 1 , • • • , m Nm } as a set of models and P = {p 1 , • • • , p Np } as a set of attributes.As described above, the test set could be split into different buckets B = {B j 1 , • • • , B j N b } based on an attribute p j .We introduce a performance table V ∈ R Nm×Np×N b , in which V ijk represents the performance of i-th model on the k-th sub-test set (bucket) generated by j-th attribute.
Model-wise The model-wise measure aims to investigate whether and how the attributes influence the performance of models with different choices of neural components.Formally, we characterize how the j-th attribute influences the i-th model based on two statistical variables: Spearman's rank correlation coefficient Spear (Mukaka, 2012) and standard deviation Std, which can be defined as: where R j is the rank values of buckets based on j-th attribute.Intuitively, S ρ i,j reflects the degree to which the i-th model positively (negatively) correlates with j-th attribute while S σ i,j indicates the degree to which this attribute influences the model.

Dataset-wise
The dataset-wise measures aim to characterize a dataset with different attributes quantitatively.We utilize two types of measures to build the connection between datasets and attributes: system-independent measure α µ , and system-dependent measures α ρ and α σ .
1) system-independent measure reflects intrinsic statistics of the datasets, such as the average word length of the whole dataset.It can be formally defined as: where N w is the number of test words and Attr(w i , j) is the value of attribute j for word w i .
2) system-dependent measures quantify the degree to which each attribute influences the CWS system on a given dataset.For example, "does the Table 2: Neural CWS systems with different architectures and pre-trained knowledge studied in this paper.We exclude systems based on joint training to make a fair comparison in the in-dataset setting.For the model name, "C" refers to "Character" and "B" refers to "Bigram".Intuitively, the models are named based on their constituents.For example, Cw2vBw2vLstmCrf denotes a model's character and the bigram feature is initialized by pre-trained embeddings using Word2Vec, and sentence encoder, as well as the decoder, are LSTM and CRF, respectively.We perform a Friedman test at p = 0.05 on model-(row-) wise and data-(column-)wise.The testing results are p(model − wise) = 2.26 × 10 −6 < 0.05 and p(data − wise) = 8.42 × 10 −8 .Therefore, the results of model-wise and data-wise have passed the significance testing.
attribute word length matter for the CWS system trained on pku dataset?".To achieve this, we design the following measures: where N m is the number of evaluated models.Intuitively, a higher absolute value of α ρ j ∈ [−1, 1] suggests that attribute j is a crucial factor, greatly influencing the performance of CWS systems.For example, if α ρ wLen = 0.95, it means word length is a major factor that influences the CWS performance.

Analysis of Holistic Evaluation
Before giving a fine-grained analysis, we present the holistic results of different models on different datasets.As shown in Tab. 2, we can observe that there is no one-size-fits-all model: bestperforming systems on different datasets frequently consist of diverse components.This nat-urally raises a question: how can pick up appropriate models for different datasets?

Analysis of Dataset Biases
Before the analysis, we conduct a statistical significance test with the Friedman test (Zimmerman and Zumbo, 1993) at p = 0.05, to examine whether the performance of different buckets partitioned by an attribute is significantly different for a given dataset.
The results are shown in the Appendix.We find that the performance of different buckets partitioned by an attribute is significantly different (p < 0.05), which holds for all the datasets.
1) Label consistency and word length have a more consistent impact on CWS performance.The common parts of the radar charts Fig. 2 (b) illustrate that no matter which datasets are, label consistency attributes (wCon, cCon) and word length (wLen) are highly correlated with CWS performance (higher α ρ ).This suggests that the learning difficulty of CWS systems is commonly influenced by label consistency and word length.
2) Frequency and sentence length matters but are minor factors The outliers in the radar chart (Fig. 2 (b)) show the peculiarities of different corpora.On attributes: sLen, wFre, oDen, the extent to which different datasets are affected varies greatly.For example, the dataset ckip is distinctive with the highest value of α ρ oDen , which can explain why character pre-training shows no advantage while the CRF layer contributes a lot.

Analysis of Model Biases
Similar to the above section, we perform the Friedman test at p = 0.05.We give detailed significance testing results in the Appendix.Tab. 3 gives an illustration of model biases characterized by measures S ρ i,j and S σ i,j .The values in grey denote the given model on the specific attribute does not pass the significance test (p ≥ 0.05).Below, we will highlight some observations.ELMo-based Models can make better use of the context information that long sentences carry.Regarding the attribute of sLen (sentence length), two models CelmBnonLstmMlp and CbertBnonLst-mMlp pass the significance test.Additionally, we observe only ELMo (CelmBnonLstmMlp) shows a strong positive correlation with sentence length, referring to Tab. 3.
Contextualized models could reduce the negative effect of OOV density and remedy the deficiency of MLP decoder.a) The performances of non-contextualized models (i.e.word2vec) strongly correlate with the oDen (density of OOV words) attribute.When equipped with BERT or ELMo, the model still could provide each OOV word with a meaningful representation on the fly based on its context.b) We observe that the model Cw2vBavgLstmMlp is strong correlated with wCon and wLen with highest values of S σ (referring to Tab. 3 with bolded value), suggesting that models with MLP layer are unstable when generalizing to the hard cases (words with lower value of wCon and higher value of wLen).However, once augmented with contextualized models, systems with MLP decoder also work well. .Here, we average the F1, S ρ i,j and S σ i,j on seven datasets.The values in gray denotes the given model on the specific attribute does not pass the significance test (p ≥ 0.05).The values in orange and in blue support observation 1 and observation 2, respectively.

Application: Model Diagnosis
Model diagnosis is the process of identifying where the model works well and where it worse (Vartak et al., 2018).We present two types of diagnostic methods: self-diagnosis and aided-diagnosis.selfdiagnosis aims to locate the bucket on which the input model has obtained the worst performance with respect to a given attribute.For aided-diagnosis, supposing that the holistic performance of two models satisfies: A > B. Then Aided-diagnosis(A,B) will first look for a bucket, on which the performance satisfies: A < B. If there is no qualified bucket, then the bucket, on which model A has achieved the best performance, will be returned.
Below, we will give a diagnostic analysis of some typical models shown in Tab. 4. The others are shown in the Appendix.
Self-diagnosis: BERT-based models are not impeccable.The first row in Tab. 4 shows the diagnosis of model CbertBnonLstmMlp, in which the x-ticklabel represents the bucket value of a specific attribute (e.g.wLen: word length) on which system has achieved worst performance.The blue bins represent the worst performance, while red bins denote the gap between worst and best performance.For example, the first histogram in the first row denotes that CbertBnonLstmMlp achieved the worst performance on attribute wCon with value S.
We observe that there is a huge performance drop on all the datasets when the test samples are with the attribute values: wCon=S (low label consistency of words), cCon=S (low label consistency of characters), wLen=L (long words).This suggests that contextualized information brought from BERT is not insufficient to deal with low label consistency and long words.To address this challenge, more efforts should be made on learning algorithms or data augmentation strategies.
Aided diagnosis: BERT v.s ELMo The second row in Tab. 4 shows the comparing between BERT and ELMo and we observed 1) BERT outperforms ELMo in the bucket of wCon=S (low label consistency of words) a lot on all datasets, suggesting that the benefit of BERT mainly comes from the processing of low label consistency of words.
2) When the OOV density of a sentence is high enough, BERT will lose its superiority.As shown in Tab. 4, BERT performs worse than ELMo in the bucket of oDen=L on the pku dataset whose average OOV density (α µ oDen ) is the highest one   4: Diagnosis of different CWS systems.For ease of presentation, we attribute values are classified into three categories: small(S), middle(M), and large(L).Regarding Self-diagnosis, the x-ticklabel represents the bucket value of a specific attribute (e.g.wLen: word length) on which the system has achieved the worst performance.The blue bins represent the worst performance, while red bins denote the gap between worst and best performance.Regarding Aided-diagnosis, the bins below the line "y = 0" represent the largest gap that model A is less than model B. By contrast, the bins above the line "y = 0" denote the largest gap that model A is better than model B. x-ticklabels in red indicate that the corresponding bins will be used for analysis in Sec.3.6.
(as shown in Fig. 2 (a)).To explain this, we take a closer look at the testing samples in the pku with high OOV density: "仰泳100米和400米" (backstroke 100m and 400m), "10月1日，北京 (October 1, Beijing)" .BERT, as multi-layer Transformers, is challenging to collect sufficient context to understand these cases.3) BERT is inferior to ELMo in dealing with long sentences.As shown in Tab. 4, BERT obtain lower performance in the bucket of sLen=L on pku and sxu datasets, whose average lengths (α µ sLen ) are the highest two.

Investigation on Cross-dataset Setting
The above in-dataset analysis aims to interpret model bias and dataset bias based on individual datasets.In many real-world scenarios, we need to transfer a trained model to a new dataset or domain, which requires us to understand the crossdataset generalization behavior of current systems.In this section, our investigation on cross-dataset generalization is driven by two questions: 1) How different architectures (i.e.Cw2vBavgLstmCrf ) of CWS systems influence their cross-dataset generalization ability?2) Now that we have found the common factor (label consistency) that affects model performance across different datasets in the previous section, can we design a measure based on it and use it to interpret cross-data generalization?We will detail our exploration below.

Setup
This section focuses on the zero-shot setting: a model with specified architecture trained on one dataset (e.g.pku) will be evaluated on a range of other datasets (e.g.ctb).To better understand the generalization behavior of CWS systems and the relation between different datasets, we first define several measures to quantify our observations.

Measures
Similar to Sec.3.2, we refer to N d as the number of all datasets and N m as the number of architectures.
The cross-dataset performance can be recorded by the following matrix: Quantifying System's Cross-dataset Generalization Intuitively, U ijk = 0.65 represents that we have adopted the architecture k (i.e.

Cw2vBavgLstmCrf
) to learn a model on the training set of i (e.g.pku), and the performance on test set of j (e.g., msr) is 0.65.We do some simple numerical processing on matrix U to make the meaning of variables more intuitive: Ûijk = (U jjk − U ijk )/U jjk .Ûpku,msr,k = 0.2 suggests that, both tested on msr, the model with architecture k trained on pku is relatively lower than that trained on msr by 0.2.Usually, a lower value of Û is suggestive of better zero-shot generalization ability.

Quantifying Discrepancies of Cross-dataset Criterion To measure the discrepancy of segmentation criteria between any pair of training data D tr
A and test data D te B , we extend the label consistency of word (defined in Sec.2.2) to corpus-level by computing its expectation on a given trainingtest dataset pair.Base on Eq. 1, we defined the measure Ψ as: in which ψ(•) (defined in Eq. 1) is a function to calculate the label consistency for a test word w te,k i .
N w denotes the number of unique test words and freq(w

Analysis
Tab. 5 illustrate the relationship between different train-test pair using data-wise Ψ and model-wise Ûk .To test whether the expectation of label consistency is a factor that can be used to characterize cross-dataset generalization, we perform a Friedman test at p = 0.05.Each group of samples for significance testing is obtained by changing the testset for a given train-set ( we have 7 groups of testing samples corresponding to the 7 columns data of Ψ in Tab. 5).The testing result is p = 0.011 < 0.05, therefore, Ψ can be utilized to describe the feature of a cross-dataset pair.
The distance between different datasets can be quantitatively characterized by Ψ.  U averaging on seven datasets, respectively.The weight of between dataset i and j is transformed into an undirected edge based on: Z ii and Z can be Ψ and U , in which the distance computed based on U is the average on eight models.Fig. 3(a), sxu, cityu, and ctb cluster together, surrounded by other datasets ckip, ncc, and pku remotely, suggesting that these neighbor datasets have the similar distribution.
The measure Ψ could be used to interpret the domain shift.As shown in Tab. 5, we find the value of Ψ could reflect the changing trends of Û .Similarly, as shown in Fig. 3, impressively, these two graphs obtained in totally different ways are so close: Fig. 3 (a) is computed purely based on intrinsic statistics of the dataset, while Fig. 3 (b) is obtained based on model outputs.These qualitative results show our proposed measure Ψ could be used to explain the discrepancies across datasets.
To get a more convincing observation, we additionally conduct a quantitative analysis.Specifically, we calculate the Spearman's rank correlation coefficient between Ψ and the U k .The results all shown in Fig. 4 (a-c).Encouragingly, we find that no matter which CWS system, the cross-dataset performances of them are highly correlated with our proposed measure of Ψ.

Application: Multi-source Transfer
Given a target domain D t , the above quantitative and qualitative analysis shows that the measure Ψ can be used to quantify the importance of different source domains D s 1 , • • • , D s N , therefore allowing us to select suitable ones for data augmentation.Next, we will show how to use the Ψ to make better choices of source domains from the other candidates.We take ctb as the tested object and continuously increase the training samples of above the seven datasets in three different ways: Rand-, Max-, and Min-select.Alg. 1 shows the decoding process for the dataset order.We choose the multicriteria segmenter proposed by Chen et al. (2017) as our training framework for multiple datasets.
Result Fig. 5 illustrates the changes in F1-score as more source domains are introduced in three different orders.We do a Friedman test with the null hypothesis that the order of training set introduced had no influence on the performance of a given model.The significance testing result shows that the training set introduced with Max-, Min-, and Rand-select are significantly different (p = 8.0 × 10 −3 < 0.05).We can observe from Fig. 5 that: More training samples are not a guarantee of better results for CWS models due to the criteria discrepancy between different datasets.
Specifically, the Max-select operation helps us find an optimal set of source domains (ctb, sxu, ncc, cityu), on which the model could achieve the best results, outperforming Chen et al. (2017)'s result by a significant margin, which trained on nine datasets (two more than ours).Regarding the two baseline decoding strategies (Min-select and Rand-select), we find the best performance on ctb are both obtained when all seven training sets are used.The above observations indicate that, when we introduce multiple training sets for data augmentation, the order of the distance between training and development sets can help us select which parts of source domains are useful.And Ψ, we proposed in this paper, is an effective measure to quantify this order (without learning process), providing a novel solution for multi-source transfer learning.

Discussion
We summarize the main observations from our experiments and try to give preliminary answers to our proposed research questions: Does existing excellent performance imply a perfect CWS system?No. Beyond giving this unsurprising answer, we present an interpretable evaluation method to help us diagnose the weaknesses of existing top-performing systems and relative merits between two systems.For example, we find even top-scoring BERT-based models still cannot deal with the words with low label consistency or long words well, and BERT is inferior to ELMo as an encoder in dealing with long sentences.Is there a one-size-fits-all system?No (Bestperforming systems on different datasets frequently involve diverse neural architectures).Although this question can be answered relatively easily by simply looking at the overall results of different systems in diverse data sets (Sec.2),we take a step further to how to make choices of them (BERT v.s ELMo, LSTM v.s CNN)) by conducting dataset bias-aware Aided-diagnosis (Sec.3.6).Can we design a measure to quantify the discrepancies among different criteria?Yes.We first verify that the label consistency of words and word length have a more consistent impact on CWS performance.Based on this, we design a measure to quantify the distance between different datasets, which correlates well with the cross-dataset performance and can be used for multi-source transfer learning, help us avoid the negative transfer.

Figure 1 :
Figure 1: The attribute definition of the word "图 书

Figure 3 :
Figure 3: 2D-visualization of the distances between datasets computed based on data-wise measure Ψ and model-wise U averaging on seven datasets, respectively.The weight of between dataset i and j is transformed into an undirected edge based on:

Figure 4 :
Figure 4: The Spearman's rank correlation coefficient be- tween Ψ and the U k .

Figure 5 :
Figure 5: The changing of F1-score as more source domains are introduced in three different orders: Max-, Min-, and Randselect.The red dotted line is the result reported by Chen et al. (2017) with the same model, trained on nine datasets. 1 te,k i ) is the frequency of the test word.A lower value of Ψ(D tr A , D te B ) suggests a larger discrepancy between the two datasets.For example, Ψ(D tr msr , D te msr ) = 78.0 and Ψ(D tr msr , D te pku ) = 75.5,indicating that the discrepancy between msr's training set and msr's test set is smaller than the discrepancy between msr's training set and pku's test set.

Table 5 :
The relationship between different pairs of datasets measured by data-wise Ψ and model-wise Ûk .Here k represents the model Cw2vBavgLstmCrf.