A New Psychometric-inspired Evaluation Metric for Chinese Word Segmentation

Word segmentation is a fundamental task for Chinese language processing. However, with the successive improvements, the standard metric is becoming hard to distinguish state-of-the-art word segmentation systems. In this paper, we propose a new psychometric-inspired evaluation metric for Chinese word segmentation, which addresses to balance the very skewed word distribution at different levels of difficulty 1 . The performance on a real evaluation shows that the proposed metric gives more reasonable and distinguishable scores and correlates well with human judgement. In addition, the proposed metric can be easily extended to evaluate other sequence labelling based NLP tasks.


Introduction
Word segmentation is a fundamental task for Chinese language processing. In recent years, Chinese word segmentation (CWS) has undergone great development, which is, to some degree, driven by evaluation conferences of CWS, such as SIGHAN Bakeoffs (Emerson, 2005;Levow, 2006;Jin and Chen, 2008;Zhao and Liu, 2010). The current state-of-the-art methods regard word segmentation as a sequence labeling problem (Xue, 2003;Peng et al., 2004). The goal of sequence labeling is to assign labels to all elements in a sequence, which can be handled with supervised learning algorithms, such as maximum entropy (ME) (Berger et al., 1996), conditional random fields (CRF) (Lafferty et al., 2001) and Perceptron (Collins, 2002).
Benefiting from the public datasets and feature engineering, Chinese word segmentation achieves quite high precision after years of intensive research. To evaluate a word segmenter, the standard metric consists of precision p, recall r, and an evenly-weighted F-score f 1 .
However, with the successive improvement of performance, state-of-the-art segmenters are hard to be distinguished under the standard metric. Therefore, researchers also report results with some other measures, such as out-of-vocabulary (OOV) recall, to show their strengths besides p, r and f 1 .
Furthermore, although state-of-the-art methods have achieved high performances on p, r and f 1 , there exists inconsistence between the evaluation ranking and the intuitive feelings towards the segmentation results of these methods. The inconsistence is caused by two reasons: (1) The high performance is due to the fact that the distribution of difficulties of words is unbalanced. The proportion of trivial cases is very high, such as '的 ('s)'，'我们 (we)', which results in that the non-trivial cases are relatively despised. Therefore, a good measure should have a capability to balance the skewed distribution by weighting the test cases.
(2) Human judgement depends on difficulties of segmentations. A segmenter can earn extra credits when correctly segmenting a difficult word than an easy word. Conversely, a segmenter can take extra penalties when wrongly segmenting an easy word than a difficult word.
Taking a sentence and two predicted segmentations as an example: S : 白藜芦醇 是 一 种 酚类 物质 (Trans: Resveratrol is a kind of phenols material.) P1: 白 藜芦 醇 是 一种 酚类 物质 P2: 白藜 芦醇 是 一 种 酚类物 质 We can see that the two segmentations have the same scores in p, r and f 1 . But intuitively, P1 should be better than P2, since P2 is worse even on the trivial cases, such as '酚类 (phenols)' and '物质 (material)'. Therefore, we think that an appropriate evaluation metric should not only provide an all-around quantitative analysis of system performances, but also explicitly reveal the strengths and potential weaknesses of a model.
Inspired by psychometrics, we propose a new evaluation metric for Chinese word segmentation in this paper. Given a labeled dataset, not all words have the same contribution to judge the performance of a segmenter.
Based on psychometric research (Lord et al., 1968), we assign a difficulty value to each word. The difficulty of a word is automatically rated by a committee of segmenters, which are diversified by training on different datasets and features. We design a balanced precision, recall to pay different attentions to words according to their difficulties.
We also give detailed analysis on a real evaluation of Chinese word segmentation with our proposed metric. The analysis result shows that the new metric gives a more balanced evaluation result towards the human intuition of the segmentation quality. We will release the weighted datasets focused this paper to the academic community.
Although our proposed metric is applied to Chinese word segmentation for a case study, it can be easily extended to other sequence labelling based NLP tasks.

Standard Evaluation Metric
The standard evaluation usually uses three measures: precision, recall and balanced F-score.
Precision p is defined as the number of correctly segmented words divided by the total number of words in the automatically segmented corpus.
Recall r is defined as the number of correctly segmented words divided by the total number of words in the gold standard, which is the manually annotated corpus.
F-score f 1 the harmonic mean of precision and recall.
Given a sentence, the gold-standard segmentation of a sentence is w 1 , · · · , W N , N is the number of words. The predicted segmentation is w ′ 1 , · · · , w ′ N ′ , N ′ is the number of words. Among that, the number of words correctly identified by the predicted segmentation is c, and the number of incorrectly predicted words is e.
p, r and f 1 are defined as follows: As a complement to these metrics, researchers also use the recall of out-of-vocabulary (OOV) words to measure the segmenter's performance in detecting unknown words.

A New Psychometric-inspired Evaluation Metric
We involve the basic idea from psychometrics and improve the evaluation metric by assigning weights to test cases.

Background Theory
This work is inspired by the test theory in psychometrics (Lord et al., 1968). Psychologists, as well as educators, have studied the way of analyzing items in a psychological test, such as IQ test. The general idea is that test cases should be given different weights, which reflects the effectiveness of a certain item to a certain measuring object.
Similarly, we consider an evaluation task as a kind of special psychological test. The psychological traits, or the ability of the model, is not an explicit characteristics. We propose that the test cases for NLP task should also be assigned a real value to account for the credits that the tagger earned from answering the test case.
In analogy to the way of computing difficulty in psychometrics, the difficulty of a target word w i is defined as the error rate of a committee in the case of word segmentation.
Given a committee of K base segmenters, we can get K segmentations for sentence w 1 , · · · , W N . We use a mark m k i ∈ {0, 1} to indicate whether word w i is correctly segmented by the k-th segmenter.
The number of words c k correctly identified by the k-th segmenter is Thus, we can calculate the degree of difficulty of each word w i .
This methodology of measuring test item difficulty is also widely applied in assessing standardized exams such as TOEFL (Service, 2000).

Psychometric-Inspired Evaluation Metric
Since the distribution of the difficulties of words is very skew, we design a new metric to balance the weights of different words according to their difficulties. In addition, we also should keep strictly a fair rule for rewards and punishments.
Intuitively, if the difficulty of a word is high, a correct segmentation should be given an extra rewards; otherwise, if the difficulty of a word is low, it is reasonable to give an extra punishment to a wrong segmentation.
Our new metric of precision, recall and balanced F-score is designed as follows.

Balanced Recall Given a new predicted seg
According to the difficulties of each word, we can calculated the reward recall r reward which biased for the difficult cases.
where r ′ reward ∈ [0, 1] is biased recall, which places more attention on the difficult cases and less attention on the easy cases.
Conversely, we can calculated another punishment recall r punishment which biased for the easy cases.
where r punishment ∈ [0, 1] is biased recall, which places more attention on the easy cases and less attention on the difficult cases. r punishment can be interpreted as a punishment as bellows.
From Eq (9), we can see that an extra punishment is given to wrong segmentation for low difficult word. In detailed, for a word w i that is easy to segment, its weights (1 − d i ) is relative higher. When its segmentation is wrong, will be larger, which results to a smaller final score.
To balance the reward and punishment, a balanced recall r b is used, which is the harmonic mean of r reward and r punishment .
is the degree of difficulty of segment s ′ i , which is an average difficulty of the corresponding gold words.
Similar to balanced recall, we use the same way to calculate balance precision p b . Here N ′ is the number of words in the predicted segmentation. d ′ i is the weight for the predicted segmentation unit w ′ i . It equals to the word difficulty of the corresponding word w that cover the right boundary of w ′ i in the gold segmentation.
Balanced F-score The final balanced F-score is

Committee of Segmenters
It is infeasible to manually judge the difficulty of each word in a dataset. Therefore, an empirical method is needed to rate each word. Since the difficulty is also not derivable from the observation of the surface forms of the text, we use a committee of automatic segmenters instead. To keep fairness Table 1: Feature templates. C represents a Chinese character, and T represents the character-based tag in set {B, M, E, S}. The subscript indicates its position relative to the current character, whose subscript is 0. C i:j represents the subsequence of characters form relative position i to j.
and justice of the committee, we need a large number of diversified committee members. Thus, the grading result of committee is fair and accurate, avoiding the laborious human annotation and the deviation caused by the subjective factor of the artificial judgement.

Building the Committee
Base Segmenters The committee is composed of a series of base segmenters, which are based on discriminative character-based sequence labeling method. Each character is labeled as one of {B, M, E, S} to indicate the segmentation. 'B' indicates the beginning character of a word. 'M' indicates the middle character of a word. 'E' indicates the end character of a word. 'S' indicates that the word consists of only a single character.

Diversity of Committee
To objectively assess the difficulty of a word, we need to maintain a large enough committee with diversity.
To encourage diversity among committee members, we train them with different datasets and features. Specifically, each base segmenter adopts one of three types of feature templates (shown in Table 1), and are trained on randomly sampling training sets. To keep a large diversity, we set sampling ratio to be 10%, 20% and 30%. In short, each base segmenter is constructed with a random combination of the candidate feature template and the sampling ratio for training dataset.

Size of Committee
To obtain a valid and reliable assessment for a word, we need to choose the appropriate size of committee. For a given test case, the judgement of its difficulty should be relatively stable. We analyze how the judgement of its difficulty changes as the size of committee increases. Figure 2 show PKU data from SIGHAN 2005 (Emerson, 2005) the difficulty is stable when the sample size is large enough.

Interpreting Difficulty with Linguistic Features
Since we get the difficulty for each word empirically, we naturally want to know whether the difficulty is explainable, as what TOEFL researchers have done (Freedle and Kostin, 1993;Kostin, 2004). We would like to know whether the variation of word difficulty can be partially expalined by a series of traceable linguistic features.
Based on the knowledge about the characteristics of Chinese grammar and the practical experiences of corpus annotation, we consider the following surface linguistic features. In order to explicitly display the relationship between the linguistic predictors and the distribution of the word difficulty at a micro level, we divide the difficulty scale into ten discrete intervals and calculate the distributions of these linguistic features on different ranges of difficulty.
Here, we interpret the difficulties of the words from the perspective of three important linguistic features: Idiom In Chinese, the 4-character idioms have special linguistic structures.
These structure usually form a different pattern that is hard for the machine algorithm to understand. Therefore, 11.43% 58.1% it is reasonable to hypothesize that the an idiom phrase is more likely to be a difficult word for word segmentation task. We can see from Figure 2a that 58.1% of idioms have a difficulty at (0.9,1]. The proportion does increase with the degree of difficulty, which corresponds with the human intuition.

Dissyllabic Word Disyllabic word is a word formed by two consecutive Chinese characters.
We can see from Figure 2b that the frequency of disyllabic words has a negative correlations with the degree of difficulty. This is an interesting result. It means that a two-syllable word pattern is easy for a machine algorithm to recognize. This is consistent with the lexical statistics (Yip, 2000), which shows that dissyllabic words account for 64% of the common words in Chinese.

Out-of-vocabulary Word
Processing out-ofvocabulary (OOV) word is regarded as one of the key factors to the improvement of model performance. Since these words never occur in the training dataset, it is for sure that the word segmentation system will find it hard to correctly recognize these words from the contexts. We can see from Figure 2c that OOV generally has high difficulty. However, a lot of OOV is relatively easy for segmenters. All the linguistic predictors above prove that the degree of difficulty, namely the weight for each word, is not only rooted in the foundation of test theory, but also correlated with linguistic intuition.

Evaluation with New Metric
Here we demonstrate the effectiveness of the proposed method in a real evaluation by reanalyzing the submission results from NLPCC P1 P2 P3 P4 P5 P6 P7 2015 Shared Task 2 of Chinese word segmentation. The dataset of this shared task is collected from micro-blog text. For convenience, we use WB to represent this dataset in the following discussions.
We We compare the standard precision, recall and F-score with our new metric. The result is displayed in Figure 3. Considering the related privacy issues, we will refer to the participants as P1, P2, etc. The order of these participants in the sub-figures is sorted according to the original ranking given by the standard metric in each track. The same ID number refers to the same participants.
It is interesting to see that the proposed metric gives out significantly different rankings for the participants, compared to the original rankings. Based on the standard metric, Participant 1 (P1) ranks the top in closed track while P7 is ranked as the worst in both tracks. However, P2 ranks first under the evaluation of the new metric in the Closed track. P7 also get higher ranking than its original one.

Correlation with Human Judgement
To tell whether the standard metric or the proposed metric is more reasonable, we asked three experts to evaluate the quality of the submissions from the participants. We randomly selected 50 test sentences from the WB dataset. For each test sentence, we present all the submitted candidate segmentation results to the human judges in random order. Then, the judges are asked to choose the best candidate(s) with the highest segmentation quality as well as the second-best candidate(s) among all the submissions. Human judges had no access to the source of the sentences.
Once we collect the human judgement of the segmentation quality, we can compute the score for each participants. If a candidate segmentation result from a certain participant is ranked first for n times, then this participants earned n point. If second for m times, then this participants earned m 2 points. Then we can get the probability of a participants being ranked the best or sub-best by computing n+ m 2 50 . Finally, we get the humanintuition-based gold ranking of the participants through the means of scores from all the human judges.
It is worth noticing that the ranking result of our proposed metric correlates with the human judgements better than that of the standard metric, as is shown in Figure 3. The Pearson correlation between our proposed metric and human judgements are 0.9056 (p = 0.004) for closed session and 0.8799 (p = 0.04) for open session while the Pearson correlation between standard metric and human judgements are only 0.096 (p = 0.836) for closed session and 0.670 (p = 0.216). This evidence strongly supports that the proposed method is a good approximate of human judgements.

Detailed Analysis
Since we have empirically got the degree of difficulty for each word in the test dataset, we can compute the distribution of the difficulty for words that have been correctly segmented. We divided the whole range of difficulty into 10 intervals. Then, we count the ratio of the correct segmented units for each difficulty interval. In this way, we can quantitatively measure to what extent the segmentation system performs on difficult test cases and easy test cases.
As is shown in Figure 4, P7 works better on difficult cases than other systems, but the worst on easy cases. This explains why P7 gets good rank based on the new evaluation metric. Besides, if we compare P1 and P2, we will notice that P2 performs just slightly worse than P1 on easy cases, but much better than P1 on difficult cases. Therefore, conventional evaluation metric rank P1 as the top system because the P1 gets a lot of credits from a large portion of easy cases. Unlike conventional metric, our new metric achieves balance between hard cases and easy cases and ranks P2 as the top system. The experiment result indicates that the new metric can reveal the implicit difference and improvement of the model, while standard metric cannot provide us with such a fine-grained result.

Validity and Reliability
Jones (1994) concluded some important criteria for the evaluation metrics of NLP system. It is very important to check the validity and reliability of a new metric.
Previous section has displayed the validity of the proposed evaluation metric by comparing the evaluation results with human judgements. The evaluation results with our new metric correlated with human intuition well.
Regarding reliability, we perform the paralleltest experiment. We randomly split the test dataset into two halves. These two halves have similar difficulty distribution and, therefore, can be considered as a parallel test. Then different models, including those used in the first experiment, are evaluated on the first half and the second half. The results in Figure 6 shows that the performances of different models with our proposed evaluation metric are significantly correlated in two parallel tests.

Visualization of the Weight
As is known, there might be some annotation inconsistency in the dataset. We find that most of the cases with high weight are really valuable difficult test cases, such as the visualized sentences from WB dataset in Figure 7. In the first sentence, the word 'BMW 族' (NOUN.People who take bus, metro and then walk to the destination) is an OOV word and contains English characters. The weight of this word, as expected, is very high. In the second sentence, the word '素不相 识' (VERB.not familiar with each other) is a 4character Chinese idiom. the conjunction word '就 算' (CONJ.even if) has structural ambiguity. It can also be decomposed into a two-word phrase '就' (ADV.just) and '算' (VERB.count). From the visualization of the weight, we can see that these difficult words are all given high weights.

Comparisons on SIGHAN datasets
In this section, we give comparisons on SIGHAN datasets.
We use four simplified Chinese datasets: PKU and MSR (SIGHAN 2005) as well as NCC and SXU (SIGHAN 2008).
For each dataset, we train four segmenters with varying abilities, based on 20%, 50%, 80% and 100% of training data respectively. The used feature template is F2 in Table 1.  Table 2 shows the different evaluation results with standard metric and our balanced metric. We can see that the proposed evaluation metric generally gives lower and more distinguishable score, compared with the standard metric.

Related work
Evaluation metrics has been a focused topic for a long time. Researchers have been trying to evaluate various NLP tasks towards human intuition (Papineni et al., 2002;Graham, 2015a;Graham, 2015b). Previous work (Fournier and Inkpen, 2012;Fournier, 2013;Pevzner and Hearst, 2002) mainly deal with the near-miss error case on the context of text segmentation. Much attention has been given to different penalization for the error. These work criticize that traditional metrics such as precision, recall and F-score, consider all the error similar. In this sense, some studies aimed at assigning different penalization to the word. We think that these explorations can be regarded as the foreshadowing of our evaluation metric that balances reward and punishment.
Our paper differs from previous research in that we take the difficulty of the test case into consideration, while previous works only focus on the variation of error types and penalisation. We involve the basic idea from psychometrics and improve the evaluation with a balance between difficult cases and easy cases, reward and punishment.
We would like to emphasize that our weighted evaluation metric is not a replacement of the traditional precision, recall, and F-score. Instead, our new weighted metrics can reveal more details that traditional evaluation may not be able to present.

Conclusion
In this paper, we put forward a new psychometric-inspired method for Chinese word segmentation evaluation by weighting all the words in test dataset based on the methodology applied to psychological tests and standardized exams. We empirically analyze the validity and reliability of the new metric on a real evaluation dataset. Experiment results reveal that our weighted evaluation metrics gives more reasonable and distinguishable scores and  correlates well with human judgement. We will release the weighted datasets to the academic community. Additionally, the proposed evaluation metric can be easily extended to word segmentation task for other languages (e.g. Japanese) and other sequence labelling-based NLP tasks, with just tiny changes. Our metric also points out a promising direction for the researchers to take into the account of the biased distribution of test case difficulty and focus on tackling the hard bones of natural language processing.