Multi-Grained Chinese Word Segmentation

Traditionally, word segmentation (WS) adopts the single-grained formalism, where a sentence corresponds to a single word sequence. However, Sproat et al. (1997) show that the inter-native-speaker consistency ratio over Chinese word boundaries is only 76%, indicating single-grained WS (SWS) imposes unnecessary challenges on both manual annotation and statistical modeling. Moreover, WS results of different granularities can be complementary and beneficial for high-level applications. This work proposes and addresses multi-grained WS (MWS). We build a large-scale pseudo MWS dataset for model training and tuning by leveraging the annotation heterogeneity of three SWS datasets. Then we manually annotate 1,500 test sentences with true MWS annotations. Finally, we propose three benchmark approaches by casting MWS as constituent parsing and sequence labeling. Experiments and analysis lead to many interesting findings.


Introduction
As the first processing step of Chinese language processing, word segmentation (WS) has been extensively studied and made great progress during the past decades, thanks to the annotation of large-scale benchmark datasets, among which the most widelyused are Microsoft Research Corpus (MSR) (Huang et al., 2006), Peking University * Correspondence author MSR PPD CTB People Daily Corpus (PPD) (Yu et al., 2003), and Penn Chinese Treebank (CTB) (Xue et al., 2005). Table 1 gives an example sentence segmented in different guidelines. Meanwhile, WS approaches gradually evolve from maximum matching based on lexicon dictionaries (Liu and Liang, 1986), to path searching from segmentation graphs based on language modeling scores and other statistics (Zhang and Liu, 2002), to character-based sequence labeling (Xue, 2003), to shift-reduce incremental parsing (Zhang and Clark, 2007). Recently, neural network models have also achieved success by effectively learning representation of characters and contexts (Zheng et al., 2013;Pei et al., 2014;Ma and Hinrichs, 2015;Cai and Zhao, 2016;Liu et al., 2016).
To date, all the labeled datasets adopt the single-granularity formalization, and previous research mainly focuses on single-grained WS (SWS), where one sentence is segmented into a single word sequence. Although different WS guidelines share the same high-level criterion of word boundaries -a character string combined closely and used steadily forms a word, people greatly diverge due to individual differences on knowledge and living environments, etc. An anonymous reviewer kindly points out that Vladímir Skalička of the Prague School claimed that unlike the "isolating" languages such as French and English, Chinese belongs to the "polysynthetic" type, in which compound words are normally produced from indigenous morphemes (Jernudd and Shapiro, 1989). The vague distinction between morphemes and compounds also contribute to the cognition divergence on the concept of words. Sproat et al. (1996) show that the consensus ratio over word boundaries is only 76% among Chinese native speakers without trained on a common guideline.To fill this gap, WS guidelines need to further group words into many types and provide illustration examples for each type. Nevertheless, it is very challenging even for well-trained annotators to fully grasp the guidelines and to be consistent on uncovered cases. For example, Xiu (2013) (in Tables 1-3) shows that about 3% characters are inconsistently segmented in the PPD training data used in SIGHAN Bakeoff 2005(Emerson, 2005. We have also observed many inconsistency cases in all MSR/PPD/CTB during this work. In a word, SWS imposes great challenge on data annotation, and as a side effect, enforces statistical models to learn subtleness of annotation guidelines rather than the true WS ambiguities. From another perspective, WS results of different granularities may be complementary in supporting applications such as information retrieval (IR) (Liu et al., 2008) and machine translation (MT) (Su et al., 2017). On the one hand, coarse-grained words enable statistical models to perform more exact matching and analyzing. On the other hand, fine-grained words are helpful in both reducing data sparseness and supporting deeper understanding of language. 1 To solve the above two issues for SWS, this paper proposes and addresses multi-grained WS (MWS). Given an input sentence, the goal is to produce a hierarchy structure of all words of different granularities, as illustrated in Figure 1. To tackle the lack of labeled data, we build a large-scale pseudo MWS dataset for model training and tuning by automatically converting annotations of three heterogeneous 1 Words in CTB are generally more fine-grained than those in PPD and MSR, probably due to the requirement of annotating syntactic structures. SWS datasets (i.e. MSR/PPD/CTB) based on the recently proposed coupled sequence labeling approach of Li et al. (2015). In order to fully investigate the problem, we manually annotate 1,500 test sentences with true MWS annotations. Finally, we propose three benchmark approaches by casting MWS as constituent parsing and sequence labeling problems. Experiments and data analysis lead to many interesting findings.
We will release the newly annotated data and the codes of the benchmark approaches at http://hlt.suda.edu.cn/~zhli. However, due to the license issue, we may not directly release all the pseudo MWS datasets. Instead, we will launch a web service for obtaining MWS annotations given a sentence with one of MSR/PPD/CTB annotations.

Pseudo MWS Data Conversion
This section introduces the process of gathering pseudo MWS data by making use of the annotation heterogeneity of the three existing datasets, i.e., MSR/PPD/CTB.

Annotation Heterogeneity
MSR is a manually labeled corpus with word boundaries and named entity tags, and is annotated by Microsoft Research Asia for supporting Chinese text processing (Huang et al., 2006). The key characteristic of MSR is treating named entities as single words. For example, " (Great Hall of the People)" is a location and forms a word in Table 1. In general, MSR is more coarsegrained than PPD and CTB. PPD is a largescale corpus with word boundaries, POS tagging, and phonetic notations to facilitate Chi-nese information processing, and is annotated by Institute of Computational Linguistics at Peking University (Yu et al., 2003). Based on the Penn Chinese Treebank Project, CTB is built to create a Mandarin Chinese corpus with syntactic bracketing (Xue et al., 2005). We find that CTB is more fine-grained in word boundaries than MSR and PPD, since syntactic annotation tends to require deeper understanding of a sentence. For example, Table 5 reports the averaged number of characters per word in each corpus, and confirms our observations.
For better understanding of annotation heterogeneity, we summarize high-frequency differences among the three datasets observed and gathered during this study in Appendix A. However, it is difficult to obtain a complete list of annotation correspondences among the three datasets, since there are too many lowfrequency and irregular cases. Moreover, we also observe a lot of inconsistency annotations of the same word or words with similar structures in all three datasets, as shown in Appendix B.

Coupled WS for Conversion: MSR/PPD as Example
This section introduces how to automatically produce high-quality PPD-side WS labels for a sentence with MSR-side gold-standard WS labels, by leveraging the two non-overlapping SWS data of MSR and PPD with the coupled sequence labeling approach of Li et al. (2015) and Li et al. (2016). Figure 2 shows the workflow. Given a sentence x = [c 1 , ..., c i , ..., c n ], the coupled model aims to produce a sequence of bundled tags t = [t a 1 t b 1 , ..., t a i t b i , ..., t a n t b n ], where t a i and t b i are two labels corresponding to two heterogeneous guidelines respectively. Table 2 gives an example of coupled WS on MSR/PPD. We employ the standard four-tag label set to mark word boundaries of one granularity, among which B, I, E respectively represent that the concerned character situates at the beginning, inside, end position of a word, and S represents a single-character word. The bottom row shows the gold-standard bundled tag sequence.
One key advantage of the coupled model is to directly learn from two non-overlapping  Two WS labels are bundled to represent MSR/PPD annotations for a character.
Ambiguous labeling is gained supposing this sentence has MSR-side goldstandard annotations.
heterogeneous training datasets, where each dataset only contains single-side gold-standard labels. To deal with this partial (or incomplete) labeling issue, they project each singleside label to a set of bundled labels by considering all labels at the missing side, as shown in the second row in Table 2. Such ambiguous labelings are used for model supervision.
Under a traditional CRF, the coupled model defines the score of a bundled tag sequence as where f joint (.) are the joint features whereas f sep_a/sep_b (.) are the separate features. Li et al. (2015) demonstrate that the joint features capture the implicit mappings between heterogeneous annotations, while the back-off separate features work as a remedy for the sparseness of the joint features.  In their case study of POS tagging, Li et al. (2015) show the coupled model improves tagging accuracy by 95.0 − 94.1 = 0.9% on CTB5-test over the baseline non-coupled model trained on a single training data.
More importantly, they show that the coupled model can be naturally used for the task of annotation conversion, where second-side labels are automatically annotated, given oneside gold-standard labels. The given one-side tags are used to obtain ambiguous labelings, as shown in Table 2, and the coupled model finds the best bundled tag sequence in the constrained search space, instead of in the whole bundled tag space, hence greatly reducing the difficulty. Li et al. (2015) report that the coupled model can improve conversion accuracy on POS tagging by 93.9 − 90.6 = 3.3% over the non-coupled model. 2 Figure 3 shows the workflow of producing pseudo MWS data with three separately trained coupled models. Please note that one coupled model is able to perform conversion between one pair of annotation standards, and thus three coupled models are required for three kinds of annotation standards. Another alternative is that we could directly train one coupled model on MSR/PPD/CTB by extending the approach of Li et al. (2015) from two guidelines into three, which would lead to a much larger bundled tag space. For simplicity, we directly employ their released codes in this work, and leave that for future exploration.

Producing Pseudo MWS Data
After conversion, we obtain 9 pseudo MWS datasets (i.e., MSR/PPD/CTBtrain/dev/test) and represent each sentence in a hierarchy structure as shown in Figure 1. Please kindly note that the guideline-specific information are thrown away, since we do not care which word belongs to which guideline.
In the resulting pseudo MWS data, we find about 0.08% of words overlap with other words, meaning a string "ABC" is segmented into "A/BC" and "AB/C" in two different annotations. We have manually checked these words, and find almost all those cases are caused by conversion errors. This confirms that our treatment of MWS as a hierarchy structure is reasonable.

Manual Annotation
In order to fully investigate the MWS problem, we have manually created a true MWS data of 1,500 sentences for final evaluation. From each test dataset in Table 5, we randomly sample 500 sentences with converted pseudo MWS annotations for manual correction. First, two coauthors of this work spent about two hours each day on manual correction of the pseudo MWS annotations for two weeks. During this period, we have summarized a list of highfrequency corresponding patterns among the three guidelines (see Appendix A), and have also written a simple program to automatically detect inconsistent annotations of given words in different training datasets, so that annotators can use the outputs of the program to decide ambiguous cases, which we find is extremely helpful for annotation.
Then, we employ 10 postgraduate students as our annotators who are at different familiarity in WS annotation. Before formal annotation, the annotators are trained for two hours on the basic concepts of MWS, highfrequency correspondences among the three guidelines, and the use of the outputs of the program. We also encourage the annotators to access the three training datasets directly for studying concrete cases under real contexts. Moreover, annotators are asked to recheck their annotations before final submission to improve quality.
To measure the inter-annotator consistency, 150 sentences (10%) are sampled for double annotation, and are grouped into four batches for four pairs of annotators. After annotation, two annotators on the same batch compare  their results and produce a consensus submission through discussion. The annotation process lasts for four days, and each annotator spends about 8 hours in total on completing 160 sentences on average. Table 3 compares data statistics on the 1, 500 sentences before and after manual annotation. The second column reports the number of words, and the last three columns report the distribution of words according to their granularity levels. To illustrate how to gain the distribution, we take Figure 1 as an example, which contains 1 single-grained words, 9 twograined words, and 7 three-grained words. 3 Table 3 shows that only 71.6% of all words are single-grained, which is somehow roughly consistent with the inter-native-speaker consistency ratio (76%) in Sproat et al. (1996). Among multi-grained words, 26.8 26.8+1.6 = 94.4% are two-grained. It is clear that manual annotation increases both the number of words by 45,279−44,593 44,593 = 1.5%, and the number of multigrained words by 74.5 − 71.6 = 2.9%. In fact, during annotation, we also feel that multigranularity phenomena are under-represented in the pseudo MWS data. The reason may be two-fold. First, the conversion models incline to suppress granularity differences, since most words have the same granularity in different datasets. Second, the exist of many inconsistencies in the same dataset also makes the conversion models more reluctant to produce multi-grained words.
The inter-annotator consistency ratio is 3859 3935 = 98.07%, where the denominator is the word number after merging the submission of all annotator pairs, and the numerator is the consensus word number. We argue that 3 Formally, we call a word s three-grained if there are two other words s1 and s2 satisfying any one conditions: 1) s2 ∈ s1 ∈ s (like " " in Figure  1); 2) s2 ∈ s ∈ s1 (like " "); 3) s ∈ s1 ∈ s2 (like " "), where ∈ means substring. The definition of twograined words is analogous; otherwise single-grained. the consistency ratio is not high, considering most words do not need correction in the pseudo MWS annotations. In fact, we find that this annotation task is actually very difficult, since the annotators must consider three guidelines simultaneously. The main inconsistency source of all four annotator pairs are due to the situation where one annotator notices a mistake while another annotator overlooks it. To solve this issue, our long-term plan is to compile a unified MWS guideline by integrating existing SWS guidelines, and gradually improve it by more manual MWS annotation. 4

Benchmark MWS Approaches
There has recently been a surge of interest in applying neural network models to both parsing and sequence labeling tasks. In this work, we propose three simple benchmark approaches for MWS, inspired by recently neural models for constituent parsing (Cross and Huang, 2016) and SWS (Pei et al., 2014).

MWS as Constituent Parsing
Due to its hierarchy structure shown in Figure  1, we naturally cast MWS as a constituent parsing problem, where characters are leaf nodes; "C" represent a character, "W" represent a word; "X" means that the spanning word cannot be further merged into a more coarse-grained word.
We employ the recently proposed transitionbased constituent parser of Cross and Huang (2016) due to its simplicity and competitive performance on different parsing benchmark datasets. In the transition system, a stack S stores processed tokens and partial trees collected so far; a queue Q contains unprocessed tokens; structural 5 and labeling 6 decisions are alternatively made to advance the state until a complete tree forms. The network architecture is composed of two parts: 1) two cascaded  bidirectional LSTM layers to encode the input token sequence, as shown in Figure 4; 2) two separate multilayer perceptrons (MLPs) to make structural/labeling decisions based on 4/3 simple LSTM span features. A span feature represents a sentence span (i, j) by concatenating the element-wise differences of BiLSTM outputs: To adapt the original parsing model to our MWS task, we concatenate bichar embeddings e c i−1 c i with single char embedding e c i as inputs to the first-layer BiLSTM, inspired by Pei et al. (2014), who show that bichar embeddings are very helpful for SWS.

MWS as Sequence Labeling
It is also straightforward to model MWS as a sequence labeling task by replacing SWS labels with MWS labels for each character. Table 4 encodes the MWS structure in Figure  1 with a sequence of MWS labels. The idea is to concatenate multiple SWS tags simultaneously for one character to denote the positions of the character under words of different granularities. Please note that each MWS label contains at most three SWS labels since we only consider three SWS datasets in this work. Here, we organize the SWS labels in the order of fine-to-coarse granularities.
For simplicity and fair comparison, we adopt a similar network architecture as the parsing into a singlehidden-layer MLP.

MWS as SWS Aggregation
Instead of directly training a MWS model on the three pseudo MWS training datasets, we can also train three separate SWS models on the three SWS training datasets. Given an input sentence, we apply the three SWS models and then merge their outputs as MWS results.
The network architecture is the same with the sequence labeling model in Section 4.2, except the MLP outputs correspond to SWS labels instead of MWS labels.

Experiments
Data: for MSR, we adopt the training/test datasets of the SIGHAN Bakeoff 2005(Emerson, 2005, and cut off 10% random training sentences as the dev data following ; for PPD and CTB, we follow Li et al. (2015) and directly adopt their datasets and data split. Table 5 shows the data statistics. 7 Evaluation Metrics: the goal of MWS is to precisely produce all words of different granularities given the input sentence. Therefore, to reach a balance of both precision (P = #Word gold∩sys #Word sys ) and recall (R = #Word gold∩sys #Word gold ), we use the F1 score (= 2P R P +R ) as in SWS. Hyper-parameter: we implement all our approaches based on the codes released by Cross and Huang (2016), by making extensions such as adding bichar embeddings and  supporting sequence labeling. 8 For simplicity, char and bichar embeddings are randomly initialized following Cross and Huang (2016). The dimensions of char and bichar embeddings are both 50 and other hyper-parameters are the same with Cross and Huang (2016). In our preliminary experiments, we observe that under their neural network framework, the MWS performance is quite stable when reruning under random initialization or reasonably altering other hyper-parameters. Due to time limitation, we leave the use of pre-trained embeddings and more hyper-parameter tuning for future exploration.
Training/test settings: when training the parsing and sequence labeling based MWS models (not SWS aggregation) on MSR/PPD/CTB-train, we adopt the simple corpus weighting strategy used in Li et al. (2015) to balance the contributions of each training dataset. Before each iteration, we randomly sample 10,000 sentences from each training dataset, and merge and shuffle them for one-iteration training. We use merged MSR/PPD/CTB-dev as the MWS dev data for model selection. 9 For the SWS aggregation model, three SWS models are separately trained on the three training/dev datasets. For evaluation, three SWS outputs produced independently are merged as one MWS result given a sentence.
In all experiments, training stops when Fscore on the dev data does not improve in 20 consecutive iterations, and we choose the model that performs best on the dev data for final evaluation.
Main results: Table 6 reports the performance of different approaches on both the pseudo MWS dev data and the manually annotated MWS test data. The "#Word" column reports the total number of words returned by the corresponding model; the following three columns show the percentages of words of different granularities; the last "Overlapping" column gives the percent of words that overlap with other words, which only happens in the "SWS aggregation" approach, since no constraint can be applied to the three separate SWS models during testing. From the results, we can draw the following findings.
First, the results suggest that using pseudo training and dev datasets to build a MWS model is feasible, based on two evidences: 1) our simple benchmark model can reach a high F-score of 96.07% on the manually annotated test data, which is 1.77% higher than directly aggregating outputs of three SWS models; 2) the P/R/F scores on the pseudo dev data and on the manually labeled test data are quite consistent in general, indicating that it is reliable to use the pseudo dev data for model selection and tuning.
Second, the parsing approach and the sequence labeling approach (with or without bichar embeddings) achieve very similar performance (within 0.15% vibration), More importantly, the parsing approach produces more words and more multi-grained words than the sequence labeling approach, indicating that it is potentially more proper to model MWS as a parsing problem in order to better capture and represent multi-granularity structures. Another possible disadvantage of the sequence labeling approach is that the trained model cannot produce more granularity levels (e.g., four-grained) beyond those in the training data. Nevertheless, compared against the manual annotations in Table 3, both the parsing and sequence labeling approaches retrieve much less multi-grained words, which is caused by the under-representation issue of the pseudo training data, as discussed in Section 3.
Third, the SWS aggregation approach achieves the best recall at the price of very low precision on both dev/test data. We believe the reason is that training three SWS models separately on one of the three training datasets has two disadvantages: 1) connections among different guidelines are totally ignored, leading to many overlapping words (1.0%); 2) smaller training data also degrades the performance of each SWS model.
Finally, using bichar embeddings turns out very helpful for MWS, and leads to 0.97 ∼ 1.18% F-score improvement on dev data and 0.62 ∼ 0.85% on test data, which is consistent with the SWS results in Pei et al. (2014).

Related Work
As far as we know, this is the first work that formally proposes and addresses the problem of Chinese MWS under the data-driven machine learning framework. It is true that the industrial community, driven by practical demand, has long been interested in retrieving words of different granularities from the engineering perspective, based on lexicon dictionaries and heuristic rules (Zhu and Li, 2008;Hou et al., 2010). We also discover two publicly released toolkits, i.e., IKAnalyzer 10 and PoolWord 11 , which consider all substrings in a sentence and return those above a threshold probability as candidate words. In contrast, this paper defines MWS as a strict hierarchy structure, and propose a supervised learning framework for the problem.
To alleviate the high OOV-ratio issue of character-based sequence labeling, Zhang et al. (2006) and Zhao and Kit (2007) propose subword-based sequence labeling for word segmentation by extracting highfrequency subword and treating them as the basic labeling units. Li (2011) and Li and Zhou (2012) propose to jointly parse the internal structures of words and syntactic structure of a sentence. Their definition of internal structures mainly considers prefix or suffix information. They manually annotate the internal structures of words that have high-frequency prefixes or suffixes and left other words with flat structures in CTB. Zhang et al. (2013) further annotate internal structures of all words in CTB and then perform character-level parsing with WS labels. Cheng et al. (2015) propose to cope with the multiple WS standard problem based on internal word structures. After close study of the above works, we find that the MWS annotations automatically built in this work actually capture a lot of subwords and word internal structures in previous works. Most importantly, the main focus of previous works is to improve SWS or parsing performance, whereas this work aims to build a hierarchy structure of multi-grained words. We leave the integration of MWS and parsing for future work.
It has been a long debate whether there exists an optimal WS granularity for MT, which is further complicated by the inevitable mistakes contained in 1-best WS outputs. Dyer et al. (2008) propose an MT model based on source-language word lattices, obtained by merging the outputs of different segmenters. Xiao et al. (2010) propose joint SWS and MT based on word lattices. Recently, Su et al. (2017) propose a word lattice-based neural MT model. They train many segmenters on MSR/PPD/CTB, and merge the outputs to produce word lattices for source-language sentences, which is similar to our SWS aggregation approach. All above works show the usefulness of word lattices instead of a single SWS output. In help IR, Liu et al. (2008) propose a ranking based WS approach for producing words of different granularities. We believe this work can further help both IR and MT by supplying with more accurate MWS results.

Conclusion
This work proposes and addresses the problem of MWS, so that all words of different granularities can be captured in a hierarchy structure given a sentence. We can draw the following interesting findings.
(1) Our annotation conversion approach can gather high-quality pseudo MWS training/dev datasets, and hence it is feasible to use them for model training and tuning.
(2) Manual MWS data annotation tells us that about 28.4% words are multi-grained, and among them 94.4% are two-grained words.
(3) The parsing and sequence labeling approaches achieve very similar performance, and outperform the SWS aggregation approach by a large margin.
We believe there are many exploration directions for this new task, among which we are particularly interested in three in the near future: 1) improving our benchmark approaches by considering task-specific features and neural network architectures, 2) verifying the usefulness of MWS to highlevel applications such as MT, 3) integrating MWS with syntactic parsing in some way by exploiting existing treebanks.