Semi-supervised Chinese Word Segmentation based on Bilingual Information

This paper presents a bilingual semi-supervised Chinese word segmentation (CWS) method that leverages the nat-ural segmenting information of English sentences. The proposed method involves learning three levels of features, namely, character-level, phrase-level and sentence-level, provided by multiple sub-models. We use a sub-model of conditional random ﬁelds (CRF) to learn mono-lingual grammars, a sub-model based on character-based alignment to obtain explicit segmenting knowledge, and another sub-model based on transliteration similarity to detect out-of-vocabulary (OOV) words. Moreover, we propose a sub-model leveraging neural network to ensure the proper treatment of the semantic gap and a phrase-based translation sub-model to score the translation probability of the Chinese segmentation and its corresponding English sentences. A cascaded log-linear model is employed to combine these features to segment bilingual unlabeled data, the results of which are used to justify the original supervised CWS model. The evaluation shows that our method results in superior results compared with those of the state-of-the-art monolingual and bilingual semi-supervised models that have been re-ported in the literature.


Introduction
Chinese word segmentation (CWS) is generally accepted to be a necessary first step in most Chinese NLP tasks because Chinese sentences are written in continuous sequences of characters with no explicit delimiters (e.g., the spaces in English). Many studies have been conducted in this area, resulting in extensive investigation of the problem of CWS using machine learning techniques in recent years. However, the reliability of CWS that can be achieved using machine learning techniques relies heavily on the availability of a large amount of high-quality, manually segmented data. Because hand-labeling individual words and word boundaries is very difficult (Jiao et al., 2006), producing segmented Chinese texts is very time-consuming and expensive. Although a number of manually segmented datasets have been constructed by various organizations, it is not feasible to combine them into a single complete dataset because of their incompatibility due to the use of various segmenting standards. Thus, it is difficult to build a large-scale manually segmented corpus, and the resulting lack of such a corpus is detrimental to further enhancement of the accuracy of CWS.
To address the scarcity of manually segmented corpora, a number of semi-supervised CWS approaches have been intensively investigated in recent years. These approaches attempt to either learn the predicted label distribution (Jiao et al., 2006) or extract mutual information ( (Liang et al., 2005); (Sun and Xu, 2011); (Zeng et al., 2013a)) from large-scale monolingual unlabeled data to update the baseline model (from manually segmented corpora). In addition to these techniques, several co-training approaches (Zeng et al., 2013b) using character-based and word-based models have also been employed. However, because monolingual unlabeled data contain limited natural segmenting information, in most semisupervised methods, the objective function tends to be optimized based on the personal experience and knowledge of the researchers. This practice means that these methods can typically yield high performance in certain specialized domains, but they lack generalizability. In contrast with these methods, we propose to leverage bilingual unlabeled data, i.e., a Chinese-English corpus with sentence alignment. Because English sentences Figure 1: The examples of different segmentation on the same Chinese sentences guided by the English sentences are naturally segmented, extracting information from a bilingual corpus is a much more objective task. As the example presented in Fig 1 shows, the English sentences that correspond to Chinese text can easily help guide better segmentation, and thus, the learning of segmenting information from bilingual data is a very promising approach.
In this paper, to obtain high-quality segmenting information from bilingual unlabeled data, we leverage multilevel features using the following steps: first, we integrate character-level features calculated using a conditional random field (CRF) model, which is used to capture the monolingual grammars. Then, we employ a statistical aligner to perform character-based alignment. Given the results of this character-based alignment, we apply several phrase-level features to extract explicit and implicit segmenting information: (1) we use two types of English-Chinese co-occurrence features (one-to-many and many-to-many) to learn the explicit segmenting information of the English sentences, (2) we use the transliteration similarity feature to detect out-of-vocabulary (OOV) words using a phrase-based translation model, and (3) we employ a neural network to calculate the semantic gap between the Chinese and English words to ensure that the Chinese segmentation follows the semantic meanings of the corresponding English sentences as closely as possible. Finally, we employ another phrase-based translation model to perform a sentence-level calculation of the translation probability of the Chinese segmentation and its corresponding English sentences. After obtaining these multilevel features, we normalize them and combine them into two log-linear models in a cascaded structure, which is illustrated in Fig  2. Finally, we segment the bilingual unlabeled data using the proposed model and use the segmentation of those data to justify the original super- Figure 2: The structure of cascaded log-linear model with multilevel features vised CWS model, which was trained on a standard manually segmented corpus.
In fact, several semi-supervised CWS methods have previously been proposed that leverage bilingual unlabeled data ( (Xu et al., 2008); (Chang et al., 2008); (Ma and Way, 2009); (Chung et al., 2009); (Xi et al., 2012)). However, most were developed for statistical machine translation (SMT), causing them to focus on decreasing the perplexity of the bilingual data and the word alignment process rather than on achieving more accurate segmentation. These methods achieve significant improvement in SMT performance but are not very suitable for common NLP tasks because in many situations, they ignore the standard grammars to satisfy the needs of SMT. By contrast, we employ various types of features to capture both monolingual standard grammars and bilingual segmenting information, which allows our semi-supervised CWS model to be very efficient at other NLP tasks and endows it with higher generalizability.
Our evaluation also shows that our method significantly outperforms the state-of-the-art monolingual and bilingual semi-supervised approaches.

Related Work
First, we review related work on monolingual supervised and semi-supervised CWS methods. Then, we review bilingual semi-supervised CWS.

Monolingual Supervised and Semi-supervised CWS Methods
Considerable efforts have been made in the NLP community in the study of Chinese word segmentation. The most popular supervised approach treats word segmentation as a sequence labeling problem, as first proposed by (Xue et al., 2003). Most previous systems have addressed this task using linear statistical models with carefully designed features ( (Peng et al., 2004); (Asahara et al., 2005); (Zhang and Clark, 2007); (Zhao et al., 2010)). However, the primary shortcoming of these approaches is that they rely heavily on a large amount of labeled data, which is very timeconsuming and expensive to produce. Thus, the scale of available manually labeled data has placed considerable limitations on the further enhancement of supervised CWS methods.
To address this problem, a number of semisupervised CWS approaches have been intensively investigated in recent years. For example, (Sun and Xu, 2011) enhanced their segmentation results by interpolating statistics-based features derived from unlabeled data into a CRF model. (Zeng et al., 2013a) introduced a graph-based semi-supervised joint model of Chinese word segmentation and part-of-speech tagging and regularized the learning of a linear CRF model based on the label distributions derived from unlabeled data. However, because monolingual unlabeled data lack natural segmenting information, most previous semi-supervised CWS methods have required certain assumptions to be made regarding their objective functions based on the researchers' personal experiences. By contrast, we leverage bilingual unlabeled data that contain the natural segmentation that is present in English sentences and can therefore extract linguistic knowledge without any manual assumptions or bias.

Bilingual Semi-supervised CWS Methods
Some previous work (( Xu et al., 2008); (Chang et al., 2008); (Ma and Way, 2009); (Chung et al., 2009); (Xi et al., 2012)) has been performed on leveraging bilingual unlabeled data to achieve better segmentation, although most such studies have focused on statistical machine translation (SMT). These approaches leverage the mappings of individual English words to one or more consecutive Chinese characters either to construct a Chinese word dictionary for maximum-matching segmentation (Xu et al., 2004) or to form a labeled dataset for training a sequence labeling model (Peng et al., 2004). (Zeng et al., 2014) also used such mappings to bias a supervised segmentation model toward a better solution for SMT. However, because most of these approaches focus on SMT performance, they emphasize decreasing the perplexity of the bilingual data and word alignment rather than improving the CWS accuracy. Thus, they sometimes ignore the standard grammars during segmentation in favor of satisfying the needs of SMT, thereby causing these methods to be rather unsuitable for other NLP tasks. By contrast, we propose to use various types of features to capture syntactic and semantic information and a cascaded log-linear model to maintain balance between the monolingual grammars and the bilingual knowledge.

Multilevel Features
In this section, we describe the three levels of features used in our approach. We propose to use character-level features to capture monolingual grammars and phrase-level and sentence-level features to obtain bilingual segmenting information. Moreover, we describe a cascaded log-linear model by proposing both inner and outer log-linear models.

Character-level Feature
The conditional random field (CRF) (Lafferty et al., 2001) model was first used for CWS tasks by (Xue et al., 2003) who treated the CWS task as a sequence tagging problem and demonstrated this model's effectiveness in detecting OOV words.
In this paper, we score the character-level feature in the same manner defined by (Xue et al., 2003). For the jth character c j in the sentence c J 1 = c 1 ...c J , the score can be calculated as follows: where f k (y j−1 , y j , c J 1 , j) is a feature function and λ k is a learned weight that corresponds to the feature f k . j represents the index of the character in the sentence. y j−1 and y j represent the tags of the previous and current characters, respectively.
We do not introduce the CRF-based CWS model in detail here, but more information can be obtained from (Lafferty et al., 2001) and (Xue et al., 2003).

Phrase-level Features
In this section, we first describe English-Chinese character-based alignment. Then, we propose several phrase-level features to obtain explicit and implicit segmenting information from the characterbased alignment. Finally, we describe the inner log-linear model that is used to combine the character-level and phrase-level features.

English-Chinese Character-based Alignment
To avoid introducing omissions and mistakes into the linguistic information in the initial segmentations of the bilingual data, we perform a statistical character-based alignment: First, every Chinese character in the bitexts is separated by white spaces so that individual characters are recognized as unique /words0 or alignment targets. Then, they are associated with English words using a statistical word aligner. By representing the English and Chinese sentences as e I 1 = e 1 e 2 ...e I and c J 1 = c 1 c 2 ...c J , respectively, where e i and c j represent single elements of the sentences, we define their alignment as a K 1 , of which each element is a span a k =< s, t > and represents the alignment of the English word e s with the Chinese character c t . Then, the corpus of unlabeled bilingual data can be represented as the set of sentence tuples <e I 1 , c J 1 , a K 1 > To obtain the character-based alignment, we employ an open-source toolkit Pialign 1 ( (Neubig et al., 2011); (Neubig et al., 2012)) which uses Bayesian learning and inversion transduction grammars.

Features Obtained from the Character-based Alignment
Given the English-Chinese character-based alignment a K 1 , we extract several phrase-level features to optimize the segmentation. For the jth character in c J 1 , we assume that one of the segmentations of the substring c j 1 can be represented as . Then, we calculate the scores of each Chinese word w n = c jn jm (j m = j n−1 + 1) in w N +1 1 using the following features.
English-Chinese One-to-Many Alignment To evaluate the probability that a sequence of Chinese characters c jn jm = c jm c jm+1 ...c jn should be combined into a word w n based on the corresponding English sentence, we integrate the feature of English-Chinese one-to-many alignment (one English word is aligned with multiple Chinese characters). First, for any English word e i in e I 1 , the phrase tuple < e i , c jn jm > can be defined as an aligned One-to-Many phrase tuple if it satisfies the following conditions: ( Then, for any phrase tuple < e i , c jn jm > that satisfies these conditions, the span < i, j m , j n > is defined as a One-to-Many span and as a member of the set A One .
Thus, for each span < i, j m , j n >, the One-to-Many score can be calculated as follows: where t(c jn jm |e i ) represents the translation probability of the phrase tuple c jn jm |e i . Finally, the score for the feature of English-Chinese One-to-Many alignment for w n = c jn jm is derived as follows:

English-Chinese Many-to-Many Alignment
The second phrase-level feature, called English-Chinese Many-to-Many Alignment (multiple English words are aligned with multiple Chinese characters), is used to evaluate the probability that a space should be inserted between c n and c n+1 . Similar to One-to-Many alignment, for any sequence of English words e i 2 i 1 and the Chinese word w n = c jn jm , the phrase tuple < e i 2 i 1 , c jn j 1 > is defined as an aligned Many-to-Many phrase tuple if it satisfies the following conditions: (1) j 1 ≤ j m , and j 1 is the beginning character of a word in w n Then, for any phrase tuple < e i 2 i 1 , c jn jm > that satisfies these conditions, the span < i 1 , i 2 , j 1 , j n > is defined as a Many-to-Many span and as a member of the set A M any .
Thus, for each span < i 1 , i 2 , j 1 , j n >, the Many-to-Many score can be calculated as follows: where t(c jn j 1 |e i 2 i 1 ) represents the translation probability of the phrase tuple < e i 2 i 1 , c jn j 1 >.
Finally, the score for the feature of English-Chinese Many-to-Many alignment for w n = c jn jm is derived as follows:

Transliteration Feature
To account for named entities (NEs), which suffer from sparsity and thus make it difficult to calculate the probabilities discussed above, we introduce a transliteration feature to evaluate the similarities between the pronunciations of Chinese and English words because many NEs are translated via transliteration. To perform this task, we first introduce an initial NE dictionary and convert each dictionary item-for example, we convert "Ow d/Alice" into "ai l i s i/a l i c e" -by transforming the Chinese word into its pronunciation (represented by the function F py (·)) and splitting the English word into its constituent letters (represented by the function F let (·)). Then, we train two phrase-based translation models (Chinese-English and English-Chinese) on the data obtained from the converted NE dictionary.
Specifically, we apply two standard log-linear phrase-based SMT models. The GIZA++ aligner is adopted to obtain word alignments (Och and Ney, 2000) from the converted NE dictionary. The heuristic strategy of grow-diag-final-and (Koehn et al., 2003) is used to combine the bidirectional alignments to extract phrase translations and to reorder tables. A 5-gram language model with Kneser-Ney smoothing is trained using S-RILM (Stolcke et al., 2002) on the target language. Moses (Koehn et al., 2007) is used as a decoder. Minimum error rate training (MERT)  is applied to tune the feature parameters on the development dataset.
Given these two phrase-based translation models, we calculate each span < i, j m , j n > in A One for the Chinese word w n using the following formula: Str(< i, jm, jn >) = S ch−en (< i, jm, jn >) +S en−ch (< i, jm, jn >) where S ch−en (<i, j m , j n >) = D Lev (F let e i , P T ch−en (F py (c jn jm ))) means that the pronunciation conversion in the Chinese-English direction is performed as follows: First, the English word e i is split into its constituent letters; Second, the sequence of Chinese characters c jn jm is converted into its pronunciation; Third, this pronunciation is input into the Chinese-English phrase-based translation model, and the corresponding translation result is obtained; And finally, the Levenshtein distance between the English letters and the translation result is returned.
S en−ch (<i, j m , j n >) can be calculated in exactly the same way.
We set any span that does not belong to A One to zero, and the transliteration feature score of a word w n = c jn jm is derived as follows: Str(< i, jm, jn >) (7) English-Chinese semantic gap feature To guarantee that the semantic meanings of the Chinese segmentation match those of the corresponding English sentences as closely as possible, we propose to use a feature based on the English-Chinese semantic gap to ensure the retention of semantic meaning during the segmentation process.
First, we pre-train word embeddings using the open-source toolkit Word2Vec (Mikolov et al., 2013) on the Chinese (segmented using characterlevel features only) and English sentences separately, thereby obtaining the vocabularies V ch and V en and their corresponding embedding matrixes L ch ∈ R n×|V ch | and L en ∈ R n×|Ven| . Given a Chinese word w n with an index i in the vocabulary, it is then straightforward to retrieve the word's vector representation via simple multiplication with a binary vector d that is equal to zero at all positions except that with index i: Because the word embeddings for the two languages (L ch and L en ) are learned separately and located in different vector spaces, we suppose that a transformation exists between these two semantic embedding spaces. Thus, we collect all the One-to-Many phrase tuples < e 1 , c j 2 j 1 > that satisfy e 1 ∈ V en and c j 2 j 1 ∈ V ch from the entire corpus of bilingual data. Then, we insert the word embedding tuple of each One-to-Many phrase tuple into the set A embed . Let us consider a word embedding tuple < p s , p t > in A embed as an example. We define a bidirectional semantic distance using the parameter θ as follows: Esem(ps, pt; θ) = Esem(ps|pt, θ) + Esem(pt|ps, θ) (9) Here, E sem (p s |p t , θ) = E sem (p t , f (W ch en p s + b ch en )) represents the transformation of p s and is performed as follows: We first multiply a parameter matrix W ch en by p s , and after adding a bias term b ch en , we apply an element-wise activation function f = tanh(·). Finally, we calculate their Euclidean distance: E sem (p t |p s , θ) can be calculated in exactly the same way.
Given the definition of the semantic distance of each word-embedding tuple in A embed , we wish to minimize the following objective function: We apply the Stochastic Gradient Descent (S-GD) algorithm to optimize each parameter and ultimately obtain the optimized parameters θ * .
Using θ * , we can calculate the semantic gap for any possible span for w n , such as < i, j m , j n >, as follows: if ei ∈ Ven c jn jm ∈ V ch < i, jm, jn >∈ AOne 0 else (12) where p s and p t are the word vector representation of e i and c jn jm , respectively. Thus, the semantic gap feature score of the word w n = c jn jm is derived as follows:

Normalization and the Inner Log-Linear Model
Because the output scores of each sub-model described above are not probabilistic and they vary by orders of magnitude, we must first normalize the output scores of each sub-model. After normalization, the scores have means and standard deviations of zero. We represent the normalization function by N orm(·). Thus, for the substring c j 1 (j ∈ [1, J)) in c J 1 of the sentence tuple <e I 1 , c J 1 , a K 1 >, assuming that one of its candidate segmentations is w N +1 1 = w 1 w 2 w 3 ...w N +1 = c j 1 1 c j 2 j 1 +1 ...c j j N +1 , the feature score of the inner log-linear model is derived as follows: where f k (n) represents the phrase-level features.
Then, we tune the weight λ 1 from 0 to 1 in equal increments of 0.1 to optimize its value.

Sentence-level Features
In this section, we describe the sentence-level features calculated using the phrase-based translation model and the outer log-linear model that is used to combine the sentence-level features with the features in the inner log-linear model.

Features Obtained from the Phrase-based Translation Model
Let us consider the last character c J in c J 1 and assume that its candidate segmentation (according to the inner log-linear model only) is w N +1 1 = w 1 w 2 w 3 ...w N +1 . We now add a sentence-level feature to incorporate into the inner log-linear model. This sentence-level feature is obtained using a phrase-based translation model. We segment the Chinese sentences from the bilingual unlabeled data using character-level features only and train a phrase-based translation model on the bilingual data that is similar to the phrase-based translation model used for the transliteration features.
Unlike the usage of the phrase-based translation model in the case of the transliteration features, here, we input both the source and target sentences and achieve the output of translation probability. Thus, we perform a force decoding for the sentence tuple <w N +1 1 , e I 1 > and obtain the set of decoding paths P(w N +1 1 ), where each element acts as a decoding path that can translate w N +1 1 into e I 1 . Finally, we define the sentence-level feature score of <w N +1 1 , e I 1 > as follows: where F trans (·) returns the translation score of the given decoding path based on the phrase-based translation model.

The Outer Log-Linear Model
Finally, we normalize the sentence-level features in a manner similar to that described previously and construct the outer log-linear model by combining the inner log-linear model and the sentencelevel features as follows: Then, we also tune the weight λ 2 from 0 to 1 in equal increments of 0.1 to optimize its value.

Decoder
A traditional viterbi beam search procedure is applied in the decoder to seek the segmented sequence with the highest score. Given a sentence tuple < e I 1 , c J 1 , a K 1 >, the decoding procedure will proceed in a left-right fashion using a dynamic programming approach. At each position j in the sequence c J 1 , we maintain a vector of size N to store the top N candidate segmentations of subsequence c j 1 which are scored using the inner loglinear model (j ∈ [1, J)) or the outer log-linear model (j = J). Finally, we return the best segmentation.

Justifying the Original CWS Model
We justify the original CWS model (the CRFbased model trained on manually segmented data) using the new CRF model trained on the segmentation of unlabeled bilingual data. To avoid overweakening the influence of the small-scale manually segmented data, we again utilized a log-linear model to balance their weights. The formula can be described as follows: where θ 3 represents the weights of the second CR-F model, which are set via minimum error rate training using the developing dataset, and λ k i (i =1, 2) represents the learned weights of the features of the CRF models.

The Datasets
In this paper, we conduct our experiments on the corpus of People's daily of 1998 (from January to June) as the standard (manually segmented) training corpus, the corpus of Bakeoff-2 CWS evaluation as the developing and testing dataset. As the corpus of Bakeoff-2 is made up of several sets provided by different organizations, we only select two sets whose segmenting standards are similar to the training corpus. For each set, we take 3000 sentences as the developing dataset and the others as the testing dataset. The statistics of every set and the standard training corpus are shown in Ta Moreover, the bilingual unlabeled data is formed by a large in-house Chinese-English parallel corpus (Tian et al., 2014). There are in total 2,215,000 Chinese-English sentence pairs crawled from online resources, concentrated in 5 different domains including laws, novels, spoken, news and miscellaneous.

Experiments
In our evaluation, the F-score was used as the accuracy measure. The precision p is defined as the percentage of words in the decoder output that are segmented correctly, and the recall r is the percentage of gold-standard output words that are correctly segmented by the decoder. The balanced F-score is calculated as 2pr/(p + r). We also report the recall of OOV words in our experiments. In the following, we refer to our methods as "SLBD" (segmenter leveraging bilingual data).
Initially, we evaluated state-of-the-art supervised CWS methods, i.e., those of (Peng et al., 2004) (Peng); (Asahara et al., 2005) (Asahara); (Zhang and Clark, 2007) (Z&C); (Zhao et al., 2010) (Zhao), whose models are trained only on manually segmented data. Moreover, we also evaluated the performance of our sub-models by segmenting the bilingual unlabeled dataset using character-level features only, the inner log-linear model (which includes character-level and phraselevel features) and the outer log-linear model (the full SLBD approach). After applying these three segmentations using the different sub-models, we trained the new CRF models on the results of the three segmentations to justify the original CWS model. The evaluation results for the supervised CWS methods and the sub-models are presented in Table 2.
It can be seen that we achieved significant improvement in performance when we combined the character-level and phrase-level features in the inner log-linear model, demonstrating that the proposed phrase-level features can be used to efficiently obtain bilingual segmenting information. Moreover, the outer log-linear model achieves a further enhancement, thereby demonstrating that the sentence-level features can be used to effectively re-rank the candidate segmentations produced by the inner log-linear model.
Next, we compared the SLBD method with several state-of-the-art monolingual semi-supervised methods, including those of (Sun et al., 2012) (Sun); (Sun and Xu, 2011) (S&X); (Zeng et al., 2013b) (Zeng). To ensure a fair comparison, we performed the evaluation in two steps. First, we input the entire bilingual unlabeled dataset into the SLBD method and input only the Chinese sentences from the bilingual unlabeled dataset into the other semi-supervised methods. Then, because the available monolingual unlabeled dataset was much larger than the bilingual unlabeled dataset in natural, we used the XIN CMN portion of Chinese Gigaword 2.0 as an additional unlabeled dataset for the monolingual semi-supervised methods. which contains 204 million words, more than ten times   the number of words in the bilingual unlabeled dataset. The testing data was the set of AS only. The evaluation is summarized in Table 3.
The results demonstrate that either leveraging the same unlabeled data or providing a much larger unlabeled dataset for the monolingual semisupervised methods, the SLBD method can significantly outperform the evaluated monolingual semi-supervised methods, which indicates that the segmenting information obtained using SLBD is much more efficient at optimizing segmentation.
Finally, we evaluated SLBD in comparison with other bilingual semi-supervised methods, including (Xu et al., 2008) (Xu); (Ma and Way, 2009) (Ma); (Xi et al., 2012) (Xi); (Zeng et al., 2014) (Zeng2014). The results presented in Table 4 indicate that SLBD demonstrates much stronger performance, primarily because these other methods were developed with a focus on SMT, which causes them to preferentially decrease the perplexity of the subsequent SMT steps rather than producing a highly accurate segmentation. In contrast to these methods, the SLBD method exhibits greater generalizability.

Conclusion
In this paper, we propose a cascaded log-linear model to involve learning three levels of bilingual linguistic features to semi-supervisedly learn a new CWS model. Different from other monolingual and bilingual semi-supervised approaches, we employ various types of features to capture both monolingual grammars and bilingual segmenting information, which allows our model to be very efficient at other NLP tasks and endows it with higher generalizability. The evaluation shows that our method significantly outperforms the state-of-the-art monolingual and bilingual semi-supervised approaches.