ACBiMA: Advanced Chinese Bi-Character Word Morphological Analyzer

While morphological information has been demonstrated to be useful for various Chinese NLP tasks, there is still a lack of complete theories, category schemes, and toolkits for Chinese morphology. This paper focuses on the morphological structures of Chinese bi-character words, where a corpus were collected based on a welldefined morphological type scheme covering both Chinese derived words and compound words. With the corpus, a morphological analyzer is developed to classify Chinese bi-character words into the defined categories, which outperforms strong baselines and achieves about 66% macro F-measure for compound words, and effectively covers derived words.


Introduction
Considering that Chinese is an analytic language without inflectional morphemes, Chinese morphology mainly focuses on analyzing morphological word formation. In this paper, we conceive the Chinese word forming process from a syntactic point of view (Packard, 2000). The analysis and prediction of the intra-word syntactic structures, i.e., the "morphological structures", have been shown to be effective in various Chinese NLP tasks, e.g., sentiment analysis (Ku et al., 2009;Huang, 2009), POS tagging (Qiu et al., 2008), word segmentation (Gao et al., 2005), and parsing (Li, 2011;Li and Zhou, 2012;Zhang et al., 2013). Thus, this paper focuses on analyzing the morphological structures of Chinese bi-character content words. Huang et al. (2010) observed that 52% multicharacter Chinese tokens are bi-character 1 , which reflects that the core task of Chinese morphological analysis should be aimed at bi-character words. Previous work tended to focus on longer unknown words (Tseng and Chen, 2002;Tseng et al., 2005;Lu et al., 2008;Qiu et al., 2008) or the functionality of morphemic characters (Galmar and Chen, 2010), and none of them effectively covered Chinese bi-character words. To the best of our knowledge, Huang et al. (2010) is the only work focused on Chinese bi-character words, where they analyzed Chinese morphological types and developed a suite of classifiers to predict the types. However, their work covers only a subset of Chinese content words and has limited scalability. Therefore, this paper addresses the issues, which expands their work by developing a more detailed scheme and collecting more words to produce a generalized analyzer.
Our contributions are three-fold: • Linguistic -we propose a morphological type scheme for full coverage of Chinese bicharacter content words, and developed a corpus containing about 11K words.
• Technical -we develop an effective morphological classifier for Chinese bi-character words, achieving 66% macro F-measure for compound words, and and effectively covers derived words.
• Practical -we release the collected data and the analyzer with the trained model to provide additional Chinese morphological features for other NLP tasks. 2

Morphological Type Scheme
Our morphological type category scheme is developed based on the literature (X.-H. Cheng, 1992;Lu et al., 2008;Huang et al., 2010) and the naming conventions of Stanford typed dependency (Chang The first character is a prefix character, e.g. 阿/a. 阿姨/a-yi/a-aunt/aunt sfx The second character is a suffix character, e.g. 仔/zi. 牛仔/new-zi/cow-zi/cowboy neg The first character is a negation character, e.g. 不/bu. 不能/bu-neng/no-capable/unable ec The first character is an existential construction, 有人/you-ren/exists-human/people e.g. 有/you/have;exists. The two major categories of Chinese bicharacter content words are derived words and compound words. Derived words are words formed in certain formations (e.g. duplication), while compound words are composed of constituent characters following certain syntactic relations. Table 1 and 2 present detailed category schemes. Note that for derived words, the characters "有/you/have" and "是/shi/be" are of a special type of existential constructions (Tao, 2007), so we isolate them from common prefixes to distinguish their unique characteristics. The "els" type (compound words) consists of exceptional words that cannot be categorized into our com-pound words scheme.

Morphological Type Classification
Due to the difference between derived words and compound words, we respectively adopt rulebased and machine learning approaches to predict their morphological types. Note that all of our approaches and features assume that Chinese morphological structures are independent from wordlevel contexts (Tseng and Chen, 2002;Li, 2011).

Derived Word: Rule-Based Approach
By definition, a morphological derived word can be recognized based on its formation. Therefore, we apply the pattern matching rules described in Table 1 to build a rule-based classifier.
To evaluate the coverage of these developed rules, we run the classifier on Chinese Treebank 7.0 (CTB) (Levy and Manning, 2003), where 2.9% of bi-character content words are annotated as derived words (842 unique word types). Our rules are able to capture derived words with a precision of 0.97. The false positives are caused by the ambiguity of Chinese characters "子/zi" and "兒/er". 3 The ambiguity results

Tone
All possible tones (0-4) of C i uni-char Pronunciation All possible pronunciations, consonants, and vowels of C i word TF in CTB The POS distribution of C i in CTB Majority POS in CTB The most frequent POS of C i in CTB Character POS Two POS tags when parsing the 2-token sentence C 1 C 2 uni-char Dist. of Senses in Dict POS distribution of the senses of C i in dictionary morpheme Majority POS in Dict POS of C i with the most senses in dictionary Root The radical (also referred to as "character root") of C i CTB Prefix/Suffix Dist. The occurrence distribution of the n-char words with C i alphabet as the prefix/suffix corresponding to each POS in CTB. symbol Dict Prefix/Suffix Dist. The occurrence distribution of the n-char dictionary entry words with C i as the prefix/suffix Example Word Same as above, but calculate Prefix/Suffix Dist. the distribution in dictionary example words. Word Feature Typed dependency Typed dependency relation between C 1 and C 2 (for C 1 C 2 ) Stanford Word POS Single POS tag of a single token (word) in mis-classifications such as "父子/fu-zi/fatherson/father and son" into the "sfx" type instead of the "conj" type. Table 1 defines the patterns we consider as derived words, and the words that do not belong to the defined classes will be considered as compound words.

Compound Word: Machine Learning Approach
To automatically predict morphological types for compound words, we perform machine learning techniques to capture generalizations from various features. For each bi-character word C 1 C 2 , we extract character-level features for C 1 and C 2 individually, as well as a single word-level feature for C 1 C 2 . Table 3 describes our feature set. For character-level features, a Chinese character may take on 3 different roles: word, morpheme, or alphabet symbol, where the extracted features are organized according to these roles. In addition, we propose word-level features, e.g. POS of C 1 C 2 , to capture the word information dismissed by the previous work (Huang et al., 2010) with consideration that such clue helps classification. We experiment with various ML classification models: Naïve Bayes (John and Langley, 1995), Random Forest (Breiman, 2001), and Support Vector Machine (Platt, 1999;Keerthi et al., 2001;Hastie and Tibshirani, 1998) for the classification task. The three types of baselines are compared: Majority, Stanford Dependency Map, and Tabular Models. The Tabular Models first assign the POS tags to each known character C based on different heuristics (i.e., the most frequent POS of C in CTB, the POS of C with most senses in Dict, and the POS of C annotated by Stanford Parser), and then assigns the most frequent morphological type obtained from training data to each POS combination, e.g., "(VV, NN) = vobj". The Stanford Dependency Map takes the dependency relation between C 1 and C 2 as predicted by the Stanford Parser (Chang et al., 2009) , and maps it to a corresponding morphological type, which is learned from training data. The Majority baseline always outputs the majority type, i.e., the "n-head" type. We develop a Chinese morphological type corpus containing 11,366 bi-character compound words, referred to as "ACBiMA Corpus 1.0." This corpus is incrementally developed in two stages: The "initial set" is first developed for preliminary study and analysis. We randomly extracted about 3,200 content words from Chinese Treebank 5.1 (Xue et al., 2005), and removed the derived words. After manually checking for and removing errors, the initial set contains 3,052 words, which are further annotated with "morphological types" and "difficulty level of determining" (1, 2, or 3) by trained native speakers and examined again by experts. The inter-annotator agreement on a 50-word held-out set, averaged over all annotator pairs, is 0.726 Kappa.
In the second stage, we expand on the initial set into a larger corpus for practical use. We sampled about 3,000 words from CTB 5.1 and annotated them with their morphological types. Moreover, we obtained the 6,500-word corpus developed by Huang et al. (2010) 4 and manually split its "Substantive-Modifier" words into "a-head", "nhead", or "v-head" types to match our category scheme. In total, the expanded dataset consists of 11,366 unique bi-character compound word types (see Table 4).

Experiments
We performed 10-fold cross-validation experiments on the entire dataset to evaluate our ap-4 The words in Huang et al. (2010) are sampled from the NTCIR CIRB040 news corpus, and the distribution of types is similar to that of our initial set. This suggests that the morphological types distribution between different Chinese corpora are similar. proaches for compound words. 5 As mentioned in §3.2, we compared against different baselines. Table 5 presents the results of our experiments, and the average human-judged difficulty level (in initial set) is also listed for comparison.
Random Forest and SVM outperformed all other models and baselines. The best accuracy is 0.674; 65% of words in the initial set are labeled as "easy" by human annotators, suggesting that our classifiers are comparable to human performance on the "easy" instances. Also, we achieved similar level of performance in macro F1-measure when compared to Huang et al. (2010) 6 , despite our task being more challenging due to having two extra types.

Conclusion and Future Work
In this paper, we developed a set of tools and resources for leveraging morphology of Chinese bicharacter words. We propose a category scheme, develop a corpus, and build an effective morphological analyzer. In future work, we intend to explore other NLP tasks where we can take advantage of ACBiMA and our tools to improve performance.