Improve Chinese Word Embeddings by Exploiting Internal Structure

,

For language like Chinese, some smaller units than word also provide rich semantic information. For example, Chinese characters in word, Chinese radicals in character. These internal structures have been proved to be useful for Chinese word and character embeddings (Chen et al., 2015;Li et al., 2015). Chen et al. (2015) took Chinese characters in a word into account when modeling the semantic meaning of the word. They proposed a character-enhanced word embeddings model (CWE) by adding the embedding of component characters in a word with the same weight to the word embedding. However, the internal characters in a Chinese word have different semantic contributions to its meaning. Take Chinese word "" " (frog) as an example. The character """ (blue or green) is to decorate character " " (frog). It is obvious that the latter character contributes more than the former one to the word meaning. In Li et al. (2015), they proposed a component-enhanced Chinese character embeddings model based on the feature that most Chinese characters are phono-semantic compounds. They considered characters and bi-characters as the basic embedding units. However, some bi-characters are meaningless, and may not form a Chinese word. These bi-characters may undermine embeddings of others. This paper, motivated by Chen et al. (2015), exploits the internal structures of Chinese word, namely the Chinese characters. We propose a method to calculate the semantic contribution of characters to a word in a cross-lingual manner. The basic idea is that the semantic contribution of Chinese characters in most Chinese words can be learned from their translations in other languages. Such as the word "" " we mentioned above. The word embeddings of other languages are used to calculate semantic contribution of characters to the word they compose. Moreover, Chinese characters are more ambiguous than words. To tackle this problem, multiple-prototype character embeddings is proposed. Different meanings of characters will be represented by different embeddings. Our contributions can be summarized as follows: 1. We provide a method to calculate the semantic contribution of Chinese characters to the word they compose with English translation. Compared with English, there are fewer human-made resources to supervise the learning process of Chinese word and character embeddings. While translation resources are always easy to be accessed on the Internet.
2. We propose a novel way to disambiguate Chinese characters with translating resources. There are some limitations in existing cluster-based algorithms (Huang et al., 2012;Neelakantan et al., 2015;Chen et al., 2015). They either fixed the number of clusters or proposed a nonparametric way to learn it for each word. However, the number of clusters for words varies a lot. For nonparametric method, different hyperparameters have to be tune to control the number of clusters for different datasets.
3. We provide a method to distinguish whether a Chinese word is semantically compositional automatically. Not all Chinese words exhibit semantic compositions from their component characters. For example, entity names, transliterated words like " â u" (sofa), single-morpheme multi-character words like "þ}" (wander). In Chen et al. (2015), they performed part-of-speech tagging to identify entity names. The transliterated words are tagged manually, which requires human work and need to be updated when new words are created.
The evaluations on word similarity, text classification, Chinese characters disambiguation, and qualitative analysis of word embeddings demonstrate the effectiveness of our method.  Word2vec (Mikolov et al., 2013a) is an algorithm to learn distributed word representations using a neural language model. Word2vec has two models, the continuous bag-of-words model (CBOW) and the skip-gram model. In this paper, we propose a new model based on the CBOW, hence we focus attention on it. CBOW aims at predicting the target word given context words in a slide window. Given a word sequence D = {x 1 , x 2 , . . . , x T }, the objective of CBOW is to maximize the average log probability where v x i and v x i are the input and output vector representations of word x i . Since the size of English vocabulary W may be up to 10 6 scale, hierarchical softmax and negative sampling (Mikolov et al., 2013b) are applied during training to learn the model efficiently. However, using CBOW to learn Chinese word embeddings directly may have some limitations. It fails to capture the internal structure of words. In (Botha and Blunsom, 2014;Luong et al., 2013;Trask et al., 2015;Chen et al., 2015), they demonstrated the usefulness to exploit the internal structure of words, and proposed some morphological-based methods. For example, Chen et al. (2015) exploit the internal structure in Chinese words.

The CWE model
The basic idea of CWE is that both external context words and internal component characters in words provide rich information in modeling the semantic meaning of the target word. In CWE, they learned word embeddings with its component characters embeddings. Let C denotes the Chinese characters set, and the word x t in context x i+j i−j is composed by several characters in C, let x t = {c 1 , c 2 , . . . , c Nt }, c k denotes the k-th character in x t , where v xt is the modified word embedding, N t denotes the number of Chinese characters in x t . To address the issue of ambiguity in Chinese characters, they proposed several approaches for multipleprototype character embeddings: position-based, cluster-based, nonparametric methods, and positioncluster-based character embeddings. These methods are denoted as CWE+P, CWE+L, CWE+N, CWE+LP respectively. However, this model has some limitations. The internal characters are of the same contribution to the semantic meaning of the word in CWE, which is not the case for most Chinese words.

Methodology
Our method can be described as three stages: • Obtain translations of Chinese words and characters Chinese words segmentation tool is used to segment words in Chinese corpus. Then we use an online English-Chinese translation tool to translate all the Chinese characters and segmented words.
• Perform Chinese character sense disambiguation We train an English corpus with CBOW to get English word embeddings. Then, we merge some meanings of Chinese characters with small difference, and disambiguate the meanings of characters in words by computing the similarity between their English translation words.
• Learn word and character embeddings with our model Based on the character sense disambiguation process, we modify the objective of CWE to learn Chinese word and character embeddings. Then we analyse the complexity of our model briefly.

Obtain translations of Chinese words and characters
We use segmentation tools to segment words in Chinese training corpus, and perform part-of-speech tagging to recognize all the entity names. Since entity name words do not exhibit semantic compositions, they are identified as non-compositional words. We count the times of characters appearing in different words. Words with Chinese characters rarely combined with other characters are classified as single-morpheme multi-character words and identified as non-compositional. Then programming interface of online translation tool is used to translate Chinese words and characters into English. For non-compositional Chinese words, they are not included in the translation list. Table 1 shows the English meanings of Chinese word "ÑW", "âu" and their component characters "Ñ" and " W", "â" and "u".

Perform Chinese character sense disambiguation
We train an English corpus with CBOW to get English word embeddings. Then, the meanings of characters with small difference are merged.
In Table 1, we observe that the difference between some meanings of character "W" is very small, some of them differ only in their part-of-speech. In Chinese, the same characters and words are used in different part-of-speech but express the same semantic meaning. Hence these meanings are merged as one semantic meaning. Let Sim(·) denotes the function to calculate the similarity between meanings of Chinese words and characters, we use cosine distance as the distance metric. The i-th and j-th meanings of Chinese character c are c i and c j . Their similarity is defined as: where Trans(c i ) denotes the English translation words set of c i , stop words(en) denotes the stop words in English, x m and x n are not in these stop words. For example, the Chinese word "Ñ W" in Table 1, c 2 denotes the second character  "W" in the word. Trans(c 3 2 ) is the third translation English words set of character "W", which is {pleasure, enjoyment}. Therefore x m can be pleasure or enjoyment here.
If the Sim(c i , c j ) is above a threshold δ, then they are merged as one semantic meaning. For simplicity, we use the union of English translation words set. One character may be translated into several English words. We may average all the translation word embeddings and then compute the similarity, or select the maximum value of the similarity between all English word pairs. In our experiments, maximum method works better.
Finally, we perform Chinese character sense disambiguation. In Chinese, characters may have multiple meanings, but for a certain word, their meanings are determined. For exmaple, the word "ÑW", the English translation is music. For character "W", the first translation "music" matches the meaning of the word. For character "Ñ", the best match is the first translation "sound". For transliterated word like "âu", the English translations are sofa and settee, neither sofa nor settee have high similarity with English translation words of character "â" and character "u". Formally, if max(Sim(x t , c k )) > λ, c k ∈ x t , then x t is identified as compositional word, and belongs to the compositional set COMP. For compositional words, we build a set where For example, the word "Ñ W" is defined as ("Ñ W", {Sim("Ñ W", "Ñ"), Sim("Ñ W", " W")},{1,1}) in F .

Learn word and character vectors with SCWE
The internal characters in a word make different contributions to its semantic meaning. However, in Chen et al. (2015), the contribution of component characters to the semantic meaning of word are treated equally. They add character embeddings to the word embeddings with the same weight, which may undermine the quality of word embeddings. Based on this point, we propose a similarity-based character-enhanced word embedding model, which takes the contribution of characters into account. We name it SCWE for ease of reference in the later part. The architecture of CWE and SCWE are shown in Fig. 1.
Similarity-Based Character-Enhanced word Embedding In the character sense disambiguation stage, we build a set F , which contains compositional words, the similarity between words and its component characters, and the meaning order number of characters in the word. Suppose x t in W is a Figure 1: Architecture of models. The left is CWE and right is SCWE. "" (frog) a? (jump into) ³* (pond)" is the word sequence. The word "" " is composed of characters "" (blue or green)" and " (frog)", and the word "³* (pond) is composed of characters "³ (pond, pool)" and "* (pond)".
To deal with ambiguity problem of Chinese characters, we propose multiple-prototype character embeddings and denote it as SCWE+M model. Since the meaning of a character is determined in a given word, we utilize the information provided by the last element in set F , and use different character embeddings for different meanings of characters. Then, in SCWE+M, Complexity analysis We analyze the complexities of CBOW, CWE, SCWE and SCWE+M. Let S denotes the size of corpus, |W | denotes the size of vocabulary, |C| denotes the number of Chinese characters in corpus. And d is the dimensions of Chinese word and character embeddings, k is the context window size, f is the time spend in computing hierarchical softmax or negative sampling, n is the average number of characters in a Chinese word, m is the average meaning number of Chinese characters. The results are shown in Table 2.
In Chinese, most of words are composed by two Chinese characters, and the meaning number of commonly used characters are usually less than five.
Moreover, according to CJK Unified Ideographs 1 , the total number of Chinese characters is 20913, the commonly used characters are less than 10000. Therefore, our model is competitive to other methods in model parameters and computational complexity.

Method
Model parameters Computational complexity CBOW |W |d 2kSf It can process about one million words in a second, and get up to 96 percent accuracy in segmentation task. The part-of-speech tagging and name entity recognition tasks are also done in this process. We select ICIBA 5 as English-Chinese translation tool, which provides us with an application programming interface. CBOW and CWE are used as baseline methods. Context window size is set as 5 and both Chinese word and character embeddings are set as 100 dimension. After some cross validation steps, our threshold δ and λ are set as 0.5 and 0.4 in character disambiguation process. The influence of λ and δ is report in the later part.

Word Similarity
Word similarity is a task to compute semantic relatedness between given word pairs. The relatedness between word pairs have been scored by human in advance. The correlation between model results and human judgement can be used to evaluate the performance of models. In this paper, wordsim-240 and wordsim-296 (Jin and Wu, 2012) are used as evaluation datasets. The Spearman's rank correlation (Myers et al., 2010) is applied to compute the correlation. The experimental results are summarized in Table 3. We observe that on wordsim-240, SCWE and SCWE+M outperform the baseline methods, which indicates the effectiveness of exploiting the internal structure. On dataset wordsim-296, we can see that CBOW, CWE, SCWE perform similarly. This may be explained by some highly ambiguous Chinese characters in this dataset. In SCWE and CWE, representing these ambiguous characters with the same embeddings may undermine word embed-  dings. Therefore, SCWE+M achieves a better performance by applying multiple-prototype character embeddings.

Text Classification
In this experiment, we use Fudan Corpus 6 as datasets, which contains 20 categories of documents, including economy, politics, sports and etc.. The number of documents in each category ranges from 27 to 1061. To avoid imbalance, we select 10 categories and organize them into 2 groups. One group is named Fudan-large and each category in this group contains more than 1000 documents. The other is named Fudan-small and each category contains less than 100 documents. In each category, 80 percent of documents are used as training set, the rest are used as testing set to evaluate the performance. The detailed information for two datasets are reported in Table 4. Similar to the way we deal with Chinese training corpus, pure digits and non-Chinese characters are removed and ANSJ is used to do word segmentation on these datasets. The publish information of each document is removed. We represent each document by averaging word embeddings in the document. The classifiers are trained using LIBLINEAR package (Fan et al., 2008) with the embeddings obtained from different methods. The performance of each method is evaluated by predicting accuracy on testing set. Experiment results are given in Table 5.
It is observed that our methods outperform the baseline methods on both datasets. This can be explained that the semantic relatedness of a word with the component characters which have more contribution to its semantic meaning is strengthen in our methods. Such as, in sports documents, the word   "¥" (ball) is used frequently. For Chinese words like ";¥" (basketball) and " ¥" (tennis), the character "¥" contributes more to their semantic meaning than other characters. Therefore, they lie closer to character "¥" in embedding space obtained by our model than CBOW and CWE, and tend to form a cluster in embedding space.

Multiple Prototype of Chinese Charaters
To tackle the ambiguity of Chinese characters, we propose multiple-prototype character embeddings.
To evaluate the effectiveness of our method, we use PCA to conduct dimensionality reduction on word and character embeddings. The results are illustrated in Fig 2. We take 3 different meanings of Chinese characters " " and "1", and 2 of their top-related words as examples. The character followed by a digit i denotes the i-th meanings of it. We can observe that characters and words, which have similar meanings are gathered together. For example, "13", "1r" and "ß1Ç" are all related to the light. Thus, they get closer in the embedding space.
We also develop a dataset to compare our method with the disambiguation methods in Chen et al. (2015). We select some ambiguous Chinese char-  acters, and then use online Xinhua Dictionary 7 as our standard to disambiguate the words that contain these ambiguous characters. Each word is assigned a number according to their explanation in the dictionary. We use KNN as classifier to evaluate all the methods. The results are shown in Table 7. It is observed that our method outperforms the methods proposed in Chen et al. (2015).

Qualitative analysis of word embeddings
In this part, we take two Chinese words as examples, and list their nearest words to examine the quality of word embeddings obtained by CWE and SCWE. The results are shown in Table 8. We can observe the most similar words return by CWE and SCWE both tend to share common characters with the given word. In CWE, characters with little semantic contribution to the word may undermine the quality of word embeddings. For example, the character """ in word "" ". The semantic relatedness of words with character """ to the given word are overestimated in CWE. In our model, by calculating the semantic contribution of internal characters to the word, we alleviate this misjudgement greatly, which demonstrates the effectiveness of our model.

Parameter Analysis
In this part, the influence of parameters on our model is investigated. The parameters include the compositional word similarity threshold λ, character disambiguation threshold δ.
Compositional word similarity To investigate how λ influence the process of non-compositional word detection, we build a word list of transliterated words manually, which consists of 161 words. Then 161 of most frequent semantic compositional words with more than one Chinese characters are added to the list in the corpus. In Table 9, the performance of our method in classifying transliterated words when λ ranges from 0.25 to 0.55 are reported. From Table 9, we can observe as λ increases, more compositional words will be classified as non-compositional words, while transliterated words are more likely to be classified correctly. Our method achieves best F-Score when λ = 0.4.
Character disambiguation threshold In Table  10, we show the performance of our model in disambiguating Chinese characters. We adopted the same datasets in Section 4.4 with different δ. From Table 1, we can observe some meanings of a character are very close, therefore, a high δ are adopted in our model. When δ = 0.5, our model gets the best result in our dataset.

Conclusion
In this paper, we exploit the internal structure in Chinese words by learning the semantic contribution of internal characters to the word. We propose a method to improve Chinese word and character embeddings with a similarity-based characterenhanced word embeddings model. Ambiguity problem of Chinese characters can also be tackled in our method. Moreover, we build a way to classify whether a Chinese word is compositional automatically, which requires to be labelled manually in CWE. We argue that our method may be used to improve word embeddings of other language whose internal structure is similar to Chinese. The code and datasets we use is available at: https: //github.com/JianXu123/SCWE.