Learning Chinese Word Representations From Glyphs Of Characters

In this paper, we propose new methods to learn Chinese word representations. Chinese characters are composed of graphical components, which carry rich semantics. It is common for a Chinese learner to comprehend the meaning of a word from these graphical components. As a result, we propose models that enhance word representations by character glyphs. The character glyph features are directly learned from the bitmaps of characters by convolutional auto-encoder(convAE), and the glyph features improve Chinese word representations which are already enhanced by character embeddings. Another contribution in this paper is that we created several evaluation datasets in traditional Chinese and made them public.


Introduction
No matter which target language it is, high quality word representations (also known as word "embeddings") are keys to many natural language processing tasks, for example, sentence classification (Kim, 2014), question answering (Zhou et al., 2015), machine translation (Sutskever et al., 2014), etc. Besides, word-level representations are building blocks in producing phrase-level (Cho et al., 2014) and sentence-level (Kiros et al., 2015) representations.
In this paper, we focus on learning Chinese word representations. A Chinese word is composed of characters which contain rich semantics. The meaning of a Chinese word is often related to the meaning of its compositional characters. Therefore, Chinese word embedding can be enhanced by its compositional character embeddings (Chen et al., 2015;Xu et al., 2016). Further-more, a Chinese character is composed of several graphical components. Characters with the same component share similar semantic or pronunciation. When a Chinese user encounters a previously unseen character, it is instinctive to guess the meaning (and pronunciation) from its graphical components, so understanding the graphical components and associating them with semantics help people learning Chinese. Radicals 1 are the graphical components used to index Chinese characters in a dictionary. By identifying the radical of a character, one obtains a rough meaning of that character, so it is used in learning Chinese word embedding (Yin et al., 2016) and character embedding (Sun et al., 2014;Li et al., 2015). However, other components in addition to radicals may contain potentially useful information in word representation learning.
Our research begins with a question: Can machines learn Chinese word representations from glyphs of characters? By exploiting the glyphs of characters as images in word representation learning, all the graphical components in a character are considered, not limited to radicals. In our proposed methods, we render character glyphs to fixed-size grayscale images which are referred to as "character bitmaps", as illustrated in Fig.1. A similar idea was also used in (Liu et al., 2017) to help classifying wikipedia article titles into 12 categories. We use a convAE to extract character features from the bitmap to represent the glyphs. It is also possible to represent the glyph of a character by the graphical components in it. We do not choose this way because there is no unique way to decompose a character, and directly learning representation from bitmaps is more straightforward. Then we use the models parallel to Skipgram (Mikolov et al., 2013a) or GloVe (Penning-ton et al., 2014) to learn word representations from the character glyph features. Although we only consider traditional Chinese characters in this paper, and the examples given below are based on the traditional characters, the same ideas and methods can be applied on the simplified characters.
Rendered bitmaps 60 pixels Characters Glyphs (As printed in PDF file) 60 pixels Figure 1: A Chinese character is represented as a fixed-size gray-scale image which is referred to as "character bitmap" in this paper.

Background Knowledge and Related Works
To give a clear illustration of our own work, we briefly introduce the representative methods of word representation learning in Section 2.1. In Section 2.2, we will introduce some of the linguistic properties of Chinese, and then introduce the methods that utilize these properties to improve word representations.

Word Representation Learning
Mainstream research of word representation is built upon the distributional hypothesis, that is, words with similar contexts share similar meanings. Usually a large-scale corpus is used, and word representations are produced from the cooccurrence information of a word and its context. Existing methods of producing word representations could be separated into two families (Levy et al., 2015): count-based family (Turney and Pantel, 2010;Bullinaria and Levy, 2007), and prediction-based family. Word representations can be obtained by training a neural-networkbased models (Bengio et al., 2003;Collobert et al., 2011). The representative methods are briefly introduced below.

CBOW and Skipgram
Both continuous bag-of-words (CBOW) model and Skipgram model train with words and contexts in a sliding local context window (Mikolov et al., 2013a). Both of them assign each word w i with an embedding w i . CBOW predicts the word given its context embeddings, while Skipgram predicts contexts given the word embedding. Predicting the occurrence of word/context in CBOW and Skipgram models could be viewed as learning a multi-class classification neural network (the number of classes is the size of vocabulary). In (Mikolov et al., 2013b), the authors introduced several techniques to improve the performance. Negative sampling is introduced to speed up learning, and subsampling frequent words is introduced to randomly discard training examples with frequent words (such as "the", "a", "of"), and has an effect similar to the removal of stop words.

GloVe
Instead of using local context windows, (Pennington et al., 2014) proposed GloVe model. Training GloVe word representations begins with creating a co-occurrence matrix X from a corpus, where each matrix entry X ij represents the counts that word w j appears in the context of word w i . In (Pennington et al., 2014), the authors used a harmonic weighting function for co-occurrence count, that is, word-context pairs with distance d contributes 1 d to the global co-occurrence count. Let w i be the word representation of word w i , and w j be the word representation of word w j as context, GloVe model minimizes the loss: where b i is the bias for word w i , andb j is the bias for context w j . A weighting function f (X ij ) is introduced because the authors consider rare cooccurrence word-context pairs carry less information than frequent ones, and their contributions to the total loss should be decreased. The weighting function f (X ij ) is defined as below. It depends on the co-occurrence count, and the authors set parameters x max = 100, α = 0.75.
In the GloVe model, each word has 2 representations w and w. The authors suggest using w + w as the word representation, and reported improvements over using w only. A Chinese word is composed of a sequence of characters. The meanings of some Chinese words are related to the composition of the meanings of their characters. For example, "戰艦" (battleship), is composed of two characters, "戰" (war) and "艦" (ship). More examples are given in Fig. 2. To improve Chinese word representations with sub-word information, character-enhanced word embedding (CWE) (Chen et al., 2015) in Section 2.2.2 is proposed. A Chinese character is composed of several graphical components. Characters with the same component share similar semantic or phonetic properties. In a Chinese dictionary characters with similar coarse semantics are grouped into categories for the ease of searching. The common graphical component which relates to the common semantic is chosen to index the category, known as a radical. Examples are given in Fig. 3. There are three radicals in row (A), and their semantic meanings are in row (B). In each column, there are five characters containing each radical. It is easy to find that the characters having the same radical have meanings related to the radical in some aspect. A radical can be put in different positions in a character. For example, in rows (C-1) to (C-4), the radicals are at the left hand side of a character, but in row (C-5), the radicals are at the bottom. The shape of a radical can be different in different positions. For example, the third radical which represents "water" or "liquid" has different forms when it is at the left hand side or the bottom of a character. Because radicals serve as a strong semantic indicator of a character, multigranularity embedding (MGE) (Yin et al., 2016) in Section 2.2.3 incorporates radical embeddings in learning word representation. Usually the components other than radicals determine the pronunciation of the characters, but in some cases they also influence the meaning of a character. Two examples are given in Fig. 4 2 . Both characters in Fig. 4 have the same radical "亻" (means humans) at the left hand side, but the graphical components at the right hand side also have semantic meanings related to the characters. Considering the left character "伐" (means attack). Its right component "戈" means "weapon", and the meaning of the character "伐" is the composition of the meaning of its two components (a human with a weapon). None of the previous word embedding approach considers all the components of Chinese characters in our best knowledge.

Character-enhanced Word Embedding (CWE)
The main idea of CWE is that word embedding is enhanced by its compositional character embeddings. CWE predicts the word from both word and character embeddings of contexts, as illustrated in Fig. 5 (a). For word w i , the CWE word embedding w cwe i has the following form: where w i is the word embedding, c j is the embedding of the j-th character in w i , and C(i) is the set of compositional characters of word w i . Mean value of CWE word embeddings of contexts are then used to predict the word w i .
Sometimes one character has several different meanings, this is known as the ambiguity problem. To deal with this, each character is assigned with a bag of embeddings. During training, one of the embeddings is picked to form the modified word embedding. The authors proposed three methods to decide which embedding is picked: positionbased, cluster-based, and non-parametric clusterbased character embeddings.

Multi-granularity Embedding (MGE)
Based on CBOW and CWE, (Yin et al., 2016) proposed MGE, which predicts target word with its radical embeddings and modified word embeddings of context in CWE, as shown in Fig.5 There is no ambiguity of radicals, so each radical is assigned with one embedding r. We denote r k as the radical embedding of character c k . MGE predicts the target word w i with the following hidden vector: is the CWE word embedding of w j . MGE picks character embeddings with the positionbased method in CWE, and picks radical embeddings according to a character-radical index built from a dictionary during training. When noncompositional word is encountered, only the word embedding is used to form h i .

Model
We first extract glyph features from bitmaps with the convAE in Section 3.1. The glyph features are used to enhance the existing word representation learning models in Section 3.2. In Section 3.3, we try to learn word representations directly from the glyph features.

Character Bitmap Feature Extraction
A convAE (Masci et al., 2011) is used to reduce the dimensions of rendered character bitmaps and capture high-level features. The architecture of the convAE is shown in Fig. 6. The convAE is composed of 5 convolutional layers in both encoder and decoder. The stride larger than one is used instead of pooling layers. Convolutional and deconvolutional layers on the same level share the same kernel. The input image is a 60×60 8-bit grayscale bitmap, and the encoder extracts 512-dimensional feature. The feature of character c k from the encoder is refer to as character glyph feature g k in the paper.
where C(i) is the compositional characters of w i and g j is the glyph feature of c j . The model predicts target word w i from ctxG word embeddings of contexts, as shown in Fig.7. The parameters in the convAE are pre-trained, thus not jointly learned with embeddings w and c, so character glyph features g are fixed during training.

Enhanced by Target Word Glyphs
Here we propose another variant. In this model, the model structure is the same as in Fig.7. The difference lies in the hidden vector used to predict the target word. Instead of adding mean value of character glyph features of the contexts, it adds mean value of glyph feature of the target word (tarG), as shown in Fig.8. As in Section 3.2.1, con-vAE is not jointly learned.

RNN-Skipgram
We learn word representation w i directly from the sequence of character glyph features { g k , c k ∈ C(i)} of word w i , with the objective of Skipgram. As in Fig.9, a 2-layer Gated Recurrent Units (GRU) (Cho et al., 2014) network followed by 2 fully connected ELU (Clevert et al., 2015) layers produces word representation w i from input sequence { g k } of word w i . w i is then used to predict the contexts of w i . In the training we use negative sampling and subsampling on frequent words from (Mikolov et al., 2013b).

RNN-GloVe
We modify GloVe model to directly learn from character glyph features as in Fig.10. We feed character glyph feature sequence { g k , c k ∈ C(i)}, { g k , c k ∈ C(j)} of word w i and context w j to a shared GRU network. Outputs of GRU are then fed to two different fully connected ELU layers to produce word representations w i and w j . The inner product of w i and w j is the prediction of log co-occurrence log(X ij ). We apply the same loss function with weights in GloVe. We follow (Pennington et al., 2014) and use w i + w i for evaluations of word representation.  Figure 10: Model architecture of RNN-GloVe. A shared GRU network and 2 different sets of fully connected ELU layers produce w i and w j . Inner product of w i and w j is the prediction of log cooccurrence log(X ij ).
word, LDC2003T09). All foreign words, numerical words, and punctuations were removed. Word segmentation was performed using open source python package jieba 3 . In all 316,960,386 segmented words, we extracted 8780 unique characters, and used a true type font (BiauKai) to render each character glyph to a 60×60 8-bit grayscale bitmap. Furthermore, We removed words whose frequency <= 25, leaving 158,565 unique words as the vocabulary set.

Extracting Visual Features of Character Bitmap
Inspired by (Zeiler et al., 2011), layer-wise training was applied to our convAE. From lower level to higher, the kernel of each layer is trained individually, with other kernels frozen for 100 epochs. Loss function is the Euclidean distance between input and reconstructed bitmap, and we added l1 regularization to the activations of convolution layers. We chose Adagrad as the optimizing algorithm, and set batch size = 20 and learning rate = 0.001. The comparison between the input bitmaps and their reconstructions is shown in Fig 11. The input bitmaps are in the upper row, while the reconstructions are in the lower row. We further visualized the extracted character glyph features with t-SNE (Maaten and Hinton, 2008). Part of the visualization result is shown in Fig. 12. From Fig. 12, we found that the characters with the same components are clustered. The result shows that the features extracted by the convAE are capable of expressing the graphical information in the bitmaps.

Training Details of Word Representations
We used CWE code 4 to implement both CBOW and Skipgram, along with the CWE. The number of multi-embedding was set to 3. We modified the CWE code to produce GWE representations. For CBOW, Skipgram, CWE, GWE and RNN-Skipgram, we used the following hyperparameters. Context window was set to 5 to both sides of a word. We used 10 negative samples, and threshold t of subsampling was set to 10 −5 .
Since Yin at al. did not publish their code, we followed their paper and reproduced the MGE model. We created the mapping between characters and radicals from the Unihan database 5 . Each character corresponds to one of the 214 radicals in this dataset, and the same hyperparameters were used in training as above. Note that we did not separate non-compositional words during training as the original CWE and MGE did.
We used the GloVe code 6 to train the baseline GloVe vectors. In construction of co-occurrence matrix for GloVe and RNN-GloVe, we followed the parameter settings of x max = 100 and α = 0.75 in (Pennington et al., 2014). Context window was 5 words to the both sides of a word, and harmonic weighting was used on co-occurrence counts. For the RNN-GloVe model, we removed entries whose value < 0.5 to speed up training.
To encourage further research, we published our convAE and embedding models on github 7 . Evaluation datasets were also uploaded, whose details will be explained in Section 5.

Word Similarity
A word similarity test contains multiple word pairs and their human annotated similarity scores. Word representations are considered good if the calculated similarity and human annotated scores have a high rank correlation. We computed the Spearman's correlation between human annotated scores and cosine similarity of word representations.
Since there is little resource for traditional Chinese, we translated WordSim-240 and WordSim-296 datasets provided by (Chen et al., 2015). Note that this translation is non-trivial. Some frequent words are considered out-of-vocabulary (OOV) due to the different usage between the simplified and traditional. For example, "butter" is translated to "黃油" in simplified, but "奶油" in traditional. Besides, we manually translated SimLex-999 (Hill et al., 2016) to traditional Chinese, and used it as the third testing dataset. We also made these datasets public along with our code.
When calculating similarities, word pairs containing OOVs were removed. In Table 1 only show the results of position-based character embeddings here because the results of clusterbased character embeddings are worse in the experiments. We found that CWE only consistently improved the performance on SimLex-999 for both CBOW and Skipgram probably because SimLex-999 contains more words that could be understood from their compositional characters. On SimLex-999, we observed that CWE was better with CBOW than Skipgram. We think the reason is that CBOW+CWE predicts the target word with the mean value of all character embeddings in the context, thus has a less noisy feature; however Skipgram+CWE uses character embeddings of an individual word. This noisy feature could cause negative effects on predicting the target word. The GWEs were learned based on CWE in two ways. "ctxG" represents using glyph features of context words, while "tarG" represents using glyph features of target words. The glyph features improved CWE on WordSim-240 and SimLex-999, but not WordSim-296. As for MGE results, we were not able to reproduce the performance in (Yin et al., 2016). We list possible reasons as below: we did not separate non-compositional word during training (character and radical embeddings are not used for these words), and the we created character-radical index from different data source. We conjecture that the first to be the most crucial factor in reproducing MGE.
The results of RNN-Skipgram and RNN-GloVe are also in Table 1. Their results are not comparable with CBOW and Skipgram. From the results, we conclude that it is not easy to produce word representations directly from glyphs. We think the reason is that RNN representations are dependent on each other. Updating model parameters for word w i would also change the word representation of word w j . As a result it is much more difficult to train such models.
We further inspect the impact of glyph features by doing significance test 8 between proposed methods and existing ones. The p-values of the tests are given in Table 2. We found only "tarG" method has a p-value less than 0.05 over CWE. +CWE+ctxG +CWE+tarG CBOW 0.085 0.215 CBOW+CWE 0.190 0.008 Table 2: p-values of significance tests between proposed methods and existing ones.

Word Analogy
An analogy problem has the following form: "king":"queen" = "man":"?", and "woman" is answer to "?". By answering the question correctly, the model is considered capable of expressing semantic relationships. Furthermore, the analogy relation could be expressed by vector arithmetic of word representations as shown in (Mikolov et al., 2013b). For the above problem, we find word w i such that w i = arg max w cos( w, w queen − w king + w man ). 8 We followed the method described in https:// stats.stackexchange.com/questions/17696/  Table 3: Accuracy of analogy problems for capitals of countries, (China) states/provinces of cities, family relations, and our proposed job&place (J&P) dataset. The higher the values, the better the results.
As in the previous subsection, we translated the word analogy dataset in (Chen et al., 2015) to traditional. The dataset contains 3 groups of analogy problems: capitals of countries, (China) states/provinces of cities, and family relations. Considering that most capital and city names do not relate to the meaning of their compositional characters, and that we did not separate noncompositional word in our experiments, we proposed a new analogy dataset composed of jobs and places (job&place). Nonetheless, there might be multiple corresponding places for a single job. For instance, A "doctor" could be in a "hospital" or "clinic". In this job&place dataset, we provide a set of places for each job. The model is considered to answer correctly as long as the predicted word is in this set.
We take the mean of all word representations of places (mean( w places 1 )) for the first job (job 1 ), and find the place for another job (job 2 ) by calculating w i such that w i = arg max w cos( w, mean( w places 1 )− w job 1 + w job 2 ).
The results are shown in Table 3. we observed CWE only improved accuracy only for the family group. The results are not surprising. The words of family relations are compositional in Chinese, however capital and city names are usually not. We observed that GWE further improved CWE for words in the family group. From Table 3, we found that glyph features are helpful when the characters can enhance word representations. This is very reasonable because glyph features are fruitful representations of characters. If character information does not play a role in learning word representations, character glyphs may not be useful. The same phenomenon is observed in Table 1.
In our job&place, we still observed that GWE improving CWE, however both CWE and GWE were slightly worse than CBOW. We also observed that Skipgram-based methods became worse than CBOW-based methods, while in all previous evaluation Skipgram-based methods are consistently better.
The results of RNN-Skipgram and RNN-GloVe are still poor. We observe that the word representations learned from RNN can no longer be expressed by vector arithmetic. The reason is still under investigation.

Case Study
To further probe the effect of glyph features, we show the following word pairs in SimLex-999 whose calculated cosine similarities are higher based on GWE models than CWE. The pairs may not look alike, but their components share related semantics. For example, in "伶俐" (clever), the component "利"(sharp) is compositional to the meaning of "俐"(acute), describing someone with a sharp mind. Other examples show the ability to associate semantics with radicals.  We also provide several counter-examples. Below are some word pairs which are not similar, however GWE methods produces higher similarity than CBOW or CWE. Take "山峰" (mountain) and "蜂蜜" (honey) as example. Since they share no  Table 5: Counter examples to which GWE methods give higher similarity scores than CBOW or CWE.
common characters, the only thing in common is the component "夆", and we assume this to be the reason for the higher similarity. Also note that in the pair "無趣" (boring) and "好笑" (funny), the CWE similarity is also higher. We conclude that the character "無" (none) is not strong enough, so the character "趣" (fun) overrides the word "無 趣" (boring), thus a higher score was mistakenly assigned.

Conclusions
This work is a pioneer in enhancing Chinese word representations with character glyphs. The character glyph features are directly learned from the bitmaps of characters by convAE. We then proposed 2 methods in learning Chinese word representations: the first is to use character glyph features as enhancement; the other is to directly learn word representation from sequences of glyph features. In experiments, we found the latter totally infeasible. Training word representations with RNN without word and character information is challenging. Nonetheless, the glyph features improved the character-enhanced Chinese word representations, especially on the word analogy task related to family. The results of exploiting character glyph features in word representation learning was ordinary. Perhaps the co-occurrence information in the corpus plays a bigger role than glyph features. Nonetheless, the idea to treat each Chinese character as image is innovative. As more character-level models (Zheng et al., 2013;Kim, 2014;Zhang et al., 2015) are proposed in the NLP field, we believe glyph features could serve as an enhancement, and we will further examine the effect of glyph features on other tasks, such as word segmentation, POS tagging, dependency parsing, or downstream tasks such as text classification, or document retrieval.