Glyph-aware Embedding of Chinese Characters

Given the advantage and recent success of English character-level and subword-unit models in several NLP tasks, we consider the equivalent modeling problem for Chinese. Chinese script is logographic and many Chinese logograms are composed of common substructures that provide semantic, phonetic and syntactic hints. In this work, we propose to explicitly incorporate the visual appearance of a character’s glyph in its representation, resulting in a novel glyph-aware embedding of Chinese characters. Being inspired by the success of convolutional neural networks in computer vision, we use them to incorporate the spatio-structural patterns of Chinese glyphs as rendered in raw pixels. In the context of two basic Chinese NLP tasks of language modeling and word segmentation, the model learns to represent each character’s task-relevant semantic and syntactic information in the character-level embedding.


Introduction
Recently, in combination with deep learning, character-level and subword-unit-level models has achieved the state-of-the-art performance in various natural language processing (NLP) tasks involving Western languages (Wu et al., 2016), we consider the equivalent modeling problem for solving NLP tasks in Chinese.Unlike English script which is alphabetic with a small alphabet, Chinese script is logographic with a large set of characters which are meaningful individually.According to Table of General Standard Characters (通⽤规范汉字表) compiled by the Chinese government in 2013, there are 3,500 level-1 (being the most common) characters and more than 8,105 characters in total (Wikipedia, 2017).At the same time, it is not correct to treat Chinese characters as equivalent to English words because the distribution of Chinese characters deviate markedly from Zipf's law (Zipf, 1935;Shtrikman, 1994).Furthermore, there is evidence suggesting that segmented Chinese words, -some of them are unigrams -, distribute according to Zipf's law (Xiao, 2008).Arguably, the closest equivalent linguistic unit in English corresponding to a Chinese character is a subword unit, i.e., word fragments.
Furthermore, there is a strong case for modeling at character-level for task involving Chinese corpora, since Chinese text is usually written without word boundaries to indicate the segmentation of characters into words.As a consequence, wordsegmented corpora is rare.Traditionally, systems are designed to process words as input, so often, a separately trained or hand crafted routine would first segment the contiguous sequence of characters into words as part of the preprocessing.However, this pipeline design might unnecessarily accumulate error due to segmentation ambiguity that can be resolved in a later stage.The trend of endto-end training of differentiable, neural networkbased models also enables training character-level models jointly with the rest of the system under the task objective.It is well-known that many Chinese characters' written form, their glyphs, share common sub-structures and some of these substructure are informative of the semantics, syntactic role and phonetics of the characters.For example, for semantics, ⾬ (rain) 雪 (snow) 雹 (hail) 雷 (thunder) all have a sub-structure ⾬, which commonly denote meteorological phenomena. 1For syntactic roles, 打 (hit) 提 (lift) 抓 (grab) all contain ⺘ which is indicative of a verb.For phonetics, ⼄ (yǐ) 亿 (yì) 忆 (yì) all share ⼄.However, as far as we are aware of, at the time of our work2 , there is no study that explicitly exploits the spatio-structural information of a Chinese character's glyph for NLP tasks. 3In this work, we explore the effect of incorporating glyphs as additional features in the context of two common Chinese NLP tasks, segmentation and language modeling, resulting in a novel glyph-aware embedding of Chinese characters.This work's major contributions are • a novel character embedding model that explicitly incorporates visual appearance of Chinese characters.
• new state-of-the-art results on a segmentation benchmark task.

Hypotheses
We hypothesize that the semantic and syntactic information of sub-glyph structures can help improve the character embeddings and thus improve performance in Chinese NLP tasks.Intuitively, representing each character only by their ID's implies that any pair of characters are as distinct as any other pair.This ignores any common sub-glyph structures shared by characters.Therefore incorporating the glyph's visual information we should be able to generalize knowledge learned about a character to another via their shared sub-glyph structures.
However, this hypothesis is not trivial because there are many Chinese characters that share strikingly similar visual appearances yet not their meanings.For example, ⼟ (soil) ↔ ⼠ (roughly means -er as fighter translates to ⽃ (fight) ⼠), and ⼈ (person) ↔ ⼊ (enter).By identifying a character with only its visual appearance, we are vulnerable to this new source of ambiguity which can harm performance.Due to this concern, we also include a mixed embedding in our experiments which combine both ID and glyph representation.

Method
In keeping with the common neural network model architectures, we decided to feed the glyph as an input to a feed-forward neural network (FNN) model, an embedder, that outputs an embedding vector which, in both the segmentation task and the language modeling task, is then consumed by a recurrent neural network to make predictions.In order to compare the proposed glyph-aware embeddings with the glyph-unaware embeddings, we shall keep the recurrent neural network (RNN) architecture fixed and only change the embedder in our experiments.
Considering that there are many different layouts for sub-glyph structures4 , and the same radical can appear at different positions5 , we think the most promising representation that preserves both the identities and the spatial arrangement of substructures is to use the raw pixels of a glyph.
Being inspired by the success of convolutional neural networks (CNN) (LeCun et al., 1995) in learning feature representation in computer vision (Krizhevsky et al., 2012), we used CNN to implement the embedder (see Figure 1).We believe that the spatial translational invariance induced by CNN's filter structure is particularly suited for modeling radicals that can appear at different locations of a glyph.After the CNN, a fully connected layer outputs an embedding vector of some dimension k.To apply our method, we first render the glyph for a character using a font file6 and then feed the glyph as a gray-scale image into the CNN embedder.
We implemented our models and experiments efficiently with Tensorflow (Abadi et al., 2016).In particular, we cached rendered glyphs to reduce repeated render calls of the same character by 1,000 times.We open-source our implementation7 for replicability

Chinese language modeling
Following the common approach in language modeling (LM), we model the likelihood of a sentence as where c i is the i-th character in a sentence of n characters.The conditional distribution of p(c i |c 1 , • • • , c i−1 ) is modeled as a gated recurrent unit (GRU) (Chung et al., 2014) together with an embedder.In all the experiments, we used a GRU with a 128-dimensional hidden state, and 300dimensional embedding vectors for all embedders.For the CNN embedder, we use a two layer CNN: 32 (7, 7) filters with (2, 2) stride in the first layer, 16 (5, 5) filters (2, 2) stride in the second layer, and a fully-connected layer at the end.For all the layers, we use ReLU non-linearity throughout (Nair and Hinton, 2010).For the linear embedder, we used only one fully-connected layer.For the last row "ID + CNN embedder" in Table .1, we combine the embedding vectors output by ID and CNN embedders via vector addition.In all the runs, we limited the vocabulary size to 4000 with one unknown class.
We experimented with language modeling on the Microsoft Research dataset (MSR) from the Second International Chinese Word Segmentation Bakeoff (Emerson, 2005).First, we should note that the CNN embedder outperformed the linear embedder by a large margin (see the second and the third row in

Chinese word segmentation
We use Peking University dataset (PKU) and Microsoft Research dataset (MSR) from the Second International Chinese Word Segmentation Bakeoff (Emerson, 2005) to compare the proposed CNN embedder with the ID embedder.We formulated the segmentation task as a structured prediction problem of predicting whether to insert word boundary behind a character for each character given the whole input sentence.An example would be: We experimented with both single-directional GRU and bidirectional long short-term memory (LSTM) recurrent networks (Graves and Schmidhuber, 2005;Hochreiter and Schmidhuber, 1997;Schuster and Paliwal, 1997) as the sequence prediction models in our experiments (RNN segmentor).(see Table . 2 and Table.3).RNN segmentor takes sequence of embeddings from embedder.For the CNN embedder, we used a single layer ReLU-gated CNN: 16 (5,5) filters with (2,2) stride and a fully-connected layer to output a 100dimensional embedding vector at the end.For the RNN segmentor, the hidden unit is set to be 100 dimensional with a fully-connected layer mapping the output hidden state to a binary prediction at each character.Overall, on both PKU and MSR, the proposed mixed embedder and bidirectional LSTM achieved the best performance outperforming the previous state-of-the-art on by a significant margin.Similar to the LM experiments, we use a vocabulary of 4000 and one unknown class.

Discussion
It should be noted that the number of parameters of the proposed CNN embedder is different than that of the ID embedder.Suppose the dimensionality of the embedding vectors is K, and the vocabulary size is N , the CNN embedder has O(N + K) many parameters: O(K) many trainable parameters and O(N ) glyphs rendered from a font file.In contrast, the ID embedder has O(N K) many parameters, all of which are trainable.This means that the CNN embedder is a more compact representation with competitive performance as the ID embedder.Shi et al. (2015) represented a character by its radicals based on Wubi input method but this ignores the scales and spatial arrangement of each radical which are present in our rendered glyphs.

Related work
It came to our late attention that independently, Liu et al. (2017) considered the same characterlevel modeling problem and experimented with vanilla CNN models almost identical to ours.They evaluated their method on a new document classification task instead of the commonly considered tasks or benchmarks we considered in this work.Consistent with their findings, we also observed similar effects of CNN embedder, ID embedder and mixed embedder in our tasks.Our mixed embedded corresponds roughly to their early fusion model.Costa-jussà et al. (2017) also considered incorporating Chinese glyphs as additional features in their Chinese-Spanish machine translation system and their modeling approach corresponds roughly to our linear embedder.

Future work
We hope to delve deeper into the cause of the CNN embedder's low performance in the LM task.In particular, we want to experiment with using bagof-stroke prediction in a multi-task loss to provide CNN with extra supervision during training.Furthermore, we have only explored two NLP tasks that emphasize semantic and syntactic information in this work.In the future, we hope to explore tasks that requires more phonetic information to do well, such as phoneme prediction.

Conclusion
Our experiments show that glyph-aware embedding can improve performance in some Chinese NLP tasks, in particular, the word segmentation task.Further studies are needed to understand the usefulness of glyph features in a more comprehensive way.However, given the visual ambiguity inherent in Chinese characters and the difficulty to interpret neural network models, any further research that uses glyph features and deep learning methods should exercise caution when measuring and verifying the contribution of the glyph features.

Figure 1 :
Figure 1: Left: our proposed glyph-aware CNN embedder.Right: the commonly used embedding model (we refer to this as ID embedder).The trainable parameters are labeled in orange.
lack of improvement of the proposed mixed embedder over the ID embedder in the language modeling task, we suspect that the CNN embedder is under-trained.Unlike a digit class inMNIST (LeCun et al., 2010)  which has 6,000 training examples, given one font, a character only has one glyph and every sub-glyph structure appears on average in only about 40 characters.Thus we suspect that the variability in input to the CNN is too limited.Modeling after common image augmentation technique(Krizhevsky et al., 2012), we applied random jitters, i.e., 2D translation with ∆x, ∆y ∈ {−2, −1, 0, +1, +2}, to the input glyphs at training time.This increases the input variations by 25-fold but the perplexity degrades slightly to 49.66.Since we mix the ID embedding and CNN embedding by summation in the proposed mixed embedder, the norm of each component embedding determines the relative importance of that representation in the resulting embedding.In Figure.2,we observe that the CNN embeddings distribute differently in the trained segmentation model and the trained language modeling model.In the case of language modeling, the norm of CNN embeddings is squashed suggesting that CNN embedding is largely ignored.

Figure 2 :
Figure 2: The distribution of the Frobenius norm of ID embeddings (id norm) and CNN embeddings (glyph norm) from the mixed embedder.Top: the segmentation task.Bottom: the language modeling task.

Table 1 :
Table. 1.This is expected as the CNN is more suitable for modeling image data.Second, LM performance of different embedders on the test split of MSR. the ID embedder (see the first row in Table. 1) remains a very strong baseline and the mixed embedder is only as good as the ID embedder by itself (see the fourth row in Table.1).It seems that CNN embedder did not provide extra information useful for the task.