Glyph2Vec: Learning Chinese Out-of-Vocabulary Word Embedding from Glyphs

Chinese NLP applications that rely on large text often contain huge amounts of vocabulary which are sparse in corpus. We show that characters’ written form, Glyphs, in ideographic languages could carry rich semantics. We present a multi-modal model, Glyph2Vec, to tackle Chinese out-of-vocabulary word embedding problem. Glyph2Vec extracts visual features from word glyphs to expand current word embedding space for out-of-vocabulary word embedding, without the need of accessing any corpus, which is useful for improving Chinese NLP systems, especially for low-resource scenarios. Experiments across different applications show the significant effectiveness of our model.


Introduction
Word embedding encoded semantic and syntactic information (Mikolov et al., 2013a,b) in lowdimensional space have served as useful features for various NLP applications but often require large-scale corpus with billions of tokens to train.
A natural constraint of word embedding is that it is not practical to collect the entire vocabulary of any language with large enough frequency to train the embedding for every word, since some new words may appear in downstream tasks. A typical solution is to simply assign a specific UNK embedding to all out-of-vocabulary (OOV) words that do not appear in the training data.
Current solutions such as using subwords (e.g., characters) are mainly considering alphabetic languages (e.g., English and French) that are composed of small amount of characters. Such techniques may not be sufficient for ideographic lan- * Equally contribution. guages (e.g., Chinese and Japanese) in which a word is often composed with characters of a large amounts. An example is that traditional Chinese includes about 17k distinct tokens. Therefore, it could be expected to suffer from underfitting not only word embedding but also character embedding. Even worse, words in ideographic languages are often composed of 2-3 characters only, unlike words in alphabetic languages are longer but with smaller types of characters. Figure 1 provides the statistics in Chinese Sinica Corpus. Components may also share similar semantics even they are different in graphs. 鳥 and 隹 are both refer to birds. (b) Cangjie input method. Each character can be presented as several keyboard inputs based on its components (e.g., 惆 is for 心+月+土+口).
The visual structure (or glyph) of a Chinese character contains rich semantics. A Chinese charhttp://asbc.iis.sinica.edu.tw/indexreadme.htm acter is made up of several graphical components. Figure 2 shows some examples that components in characters represent similar semantic or pronunciation. In addition to glyphs, we propose to use the high-quality features provided by Cangjie input method to represent each character. Cangjie is a popular Chinese input method. Similar to radicals, characters are composed of 24 basic graphical units. Each unit is mapped to a corresponded letter key on a standard QWERTY keyboard. Building beyond character glyphs, one can intuitively guess the semantic of a word. Recent work (Chen et al., 2015;Xu et al., 2016;Yin et al., 2016;Liu et al., 2017;Su and Lee, 2017) have shown benefits of the compositionality at character level or visual feature of Chinese glyphs for some tasks.
In this work, we suggest that in the OOV scenario glyphs can be particularly useful. A key observation for solving OOV problem matches the intuition of human generalization in Chinese. When a Chinese user reads an unseen word or a character, by decomposing the structure, graphical components such as radicals for a character often help Chinese users understand the meaning and sometimes pronunciation of the character.
We study a novel application that recovers Chinese OOV word embeddings from glyphs. Our work is to answer a question : given the pretrained word embeddings, can we directly learn a mapping from word glyphs to their word embedding and generalize the mapping for the purpose of generating the embedding of OOV words? We formulate it as a visual-to-text transfer learning problem and show that the visual structure of Chinese characters is helpful in learning Chinese OOV embeddings.

Related Work
Exploiting Structure of Chinese Characters Recent work have explored the use of Chinese character structure in different settings (E and Xiang, 2017;Liu et al., 2017;Dai and Cai, 2017). Several work aim to use character-level feature to enhance standard word embedding learning models (e.g., Word2Vec or GloVe). CWE (Chen et al., 2015) propose to use character-level formulation for words in training word embeddings; SCWE (Xu et al., 2016) and Li et al. (2015) extends to consider the relations of characters compositionally. MGE (Yin et al., 2016) and Shi et al. (2015) further includes radical information associated to characters. Yu et al. (2017) jointly embed Chinese words, characters, and radicals. GWE (Su and Lee, 2017) proposes to extract feature from character bitmaps as the inputs of Word2Vec and GloVe. Our work is different from all of them, since we emphasize on generating the OOV word embeddings, which is not handled by them.
Learning Embedding for OOVs To handle OOV words, an approach is operating character level embeddings, then averages them into word embeddings (Kim et al., 2016;Wieting et al., 2016). Morphology-based approaches take advantage of meaningful linguistic substructures (Botha and Blunsom, 2014;Luong et al., 2013;Bhatia et al., 2016). Morphology-based approaches often struggle with those vocabularies lacking linguistic substructures such as names and transliterations of foreign language, which often appears as OOV words. In all the models above, just like Word2Vec (Mikolov et al., 2013c)), the embeddings meed to learned by training over a large corpus.
The most similar work is Mimick model (Pinter et al., 2017). By learning a character language generating model, guided by minimizing the distance between the output embedding of LSTMs and pre-trained word embeddings, Mimick shows feasibility of generating OOV word embedding from character compositions. However, Mimick is mainly from the view of alphabetic languages that does not consider glyphs. Chinese words often consist of short sequences composed of many kinds of tokens that are difficult for language model approaches to handle (see Figure 1) and could suffer from under-fitting.

Our Model: Glyph2Vec
We formulate the task of learning OOV embeddings as a transfer learning problem. Formally, given a Chinese vocabulary set V of size |V|, and a pre-trained embeddings matrix E ∈ R |V|×d where each word w i is associated with a vector e i of dimension d as training set {w i , e i } |V| i=1 . We aim to learn a mapping F : w → R d , where F projects the input word to the d dimension embedding space such that F (w i ) ≈ e i . In testing, a word w t may be out of V, while the model is still obliged to predict the embedding e t with F (w t ).
Given the glyphs for a word x = [c j ] |x| 1 as a sequence of character 2D bitmaps c provided according to V, we can considering a function g : x → R k that transforms glyphs into vi-  Figure 3.

Visual Feature Extractor
We consider two implementations of visual feature extractor g.
ConvAE We adopt the convolutional autoencoder ConvAE (Masci et al., 2011) to capture the structure of characters bitmaps c. The architecture of the ConvAE follows Figure 6 in (Su and Lee, 2017). Eventually, the well-trained encoder is fixed as extractor that extracts 512-dimensional feature for every character c. The input bitmaps are 60×60 8-bit images in grayscale.
Cangjie Composition We propose to use Cangjie input codes as high-level annotations of characters, which can be easily collected from the input method dictionary. We construct a Bag-of-Root (BoR) vector for each character according to the Cangjie dictionary. Each BoR binary vector of 24 dimensions representing the roots that a character possesses.

Compositional Model: From Characters to Words
After the visual features of every character in a word are extracted, we still need to compose them to word level. A compositional model f takes a sequence of characters' visual feature and projects them onto the word embedding space. The right portion of Figure 3 shows the architecture of f . We construct a bi-directional RNN network with GRU cells (Cho et al., 2014) to compute the expected word embedding over the character feature sequence. Finally, the 300D word embeddings are predicted. To calculate the loss for backpropagation, we adopt squared Euclidean distance between the prediction F = f (g(x)) and the gold word embedding w: F (x) − w 2 .

Pre-trained Chinese Character Embedding
Unlike alphabetical languages, each Chinese character carries its own meaning. State-of-the-art Chinese word embedding models (Chen et al., 2015;Xu et al., 2016;Yin et al., 2016)    As a sanity check, in Fig. 4 we visualize the embedding of seen and OOV words. One could observe meaningful clusters that have similar visual structure. For example, 烤雞 (roast chicken) could be mapped with 烤鴨 (roast duck) because 雞 (chicken) and 鴨 (duck) have different glyphs both about bird. Some cooking verbs that have the radical 火 (fire) like 烤 (roast) and 燒烤 (roast) are also mapped closely. Some unseen characters (or "words" with only one character) can also be predicted reasonably.

Nearest Neighbor Examples
We qualitatively analyze Glyph2Vec with nearest neighbor (NN) sanity check. Table 2 shows the results of retrieved nearest neighbors with OOV word queries for Mimick and our Glyph2Vec embeddings (using V), respectively.
We observe Glyph2Vec is able to model visual semantic by associating those characters that share related visual features since Glyph2Vec learns from the images of characters. For example, 鰻(eel) in 蛇鰻(snake-eel) shares the radicals of 魚(fish) with 石 鱸(Haemulidae, fish name). 銠(Rh) and 氯(Cl) in 三氯化銠(RhCl3) associate some visual features relate to chemicals like 金 in 鈰(Ce), 气 in 氟(F), 酉 in 酸(acid), and more.
On the other hand, we observe some properties including composition (e.g., numbers) and character semantic that both Glyph2Vec and Mimick can provide. (1) Composition: composing characters that have very different meanhttps://sites.google.com/site/rmyeid/projects/polyglot ing after splitting them. For instance, 茲 尼 約 夫 is a transliteration of Seleznev (Russian name), for which every character is meaningless alone but a meaningful transliteration when combined. With character-level compositional model in Glyph2Vec, it could be retrieved given 克 羅 迪 歐(Claudio, western name). Moreover, Glyph2Vec preserves correct meaning of a character when attaching with the other characters. For example, 驟(abrupt)減(decrease) can retrieve 減 少(cut back) and 減低(reduce) properly when 減 (subtract) is associated to different characters. (2) character semantic: associating different characters with similar meaning. For example, 道(street) is related to 巷(lane) or 弄(alley) and they are retrieved by our model given 學府二道(Xuefu 2nd Street) as the OOV word even though the characters look completely different.

Joint Tagging of Parts-of-Speech (POS) and Morphosyntactic Attributes
We follow the experiment protocols of partsof-speech tagging and morphosyntactic attributes tagging stated in Mimick (Pinter et al., 2017) for this experiment. There are two parts-of-speech tagging tasks based on the Chinese thread of Universal Dependencies (UD) scheme (De Marneffe et al., 2014). To avoid tuning towards those OOV words, we consider the similar evaluation protocols of generalized zero-shot learning (Chao et al., 2016;Xian et al., 2017) that the embedding of not only unseen but also seen words need to be generated. Both word-level LSTM and character LSTM are reported (Table 3). With visual feature available, Glyph2Vec consistently outperforms Mimick. On the other hand, we observe using pretrained character embedding only helps on accuracy of seen words but not OOV words, which suggests that it is necessary for a module like Mimick or Glyph2Vec to learn to compose characters for OOV words.

Wikipedia Title Classification
As we introduced in Sec. 1, in real-world scenario Chinese systems could suffer from severe OOV problem. An example is Wikipedia encyclopedia. It contains lots of rarewords that easily become OOV words such as terminologies, scientific names, geography locations, ... etc. We utilize Wikipedia Title dataset (Liu et al., 2017) Table 4: Wikipedia Title Classification Accuracy study the problem. The dataset is a collection of 593K Chinese articles from Wikipedia and categorizing them into 12 classes based on their titles. We preprocessed the data by removing punctuation, special characters, and other non-Chinese in-stances, and turning Arabic numbers into Chinese text. We use opensource Jieba toolkit to segment each title into words. 52.5% are OOV based on Sinica Corpus, and we generate their embeddings by Glyph2Vec.
We construct a neural network classifier with the generated word embedding as input to evaluate our method. The classifier is consist of 3 fullyconnected (FC) layers on top of the averaged word embedding of titles. Results are shown in Table 4. With glyph feature and Cangie BoR feature provided, the performance could be improved significantly compared to neglecting OOV (as UNK) in such challenging setting.

Conclusion
In this work, we propose a multi-modal framework that expand pre-trained embedding space to include OOV words using character visual features such as Cangjie feature and Chinese character glyphs. We have demonstrated the effectiveness of Glyph2Vec on traditional Chinese, and we believe Glyph2Vec can also be applied to other ideographic languages to handle OOV words as well.
https://github.com/fxsjy/jieba We note that the accuracy cannot be compared with the report in (Liu et al., 2017) since they did not consider OOV and char/word embeddings. Here we only use the dataset to examine the performance of OOV embedding.
For simplified Chinese, we suggest users to first translate into traditional Chinese since traditional characters have richer structures and probably more semantics can be extracted through Glyph2Vec.