A Hybrid Learning Scheme for Chinese Word Embedding

To improve word embedding, subword information has been widely employed in state-of-the-art methods. These methods can be classified to either compositional or predictive models. In this paper, we propose a hybrid learning scheme, which integrates compositional and predictive model for word embedding. Such a scheme can take advantage of both models, thus effectively learning word embedding. The proposed scheme has been applied to learn word representation on Chinese. Our results show that the proposed scheme can significantly improve the performance of word embedding in terms of analogical reasoning and is robust to the size of training data.


Introduction
Word embedding, also known as distributed word representation, represents a word as a real-valued low-dimensional vector and encodes its semantic meaning into the vector. It is a fundamental task of natural language processing (NLP), such as language modeling (Bengio et al., 2003;Mnih and Hinton, 2009), machine translation (Bahdanau et al., 2014;Sutskever et al., 2014), caption generation Devlin et al., 2015) and question answering (Hermann et al., 2015).
Most previous word embedding methods suffer from high computational complexity and have difficulty to be applied to large-scale corpora. Recently, Continuous Bag-Of-Words (CBOW) and Skip-Gram (SG) models (Mikolov et al., 2013a), which can alleviate the above issue, have received much attention. However, these models take a word as a basic unit but ignore rich subword information, which could significantly limit their performance. To improve the performance of word embedding, subword information, such as morphemes and character n-grams, has been employed (Luong et al., 2013;Qiu et al., 2014;Cao and Rei, 2016;Sun et al., 2016a;Wieting et al., 2016;Bojanowski et al., 2017). While these methods are effective, they are originally developed for alphabetic writing systems and can't be applied directly to other writing systems, like Chinese.
In Chinese, each word typically consists of less characters than in English 1 , while each character can have a complicated structure of its meaning. Typically, a Chinese character can be decomposed into components (部), where each component has its own meaning. The internal semantic meaning of a Chinese word emerges from such a structure. For example, the Chinese word "海水 (seawater)" is composed by "海 (sea)" and "水 (water)". The semantic component of "海 (sea)" is "氵", which is the transformation of "水 (water)" and indicates it is related to "水 (water)". Therefore, the word "海水 (seawater)" has the meaning of "water from the sea".
Based on the linguistic feature of Chinese, recent methods have used subword information to improve Chinese word embedding. For example, Chen et al. (2015) proposed a character-enhanced word embedding (CWE) model, which departed from CBOW of representing context words with both character embeddings and word embeddings. Shi et al. (2015) proposed a radical embedding method, which used the CBOW framework but replacing word embeddings with radical embeddings. Yin et al. (2016) and Xu et al. (2016) extended the CWE model in different ways: the former presented a multi-granularity embedding (MGE) model, additionally using the embeddings associated with radicals detected in the target word; the latter proposed a similarity-based character-enhanced word embedding (SCWE) model, considering the similarity between a word and its component characters. Yu et al. (2017) introduced a joint learning word embedding (JWE) model, which jointly learned embeddings for words, characters and components, and predicted the target word, respectively. Cao et al. (2018), on the other hand, represented Chinese words as sequences of strokes 2 and learned word embedding with stroke n-grams information.
The above methods can be divided into two types: compositional and predictive model. The compositional model composes rich information into one vector to predict the target word. In this type of model, information works in a cooperative manner for word embedding. By contrast, the predictive model decouples various information to predict the target word. The information in this type of model works competitively for word embedding. Both models can effectively learn word embedding and give good estimation for rare and unseen words. By combining richer information, the compositional model can more accurately represent the target word. However, information is usually composed in a sophisticated way. The predictive model, on the other hand, is simple and can directly capture the interaction between words and their internal information. This type of model, however, typically ignores the interrelationship between various information.
To take advantage of both models, in this paper, we propose a hybrid learning scheme for word embedding. The proposed scheme learns word embedding in a competitive and cooperative manner. Specifically, in our scheme, the decoupled representations are used to capture the semantic meaning of target word respectively while making their composition semantically consistent with the target word. The performance of proposed scheme has been evaluated on Chinese in terms of word similarity and analogy tasks. The results show that our proposed scheme can effectively learn word representation and is robust to the size of training data.

Proposed Scheme
In this section, we present the details of our proposed hybrid learning scheme for word embedding. We denote the proposed scheme as Co-Opetition Word Embedding (COWE). It consists of predictive and compositional parts, which will be described in subsection 2.1 and subsection 2.2, 2 https://en.wikipedia.org/wiki/Stroke_ (CJKV_character) respectively. This is followed by describing the objective function.
The meaning of notation used in this section is as follows. We denote the training corpus as , word vocabulary as , character vocabulary as , components vocabulary as . Each word ∈ , character ∈ and component ∈ are associated with vectors ∈ ℝ , ∈ ℝ , ∈ ℝ , respectively, where is the vector dimension. The characters and components in word are denoted as [ ] and [ ] , where | [ ] | and | [ ] | denote the number of characters and components in , respectively.

Predictive Part
In the predictive part, the compositions of context words, characters and components as well as compositions of characters and components in target word are used to predict the target word, as illustrated in Figure 1. These separate predictions by various compositions can be considered as competitions for the semantic meaning of target word. In order to maintain similar length between different compositions, COWE uses an average operation as the composition operation.  Figure 1: Illustration of the predictive part of COWE.
The goal of this part is to maximize the sum of log likelihoods of all predictive conditional probabilities: where 1 , 2 , 3 , 4 and 5 correspond to the above mentioned five compositions, respectively.
Here, 1 is defined as: where is the context window size. 2 , 3 , 4 and 5 are defined in a similar way. The conditional probability is defined using a softmax function as: This objective function is similar to the one used in JWE (Yu et al., 2017). The main difference is that we further decouple components in the context words and target word, and leverage characters in the target word in addition.

Compositional Part
In the compositional part, all compositions mentioned above work in a cooperative manner, where their composition is used to predict the target word. We consider the composition as semantic consistency point of various representations, and the prediction loss as consistency loss, as shown in Figure 2.
The goal of this part is to maximize the following objective function: where is the semantic consistency point, and is defined as: Similar to the predictive part, the conditional probability is defined using the softmax function (see Equation (3)).

Objective Function
As COWE consists of predictive and compositional parts, its objective function is therefore consisted of the sum of all prediction losses and the consistency loss: To solve the above optimization problem, we employ the negative sampling technique (Mikolov et al., 2013b). Note that only the consistency loss between semantic consistency point and target word is considered. In preliminary experiments, we also tried the consistency losses between semantic consistency point and sampled negative words, but observed reduced performance.
As a result, the final objective function can be written as: where is a sigmoid function: (x) = 1/(1 + exp (−x)), is the number of negative words, ̃ is the sampled negative word and ̃ is the distribution of negative words.

Experiments
In this section, we evaluate COWE on Chinese in terms of word similarity computation and analogical reasoning.

Experimental Settings
We  in the corpus. Finally, perform Chinese word segmentation with THULAC 6 (Sun et al., 2016b). In addition, we perform POS tagging on the training corpus using THULAC and identify all entity names for CWE (Chen et al., 2015), as it does not use the character information for noncompositional words. We use the subword files provided by Yu et al. (2017). As a result, we obtain a 1 GB training corpus with 165,507,601 words, 368,408 unique words, 20,885 unique characters and 13,232 unique components. We compare COWE with CBOW (Mikolov et al., 2013a) 7 , CWE (Chen et al., 2015) 8 and JWE (Yu et al., 2017) 9 . To further evaluate the effect of consistency loss and components, we create two variants of COWE, denoted as COWE-c2 and COWE-p. The former is indeed the JWE model with an additional consistency loss, while the latter is COWE without using component information. The same parameter settings are used for all models. Specifically, the vector dimension is set to 200, the training iteration is set to 100, both the size of context window and number of negative samples are set to 5, the initial learning rate is set to 0.025, and the subsampling threshold is set to 10 -4 .

Word Similarity
This task is to evaluate the effectiveness of word embedding in capturing semantic similarity of word pairs. Following Yu et al. (2017), we adopt wordsim-240 and wordsim-296 datasets (Jin and Wu, 2012). Both datasets contain manuallyannotated similarity scores for word pairs. In wordsim-240, words in 234 pairs appear in the training corpus, and in wordsim-296, words in 286 pairs appear in the training corpus. Unseen words are removed. The performance of word embedding is evaluated by ranking the pairs according to their cosine similarity and measuring the Spearman correlation with human ratings. The results are shown in Table 1.
The results, on the wordsim-240 dataset, show that CWE performs better than CBOW, but outperformed by all other models. This could indicate the benefits of using rich information. COWE-c2 is not so good as JWE, COWE-p and COWE perform even worse. This suggests that the introduc-tion of consistency loss, to some extent, may limit the performance of word representation. This may be due to the fact that our average semantic consistency point considers the contributions of various representations equally. With the evolution of history, however, meanings of some Chinese characters or components have degraded, making them less expressive. We plan to investigate the composition operation further in future work.

Word Analogy
This task is to evaluate the effectiveness of word embedding in capturing semantic relations between pairs of words. The goal is to answer the analogy questions of the form "a is to a* as b is to b*", where b* is hidden, and must be reasoned out from the vocabulary. We use the Chinese word analogy dataset provided by Chen et al. (2015). It consists of 1,124 analogy questions, categorized into 3 types: 1) capitals of countries (677 groups), 2) capitals of provinces/states (175 groups), and 3) family relationships (272 groups). The analogy questions are answered using 3CosAdd (Mikolov et al., 2013a) as well as 3CosMul (Levy and Goldberg, 2014) 10 . We abbreviate the two methods as "Add" and "Mul", respectively. The evaluation metric for this task is the percentage of questions for which the argmax result is the correct answer b*. The results are shown in Table 2 11 . It can be found that CBOW performs better than CWE and JWE on the Capital and Family tasks. This is due to that using internal information improperly could be harmful in cases where words are non-compositional or irrelevant words sharing similar internal structures. For example, the words "儿子 (son)" and "妻子 (wife)" share the same character "子", which means "son" in the former but makes no sense in the latter. We observe that COWE-c2 achieves the best results 10 https://bitbucket.org/omerlevy/hyperwords 11 The results do not agree with that reported in (Yu et al., 2017). We suggest that these discrepancies stem from differences in training corpus and parameter settings.  Table 1: Results on word similarity evaluation.
on the Family task and outperforms JWE by large margins. This shows the effectiveness of consistency loss in helping with learning from various information. COWE-p and COWE perform best on the other tasks, respectively. The fact suggests that different information could help in different ways.

Performance on Low-Resource Corpora
To evaluate the performance of different models on low-resource corpora, we conduct the same experiments on 5%, 10% and 20% randomly selected Wikipedia articles, respectively. As less training data introducing more noises, this makes it more difficult for models to learn good word representations. The results are shown in Table 3.
The results indicate the superiority of our models on low-resource corpora. We observe that as the size of dataset decreases, the performance of baselines drops rapidly, while the performance decrement of COWE and its variants is much smaller. This shows the robustness of our proposed models. COWE-p is generally more robust than COWE-c2, however, COWE-c2 performs more robustly on the Family task. Taking both characters and components into account, COWE achieves the most robust results.
We also observe that on the Capital task, the performance of CWE and JWE drops more quick than CBOW, which agrees with the previous findings. However, with the consistency loss, COWE-c2 always performs better than JWE, and usually outperforms CBOW. We believe that the consistency loss, in cases where some embeddings are useless, would encourage weak embeddings to close to strong embeddings, letting weak embeddings acquire some helpful features, and prevent strong embeddings from overfitting. On the State and Family tasks, where the character and component embeddings could be useful, all of our models still outperform the baselines by large margins. This should be due to the fact that the consistency loss prevents various learned embeddings from contradicting each other, thus making all of them close to the true target word embedding.

Case Study
To gain a better understanding of the quality of learned word embedding, we take the word "癌症 (cancer)" as an example and show its nearest neighbors in Table 4, where cosine similarity is used as the distance metric.
All words yielded by different models are disease-related. Specifically, words yielded by CWE contain the character "癌 (cancer)", including some weird words, like "国家癌症 (national cancer)" and "抑癌 (anti-cancer)" 12 . This implies that CWE has overused the internal information. For 12 Translation by Google Translate.   JWE and COWE, which directly capture the interaction between the words and their internal information, they yield disease-related words that do not contain the component "疒", such as "肺结核 (pneumonia)". This indicates that they make full use of external and internal information, and avoid the above issue. Compared to JWE, COWE yields more words that are semantically relevant to the target word.

Conclusion
This paper proposes a scheme, which combines predictive and compositional models to jointly learn various word representations in a competitive and cooperative manner. The predictive part of the proposed scheme is based on various external and internal information, which is used to capture corresponding representation. In the compositional part, the semantic consistency point and the consistency loss are introduced. They connect separate learned representations and prevent them from contradicting each other. The experimental results show that the proposed scheme outperforms baseline models on word analogy tasks and achieves competitive results on word similarity tasks. The results also show that our model is robust to the size of training data. Therefore, our proposed scheme is suitable to be applied on lowresource corpora, for example task-specific corpora, where data is often very scarce.