Integrating Semantic Knowledge into Lexical Embeddings Based on Information Content Measurement

Distributional word representations are widely used in NLP tasks. These representations are based on an assumption that words with a similar context tend to have a similar meaning. To improve the quality of the context-based embeddings, many researches have explored how to make full use of existing lexical resources. In this paper, we argue that while we incorporate the prior knowledge with context-based embeddings, words with different occurrences should be treated differently. Therefore, we propose to rely on the measurement of information content to control the degree of applying prior knowledge into context-based embeddings - different words would have different learning rates when adjusting their embeddings. In the result, we demonstrate that our embeddings get significant improvements on two different tasks: Word Similarity and Analogical Reasoning.

In past few years, several unsupervised methods for word embeddings (Collobert et al., 2011;Dhillon et al., 2012;Lebret and Collobert, 2014;Li and Zhang, 2015;Mikolov et al., 2013a;Pennington et al., 2014) have been proposed and have had great results in various evaluations. Through exploiting local context of target words, these algorithms learn word embeddings by maximizing the contextual distribution of a large corpus.
Knowledge bases provide rich semantic relatedness between words, which are more likely to capture the desired semantics on certain NLP tasks. To improve the quality of context-based embeddings, some researchers attempted to incorporate knowledge base, such as WordNet (Miller, 1995) and Paraphrase Database (Ganitkevitch et al., 2013) into the learning process. Recent work has shown that aggregating the knowledge base information into context-based embeddings can significantly improve the embeddings Chang et al., 2013;Faruqui et al., 2015;Xu et al., 2014;Yih et al., 2012;Yu and Dredze, 2014).
One implicit but critical reason of the success on using knowledge bases, based on our insight, is that knowledge bases can complement the embedding quality of those words which lack enough statistics of word occurrences, such as enough occurrences or diversity of their context. These words may suffer the difficulty obtaining meaningful information from the given corpus. Following this idea, we argue that while incorporating prior knowledge into context-based embeddings, words with different statistics of word occurrences should be treated differently. With this idea, we propose to rely on the measurement of information content to control the degree of applying prior knowledge into context-based embeddings.

Learning Embeddings
In this section, we will first review word2vec, a popular context-based embedding approach, and then introduce Relation Constrained Model (RCM) to incorporate prior knowledge. Finally we propose our approach to utilize the both two models, making words with different statistics of word occurrences be treated differently while incorporating prior knowledge.

Context-based Embedding
Context-based embedding has two main model families: global matrix factorization methods, such as latent semantic analysis (LSA) (Bullinaria and Levy, 2007;Lebret and Collobert, 2014;Pennington et al., 2014;Rohde et al., 2006) and local context window methods (Bengio, 2013;Collobert and Weston, 2008;Mikolov et al., 2013a). Both training models learn the embedding by using the statistical information of the word context from a large corpus. In this paper, we adopt continuous bag-of-word (CBOW) in word2vec (Mikolov et al., 2013a) as our context-based embedding model. CBOW is an unsupervised learning algorithm using a neural language models, given a target word w t and its c neighboring words, the model is aimed at maximizing the log-likelihood of each word given its context.
The objective function is shown as following: In CBOW, p w t |w t+c t−c defined as: where e w and e w represent the input and output embeddings respectively.
CBOW use stochastic gradient descent to learn embeddings, the update of e w and e w j are: where σ(x) = exp{x}/(1 + exp{x}) I [x] is 1 when x is true,f (w) = e w t+c j=t−c e w j , α is learning rate.

Relation Constrained Model(RCM)
RCM (Yu and Dredze, 2014) designed a simple but effective method to incorporate prior knowledge into context-based embeddings. Given a set of relation pairs (w, w i ) in a given knowledge base, by maximizing the log probability of w and w i , the model aims to increase the similarity between w and w i . To simplify the formula, we can define R as a set of relations between w and w i . R w is the subset of R which involve word w.
The objective function is shown as following: where The objective function of RCM is similar to the CBOW but without the context. RCM only revise output embeddings e w and e w i when it trains with CBOW jointly.
RCM use stochastic gradient descent to learn embeddings, the update of e w and e w i are: where σ(x) = exp{x}/(1 + exp{x}) (10) is 1 when x is true, f (w) = e w e w i , α is the learning rate.

Information Content Measurement
No matter which kind of context-based embedding approach, statistics of word occurrences play a primary role. Under this statement, the embedding quality of those words which lack enough statistics of word occurrences, such as enough occurrences or diversity of their context, may suffer the difficulty obtaining meaningful information from the given corpus. We argue that while incorporating prior knowledge into context-based embeddings, words with different statistics of word occurrences should be treated differently. With this idea, we investigate several score functions S IC to adjust the learning rate, aiming to make words with less statistical information be adjusted more via prior knowledge, and words with richer statistical information be adjusted less.
The update formula of e w and e w i are: In this paper, we propose three kinds of score functions to control the adjustment: Threshold, Function(Freq.), and Function(Ent.).
a. Threshold: The first one is a binary indicator based on a threshold of word frequency. We can distinguish the word relations into two groups.
This strategy will only revise low frequency word in a word relation pair, when one word of the relation word pair has low frequency and the other has high frequency.
b. Function (Freq.): In contrast to the previous strategy, we make the score function smoother, we use a relative value between two words frequencies and a hyperbolic tangent function to determine the score.
This strategy still can revise relatively lower frequency word in a word relation pair, when one word of the relation word pair has relatively lower frequency and the other has relatively high frequency. This scoring function is based on our assumption that if a word has relatively higher occurrence, its embedding quality is better, so it does not need to be adjusted much.
c. Function (Ent.): In addition to the word's frequency, in fact, we believe that the contextual diversity plays a critical role of affecting the quality of word embedding. Therefore, we propose a score function based on the conditional entropy (information content) from the information theory.
We define the score function as the follows: where C is a set of all context words of w, and c j is the jth context word of w.
In here, the occurrence probability of w (denoted as p(w)) and the occurrence probability of w with its context c j (denoted as p(c j , w)) are defined as: The output value of this entropy function conditions on two main points. First, as we defined in Equ. 16, if there's a high frequency word w, the output value will be high. Second, for a word w with many different contextual words, the output value will be higher. This score function is based on our assumption that if a word has context with higher diversity, its embedding quality is supposed to be better and does not need to be adjusted much.

Experiments
We conduct two experiments to evaluate our approach: Word Similarity and Analogical Reasoning. These two experiments directly test the quality of information embedded in the word vector. We integrate semantic information from knowledge bases using the four strategies: Baseline(Joint), Threshold, Function(Freq.), and Function(Ent.). We compare our proposed methods under the setting of using both prior knowledge and context to adjust the embeddings.

Training Data
We use New York Times (NYT) 1994-97 subset from Gigaword v5.0 (Parker et al., 2011) as the training corpus for CBOW, which is the same setting as (Yu and Dredze, 2014). After preprocessing of tokenization, the final training corpus contains 555.4 million tokens. We use two knowledge bases: Paraphrase Database (PPDB) (Ganitkevitch et al., 2013) and WordNet (Miller, 1995). For PPDB, we use the XXL package,  Table 1: Spearman rank correlation on word similarity task. All embeddings are 300 dimensions. The best result for each dataset is highlighted in bold.
which shows the best result in (Yu and Dredze, 2014). It contains 587,439 synonym word pairs. For WordNet, we extract relation pairs from synonym. It contains 132,046 word pairs.

Parameter Setting
We set all our embedding size to 300, which is a suitable embedding size mentioned in (Melamud et al., 2016). The training iteration for RCM is 100. Learning rate for CBOW is 0.025. We experiment on an array of learning rates for the Baseline(Joint) and the best one is 0.0001. While the learning rate for Threshold remains 0.0001, we attempt various learning rates for Function(Freq.) and Function(Ent.) and the best one is 0.001, which is larger than 0.0001. This setting can be actually explained by that the output values of the two functions are between 0 to 1, which is used to decrease the learning rate. In other words, the learning rate of the two functions needs to be set a larger value than the baseline in order to be decreased by the two functions. The Window Size is 5. Negative Sample is 15. We experiment on the threshold values of 10, 50 and 100. In our experiments, 50 gets the best result. We first learn the embeddings using CBOW with a random initialization, and take this pre-trained embeddings to initialize a joint model, where CBOW and RCM are jointly trained, and their learning rates are adjusted by using our proposed functions. Following (Yu and Dredze, 2014), we use asynchronous stochastic gradient ascent in training, where the threads to the CBOW and RCM are set to be a balance of 12:1 and the shared embeddings are updated by each thread based on training data within the thread. We let the CBOW threads to control convergence; training stops when CBOW threads finish processing the data. The joint model without using our proposed functions is taken as the baseline system, denoted by Baseline(Joint)

Word Similarity Task
The aim of word similarity task is to check whether a given word would have the similarity score which closely corresponds to human judges. These datasets contain relatedness scores for pairs of words; the cosine similarity of the embedding for two words should have high correlation. We use five datasets to evaluate: MEN-3k (Bruni et al., 2014), RW (Luong et al., 2013), WordSim-353 (Finkelstein et al., 2002), also the partitioned dataset from WordSim-353, separated into the dataset into two different relations, WS353-Similarity and WS353-Relatedness (Agirre et al., 2009;Zesch et al., 2008). Table 1 shows that comparing to the baseline, all of our proposed three methods get significant improvement. The results support our argument that incorporating prior knowledge into context-based embeddings can complement the embedding quality of those words which lack enough statistics of word occurrences.  Table 2: Accuracy on analogical reasoning task. All embeddings are 300 dimensions. The best result for each dataset is highlighted in bold.

Analogical Reasoning Task
Analogical reasoning task was popularized by (Mikolov et al., 2013b). The dataset is composed  Table 3: Spearman rank correlation on word similarity task. All embeddings are 300 dimensions. The corpus is the same as Table 1, but the size is 1/100. The best result for each dataset is highlighted in bold.
of analogous word pairs. It contains pairs of tuples of word relations that follow a common syntactic relation. The goal of this task is to find a term c for a given term d so that c:d best resembles a sample relationship a:b. We use the vector offset method (Levy and Goldberg, 2014;Mikolov et al., 2013b), computing e d = e a − e b + e c and returning the vector which has the highest cosine similarity to e d . We use two datasets, Googles analogy dataset (Mikolov et al., 2013b), which contains 19,544 questions, about half of the questions are syntactic analogies and another half of a more semantic nature, and MSR analogy dataset (Mikolov et al., 2013b), which contains 8,000 syntactic analogy questions. Table 2 shows the similar result as Word Similarity and demonstrates our proposed methods are stable and can be applied to different tasks.

Corpus Size
We also apply our models on the corpus with a smaller size. The same corpus is used but its size is 1/100. All parameters are the same except that the threads to the CBOW and RCM are set to be a balance of 2:1, and only the learning rates of positive samples are adjusted by our functions. The results are shown in Table 3 and Table 4, which shows our proposed models also improve the CBOW and outperform the baseline. In our experiments, we find out that for a smaller corpus, adjusting the learning rates of both positive samples and negative samples can not gain as much improvement as only using positive samples. Our conjecture is that since the quality of the embeddings trained from a smaller corpus might not be as high as the ones trained from a larger corpus, and the number of negative samples is much more than the positive sample (15:1 in our setting) each time, negative sample with the learning rate adjustment are more likely to mislead the training for a smaller corpus.  Table 4: Accuracy on analogical reasoning task. All embeddings are 300 dimensions. The corpus is the same as Table 2, but the size is 1/100. The best result for each dataset is highlighted in bold.

Conclusion
In this paper, we argue that while applying prior knowledge into context-based embeddings, statistics of word occurrences should be considered, which based on the assumption that a embedding with more contextual information is supposed to have higher quality, and thus should be treated in a different way while incorporating with knowledge bases. We propose three models and demonstrate our embeddings got improved on two different tasks: Word Similarity and Analogical Reasoning. The implementation is based on RCM package and we have released the code for academic use. 1 In the future, under this framework, we plan to further investigate other possible score functions of learning rate based on information theory or dynamic consideration of training process for the incorporation of context and knowledge base information.