Learning Word Representations with Regularization from Prior Knowledge

Conventional word embeddings are trained with specific criteria (e.g., based on language modeling or co-occurrence) inside a single information source, disregarding the opportunity for further calibration using external knowledge. This paper presents a unified framework that leverages pre-learned or external priors, in the form of a regularizer, for enhancing conventional language model-based embedding learning. We consider two types of regularizers. The first type is derived from topic distribution by running LDA on unlabeled data. The second type is based on dictionaries that are created with human annotation efforts. To effectively learn with the regularizers, we propose a novel data structure, trajectory softmax, in this paper. The resulting embeddings are evaluated by word similarity and sentiment classification. Experimental results show that our learning framework with regularization from prior knowledge improves embedding quality across multiple datasets, compared to a diverse collection of baseline methods.


Introduction
Distributed representation of words (or word embedding) has been demonstrated to be effective in many natural language processing (NLP) tasks (Bengio et al., 2003;Collobert and Weston, 2008;Turney and Pantel, 2010;Collobert et al., 2011;Mikolov et al., 2013b,d;Weston et al., 2015). Conventional word embeddings are trained with a single objective function (e.g., language modeling (Mikolov et al., 2013c) or word co-occurrence factorization (Pennington et al., 2014)), which restricts the capability of the learned embeddings from integrating other types of knowledge. Prior work has leveraged relevant sources to obtain embeddings that are best suited for the target tasks, such as Maas et al. (2011) using a sentiment lexicon to enhance embeddings for sentiment classification. However, learning word embeddings with a particular target makes the approach less generic, also implying that customized adaptation has to be made whenever a new knowledge source is considered.
Along the lines of improving embedding quality, semantic resources have been incorporated as guiding knowledge to refine objective functions in a joint learning framework Xu et al., 2014;Yu and Dredze, 2014;Nguyen et al., 2016), or used for retrofitting based on word relations defined in the semantic lexicons (Faruqui et al., 2015;Kiela et al., 2015). These approaches, nonetheless, require explicit word relations defined in semantic resources, which is a difficult prerequisite for knowledge preparation.
Given the above challenges, we propose a novel framework that extends typical context learning by integrating external knowledge sources for enhancing embedding learning. Compared to a well known work by Faruqui et al. (2015) that focused on tackling the task using a retrofitting 1 framework on semantic lexicons, our method has an emphasis on joint learning where two objectives are considered for optimization simultaneously. In the meantime, we design a general-purpose infrastructure which can incorporate arbitrary external sources into learning as long as the sources can be encoded into vectors of numerical values (e.g. multi-hot vector according to the topic distributions from a topic model). In prior work by Yu and Dredze (2014) and Kiela et al. (2015), the ex-ternal knowledge has to be clustered beforehand according to their semantic relatedness (e.g., cold, icy, winter, frozen), and words of similar meanings are added as part of context for learning. This may set a high bar for preparing external knowledge since finding the precise word-word relations is required. Our infrastructure, on the other hand, is more flexible as knowledge that is learned elsewhere, such as from topic modeling or even a sentiment lexicon, can be easily encoded and incorporated into the framework to enrich embeddings.
The way we integrate external knowledge is performed by the notion of a regularizer, which is an independent component that can be connected to the two typical architectures, namely, continuous bag-of-words (CBOW) and skip-gram (SG), or used independently as a retrofitter. We construct the regularizers based on the knowledge learned from both unlabeled data and manually crafted information sources. As an example of the former, a topic model from latent Dirichlet allocation (LDA) (Blei et al., 2003) is first generated from a given corpus, based on which per-word topical distributions are then added as extra signals to aid embedding learning. As an example of the latter, one can encode a dictionary into the regularizer and thus adapt the learning process with the encoded knowledge.
Another contribution of this paper is that we propose a novel data structure, trajectory softmax, to effectively learn prior knowledge in the regularizer. Compared to conventional tree based hierarchical softmax, trajectory softmax can greatly reduce the space complexity when learning over a high-dimension vector. Our experimental results on several different tasks have demonstrated the effectiveness of our approach compared to up-todate studies.
The rest of the paper is organized as follows. In section 2, we describe in detail our framework and show how we learn the regularizer in section 3. Section 4 presents and analyzes our experimental results and section 5 surveys related work. Finally, conclusions and directions of future work are discussed in section 6.

Approach
Conventionally word embeddings are learned from word contexts. In this section, we describe our method of extending embedding learning to incorporate other types of information sources.
Previous work has shown that many different sources can help learn better embeddings, such as semantic lexicons (Yu and Dredze, 2014;Faruqui et al., 2015;Kiela et al., 2015) or topic distributions (Maas et al., 2011;Liu et al., 2015b). To provide a more generic solution, we propose a unified framework that learns word embeddings from context (e.g., CBOW or SG) together with the flexibility of incorporating arbitrary external knowledge using the notion of a regularizer. Details are unfolded in following subsections.

The Proposed Learning Framework
Preliminaries: The fundamental principle for learning word embeddings is to leverage word context, with a general goal of maximizing the likelihood that a word is predicted by its context. For example, the CBOW model can be formulated as maximizing (1) where υ i+j refers to the embedding of a word in w i+c i−c , and c defines the window size of words adjacent to the word w i . The optimization for L over the entire corpus is straightforward.
The left part of Figure 1 illustrates the concept of such context learning. It is a typical objective function for language modeling, where w i is learned by the association with its neighboring words. Since context greatly affects the choice of the current word, this modeling strategy can help finding reasonable semantic relationships among words.
Regularizer: To incorporate additional sources for embedding learning, we introduce the notion of a regularizer, which is designed to encode information from arbitrary knowledge corpora.
Given a knowledge resource Ψ, one can encode the knowledge carried by a word w with ψ(w), where ψ can be any function that maps w to the knowledge it encapsulates. For example, a word has a topic vector ψ ,:] is the topic distribution matrix for all words with K topics; −−→ e (w i ) is the standard basis vector with 1 at the i-th position in the vocabulary V . Therefore, regularization for all w with given a knowledge source can be conceptually used to maximize w∈V R(υ), where R is the regularizer, defined as a function of the embedding υ of a given word w and formulated as: The right part of Figure 1 shows an instantiation of a regularizer that encodes prior knowledge of vocabulary size |V |, each with D dimensions.
Joint Learning: To extend conventional embedding learning, we combine context learning from an original corpus with external knowledge encoded by a regularizer, where the shared vocabulary set forms a bridge connecting the two spaces. In particular, the objective function for CBOW with integrating the regularizer can be formulated as maximizing where not only w i , but also R(w i ) is predicted by the context words w i+j via their embeddings υ i+j . Figure 1 as a whole illustrates this idea. Recall that each row of the matrix corresponds to a vector of a word in V , representing prior knowledge across D dimensions (e.g., semantic types, classes or topics). When learning/predicting a word within this framework, the model needs to predict not only the correct word as shown in the context learning part in the figure, but also the correct vector in the regularizer. In doing so, the prior knowledge will be carried to word embed-dings from regularization to context learning by back-propagation through the gradients obtained from the learning process based on the regularization matrix.
Retrofitting: With joint learning as our goal, we should emphasize that the proposed framework supports simultaneous context learning and prior knowledge retrofitting with a unified objective function. This means that the retrofitters can be considered as a stand-alone component at disposal, where the external knowledge vectors are regarded as supervised-learning target and the embeddings are updated through the course of fitting to the target. In §4, we will evaluate the performance of both joint learner and retrofitter in detail.

Parameter Estimation
As shown in Equation 3, prior knowledge participates in the optimization process for predicting the current word and contributes to embedding updating during training a CBOW model. Using stochastic gradient descent (SGD), embeddings can be easily updated by both objective functions for language modeling and regularization through: (4) where R is defined as in Eq.2 for ψ(w i ). For SG model, prior knowledge is introduced in a similar way, with the difference being that context words are predicted instead of the current word.
Therefore, when learned from the context, em- beddings are updated in the same way as in normal CBOW and SG models. When learned from the regularizer, embeddings are updated via a supervised learning over Ψ, on the condition that Ψ is appropriately encoded by ψ. The details of how it is performed will be illustrated in the next subsection.

Trajectory Softmax
Hierarchical softmax is a good choice for reducing the computational complexity when training probabilistic neural network language models. Therefore, for context learning on the left part of Figure 1, we continue using hierarchical softmax based on Huffman coding tree (Mikolov et al., 2013a). Typically to encode the entire vocabulary, the depth of the tree falls in a manageable range around 15 to 18. However, different from learning context words, to encode a regularizer as shown on the right part of Figure 1, using hierarchical softmax is intractable due to exponential space demand. Consider words expressed with D-dimensional vectors in a regularizer, a tree-based hierarchical softmax may require 2 D − 1 nodes, as illustrated in the left hand side of Figure 2. Since each node contains a d-dimensional "node vector" that is to be updated through training, the total space required is O(2 D · d) for hierarchical softmax to encode the regularizer. When D is very large, such as D = 50 meaning that tree depth is 50, the space demand tends to be unrealistic as the number of nodes in the tree grows to 2 50 .
To avoid the exponential requirement in space, in this work, we propose a trajectory softmax activation to effectively learn over the D-dimensional vectors. Our approach follows a grid hierarchical structure along a path when conducting learning in the regularizer. From the right hand side of Figure 2, we see that the same regularizer entry is encoded with a path of D nodes, using a grid structure instead of a tree one. Consequently the total space required will be reduced to O(2 · D · d).
As a running example, Figure 2 shows that when D = 4, the conventional hierarchical softmax needs at least 15 nodes to perform softmax over the path, while trajectory softmax greatly reduces space to only 7 nodes. Compared to treebased hierarchical softmax, the paths in trajectory softmax are not branches of a tree, but a fully connected grid of nodes with space complexity of D × |C| in general. Here |C| refers to the number of choices on the paths for a node to the next node, and thus |C| = 2 is the binary case. In Figure 2, we see an activation trajectory for a sequence of "Root→100" for encoding word w 5 . w t is then learned and updated through the nodes on the trajectory when w 5 is predicted by w t . The learning and updating are referred by the dashed arrow lines. Overall, trajectory softmax greatly reduces the space complexity than hierarchical softmax, especially when words sharing similar information, in which case the paths of these words will be greatly overlapped.
More formally, learning with trajectory softmax in the binary case is similar to hierarchical softmax, which is to maximize p over the path for a vector encoded in ψ(w), where p is defined below with an input vector υ: where υ i is the inner vector in i-th node on the trajectory. n(i + 1) = 1 or −1 when (i + 1)-th node is encoded with 0 or 1, respectively. The final update to word embedding υ with the regularizer is conducted by: which is applied to i = 1, 2, ..., D − 1, where σ(x) = exp(x)/(1 + exp(x)); t i = n(i + 1) ; γ is a discount learning rate.
Since the design of trajectory softmax is compatible with the conventional hierarchical softmax, one can easily implement the joint learning by concatenating its Root with the terminal node in the hierachical tree. The learning process is thus to traverse all the nodes from the hierarchical tree and the trajectory path.

Constructing Regularizers
We consider two categories of information sources for constructing regularizers. The first type of regularizer is built based on resources without annotation. On the contrary, the second type uses text collections with annotation. For brevity, throughout the paper we refer to the former as unannotated regularizer whereas the latter is recognized as annotated regularizer.

Unannotated Regularizer
The unannotated regularizer constructs its regularization matrix based on an LDA learned topic distribution, which reflects topical salience information of a given word from prior knowledge. Using LDA not only serves our purpose of learning according to word semantics reflected by cooccurrences but can also bring in knowledge inexpensively (i.e., no annotations needed).
To start, a classic LDA is first performed on an arbitrary base corpus for retrieving word topical distribution, resulting in a topic model with K topics. All the units in the corpus are then assigned with a word-topic probability φ i corresponding to topic k, based on which a matrix is formed with all − → Φ w , as described in §2.1. Next we convert each − → Φ into a 0-1 vector based on the maximum values in − → Φ . In particular, positions with maximum values are set to 1 and the rest are set to 0 (e.g.  2 We experimented with other numbers for K, and their performance didn't vary too much when K > 40. We didn't include this comparison due to the similar results. implementation 3 is used for training Φ [1:K,:] , with 1,000 iterations.

Annotated Regularizer
We use three sources for training annotated regularizers in this work. Two of the sources are semantic lexicons, namely, the Paraphrase Database (PPDB) 4 (Ganitkevitch et al., 2013) and synonyms in the WordNet (WN syn ) 5 (Miller, 1995). They are used in the word similarity task. The third source is a semantic dictionary, SentiWordNet 3.0 (SWN) (Baccianella et al., 2010), which is used in the sentiment classification task. All of the three sources were created with annotation efforts, where either lexical or semantic relations were provided by human experts beforehand.
Before constructing the regularizer, we need encode each word in the sources as a vector according to its relations to other words or predefined information. For PPDB and WN syn , we use them in different ways for joint learning and retrofitting. In order to optimize the efficiency in joint learning, we compress the word relations with topic representations. We use an LDA learner to get topic models for the lexicons 6 , with K = 50. Therefore, the word relations are transferred into topic distributions that are learned from their cooccurrences defined in the lexicon. The way we construct regularization matrix may be lossy, risking losing information that is explicitly delivered in the lexicon. However, it provides us effective encodings for words, and also yields better learning performance empirically in our experiments. In retrofitting, we directly use words' adjacent matrices extracted from their relations defined in the lexicons, then take the adjacent vector for each word as the regularization vector.
The SWN includes 83K words (147K words and phrases in total). Every word in SWN has two scores for its degree towards positive and negative polarities. For example, the word "pretty" receives 0.625 and 0 for positive and negative respectively, which means it is strongly associated with positive sentiment. The scores range from 0 to 1 with step  Table 1: Word similarity results for joint learning on three datasets in terms of Pearson's coefficient correlation (γ) and Spearman's rank correlation (ρ) in percentages. Higher score indicates better correlation of the model with respect to the gold standard. Bold indicates the highest score for each embedding type. of 0.125 for both positive and negative polarities. Therefore there are 9 different degrees for a word to be annotated for the two sentiments. For encoding this dictionary, we design a 18-dimension vector, in which the first 9 dimension represents the positive sentiment while the last 9 for negative sentiment. A word is thus encoded into a binary form where the corresponding dimension is set to 1 with others 0. For the aforementioned word "pretty", its encoded vector will be "000001000 000000000", in which the score 0.625 of positive activates the 6th dimension in the vector. In doing so, we form a 83K × 18 regularization matrix for the SWN dictionary.

Experiments
The resulting word embeddings based on joint learning as well as retrofitting are evaluated intrinsically and extrinsically. For intrinsic evaluation, we use word similarity benchmark to directly test the quality of the learned embeddings. For extrinsic evaluation, we use sentiment analysis as a downstream task with different input embeddings. Regularizers based on LDA, PPDB and WN syn are used in word similarity experiment, while SentiWordNet regularization is used in sentiment analysis. The experimental results will be discussed in §4.1 and §4.2.
We experiment with three learning paradigms, namely CBOW, SG and GloVe. GloVe is only tested in retrofitting since our regularizer is not compatible with GloVe learning objective in joint learning. In all of our retrofitting experiments, we only train the regularizer with one iteration, consistent with Kiela et al. (2015).
The base corpus that we used to train initial word embeddings is from the latest articles dumped from Wikipedia and newswire 7 , which contains approximately 8 billion words. When training on this corpus, we set the dimension of word embeddings to be 200 and cutoff threshold of word frequency threshold to be 5 times of occurrence. These are common setups shared across the following experiments.

Word Similarities Evaluation
We use the MEN-3k (Bruni et al., 2012), SimLex-999  and WordSim-353 (Finkelstein et al., 2002) datasets to perform quantitative comparisons among different approaches to generating embeddings. The cosine scores are computed between the vectors of each pair of words in the datasets 8 . The measures adopted are Pearson's coefficient of product-moment correlation (γ) and Spearman's rank correlation (ρ), which reflect how  Table 2: Word similarity results for retrofitting on three datasets in terms of Pearson's coefficient correlation (γ) and Spearman's rank correlation (ρ) in percentages. Higher score indicates better correlation of the model with respect to the gold standard. Bold indicates the highest score for each embedding type.
close the similarity scores to human judgments.
For both joint learning and retrofitting, we test our approach with using PPDB and WN syn as the prior knowledge applied to our regularizer. Considering that LDA can be regarded as soft clustering for words, it is very hard to present words with deterministic relations like in PPDB and WN syn , therefore we do not apply retrofitting on LDA results for previous studies.
The evaluation results are shown in Table 1 and  Table 2 for joint learning and retrofitting, respectively. Each block in the tables indicates an embedding type and its corresponding enhancement approaches. For comparison, we also include the results from the approaches proposed in previous studies, i.e., Yu and Dredze (2014) 9 for CBOW, Kiela et al. (2015) 10 for SG and Faruiqui et al. (2015) 11 for all initial embeddings. Their settings are equal to that used in our approach. Table 1 shows that directly using LDA topic distributions as embeddings can give reasonable results for word similarities. Because LDA captures word co-occurrences globally so that words share similar contexts are encoded similarly via topic distributions. This is a good indication showing that LDA could be a useful guidance to help our regularize to incorporate global information.
For other joint learning results in Table 1, our approach shows significant gain over the baselines, the same for the approaches from previous studies (Yu and Dredze, 2014;Faruqui et al., 2015). However, using WN syn in Kiela et al. (2015) does not help, this may owe to the fact that using the words defined in WN syn as contexts will affect the real context learning and thus deviate the joint objective function. Interestingly, using LDA in regularizer significantly boosts the performance on MEN-3k, even better than that with using semantic lexicons. The reason might be that LDA enhances word embeddings with the relatedness inherited in topic distributions.
For retrofitting, Table 2 shows that our approach demonstrates its effectiveness for enhancing initial embeddings with prior knowledge. It performs consistently better than all other approaches in a wide range of settings, including three embedding types on three datasets, with few exceptions. Since retrofitting only updates those words in the external sources, e.g., LDA word list or lexicons, it is very sensitive to the quality of the corresponding sources. Consequently, it can be observed from our experiment that unannotated knowledge, i.e., topic distributions, is not an effective source as a good guidance. In contrast, PPDB, which is of high quality of semantic knowledge, outperforms other types of information in most cases.

Sentiment Classification Evaluation
We perform sentiment classification on the IMDB review data set (Maas et al., 2011), which has 50K labeled samples with equal number of positive and negative reviews. The data set is pre-divided into training and test sets, with each set containing 25K reviews. The classifier is based on a bi-directional LSTM model as described in Dai and Le (2015), with one hidden layer of 1024 units. Embeddings from different approaches are used as inputs for the LSTM classifier. For determining the hyperparameters (e.g., training epoch and learning rate), we use 15% of the training data as the validation set and we apply early stopping strategy when the error rate on the validation set starts to increase. Note that the final model for testing is trained on the entire training set.
As reported in Table 3, the embeddings trained by our approach work effectively for sentiment classification. Both joint learning and retrofitting with our regularizer outperform other baseline approaches from previous studies, with joint learning being somewhat better than retrofitting. Overall, our joint learning with CBOW achieves the best performance on this task. A ten-partition twotailed paired t-test at p < 0.05 level is performed on comparing each score with the baseline result for each embedding type. Considering that sentiment is not directly related to word meaning, the results indicate that our regularizer is capable of incorporating different type of knowledge for a specific task, even if it is not aligned with the context learning. This task demonstrates the potential of our framework for encoding external knowledge and using it to enrich the representa-   (Maas et al., 2011). Bold indicates the highest score for each embedding type. * indicates t-test significance at p < 0.05 level when compared with the baseline.
tions of words, without the requirement to build a task-specific, customized model.

Related Work
Early research on representing words as distributed continuous vectors dates back to Rumelhart et al. (1986). Recent previous studies (Collobert and Weston, 2008;Collobert et al., 2011) showed that, the quality of embeddings can be improved when training multi-task deep models on task-specific corpora, domain knowledge that is learned over the process. Yet one downside is that huge amounts of labeled data is often required. Another methodology is to update embeddings by learning with external knowledge. Joint learning and retrofitting are two mainstreams of this methodology. Leveraging semantic lexicons (Yu and Dredze, 2014;Faruqui et al., 2015;Liu et al., 2015a;Kiela et al., 2015;Wieting et al., 2015;Nguyen et al., 2016) or word distributional information (Maas et al., 2011;Liu et al., 2015b) has been proven as effective in enhancing word embeddings, especially for specific downstream tasks.  proposed to improve embedding learning with different kinds of knowledge, such as morphological, syntactic and semantic information. Wieting et al. (2015) improves embeddings by leveraging paraphrase pairs from the PPDB for learning phrase embeddings in the paraphrasing task. In a similar way, Hill et al. (2016) uses learned word embeddings as supervised knowledge for learning phrase embeddings. Although our approach is conceptually similar to previous work, it is different in several ways. For leveraging unlabeled data, the regularizer in this work is different from applying topic distributions as word vectors (Maas et al., 2011) or treating topics as conditional contexts (Liu et al., 2015b). For leveraging semantic knowledge, our regularizer does not require explicit word relations as used in previous studies (Yu and Dredze, 2014;Faruqui et al., 2015;Kiela et al., 2015), but takes encoded information of words. Moreover, in order to appropriately learn the encoded information, we use trajectory softmax to perform the regularization. As a result, it provides a versatile data structure to incorporate any vectorized information into embedding learning. The above novelties make our approach versatile so that it can integrate different types of knowledge.

Conclusion and Future Work
In this paper we proposed a regularization framework for improving the learning of word embeddings with explicit integration of prior knowledge. Our approach can be used independently as a retrofitter or jointly with CBOW and SG to encode prior knowledge. We proposed trajectory softmax for learning over the regularizer, which can greatly reduce the space complexity compared to hierarchical softmax using the Huffman coding tree, which enables the regularizer to learn over a long vector. Moreover, the regularizer can be constructed from either unlabeled data (e.g., LDA trained from the base corpus) or manually crafted resources such as a lexicon. Experiments on word similarity evaluation and sentiment classification show the benefits of our approach.
For the future work, we plan to evaluate the effectiveness of this framework with other types of prior knowledge and NLP tasks. We also want to explore different ways of encoding external knowledge for regularization.