Online Learning of Interpretable Word Embeddings

Word embeddings encode semantic meanings of words into low-dimension word vectors. In most word embeddings, one cannot interpret the meanings of speciﬁc dimensions of those word vectors. Non-negative matrix factorization (NMF) has been proposed to learn interpretable word embeddings via non-negative constraints. However, NMF methods suffer from scale and memory issue because they have to maintain a global matrix for learning. To alleviate this challenge, we propose on-line learning of interpretable word embed-dings from streaming text data. Experiments show that our model consistently outperforms the state-of-the-art word embedding methods in both representation ability and interpretability. The source code of this paper can be obtained from http: //github.com/skTim/OIWE .


Introduction
Word embeddings (Turian et al., 2010) aim to encode semantic meanings of words into lowdimensional dense vectors. As compared with traditional one-hot representation and distributional representation, word embeddings can better address the sparsity issue and have achieved success in many NLP applications recent years.
There are two typical approaches for word embeddings. The neural-network (NN) approach (Bengio et al., 2006) employs neural-based techniques to learn word embeddings. The matrix factorization (MF) approach (Pennington et al., 2014) builds word embeddings by factorizing wordcontext co-occurrence matrices. The MF approach requires a global statistical matrix, while the N-N approach can flexibly perform learning from * Corresponding author: Z. Liu (liuzy@tsinghua.edu.cn) streaming text data, which is efficient in both computation and memory. For example, two recent NN methods, Skip-Gram and Continuous Bagof-Word Model (CBOW) (Mikolov et al., 2013a;Mikolov et al., 2013b), have achieved impressive impact due to their simplicity and efficiency.
For most word embedding methods, a critical issue is that, we are unaware of what each dimension represent in word embeddings. Hence, the latent dimension for which a word has its largest value is difficult to interpret. This makes word embeddings like a black-box, and prevents them from being human-readable and further manipulation.
People have proposed non-negative matrix factorization (NMF) for word representation, denoted as non-negative sparse embedding (NNSE) (Murphy et al., 2012). NNSE realizes interpretable word embeddings by applying non-negative constraints for word embeddings. Although NNSE learns word embeddings with good interpretabilities, like other MF methods, it also requires a global matrix for learning, thus suffers from heavy memory usage and cannot well deal with streaming text data.
Inspired by the characteristics of NMF methods (Lee and Seung, 1999), we note that, nonnegative constraints only allow additive combinations instead of subtractive combinations, and lead to a parts-based representation. Hence, the non-negative constraints derive interpretabilities of word embeddings. In this paper, we aim to design an online NN method to efficiently learn interpretable word embeddings. In order to achieve the goal of interpretable embeddings, we design projected gradient descent (Lin, 2007) for optimization so as to apply non-negative constraints on NN methods such as Skip-Gram. We also employ adaptive gradient descent (Sun et al., 2012) to speedup learning convergence. We name the proposed models as online interpretable word embeddings (OIWE).
For experiments, we implement OIWE based on Skip-Gram. We evaluate the representation performance of word embedding methods on the word similarity computation task. Experiment results show that, our OIWE models are significantly superior to other baselines including Skip-Gram, RNN and NNSE. We also evaluate the interpretability performance on the word intrusion detection task. The results demonstrate the effectiveness of OIWE as compared to NNSE.

Our Model
In this section, we first introduce Skip-Gram and then introduce the proposed online interpretable word embeddings based on Skip-Gram.

Skip-Gram
Skip-Gram (Mikolov et al., 2013b) is simple and effective to learn word embeddings. The objective of Skip-Gram is to make word vectors good at predicting its context words. More specifically, given a word sequence {w 1 , w 2 , . . . , w T }, Skip-Gram aims to maximize the average log probability where k is the context window size, and Pr(w t+j |w t ) indicates the probability of seeing w t+j in the context of w t , which are measured with softmax function where w t+j and w t are word embeddings of w t+j and w t , and W is the vocabulary size. Since the computation of full softmax is time consuming, the techniques of hierarchical softmax and negative sampling (Mikolov et al., 2013b) are proposed for approximation. Take negative sampling for example. The log probability Pr(w t+j |w t ) can be approximate by where σ(x) = 1/(1 + exp(−x)), and N t is the set of negative samples as compared to the corresponding context word w t+j . The task can be regarded as to distinguish the context word w t+j from negative samples.
For Skip-Gram with negative sampling, we can perform stochastic gradient descent for learning. The update rule for the positive/negative context words u ∈ {w t+j } ∪ N t is where I wt (u) = 1 when w is the positive context word of w t and I wt (u) = 0 when w is negative, i is the iteration number, and γ is the learning rate. Correspondingly, the update rule for the input word w t is (5) We note that, the learning rate γ in Skip-Gram is shared by all word embeddings.

OIWE
In order to learn interpretable word embeddings, we have to make the word embeddings learned in Skip-Gram keep non-negative. In order to achieve this goal, we have to constrain the update rules in Equation (4) and (5) as follows: where x may be u or w t , k is the corresponding dimension in word embedding x, ∇f (x k ) indicates the gradient corresponding to x k , and P [·] is defined as Motivated by the projected gradient descent methods for NMF (Lin, 2007), in this paper we propose two methods for Skip-Gram to realize the constraint in Equation (6). Naive Projected Gradient (NPG). In NPG, we consider the most straightforward update strategy by simply setting The method has been used for NMF (Lin, 2007) although the details are not discussed. The NPG method only constrains the violated dimensions without taking the update consistency among dimensions of a word embedding into account. For example, if many dimensions encounter x i k + γ∇f (x k ) < 0 at the same time, which are set to 0 with Equation (8) with other dimensions unchanged, the updated word embedding may heavily deviate from its semantic meaning. Hence, NPG may suffer from instable updating results. To address this issue, we propose to employ the following improved projected gradient method.
Improved Projected Gradient (IPG). In order to make the non-negative update more consistent among dimensions, we design an improved projected gradient by iteratively finding the most appropriate learning rate γ. The basic idea is that, we will find a good learning rate γ to make less dimensions violate the non-negative constraint.
More specifically, in Equation (6), for a learning rate γ, we define the violation ratio as where K is the dimension size of word embeddings. The violation ratio indicates how many dimensions violate the non-negative constraint and require to be set to 0. When the learning rate γ decreases, the violation ratio will also decrease, and the zero-setting in Equation (8) will bring less deviation to word embeddings. We set a threshold δ for the violation ratio R(γ) and a lower bound γ L for the learning rate γ. Starting from an initial learning rate γ 0 , we will repeatedly decrease the learning rate by with 0 < β < 1 until and then update with Equation (8) using γ m+1 . In nature, the updating constraint of learning rate in Equation (11) play a similar role to Equation (13) in (Lin, 2007), which aims to prevent the projection operation from heavily deviating the word embeddings.

More Optimization Details
In experiments, we explore many optimization methods and find the following two strategies are important: (1) Adaptive Gradient Descent. Following the idea from (Sun et al., 2012), we maintain different learning rates γ w for each word w, and the learning rates for those highfrequency words may decrease faster than those low-frequency words. This will speedup the convergence of word embedding learning.
(2) Unified Word Embedding Space. Different from original Skip-Gram (Mikolov et al., 2013b) which learn embeddings of w t and its context words w t+j in two separate spaces, in this paper both w t and its context words w t+j share the same embedding space. Hence, a word embedding may get more opportunities for learning.

Experiments
In this section, we investigate the representation performance and interpretability of our OIWE models with other baselines including typical N-N and MF methods. The representation performance is evaluated with the word similarity computation task, and the interpretability is evaluated with the word intrusion detection task. For the both tasks, we train our OIWE models using the text8 corpus obtained from word2vec website 1 , and the OIWE models achieve the best performance by setting the dimension number K = 300, β = 0.6, δ = 1/60, and γ L = 2.5 × 10 −6 .
The evaluation results of word similarity computation are shown in Table 1. We can observe that: (1) The OIWE models consistently outperform other baselines. (2) IPG generally achieves better representation performance than NPG. This indicates consistent updates are important for learning of word embeddings. One can refer to http://github.com/skTim/ OIWE for the evaluation results on more evaluation datasets.

Word Intrusion Detection
We evaluate interpretability of word embeddings with the task of word intrusion detection proposed by (Murphy et al., 2012). In this task, for each dimension we create a word set containing top-5 words in this dimension, and intruce a noisy word from the bottom half of this dimension which ranks high in other dimensions. Human editors are asked to check each word set and try to pick out the intrusion words, and the detection precision indicates the interpretability of word embedding models. Note that, for this task we do not perform normalization for word vectors.  The evaluation results are shown in Table 2. We can observe that: (1) Skip-Gram performs poor in word intrusion detection without doubt since it is uninterpretable in nature.
(2) The OIWE-NPG model achieves better interpretability as compared to Skip-Gram, but performs much worse than the OIWE-IPG model. The OIWE-IPG model achieves competitive interpretability with NNSE. This indicates that reducing violation rations in word embedding learning is crucial for preserving interpretability.
In Table 3, we show top-5 words for some dimensions, which clearly demonstrate semantic meanings of these dimensions. One can also refer to http://github.com/skTim/OIWE to find top-5 words for all dimensions.

Influence of Dimension Numbers
The dimension number is an important configuration in word embeddings. In Fig. 1 we show the performance of OIWE and Skip-Gram on word similarity computation with varying dimension numbers. From the figure, we can observe that: (1) The both models achieve their best performance under the same dimension number. This indicates that OIWE, to some extent, inherits the representation power of Skip-Gram. (2) The performance of OIWE seems to be more sensitive to dimension numbers. When the dimension number changes from 300 to 200 or 400, the performance drops much quickly than Skip-Gram. The reason may be as follows. OIWE has to concern about both representation ability of word embeddings and interpretability of each dimension. An appropriate dimension number is critical to make each dimension interpretable, just like the cluster number is important for clustering. On the contrary, Skip-Gram is much free to learn word embeddings only concerning about representation ability. (3) The performance of OIWE with various dimensions also varies on different evaluation datasets. For example, OIWE-IPG with K = 400 gets 68.74 on MEN, which is much better than that with K = 300. In future work, we will extensively investigate the characteristics of OIWE with respect to dimension numbers and other hyperparameters.

Conclusion and Future Work
In this paper, we present online interpretable word embeddings. The OIWE models perform project- Figure 1: Influence of Dimension Number on Words Similarity ed gradient descent to apply non-negative constraints on NN methods such as Skip-Gram. Experiment results on word similarity computation and word intrusion detection demonstrate the effectiveness and efficiency of our models in both representation ability and interpretability. We also note that, our models can be easily extended to other NN methods.
In future, we will explore the following research issues: (1) We will extensively investigate the characteristics of OIWE with respect to various hyperparameters including dimension numbers. (2) We will evaluate the performance of our OIWE models in various NLP applications.