Parameterized context windows in Random Indexing

This paper introduces a parameterization for word embeddings produced by the Random Indexing framework. The parameterization introduces position speciﬁc weights in the context windows, and the approach is shown to improve the performance in both word similarity and sentiment classiﬁcation tasks. We also demonstrate the relation between Random Indexing and Convolutional Neural Networks.


Introduction
Quantifying the importance of contextual information for semantic representation is the goal of distributional semantics, in which contextual information is used to quantify semantic similarities between words (Turney and Pantel, 2010). However, standard practice in distributional semantics is to weight the importance of context items based on either its frequency (Sahlgren et al., 2016), its distance to the focus word (Lund et al., 1995), or its global co-occurrence statistics (Niwa and Nitta, 1994). Thus far, there has not been much work on applying machine learning to this in order to select useful context items for distributional semantics.
The idea with the proposed parameterization is to weight the items in the context window based on their usefulness for accomplishing some specific task, such as sentiment classification or word similarity rating. In this paper, we introduce a simple parameterization for the Random Indexing processing model. We first show that Random Indexing can be formulated in terms of a convolution, in order to situate the framework in the context of neural networks. We then introduce a simple parameterization of the positions on the context windows, and we show that it improves the performance of the embeddings in some word similarity and sentiment classification tasks. 1

Notation
Using a vocabulary V of words w i for i = 1, . . . , |V|, we seek word embeddings v i ∈ R d by collecting statistics from a corpus C = {w 1 , . . . , w t , . . . , w N }. We will interchangeably mix subscripts i and t of words and embeddings to index the vocabulary and corpus respectively.

Random Indexing as Convolution
Random Indexing (RI) (Kanerva et al., 2000;Kanerva, 2009) is a distributional semantic model that updates the embedding vectors v * in an online fashion by summing the sparse random vectors e * that represent the context items (these vectors are called random index vectors and act as unique identifiers for the context items, words in this case): k is the context window size and h(w) some weight that quantifies the importance of the context item (the standard setting is h(w) = 1 for all w). Here e t+l is the random index vector to corpus item w t+l . Equation (1) describes the update rule of RI and the final embeddings v i can be expressed as: To establish the equivalence between RI and convolution, we can reformulate the update rule in Equation (1) as follows; let h ∈ R (2k+1)×d be a filter function where Furthermore, let S ∈ R N ×d be a matrix of stacked sparse random vectors e t . Now, if we use h and S, we can rewrite the second term of the random indexing update rule in Equation (1): Equation (3) is a 2D discrete convolution between S and h, hence: (4) Because h has been defined with zeros everywhere except for column d 2 , Equation (4) can been seen as a 1D convolution over each column vector in S,

Dealing with Redundant Features
Since word embeddings (produced by RI or some other distributional model) are constructed unsupervised by collecting co-occurrence information from a large corpus, it is likely that the resulting embeddings are very general, which may lower the expressiveness of the embeddings if they are going to be used in a very specific domain. Take the example of training a text categorization classifier within a financial context; corpus occurrences of the words "bank" and "stock" in the senses of LARGE COLLECTION and INVENTORY will likely not provide useful information for the embeddings in this domain. In word embeddings, different senses are represented by co-occurrences with different context items (Cuba Gyllensten and Sahlgren, 2015). We refer to context items that are less useful for a specific task as redundant features of the embeddings. Unfortunately, it is not, in the general case, possible to know a priori which context items will be useful to construct embeddings for a particular task. Such context (i.e. feature) selection instead needs to be performed jointly with training the classifier. When backpropagation is used as optimization strategy of the classifier, one can also treat the word embeddings as parameters to update. It is straightforward to take the derivatives of the objective function with respect to the input and apply Stochastic Gradient Descent (SGD) updates just as for the model parameters. This strategy is well known (Zhang and Wallace, 2015), and will be referred to as SGD Random Indexing (SGD-RI).

Parameterization of context window
Another strategy is to parameterize the word embeddings, and to optimize those parameters jointly with the task using backpropagation. The RI algorithm, as defined in (Sahlgren et al., 2016), weights the importance of context items based on their relative frequency according to Equation (6): where c is a constant, f (w t ) is the corpus frequency of item w t , and |V| is the size of the vocabulary (i.e. the number of unique words seen thus far). We would however like to parameterize context items not only depending on relative frequency but also on their usefulness for the specific task at hand. To describe the suggested parametrization, recall the Random Indexing algorithm in Equation (1) where we look at each word and its context in the corpus in a streaming fashion, and construct embedding vectors by summing the index vectors of all words occurring in the context. A fairly obvious refinement of this algorithm would be to parameterize the relative positions within the context window depending on their usefulness for the task at hand. Equation (7) formalizes the parameterization by introducing an additional factor to the weighting scheme: Inserting this parameterization into the update rule in Equation (1), we get: (9) By careful inspection, the θ wt l can be moved outside the inner sum, while swapping the subscript to i since w t = w i : The rewrite now allows the inner sum to be calculated before fitting the θ w i l s which makes the algorithm much more efficient. In practice, this means we aggregate an embedding vectorṽ l i for each relative window position l, for each word w i . Stacking these 2k context vectors into a matrix V i and collecting the θ w i l s in a vector yields: Equation (10) can now be rewritten as a matrix vector multiplication: In other words, this suggests instead of aggregating embedding vectors v i according to (9), to aggregate matrices V i upon parsing the corpus. The embedding vectors are then calculated as a multiplication with a parameter vector θ i according to (13). Note that when θ i = 1 you recover the vanilla Random Indexing embeddings.
We will refer to this strategy as Parameterized Random Indexing (PAR-RI).

Example: Word Similarity
To exemplify the effectiveness of the proposed parameterization, we use the SimLex-999 (Hill et al., 2015) test in order to see how much the Spearman rank correlation can be improved by fitting the θ i s such that cosine similarity between the embedding vectors correspond to the similarity ratings. Formally, we seek to minimize the following objective function: where (w i , w j ) ∈ S corresponds to each word pair in SimLex. s(w i , w j ) is the SimLex similarity score for the word pair (scaled to [0, 1]) and cos α ij is the cosine similarity between the word's corresponding vectors: where v i and v j are w i and w j 's corresponding word vectors, calculated as in equation (13). Since this is a non-convex problem, SGD is applied as optimization strategy. Calculating the gradient of f with respect to θ i and θ j is straightforward: Applying the chain rule, the gradient of cos α ij becomes: The expression for δ cos α ij δθ j is the same, but with the subscripts interchanged. We now apply SGD to optimize the θ i s iteratively using the following update rules: This procedure is performed using V * matrices generated from a dump of Wikipedia with the Random Indexing hyper-parameters listed in Table 1. The θ i s are initialized to one-vectors (θ * = 1) and updated according to equation (19) with a learning rate η = 1.0 until convergence. The results are summarized in Table 2. We can see that the Spearman correlation is drastically improved with the optimized θ i s. This experiment can be seen as, for each word w i , finding a linear combination in the column space of V i that optimizes the cosine similarity of the word vectors to match the SimLex similarity scores. It is remarkable that optimizing the θ i s in the relatively small 20-dimensional (R 2k ) subspaces of the full word space (R 2000 ) yields such a big improvement. The improvements reported in the previous section should motivate the parameterization to be viable for improving the performance in text classification as well. In this section, we parameterize the embeddings for sentiment classification using two standard benchmarks; the Pang and Lee Sentence Polarity Dataset v1.0 (PL05) (Pang and Lee, 2005) and the Stanford Sentiment Treebank (SST) (Socher et al., 2013). The PL05 data consists of 10,662 short movie reviews that are classified as either positive or negative. Experiments using this dataset are split into 25% test and 75% train/validation sets and evaluated by 5-fold cross validation on the training/validation set. We make two consecutive runs, in total 10 trainings, and report their maximum, minimum and mean accuracy as well as their standard deviation. The SST data is an extension of PL05 with train/validation/test splits provided. The dataset also provides finegrained labels (very positive, positive, neutral, negative, very negative). In this study we have however omitted the neutral labels and treated it as a binary classification problem by merging the very positive, positive, very negative and negative classes into two. We report the maximum, minimum and mean accuracy as well as the standard deviation of 10 consecutive runs using the provided train/val/test splits. We use two different classifiers in these experiments. The first is a standard neural network (referred to as MLP for Multi-Layer Perceptron) (Rumelhart et al., 1986) with one hidden layer of 120 nodes with sigmoid activations and one sigmoid output unit. All word vectors are normalized to an l 2 norm of 1 and naively summed to produce document vectors. The weights in the neural network are also l 2 regularized with a constant factor of λ = 0.001. The second classifier is the model proposed by Kim (2014) which implements a Convolutional Neural Network (CNN). The hyper-parameters used are listed in table 3. Like the MLP model, the word embeddings are also normalized to unit length. As comparison with the different RI-based embeddings (RI, SGD-RI, and PAR-RI), we also include results using embeddings produced with SGNS (Mikolov et al., 2013) and GloVe (Pennington et al., 2014), both with 300-dimensional vectors and a window size of 2. We also include results using embeddings randomly sampled from a uniform U (−0.25, 0.25) distribution (RAND). All word embeddings (except for the RAND vectors) are pre-trained (unsupervised) on a dump of Wikipedia. We list all hyper-parameters for RI, SGNS and GloVe in table 4, 5 and 6 The results of all experiments are shown in ta-   (Suzuki and Nagata, 2015), consistently underperform both in comparison with SGNS and PAR-RI (and, in the case of the MLP classifier, also the SGD-RI embeddings). This is in contrast to the experiments performed by (Zhang and Wallace, 2015) where the difference was minor.
Comparing the PAR-RI embeddings with SGD-RI and standard RI, it seems PAR-RI performs well, with the highest mean accuracy on the SST dataset, using the MLP model. SGD-RI improves the results compared to the standard RI embeddings for the MLP model, but not for the CNN model. Updating of the SGNS embeddings just like SGD-RI for the CNN have also been studied in Zhang and Wallace (2015), who report a performance boost of about ∼0.8%. This also contrasts to our results with SGD-RI using the CNN model, which instead decrease the performance compared to standard RI. This could be due to the RI embeddings being more high dimensional than SGNS, yielding a larger and harder parameter space to optimize.
Comparing our results to other reported results in the literature, Kim (2014) and Zhang and Wallace (2015) manage to push the boundaries up to 80.10 for the PL05 data, and up to 84.88 for the SST data using SGNS embeddings pre-trained on a much larger 100 billion tokens Google News dataset. We believe this somewhat increased performance is partly due to the bigger dataset. Another factor could also be that the language style in news articles is more similar to the movie reviews compared to Wikipedia, arguably yielding better-suited embeddings.

Optimized Context Profiles
When the PAR-RI parametrization was proposed, the hypothesis was that certain relative positions in the context windows would be more important in describing the context of a word than others. The results in the two previous sections demonstrate that the proposed parameterization is able to improve the embeddings in both a word similarity task, and (to a lesser extent) a sentiment classification task. This indicates that the parameterization is actually able to find useful context profiles for terms used in the various test settings. In this section, we exemplify the kinds of context profiles learned when trained for the sentiment classification task. Figure 1 (on page 7) shows the learned weights per context window position for four different adjectives (top row), four different determiners (middle row), and four different nouns (bottom row). The parameterization obviously has a larger effect for some words than for others; as an example, the windows for "good" and "bad" is much more parameterized than the windows for "reliable" and "positive", and the windows for the determiners are in general much more parameterized than the windows for nouns. It is interesting to note that there is a small tendency that the windows for the adjectives have a higher weight in the +1 position, which is consistent with a linguistic analysis of adjectives as qualifiers of succeeding nouns. By contrast, the window positions for the determiners seem to have a higher weight in the positions just preceding the focus word, while the windows for It thus seems as if the parameterization is able to learn slightly different window profiles for different parts of speech.
As noted in the introduction, is is common practice in distributional semantics to weight the context windows by the distance to the focus word. If this is an optimal strategy, we should see a belllike curve leaning to zero at the edges. Such a shape is partially present for some of the words, for example in "and", "of", "good" and "bad", but for most words, the weights are almost unchanged. We believe this could be due to the vanishing gradient problem where the gradient seems to vanish deeper down the model. In addition, the less common the word is in the training set, the less it is updated. Another interesting aspect of the learned weights is that by inspecting the l 1 norm of the weight vectors, we get a hint of the words' relative importance for the given task. We can see that the l 1 norm for the words "good" and "bad" are larger than for "the" and "of", which feels natural for the sentiment classification task.

Conclusion
This paper has introduced a simple parameterization for the RI framework, which has also been derived in terms of convolution. It parameterizes the positions in the context windows and optimizes with respect to the performance of the embeddings in some given task, such as word similarity or text classification. Our experiments show that the proposed PAR-RI model is able to improve the performance of the embeddings in many cases, and that the results are competitive in comparison with other well-known embeddings. The idea of parameterizing the window positions could also be applied to other distributional semantic models, such as SGNS.
We note that all embeddings used in the sentiment classification task produce very similar results. This indicates that in practice, the word embeddings included in this paper are more or less equivalent. It is therefore doubtful whether it is possible to draw any conclusions based on these results regarding the question whether any single embedding is superior to the others in the general case.
The examples of context profiles provided as examples of the parameterization shows some interesting effects. However, training the positiondependent weights is non-trivial, and one could probably think of better initializations of the weights than just one-vectors, for example using a bell-like shape. The vanishing gradient problem would however remain, and the weights for uncommon words will not change significantly.
The conclusion of the experiments using SGD-RI is that updating the embeddings jointly with the classification model using SGD does not necessarily improve generalization. This is in fact not so strange. Moving around only a subset of the words (i.e. the words present in the training set), while leaving the rest untouched produces an inconsistent space with undefined distributional properties between updated and non-updated embeddings. It could therefore be an idea to use randomized embeddings for all words not present in the training set because they then can be regarded as approximately orthogonal, and thus should not interfere with the semantic structure.