Additive Compositionality of Word Vectors

Additive compositionality of word embedding models has been studied from empirical and theoretical perspectives. Existing research on justifying additive compositionality of existing word embedding models requires a rather strong assumption of uniform word distribution. In this paper, we relax that assumption and propose more realistic conditions for proving additive compositionality, and we develop a novel word and sub-word embedding model that satisfies additive compositionality under those conditions. We then empirically show our model’s improved semantic representation performance on word similarity and noisy sentence similarity.


Introduction
Previous word embedding studies have empirically shown linguistic regularities represented as linear translation in the word vector space, but they do not explain these empirical results mathematically (Mikolov et al., 2013b;Pennington et al., 2014;Bojanowski et al., 2017).
Recent studies present theoretical advances to interpret these word embedding models. Levy and Goldberg (2014b) show that the global optimum of SGNS (Skip-Gram with Negative Sampling) is the shifted PMI (PMI(i, j)−k). Arora et al. (2016) propose a generative model to explain PMI-based distributional models and presents a mathematical explanation of the linguistic regularity in Skip-Gram (Mikolov et al., 2013a). Gittens et al. (2017) provide a theoretical justification of the additive compositionality of Skip-Gram and shows that the linguistic regularity of Skip-Gram is explained by additive compositionality of Skip-Gram (Mikolov et al., 2013a). One property of the word vectors that equates to satisfying additive compositionality is the following: where word c is the paraphrase word of the set of words {c 1 , ..., c n }, and u is the vector representation of a word. We explain additive compositionality in more detail in section 3.3.
In this paper, we provide a more sound mathematical explanation of linguistic regularity to overcome the limitations of previous theoretical explanations. For instance, Levy and Goldberg (2014b) do not provide a connection between shifted-PMI and linguistic regularity, and Arora et al. (2016) and Gittens et al. (2017) require strong assumptions in their mathematical explanation about linguistic regularity. Arora et al. (2016) assume isotropy, a uniformly distributed word vector space, and Gittens et al. (2017) assume a uniform word frequency distribution within a corpus, p(w) = 1/|V |.
We propose a novel word/sub-word embedding model which we call OLIVE that satisfies exact additive compositionality. The objective function of OLIVE consists of two parts. One is a global co-occurrence term to capture the semantic similarity of words. The other is a regularization term to constrain the size of the inner product of cooccurring word pairs. We show that the global optimum point of OLIVE is the exact PMI matrix under certain condition unlike SGNS whose optimum approximates PMI due to the sampling process in training (Levy and Goldberg, 2014b). The source code and pre-trained word vectors of OLIVE are publicly available 1 . By being a more theoretically sound word embedding model OLIVE shows improved empirical performance for semantic representation of word vectors, and by eliminating sampling process in SGNS, OLIVE shows robustness on the size of the vocabulary and sentence representation performance on various noisy settings. We evaluate the semantic representation performance with the word similarity task, and we show the robustness of our model by conducting word similarity task on various vocabulary size and sentence similarity task on various noisy settings.
The contributions of our research are as follows: • We present a novel mathematical explanation of additive compositionality of SGNS.
• We propose a word/sub-word embedding model that theoretically satisfies additive compositionality. We provide the code for this model for reproducibility.
• In addition to theoretical justification, we show the empirical performance of our model and its robustness.

Related Work
Learning word co-occurrence distribution is known as an effective method to capture the semantics of words (Baroni and Lenci, 2010;Harris, 1954;Miller and Charles, 1991;Bullinaria and Levy, 2007). Based on this word co-occurrence distribution, two major types of word embedding research have been conducted. One is to use the local context of words in a corpus to train a neural network (Mikolov et al., 2013a,b;Bojanowski et al., 2017;Xu et al., 2018;Khodak et al., 2018). The other is to use the global statistics (Huang et al., 2012;Pennington et al., 2014). Aside from the general purpose embedding, some approaches are specific to a domain (Shi et al., 2018) or use extra human labeled information (Wang et al., 2018). In this paper, we focus on the general purpose word embedding.
Among those, SGNS (Skip-Gram with Negative Sampling) (Mikolov et al., 2013b) and FastText (Bojanowski et al., 2017) are widely-used neural network based word and sub-word embedding models that use negative sampling (Gutmann and Hyvärinen, 2012;Mnih and Kavukcuoglu, 2013). With the argument that global co-occurrence statistics, overlooked in SGNS and FastText, are important in capturing the word semantics, Pennington et al. (2014) propose GloVe.
Although Skip-Gram and GloVe seem to capture the linguistic regularity sufficiently, from the theoretical perspective of additive compositionality of word vectors, both models are lacking because they require extra and strong assumptions (Gittens et al., 2017;Arora et al., 2016).
Recently, (Allen and Hospedales, 2019) claim strong assumptions in the previous word embedding studies (Gittens et al., 2017;Arora et al., 2016) and propose theoretical explanation about linguistic regularity in SGNS based on paraphrase definition in (Gittens et al., 2017). Although their theory explains linguistic regularity in SGNS, there are remaining mathematical properties in SGNS: uniqueness of the paraphrase vector, and the meaning of the negative sampling parameter.
In this paper, we recognize the importance of additive compositionality of word vectors which connects word vectors, and their linguistic regularity and we provide a novel mathematical explanation of SGNS's additive compositionality and uniqueness property in additive compositionality. Further, we propose a novel word/sub-word embedding model that satisfies additive compositionality based on our theory and show its capabilities in capturing the semantics of words.

Preliminaries
In this section, we describe three important concepts about word vectors and how they are used in existing word embedding models. First, we describe the PMI matrix which is an approximated global optimum point of SGNS (Skip-Gram with Negative Sampling) (Levy and Goldberg, 2014b) and known to capture the semantic meaning of words in vector space (Arora et al., 2016). Second, we describe sub-sampling, a word frequency balancing method to increase word embedding performance. Third, we describe additive compositionality, the notion that a paraphrase word vector is a vector sum of its context word set (Gittens et al., 2017).

Objective Function
There are two major types of objective functions used in word embedding models. One is used in SGNS and is based on the PMI; and the other is used in GloVe, based on the joint probability dis-tribution of word pairs, p(w 1 , w 2 ).

Skip-Gram with Negative Sampling
SGNS iteratively trains word vectors for each cooccurring word pair in the same local context window (Mikolov et al., 2013b). The objective function for each word pair is as follows, (1) By minimizing (1), we get the global optimum point of SGNS as follows, Levy and Goldberg (2014b) where k is the number of negative samples. One problem of SGNS which trains word vectors for each independent context window is that it cannot utilize the global statistics of a corpus (Pennington et al., 2014).

GloVe
To capture the global co-occurrence statistics of word pairs, Pennington et al. (2014) propose GloVe with the following objective function, which learns the word vectors that capture the global co-occurrences of word pairs (X ij ). Note that this model does not result in the optimum being the PMI statistic, which is the condition for additive compositionality described in section 4.

Sub-sampling
Word embedding models such as SGNS, Fast-Text, and GloVe use various balancing methods that reduce the frequencies of very frequent words. These balancing methods are known to improve the semantic structure learned by a word embedding model. GloVe uses a clipping method to their weighting function that has an upper bound on the number of co-occurrences of word pairs. SGNS and FastText use a sub-sampling method that probabilistically discards frequent words during the learning procedure with the discard probability of word i as follows (Mikolov et al., 2013b), Here, p(i) is word frequency of word i in a corpus. s is a sub-sampling parameter. In this paper, we propose a statistical sub-sampling method that is based on sub-sampling method in SGNS. Our proposed method can be applied to a model that uses global statistics.
3.3 Additive Compositionality of Gittens et al. (2017) Gittens et al. (2017) provide a mathematical definition of additive compositionality and a theoretical framework to justify the additive compositionality of Skip-Gram. They define additive compositionality by formulating a link between a paraphrase word and its context words, where the definition of paraphrase words is, Here, c is a word and C = {c 1 , ..., c n } is a set of words. So, if a word c minimizes (3) for given set of words C, then we say the word c is a paraphrase of the word set, C. If paraphrase vector u c is captured by vector addition of words {c 1 , ..., c n }, we call the word vectors, u satisfy additive compositionality. Gittens et al. (2017) introduce two conditions of a paraphrase word vector c to be vector sum of a set C, u c = n i=1 u c i .
• Given context word c, the probability that word w occurs within the same window can be calculated as follows, • Given the set of words C, the probability distribution of word w can be calculated as follows, where Z c and Z C are normalization vari- We are inspired by this mathematical definition of additive compositionality, and our work is based on their problem definition.
(2017) assume uniform word frequency distribution, p(w) = 1/|V |. We know from Zipf's law that this assumption is not realistic, so we propose to replace (4) with the following such that the word embedding model satisfies additive compositionality without the uniform distribution assumption: If a word embedding model satisfies (5) and (6), then the embedding vector of a paraphrase word, u c can be represented by the vector sum of its context word set, (3) between (5) and (6) is computed as follows, Here Since (7) is convex and (8) becomes 0 when u c = u C , a paraphrase word vector can be represented by the vector sum of its context words set, C. We explain the details of the proof about convexity of (7) in the Appendix.
From (4) and (6), we can show that SGNS approximately satisfies additive compositionality without the uniform word frequency distribution assumption.
Since (9) is the same as (6) and we can prove (5) with Bayes' theorem, we prove that SGNS approximately satisfies additive compositionality.

Model
In this section, we describe our word embedding model OLIVE which satisfies additive compositionality described in section 4. We first describe our word level embedding model and its properties, then we expand the model to the sub-word level.

Loss Function
The loss function in OLIVE consists of two parts: 1) a global co-occurrence term to capture the semantic similarity of co-occurring word pairs, and 2) a regularization term with sigmoid function and a different coefficient value for each word pair.
Here, we use the regularization term to control the global optimal point of our model. Our proposed loss function is Here, D is the set of co-occurring word pairs, p(i, j) is the probability that the words (i, j) occur in the same context window, u, v are the word embedding vectors, and S ij is the regularization coefficient for word pair (i, j). The notations used in our model are summarized in Table 1. By minimizing this loss function, we get the word embedding vectors u and v.

Properties
The global optimum of (10) depends on the value S ij . When S ij is p(i, j)+p(i)p(j)k, our model has a single local optimum -the global optimum point which is the shifted PMI. Also, OLIVE satisfies additive compositionality when S ij = p(i, j) + p(i)p(j)k. We describe the theorem and the proof below.
Theorem 3. If S ij = p(i, j) + p(i)p(j)k and the dimension of word embedding vector u, v is sufficiently large to get global optimum point of L OLIV E , then L OLIVE has a unique local optimum with respect to u T i v j , and u T i v j becomes PMI(i, j) − log k at the local optimum.
Proof. Let r = u T i v j . First derivative of (10) is as follows, Since p(i)p(j)kσ(−r)σ(r)e −r is always positive, ∂L OLIVE ∂r is positive when r satisfies the following condition: Second derivative of L OLIVE is as follows, Since p(i)p(j)kσ(−r)σ(r) 2 e −r is always positive, ∂L OLIVE ∂r 2 is positive when r satisfies following condition: which leads to the following properties: Since log( 2p(i,j) p(i)p(j)k + 1) > log( p(i,j) p(i)p(j)k ), L OLIVE has a unique local optimum with respect to u T i v j . By simply finding r that makes (11) 0, we can show that the global optimum is PMI(i, j) − log k.
Proof. From Theorem 1, we can prove Theorem 4 by showing that L OLIVE satisfies (5) and (6).
(b) From Theorem 3, we can rewrite the global optimum of our model as Since (15) is the same as (6), we prove that our model satisfies (6).

Sub-word Level Embedding
We can expand (10) to a sub-word level embedding model without losing the properties in section 5.1.2. Let u i = x∈G i g x /|G i | and v j = y∈G j h y /|G j |. Then, the expanded sub-word level model can be defined by the loss funtion where x, y are sub-word indicators in words i, j. g, h are sub-word embedding vectors, and G i is the set of sub-words in word i.

Statistical Sub-sampling
Similar to SGNS and FastText, we apply subsampling to improve word embedding performance, but because SGNS sub-samples words in each iteration of the learning process, we cannot directly apply it to OLIVE which uses the global statistics. Instead we propose a statistical subsampling based on the same discard probability form as (2) but considering for each word whether it needs to be sub-sampled, as follows: 2 ) 2 1 otherwise (17) Here, s is the sub-sampling parameter. By multiplying the above probability to the frequency of the word, we can get the sub-sampled global statistic of a word. The statistics can be calculated by where N i is the frequency of word i and N i,j is the co-occurrence frequency of word pair (i, j).

Updating Rule
In the learning process of (10) and (16), the word vectors are updated by gradient descent, where the gradient of a word vector u i in (10) is In (16), the gradient of the sub-word vector g x and h y can be simply calculated by (19) as We update the word vectors with the normalized gradient of the word vector. The updating rule is where η is the learning rate.

Experiment
In this section, we conduct three experiments to show the empirical effects of theoretical improvement on OLIVE. First, we conduct an experiment on the word similarity task to verify the semantic representation performance of OLIVE. Second, we report the word similarity performance of OLIVE on various vocabulary sizes and noisy sentence representation performance to show the robustness of OLIVE.

Training Settings
We train our model and baseline models with the Wikipedia English corpus with 4 billion tokens.
We preprocess the corpus with Matt Mahoney's perl script 2 . We use Skip-Gram, FastText, GloVe, and Probabilistic-FastText (Athiwaratkun et al., 2018) as baseline models. For all word and subword experiments, we set dimension = 300 and windowsize = 5. To train our word/sub-word model, we set s = 10 −5 , k = 50, the number of iterations to 500, and the initial learning rate, η 0 to 0.5. For every iteration, we decrease the learning rate by the following formula: In Skip-Gram and FastText, we set the number of negative samples to 5, the sub-sampling parameter to 10 −5 and the initial learning rate to 0.025. In GloVe, we set the parameter x max to 100 following their paper (Pennington et al., 2014). In Prob-FastText (Athiwaratkun et al., 2018), we set the parameters to the default settings in their code 3 . In the sub-word experiment, we extract sub-words whose length is in the range [2, 7].

Word Similarity
We evaluate our word embedding performance with the word similarity task using three word similarity datasets: MTurk-(287, 771) (Radinsky et al., 2011), and SL-999 (Hill et al., 2015). We compare our model with four models: Skip-Gram, FastText, GloVe, and Probabilistic-FastText, a multisense sub-word embedding model (Athiwaratkun et al., 2018). Table 3 shows the results of the word similarity experiment with words that occur two or more times in the corpus for a vocabulary size of 6.2 million words. Table 2 shows the results for the same experiment but with words that occur five or more times in the corpus for a vocabulary size of 2.8 million words. With both of these tables, we can see that OLIVE outperforms all four comparison models.

Effect of Vocabulary Size
When we compare tables 2 and 3 we see significant performance decreases in FastText, Skip-Gram, and Prob-FastText for the larger vocab-   Table 3: Spearman's rank correlation coefficient of word similarity task on 6.2 × 10 6 vocabulary size. ulary size. On the other hand, our model and GloVe show consistent performance, and we interpret that as a result of using the global word co-occurrence statistics. To visualize that more clearly, we plot in Figure 1 the word similarity scores for the MTurk-771 dataset on various vocabulary sizes for the different models. This figure clearly shows the robustness of our model with respect to the vocabulary size.

Noisy Sentence Representation
In sections 4 and 5.1.2, we have shown that our model satisfies additive compositionality and SGNS's approximated additive compositionality. Theorem 1 implies equality between finding paraphrase vector of context words and finding sentence vector, which makes the sentence vector that minimizes the KL divergence (7) to u sentence = w∈sentence u w . To see robustness on additive compositionality of OLIVE that comes from removing the approximation in SGNS, we measure sentence similarity on various noise settings with  the sentence vectors calculated by the sum of word vectors in the sentence.
We use SICK dataset (Marelli et al., 2014) and all the STS-English dataset (Cer et al., 2017): STS-train, STS-dev, and STS-test. These datasets contain sentence pairs and human-annotated similarity scores for each sentence pair. To make noise in a sentence, we use two types of noise settings: typo and omitted-word. We use misspelling generation method proposed in Piktus et al. (2019) to make typos in a sentence with probability p. Piktus et al. (2019) use query logs of a search engine to build probabilistic distribution of misspelled words of a word, p(misspelled word|word). In omitted-word setting, we randomly discard words in each sentence with probability p. In both typo/omitted-word settings, we measure sentence similarity on various noise probability settings, p ∈ [0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4].
To compute the similarity of a pair of sentences, we first calculate the sentence vector and take the cosine similarity of the sentence vector pair.   we calculate the Spearman and Pearson correlations between the human-annotated similarity and the cosine similarity of the sentence vector pair. We report results on various typo/omitted-word setting in figure 2, 3 and results of typo/omit-word setting on 0.15 noise probability in table 4. Table 4 show that OLIVE outperforms in each sub-word and word embedding group on both typo/omitted-word settings.
When we compare the performance of GloVe, Skip-Gram, and OLIVE in table 4, we get an empirical evidence of approximate and exact additive compositionality in (2) and (4). Since GloVe does not theoretically satisfy additive compositionality, the correlation of GloVe is lower than Skip-Gram. Overall, we establish that our model captures noisy sentence representation better than Skip-Gram, GloVe, and FastText.
Since OLIVE-sub and FastText are character n-gram embedding models, u i = 1 |G i | x∈G i g x , misspelling in a word tends to affect vector rep-resentation of a word insignificantly. In figure 2, we empirically show significant performance difference between sub-word embedding models and word embedding models on typo setting. Also, figure 2 and 3 show that OLIVE outperforms in each sub-word and word embedding group on both typo/omit-word settings with various noise settings.

Conclusion
In this paper, we proposed 1) novel theoretical conditions of additive compositional word embedding model and 2) novel word/sub-word embedding model which we call OLIVE that satisfies additive compositionality. The loss function of OLIVE consists of a term for learning semantic similarity and a regularization term. From the loss function, we derived three properties of OLIVE: additive compositionality, uniqueness of local optimum, and shifted-PMI as the global optimum.
Through several experiments, we showed OLIVE outperforms other existing embedding models on various word similarity task and showed robustness with respect to the size of the vocabulary. With sentence similarity task on various noisy settings, we showed robustness on additive compositionality of OLIVE. learning knowledge based system for video understanding).