Batch IS NOT Heavy: Learning Word Representations From All Samples

Stochastic Gradient Descent (SGD) with negative sampling is the most prevalent approach to learn word representations. However, it is known that sampling methods are biased especially when the sampling distribution deviates from the true data distribution. Besides, SGD suffers from dramatic fluctuation due to the one-sample learning scheme. In this work, we propose AllVec that uses batch gradient learning to generate word representations from all training samples. Remarkably, the time complexity of AllVec remains at the same level as SGD, being determined by the number of positive samples rather than all samples. We evaluate AllVec on several benchmark tasks. Experiments show that AllVec outperforms sampling-based SGD methods with comparable efficiency, especially for small training corpora.


Introduction
Representing words using dense and real-valued vectors, aka word embeddings, has become the cornerstone for many natural language processing (NLP) tasks, such as document classification (Sebastiani, 2002), parsing (Huang et al., 2012), discourse relation recognition (Lei et al., 2017) and named entity recognition (Turian et al., 2010). Word embeddings can be learned by optimizing that words occurring in similar contexts have similar embeddings, i.e. the well-known distributional hypothesis (Harris, 1954). A representative method is skip-gram (SG) (Mikolov et al., 2013a,b), which realizes the hypothesis using a * The first two authors contributed equally to this paper and share the first-authorship.
(a) (b) Figure 1: Impact of different settings of negative sampling on skip-gram for the word analogy task on Text8. Clearly, the accuracy depends largely on (a) the sampling size of negative words, and (b) the sampling distribution (β = 0 means the uniform distribution and β = 1 means the word frequency distribution). shallow neural network model. The other family of methods is count-based, such as GloVe (Pennington et al., 2014) and LexVec (Salle et al., 2016a,b), which exploit low-rank models such as matrix factorization (MF) to learn embeddings by reconstructing the word co-occurrence statistics.
By far, most state-of-the-art embedding methods rely on SGD and negative sampling for optimization. However, the performance of SGD is highly sensitive to the sampling distribution and the number of negative samples (Chen et al., 2018;Yuan et al., 2016), as shown in Figure 1. Essentially, sampling is biased, making it difficult to converge to the same loss with all examples, regardless of how many update steps have been taken. Moreover, SGD exhibits dramatic fluctuation and suffers from overshooting on local minimums (Ruder, 2016). These drawbacks of SGD can be attributed to its one-sample learning scheme, which updates parameters based on one training sample in each step.
To address the above-mentioned limitations of SGD, a natural solution is to perform exact (full) batch learning. In contrast to SGD, batch learning does not involve any sampling procedure and computes the gradient over all training samples. As such, it can easily converge to a better optimum in a more stable way. Nevertheless, a well-known difficulty in applying full batch learning lies in the expensive computational cost for large-scale data. Taking the word embedding learning as an example, if the vocabulary size is |V |, then evaluating the loss function and computing the full gradient takes O(|V | 2 k) time, where k is the embedding size. This high complexity is unaffordable in practice, since |V | 2 can easily reach billion level or even higher.
In this paper, we introduce AllVec, an exact and efficient word embedding method based on full batch learning. To address the efficiency challenge in learning from all training samples, we devise a regression-based loss function for word embedding, which allows fast optimization with memorization strategies. Specifically, the acceleration is achieved by reformulating the expensive loss over all negative samples using a partition and a decouple operation. By decoupling and caching the bottleneck terms, we succeed to use all samples for each parameter update in a manageable time complexity which is mainly determined by the positive samples. The main contributions of this work are summarized as follows: • We present a fine-grained weighted least square loss for learning word embeddings. Unlike GloVe, it explicitly accounts for all negative samples and reweights them with a frequency-aware strategy.
• We propose an efficient and exact optimization algorithm based on full batch gradient optimization. It has a comparable time complexity with SGD, but being more effective and stable due to the consideration of all samples in each parameter update.
• We perform extensive experiments on several benchmark datasets and tasks to demonstrate the effectiveness, efficiency, and convergence property of our AllVec method.
2 Related Work 2.1 Skip-gram with Negative Sampling Mikolov et al. (2013a,b) proposed the skip-gram model to learn word embeddings. SG formulates the problem as a predictive task, aiming at predicting the proper context c for a target word w within a local window. To speed up the training process, it applies the negative sampling (Mikolov et al., 2013b) to approximate the full softmax. That is, each positive (w, c) pair is trained with n randomly sampled negative pairs (w, w i ). The sampled loss function of SG is defined as where U w andŨ c denote the k-dimensional embedding vectors for word w and context c. P n (w) is the distribution from which negative context w i is sampled. Plenty of research has been done based on SG, such as the use of prior knowledge from another source (Kumar and Araki, 2016;Liu et al., 2015a;Bollegala et al., 2016), incorporating word type information (Cao and Lu, 2017;Niu et al., 2017), character level n-gram models  and jointly learning with topic models like LDA (Shi et al., 2017;Liu et al., 2015b). Mikolov et al. (2013b) showed that the unigram distribution raised to the 3/4th power as P n (w) significantly outperformed both the unigram and the uniform distribution. This suggests that the sampling distribution (of negative words) has a great impact on the embedding quality. Furthermore, Chen et al. (2018) and Guo et al. (2018) recently found that replacing the original sampler with adaptive samplers could result in better performance. The adaptive samplers are used to find more informative negative examples during the training process. Compared with the original word-frequency based sampler, adaptive samplers adapt to both the target word and the current state of the model. They also showed that the finegrained samplers not only speeded up the convergence but also significantly improved the embedding quality. Similar observations were also found in other fields like collaborative filtering (Yuan et al., 2016). While being effective, it is proven that negative sampling is a biased approximation and does not converges to the same loss as the full softmax -regardless of how many update steps have been taken (Bengio and Senécal, 2008;Blanc and Rendle, 2017).

Count-based Embedding Methods
Another line of research is the count-based embedding, such as GloVe (Pennington et al., 2014). GloVe performs a biased MF on the word-context co-occurrence statistics, which is a common ap-proach in the field of collaborative filtering (Koren, 2008). However, GloVe only formulates the loss on positive entries of the co-occurrence matrix, meaning that negative signals about wordcontext co-occurrence are discarded. A remedy solution is LexVec (Salle et al., 2016a,b) which integrates negative sampling into MF. Some other methods (Li et al., 2015;Stratos et al., 2015;Ailem et al., 2017) also use MF to approximate the word-context co-occurrence statistics. Although predictive models and count-based models seem different at first glance, Levy and Goldberg (2014) proved that SG with negative sampling is implicitly factorizing a shifted pointwise mutual information (PMI) matrix, which means that the two families of embedding models resemble each other to a certain degree.
Our proposed method departs from all above methods by using the full batch gradient optimizer to learn from all (positive and negative) samples. We propose a fast learning algorithm to show that such batch learning is not "heavy" even with tens of billions of training examples.

AllVec Loss
In this work, we adopt the regression loss that is commonly used in count-based models (Pennington et al., 2014;Stratos et al., 2015;Ailem et al., 2017) to perform matrix factorization on word cooccurrence statistics. As highlighted, to retain the modeling fidelity, AllVec eschews using any sampling but optimizes the loss on all positive and negative word-context pairs. Given a word w and a symmetric window of win contexts, the set of positive contexts can be obtained by sliding through the corpus. Let c denote a specific context, M wc be the number of cooccurred (w, c) pairs in the corpus within the window. M wc = 0 means that the pair (w, c) has never been observed, i.e. the negative signal. r wc is the association coefficient between w and c, which is calculated from M wc . Specifically, we use r + wc to denote the ground truth value for positive (w, c) pairs and a constant value r − (e.g., 0 or -1) for negative ones since there is no interaction between w and c in negative pairs. Finally, with all positive and negative pairs considered, a regular loss function can be given as Eq. (1), where V is the vocabulary and S is the set of positive pairs. α + wc and α − wc represent the weight for positive and negative (w, c) pairs, respectively.
When it comes to r + wc , there are several choices. For example, GloVe applies the log of M wc with bias terms for w and c. However, research from Levy and Goldberg (2014) showed that the SG model with negative sampling implicitly factorizes a shifted PMI matrix. The PMI value for a (w, c) pair can be defined as where '*' denotes the summation of all corresponding indexes (e.g., M w * = c∈V M wc ). Inspired by this connection, we set r + wc as the positive point-wise mutual information (PPMI) which has been commonly used in the NLP literature (Stratos et al., 2015;Levy and Goldberg, 2014). Sepcifically, PPMI is the positive version of PMI by setting the negative values to zero. Finally, r + wc is defined as r + wc = P P M I wc = max(P M I wc , 0) (3)

Weighting Strategies
Regarding α + wc , we follow the design in GloVe, where it is defined as As for the weight for negative instances α − wc , considering that there is no interaction between w and negative c, we set α − wc as α − c (or α − w ), which means that the weight is determined by the word itself rather than the word-context interaction. Note that either α − wc = α − c or α − wc = α − w does not influence the complexity of AllVec learning algorithm described in the next section. The design of α − c is inspired by the frequency-based oversampling scheme in skip-gram and missing data reweighting in recommendation (He et al., 2016). The intuition is that a word with high frequency is more likely to be a true negative context word if there is no observed word-context interactions. Hence, to effectively differentiate the positive and negative examples, we assign a higher weight for the negative examples that have a higher word fre-quency, and a smaller weight for infrequent words. Formally, α − wc is defined as where α 0 can be seen as a global weight to control the overall importance of negative samples. α 0 = 0 means that no negative information is utilized in the training. The exponent δ is used for smoothing the weights. Specially, δ = 0 means a uniform weight for all negative examples and δ = 1 means that no smoothing is applied.

Fast Batch Gradient Optimization
Once specifying the loss function, the main challenge is how to perform an efficient optimization for Eq.(1). In the following, we develop a fast batch gradient optimization algorithm that is based on a partition reformulation for the loss and a decouple operation for the inner product.

Loss Partition
As can be seen, the major computational cost in Eq.
(1) lies in the term L N , because the size of (V ×V ) \ S is very huge, which typically contains over billions of negative examples. To this end, we show our first key design that separates the loss of negative samples into the difference between the loss on all samples and that on positive samples 1 . The loss partition serves as the prerequisite for the efficient computation of full batch gradients.
By replacing L N in Eq.(1) with Eq. (6), we can obtain a new loss function with a more clear structure. We further simplify the loss function by merging the terms on positive examples. Finally, we achieve a reformulated loss where . It can be seen that the new loss function consists of two components: the loss L A on the whole V × V training examples and L P on positive examples. The major computation now lies in L A which has a time complexity of O(k|V | 2 ). In the following, we show how to reduce the huge volume of computation by a simple mathematical decouple.

Decouple
To clearly show the decouple operation, we rewrite L A as L A by omitting the constant term α − c (r − ) 2 . Note that u wd andũ cd denote the d-th element in U w andŨ c , respectively.
Now we show our second key design that is based on a decouple manipulation for the inner product operation. Interestingly, we observe that the summation operator and elements in U w andŨ c can be rearranged by the commutative property (Dai et al., 2007), as shown below.
An important feature in Eq.(9) is that the original inner product terms are disappeared, while in the new equation c∈V α − cũcdũcd and c∈V α − cũcd are "constant" values relative to u wd u wd and u wd respectively. This means that they can be pre-calculated before training in each iteration. Specifically, we define p w dd , p c dd , q w d and q c d as the pre-calculated terms Then the computation ofL A can be simplified to It can be seen that the time complexity to compute all p w dd is O(|V |k 2 ), and similarly, O(|V |k 2 ) for p c dd and O(|V |k) for q w d and q c d . With all terms pre-calculated before each iteration, the time complexity of computingL A is just O(k 2 ). As a result, the total time complexity of computing L A is decreased to O(2|V |k 2 + 2|V |k + k 2 ) ≈ O(2|V |k 2 ), which is much smaller than the original O(k|V | 2 ). Moreover, it's worth noting that our efficient computation forL A is strictly equal to its original value, which means AllVec does not introduce any approximation in evaluating the loss function.
Finally, we can derive the batch gradients for u wd andũ cd as where I + w denotes the set of positive contexts for w, I + c denotes the set of positive words for c and

Time Complexity Analysis
In the following, we show that AllVec can achieve the same time complexity with negative sampling based SGD methods. Given the sample size n, the total time complexity for SG is O((n + 1)|S|k), where n + 1 denotes n negative samples and 1 positive example. Regarding the complexity of AllVec, we can see that the overall complexity of Algorithm 1 is O(4|S|k + 4|V |k 2 ).
For the ease of discussion, we denote c as the average number of positive contexts for a word in the training corpus, i.e. |S| = c|V | (c ≥ 1000 in most cases). We then obtain the ratio 4|S|k + 4|V |k 2 (n + 1)|S|k = 4 n + 1 (1 + k c ) where k is typically set from 100 to 300 (Mikolov et al., 2013a;Pennington et al., 2014), resulting in k ≤ c. Hence, we can give the lower and upper bound for the ratio: The above analysis suggests that the complexity of AllVec is same as that of SGD with negative sample size between 3 and 7. In fact, considering that c is much larger than k in most datasets, the major cost of AllVec comes from the part 4|S|k (see Section 5.4 for details), which is linear with respect to the number of positive samples.
Word analogy task. The task aims to answer questions like, "a is to b as c is to ?". We adopt the Google testbed 2 which contains 19, 544 such questions in two categories: semantic and syntactic. The semantic questions are usually analogies about people or locations, like "king is to man as queen is to ?", while the syntactic questions focus on forms or tenses, e.g., "swimming is to swim as running to ?".
QVEC. QVEC is an intrinsic evaluation metric of word embeddings based on the alignment to features extracted from manually crafted lexical resources. QVEC has shown strong correlation with the performance of embeddings in several semantic tasks (Tsvetkov et al., 2015).
We compare AllVec with the following word embedding methods.
• SG: This is the original skip-gram model with SGD and negative sampling (Mikolov et al., 2013a,b). • SGA: This is the skip-gram model with an adaptive sampler (Chen et al., 2018). • GloVe: This method applies biased MF on the positive samples of word co-occurrence matrix (Pennington et al., 2014). • LexVec: This method applies MF on the PPMI matrix. The optimization is done with negative sampling and mini-batch gradient descent (Salle et al., 2016b).
For all baselines, we use the original implementation released by the authors.

Datasets and Experimental Setup
We evaluate the performance of AllVec on four real-world corpora, namely Text8 3 , NewsIR 4 , Wiki-sub and Wiki-all. Wiki-sub is a subset of 2017 Wikipedia dump 5 . All corpora have been pre-processed by a standard pipeline (i.e. removing non-textual elements, lowercasing and tokenization). Table 1 summarizes the statistics of these corpora.
To obtain M wc for positive (w, c) pairs, we follow GloVe where word pairs that are x words apart contribute 1/x to M wc . The window size is set as win = 8. Regarding α + wc , we set xmax = 100 and ρ = 0.75. For a fair comparison, the embedding size k is set as 200 for all models and corpora. AllVec can be easily trained by AdaGrad (Zeiler, 2012) like GloVe or Newton-like (Bayer et al., 2017;Bradley et al., 2011) second order methods. For models based on negative sampling (i.e. SG, SGA and LexVec), the sample size is set as n = 25 for Text8, n = 10 for NewsIR and n = 5 for Wiki-sub and Wiki-all. The setting is also suggested by Mikolov et al. (2013b). Other detailed hyper-parameters are reported in Table 2.

Accuracy Comparison
We present results on the word analogy task in Table 2. As shown, AllVec achieves the highest total accuracy (Tot.) in all corpora, particu-larly in smaller corpora (Text8 and NewsIR). The reason is that in smaller corpora the number of positive (w, c) pairs is very limited, thus making use of negative examples will bring more benefits. Similar reason also explains the poor accuracy of GloVe in Text8, because GloVe does not consider negative samples. Even in the very large corpus (Wiki-all), ignoring negative samples still results in sub-optimal performance.
Our results also show that SGA achieves better performance than SG, which demonstrates the importance of a good sampling strategy. However, regardless what sampler (except the full softmax sampling) is utilized and how many updates are taken, sampling is still a biased approach. AllVec achieves the best performance because it is trained on the whole batch data for each parameter update rather than a fraction of sampled data.
Another interesting observation is AllVec performs better in semantic tasks in general. The reason is that our model utilizes global co-occurrence statistics, which capture more semantic signals than syntactic signals. While both AllVec and GloVe use global contexts, AllVec performs much better than GloVe in syntactic tasks. We argue that the main reason is because AllVec can distill useful signals from negative examples, while GloVe simply ignores all negative information. By contrast, local-window based methods, such as SG and SGA, are more effective to capture local sentence features, resulting in good performance on syntactic analogies. However, Rekabsaz et al. (2017) argues that these local-window based methods may suffer from the topic shifting issue. Table 3 and Table 4 provide results in the word similarity and QVEC tasks. We can see that Al-lVec achieves the best performance in most tasks, which admits the advantage of batch learning with all samples. Interestingly, although GloVe performs well in semantic analogy tasks, it shows extremely worse results in word similarity and QVEC. The reason shall be the same as that it performs poorly in syntactic tasks.

Impact of α − c
In this subsection, we investigate the impact of the proposed weighting scheme for negative (context) words. We show the performance change of word analogy tasks on NewsIR in Figure 2 by tuning α 0 and δ. Results in other corpora show similar trends thus are omitted due to space limitation. The parameter columns (para.) for each model are given from left to right as follows. SG: subsampling of frequent words, window size and the number of negative samples; SGA: λ (Chen et al., 2018) that controls the distribution of the rank, the other parameters are the same with SG; GloVe: xmax, window size and symmetric window; LexVec: subsampling of frequent words and the number of negative samples; AllVec: the negative weight α0 and δ. Boldface denotes the highest total accuracy.
Figure 2(a) shows the impact of the overall weight α 0 by setting δ as 0.75 (inspired by the setting of skip-gram). Clearly, we observe that all results (including semantic, syntactic and total accuracy) have been greatly improved when α 0 increases from 0 to a larger value. As mentioned before, α 0 = 0 means that no negative information is considered. This observation verifies that negative samples are very important for learning good embeddings. It also helps to explain why GloVe performs poorly on syntactic tasks. In addition, we find that in all corpora the optimal results are usually obtained when α 0 falls in the range of 50 to 400. For example, in the NewIR corpus as shown, AllVec achieves the best performance when α 0 = 100. Figure 2(b) shows the impact of δ with α 0 = 100. As mentioned before, δ = 0 denotes a uniform value for all negative words and δ = 1 denotes that no smoothing is applied to word frequency. We can see that the total accuracy is only around 55% when δ = 0. By increasing its value, the performance is gradually improved, achieving the highest score when δ is around 0.8. Further increase of δ will degrade the total accuracy. This analysis demonstrates the effectiveness of the proposed negative weighting scheme. hibits a more stable convergence due to its full batch learning. In contrast, GloVe has a more dramatic fluctuation because of the one-sample learning scheme. Figure 3(b) shows the relationship between the embedding size k and runtime on NewsIR. Although the analysis in Section 4.3 demonstrates that the time complexity of AllVec is O(4|S|k + 4|V |k 2 ), the actual runtime shows a near linear relationship with k. This is because 4|V |k 2 /4|S|k = k/c, where c generally ranges from 1000 ∼ 6000 and k is set from 200 to 300 in practice. The above ratio explains the fact that 4|S|k dominates the complexity, which is linear  with k and |S|.

Convergence Rate and Runtime
We also compare the overall runtime of AllVec and SG on NewsIR and show the results in Table  5. As can be seen, the runtime of AllVec falls in the range of SG-3 and SG-7 in a single iteration, which confirms the theoretical analysis in Section 4.3. In contrast with SG, AllVec needs more iterations to converge. The reason is that each parameter in SG is updated many times during each iteration, although only one training example is used in each update. Despite this, the total run time of AllVec is still in a feasible range. Assuming the convergence is measured by the number of parameter updates, our AllVec yields a much faster convergence rate than the one-sample SG method.
In practice, the runtime of our model in each iteration can be further reduced by increasing the number of parallel workers. Although baseline methods like SG and GloVe can also be parallelized, the stochastic gradient steps in these methods unnecessarily influence each other as there is no exact way to separate these updates for different workers. In other words, the parallelization of SGD is not well suited to a large number of work- ers. In contrast, the parameter updates in AllVec are completely independent of each other, therefore AllVec does not have the update collision issue. This means we can achieve the embarrassing parallelization by simply separating the updates by words; that is, letting different workers update the model parameters for disjoint sets of words. As such, AllVec can provide a near linear scaling without any approximation since there is no potential conflicts between updates.

Conclusion
In this paper, we presented AllVec, an efficient batch learning based word embedding model that is capable to leverage all positive and negative training examples without any sampling and approximation. In contrast with models based on SGD and negative sampling, AllVec shows more stable convergence and better embedding quality by the all-sample optimization. Besides, both theoretical analysis and experiments demonstrate that AllVec achieves the same time complexity with the classic SGD models. In future, we will extend our proposed all-sample learning scheme to deep learning methods, which are more expressive than the shallow embedding model. Moreover, we will integrate prior knowledge, such as the words that are synonyms and antonyms, into the word embedding process. Lastly, we are interested in exploring the recent adversarial learning techniques to enhance the robustness of word embeddings.