Incremental Skip-gram Model with Negative Sampling

This paper explores an incremental training strategy for the skip-gram model with negative sampling (SGNS) from both empirical and theoretical perspectives. Existing methods of neural word embeddings, including SGNS, are multi-pass algorithms and thus cannot perform incremental model update. To address this problem, we present a simple incremental extension of SGNS and provide a thorough theoretical analysis to demonstrate its validity. Empirical experiments demonstrated the correctness of the theoretical analysis as well as the practical usefulness of the incremental algorithm.


Introduction
Existing methods of neural word embeddings are typically designed to go through the entire training data multiple times. For example, negative sampling (Mikolov et al., 2013b) needs to precompute the noise distribution from the entire training data before performing Stochastic Gradient Descent (SGD). It thus needs to go through the training data at least twice. Similarly, hierarchical soft-max (Mikolov et al., 2013b) has to determine the tree structure and GloVe (Pennington et al., 2014) has to count co-occurrence frequencies before performing SGD.
The fact that those existing methods are multipass algorithms means that they cannot perform incremental model update when additional training data is provided. Instead, they have to re-train the model on the old and new training data from scratch.
However, the re-training is obviously inefficient since it has to process the entire training data received thus far whenever new training data is provided. This is especially problematic when the amount of the new training data is relatively smaller than the old one. One such situation is that the embedding model is updated on a small amount of training data that includes newly emerged words for instantly adding them to the vocabulary set. Another situation is that the word embeddings are learned from ever-evolving data such as news articles and microblogs (Peng et al., 2017) and the embedding model is periodically updated on newly generated data (e.g., once in a week or month). This paper investigates an incremental training method of word embeddings with a focus on the skip-gram model with negative sampling (SGNS) (Mikolov et al., 2013b) for its popularity. We present a simple incremental extension of SGNS, referred to as incremental SGNS, and provide a thorough theoretical analysis to demonstrate its validity. Our analysis reveals that, under a mild assumption, the optimal solution of incremental SGNS agrees with the original SGNS when the training data size is infinitely large. See Section 4 for the formal and strict statement. Additionally, we present techniques for the efficient implementation of incremental SGNS.
Three experiments were conducted to assess the correctness of the theoretical analysis as well as the practical usefulness of incremental SGNS. The first experiment empirically investigates the validity of the theoretical analysis result. The second experiment compares the word embeddings learned by incremental SGNS and the original SGNS across five benchmark datasets, and demonstrates that those word embeddings are of comparable quality. The last experiment explores the training time of incremental SGNS, demonstrating that it is able to save much training time by avoiding expensive re-training when additional training data is provided.

SGNS Overview
As a preliminary, this section provides a brief overview of SGNS.
Given a word sequence, w 1 , w 2 , . . . , w n , for training, the skip-gram model seeks to minimize the following objective to learn word embeddings: where w i is a target word and w i+j is a context word within a window of size c. p(w i+j | w i ) represents the probability that w i+j appears within the neighbor of w i , and is defined as where t w and c w are w's embeddings when it behaves as a target and context, respectively. W represents the vocabulary set.
Since it is too expensive to optimize the above objective, Mikolov et al. (2013b) proposed negative sampling to speed up skip-gram training. This approximates Eq. (1) using sigmoid functions and k randomly-sampled words, called negative samples. The resulting objective is given as , and σ(x) is the sigmoid function. The negative sample v is drawn from a smoothed unigram probability distribution referred to as noise distribution: represents the frequency of a word v in the training data and α is a smoothing parameter (0 < α ≤ 1).
The objective is optimized by SGD. Given a target-context word pair (w i and w i+j ) and k negative samples (v 1 , v 2 , . . . , v k ) drawn from the noise distribution, the gradient of is computed. Then, the gradient descent is performed to update t w i , c w i+j , and c v 1 , . . . , c v k . SGNS training needs to go over the entire training data to pre-compute the noise distribution q(v) before performing SGD. This makes it difficult to perform incremental model update when additional training data is provided.

Incremental SGNS
This section explores incremental training of SGNS. The incremental training algorithm (Section 3.1), its efficient implementation (Section 3.2), and the computational complexity (Section 3.3) are discussed in turn.

Algorithm
Algorithm 1 presents incremental SGNS, which goes through the training data in a single-pass to update word embeddings incrementally. Unlike the original SGNS, it does not pre-compute the noise distribution. Instead, it reads the training data word by word 1 to incrementally update the word frequency distribution and the noise distribution while performing SGD. Hereafter, the original SGNS (c.f., Section 2) is referred to as batch SGNS to emphasize that the noise distribution is computed in a batch fashion.
The learning rate for SGD is adjusted by using AdaGrad (Duchi et al., 2011). Although the linear decay function has widely been used for training batch SGNS (Mikolov, 2013), adaptive methods such as AdaGrad are more suitable for the incremental training since the amount of training data is unknown in advance or can increase unboundedly.
It is straightforward to extend the incremental SGNS to the mini-batch setting by reading a subset of the training data (or mini-batch), rather than a single word, at a time to update the noise distribution and perform SGD (Algorithm 2). Although this paper primarily focuses on the incremental SGNS, the mini-batch algorithm is also important in practical terms because it is easier to be multithreaded.
Alternatives to Algorithms 2 might be possible. Other possible approaches include computing the noise distribution separately on each subset of the training data, fixing the noise distribution after computing it from the first (possibly large) subset, and so on. We exclude such alternatives from our investigation because it is considered difficult to provide them with theoretical justification.

Efficient implementation
Although the incremental SGNS is conceptually simple, implementation issues are involved.
Since new words emerge endlessly in the training data, the vocabulary set can grow unboundedly and exhaust a memory.
We address this problem by dynamically changing the vocabulary set. The Misra-Gries algorithm (Misra and Gries, 1982) is used to approximately keep track of top-m frequent words during training, and those words are used as the dynamic vocabulary set. This method allows the maximum vocabulary size to be explicitly limited to m, while being able to dynamically change the vocabulary set.

Adaptive unigram table
Another problem is how to generate negative samples efficiently. Since k negative samples per target-context pair have to be generated by the noise distribution, the sampling speed has a significant effect on the overall training efficiency.
Let us first examine how negative samples are generated in batch SGNS. In a popular implementation (Mikolov, 2013), a word array (referred to as a unigram table) is constructed such that the number of a word w in it is proportional to q(w). See Table 1 for an example. Using the unigram table, negative samples can be efficiently generated by sampling the table elements uniformly at random. It takes only O(1) time to generate one negative sample.
The above method assumes that the noise distribution is fixed and thus cannot be used directly for the incremental training. One simple solution is to reconstruct the unigram table whenever new training data is provided. However, such a method a, a, a, a, b, b, b, c, c)  Algorithm 3 Adaptive unigram table.
if |T | < τ then 8: add F copies of wi to T 9: else 10: for j = 1, . . . , τ do 11: T [j] ← wi with probability F z 12: end for 13: end if 14: end for is not effective for the incremental SGNS, because the unigram table reconstruction requires O(|W|) time. 2 We propose a reservoir-based algorithm for efficiently updating the unigram table (Vitter, 1985;Efraimidis, 2015) (Algorithm 3). The algorithm incrementally update the unigram table T while limiting its maximum size to τ . In case |T | < τ , it can be easily confirmed that the number of a word w in T is f (w) α (∝ q(w)). In case |T | = τ , since z = ∑ w∈W f (w) α is equal to the normalization factor of the noise distribution, it can be proven by induction that, for all j, T [j] is a word w with probability q(w). See (Vitter, 1985;Efraimidis, 2015) for reference.
Note on implementation In line 8, F copies of w i are added to T . When F is not an integer, the copies are generated so that their expected number becomes F . Specifically, ⌈F ⌉ copies are added to T with probability F − ⌊F ⌋, and ⌊F ⌋ copies are added otherwise.
The loop from line 10 to 12 becomes expensive if implemented straightforwardly because the maximum table size τ is typically set large (e.g., τ = 10 8 in word2vec (Mikolov, 2013)). For acceleration, instead of checking all elements in the unigram table, randomly chosen τ F z elements are substituted with w i . Note that τ F z is the expected number of table elements to be substituted in the original algorithm. This approximation achieves great speed-up because we usually have F ≪ z.
In fact, it can be proven that it takes O(1) time when α = 1.0. See Appendix 3 A for more discussions.

Computational complexity
Both incremental and batch SGNS have the same space complexity, which is independent of the training data size n. Both require O(|W|) space to store the word embeddings and the word frequency counts, and O(|T |) space to store the unigram table.
The two algorithms also have the same time complexity. Both require O(n) training time when the training data size is n. Although incremental SGNS requires extra time for updating the dynamic vocabulary and adaptive unigram table, these costs are practically negligible, as will be demonstrated in Section 5.3.

Theoretical Analysis
Although the extension from batch to incremental SGNS is simple and intuitive, it is not readily clear whether incremental SGNS can learn word embeddings as well as the batch counterpart. To answer this question, in this section we examine incremental SGNS from a theoretical point of view.
The analysis begins by examining the difference between the objectives optimized by batch and incremental SGNS (Section 4.1). Then, probabilistic properties of their difference are investigated to demonstrate the relationship between batch and incremental SGNS (Sections 4.2 and 4.3). We shortly touch the mini-batch SGNS at the end of this section (Section 4.4).

Objective difference
As discussed in Section 2, batch SGNS optimizes the following objective: where θ = (t 1 , t 2 , . . . , t |W| , c 1 , c 2 , . . . , c |W| ) collectively represents the model parameters 4 (i.e., word embeddings) and q n (v) represents the noise distribution. Note that the noise distribution is represented in a different notation than Section 2 to make its dependence on the whole training data explicit. The function represents the word frequency in the first i words of the training data.
In contrast, incremental SGNS computes the at each step to perform gradient descent. Note that the noise distribution does not depend on n but rather on i. Because it can be seen as a sample approximation of the gradient of incremental SGNS can be interpreted as optimizing L I (θ) with SGD.
Since the expectation terms in the objectives can be rewritten as the difference between the two objectives can be given as is the delta function.

Unsmoothed case
Let us begin by examining the objective difference ∆L(θ) in the unsmoothed case, α = 1.0. The technical difficulty in analyzing ∆L(θ) is that it is dependent on the word order in the training data. To address this difficulty, we assume that the words in the training data are generated from some stationary distribution. This assumption allows us to investigate the property of ∆L(θ) from a probabilistic perspective. Regarding the validity of this assumption, we want to note that this assumption is already taken by the original SGNS: the probability that the target and context words co-occur is assumed to be independent of their position in the training data.
We below introduce some definitions and notations as the preparation of the analysis. Definition 1. Let X i,w be a random variable that represents δ w i ,w . It takes 1 when the i-th word in the training data is w ∈ W and 0 otherwise.
Remind that we assume that the words in the training data are generated from a stationary distribution. This assumption means that the expectation and (co)variance of X i,w do not depend on the index i. Hereafter, they are respectively de- Definition 2. Let Y i,w be a random variable that represents q i (w) when α = 1.0. It is given as 1 Convergence of the first and second order moments of ∆L(θ) It can be shown that the first order moment of ∆L(θ) has an analytical form.
Theorem 1. The first order moment of ∆L(θ) is given as where H n is the n-th harmonic number.
Sketch of proof. Notice that E[∆L(θ)] can be written as Because we have, for any i and j such that i ≤ j, plugging this into E[∆L(θ)] proves the theorem. See Appendix B.1 for the complete proof.
Theorem 1 readily gives the convergence property of the first order moment of ∆L(θ): Theorem 2. The first-order moment of ∆L(θ) decreases in the order of O( log(n) n ): and thus converges to zero in the limit of infinity: Proof. We have H n = O(log(n)) from the upper integral bound, and thus Theorem 1 gives the proof.
A similar result to Theorem 2 can be obtained for the second order moment of ∆L(θ) as well.
Theorem 3. The second-order moment of ∆L(θ) decreases in the order of O( log(n) n ): and thus converges to zero in the limit of infinity: Proof. Omitted. See Appendix B.2.

Main result
The above theorems reveal the relationship between the optimal solutions of the two objectives, as stated in the next lemma.
We are now ready to provide the main result of the analysis. The next theorem shows the convergence of L B (θ).
(2) means that for any ϵ 2 > 0, there exists n ′ such that if n ′ ≤ n then |E[l]| < ϵ 2 . Therefore, we have The arbitrary property of ϵ 1 and ϵ 2 allows ϵ 1 + ϵ 2 to be rewritten as ϵ. Also, Eq. (3)  Informally, this theorem can be interpreted as suggesting that the optimal solutions of batch and incremental SGNS agree when n is infinitely large.

Smoothed case
We next examine the smoothed case (0 < α < 1). In this case, the noise distribution can be represented by using the ones in the unsmoothed case: Definition 3. Let Z i,w be a random variable that represents q i (w) in the smoothed case. Then, it can be written by using Y i,w : Because Z i,w is no longer a linear combination of X i,w , it becomes difficult to derive similar proofs to the unsmoothed case. To address this difficulty, Z i,w is approximated by the first-order Taylor expansion around E[(Y i,1 , Y i,2 , . . . , Y i,|W| )] = (µ 1 , µ 2 , . . . , µ |W| ).

The first-order Taylor approximation gives
where µ = (µ 1 , µ 2 , . . . , µ |W| ) and M w,v = ∂gw(x) ∂xv | x=µ . Consequently, it can be shown that the first and second order moments of ∆L(θ) have the order of O( log(n) n ) in the smoothed case as well. See Appendix C for the details.

Mini-batch SGNS
The same analysis result can also be obtained for the mini-batch SGNS. We can prove Theorems 2 and 3 in the mini-batch case as well (see Appendix D for the proof). The other part of the analysis remains the same.

Experiments
Three experiments were conducted to investigate the correctness of the theoretical analysis (Section 5.1) and the practical usefulness of incremental SGNS (Sections 5.2 and 5.3). Details of the experimental settings that do not fit into the paper are presented in Appendix E.

Validation of theorems
An empirical experiment was conducted to validate the result of the theoretical analysis. Since it is difficult to assess the main result in Section 4.2.2 directly, the theorems in Sections 4.2.1, from which the main result is readily derived, were investigated. Specifically, the first and second order moments of ∆L(θ) were computed on datasets of increasing sizes to empirically investigate the convergence property.
Datasets of various sizes were constructed from the English Gigaword corpus (Napoles et al., 2012). The datasets made up of n words were constructed by randomly sampling sentences from the Gigaword corpus. The value of n was varied over {10 3 , 10 4 , 10 5 , 10 6 , 10 7 }. 10, 000 different datasets were created for each size n to compute the first and second order moments. and circles represent the empirical values and theoretical values obtained by Theorem 1, respectively. Figure 1 (top right) similarly illustrates the second order moments of ∆L(θ). Since Theorem 3 suggests that the second order moment decreases in the order of O( log(n) n ), the graph y ∝ log(x) x is also shown. The graph was fitted to the empirical data by minimizing the squared error.
The top left figure demonstrates that the empirical values of the first order moments fit the theoretical result very well, providing a strong empirical evidence for the correctness of Theorem 1. In addition, the two figures show that the first and second order moments decrease almost in the order of O( log(n) n ), converging to zero as the data size increases. This result validates Theorems 2 and 3. Figures 1 (bottom left) and (bottom right) show similar results when α = 0.75. Since we do not have theoretical estimates of the first order moment when α ̸ = 1.0, the graphs y ∝ log(n) n are shown in both figures. From these, we can again observe that the first and second order moments decrease almost in the order of O( log(n) n ). This indicates the validity of the investigation in Section 4.3. The relatively larger deviations from the graphs y ∝ log(n) n , compared with the top right figure, are considered to be attributed to the firstorder Taylor approximation.

Quality of word embeddings
The next experiment investigates the quality of the word embeddings learned by incremental SGNS through comparison with the batch counterparts.
The Gigaword corpus was used for the training.
For the comparison, both our own implementation of batch SGNS as well as WORD2VEC (Mikolov et al., 2013c) were used (denoted as batch and w2v). The training configurations of the three methods were set the same as much as possible, although it is impossible to do so perfectly. For example, incremental SGNS (denoted as incremental) utilized the dynamic vocabulary (c.f., Section 3.2.1) and thus we set the maximum vocabulary size m to control the vocabulary size. On the other hand, we set a frequency threshold to determine the vocabulary size of w2v. We set m = 240k for incremental, while setting the frequency threshold to 100 for w2v. This yields vocabulary sets of comparable sizes: 220, 389 and 246, 134.
The learned word embeddings were assessed on five benchmark datasets commonly used in the literature (Levy et al., 2015): WordSim353 (Agirre et al., 2009), MEN (Bruni et al., 2013), SimLex-999 (Hill et al., 2015), the MSR analogy dataset (Mikolov et al., 2013c), the Google analogy dataset (Mikolov et al., 2013a). The former three are for a semantic similarity task, and the remaining two are for a word analogy task. As evaluation measures, Spearman's ρ and prediction accuracy were used in the two tasks, respectively.
Figures 2 (a) and (b) represent the results on the similarity datasets and the analogy datasets. We see that the three methods (incremental, batch, and w2v) perform equally well on all of the datasets. This indicates that incremental SGNS can learn as good word embeddings as the batch counterparts, while being able to perform incremental model update. Although incremental performs slightly better than the batch methods in some datasets, the difference seems to be a product of chance.
The figures also show the results of incremental SGNS when the maximum vocabulary size m was reduced to 150k and 100k (incremental-150k and incremental-100k). The resulting vocabulary sizes were 135, 447 and 86, 993, respectively. We see that incremental-150k and incremental-100k perform comparatively well with incremental, although relatively large performance drops are observed in some datasets (MEN and MSR). This demonstrates that the Misra-Gries algorithm can effectively control the vocabulary size.

Update time
The last experiment investigates how much time incremental SGNS can save by avoiding retraining when updating the word embeddings. In this experiment, incremental was first trained on the initial training data of size 5 n 1 and then updated on the new training data of size n 2 to measure the update time. For comparison, batch and w2v were re-trained on the combination of the initial and new training data. We fixed n 1 = 10 7 and varied n 2 over {1×10 6 , 2×10 6 , . . . , 5×10 6 }. The experiment was conducted on Intel R ⃝ Xeon R ⃝ 2GHz CPU. The update time was averaged over five trials. Figure 2 (c) compares the update time of the three methods across various values of n 2 . We see that incremental significantly reduces the update time. It achieves 10 and 7.3 times speed-up compared with batch and w2v (when n 2 = 10 6 ). This represents the advantage of the incremental algorithm, as well as the time efficiency of the dynamic vocabulary and adaptive unigram table. We note that batch is slower than w2v because it uses Ada-Grad, which maintains different learning rates for different dimensions of the parameter, while w2v uses the same learning rate for all dimensions.

Related Work
Word representations based on distributional semantics have been common (Turney and Pantel, 2010; Baroni and Lenci, 2010). The distributional methods typically begin by constructing a wordcontext matrix and then applying dimension reduction techniques such as SVD to obtain highquality word meaning representations. Although some studies investigated incremental updating of the word-context matrix (Yin et al., 2015;Goyal 5 The number of sentences here. and Daume III, 2011), they did not explore the reduced representations. On the other hand, neural word embeddings have recently gained much popularity as an alternative. However, most previous studies have not explored incremental strategies (Mikolov et al., 2013a,b;Pennington et al., 2014). Peng et al. (2017) proposed an incremental learning method of hierarchical soft-max. Because hierarchical soft-max and negative sampling have different advantages (Peng et al., 2017), the incremental SGNS and their method are complementary to each other. Also, their updating method needs to scan not only new but also old training data, and thus is not an incremental algorithm in a strict sense. As a consequence, it potentially incurs the same time complexity as the retraining. Another consequence is that their method has to retain the old training data and thus wastes space, while incremental SGNS can discard old training examples after processing them.
Very recently, May et al. (2017) also proposed an incremental algorithm for SGNS. However, their work differs from ours in that their algorithm is not designed to use smoothed noise distribution (i.e., the smoothing parameter α is assumed fixed as α = 1.0 in their method), which is a key to learn high-quality word embeddings. Another difference is that they did not provide theoretical justification for their algorithm.
There are publicly available implementations for training SGNS, one of the most popular being WORD2VEC (Mikolov, 2013). However, it does not support an incremental training method. GEN-SIM (Řehůřek and Sojka, 2010) also offers SGNS training. Although GENSIM allows the incremental updating of SGNS models, it is done in an adhoc manner. In GENSIM, the vocabulary set as well as the unigram table are fixed once trained, meaning that new words cannot be added. Also, they do not provide any theoretical accounts for the validity of their training method. Finally, we want to note that most of the existing implementations can be easily extended to support the incremental (or mini-batch) SGNS by simply keep updating the noise distribution.

Conclusion and Future Work
This paper proposed incremental SGNS and provided thorough theoretical analysis to demonstrate its validity. We also conducted experiments to empirically demonstrate its effectiveness. Although the incremental model update is often required in practical machine learning applications, only a little attention has been paid to learning word embeddings incrementally. We consider that incremental SGNS successfully addresses this situation and serves as an useful tool for practitioners.
The success of this work suggests several research directions to be explored in the future. One possibility is to explore extending other embedding methods such as GloVe (Pennington et al., 2014) to incremental algorithms. Such studies would further extend the potential of word embedding methods.