Why ADAGRAD Fails for Online Topic Modeling

Online topic modeling, i.e., topic modeling with stochastic variational inference, is a powerful and efficient technique for analyzing large datasets, and ADAGRAD is a widely-used technique for tuning learning rates during online gradient optimization. However, these two techniques do not work well together. We show that this is because ADAGRAD uses accumulation of previous gradients as the learning rates’ denominators. For online topic modeling, the magnitude of gradients is very large. It causes learning rates to shrink very quickly, so the parameters cannot fully converge until the training ends

Probabilistic topic models (Blei, 2012) are popular algorithms for uncovering hidden thematic structure in text. They have been widely used to help people understand and navigate document collections (Blei et al., 2003), multilingual collections (Hu et al., 2014), images (Chong et al., 2009), networks (Chang and Blei, 2009;Yang et al., 2016), etc. Probabilistic topic modeling usually requires computing a posterior distribution over thousands or millions of latent variables, which is often intractable. Variational inference (Blei et al., 2016, VI) approximates posterior distributions. Stochastic variational inference (Hoffman et al., 2013, SVI) is its natural online extension and enables the analysis of large datasets.
Online topic models (Hoffman et al., 2010;Bryant and Sudderth, 2012;Paisley et al., 2015) optimize the global parameters of interest using stochastic gradient ascent. At each iteration, they sample data points to estimate the gradient. In practice, the sample has only a small percentage of the vocabulary. The resulting sparse gradients hurt performance. ADAGRAD (Duchi et al., 2011) is designed for high dimensional online optimization problems and adjusts learning rates for each dimension, favoring rare features. This makes ADAGRAD well-suited for tasks with sparse gradients such as distributed deep networks (Dean et al., 2012), forward-backward splitting (Duchi and Singer, 2009), and regularized dual averaging methods (Xiao, 2010).
Thus, it may seem reasonable to apply ADA-GRAD to optimize online topic models. However, ADAGRAD is not suitable for online topic models (Section 1). This is because to get a topic model, the training algorithm must break the symmetry between parameters of words that are highly related to the topic and words that are not related to the topic. Before the algorithm converges, the magnitude of gradients of the parameters are very large. Since ADAGRAD uses the accumulation of previous gradients as learning rates' denominators, the learning rates shrink very quickly. Thus, the algorithm cannot break the symmetry quickly. We provide solutions for this problem. Two alternative learning rate methods, i.e., ADADELTA (Zeiler, 2012) and ADAM (Kingma and Ba, 2014), can address this incompatibility with online topic models. When the dataset is small enough, e.g., a corpus with only hundreds of documents, ADAGRAD can still work.

Buridan's Optimizer
Latent Dirichlet allocation (Blei et al., 2003, LDA) is perhaps the most well known topic model. In this section, we analyze problems with ADAGRAD for online LDA (Hoffman et al., 2010), and provide some solutions. Our analysis is easy to generalize to other online topic models, e.g., online Hierarchical Dirichlet Process (Wang et al., 2011, HDP).  Figure 1: Illustration of ADAGRAD's problem. Initially, the topic does not favor particular words over others, so the training algorithm incorrectly increases the parameters of bottom words. Then, ADAGRAD learning rates decrease too quickly, leaving the tie between top and bottom unbroken. Thus, the algorithm fails to form appropriate topics. A constant rate easily breaks the tie. When the tie is broken, the algorithm decreases the parameters of bottom words and increases the parameters of top words until convergence.

Online LDA
To train LDA, we want to compute the posterior where β k is the topic-word distribution for the k th of K topics, θ d is the document-topic distribution for the d th of D document, z dn is the topic assignment for the n th of N d words in in the d th document, w dn is the word type of the n th word in the d th document, with α and η the Dirichlet priors over the document-topic and topic-word distributions.
However, this is intractable. Stochastic variational inference (SVI) is a popular approach for approximation. It first posits a mean field variational distribution where γ (Dirichlet) and φ (multinomial) are local parameters and λ (Dirichlet) is a global parameter. SVI then optimizes the variational parameters to minimize the KL divergence between the variational distribution and the true posterior.
At iteration t, SVI samples a document d from the corpus and updates the local parameters: where n v is the number of words v in d, and Ψ (.) is the digamma function. After finding φ d and γ d , SVI optimizes the global parameters using stochastic gradient ascent, where ρ (t) is the learning rate,λ kv is the gradient.

ADAGRAD for Online LDA
In general, ρ (t) kv = κ (t) , for all v ∈ 1, .., V and k ∈ 1, ..., K, where κ (t) can be a decreasing rate (Hoffman et al., 2013), a small constant (Collobert et al., 2011) or an adaptive rate (Ranganath et al., 2013). These three methods are all global learning rate methods, which cannot adaptively adjust learning rate for each dimension of the parameter, or address the problems caused by sparse gradients.
ADAGRAD is a popular learning rate method designed for online optimization problems with high dimension and sparse gradients. Thus, it seems reasonable to apply ADAGRAD to update learning rates for online topic models. When using ADA-GRAD (Duchi et al., 2011) with online LDA, the update rule for the each learning rate is where ρ 0 is a constant, and a very small guarantees that the learning rates are non-zero.

ADAGRAD's Indecision
A philosophical thought experiment provides us with the story of Buridan's ass (Bayle, 1826): situated between two piles of equally tasty hay, the poor animal starved to death. ADAGRAD faces a similar problem in breaking the symmetries of common variational inference initializations. For convenience, we unfold an example with a single document at each iteration. Our analysis generalizes to mini-batches. Initially, the topics β 1:K do not favor particular words over others as inference cannot know a priori which words will have high probability in a particular topic. The algorithm must break ties between parameters of the top and bottom words in a topic. Unfortunately, the momentum of ADAGRAD fails for topic models. We now explain why this is. ADAGRAD looks to the gradient for clues about what features will be important. This is because before the equilibrium is broken, the values of different λ kv are close, so Equation 1 will be approximately seen as φ d vk ∝ exp {Ψ (γ dk )}, which implicates that λ has very small influence on the optimization of φ. If some topics are prevalent in the sampled document d, large probability will be assigned to the corresponding φ .k , meaning that all words in document d are treated as top words. The initial clues are at best random and at words counter productive.
However, ADAGRAD uses these cues to prefer some dimensions over others. Let λ * be the optimum; the topic ADAGRAD should find at convergence: λ * kv ≈ E λ (t) kv . By definition, once the algorithm converges, λ * kv for top words will have very large values while λ * kv for bottom words will be small. After using noisy momentum terms, it must overcome initial faulty signals.
We now show the lower and upper bounds of E λ (t) kv to show how big of an uphill battle ADA-GRAD faces. Expanding the update rule, wheren v = D i=1 n iv /D, and φ vk is the probability that word v is assigned to topic k. For a bottom word, φ vk → 0. For a top word, φ vk ≥ 1/K. After convergence, for a bottom word E [φ vk ] ≈ η. For a top word, 1/K ≤ E [φ vk ] ≤ 1. Thus, the lower and upper bounds of E λ (t) kv are For a large datasets, Dn v should be large. Thus for top words, λ * kv will converge to a large value: quite a large hill to climb.
How quickly the algorithm climbs the hill is inversely proportional to the gradient size. We next show that the magnitude of gradients of top words are very large before the algorithm converges. Let g * be the gradient after convergence. We show the bounds of |g kv |, where |.| is the absolute value, in the following: Thus, Only when n dv =n v , does | g (t) kv | = 0. Otherwise, due to the large D, | g * kv | will be large. However, in practice, n dv varies largely from document to document, which leads to large values of | g * kv | . Based on the gradient's property, when λ kv is far away from the optimum, | g (t) kv | ≥ | g * kv | . Thus, the values of | g (t) kv | for the top words are very large before convergence. ADAGRAD uses the accumulations of previous gradients as learning rates' denominators. Because of these large gradients in the first several iterations, learning rates soon decrease to small values; even if a topic has gathered a few words, ADAGRAD lacks the momentum to move other words into the topic. These small learning rates slows the updates of λ.
In sum, the initial gradient signals confuse the algorithm, the gradients are large enough to impede progress later, and large datasets imply a very large hill the algorithm must climb. Since the update progresses slowly, online LDA needs more iterations to break the equilibrium. Because the gradients of all words are still very large, the learning rates decrease quickly, which makes the update progress slower. When the update progresses more slowly, online LDA needs more iterations to break the tie. This cycle repeats, until some learning rates decrease to zero and learning effectively stops. Thus, the algorithm will never break the tie or infer good topics. Figure 1 illustrates the problem of online LDA with ADAGRAD.

Alternative Solutions
ADADELTA (Zeiler, 2012) and ADAM (Kingma and Ba, 2014) are extensions to ADAGRAD. ADADELTA does not have guaranteed convergence on convex optimization problems. Even though ADAM has a theoretical bound on its convergence rate, it is controlled by and sensitive to several learning rate parameters. For good performance with ADAM, manual adjustment is necessary. In addition, since ADADELTA computes the moving average of updates, and ADAM needs to compute the bias-corrected gradient estimate, they require more intricate implementations. Consequently, these two methods are not as popular as ADAGRAD for beginners. However, for SVI latent variable models, they can address the problems with ADAGRAD.
ADADELTA updates the learning rates with the following rule: where ρ 0 is a decay constant, and ε is for numerical stability. ADAM's update rule is determined based on estimates of first and second moments of the gradients: where ρ 0 is a constant, b controls the decay rate.
Both ADADELTA and ADAM use the moving average of gradients as the denominator of learning rates. The learning rates will not monotonically decrease, but vary in a certain range. This property prevents online topic models from being trapped and breaks the tie between top words and bottom topic words. ADAM in particular uses biascorrected estimate of gradientm kv , rather than the original stochastic gradient g kv to guide direction for the optimization and therefore achieves better results.
In addition, the magnitude of gradients is proportional to the dataset's size. Thus, when the dataset is small enough, ADAGRAD will still work.

Empirical Study
We study three datasets: synthetic data, Wikipedia and SMS spam corpus. 1 We use the generative process of LDA to generate synthetic data. We vary the vocabulary size V ∈ {2, 10, 100, 1000, 5000}, and the number of documents D ∈ {300, 500, 10 3 , 10 4 , 10 5 , 10 6 }. The Wikipedia dataset consists of 1M articles collected from Wikipedia. 2 The vocabulary is the same as (Hoffman et al., 2010). The SMS corpus is a small corpus containing 1084 documents.

Metrics and Settings
Error rate: For experiments on synthetic data set, we use error rate to measure the difference between the estimatedβ and the known β. The min greedily matches eacĥ β k to its best fit. While an uncommon metric for unsupervised algorithms, on the synthetic data we have the true β.  Predictive likelihood: For experiments on real data sets, we use per-word likelihood (Hoffman et al., 2013) to evaluate the model quality. We randomly hold out 10K documents and 100 documents on Wikipedia and SMS respectively.

Settings:
In the experiments on synthetic data, we use online LDA (Hoffman et al., 2010), since the data is generated by LDA. In the experiments on real datasets, we use online LDA and online HDP (Wang et al., 2011). In the experiments on Wikipedia, we set the number of topics K = 100 and the mini-batch size M = 100. In the experiments on SMS corpus, we set K = 10 and M = 20. For ADAM, we use the default setting of b, and set ρ 0 = 10 and = 1000. For ADADELTA, we set = 1000. For ADAGRAD, we set ρ 0 = = 1. These are best settings for these three methods. The best constant rate is 10 −3 . Figure 2 illustrates the experimental results on synthetic datasets. ADAGRAD only works well with small datasets. When the number of documents increases, ADAGRAD performance degrades. Conversely, other methods can handle more documents. Figure 3 illustrates experimental results on real corpora. ADAGRAD gets competitive results to the other algorithms on the small SMS corpus. However on very large Wikipedia corpus, ADAGRAD fails to infer good topics, and its predictive ability is worse than the other methods. While ADADELTA and ADAM work well on Wikipedia, ADAM is the clear winner between the two.

Conclusion
ADAGRAD is a simple and popular technique for online learning, but is not compatible with traditional initializations and objective functions for online topic models. We show that practitioners are best off using simpler online learning techniques or ADADELTA and ADAM, which are two variants of ADAGRAD, which use the moving average of gradients as denominator. These two methods avoid ADAGRAD's problem. In particular, ADAM performs much better for prediction.
We would like to build a deeper understanding of which aspects of an unsupervised objective, nearuniform initialization, and non-identifiability contribute to these issues and to discover other learning problems that may share these issues.