Scaling Up Word Clustering

Word clusters improve performance in many NLP tasks including training neural network language models, but current increases in datasets are outpacing the ability of word clusterers to handle them. In this paper we present a novel bidirectional, interpolated, re-ﬁning, and alternating (BIRA) predictive exchange algorithm and introduce ClusterCat, a clusterer based on this algorithm. We show that ClusterCat is 3–85 times faster than four other well-known clusterers, while also improving upon the predictive exchange algorithm’s perplexity by up to 18% . Notably, ClusterCat clusters a 2.5 billion token English News Crawl corpus in 3 hours. We also evaluate in a machine translation setting, resulting in shorter training times achieving the same translation quality measured in B LEU scores. ClusterCat is portable and freely available.

Word clusters also speed up normalization in training neural network and MaxEnt language models, via class-based decomposition (Goodman, 2001b). This reduces the normalization time from O(|V |) (the vocabulary size) to ≈ O( |V |) .

Exchange-Based Clustering
The exchange algorithm (Kneser and Ney, 1993) uses an unlexicalized (two-sided) model: P (w i |w i−1 ) = P (w i |c i ) P (c i |c i−1 ) where the class c i of the predicted word w i is conditioned on the class c i−1 of the previous word w i−1 . Goodman (2001a) altered this model so that c i is conditioned directly upon w i−1 : P (w i |w i−1 ) = P (w i |c i ) P (c i |w i−1 ) . This fractionates the history more, but it greatly speeds up hypothesizing an exchange since the history doesn't change. The resulting partially lexicalized (onesided) model gives the accompanying predictive exchange algorithm (Uszkoreit and Brants, 2008) a time complexity of O((B + |V |) × |C| × I) where B is the number of unique bigrams, |C| is the number of classes, and I is the number of training iterations, usually < 20 .

ClusterCat
ClusterCat is word clustering software designed to be fast and scalable, while also improving upon the predictive exchange algorithm. We describe in this section improvements in the model, the algorithm, as well as in the implementation. 42

Model and Algorithm
We developed a bidirectional, interpolated, refining, and alternating (BIRA) predictive exchange algorithm. The goal of BIRA is to produce better clusters by using multiple, changing models to escape local optima. This uses both forward and reversed bigram class models in order to improve cluster quality by evaluating log-likelihood on two different models. Unlike using trigrams, bidirectional bigram models only linearly increase time and memory requirements, and in fact some data structures can be shared. The two directions are interpolated to allow softer integration of these two models: ) . Furthermore, the interpolation weight λ for the forward direction alternates to 1−λ every a iterations i to help escape local optima. The time complexity is O(2 × (B + |V |) × |C| × I) . The original predictive exchange algorithm can be obtained by setting λ = 1 and a = 0 .
Cluster refinement improves both cluster quality and speed. The vocabulary is initially clustered into |G| sets, where |G| |C|, typically 2-10 . This groups words into broad classes, like nouns, verbs, etc. After a few iterations (i) of this, the full partitioning C f is explored. Clustering G converges very quickly, typically requiring no more than 3 iterations. In contrast to divisive hierarchical clustering and coarse-to-fine methods (Petrov, 2009), after the initial iterations, any word can still move to any cluster-there is no hard constraint that the more refined partitions be subsets of the initial coarser partitions. This gives more flexibility in optimizing on log-likelihood, especially given the noise that naturally arises from coarser clusterings. We explored cluster refinement over more stages than just two, successively increasing the number of clusters. We observed no improvement over the two-stage method described above.
The contributions of each of these, relative to the original predictive exchange algorithm, are shown in Figure 1 . The data and configurations are discussed in more detail in Section 4. The greatest improvement is due to using lambda inversion (+Rev), followed by cluster refinement (+Refine), then interpolating the bidirectional models (+BiDi), with robust improvements by using all three of these-an 18% reduction in perplexity over the predictive exchange algorithm. We have found that both lambda inversion and cluster refinement prevent early convergence at local optima, while bidirectional models give immediate and consistent training set PP improvements, but this is attenuated in a unidirectional evaluation.

Implementation
We represent the set of bigrams B as an array of records that track the number of predecessors, as well as having a pointer to an array of the predecessors' IDs. This allows for easy prefetching to reduce memory latency, while also keeping memory overhead low. We dispense with the predictive exchange RemoveWord procedure for tentative steps, since this does not change the final clustering.
Most of the computation for the predictive exchange algorithm is spent on the logarithm function in δ ← δ − N (w, c) · log N (w, c) . 1 Since the codomain of N (w, c) is N 0 , and due to the power law distribution of the algorithm's access to these entropy terms, we precompute N ·log N up to, say 10e+7, with minimal memory requirements. 2 This results in a considerable speedup of around 40% .
The data comes from the 2011-2013 News Crawl monolingual data of the WMT task. 3 For these experiments the data was deduplicated, shuffled, tokenized, digit-conflated, and lowercased. In order to have a large test set, one line per 100 of the resulting corpus was separated into the test set. 4 For English this gave 1B training tokens, 2M training types, and 12M test tokens. For Russian, 550M training tokens, 2.7M training types, and 6M test tokens.
All clusterers had a minimum count threshold of 3 occurrences in the training set. All used 12 threads and 15 iterations, except single-threaded mkcls which used the default one iteration. Clusterings were performed on a 2.4 GHz Opteron 8378 machine featuring 16 threads and 64 GB of RAM. Table 1 presents wall clock times. The predictive exchange-based clusterers (ClusterCat and Phrasal) exhibit slow time growth as |C| increases, while the other three (Brown, mkcls, and word2vec) are much more sensitive to |C| . ClusterCat is three times faster than Phrasal for all sets. For both English and Russian we observe prohibitive growth for mkcls, with the full Russian training set taking over 3 days, compared to 1.5 hours for ClusterCat.  3.0 hours to cluster 2.5 billion training tokens, using 40 GB of memory for |C| = 800. When the number of clusters was tripled to |C| = 2400 , the same 2.5B corpus was clustered in under 8 hours.
The clusterings are also evaluated on the perplexity (PP) of an external 5-gram two-sided class-based LM. Botros et al. (2015) found that the two-sided model (which mkcls uses) tends to give better PP in two-sided class-based LM experiments, but the onesided model of the predictive exchange that we employed produces better PP for training LSTM LMs. Table 2 shows perplexity results using a varying number of classes. As word2vec is the only clusterer not optimized on log-likelihood, its perplexity is quite high, and remains high as more training data is added. 6 On the other hand, mkcls gives the lowest perplexity, although this is an artefact of the twosided evaluation. ClusterCat gives lower perplexity than the original predictive exchange algorithm (in Phrasal) and Brown clustering. The Russian experiments yielded higher PP for all clusterings, but otherwise the same comparative results. The metaheuristic techniques used in mkcls can be applied to other exchange-based clusterers-including ours-for further improvements.
It is also interesting to look at time-sensitive clustering. Figure 2 shows what perplexity can be obtained within a given training time frame. For each clusterer, each successive rightward point in the figure represents an order of magnitude more training data, from 10 6 to 10 9 tokens. ClusterCat can train on 10 times more data than either mkcls or Browncluster and produces better perplexity than either, within a given amount of time.
News Crawl training data. This training set was too large for the external class-based LM to fit into memory, so no perplexity evaluation of this clustering was possible.

Extrinsic Evaluation
We also evaluated mkcls and ClusterCat extrinsically in machine translation, for word alignment. As training sets get larger every year, mkcls struggles to keep pace, and is a substantial time bottleneck in MT pipelines. We compare time and BLEU scores of using either mkcls or ClusterCat for Russian↔English translation.
The parallel data comes from the WMT-2015 Common Crawl Corpus, News Commentary, Yandex 1M Corpus, and the Wiki Headlines Corpus. 7 The monolingual data consists of 2007-2014 News Commentary and News Crawl articles. The dev and test sets contain 3000 sentences from EN→RU manually translated news articles. We used standard configurations, like truecasing, MGIZA alignment, GDFA phrase extraction, phrase-based Moses, quantized KenLM 5-gram MKN LMs, and MERT tuning. Table 3 presents the BLEU score changes across varying cluster sizes. 8 The BLEU score differences between using mkcls and ClusterCat are minimal but there are a few statistically significant changes, using bootstrap resampling (Koehn, 2004).  Figure 3 shows translation model training times, before MERT. Using ClusterCat reduces the translation model training time with 500 clusters from 20 hours using mkcls (of which 60% of the time is spent on clustering) to just 8 hours (of which 5% is spent on clustering).

Conclusion
In this article we have presented improvements to the predictive exchange algorithm that address longstanding drawbacks of the original algorithm compared to other clustering algorithms. Bidirectional models, lambda inversion, and cluster refinement produce better word clusters, as we showed in several two-sided class-based LM experiments. On these large datasets the quality of the resulting clusters is better than predictive exchange clusters and Brown clusters, and approaches the stochastic exchange clusters produced by mkcls, which takes 35-85 times longer.
We also improved upon the speed of the algorithm by cluster refinement and entropy term precalculation. MT experiments showed that word alignment models using ClusterCat fully match those using mkcls in BLEU scores, with time savings found by using ClusterCat. The software, as well as additional compatibility and visualization scripts, are available under a Free license at https://github.com/ jonsafari/clustercat .