BIRA: Improved Predictive Exchange Word Clustering

Word clusters are useful for many NLP tasks including training neural network language models, but current increases in datasets are outpacing the ability of word clusterers to handle them. Little attention has been paid thus far on inducing high-quality word clusters at a large scale. The predictive exchange algo-rithm is quite scalable, but sometimes does not provide as good perplexity as other slower clustering algorithms. We introduce the bidirectional, interpolated, reﬁning, and alternating (BIRA) predictive exchange algorithm. It improves upon the predictive exchange algorithm’s perplexity by up to 18%, giving it perplexities comparable to the slower two-sided exchange algorithm, and better perplexities than the slower Brown clustering algorithm. Our B IRA implementation is fast, clustering a 2.5 billion token English News Crawl corpus in 3 hours. It also reduces machine translation training time while preserving translation quality. Our implementation is portable and freely available.

Word clusters also speed up normalization in training neural network and MaxEnt language models, via class-based decomposition (Goodman, 2001a). This reduces the normalization time from O(|V |) (the vocabulary size) to ≈ O( |V |) . More improvements to O(log(|V |)) are found using hierarchical softmax (Morin and Bengio, 2005;Mnih and Hinton, 2009) .

Word Clustering
Word clustering partitions a vocabulary V, grouping together words that function similarly. This helps generalize language and alleviate data sparsity. We discuss flat clustering in this paper. Flat, or strict partitioning clustering surjectively maps word types onto a smaller set of clusters.
The exchange algorithm (Kneser and Ney, 1993) is an efficient technique that exhibits a general time complexity of O(|V | × |C| × I), where |V | is the number of word types, |C| is the number of classes, and I is the number of training iterations, typically < 20 . This omits the specific method of exchanging words, which adds further complexity. Words are exchanged from one class to another until convergence or I .
One of the oldest and still most popular exchange algorithm implementations is mkcls (Och, 1995) 1 , which adds various metaheuristics to escape local optima. Botros et al. (2015) introduce their implementation of three exchange-based algorithms. Martin et al. (1998) and Müller and Schütze (2015) 2 use trigrams within the exchange algorithm. Clark (2003) adds an orthotactic bias. 3 The previous algorithms use an unlexicalized (two-sided) language model: P (w i |w i−1 ) = P (w i |c i ) P (c i |c i−1 ) , where the class c i of the predicted word w i is conditioned on the class c i−1 of the previous word w i−1 . Goodman (2001b) altered this model so that c i is conditioned directly upon w i−1 , hence: P (w i |w i−1 ) = P (w i |c i ) P (c i |w i−1 ) . This new model fractionates the history more, but it allows for a large speedup in hypothesizing an exchange since the history doesn't change. The resulting partially lexicalized (one-sided) class model gives the accompanying predictive exchange algorithm (Goodman, 2001b;Uszkoreit and Brants, 2008) where B is the number of unique bigrams in the training set. 4 We introduce a set of improvements to this algorithm to enable high-quality large-scale word clusters.

BIRA Predictive Exchange
We developed a bidirectional, interpolated, refining, and alternating (BIRA) predictive exchange algorithm. The goal of BIRA is to produce better clusters by using multiple, changing models to escape local optima. This uses both forward and reversed bigram class models to improve cluster quality by evaluating log-likelihood on two different models. Unlike using trigrams, bidirectional bigram models only linearly increase time and memory requirements, and in fact some data structures can be shared. The two directions are interpolated to allow softer inte-gration of these two models: The interpolation weight λ for the forward direction alternates to 1 − λ every a iterations (i): Figure 1 illustrates the benefit of this λ-inversion to help escape local minima, with lower training set perplexity by inverting λ every four iterations: The time complexity is O(2×(B+|V |)×|C|×I) . The original predictive exchange algorithm can be obtained by setting λ = 1 and a = 0 . 5 Another innovation, both in terms of cluster quality and speed, is cluster refinement. The vocabulary is initially clustered into |G| sets, where |G| |C|, typically 2-10 . After a few iterations (i) of this, the full partitioning C f is explored. Clustering G converges very quickly, typically requiring no more than 3 iterations. 6 The intuition behind this is to group words first into broad classes, like nouns, verbs, adjectives, etc. In contrast to divisive hierarchical clustering and coarse-to-fine methods (Petrov, 2009), after the initial iterations, the algorithm is still able to exchange any word to any cluster-there is no hard constraint that the more refined partitions be subsets of the initial coarser partitions. This gives more flexibility in optimizing on log-likelihood, especially given the noise that naturally arises from coarser clusterings. We explored cluster refinement over more stages than just two, successively increasing the number of clusters. We observed no improvement over the twostage method described above. Each BIRA component can be applied to any exchange-based clusterer. The contributions of each of these are shown in Figure 2, which reports the development set perplexities (PP) of all combinations of BIRA components over the original predictive exchange algorithm. The data and configurations are discussed in more detail in Section 4. The greatest PP reduction is due to using lambda inversion (+Rev), followed by cluster refinement (+Refine), then interpolating the bidirectional models (+BiDi), with robust improvements by using all three of these-an 18% reduction in perplexity over the predictive exchange algorithm. We have found that both lambda inversion and cluster refinement prevent early convergence at local optima, while bidirectional models give immediate and consistent training set PP improvements, but this is attenuated in a unidirectional evaluation.
We observed that most of the computation for the predictive exchange algorithm is spent on the logarithm function, calculating δ ← δ − N (w, c) · log N (w, c) . 7 Since the codomain of N (w, c) is 7 δ is the change in log-likelihood, and N (w, c) is the count N 0 , and due to the power law distribution of the algorithm's access to these entropy terms, we can precompute N · log N up to, say 10e+7, with minimal memory requirements. 8 This results in a considerable speedup of around 40% .

Experiments
Our experiments consist of both intrinsic and extrinsic evaluations. The intrinsic evaluation measures the perplexity (PP) of two-sided class-based models for English and Russian, and the extrinsic evaluation measures BLEU scores of phrase-based MT of Russian↔English and Japanese↔English texts.

Class-based Language Model Evaluation
In this task we used 400, 800, and 1200 classes for English, and 800 classes for Russian. The data comes from the 2011-2013 News Crawl monolingual data of the WMT task. 9 For these experiments the data was deduplicated, shuffled, tokenized, digitconflated, and lowercased. In order to have a large test set, one line per 100 of the resulting (shuffled) corpus was separated into the test set. 10 The minimum count threshold was set to 3 occurrences in the training set.  The clusterings are evaluated on the PP of an external 5-gram unidirectional two-sided class-based language model (LM). The n-gram-order interpolation weights are tuned using a distinct development set of comparable size and quality as the test set. Table 2 and Figure 3 show perplexity results using a varying number of classes. Two-sided exchange gives the lowest perplexity across the board, although this is within a two-sided LM evaluation. of a given word followed by a given class. We also evaluated clusters derived from word2vec (Mikolov et al., 2013) using various configurations 11 , and all gave poor perplexities. BIRA gives better perplexities than both the original predictive exchange algorithm and Brown clusters. 12 The Russian experiments yielded higher perplexities for all clusterings, but otherwise the same comparative results.  In general Brown clusters give slightly worse results relative to exchange-based clusters, since Brown clustering requires an early, permanent placement of frequent words, with further restrictions imposed on the |C|-most frequent words (Liang, 2005). 13 Liang-style Brown clustering is only efficient on a small number of clusters, since there is a |C| 2 term in its time complexity. 11 Negative sampling & hierarchical softmax; CBOW & skipgram; various window sizes; various dimensionalities. 12 For the two-sided exchange we used mkcls; for the original pred. exchange we used Phrasal's clusterer; for Brown clustering we used Percy Liang's brown-cluster (329dc). All had min-count=3, and all but mkcls (which is not multithreaded) had threads=12, iterations=15. 13 Recent work by Derczynski and Chester (2016) loosens some restrictions on Brown clustering.  The original predictive exchange algorithm has a more fractionated history than the two-sided exchange algorithm. Interestingly, increasing the number of clusters causes a convergence in the word clusterings themselves, while also causing a divergence in the time complexities of these two varieties of the exchange algorithm. The metaheuristic techniques employed by the twosided clusterer mkcls can be applied to other exchange-based clusterers-including ours-for further improvements. Table 3 presents wall clock times using the full training set, varying the number of word classes |C| (for English). 14 The predictive exchange-based clusterers (BIRA and Phrasal) exhibit slow increases in time as the number of classes increases, while the others (Brown and mkcls) are much more sensitive to |C| . Our BIRA-based clusterer is three times faster than Phrasal for all these sets.
We performed an additional experiment, adding more English News Crawl training data. 15 Our implementation took 3.0 hours to cluster 2.5 billion training tokens, with |C| = 800 using modest hardware. 14

Machine Translation Evaluation
We also evaluated the BIRA predictive exchange algorithm extrinsically in machine translation. As discussed in Section 1, word clusters are employed in a variety of ways within machine translation systems, the most common of which is in word alignment where mkcls is widely used. As training sets get larger every year, mkcls struggles to keep pace, and is a substantial time bottleneck in MT pipelines with large datasets.
We used data from the Workshop on Machine Translation 2015 (WMT15) Russian↔English dataset and the Workshop on Asian Translation 2014 (WAT14) Japanese↔English dataset (Nakazawa et al., 2014). Both pairs used standard configurations, like truecasing, MeCab segmentation for Japanese, MGIZA alignment, grow-diag-final-and phrase extraction, phrase-based Moses, quantized KenLM 5gram modified Kneser-Ney LMs, and MERT tuning.  The BLEU score differences between using mkcls and our BIRA implementation are small but there are a few statistically significant changes, using bootstrap resampling (Koehn, 2004). Table 4 presents the BLEU score changes across varying cluster sizes (*: p-value < 0.05, **: p-value < 0.01). MERT tuning is quite erratic, and some of the BLEU differences could be affected by noise in the tuning process in obtaining quality weight values. Using our BIRA implementation reduces the translation model training time with 500 clusters from 20 hours using mkcls (of which 60% of the time is spent on clustering) to just 8 hours (of which 5% is spent on clustering).

Conclusion
We have presented improvements to the predictive exchange algorithm that address longstanding drawbacks of the original algorithm compared to other clustering algorithms, enabling new directions in using large scale, high cluster-size word classes in NLP. Botros et al. (2015) found that the one-sided model of the predictive exchange algorithm produces better results for training LSTM-based language models compared to two-sided models, while two-sided models generally give better perplexity in class-based LM experiments. Our paper shows that BIRA-based predictive exchange clusters are competitive with two-sided clusters even in a two-sided evaluation. They also give better perplexity than the original predictive exchange algorithm and Brown clustering.