Bag of Tricks for Efficient Text Classification

This paper explores a simple and efficient baseline for text classification. Our experiments show that our fast text classifier fastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation. We can train fastText on more than one billion words in less than ten minutes using a standard multicore CPU, and classify half a million sentences among 312K classes in less than a minute.


Introduction
Building good representations for text classification is an important task with many applications, such as web search, information retrieval, ranking and document classification (Deerwester et al., 1990;Pang and Lee, 2008).Recently, models based on neural networks have become increasingly popular for computing sentence representations (Bengio et al., 2003;Collobert and Weston, 2008).
While these models achieve very good performance in practice (Kim, 2014;Zhang and LeCun, 2015;Zhang et al., 2015), they tend to be relatively slow both at train and test time, limiting their use on very large datasets.
At the same time, simple linear models have also shown impressive performance while being very computationally efficient (Mikolov et al., 2013;Levy et al., 2015).They usually learn word level representations that are later combined to form sentence representations.In this work, we propose an extension of these models to directly learn sentence representations.We show that by incorporating additional statistics such as using bag of n-grams, we reduce the gap in accuracy between linear and deep models, while being many orders of magnitude faster.
Our work is closely related to standard linear text classifiers (Joachims, 1998;McCallum and Nigam, 1998;Fan et al., 2008).Similar to Wang and Manning (2012), our motivation is to explore simple baselines inspired by models used for learning unsupervised word representations.As opposed to Le and Mikolov (2014), our approach does not require sophisticated inference at test time, making its learned representations easily reusable on different problems.We evaluate the quality of our model on two different tasks, namely tag prediction and sentiment analysis.

Model architecture
A simple and efficient baseline for sentence classification is to represent sentences as bag of words (BoW) and train a linear classifier, for example a logistic regression or support vector machine (Joachims, 1998;Fan et al., 2008).However, linear classifiers do not share parameters among features and classes, possibly limiting generalization.
Common solutions to this problem are to factorize the linear classifier into low rank matrices (Schutze, 1992;Mikolov et al., 2013) or to use multilayer neural networks (Collobert and Weston, 2008;Zhang et al., 2015).In the case of neural networks, the information is shared via the hidden

layers.
Figure 1 shows a simple model with 1 hidden layer.The first weight matrix can be seen as a look-up table over the words of a sentence.The word representations are averaged into a text representation, which is in turn fed to a linear classifier.This architecture is similar to the cbow model of Mikolov et al. (2013), where the middle word is replaced by a label.The model takes a sequence of words as an input and produces a probability distribution over the predefined classes.We use a softmax function to compute these probabilities.
Training such model is similar in nature to word2vec, i.e., we use stochastic gradient descent and backpropagation (Rumelhart et al., 1986) with a linearly decaying learning rate.Our model is trained asynchronously on multiple CPUs.

Hierarchical softmax
When the number of targets is large, computing the linear classifier is computationally expensive.More precisely, the computational complexity is O(Kd) where K is the number of targets and d the dimension of the hidden layer.In order to improve our running time, we use a hierarchical softmax (Goodman, 2001) based on a Huffman coding tree (Mikolov et al., 2013).During training, the computational complexity drops to O(d log 2 (K)).In this tree, the targets are the leaves.
The hierarchical softmax is also advantageous at test time when searching for the most likely class.Each node is associated with a probability that is the probability of the path from the root to that node.If the node is at depth l + 1 with parents n 1 , . . ., n l , its probability is This means that the probability of a node is always lower than the one of its parent.Exploring the tree with a depth first search and tracking the maximum probability among the leaves allows us to discard any branch associated with a smaller probability.In practice, we observe a reduction of the complexity to O(d log 2 (K)) at test time.This approach is further extended to compute the T -top targets at the cost of O(log(T )), using a binary heap.

N-gram features
Bag of words is invariant to word order but taking explicitly this order into account is often computationally very expensive.Instead, we use bag of n-gram as additional features to capture some partial information about the local word order.This is very efficient in practice while achieving comparable results to methods that explicitly use the order (Wang and Manning, 2012).
We maintain a fast and memory efficient mapping of the n-grams by using the hashing trick (Weinberger et al., 2009) with the same hashing function as in Mikolov et al. (2011) and 10M bins if we only used bigrams, and 100M otherwise.

Sentiment analysis
Datasets and baselines.We employ the same 8 datasets and evaluation protocol of Zhang et al. (2015).We report the N-grams and TFIDF baselines from Zhang et al. (2015), as well as the character level convolutional model (char-CNN) of Zhang and LeCun (2015) and the very deep convolutional network (VDCNN) of Conneau et al. (2016).
We also compare to Tang et al. (2015) following their evaluation protocol.We report their main baselines as well as Model AG Sogou DBP Yelp P. Yelp F. Yah. A. Amz.F. Amz.P.
BoW (Zhang et al., 2015) 88.8 92.9 96.6 92.2 58.0 68.9 54.6 90.4 ngrams (Zhang et al., 2015) 92  Results.We present the results in Figure 1.We use 10 hidden units and run fastText for 5 epochs with a learning rate selected on a validation set from {0.05, 0.1, 0.25, 0.5}.On this task, adding bigram information improves the performance by 1 − 4%.Overall our accuracy is slightly better than char-CNN and a bit worse than VDCNN.Note that we can increase the accuracy slightly by using more n-grams, for example with trigrams, the per-formance on Sogou goes up to 97.1%.Finally, Figure 3 shows that our method is competitive with the methods presented in Tang et al. (2015).We tune the hyper-parameters on the validation set and observe that using n-grams up to 5 leads to the best performance.Unlike Tang et al. (2015), fastText does not use pre-trained word embeddings, which can be explained the 1% difference in accuracy.
Training time.Both char-CNN and VDCNN are trained on a NVIDIA Tesla K40 GPU, while our models are trained on a CPU using 20 threads.Table 2 shows that methods using convolutions are several orders of magnitude slower than fastText.
Note that for char-CNN, we report the time per epoch while we report overall training time for the other methods.While it is possible to have a 10× speed up for char-CNN by using more recent CUDA implementations of convolutions, fastText takes less than a minute to train on these datasets.Our speed-up compared to CNN based methods increases with the size of the dataset, going up to at  We show a few correct and incorrect tag predictions.

Tag prediction
Dataset and baseline.To test scalability of our approach, further evaluation is carried on the YFCC100M dataset (Ni et al., 2015) which consists of almost 100M images with captions, titles and tags.We focus on predicting the tags according to the title and caption (we do not use the images).We remove the words and tags occurring less than 100 times and split the data into a train, validation and test set.The train set contains 91,188,648 examples (1.5B tokens).The validation has 930,497 examples and the test set 543,424.The vocabulary size is 297,141 and there are 312,116 unique tags.We will release a script that recreates this dataset so that our numbers could be reproduced.We report precision at 1.We consider a frequency-based baseline which predicts the most frequent tag.We also compare with Tagspace (Weston et al., 2014), which is a tag prediction model similar to ours, but based on the Wsabie model of Weston et al. (2011).While the Tagspace model is described using convolutions, we consider the linear version, which achieves comparable performance but is much faster.5 presents a comparison of fastText and the baselines.We run fastText for 5 epochs and compare it to Tagspace for two sizes of the hidden layer, i.e., 50 and 200.Both models achieve a similar performance with a small hidden layer, but adding bigrams gives us a significant boost in accuracy.At test time, Tagspace needs to compute the scores for all the classes which makes it relatively slow, while our fast inference gives a significant speed-up when the number of classes is large (more than 300K here).Overall, we are more than an order of magnitude faster to obtain model with a better quality.The speedup of the test phase is even more significant (a 600x speedup).Table 4 shows some qualitative examples.FastText learns to associate words in the caption with their hashtags, e.g., "christmas" with "#christmas".It also captures simple relations between words, such as "snowfall" and "#snow".Finally, using bigrams also allows it to capture relations such as "twin cities" and "#minneapolis".

Results and training time. Table
In this work, we have developed fastText which extends word2vec to tackle sentence and document classification.Unlike unsupervisedly trained word vectors from word2vec, our word features can be averaged together to form good sentence representations.In several tasks, we have obtained performance on par with recently proposed methods inspired by deep learning, while observing a massive speed-up.Although deep neural networks have in theory much higher representational power than shallow models, it is not clear if simple text classification problems such as sentiment analysis are the right ones to evaluate them.We will publish our code so that the research community can easily build on top of our work.

Figure 1 :
Figure 1: Model architecture for fast sentence classification.

Table 1 :
Test accuracy [%] on sentiment datasets.FastText has been run with the same parameters for all the datasets.It has 10 hidden units and we evaluate it with and without bigrams.For VDCNN and char-CNN, we show the best reported numbers without data augmentation.

Table 2 :
Training time on sentiment analysis datasets compared to char-CNN and VDCNN.We report the overall training time, except for char-CNN where we report the time per epoch.* Training time for a single epoch.their two approaches based on recurrent networks (Conv-GRNN and LSTM-GRNN).

Table 4 :
Examples from the validation set of YFCC100M dataset obtained with fastText with 200 hidden units and bigrams.

Table 5 :
Prec@1 on the test set for tag prediction on YFCC100M.We also report the training time and test time.