Exploring the Relationship Between Algorithm Performance, Vocabulary, and Run-Time in Text Classification

Text classification is a significant branch of natural language processing, and has many applications including document classification and sentiment analysis. Unsurprisingly, those who do text classification are concerned with the run-time of their algorithms, many of which depend on the size of the corpus’ vocabulary due to their bag-of-words representation. Although many studies have examined the effect of preprocessing techniques on vocabulary size and accuracy, none have examined how these methods affect a model’s run-time. To fill this gap, we provide a comprehensive study that examines how preprocessing techniques affect the vocabulary size, model performance, and model run-time, evaluating ten techniques over four models and two datasets. We show that some individual methods can reduce run-time with no loss of accuracy, while some combinations of methods can trade 2-5% of the accuracy for up to a 65% reduction of run-time. Furthermore, some combinations of preprocessing techniques can even provide a 15% reduction in run-time while simultaneously improving model accuracy.


Introduction
With the increasing amount of text data available, text analysis has become a significant part of machine learning (ML). Many problems in text analysis use ML methods to perform their task, ranging from classical problems like text classification and topic modeling, to more complex tasks like question answering. Although neural networks have become increasingly common in the research field, many industry NLP problems can be well served by less complex but more efficient and explainable models, such as Support Vector Machines (SVMs) or K-Nearest Neighbors (K-NN).
We focus on the text classification problem, where the dominant approach to using these nonneural models is to first calculate the number of unique terms in the dataset (the vocabulary, size V ) and encode each instance of the dataset into a bag-of-words (BoW) representation (Joachims, 1998;Zhang et al., 2010). This results in a high-dimensional vector of size V that indicates whether each given word of the vocabulary was used in this instance.
However, the vanilla approach to the BoW representation can lead to sub-par performance, as shown by numerous studies that have examined how preprocessing techniques affect the BoW w.r.t. performance and vocabulary size. These studies have examined this representation in fields such as information retrieval (Chaudhari et al., 2015;Patil and Atique, 2013;Beil et al., 2002;Senuma, 2011), text classification (Yang and Pedersen, 1997;Caragea et al., 2012;Uysal and Gunal, 2014;Vijayarani et al., 2015;Kumar and Harish, 2018;HaCohen-Kerner et al., 2020;Symeonidis et al., 2018) and topic modeling (Schofield and Mimno, 2016;Blei et al., 2003). They suggest a myriad of preprocessing techniques that could improve performance, ranging from choosing features that have high mutual information, low frequency, or simply remove punctuation.
Another related problem of the BoW representation is that this sparse high-dimensional vector does not scale well to datasets with large vocabularies. As preprocessing techniques help contribute to a reduced vocabulary, they should also help alleviate this scaling problem, at least according to folklore. However, to the best of our knowledge, no previous study of preprocessing techniques have examined how they contribute to reduced run-time costs, leading to uncertainty about what these techniques do to mitigate the computational complexity in practice.
To remedy this, we analyze how these prepro- cessing methods affect not only vocabulary size and performance, but also how they affect training and inference time. To do this, we contribute a comprehensive analysis of 10 different preprocessing methods applied to four machine learning models, evaluated on two datasets with widely varying vocabularies ( Figure 1). Our results show that the individual preprocessing methods provide widely different effects on run-time, with some methods (i.e. rare word filtering and stopword removal) providing significant run-time reductions without losing any performance. We also show that some combinations of preprocessing methods both improve performance and reduce run-time.

Experimental Setup
Datasets To see how preprocessing affects runtime, we examine two datasets (in English): the Amazon (He and McAuley, 2016) 2 and AP News corpora (MacIntyre, 1998). These datasets were chosen because of the wide disparity between their vocabularies. The Amazon corpus comes from user product reviews and contains a much higher vocabulary relative to the number of documents, due to its noisy text. The AP News corpus contains professionally-edited news articles and its vocabulary plateaus much faster than the Amazon corpus ( Figure 1). We perform sentiment analysis on Amazon and year classification on AP News and report scores with the accuracy metric. We note 2 http://jmcauley.ucsd.edu/data/amazon/ that we also computed the F1 score alongside accuracy and found our results to be similar; thus we report accuracy since it is easier to understand.
To test the effect of document size on preprocessing, we sampled various-sized 3 datasets from the original corpus and ran our analysis on each, sampling 5 different times with differing random seeds. 4 However, we found that our results were nearly identical across the differing corpus sizes and thus, only report numbers for the 100k size.
Due to the exponential number of possible preprocessing combinations, we run all individual methods but restrict the search space of combinations of these methods. For rare word filtering and word hashing, we first conduct experiments for 9 different levels of filtering individually, using only the best level in future combinations with other methods. Results for all levels of filtering and hashing are in Appendices A and B. We then conduct experiments for all 24 combinations of spelling correction, word segmentation, number removal, and stopword removal, using the best outcome (the pipeline of all four) to combine with other methods. We note that while this is not an exhaustive search of all combinations, our analysis includes the standard preprocessing pipelines as well as many more.

Models
We use Scikit-Learn (Pedregosa et al., 2011) for three of the base algorithms, including K-NN (Altman, 1992), Naive Bayes (Rish et al., 2001), and the Support Vector Machine (SVM, (Suykens and Vandewalle, 1999)). We also employ Vowpal Wabbit (Langford et al., 2007;Karampatziakis and Langford, 2010), due to its strong performance and frequent use in industry. All models use default hyperparameters and our document representations use the BoW representation, consisting of a sparse vector format. These four models provide a wide range of algorithms that might be used, allowing us to show how preprocessing methods generalize across models.
Compute All experiments were performed using 14-core Intel Broadwell processors running at 2.4GHz with 128GB of DDR4 2400 MT/s RAM.

Results
We format our results relative to the algorithm with no preprocessing, to easily show how preprocessing changes this baseline performance. We first run each algorithm with no preprocessing, measuring the run-time, vocabulary size, and accuracy. We then report the scores of each preprocessing pipeline relative to the algorithm's baseline (e.g. a model with preprocessing that scores 75% of the no-preprocessing baseline's accuracy has a relative accuracy of 0.75).
As the cross product of the number of methods vs. the number of models is still far too large to include in this paper, we show the average of each model's relative proportion to its respective baseline performance. 5 This aggregation shows us 5 We first compute each algorithm's relative score to its the average relative performance across the four models, helping us generalize our results to be model-independent. For full tables detailing specific model results, see Appendix C. Bold scores in tables indicate statistical similarity to the best score in the column (two-sample t-test, α = 0.05).

Individual Techniques
We see results for the Amazon corpus in Table 1 and for the AP News corpus in Table 2. On Amazon, each individual preprocessing method performs statistically similar to the baseline's accuracy, while three algorithms (stopword removal, rare word filtering, and word segmentation) also provide a moderate decrease (20-30%) in train and test time. Rare word filtering and stopword removal are effective across both corpora (with rare word filtering being even more effective on AP News, reducing the training time in half), while the other methods do not significantly impact either train-time or accuracy on AP News. We hypothesize that these techniques are more effective on the AP corpus because of its much smaller (and less varied) vocabulary.
baseline (e.g. SVM with rare word filtering vs SVM with no preprocessing) and then take the average of the models for that method (e.g. average the relative performance of rare word filtering on models {K-NN, Naive Bayes, SVM, and Vowpal Wabbit} for the final score for rare word filtering).  Combination Techniques The combination techniques also show a mild impact on accuracy, with most methods on both corpora performing statistically similar to the baseline. On the Amazon corpus, a handful of methods trade 2-5% of accuracy for up to a 65% reduction in training and testing time ("Lowest Train/Test Time" section in Table 1). Those that do not reduce accuracy (such as stop+rare) can still reduce the training and testing time by up to 55%. We see in the "Highest Accuracy" section that some methods (i.e. spell+seg+rare, etc.) can even improve performance by almost 2% while also reducing run-time by 10-15%. Similarly, when we examine the results on AP News we can find combinations with reduced run-time (up to 70% and 50% reductions in train and test time respectively) with no accuracy loss (but also no gains).
Correlations In order to show the correlation between run-time and the other variables, we show a heatmap of these correlations in Figure 2. Most of these variables are highly correlated with each other, as expected (training time is highly correlated with testing time, etc.). However, although testing time is highly correlated with vocabulary size (0.8 correlation), training time is not highly correlated (0.17), We hypothesize that a low vo-cabulary directly leads to faster inference, while which words are removed from the vocabulary has a bigger role in how quickly the algorithm converges during training. This hypothesis is also supported by the low correlation between vocabulary size and accuracy, indicating that what is in the vocabulary is more important than its size. analyze and cross-compare up to 16 different techniques for four machine learning algorithms. In contrast, our work is the first to examine these preprocessing techniques beyond accuracy, examining them in tandem with how they affect vocabulary size and run-time.

Conclusion
In this work we conduct the first study that examines the relationship between vocabulary size, run-time, and accuracy across different models and corpora for text classification. In general, we find that although vocabulary size is highly correlated with testing time, it is not highly correlated with training time or accuracy. In these cases, the specifics of the preprocessing algorithm (the content of what it removes) matter more.
Our experiments show that rare word filtering and stopword removal are superior to many other common preprocessing methods, both in terms of their ability to reduce run-time and their potential to increase accuracy. By using these methods, we show that it is possible to reduce training and test-ing time by up to 65% with a loss of only 2-5% of accuracy, or in some cases, to provide accuracy and run-time improvements simultaneously. We hope that this study can help both researchers and industry practitioners as they design machine learning pipelines to reach their end-goals.  Tables 3 and 4 show the results of rare word filtering on the Amazon and AP News datasets. We filtered at levels corresponding to the geometric progression of values from 1 to half the size of the corpus (we refer to these as levels 1 to 9, with higher numbers being more filtered). We find that rare word filtering at higher levels provides increased vocabulary and run-time reductions, while also reducing accuracy, in general.

B Word Hashing
Tables 5 and 6 show the effect of different levels of word hashing on model accuracy (where "Size" indicates the number of hash buckets used). We find that word hashing with small numbers of buckets reduces vocabulary and run-time, while also decreasing accuracy in general.

C Full Tables for Method Combinations
In Tables 7 and 8 we show the complete table for preprocessing method combinations on Amazon and AP News respectively.