On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis

In this paper we investigate the impact of simple text preprocessing decisions (particularly tokenizing, lemmatizing, lowercasing and multiword grouping) on the performance of a state-of-the-art text classifier based on convolutional neural networks. Despite potentially affecting the final performance of any given model, this aspect has not received a substantial interest in the deep learning literature. We perform an extensive evaluation in standard benchmarks from text categorization and sentiment analysis. Our results show that a simple tokenization of the input text is often enough, but also highlight the importance of being consistent in the preprocessing of the evaluation set and the corpus used for training word embeddings.


Introduction
Words are often considered as the basic constituents of texts for many languages, including English. 1 The first module in an NLP pipeline is a tokenizer which transforms texts to sequences of words. However, in practise, other preprocessing techniques can be (and are) further used together with tokenization. These include lemmatization, lowercasing and multiword grouping, among others. Although these preprocessing decisions have been studied in the context of conventional text classification techniques (Leopold and Kindermann, 2002;Uysal and Gunal, 2014), little attention has been paid to them in the more recent neural-based models. The most similar study to ours is Zhang and LeCun (2017), which analyzed different encoding levels for English and Asian languages such as Chinese, Japanese and Korean. As opposed to our work, their analysis was focused on UTF-8 bytes, characters, words, romanized characters and romanized words as encoding levels, rather than the preprocessing techniques analyzed in this paper.
Additionally, word embeddings have been shown to play an important role in boosting the generalization capabilities of neural systems (Goldberg, 2016;Camacho-Collados and Pilehvar, 2018). However, while some studies have focused on intrinsically analyzing the role of lemmatization in their underlying training corpus (Ebert et al., 2016;Kuznetsov and Gurevych, 2018), the impact on their extrinsic performance when integrated into a neural network architecture has remained understudied. 2 In this paper we focus on the role of preprocessing the input text, particularly in how it is split into individual (meaning-bearing) tokens and how it affects the performance of standard neural text classification models based on Convolutional Neural Networks (LeCun et al., 2010;Kim, 2014, CNN). CNNs have proven to be effective in a wide range of NLP applications, including text classification tasks such as topic categorization (Johnson and Zhang, 2015;Tang et al., 2015;Xiao and Cho, 2016;Conneau et al., 2017) and polarity detection (Kalchbrenner et al., 2014;Kim, 2014;Dos Santos and Gatti, 2014;Yin et al., 2017), which are the tasks considered in this work. The goal of our evaluation study is to find answers to the following two questions: 1. Are neural network architectures (in particular CNNs) affected by seemingly small preprocessing decisions in the input text?
2. Does the preprocessing of the embeddings' underlying training corpus have an impact on the final performance of a state-of-the-art neural network text classifier?
According to our experiments in topic categorization and polarity detection, these decisions are important in certain cases. Moreover, we shed some light on the motivations of each preprocessing decision and provide some hints on how to normalize the input corpus to better suit each setting.
The accompanying materials of this submission can be downloaded at the following repository: https://github.com/pedrada88/ preproc-textclassification.

Text Preprocessing
Given an input text, words are gathered as input units of classification models through tokenization. We refer to the corpus which is only tokenized as vanilla. For example, given the sentence "Apple is asking its manufacturers to move Mac-Book Air production to the United States." (running example), the vanilla tokenized text would be as follows (white spaces delimiting different word units): Apple is asking its manufacturers to move MacBook Air production to the United States .
We additionally consider three simple preprocessing techniques to be applied to an input text: lowercasing (Section 2.1), lemmatizing (Section 2.2) and multiword grouping (Section 2.3).

Lowercasing
This is the simplest preprocessing technique which consists of lowercasing each single token of the input text: apple is asking its manufacturers to move macbook air production to the united states .
Due to its simplicity, lowercasing has been a popular practice in modules of deep learning libraries and word embedding packages (Pennington et al., 2014;Faruqui et al., 2015). Despite its desirable property of reducing sparsity and vocabulary size, lowercasing may negatively impact system's performance by increasing ambiguity. For instance, the Apple company in our example and the apple fruit would be considered as identical entities.

Lemmatizing
The process of lemmatizing consists of replacing a given token with its corresponding lemma: Apple be ask its manufacturer to move Mac-Book Air production to the United States .
Lemmatization has been traditionally a standard preprocessing technique for linear text classification systems (Mullen and Collier, 2004;Toman et al., 2006;Hassan et al., 2007). However, it is rarely used as a preprocessing stage in neuralbased systems. The main idea behind lemmatization is to reduce sparsity, as different inflected forms of the same lemma may occur infrequently (or not at all) during training. However, this may come at the cost of neglecting important syntactic nuances.

Multiword grouping
This last preprocessing technique consists of grouping consecutive tokens together into a single token if found in a given inventory: Apple is asking its manufacturers to move MacBook Air production to the United States .
The motivation behind this step lies in the idiosyncratic nature of multiword expressions (Sag et al., 2002), e.g. United States in the example. The meaning of these multiword expressions are often hardly traceable from their individual tokens. As a result, treating multiwords as single units may lead to better training of a given model. Because of this, word embedding toolkits such as Word2vec propose statistical approaches for extracting these multiwords, or directly include multiwords along with single words in their pretrained embedding spaces (Mikolov et al., 2013b).

Evaluation
We considered two tasks for our experiments: topic categorization, i.e. assigning a topic to a given document from a pre-defined set of topics, and polarity detection, i.e. detecting if the sentiment of a given piece of text is positive or negative (Dong et al., 2015). Two different settings were studied: (1) word embedding's training corpus and the evaluation dataset were preprocessed in a similar manner (Section 3.2); and (2) the two were preprocessed differently (Section 3.3). In what follows we describe the common experimental setting as well as the datasets and preprocessing used for the evaluation.

Experimental setup
We tried with two classification models. The first one is a standard CNN model similar to that of Kim (2014), using ReLU (Nair and Hinton, 2010) as non-linear activation function. In the second model, we add a recurrent layer (specifically an LSTM (Hochreiter and Schmidhuber, 1997)) before passing the pooled features directly to the fully connected softmax layer. 3 The inclusion of this LSTM layer has been shown to be able to effectively replace multiple layers of convolution and be beneficial particularly for large inputs (Xiao and Cho, 2016). These models were used for both topic categorization and polarity detection tasks, with slight hyperparameter variations given their different natures (mainly in their text size) which were fixed across all datasets. The embedding layer was initialized using 300-dimensional CBOW Word2vec embeddings (Mikolov et al., 2013a) trained on the 3B-word UMBC WebBase corpus (Han et al., 2013) with standard hyperparameters 4 .
Evaluation datasets. For the topic categorization task we used the BBC news dataset 5 (Greene and Cunningham, 2006), 20News (Lang, 1995), Reuters 6 ( Lewis et al., 2004) and Ohsumed 7 . 3 The code for this CNN implementation is the same as in (Pilehvar et al., 2017), which is available at https://github. com/pilehvar/sensecnn 4 Context window of 5 words and hierarchical softmax. 5 http://mlg.ucd.ie/datasets/bbc.html 6 Due to the large number of labels in the original Reuters (i.e. 91) and to be consistent with the other datasets, we reduce the dataset to its 8 most frequent labels, a reduction already performed in previous works (Sebastiani, 2002  Preprocessing. Four different techniques (see Section 2) were used to preprocess the datasets as well as the corpus which was used to train word embeddings (i.e. UMBC). For tokenization and lemmatization we relied on Stanford CoreNLP . As for multiwords, we used the phrases from the pre-trained Google News Word2vec vectors, which were obtained using a simple statistical approach (Mikolov et al., 2013b). 12 Table 2 shows the accuracy 13 of the classification models using our four preprocessing techniques. We observe a certain variability of results depending on the preprocessing techniques used (aver-8 Both PL04 and PL05 were downloaded from http:// www.cs.cornell.edu/people/pabo/movie-review-data/ 9 http://www.rottentomatoes.com 10 We mapped the numerical value of phrases to either negative (from 0 to 0.4) or positive (from 0.6 to 1), removing the neutral phrases according to the scale (from 0.4 to 0.6). 11 For the datasets with train-test partitions, the sizes of the test sets are the following: 7,532 for 20News; 12,733 for Ohsumed; 25,000 for IMDb; and 1,000 for RTC. 12 For future work it would be interesting to explore more complex methods to learn embeddings for multiword expressions (Yin and Schütze, 2014;Poliak et al., 2017). 13 Computed by averaging accuracy of two different runs. The statistical significance was calculated according to an unpaired t-test at the 5% significance level.  age variability 14 of ±2.4% for the CNN+LSTM model, including a statistical significance gap in seven of the nine datasets), which proves the influence of preprocessing on the final results. It is perhaps not surprising that the lowest variance of results is seen in the datasets with the larger training data (i.e. RTC and Stanford). This suggests that the preprocessing decisions are not so important when the training data is large enough, but they are indeed relevant in benchmarks where the training data is limited. As far as the individual preprocessing techniques are concerned, the vanilla setting (tokenization only) proves to be consistent across datasets and tasks, as it performs in the same ballpark as the best result in 8 of the 9 datasets for both models (with no noticeable differences between topic categorization and polarity detection). The only topic categorization dataset in which tokenization does not seem enough is Ohsumed, which, unlike the more general nature of other categorization datasets (news), belongs to a specialized domain (medical) for which fine-grained distinctions are required to classify cardiovascular diseases. In particular for this dataset, word embeddings trained on a general-domain corpus like UMBC may not accurately capture the specialized meaning of medical terms and hence, sparsity becomes an issue. In fact, lowercasing and lemmatizing, which are mainly aimed at reducing sparsity, outperform the vanilla setting by over six points in the CNN+LSTM setting and clearly outperform the other preprocessing techniques on the single CNN model as well.

Experiment 1: Preprocessing effect
Nevertheless, the use of more complex preprocessing techniques such as lemmatization and multiword grouping does not help in general. Even though lemmatization has proved useful in conventional linear models as an effective way to deal with sparsity (Mullen and Collier, 2004;Toman et al., 2006), neural network architectures seem to be more capable of overcoming sparsity thanks to the generalization power of word embeddings.

Experiment 2: Cross-preprocessing
This experiment aims at studying the impact of using different word embeddings (with differently preprocessed training corpora) on tokenized datasets (vanilla setting). Table 3 shows the results for this experiment. In this experiment we observe a different trend, with multiwordenhanced vectors exhibiting a better performance both on the single CNN model (best overall performance in seven of the nine datasets) and on the CNN+LSTM model (best performance in four datasets and in the same ballpark as the best results in four of the remaining five datasets). In this case the same set of words is learnt but single tokens inside multiword expressions are not trained. Instead, these single tokens are considered in isolation only, without the added noise when considered inside the multiword expression as well. For instance, the word Apple has a clearly different meaning in isolation from the one inside  Table 3: Cross-preprocessing evaluation: accuracy on the topic categorization and polarity detection tasks using different sets of word embeddings to initialize the embedding layer of the two classifiers. All datasets were preprocessed similarly according to the vanilla setting. † indicates results that are statistically significant with respect to the top result.
the multiword expression Big Apple, hence it can be seen as beneficial not to train the word Apple when part of this multiword expression. Interestingly, using multiword-wise embeddings on the vanilla setting leads to consistently better results than using them on the same multiwordgrouped preprocessed dataset in eight of the nine datasets. This could provide hints on the excellent results provided by pre-trained Word2vec embeddings trained on the Google News corpus, which learns multiwords similarly to our setting. Apart from this somewhat surprising finding, the use of the embeddings trained on a simple tokenized corpus (i.e. vanilla) proved again competitive, as different preprocessing techniques such as lowercasing and lemmatizing do not seem to help. In fact, the relatively weaker performance of lemmatization and lowercasing in this crossprocessing experiment is somehow expected as the coverage of word embeddings in vanilla-tokenized datasets is limited, e.g., many entities which are capitalized in the datasets are not covered in the case of lowercasing, and inflected forms are missing in the case of lemmatizing.

Conclusions
In this paper we analyzed the impact of simple text preprocessing decisions on the performance of a standard word-based neural text classifier. Our evaluations highlight the importance of being careful in the choice of how to preprocess our data and to be consistent when comparing different systems. In general, a simple tokenization works equally or better than more complex pre-processing techniques such as lemmatization or multiword grouping, except for domain-specific datasets (such as the medical dataset in our experiments) in which sole tokenization performs poorly. Additionally, word embeddings trained on multiword-grouped corpora perform surprisingly well when applied to simple tokenized datasets. This property has often been overlooked and, to the best of our knowledge, we test the hypothesis for the first time. In fact, this finding could partially explain the long-lasting success of pre-trained Word2vec embeddings, which specifically learn multiword embeddings as part of their pipeline (Mikolov et al., 2013b).
Moreover, our analysis shows that there is a high variance in the results depending on the preprocessing choice (±2.4% on average for the best performing model), especially when the training data is not large enough to generalize. Further analysis and experimentation would be required to fully understand the significance of these results; but, this work can be viewed as a starting point for studying the impact of text preprocessing in deep learning models. We hope that our findings will encourage future researchers to carefully select and report these preprocessing decisions when evaluating or comparing different models. Finally, as future work, we plan to extend our analysis to other tasks (e.g. question answering), languages (particularly morphologically rich languages for which these results may vary) and preprocessing techniques (e.g. stopword removal or part-of-speech tagging).