A Comprehensive Analysis of Preprocessing for Word Representation Learning in Affective Tasks

Affective tasks such as sentiment analysis, emotion classification, and sarcasm detection have been popular in recent years due to an abundance of user-generated data, accurate computational linguistic models, and a broad range of relevant applications in various domains. At the same time, many studies have highlighted the importance of text preprocessing, as an integral step to any natural language processing prediction model and downstream task. While preprocessing in affective systems is well-studied, preprocessing in word vector-based models applied to affective systems, is not. To address this limitation, we conduct a comprehensive analysis of the role of preprocessing techniques in affective analysis based on word vector models. Our analysis is the first of its kind and provides useful insights of the importance of each preprocessing technique when applied at the training phase, commonly ignored in pretrained word vector models, and/or at the downstream task phase.


Introduction
Affective tasks such as sentiment analysis, emotion classification and sarcasm detection have enjoyed great popularity in recent years. This success can be largely attributed to the fundamental and straightforward nature of the methods employed, the availability of vast amounts of user-generated natural language data, and the wide range of useful applications, spanning from hate speech detection to monitoring the sentiment of financial markets and news recommendation (Djuric et al., 2015;Babanejad et al., 2019). Most early models of affect analysis employed pretrained word embeddings that have been obtained under the assumption of the distributional hypothesis (Mikolov et al., 2013;Devlin et al., 2018). The distributional hypothesis suggests that two words occurring frequently in similar linguistic contexts tend to be more semantically similar, and therefore should be represented closer to one another in the embedding space. However, while such embeddings are useful for several natural language processing (NLP) downstream tasks, they are known to be less suitable for affective tasks in particular (Tang et al., 2014;Agrawal et al., 2018). Although some authors claim that there is a need for post-processing word embeddings for affective tasks, others find that off-theshelf vectors are very powerful for affective lexicon learning (Lison and Kutuzov, 2017). For example, word2vec (Mikolov et al., 2013) estimates the pair of words 'happy' and 'sad' to be more similar than the pair of words 'happy' and 'joy', which is counterintuitive, and might affect the accuracy performance of the models that depend on it.
To address the limitations of traditional word embeddings, several techniques have been proposed, including task-specific fine-tuning (Devlin et al., 2018), retrofitting (Faruqui et al., 2014), representing emotion with vectors using a multi-task training framework (Xu et al., 2018) and generating affective word embeddings (Felbo et al., 2017), to name a few. Other attempts to overcome the limitation of word vectors include optimization of hyperparameters (Levy et al., 2015), as well as fine-tuned preprocessing strategies tailored to different NLP tasks. While these strategies have demonstrated evidence of improving the accuracy performance in tasks such as word similarity, word analogy, and others (Lison and Kutuzov, 2017), their effect in affective tasks has not received considerable attention and remains less explored. Our work is motivated by the observation that preprocessing factors such as stemming, stopwords removal and many others make up an integral part of nearly every improved text classification model, and affective systems in particular (Danisman and Alpkocak, 2008;Patil and Patil, 2013). However, little work has been done towards understanding the role of preprocessing techniques applied to word embeddings in different stages of affective systems. To address this limitation, the overarching goal of this research, is to perform an extensive and systematic assessment of the effect of a range of linguistic preprocessing factors pertaining to three affective tasks, including sentiment analysis, emotion classification and sarcasm detection. Towards that end, we systematically analyze the effectiveness of applying preprocessing to large training corpora before learning word embeddings, an approach that has largely been overlooked by the community. We investigate the following research questions: (i) what is the effect of integrating preprocessing techniques earlier into word embedding models, instead of later on in a downstream classification models? (ii) which preprocessing techniques yield the most benefit in affective tasks? (iii) does preprocessing of word embeddings provide any improvement over stateof-the-art pretrained word embeddings? and if yes, how much? Figure 1 illustrates the difference between a) preprocessing word embeddings pipeline (Pre) vs. b) preprocessing classification dataset pipeline (Post), where preprocessing techniques in (a) are applied to the training corpus of the model and in (b) only to the classification dataset. In brief, the main contributions of our work are as follows: • We conduct a comprehensive analysis of the role of preprocessing techniques in affective tasks (including sentiment analysis, emotion classification and sarcasm detection), employing different models, over nine datasets; • We perform a comparative analysis of the accuracy performance of word vector models when preprocessing is applied at the training phase (training data) and/or at the downstream task phase (classification dataset). Interestingly, we obtain best results when preprocessing is applied only to the training corpus or when it is applied to both the training corpus and the classification dataset of interest.
• We evaluate the performance of our best preprocessed word vector model against state-ofthe-art pretrained word embedding models; • We make source code and data publicly available to encourage reproducibility of results 1 .
The rest of the paper is organized as follows: Section 2 presents an overview of the related work. Section 3 elaborates on the preprocessing techniques employed in the evaluation of models. Section 4 describes the experimental evaluation framework. In Section 5 a comprehensive analysis of the results is provided. Section 6 concludes the paper with key insights of the research.

Related Work
In this section, we present an overview of related work on preprocessing classification datasets and preprocessing word embeddings, and how our work aims to bridge the gap between those efforts.

Preprocessing Classification Datasets
Preprocessing is a vital step in text mining and therefore, evaluation of preprocessing techniques has long been a part of many affective systems. Saif et al. (2014) indicated that, despite its popular use in Twitter sentiment analysis, the use of precompiled stoplist has a negative impact on the classification performance. Angiani et al. (2016) analyzed various preprocessing methods such as stopwords removal, stemming, negation, emoticons, and so on, and found stemming to be most effective for the task of sentiment analysis. Similarly, Symeonidis et al. (2018) found that lemmatization increases accuracy. Jianqiang and Xiaolin (2017) observed that removing stopwords, numbers, and URLs can reduce noise but does not affect performance, whereas replacing negation and expanding acronyms can improve the classification accuracy.
Preprocessing techniques such as punctuation and negation (Rose et al., 2018) or pos-tagging and negation (Seal et al., 2020) make up a common component of many emotion classification models (Kim et al., 2018;Patil and Patil, 2013). One of the earliest works (Danisman and Alpkocak, 2008) preserved emotion words and negative verbs during stopwords removal, replaced punctuation with descriptive new words, replaced negative short forms with long forms, and concatenated negative words with emotion words to create new words (e.g., not happy → NOThappy ). Although stemming may remove the emotional meaning from some words, it has been shown to improve classification accuracy (Danisman and Alpkocak, 2008;Agrawal and An, 2012). Negations have also been found beneficial, whereas considering intensifiers and diminishers did not lead to any improvements (Strohm, 2017). Pecar et al. (2018) also highlight the importance of preprocessing when using user-generated content, with emoticons processing being the most effective. Along the same lines, while Gratian and Haid (2018) found pos-tags to be useful, Boiy et al. (2007) ignored pos-tagging because of its effect of reducing the classification accuracy The aforementioned works describe preprocessing techniques as applied directly to evaluation datasets in affective systems. In contrast, we examine the effectiveness of directly incorporating these known effective preprocessing techniques further "upstream" into the training corpus of word embeddings, which are widely used across a number of downstream tasks.

Preprocessing Word Embeddings
Through a series of extensive experiments, particularly those related to context window size and dimensionality, (Levy et al., 2015) indicate that seemingly minor variations can have a large impact on the success of word representation methods in similarity and analogy tasks, stressing the need for more analysis of often ignored preprocessing settings. Lison and Kutuzov (2017) also present a systematic analysis of context windows based on a set of four hyperparameters, including window position and stopwords removal, where the right window was found to be better than left for English similarity task, and stopwords removal substantially benefited analogy task but not similarity.
A general space of hyperparameters and preprocessing factors such as context window size (Her-shcovich et al., 2019;Melamud et al., 2016), dimensionality (Melamud et al., 2016), syntactic dependencies (Levy and Goldberg, 2014;Vulić et al., 2020) and their effect on NLP tasks including word similarity (Hershcovich et al., 2019), tagging, parsing, relatedness, and entailment (Hashimoto et al., 2017) and biomedical (Chiu et al., 2016) has been studied extensively in the literature. The main conclusion of these studies, however, is that these factors are heavily task-specific. Therefore, in this work we explore preprocessing factors of generating word embeddings specifically tailored to affective tasks, which have received little attention.
A recent study investigated the role of tokenizing, lemmatizing, lowercasing and multiword grouping (Camacho-Collados and Pilehvar, 2018) as applied to sentiment analysis and found simple tokenization to be generally adequate. In the task of emotion classification, Mulki et al. (2018) examined the role of four preprocessing techniques as applied to a vector space model based on tf-idf trained on a small corpus of tweets, and found stemming, lemmatization and emoji tagging to be the most effective factors.
Distinct from prior works, we examine a much larger suite of preprocessing factors grounded in insights derived from numerous affective systems, trained over two different corpora, using three different word embedding models. We evaluate the effect of the preprocessed word embeddings in three distinct affective tasks including sentiment analysis, emotion classification and sarcasm detection.

Preprocessing in Affective Systems
This section describes the preprocessing factors applied to the training corpus that is then used to generate word representations and the order of the preprocessing factors which we need to follow when applying on the corpus.

Preprocessing Factors
Basic: A group of common text preprocessing applied at the very beginning, such as removing html tags, removing numbers, and lowercasing. This step removes all common punctuation from text, such as "@%*=()/ +" using the NLTK regexptokenizer 2 .
Spellcheck (spell): A case can be made for either correcting misspellings and typos or leaving them as is assuming they represent natural language text and its associated complexities. In this step, we identify words that may have been misspelled and correct them 3 . As unambiguous spell corrections are not very common and in most cases we have multiple options for correction, we built our own custom dictionary to suggest a replacement by parsing the ukWac corpora 4 to retrieve a wordfrequency list. A misspelled word that has multiple replacements is replaced with the suggested word that has the maximum frequency in the corpora.

Negation (neg):
Negation is a mechanism that transforms a positive argument into its inverse rejection (Benamara et al., 2012). Specifically in the task of affective analysis, negation plays a critical role as negation words can affect the word or sentence polarity causing the polarity to invert in many cases. Our negation procedure is as follows: (i) Compilation of an antonym dictionary: The first stage involves compiling an antonym dictionary using the WordNet corpus (Miller, 1995). For every synset, there are three possibilities: finding no antonym, one antonym or multiple antonyms. The first two cases are trivial (unambiguous replacements). In the case of the third option (ambiguous replacement), which represents the most common case, amongst the many choices, we consider the antonym with the maximum frequency in the ukWac corpus, as described in the previous section and finally the antonym of a word is picked at random from one of its senses in our antonym dictionary. (ii) Negation handler: Next, we identify the negation words in tokenized text 5 . If a negation word is found, the token following it (i.e., negated word) is extracted and its antonym looked up in the antonym dictionary. If an antonym is found, the negation word and the negated word are replaced with it.
For example, let the sentence "I am not happy today" in its tokenized form ['I', 'am', 'not', 'happy', 'today']. First, we identify any negation words (i.e., 'not') and their corresponding negated words (i.e., 'happy'). Then, we look up the antonym of 'happy' in the antonym dictionary (i.e., 'sad') and replace the phrase 'not happy' with the word 'sad', resulting in a new sentence "I am sad today".
Parts-of-Speech (pos): Four parts-of-speech classes, namely nouns, verbs, adjectives and adverbs have been shown to be more informative with regards to affect than the other classes. Thus, using the NLTK pos-tagger, for each sentence in the corpus we retain only the words belonging to one of these four classes, i.e., NN*, JJ*, VB*, and RB*.
Stopwords (stop): Stopwords are generally the most common words in a language typically filtered out before classification tasks. Therefore, we remove all the stopwords using the NLTK library.
Stemming (stem): Stemming, which reduces a word to its root form, is an essential preprocessing technique in NLP tasks. We use NLTK Snowball stemmer for stemming our training corpus.

Order of Preprocessing Factors
While some preprocessing techniques can be applied independently of each other (e.g., removing stopwords and removing punctuation), others need a more careful consideration of the sequence in which they are applied in order to obtain a more stable result. For instance, pos-tagging should be applied before stemming in order for the tagger to work well, or negation should be performed prior to removing stopwords. To this end, we consider the following ordering when combining all the aforementioned preprocessing factors: spellchecking, negation handling, pos classes, removing stopwords, and stemming. Table 1 summarizes the details of our two training corpora with regards to their vocabulary and corpus sizes after applying various preprocessing settings. For some preprocessing such as POS (pos) and stopwords removal (stop), without any significant loss in vocabulary as indicated by the % ratio of preprocessed to basic, the corpus size reduces dramatically, in some cases more than 50%, a nontrivial implication with regards to training time.

Word Embedding Models
We obtain our preprocessed word representations through three models: (i) CBOW (Continuous Bag-of-Words), (ii) Skip-gram: While CBOW takes the context of each word as the input and tries to predict the word corresponding to the context, skip-gram reverses the use of target and context words, where the target word is fed at the input and the output layer of the neural network is replicated multiple times to accommodate the chosen number of context words (Mikolov et al., 2013). We train both the models on both the training corpora using min count of 5 for News and 100 for Wikipedia with window sizes of 5 and 10, respectively, setting dimensionality to 300.
(iii) BERT (Bidirectional Encoder Representations from Transformers): BERT is an unsupervised method of pretraining contextualized language representations (Devlin et al., 2018). We train the model using BERT large uncased archi-tecture (24-layer, 1024-hidden, 16-heads, 340M parameters) with same setting for parameters as the original paper.
We train each of the three models (CBOW, Skipgram and BERT) 8 times using 16 TPUs (64 TPU chips), Tensorflow 1.15, 1TB memory on Google Cloud and two 32 GPUs cluster of V100/RTX 2080 Ti, 1TB memory using Microsoft CNTK parallelization algorithm 8 on Amazon server. For a large model such as BERT, it takes upto 4-5 days for each run of the training.

Evaluation Datasets
We conduct our evaluation on three tasks, namely sentiment analysis, emotion classification and sarcasm detection. Table 2 presents the details of our evaluation datasets, and some illustrative examples of text are shown in Table 3.
Sentiment Analysis: This popular task involves classifying text as positive or negative, and we use the following three datasets for evaluation: (i) IMDB: This dataset 9 includes 50,000 movie reviews for sentiment analysis, consisting of 25,000 negative and 25,000 positive reviews Maas et al. · I must admit that this is one of the worst movies I've ever seen. I thought Dennis Hopper had a little more taste than to appear in this kind of yeeeecchh... [truncated] negative IMDB · everything was fine until you lost my bag.
negative Airline · At work, when an elderly man complained unjustifiably about me and distrusted me.
anger ISEAR · The ladies danced and clapped their hands for joy.
happy Alm · if this heat is killing me i don't wanna know what the poor polar bears are going through sadness SSEC · ford develops new suv that runs purely on gasoline sarcastic Onion · Been saying that ever since the first time I heard about creationsism not-sarcastic IAC · Remember, it's never a girl's fault, it's always the man's fault.
sarcastic Reddit Table 3: Examples of text instances in the evaluation datasets from fairy tales marked with one of five emotion categories: angry-disgusted, fearful, happy, sad and surprised Cecilia and Ovesdotter (2008). Sarcasm Detection: Detecting sarcasm from text, a challenging task due to the sophisticated nature of sarcasm, involves labeling text as sarcastic or not. We use the following three datasets: (i) Onion: This news headlines dataset 13 collected sarcastic versions of current events from The Onion and non-sarcastic news headlines from HuffPost Misra and Arora (2019), resulting in a total 28,619 records. (ii) IAC: A subset of the Internet Argument Corpus Oraby et al. (2016), this dataset contains response utterances annotated for sarcasm. We extract 3260 instances from the general sarcasm type. 14 . (iii) Reddit: Self-Annotated Reddit Corpus (SARC) 15 is a collection of Reddit posts where sarcasm is labeled by the author in contrast to other datasets where the data is typically labeled by independent annotators Khodak et al. (2017).

Classification Setup
For classification, we employ the LSTM model as it works well with sequential data such as text. For binary classification, such as sentiment analysis and sarcasm detection, the loss function used is the binary cross-entropy along with sigmoid activation: where y is the binary representation of true label, p(y) is the predicted probability, and i denotes the i th training sample. For multiclass emotion classification, the loss function used is categorical cross-entropy loss over a batch of N instances and k classes, along with softmax activation: where p(y) is the predicted probability distribution, The optimizer is Adam Kingma and Ba (2014), all loss functions are sample-wise, and we take the mean of all samples (epoch = 5, 10, batch size = 64, 128). All sentiment and sarcasm datasets are split into training/testing using 80%/20%, with 10% validation from training. For the smaller and imbalanced emotion datasets, we use stratified 5fold cross-validation. We use a dropout layer to prevent overfitting by ignoring randomly selected neurons during training. We use early stopping when validation loss stops improving with patience = 3, min-delta = 0.0001. The results are reported in terms of weighted F-score (as some emotion datasets are highly imbalanced), where F-score = 2 p.r p+r , with p denoting precision, and r is recall.

Discussion and Analysis
We analyze the impact of preprocessing techniques in word representation learning on affect analysis.

Effect of Preprocessing Factors
A primary goal of this work is to identify the most effective preprocessing factors for training word embeddings for affective tasks. Table 4 details the results of our experiments comparing the performance of individual preprocessing factors as well as those of ablation studies (i.e., including all the factors but one).
Observing the performance of the individual factors on the News corpus, we note that even a single simple preprocessing technique can bring improvements, thereby validating our intuition of incorporating preprocessing into training corpora of word representations. Second, negation (neg) processing appears to be consistently the most  Table 4: F-score results of evaluating the effect of preprocessing factors using CBOW and Skip-gram on News corpus. The overall best results are in bold. The best result using only any one preprocessing setting is underlined.
effective factor across all the 9 datasets, indicating its importance in affective classification, followed by parts-of-speech (pos) processing where we retained words belonging only to one of four classes. On the other hand, removing stopwords (stop), spellchecking (spell) and stemming (stem) yield little improvement and mixed results. Interestingly, applying all the preprocessing factors is barely better or in some cases even worse (Onion, Reddit and SSEC) than applying just negation. Finally, the best performance comes from combining all the preprocessing factors except stemming (All-stem). Moreover, Table 5 details the performance of ablation studies on Wikipedia corpus for all three models where we note that the best performance for the CBOW model comes from combining all the preprocessing factors except stemming (All-stem), whereas for the Skip-gram and BERT models, the best results are obtained by applying all the preprocessing factors except stopwords removal (All-stop). Considering that the Wikipedia corpus is almost 160 times bigger than the News corpus, it is unsurprising that the word embeddings obtained from the former yield considerably better results, consistent across all nine datasets.

Evaluating Preprocessing Training Corpora for Word Vectors vs. Preprocessing Classification Data
We investigate the difference between applying preprocessing to the training corpora for generating word embeddings (Pre) and applying preprocessing to the classification datasets (Post). As an example, during Pre, we first apply the preprocessing techniques (e.g., all but stemming) to the training corpus (e.g., Wikipedia), then generate word embeddings, then convert a classification dataset (e.g., IMDB) into word embedding representation, and finally classify using LSTM. Conversely, for Post, we first generate word embeddings from a training corpus (e.g., Wikipedia), then apply the preprocessing techniques (e.g., all but stemming) to the classification dataset (e.g., IMDB), which is then converted to word vector representation, and finally classified using LSTM 16 . The results of this experiment are presented in Table 6, where we observe that incorporating preprocessing into the training corpora before generat-

Evaluating Proposed Model against State-of-the-art Baselines
While not a primary focus of this paper, in this final experiment we compare the performance of our preprocessed word embeddings against those of six state-of-the-art pretrained word embeddings 17 . 17 These vectors obtained from their original repositories have been used without any modifications.
(i) GloVe: Global vectors for word representations (Pennington et al., 2014) were trained on aggregated global word co-occurrences. We use the vectors trained on GloVe6B 6 billion words 18 , uncased, from Wikipedia and Gigaword. (ii) SSWE: Sentiment Specific Word Embeddings (unified model) 19 were trained using a corpus of 10 million tweets to encode sentiment information into the continuous representation of words (Tang et al., 2014). (iii) FastText: These pretrained word vectors 20 , based on sub-word character n-grams were trained on Wikipedia using fastText (Bojanowski et al., 2017), an extension of the word2vec model.  Table 7: F-score results of comparing against state-of-the-art word embeddings. The best score is highlighted in bold, and the second best result is underlined.
(iv) DeepMoji: These word embeddings 21 were trained using BiLSTM on 1.2 billion tweets with emojis (Felbo et al., 2017). (v) EWE: Emotionenriched Word Embeddings 22 were learned on 200,000 Amazon product reviews corpus using an LSTM model (Agrawal et al., 2018). From the results in Table 7, we notice that BERT is best on eight out of nine datasets except one sarcasm dataset (Reddit), while word2vec CBOW is the second best on four datasets. Overall, our analysis suggests that preprocessing at word embedding stage (Pre) works well for all the three affective tasks. Figure 2 summarizes the results obtained for all three tasks in terms of (a) absolute F-scores and (b) relative improvement (best preprocessing over Basic preprocessing). The IMDB dataset achieves the highest F-score overall, most likely because it consists of movie reviews which are much longer than the text from other genres. As expected, the binary classification task of sentiment analysis and sarcasm detection achieve comparable results, while the multiclass emotion classification typically has much lower F-scores. The most interesting observation, however, is noticed in Fig. 2(b) where the emotion datasets show the highest relative improvement, indicating that multiclass classification tasks may benefit the most from applying preprocessing at word embedding stage (Pre).

Conclusions
We systematically examined the role of preprocessing training corpora used to induce word representations for affect analysis. While all preprocessing techniques improved performance to a certain ex-21 https://github.com/bfelbo/DeepMoji 22 https://www.dropbox.com/s/wr5ovupf7yl282x/ewe uni.txt Figure 2: Absolute F-scores vs. relative improvement tent, our analysis suggests that the most noticeable increase is obtained through negation processing (neg). The overall best performance is achieved by applying all the preprocessing techniques, except stopwords removal (All-stop). Interestingly, incorporating preprocessing into word representations appears to be far more beneficial than applying it in a downstream task to classification datasets. Moreover, while all the three affective tasks (sentiment analysis, sarcasm detection and emotion classification) benefit from our proposed preprocessing framework, our analysis reveals that the multiclass emotion classification task benefits the most. Exploring the space of subsets of our preprocessing factors might yield more interesting combinations; we leave this for future work.