UIUC at SemEval-2018 Task 1: Recognizing Affect with Ensemble Models

Our submission to the SemEval-2018 Task1: Affect in Tweets shared task competition is a supervised learning model relying on standard lexicon features coupled with word embedding features. We used an ensemble of diverse models, including random forests, gradient boosted trees, and linear models, corrected for training-development set mismatch. We submitted the system’s output for subtasks 1 (emotion intensity prediction), 2 (emotion ordinal classification), 3 (valence intensity regression) and 4 (valence ordinal classification), for English tweets. We placed 25th, 19th, 24th and 15th in the four subtasks respectively. The baseline considered was an SVM (Support Vector Machines) model with linear kernel on the lexicon and embedding based features. Our system’s final performance measured in Pearson correlation scores outperformed the baseline by a margin of 2.2% to 14.6% across all tasks.


Introduction
Affective computing deals with the recognition, interpretation, processing, and simulation of human affects. It is a highly interdisciplinary field at the heart of a broad range of technological applications in health care, media & advertisement, automotive, and others.
Although emotions are a fundamental feature of human experience, they have been long ignored by technology development mainly due to their complex and subjective nature, as well as the lack of learning capabilities to detect them. Current affective computing systems focus mainly on facial expressions, body language, speech (tone of voice, rhythm, etc.), keystroke as well as physiological input (e.g., heart rate, body temperature) to capture and process changes in a user's emotional state. However, in environments such as social media and Internet forums, most often the only signal is written language. And since language per se is the smallest portion of human communication (Mehrabian, 1981), emotions are not easy to detect.
Although emotion detection is directly related to the more popular task of sentiment analysis, they differ in many respects. Sentiment Analysis aims to detect the positive, neutral, or negative orientation of the text, while emotion detection focuses on recognizing and classifying text snippets into a set of predefined, more or less universal emotions. Various such classification models have been proposed, two famous ones being Ekman's (Ekman, 1997) six basic emotions (anger, happiness, surprise, disgust, sadness, and fear) and Plutchik's wheel of eight emotions (Plutchik, 2001), where each primary emotion has a polar opposite (joy, trust, fear, surprise, sadness, anticipation, anger, and disgust).
To date, there are many freely available tools for sentiment polarity classification of input text, yet not so many exist for emotion detection. Major challenges are: (1) the difficulty in establishing ground truth for various emotions, (2) the high variability, vagueness, ambiguity, and implicitness of language that can make the detection very problematic, (3) the scarcity of non-verbal clues in written communication, as well as (4) the challenge of getting access to and being able to process the right type of context. This can be explained by the "7% Rule" (Mehrabian, 1981): only 7% of human communication is verbal while over 90% is comprised of tone of voice (38%) and body language (55%).
This year, SemEval 2018 hosts Task1: Affect in Tweets (Mohammad et al., 2018) -a shared task competition aiming to predict emotions and sentiment in tweets. There are five sub-tasks (Table  1). The participating systems have to automati-cally determine the intensity of emotions (E) and intensity of sentiment (i.e., valence V) from a collection of tweets, as experienced by the authors of these tweets. The organizers also include a multilabel emotion classification task for tweets. For each task, separate training and test data sets for each language considered are provided to the participants. Task  Input  Output  Label  1 El-reg Tweet (t), Intensity(e, t) ∈ (0, 1) Emotion (e) 2

ID
El-oc  The contributions of the UIUC system are as follows: (1) In this competition, we demonstrate the use of a system that uses lexicon-and embedding-based features in an ensemble model of diverse approaches such as random forests, gradient boosted trees, and linear classifiers. We demonstrate how their combination in the final ensemble outperforms each of the individual methods.
(2) We account for the train-development mismatch in the dataset by training a separate model to learn this mismatch. (3) We analyze the UIUC system and several variants of it, some of which improve on its performance. (4) We also perform an error analysis of "difficult" tweets, and explore areas for improvement of the model.

Related Work
Word-Emotion Lexicons: Word-emotion lexicons are a mapping between the words in the vocabulary to an emotion rating. Some lexicons map words to discrete emotions, such as General Inquirer (Stone et al., 1962), Wordnet Affect (Strapparava et al., 2004) and the NRC-10 Emotion Lexicon (Mohammad and Turney, 2013). Others, such as Affective Norms for English Words (ANEW) (Bradley and Lang, 1999) and WKB Corpus (Warriner et al., 2013), map them to dimensions such as valence, arousal and dominance.
Sentence-Level Labeled Corpora: Large scale corpora annotated with sentence-level emo-tion labels are uncommon in the literature. Affective Text (Strapparava and Mihalcea, 2007), created for SemEval 2007, contains emotion annotations headlines of news articles. Alm et al. annotated about 185 children's stories with the Ekman labels. Aman and Szpakowicz created annotated 5,000 sentences with additional labels for intensity and emotion bearing phrases. Preotiuc-Pietro et al. annotated 3,000 social media posts for valence and arousal, making this one of the few datasets that contains annotations based on the VAD model.
Approaches: Rule-based approaches incorporate domain knowledge. This can include termbased n-gram features, distance between certain terms or pre-specified POS patterns. Early work in this area focused mainly on linguistic heuristics (Hatzivassiloglou and McKeown, 1997). However, a major drawback of these rule-based approaches is that they are unable to detect novel expression of sentiment. Keyword based approaches classify text based on the detection of unambiguous words in language. They depend on large scale lexicons with affective labels for words, such NRC (Mohammad and Turney, 2013). Knowledgebased approaches use web ontologies or semantic networks. A major advantage of such systems is that they enable the system to use conceptual ideas derived from world knowledge (Cambria and Hussain, 2012). Recently, distributed approaches have been proposed that leverage word embeddings and train deep neural networks on the embedding space (Mohammad and Bravo-Marquez, 2017a).
Shared evaluations have encouraged the community to create benchmarks over shared tasks, and have been organized frequently. The Affective Text task at SemEval 2007 (Strapparava and Mihalcea, 2007) asked its participants to predict emotion labels for headlines of news articles. More recently, the Shared Task on Emotion Intensity (EmoInt) at WASSA 2017 (Mohammad and Bravo-Marquez, 2017a), had 22 participating teams who were given a corpus of 3,960 English tweets annotated with a continuous intensity score for each of four of Ekman's basic emotions: anger, fear, joy and sadness.

Dataset
Tasks 1 and 2 share the same training and development data sets: a total of 7,500 sentences in training and about 1,600 sentences in development across the four emotions: anger, fear, joy, sadness. It is interesting to note that the training data sets for the emotions of fear, anger and sadness overlap significantly: all pairs have a Jaccard similarity of over 0.5. This means that over 67% of the data sets across these emotions contain the same tweets.
Tasks 3 and 4 share the same data sets as well, for a total of 1,200 tweets in training and 450 tweets in development across the four emotions.
Another interesting overlap is between the tweet collections for Tasks 5 and 1 (and therefore Task 2): The data set for Task 5 appears to be made up largely of the tweets for Task 1, for both the training and development sets. These overlaps of the training and development data sets across all emotions gave us the idea to tackle all tasks using a common set of features. For instance, Tasks 2 and 4 may be solved by simply transforming the output of Tasks 1 and 3, respectively. Task 5 involves a multi-label classification and thus, needs more thought.
In the test set, with the exception of the first 1,000 or so sentences, nearly 95% of the total sentences for Tasks 1A and 3A (i.e., for English) are the so called "mystery" sentences -meaning, essentially neutral sentences without any emotional content. The scores reported by the organizers are for the non-mystery sentences only (i.e., nonneutral).

The UIUC System
Our system takes as input features from affective lexicons and word embeddings trained on affective Twitter corpora. We then train an ensemble of diverse models over these features. Given that the training and development labels are not directly comparable, we also model the mismatch between the two sets. Moreover, we also describe additional models that we constructed after the competition deadline (section 4.4). We report results for tasks 1A, 2A, 3A and 4A (where 'A' identifies the target language: English).

Feature Space
We have used the AffectiveTweets (Mohammad and Bravo-Marquez, 2017b), a package in Weka (Hall et al., 2009) for extracting certain features from a tweet.
In addition, we also extract Embedding based features (Twitter Edinburgh 100D / 400D corpus) using the AffectiveTweets package.

Models
The UIUC system contains an ensemble constructed using stacking of several base learners. A schematic of this ensemble is shown in figure  1. We obtained out-of-fold predictions for each of these three models using 5-fold cross validation on layer 1. These predictions were concatenated and provided as input to layer 2. Parameters of the models in this ensemble are detailed below: Layer 1 Random Forests: n estimators=100, max features= √ F (F=total features), max depth=5, min samples leaf=2 XGB 1 : max depth=5, min child weight=150, gamma=0, n estimators=150, reg alpha=0.01, reg lambda=0.87, learning rate=0.1 SVM:kernel=linear, C=0.1 Layer 2 XGB: max depth=3, min child weight=1, gamma=0, n estimators=100, reg alpha=0.1, reg lambda=1, learning rate=0.1, random state=0

Modeling the Mismatch Between the Training and Development Sets
According to the organizers, the training set for the task was created from an annotation effort in 2016 (Mohammad and Bravo-Marquez, 2017a). The development and test sets were created from a common 2017 annotation effort. As a result, the scores for tweets across the training and development sets or across the training and test sets are not directly comparable. We therefore devise a model that can predict and eliminate the mismatch between the two sets of labels. As a means to model the mismatch in the distributions of the two label sets, we train a linear model that, for the labels in the development set, learns a function between the predictions made for the development set and the ground truth. This learner does not affect the training in any way, but is a way to transform the predictions made for the development set so that they are comparable to the ground truth labels.

Additional Models
After the competition deadline, we built and evaluated additional models. The overall model was an ensemble with the same structure as the official submission. Additions include implementation of neural models of computation. In particular, we implement feed-forward neural networks (using the average word embeddings as input), LSTM-CNN (using individual word embeddings) and character level LSTM (using the character stream). The neural networks were implemented in Keras (Chollet et al., 2015) with the Tensorflow (Abadi et al., 2016) backend. Details about these additional models are shown below.

Results
In this section, we describe the results of our official submission to SemEval 2018 (subsection 5.1) as well as the results of experiments on additional models constructed after the competition deadline (section 5.2). Tables 2 and 3 show the performance of our model for Tasks 1A, 2A, 3A and 4A, respectively. We have shown the results by comparing our model against the baseline, which has been trained using an SVM with linear kernel on the lexicon and embedding based features. Our submission outperforms the baseline in nearly all the task-emotion pairs.

Performance of the UIUC system
In particular, we observe that the results for the prediction of data points in the 0.5 -1 range are poorer than in the overall range. The reason for this is that the finer prediction is a harder task than the overall prediction, and exactly predicting the emotion intensity given that it is high has significant variance. Scores for Task 2A are worse than those for Task 1A in spite of the similarity of the tasks. This is because in Task 2A, we essentially discretize the output, thereby either increasing or decreasing the absolute error between the intensity predicted and the actual intensity, depending on whether the discretized output is correct or not. On the whole, evidently, the correlation drops as the effect of the latter case (increase in the absolute error) dominates over the former.
Tasks 3 and 4 follow similar trends as Tasks 1 and 2 respectively, but we see a higher correlation for these tasks as compared to Tasks 1 and 2, respectively. This leads to the conclusion that predicting the sentiment is an easier task than pre-   Table 3: Results of the UIUC system for tasks 3a (valence intensity regression) and 4a (valence ordinal classification) and comparison with baseline. The alternate evaluation was Pearson correlation for tweets with score 0.5-1 for subtask 3a and Cohen's Kappa for subtask 4a.
dicting the intensity of a given emotion.

Performance of Additional Models
Ablation Study for Task 1: Given the multiple subsections of data, it is difficult to optimize the architecture and parameters for all emotions for all subtasks. Therefore, we focus on optimizing the architecture and parameters for only the first subtask (emotion intensity prediction) for the emotion anger. Given the many models developed and presented here, it is interesting to see how they perform individually on this subtask. Table 4 shows the performance of various feature-model combinations. Note that L and E in the Features column indicate lexicon-based and embedding-based features respectively.  Table 4: An ablation study of various features and models for subtask 1: emotion intensity prediction for the specific case of the emotion anger.
We use the SVM trained on lexical features as the baseline. We can see that the SVM+XGB+FFNN (referred to as M1) performs better than the SVM alone. LSTM-CNN with attention (referred to as M2) performs similarly to the SVM baseline. However, when combined together, the model M1+M2+Char performs better than each of the individual models on the test set. This means that the different models capture complementary information about the input, and work better in unison, thus demonstrating the efficacy of the idea of ensembling.
Henceforth, we use M1 to refer to the SVM + XGBoost + Feedforward Neural Network architecture trained on lexical features, M2 to refer to the LSTM-CNN architecture with attention trained on the embedding features and Char to refer to the character level LSTM model trained on the individual characters.  Tasks 1 and 2 with with Additional Models: Table 5 shows the performance of the models described above to the first two subtasks: emotion intensity prediction and emotion ordinal classification. We have shown the results for all the four emotions. As we can see, here too, the model combination M1+M2+Char combination performs the best for all emotions in subtask 1. The performance of the model is the best for the emotion joy, and the worst for the emotion fear.
Tasks 3 and 4 with with Additional Models: Coming to subtasks 3 and 4 (valence intensity prediction and valence ordinal classification respectively), Table 6 shows the performance of the var-  ious models on these tasks. Consistent with the results of subtasks 1 and 2, the combined model M1+M2+Char performs the best for both tasks.
In general, we note that the correlation is significantly higher on valence prediction tasks as compared to the emotion intensity tasks. This is likely because the emotion intensity prediction is a fine grained task, requiring the model to observe patterns specific to an emotion. Valence is more of an "aggregated" effect of all the emotions.
Had the best model in additional experiments for all subtasks been submitted to SemEval with all other factors constant, its rank based on the macro-average for the first four subtasks would have been 15 th , 15 th , 18 th and 13 th respectively.

Discussion
In order to identify areas where the model can improve, it is necessary to study cases where it performs poorly. To do so, we select 5 sentences where the baseline SVM model performs very poorly while predicting anger intensity (based on absolute error) and 1 sentence where it performs well. We have restricted the number of sentences to 6 for brevity. In particular, for sentences 1 and 2, the model significantly overestimates the intensity, for sentence 3, the model predicts the intensity accurately. For sentences 4, 5 and 6, the model significantly underestimates the intensity. Table 7 shows the sentences considered and the true value of emotion intensity for the emotion anger.
We then compare the absolute error between the true value and model prediction for various models. This comparison is shown in Table 8. Given that 5 of the 6 sentences are "difficult" for the models, we observe that there is no clear  winner over these sentences. However, we observe that for sentences 1 and 2, the model M1 performs relatively well. For sentences 4, 5 and 6, the models involving M2 perform relatively well. This suggests that M1 is better at predicting the lower intensities, while M2 is better at the higher intensities. This may explain why though the overall scores for the two models was similar, the ensembled model outperformed the individual models. Another interesting observation is that for sentence 4, the presence of the capital letters is the reason for the high intensity. The model M1+M2+Char is able to identify this well, and contributes to reducing the error significantly as compared to all the other models.

Conclusion
In this paper we presented the UIUC system that performs regression and ordinal classification of the emotion and sentiment present in English tweets. Our system comprised an ensemble trained on lexicon based and embedding based features. We also provided an account for the training and development mismatch in a given data set by training an adaptive model between the model predictions and the final test predictions. We finally perform an error analysis over the various models to identify potential sources of improvement to the model.