IMS at EmoInt-2017: Emotion Intensity Prediction with Affective Norms, Automatically Extended Resources and Deep Learning

Our submission to the WASSA-2017 shared task on the prediction of emotion intensity in tweets is a supervised learning method with extended lexicons of affective norms. We combine three main information sources in a random forrest regressor, namely (1), manually created resources, (2) automatically extended lexicons, and (3) the output of a neural network (CNN-LSTM) for sentence regression. All three feature sets perform similarly well in isolation (≈ .67 macro average Pearson correlation). The combination achieves .72 on the official test set (ranked 2nd out of 22 participants). Our analysis reveals that performance is increased by providing cross-emotional intensity predictions. The automatic extension of lexicon features benefit from domain specific embeddings. Complementary ratings for affective norms increase the impact of lexicon features. Our resources (ratings for 1.6 million twitter specific words) and our implementation is publicly available at http://www.ims.uni-stuttgart.de/data/ims_emoint.


Introduction
In natural language processing, emotion recognition is the task of associating words, phrases or documents with predefined emotions from psychological models. Typical discrete categories are those proposed by Ekman (Ekman, 1999) and Plutchik (Plutchik, 2001), namely Anger, Anticipation, Disgust, Fear, Joy, Sadness, Surprise und Trust. In contrast to sentiment analysis with its main task to recognize the polarity of text (e. g., positive, negative, neutral, mixed), only a few resources and domains have been subject of analysis. Examples are, e. g., tales (Alm et al., 2005), blogs (Aman and Szpakowicz, 2007), and as a very popular domain, microblogs on Twitter (Dodds et al., 2011). The latter in particular provides a large resource of data in the form of user messages (Costa et al., 2014). A common source of weak supervision for training classifiers are hashtags, emoticons, or emojis, which are interpreted as a weak form of author "self-labeling" (Suttles and Ide, 2013). The classifier then learns the association of all other words in the message with the emotion (Wang et al., 2012). An alternative to discrete models are continuous models that map emotions to an n-dimensional space with valence, arousal and dominance (VAD) being usual dimensions. Previous works that rely on the VAD-scheme focus mainly on extending and adapting the affective lexicons (Bestgen and Vincze, 2012;Turney and Littman, 2003), including to historical texts (Buechel et al., 2016), and on the prediction and extrapolation of affective ratings (Recchia and Louwerse, 2015a;Hollis et al., 2017).
The WASSA-2017 shared task on the prediction of emotion intensity in tweets (EmoInt) aims at combining descrete emotion classes with different levels of activation. Given a tweet and an emotion (anger, fear, joy, and sadness), the task requires to determine the intensity expressed regarding a particular emotion. This score can be seen as an approximation of the emotion intensity felt by the reader or expressed by the author. For a detailed task descriptions and background information on the data collection see Mohammad and Bravo-Marquez (2017).

System Description
In the following, we introduce all feature sets we experimented with. We start with an analysis and selection of features obtained from the baseline system AffectiveTweets, explain how we extend resources to the domain of Twitter. Then, we explain our sentence regressor, which is based on deep learning and pre-trained word embeddings. Finally, we introduce two additional, manually defined features.

Baseline Features
The baseline system AffectiveTweets 1 which has been provided to participants together with the training and development data includes a huge variety of different features and configurations. The different feature types can be classified into a), SparseFeatures, which refer to word and character n-grams from tweets, b), LexiconFeatures, which are taken from several emotion and sentiment lists (we consider the SentiStrength-based feature to be part of this), and c), the EmbeddingsFeature, which comprise a tweet-level feature representation that can incorporate any pre-trained word embeddings.

Extending and Adding Norms
The baseline system builds on top of a variety of different lexical resources (Hu and Liu, 2004;Wilson et al., 2005; Svetlana Kiritchenko and Mohammad; Mohammad and Turney, 2013;Mohammad and Kiritchenko, 2015;Baccianella et al., 2010;Bravo-Marquez et al., 2016;Nielsen, 2011). Such 1 https://github.com/felipebravom/ AffectiveTweets resources are naturally limited in coverage and often focus on words that are closely associated with a certain emotion or sentiment (e. g., the word "hate" with the emotion anger).
At the same time, social media data is typically rich in lexical variations, and hence, tend to contain a great deal of out-of-vocabulary words. We address this with three separate approaches, namely by i) applying a supervised method to extend these lexicons to larger Twitter specific vocabulary ii), learning a new rating score for every word and not just highly associated terms and iii), including novel rating categories that provide complementary and potential useful information, such as valence, arousal, dominance and concreteness.
Several approaches have been proposed to combine distributional word representations with supervised machine learning methods to extend affective norms (Turney et al., 2011;Tsvetkov et al., 2014;Recchia and Louwerse, 2015b;Vankrunkelsven et al., 2015;Köper and Schulte im Walde, 2016;Sedoc et al., 2017). Köper and Schulte im Walde (2017) compared various supervised methods and showed that a feed forward neural network together with low dimensional distributed word representations (embeddings) obtained the highest correlation with human annotated ratings for concreteness.
Following these findings, we apply the same methodology. For a given emotion or norm we train a feed forward neural network with two hidden layers, each having 200 neurons. The input of the network is a single word representation (300 dimensions) and the output is one numerical value trained to correspond to the human annotated (gold) rating for the given input word. We apply the model to predict a rating score for every word representation in our distributional space (which includes the training data).
This method strongly depends on the underlying word representation. We therefore conduct multiple experiments using different word embeddings (shown in Section 4.2). We apply this procedure for 13 different lexicons using the following resources: NRC Hashtag Emotion Lexicon (Mohammad and Kiritchenko, 2015) containing ratings for 17k words with associations to anger, anticipation, disgust, fear, joy, sadness, surprise and trust. Additionally, we use the 14k ratings for valence, arousal, and dominance collected by Warriner et al. (2013). For concreteness we rely on the collection of 40k ratings from Brysbaert et al. (2014). Finally, we use the 10k ratings for happiness from Dodds et al. (2011). These 13 ratings correspond to an automatic extension to 1.6 million word types with ≈ 21 million new word ratings. We map the ratings to an interval of [0, 10]. Table 1 shows the top words for eight ratings. For the emotion intensity prediction in our predictive model, we represent each rating with seven feature dimensions per tweet: 1. Average rating score across all words 2. Average rating score across all nouns 3. Average rating score across all adjectives 4. Average rating score across all verbs 5. Average rating score across all hashtags 6. Maximum rating score 7. Standard deviation of all rating scores

Tweet Regression
The tweet regression feature relies on the annotated training samples. We train a neural network based on word embeddings to predict the emotion intensity for each tweet.
Convolutional neural networks (CNNs), trained on top of pre-trained word vectors, have been shown to work well for sentence-level classification tasks (Kim, 2014). We apply a similar method here, combining CNNs and LSTMs (Hochreiter and Schmidhuber, 1997). The final architecture used by IMS is shown in Figure 1. Each tweet is represented by a matrix of size 50 × 300 (padded where necessary, embedding dimension is 300, the maximal token sequence in a tweet is set to 50). We apply dropout with a rate of 0.25. The matrix is then the input for a convolutional layer with a window size of 3, followed by a maxpooling layer (size 2) and an LSTM to predict a numerical output for each tweet.
This architecture captures sequential information in a compact way. For comparison, we conduct experiments using a variety of different architectures (shown in Section 4.3) including linear regression, multilayer perceptron (MLP), two stacked LSTMs and the proposed CNN-LSTM architecture.

Additional Features
In addition to regression and lexical features, we add two hand-crafted features. The first is a Boolean feature which holds if and only if an exclamation mark is present in the tweet. The second represents the overall number of tokens in the tweet.  Figure 1: CNN-LSTM Architecture used for tweet regression.

Implementation Details
As a source for our in-domain embeddings, we use a corpus from 2016 retrieved with the Twitter streaming and rest APIs with emotion hashtags and popular general hashtags. It consists of ≈50 million tweets and ≈800 million tokens. After removing words with less than 10 occurrences, the resource contains 1.6 million word types. The 300 dimensional word representations are obtained with word2vec 2 (Mikolov et al., 2013). To study the impact of the training domain, we additionally conduct experiments with the public available GoogleNews-vectors that were trained on a 100b words corpus of news texts. Both word embeddings are used to extend the emotion lexicons (Section 2.2) as well as input embeddings in our tweet regression model (Section 2.3). We use TweetNLP 3 (Owoputi et al., 2013) as tokenizer. In the case of observing only out-ofvocabulary words (no rating available) we set the score to the median value of the corresponding category.
The regressor based on the tweet text is implemented with keras (Chollet et al., 2015). We train one model for each of the four emotions separately. Furthermore, we provide the output of all four emotion-specific regression models in all emotion intensity prediction tasks. 4 Finally, for the full system IMS, we combine features in a random forest classifier using weka (Witten et al., 1999). We use 800 trees (called iterations in Weka). We estimate one model for each of the four target emotions.

Feature Subset Selection and Analysis
Feature selection and analysis was performed on annotated training and development data. All experiments were carried out using 10-fold cross validation. We report results following the official shared task evaluation measure to predict a value between 0 and 1, namely Pearson correlation for each emotion separately as well as a macro average over all emotions. Features that were finally used in IMS are marked with and respectively for features that were disregarded.

Baseline Feature Engineering
We start with feature engineering based solely on the baseline features (see Section 2.1). Table 2 shows our observation when exploring the different options from AffectiveTweets using default parameters. The embeddings (Embd.) are the recommended 400 dimensional Twitter embeddings available from the baseline system's homepage.
As we see in this table, an average performance of .67 is already obtained when relying only on a random forest in combination with the lexicon features. The other features, as well as the combination, result in inferior performance. In addition, the lexicon-based system is comparably simple with only 45 feature dimensions. We therefore only use the lexicon features from the baseline system.

Lexicons and Extended Lexicons
As a next feature, we explore various settings for the automatic extension of the lexicon features. Taproviding the predictions for the remaining.  ble 3 compares the baseline lexicon against the lexicons we add without extension (ACVH-Lexicons) as well as the automatically extended resources (Ext.*). ACVH-Lexicons contains the unmodified ratings for arousal, concreteness, valency and happiness (ACVH), which were not part of the baseline system. For Ext.* we present results based on underlying news (Ext.News) and Twitter (Ext.Twitter) embeddings. In addition we present results for each lexicon-feature in isolation, as well as in combination with the baseline lexicons (Lexicons(=BL)). It can be seen that the ACVH lexicons without automatic extension (ACVH-Lexicons) perform poorly and provide no performance gain when combined with the baseline (ACVH-Lexicons+BL). We assume that the poor coverage on Twitter data is the main reason. On the other hand, the automatically extended ratings perform well, and the choice of embeddings here has a high impact on the quality of the resulting ratings. In more detail, the in-domain embeddings (Ext.Twitter) create ratings that are extrinsically evaluated superior to the out-domain embeddings (Ext.News) with an average score .52 against .67. The information of existing lexicons and extended norms is not redundant. The combination ( Ext.Twitter+BL) increases average correlation across all four emotions by +.02 points, from .67 → .69..
To get a further understanding of the automatically extended norms, Figure 2 shows the evaluation performance of the thirteen extended norm Numbers in brackets refer to training size used to extend the norms. Evaluation based on 10-fold cross validation using the full training data and random forest.  categories separately. Especially the extended ratings from the new lexicons perform well: happiness, dominance and valency. However, we also see that the number of training samples might have a big impact, e. g., the automatical ratings of joy are only trained on 3.4k samples while the size of the happiness training data is larger.

Tweet Regression Architectures
In addition to the CNN-LSTM architecture used in the final system (see Section 2.3), we experimented with different models for tweet regression. Table 4 shows results using various machine learning algorithms to directly predict the emotion intensity.
We use the in-domain Twitter embeddings as input. We observe that our architecture, introduced in Section 2.3, performs superior to other methods. Remarkable, the CNN-LSTM feature, as well

Full System Combination
A combination of all features leads to the best performance, they provide complementary information. An overview is given in Table 5 and Table 6.
Another interesting observation is found with respect to the usage of cross-emotional intensity predictions: IMS trains a classifier for each emotion in isolation. Similarly, the tweet regression feature is trained emotion-wise but for each instance we also provide the intensity prediction from all other emotion models (therefore, 4 features). Without the cross-emotion information, we yield only a macro average across all emotions of .707 (vs. .719). Figure 3 shows how the emotion intensity predictions of these models correlate. It can be seen that fear, sadness and anger are slightly correlated while joy is negatively correlated with all three emotions. Interestingly, a combined model (Comb), which is trained on all emotions also leads to a high correlation for each emotion and especially sadness. Note that the classifier trained on all emotions (Comb) is not used by the final system IMS.
Finally, we want to mention that the impact of the two manual defined features is very little, we found that they increase performance on joy by +.01 and we therefore decided to keep them. Table 7 shows the official results (Full IMS-Test) and the performance using only a subset of the   entire features. For comparison, we also show the results of the best performing system (Best-Competitor). our baseline, using only the lexicon features and a random forest classifier obtains a competitive Pearson correlation of .65, which would have been ranked as the 8th best system. Both of our core features, namely the extended resources, as well as the CNN-LSTM tweet regression architecture, increase performance by +.05 points when combined with the baseline lexicons (Lexicons(=BL)). Their performance is similar for anger and joy, but the ratings seem more useful for fear, and the regression more useful for sadness. The result of Ext.Twitter+BL with .70 would have ranked the 4th best system.

Official Results -Analysis Test Data
The final combination of all our features results in an increase of ≈ +.020 correlation points. The performance of IMS on the test set without the two manually defined features is .719. Furthermore, we observe that our submission on the test data is on average very close to the estimated performance on the training data (both .72), but when looking at individual emotions our system is performing better on sadness and slightly worse on fear.

Error Analysis
Based on a manual inspection of individual tweets with a large gap between prediction and gold rating, we found that the model's prediction often depends on single words and ignores larger contexts. An example case with a high error for fear is: "Most people never achieve their goals because they are afraid to fail." (fear, G: .22, P: .55) Here, the gold emotion intensity for fear is comparably low, but our model predicts a high fear intensity. Similarly, in the tweet with high joy intensity "Just died from laughter after seeing that." (joy, G: .92, P: .50) our model predicts a low joy intensity. Another challenge are modifications as in "After this news Im supposed to be so damn happy and rejoicing but Im here like " (joy, G: .07, P: .53) Here, the gold annotation is very low, but our model predicts a medium intensity for joy.

Conclusion
Our system IMS, submitted to the EmoInt-2017 shared task, combines existing lexicons with automatically extended norms and a CNN-LSTM neural network based on embeddings. Our findings show that each of the three main components performs equally well, but the highest performance is achieved in combination. In addition, we found that extending existing emotion lexicons and affective norms improves performance over the original resources. We also showed that the impact of underlying word representation is important. In particular in-domain embeddings (trained on twitter data) perform superior to other embeddings. A particularly interesting observation is that providing cross-emotional intensity predictions benefits the performance.