Prayas at EmoInt 2017: An Ensemble of Deep Neural Architectures for Emotion Intensity Prediction in Tweets

The paper describes the best performing system for EmoInt - a shared task to predict the intensity of emotions in tweets. Intensity is a real valued score, between 0 and 1. The emotions are classified as - anger, fear, joy and sadness. We apply three different deep neural network based models, which approach the problem from essentially different directions. Our final performance quantified by an average pearson correlation score of 74.7 and an average spearman correlation score of 73.5 is obtained using an ensemble of the three models. We outperform the baseline model of the shared task by 9.9% and 9.4% pearson and spearman correlation scores respectively.


Introduction
EmoInt (Mohammad and Bravo-Marquez, 2017) is a shared task hosted by WASSA 2017, aiming to predict the emotion intensity in tweets. The emotion can be one out of anger, joy, fear and sadness. For each tweet, the emotion is known, and the task is to predict the intensity of the corresponding emotion, where intensity is a real valued score ranging from 0 to 1. This is different from most of the other tasks or systems in the domain of emotion detection/sentiment analysis which tend to focus on classifying the tweets or text into different categories. For example, given the tweet -'I hate my lawn mower. If it had a soul, I'd condemn it to the fiery pits of Hell.' and the corresponding emotion -'anger', the system has to predict a value for how intensely this emotion is felt by * these authors have equal contributions to the paper the author of the tweet which is as close as possible to the gold label intensity (0.833 in this case).
The systems built for this task are useful across various NLP applications, but perhaps most obviously in complementing sentiment analysis systems. For example, the degree of anger expressed in a grievance can be used to decide its priority of being addressed, and the intensity of joy can help decide which reviews to project when publicizing a product. Our submitted system is an ensemble of three broad sets of approaches combined using a weighted average of the separate predictions (section 3). All the approaches rely on representing the input tweet as a word vector using the word2vec approach (Mikolov et al., 2013), and using neural network based architectures to finally give the intensity score for the tweet of the given emotion X (please note that we already know the emotion of the tweet in this task).
The shared task organizers provided the training and a small development dataset for building our systems, and then a period of about 2 weeks was given for submitting our predictions on a blind test set. 1 The rest of the paper is structured as follows. Section 2 discusses in brief the dataset for the task. Section 3 explains the various approaches used by our ensemble model, the kind of experiments we carried out along with the details of the parameters which gave optimal results on cross validation, and the way we combined the predictions. Section 4 explains how the system is evaluated and Section 5 states the results we achieved and discusses the various implications of those results. We conclude our work in Section 6.
We used the dataset provided within the shared task for training our system. No other external datasets were used in training. The data files include the tweet id, the tweet, the emotion of the tweet and the emotion intensity (for training and dev sets). Test set's gold labels were given only after the evaluation period. There are around 800-1100 tweets in the training set, 70-110 in the development set, and around 700-1000 in the test set (across all the emotions). The complete details of the dataset can be found in (Mohammad and Bravo-Marquez, 2017).

Proposed System
Our system is an ensemble of three sets of approaches. We describe the individual approaches, followed by the ensemble process. We mention the parameters for the optimal variants of each approach and the architecture based decisions or parameters that were varied to provide an insight into the scope of our experiments. The parameters were chosen so that they maximize the Pearson-correlation between the predicted and actual scores on the K-fold cross-validation. The evaluation method used to select the optimal variants is explained in section 4. A bird's eye view of the various architectures is shown in Figure 1.

Approach 1: Feed-forward neural network
Feed forward neural networks have proven to be highly successful in classification and real value prediction based tasks across a variety of domains, including NLP applications ( (Bengio et al., 2003), (Collobert et al., 2011)). (Deep) Neural networks have given state-of-the-art results in sentiment analysis (Tang et al., 2014) which is closely related to our task. Here we detail the architecture of our network -Input features: Each tweet is represented as a 443 dimensional vector by concatenating two different feature vectors obtained as follows -1. Word2Vec (Mikolov et al., 2013) representation of the tweet using publicly available embeddings (Godin et al., 2015) which were trained on 400 million tweets for the ACL W-NUT 2015 shared task (Baldwin et al., 2015). We chose it over other available pre-trained tweet based embeddings as it is trained on a large dataset and we also prefer its high dimensionality of 400. The vector for each word is averaged to get a 400 dimensional representation of the tweet.
2. TweetToLexiconFeatureVector is a filter in the AffectiveTweets 2 (Mohammad and Bravo-Marquez, 2017) package for converting tweets into numeric 43-dimensional vectors that can be used directly as features in our machine learning system. The filter calculates the features from the tweet using several lexicons: Network Architecture: The input layer passes the 443 dimensional vector into 4 subsequent hidden layers (L1, L2, L3, L4) (the left half of Figure  1). We use Rectified Linear Unit ('relu') (Maas et al., 2013) as an activation function for each of the hidden layers (chosen as per the cross validation performance described in section 4). L1 is followed by dropout (Srivastava et al., 2014) to avoid over-fitting and co-adaption of features. The number of hidden units in L1 − L4 and value of dropout (p) was varied, and the optimal settings were decided as per the cross validation performance for each emotion separately. The chosen values are mentioned in Table 1. L4 is followed by a single sigmoid neuron which predicts the intensity of the emotion between 0 to 1. Training: The network parameters are learned by directly minimizing the negative of the Pearsoncorrelation (as it is a differentiable function) between actual and predicted intensities. We optimize the above function by back-propagating through layers via Mini-batch Gradient Descent.  We use a batch size of 8, 30 training epochs and Adam optimization algorithm (Kingma and Ba, 2014) with the parameters set as α = 0.001, β 1 = 0.9, β 2 = 0.999 and = 10 −9 .

Approach 2: Multitask Deep Learning
Multitask learning using deep neural network via shared layers has become quite popular and successful as exploited in, for example (Collobert and Weston, 2008), and has been the focus of many cross lingual models like (Huang et al., 2013). (Collobert and Weston, 2008) described a single unified architecture for performing a variety of NLP tasks: named entity recognition, semantic similarity, part-of-speech tagging, etc. In this approach, we attempt to use the idea of multitask learning to explore the notion of generalized or shared learning across the different emotions.  Input features: The input features are same as Approach 1 and same for all the 4 subtasks. We treat the 4 emotions as different subtasks to apply deep multi-task learning. Network Architecture: The overall architecture can still be realized using the left side of figure  1. The network's initial layers are shared across multiple emotions with an objective to increase the generalization whereas the individual top layers can be seen as learning emotion specific features. Specifically, the system consists of two hidden layers (L1 & L2) shared between 4 regressors, while the last two layers (L3 & L4) are allowed to be different across the different subtasks (L3a, L3b, L3c, L3d and the same for L4). The model can be thought of as an input vector for the tweet going into the exact same two hidden layers regardless of the subtask, but then going into different layers (at the 3rd and 4th level) with the output from L4 going into their respective output neurons. The parameters (number of neurons in the shared as well as the non shared layers along with the dropout rate p) for each emotion are given in Table 2. Note that these parameters are optimized using cross validation (section 4).
Training: We use the same settings as in Approach 1 with respect to the cost function, optimization algorithm, update rule, learning rate, epochs, etc. We train the network for 4 cycles at every epoch. During the 1 st cycle, we train the model for anger, where the input will pass through L1, L2, L3a, L4a and finally the corresponding output neuron. The network is similarly trained for fear, joy and sadness during the 2 nd ,3 rd and 4 th cycles respectively. Learning parameters this way ensures additional training examples for the initial layers (L1, L2) so that they may generalize well to learn taskindependent representations while the higher layers (L3, L4) put pressure on the parameters to learn more task-specific representations.

Approach 3: Sequence Modeling using CNNs and LSTMs
Using Recurrent Neural Networks (RNN) has become a very common technique for various NLP based tasks like language modeling (Mikolov et al., 2010). Their time step based sequentially connected structure is intuitive to use for sequential data such as sentences. Long-short term memory (LSTM) (Hochreiter and Schmidhuber, 1997) architecture is an advanced version of RNN that uses various gates to control the vanishing gradient problem (among other obstacles) that arise during the training of RNNs, and has found resounding success in a host of applications ((Graves and Jaitly, 2014), (Graves and Schmidhuber, 2005)). Convolutional Neural Network (CNN) is also a popular neural network based architecture, and has been successful in the NLP domain in various tasks ((Lee and Dernoncourt, 2016), (Kim, 2014)).
Combining these architectures has also been found to be quite successful as in (Zhou et al., 2015) Both these architecture expect a sequence of vectors as input to operate on. We describe how we use these deep learning models, which play a dominant role in our final ensemble system -Input features: We again use the word2vec embeddings trained on twitter tweets ( (Godin et al., 2015)) to represent the words in a tweet as 400 dimensional vectors, ignoring the words not found. These embeddings are ideal for representing tweets as they have been trained on a very large amount of tweets. Instead of averaging the word vectors as in our first two approaches, we concatenate them. Since length of different tweets can vary, we fix the length of each concatenated representation as 50 (since the maximum tweet length across the training and development data is 46 according to our analysis and we do not want to miss out on any information in the already short tweet) by performing zero padding. For datasets where a tweet may have length greater than 50, the number has to be tuned accordingly. Padding of zero vectors is done to make the representation of every tweet as a (50,400) vector. These representations are then fed to a host of architectures, whose general representation is given in the figure 1.  Network Architecture: As shown in figure 1, the concatenated vector representation of the tweet is first fed to a LSTM or CNN and then some fully connected (dense) hidden layers. The representation learned in the last hidden layer is fed to a single sigmoid neuron which gives us the intensity of the emotion (as in the previous 2 approaches). We tried many variations of the different parameters involved in constructing this model (keeping all others fixed while one is varied) to come up with several architectures but show the parameters for only the three top performing ones (as per cross validation) for each emotion in Table 3. The variations we tried includei) using only LSTM/CNN plus fully connected layers, and also the combination of these architectures with the initial LSTM's output for each word fed to a CNN, or vice versa. ii) Using Simple RNN, Bidirectional LSTM ( (Schuster and Paliwal, 1997), (Godin et al., 2015)), Gated Recurrent Units (GRU) (Cho et al., 2014) instead of LSTM. iii) Using (global) max pooling versus (global) average pooling for CNNs. iv) Using dropout (Srivastava et al., 2014). Note that a dropout layer was added after pooling layer for a CNN, while the same dropout rate was set for both matrices involved in the standard definition in case of LSTM (Zaremba et al., 2014). v) Using different number of neurons for CNN/LSTM/fully connected hidden layers. (usually starting from 300 or 256, and halving the number of neurons as we went deeper) vi) Using different number of fully connected hidden layers (0,1 or 2 in between the LSTM/CNN layer and sigmoid neuron).
In every case, 'relu' activation function was used in the hidden dense layers (except the last neuron which uses sigmoid). Dropout, if applied was always set to 0.2 (we also experimented with 0.1,0.3,0.4 and 0.5 as the dropout rate). Also, the filter height used for CNNs was always set to 3, and striding length for convolution was always 1.
Training: The network parameters are learned by minimizing the Mean Absolute Error between the actual and predicted values of emotion intensity. We optimize this loss function by back-propagating through layers via Mini-batch Gradient descent, with batch size of 8, 15 training epochs and Adam optimization algorithm (Kingma and Ba, 2014) with the same parameters as mentioned in Approach 1.
The deep learning based models in all the above approaches were implemented in Python using Keras library (Chollet et al., 2015).

3.4
Bringing it all together: The submitted ensemble system As described above, we now have 5 models to combine -1 each out of Approach 1 and 2, and 3 from Approach 3. We take a weighted average of the predictions from each of the system to form our final submission. The weights are informed from the results from cross validation (the CV score as explained in section 4), and are as follows -1 for Approach 1, 3 for Approach 2, 3 each for the two best systems from approach 3 (which  Table 4: Results are very close in performance as can be seen in Table 4), and 2 for the 3rd best system in approach 3. Our ensemble model improves the performance by at least 2% over any of our individual models (Table 4).

Evaluation
Cross Validation (CV): We combined the training and development sets, trained on 80% of this set while predicting on the remaining 20%, and repeated this seven times (for each emotion separately). The average of these was used as the CV score to evaluate our models. The metric used for evaluating performance was Pearson Correlation.
Test: The optimal setting for each model was decided using the CV score (Table 4). Then these chosen models (as described in Table 1,2 and 3) were used to generate predicted intensities on the test set, by training on the full training and development sets combined. Again an average of seven runs was taken. The predictions for the final ensemble model are generated using a weighted average of the individual predictions as described in section 3.4.

Results and Discussion
We compare the results achieved by our individual approaches, the submitted ensemble system and the WEKA Baseline system which is the official baseline model for this task (Mohammad and Bravo-Marquez, 2017) in Table 4. For brevity, we only show the Pearson Correlation scores on the test set (although the Spearman correlation scores show similar trends). We discuss the major takeaways from these results -1. Our submitted ensemble model achieves an average (or overall) score of 75.26% and 74.70%, which beats the baseline model by about 14% and 10% on cross validation and test sets respectively. The improvement points to the potential of deep learning based models over the simpler lexicon based approaches. These are also the best scores among all participating systems in the shared task (according to the public leaderboard 3 ).
2. The ensemble model achieves about 3-5% improvement over the average scores, and offers significant improvement in performance across all the emotions, which indicates that the approaches do complement each other quite well.
3. Approach 2 (Multitask DL) achieves the lowest scores among the three sets of approaches. Among Approach 1 (Feed Forward NN) and Approach 3 (CNN+LSTM Seq. Modeling), approach 3 has a best test score of 72.15 compared to approach 1's 69.58, which is a significant improvement and points to sequential models like LSTMs and CNNs being a better choice over feed forward neural networks.
4. Among the individual emotions, our ensemble model gives the best performance for 'Sadness', followed very closely by 'Fear', then 'Joy' and finally 'Anger'.

Conclusion and Future Work
In this paper, we propose a deep learning framework to predict the intensity of the emotion in tweets exhibiting that emotion. The proposed approach is based on an ensemble of Feed-Forward Neural Networks, Multi-Task Deep Learning and Sequence Modeling using CNNs and LSTMs, allowing us to explore the different directions a neural network based methodology can take. Each individual approach is described in detail with a view of making our experiments replicable. The optimal parameters are mentioned, along with our method of bringing the approaches together. Our submitted system beats the baseline system by about 10% on the test set. Although our model achieves state-of-the-art results, there is definite room for improvement. In the future, we would like to experiment with handcrafted features in addition to word-vectors and lexicon features. We would also experiment with other filters provided in AffectiveTweets package (Mohammad and Bravo-Marquez, 2017) such as TweetToSentiStrengthFeatureVector, TweetNLP-Tokenizer etc. Another very interesting idea would be to try better ways of 'ensembling' the different models and analyze how each system or approach complements the other.