NUIG at EmoInt-2017: BiLSTM and SVR Ensemble to Detect Emotion Intensity

This paper describes the entry NUIG in the WASSA 2017 (8th Workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis) shared task on emotion recognition. The NUIG system used an SVR (SVM regression) and BLSTM ensemble, utilizing primarily n-grams (for SVR features) and tweet word embeddings (for BLSTM features). Experiments were carried out on several other candidate features, some of which were added to the SVR model. Parameter selection for the SVR model was run as a grid search whilst parameters for the BLSTM model were selected through a non-exhaustive ad-hoc search.


Introduction
The WASSA 2017 shared task on emotion intensity (EmoInt) is a competition intended to stimulate research into emotion recognition from text (Mohammad and Bravo-Marquez, 2017). The task provides a corpus of 3960 English language tweets annotated with a continuous intensity score for each of four basic emotions: anger, fear, joy and sadness. This is a subset of the set of basic emotions proposed by Ekman (Ekman, 1992), which has been widely used as an emotion representation scheme in emotion recognition research (Mohammad, 2016;Poria et al., 2017). An additional 3142 tweets were used for evaluation of competition entries, with annotations withheld during the competition.
The NUIG entry to the task consisted of an ensemble of two supervised models: an SVR (Support Vector Machine Regression 2 ) with n-gram and several custom features and a BiLSTM (Bidirectional Long-Short Term Memory 3 ) model utilising tweet word embeddings. The models are accessible on DockerHub, GitHub and as a Rest API service (see Section 6).
In Section 2 we briefly overview related work. In Section 3 we discuss the data cleaning and preprocessing steps taken. In Section 4 we describe the model architectures and parameter choices. In Section 5 we discuss some observed issues with the models.

Related Research
In this section we briefly describe related work that has attempted to model emotions using machine learning based regressors and classifiers. Wu et al. (Wu et al., 2006) use a hybrid of keyword search and Artificial Neural Networks (when no emotional keywords are present) to tackle the problem of detecting multiple emotions (anger, fear, hope, sadness, happiness, love and thank) achieving an average test accuracy for all emotions of 57.75 %. In the speech recognition domain, Wllmer et al. (Wöllmer et al., 2008) have applied Long Short Memory Networks (LSTMs) to detect emotions from speech using spectral features and measurements of voice quality, in an attempt to continuously represent emotions as opposed to using discrete classes of valence, arousal and dominance. Schuller et al.  in 2008 combined both acoustic models of speech, phonetics and word features on the EMO-DB database 4 which demonstrated the importance of incorporating word models for such emotion recognition tasks.

Preprocessing
Tokenisation for both models was based on the regular expressions and rules provided with Stanford's Glove Twitter Word Vectors (Pennington et al., 2014) with some custom additions and modifications. Notable changes included the removal of hash symbols from tags, and extra emoticon detection patterns.
Removal of hash symbols had noticeable impact on the training accuracy for the BiLSTM model (for SVR it did not have significant impact). One possible explanation is the presence of hash tags in the training data for which the corresponding word is present in the word embedding, but not the tag itself. A concrete example is "#firbromyalgia". Note that stop words were not removed.
The preprocessing steps were as follows: 1. URL's, @mentions are replaced by standard tokens: "<url>" and "<user>" 2. emoticons were replaced by a small set of standard tokens: "<smile>", "<lolface>", "<sadface>", "<neutralface>", "<heart>" 3. hash symbols are removed from #hashtags 4. repeated full stops, question marks and exclamation marks are replaced with a single instance with a special token "<repeat>" added 5. characters repeated 3 times or more are replaced with one instance and a special token "<elong>" is added 6. a special token "<allcaps>" is added for each word in all capitals 7. remaining punctuation characters are treated as individual tokens 8. apostrophes are removed from negative contractions (e.g. "don't" is changed to "dont") 5 9. other contractions are split into two tokens (e.g.: "it's" is changed to "it" and "'s") 10. tokens are converted to lower case

Model Architecture and Training
The overall model is a simple ensemble of an Support Vector Regression (SVR -see Section 4.1) and Bidirectional Long-Short Term Memory neural network (BiLSTM -see Section 4.2). The ensemble is described in Section 4.3.
The BiLSTM model was chosen due to it's recent excellent performance across numerous NLP tasks. The SVR model chosen as a baseline implementation, but found to contribute to the overall performance. Standard Long-Short Term Memory (LSTM) models were also attempted, however were outperformed by our BiLSTM (results not reported here).

Support Vector Machine Regression
The core features for the SVR model are a bag of 1,2,3 and 4-grams. N-grams with corpus frequency less than 2 or document frequency greater than 100 were removed. Experiments including words with document frequency up to 1000 showed similar performance, so the more stringent criterion resulting in a much smaller vocabulary was chosen. Note that this will also remove most words commonly considered stop words.
The following extra features were added. Average, min and max word vectors for each token are taken as features due to variation in sentence length 6 . Proportion of Capital symbols and proportion of words with first capital are considered. Finally, average, standard deviation, min and max of cosine similarities between the vector for each emotion name (e.g. "fear") and word vectors of all words in a tweet are added to the experiment.
An RBF (Radial Basis Function) kernel was chosen in preference to a Linear kernel as the classifier's training time is prompt due to the small dataset size. This kernel provided marginally better results.
A grid search of model parameters C, gamma, tolerance and epsilon was applied to find the optimal set parameters. The best combination is stored for each emotion model separately (see Table 1). Other model parameters were left at their default values in the sklearn.svm.SVR implementation as those values performed better than alternatives.

Bidirectional LSTM
Preprocessed and tokenized sentences are converted to 100-dimensional twitter Glove word vectors. We considered also 200-dimensional vectors 7 , however performance was slightly worse and memory requirements substantially increased.
Embedding vectors were fed into a BiL-STM network followed by a layer trained with dropout (Srivastava et al., 2014) to reduce over-fitting issues. The output of the dropout layer was inputted to a 2-hidden layer network before a final activation layer.Experiments were carried out on the 2-hidden layers where the number of neurons were varied between 20-60 in the first hidden layer and in the range of 10-20 in the second layer. For the sake of brevity, we only focus on the best performing architecture which is 100-50-25-1 (See Figure 1). Smaller layer sizes are not sufficient to catch the shape of the data and excessively big layer sizes lead to over-fitting and exponential growth of training time. For the loss function in training, Mean Absolute Error (MAE) is used in preference to Mean Squared Error (MSE) as it assigns equal weight to the data points and thus emphasizes the extremes. The "Softsign" activation function is found the best for the problem. Spearman and Pearson correlations are used as the main evaluation of network structures and parameter settings, however we also considered R2 scores, as in some cases Spearman and Pearson scores remained the same over training epochs while the R2 score improved.o To avoid over-fitting, the number of training epochs is chosen through evaluating models after each epoch. The number of epochs at which training did not significantly improve Spearman correlation ρ is chosen for the final model (see Table 2). It is evident that fear takes considerably longer to train, 4 times longer than joy for example.

Ensemble
With the limited time available, we attempted three simple approaches: taking the maximum, minimum and average of the predicted intensity between the two models. The best performance was obtained by averaging the LSTM and SVR outputs (see Table 3). We believe that further investigation of the characteristics that led to a better ensemble model would likely lead to improvements in model design both in the BiLSTM itself and in alternative ensemble strategies.

Discussion
Overall, we see that performance in the development data set, used to select model parameters, did not differ substantially from performance on the test set, indicating that overfitting did not occur (see Table 4). Interestingly the difference between development and test set performance varies in line with the number of epochs. Concretely, fear and especially sadness see a strong performance gain on the test set, whereas the joy model degraded in performance, which was trained for the lowest number of epochs for all emotions. Given that our performance relative to the best performing entry also followed this pattern and that a dropout layer was used, which has been shown to control overfitting (Srivastava et al., 2014), it seems likely that choosing a larger number of epochs would have resulted in better models.
Analysis of model prediction errors on test data  revealed that extreme values were not modelled well for both SVR and BiLSTM models, with the SVR model performing marginally better, as seen for anger in Figure 2 (other emotions were similar). In the case of the BiLSTM model, we attribute this to the choice of L1 error as the loss function, which does not penalise large errors strongly. Overall performance with this loss function was, however, better on the provided data.
We also attempted to use the Emotion Hashtag Corpus (Mohammad, 2012) as training data for the BiLSTM model. This corpus only has category labels, so a model was built providing class probabilities, which were used as a proxy for intensity of the emotion classes. The performance was worse than random however, with an average R2 score of -3.63 (correlation 0.28), and this approach was abandoned. We believe this is due to two main factors: the intrinsic noise associated with emotion hash tags as emotion labels and that emotion probability is not a good analogue for emotion intensity. It would be interesting to experiment in the future with adding a binary feature for each emotion provided by a model trained on the hashtag corpus to our models.

Conclusion
The English language datasets provided for the WASSA competition are relatively clean but small, and the annotated labels for four emotions are precise and valuable. We performed experiments on the provided data drawing on our experience in emotion detection. The best built models are developed further and put together as an accessible service / software. The service is now available as part of the MixedEmotions platform 8 as well as the DockerHub as a docker image, on 8 http://mixedemotions.insight-centre.org/