HGSGNLP at IEST 2018: An Ensemble of Machine Learning and Deep Neural Architectures for Implicit Emotion Classification in Tweets

This paper describes our system designed for the WASSA-2018 Implicit Emotion Shared Task (IEST). The task is to predict the emotion category expressed in a tweet by removing the terms angry, afraid, happy, sad, surprised, disgusted and their synonyms. Our final submission is an ensemble of one supervised learning model and three deep neural network based models, where each model approaches the problem from essentially different directions. Our system achieves the macro F1 score of 65.8%, which is a 5.9% performance improvement over the baseline and is ranked 12 out of 30 participating teams.


Introduction
In Natural Language Processing, emotion recognition is concerning of associating words, phrases or documents with predefined emotion categories, such as Anger, Anticipation and Sadness (Ekman, 1999;Plutchik, 2001).Most of previous research works on emotion recognition (Wang et al., 2012;Bestgen and Vincze, 2012;Suttles and Ide, 2013;Recchia and Louwerse, 2015;Hollis et al., 2017) presumes emotion words or their representations are accessible.Such models might fail to learn associations for more subtle descriptions and therefore fail to predict the emotion when overt emotion words are not available.
The WASSA-2018 Implicit Emotion Shared Task (IEST) (Klinger et al., 2018) aims to predict the emotion category of a given tweet when the explicit emotion word, or trigger words, is removed.The emotion category can be one of six classes: Anger, Disgust, Fear, Joy, Sadness and Surprise.For examples: 1. "It's [#TARGETWORD#] when you feel like you are invisible to others." 2. "We are so [#TARGETWORD#] that people must think we are on good drugs or just really good actors." In the above 2 examples, with the help of common sense or world knowledge, implicit emotion still can be inferred from context as Sadness and Joy.The [#TARGETWORD#] tokens in the examples indicate the position of the removed word in the given tweet.
Our submitted system is an ensemble of four broad sets of approaches combined using a weighted average of the separate predictions.One approach uses traditional lexicon-based method to train a logistic regression classifier, while the remaining three approaches rely on representing the input tweet as a word vector and using neural network based architectures to give the emotion category for the tweet.
The rest of the paper is structured as follows.Section 2 describes the features used in our system.Section 3 explains the various approaches used by our ensemble model and the way we combined the predictions.Section 4 states the experiment results and discusses the implications of those results.We conclude our work in Section 5.

Word
The current word and its lowercase format are used as features.To provide additional context information, word n-grams and character n-grams are also used.

Word Embeddings
Word embeddings are trained from large unlabeled raw tweets to be used as input to neural network model as well as for generating word clusters.
From an initial collection of 1.6 billion tweets, the collection is filtered to only include tweets that From this tweet collection, word embeddings are generated following the steps described in Toh and Su (2016).Besides using the previous two approaches (Gensim and GloVe tool), the fastText tool (Bojanowski et al., 2017) 1 is also used to generate word embeddings.

Word Cluster
K-means clusters are generated from the word embeddings using the K-means implementation of Apache Spark MLlib.From the K-means clusters, word cluster features are generated.For each word, the cluster id that the word belongs to is used as a feature.

Approaches
This section describes the four approaches used to generate the emotion predictions.

Approach 1: Lexicon Model
The Vowpal Wabbit tool2 is used to train a multiclass classifier using the one-against-all setting (--oaa).
The features used to train the classifier include the words in the tweet (both original and lowercase format) and word clusters where 5 different word clusters are used.
Table 1 shows the command line arguments used to train the Vowpal Wabbit model.

Approach 2: fastText Model
The fastText tool is used to train a text classifier using the supervised subcommand (Joulin et al., 2017).The lowercase words in the tweet are used to train the classifier.
Table 2 shows the command line arguments used to train the fastText model.

Approach 3: Convolutional Neural Network Model
Convolutional Neural Network (CNN) has been shown to work well for sentence-level classification tasks (Kim, 2014).Here we detail the architecture of our network.
Input and Embedding Layer: Each tweet is preprocessed by (1) normalizing emoji to text3 ; (2) normalizing hyper links and @mentions to someurl and someuser; and (3) splitting hashtag chunks into separate words4 .Then the tweet is converted into a concatenated vector and padded to an equal length (or truncated if the tweet is longer than the pre-defined length).The input vector is fed to the embedding layer (i.e.pretrained glove.twitter.27Bvectors), which converts each word into a distributional vector.
CNN Layer: The concatenated vector representation of the tweet is then fed to CNN.The number of hidden units is set to be 256.We apply tanh as activation and dropout with a rate of 0.2.
Output Layer: The output of CNN is flattened and then passed to a fully connected layer.Finally, a softmax layer was added on top of the fully connected layer.The network is trained by minimiz- Figure 1 (a) shows the model architecture of the CNN model.

Approach 4: Sequence Modeling using CNN and LSTM
Long-short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) architecture is an advanced version of RNN and has been successful in the NLP domain on various tasks (Graves and Schmidhuber, 2005;Graves and Jaitly, 2014).
Combining CNN and LSTM has also been found to be quite successful in (Zhou et al., 2015;Goel et al., 2017).In this approach, we attempt to use CNN to extract regional features and then use Bi-LSTM to capture compositional semantics from both forward and backward directions of word sequence.
Since the input, embedding, CNN layers are the same as Approach 2, we only detail the architectures of the following different layers.
Bi-LSTM with Pooling Layer: We use bidirectional LSTMs followed by some pooling layer to model the output from CNN layer.The number of hidden units is set to be 300.We apply relu as activation and dropout with a rate of 0.2.The outcomes from max pooling and average pooling are concatenated.
Output Layer: The concatenated output of Bi-LSTM with Pooling layer is then passed to a fully connected layer.Finally, a sigmoid layer was added on top of the fully connected layer.The network is trained by minimizing the categorical cross-entropy error with Adam for parameter optimization.
Figure 1 (b) shows the model architecture of the sequence model.

Approach 5: Residual LSTM Model
Residual LSTM (Kim et al., 2017) adds an additional spatial shortcut path from lower layers to better deal with vanishing gradients.It provides efficient training of deep networks with multiple LSTM layers and has been successfully applied to speech recognition and NER tasks (Tran et al., 2017).The formulation is as follows: Where l represents layer index and i l t , f l t and o l t are input, forget and output gates respectively.
x l t is an input from (l − 1) th layer, h l t−1 is a output layer at time t − 1 and c l t−1 is an internal cell state at t − 1.And a short cut from a prior output layer h l−1 t is added to a projection output m l t via shows the model architecture of our residual LSTM model.Two Bi-LSTM layers are included and the number of hidden units is set to be 512.We apply relu as activation and dropout with a rate of 0.2.The network is then trained by minimizing the categorical cross-entropy error with Adam for parameter optimization.

Ensemble Model
To combine the predictions of the five models mentioned above, we compute the weighted average of the category probabilities of the four models.The trial data is used to select the optimal weight of each model.The selected emotion category is the category that has the highest weighted average.

Dataset and Evaluation Metric
The task organizers provide a training dataset (i.e.153k instances) and a small blind trial dataset (i.e.9.6k instances) for system building.Then a period of 1 week is given for submitting the predictions on a blind test dataset (i.e.29k instances).
Macro-averaged F1 score is chosen to be the official evaluation metric.

Results on Trial Data and Analysis
The optimal setting for each model is decided using cross validation on training dataset.Then the weighted average is computed from individual predictions to generate the predictions for the final ensemble model using trial dataset as described in Section 3.6.Table 3 shows the trial results for all individual models and ensemble model.
We observe that the Lexicon approach achieves the best score among all approaches.Among the four deep neural models, CNN+LSTM and fast-Text achieve better score of 62% compared to CNN and Residual-LSTM, which demonstrates that both the combination of long sequence and regional features and the word n-grams capture effective information.Since the residual LSTM network does not perform as expected, we did not include it into our final ensemble model.
We also observe that the ensemble model achieves the best performance compared with each individual model and offers equal or better performance across all the emotions, which indicates that the four approaches do complement each other quite well.We also compare the results achieved by our submitted ensemble system, official baseline system and top-ranked systems in Table 5.Our ensemble model achieves average f1-macro score of 65.8%, which beats the baseline model by 5.9%.However, the top-ranked systems all incorporate models trained in previous emotion related tasks (e.g.SemEval 2018: Affective in Tweets) as additional features.This probably is the reason for our performance gap.

Conclusion and Future Work
In this paper, we propose a hybrid framework to predict the emotion category in tweets when no explicit emotion words are presented.The proposed approach combines lexicon based logistic regression classifier, fastText, Convolutional Neural Networks and Sequence Modeling using CNN and LSTM, allowing us to explore the different directions each methodology can take.Our system HGSGNLP, submitted to the IEST 2018 Shared Task, beats the baseline system by 5.9% on the test set.
Compared to the best systems, there is still room for improvement.
In the future, we would like to experiment with some other filters provided in AffectiveTweets package (Mohammad and Bravo-Marquez, 2017) such as TweetToSentiStrengthFeatureVector.We would also experiment with incorporating lexicon features to existing neural networks.

Figure 1 :
Figure 1: The architectures of our three neural models.(a) is the neural model for Approach 3. (b) is the neural model for Approach 4. (c) is the neural model for Approach 5.

Table 2 :
fastText command line arguments used to train the model.

Table 3 :
Performance comparison between individual models and ensemble model on trial data.Our final ensemble model includes lexicon, fastText, CNN and CNN-LSTM models.

Table 4 :
Table 4 reports our official results on test data.Among the individual emotions, our ensemble Official results for our submission.

Table 5 :
Performance comparison between our system, official baseline system and top-ranked systems on IEST shared task.The number in parentheses are the official rankings.
model gives the best performance for Joy, followed by Fear and Disgust.