Sentylic at IEST 2018: Gated Recurrent Neural Network and Capsule Network Based Approach for Implicit Emotion Detection

In this paper, we present the system we have used for the Implicit WASSA 2018 Implicit Emotion Shared Task. The task is to predict the emotion of a tweet of which the explicit mentions of emotion terms have been removed. The idea is to come up with a model which has the ability to implicitly identify the emotion expressed given the context words. We have used a Gated Recurrent Neural Network (GRU) and a Capsule Network based model for the task. Pre-trained word embeddings have been utilized to incorporate contextual knowledge about words into the model. GRU layer learns latent representations using the input word embeddings. Subsequent Capsule Network layer learns high-level features from that hidden representation. The proposed model managed to achieve a macro-F1 score of 0.692.


Introduction
Emotion is a complex aspect of the human behavior which makes the humanity distinguishable from other biological behaviors of creatures. Emotions are typically originated as a response to a situation. Since the emergence of social media, people often express opinions as responses to daily encounters by posting on these platforms. These microblogs contain emotions related to the topics the author have discussed. Thus emotion detection is useful to understand more specific sentiments held by the author towards the discussed topics. Hence this is a challenge with a significant business value.
Emotion analysis can be considered as an extension of sentiment analysis. Even though there has been a notable amount of research in sentiment analysis in the literature, research on emotion analysis has not gained much attention. The related work suggests this task can be handled us-ing emojis or hashtags present in the text i.e. distance supervision techniques (Felbo et al., 2017). However, these features can be unreliable due to noise, thus affect the accuracy of the results.
Although explicit words related to emotions (happy, sad, etc.) in a document directly affect the emotion detection task, other linguistic features play a major role as well. Implicit Emotion Recognition Shared Task introduced in Klinger et al. (2018) aims at developing models which can classify a text into one of the emotions; Anger, Fear, Sadness, Joy, Surprise, Disgust without having access to an explicit mention of an emotion word. Participants were given a tweet from which any of the above emotion terms or one of their synonyms is removed. The task is to predict the emotion that the excluded word expresses. E.g.: It's [#TARGETWORD#] when you feel like you are invisible to others.
In this paper, we propose an approach based on Gated Recurrent Units (GRU) (Cho et al., 2014) followed by Capsule Networks (Sabour et al., 2017) to tackle the challenge. This model managed to achieve a macro-F1 score of 0.692 and ranked 5th in WASSA 2018 implicit emotion detection task.

Methodology
We have used a sentence classification model which is based on bidirectional GRUs and Capsule networks. First, the raw tweets are preprocessed, then mapped into a continuous vector space using an embedding layer. Afterward, we used a Bidirectional Gated Recurrent Unit (Bi-GRU) (Cho et al., 2014) layer to encode sentences into a fixed length representation. The fixed length represen-tation is then fed into a Capsule Network (Sabour et al., 2017) where it will learn the features and emotional context of the sentences. Finally, the Capsule network is followed by a fully connected dense layer with softmax activation for the classification.

Preprocessing
Microblogs typically contain informal language usages such as short terms, emojis, misspellings, and hashtags. Hence, preprocessing steps should be employed in order to clean these informal and noisy text data. Moreover, efficient preprocessing plays a vital role in achieving a good performance. Ekphrasis tool (Baziotis et al., 2017) is used for initial preprocessing of the tweets. Tweet tokenizing, word normalization, spell correcting and word segmentation for hashtags are done as preprocessing steps.

Tweet Tokenizing
Tokenizing is the first and the most important step of preprocessing. Ability to correctly tokenize a tweet directly impacts the quality of a system. Since there is a large variety of vocabulary and expressions present in short texts such as Twitter, it is a challenging task to correctly tokenize a given tweet. Twitter markup, emoticons, emojis, dates, times, currencies, acronyms, censored words (e.g. s**t), words with emphasis (e.g. *very*) are recognized during tokenizing and treated as a separate token.

Word Normalization
Upon tokenizing, set of transformations including converting to lowercase and transforming URLs, usernames, emails, phone numbers, dates, times, hashtags to a predefined set of tags (e.g @user1 → <user>) are applied. This method helps to reduce the vocabulary size and generalize the tweet.

Spell Correcting and Word Segmentation
As the last step in preprocessing, we apply spell correcting and word segmentation to hashtags. (e.g. #makeitrain → make it rain)

Model
An overview of the model is shown in figure 1 and each segment of the model is described in the following sub sections.

Word Embedding Layer
Word embedding layer is the first layer of the model. Each token will be mapped into a continuous vector space using a set of pretrained word embeddings. We used 300 dimensional, pretrained, Word2Vec embeddings introduced in . Given an input tweet, S = [s 1 , s 2 , ., s i , .., s n ] where s i is the token at position i, the embedding matrix W e , the output of the embedding layer X is,

Bidirectional GRU Layer
The word embedding layer is followed by a bidirectional GRU (Cho et al., 2014) layer. There is a forward GRU ( − → h t ) and a backward GRU (( ← − h t )) and the latent representation output by the two GRUs is concatenated to get the final output ← − h t ) of the layer. Following set of equations follows the standard notation used in Cho et al. (2014).

Capsule Layer
Features encoded by the bidirectional GRU layer is then passed to a Capsule Network (Sabour et al., 2017). Capsule Network consists of a set of capsules where each capsule corresponds to a high level feature. Each capsule outputs a vector, of which the magnitude represents the probability of the corresponding feature existence. Following set of equations follows the standard notation used in Sabour et al. (2017). Prediction vectorû j|i is calculated by multiplying the output h i from the GRU layer with a weight matrix.û Total input to a capsule s j is a weighted sum over all the prediction vectorsû j|i .
c ij represents the coupling coefficients found through the iterative dynamic routing. A non-linear "Squash" function is used to scale the vectors such that the magnitude is mapped to a value between 0 and 1.
Dynamic Routing process introduced by Sabour et al. (2017) is used as the routing mechanism between capsules.

Classification Layer
The flattened output from the capsule layer (Say C) is fed to a dense layer which has a softmax activation. It outputs a vector of size 6 (number of classes). The values in the vector components are probabilities for the presence of each of the six emotions. The emotion with the highest probability is selected as the output.
For all y i ∈ Y , f i is calculated as follows.
Then the class with highest f i is taken as the output.

Regularization
Gaussian noise is added to both the embedding layer and the softmax classification layer for the purpose of making the model more robust to overfitting. Further, dropout is applied to the Capsule network output and a spatial dropout is applied to the embedding Layer to reduce overfitting.

Experiments and Results
3.1 Experimental setup

Training
We used Adam optimizer (Kingma and Ba, 2014) for optimizing our network with a batch size of 512. Gradient Clipping (Pascanu et al., 2013) was employed to address the exploding gradient problem where all the gradients were clipped at 1. Keras (Chollet et al., 2015) was used to develop the model and experiments were done using both Tensorflow (Abadi et al., 2015) and Theano (Theano Development Team, 2016) backends. Google Colaboratory 1 was used as the runtime environment for training the model.

Hyper-Parameters
We have employed Word2Vec  embeddings of 300 dimensions for the embedding layer. The GRU layer consists of 128 cells for both directions. We have used 16 capsules each with an output size of 32 and 5 routing iterations. Spatial dropout of 0.3 is applied to the embeddings and dropout of 0.25 is applied to the Capsule network. Gaussian noise of 0.1 is added to both the embedding layer and the Capsule network.

Results
We ranked 5th among 30 contestants in the competition. We achieved a macro-F1 score of 0.692 which is 0.155 improvement compared to the baseline model (Maximum Entropy Model using bag of words (BoW) and bigrams). The topranked model has a 0.031 improvement compared to our model. sentence classification tasks. RNNs have the capability to capture sequential features present in sentences. Further, when they are incorporated with attention mechanisms the accuracy of the models increases notably (Yang et al., 2016;Tang et al., 2015). Hence, we have first implemented a model which uses a bidirectional GRU (Cho et al., 2014) layer to learn latent representations followed by a hierarchical attention mechanism. Attention mechanisms have the ability to capture important keywords in sentences and give a higher weight to those words. This is one of the prominent approaches that typically results in a good performance in regular text classification tasks. Table 2 shows that this approach yielded a reasonable accuracy, yet it was not the best performing approach. Another approach is to use a Convolution Neural Network (CNN) (Kim, 2014) layer on top of RNNs instead of attention mechanisms. Intuition is that the CNN layers will act as a different attention mechanism and captures high-level features from the features learned by the below layers. Hence, the second approach we investigated was using CNNs instead of the attention mechanism. As the table 2 shows, this approach resulted in a slight drop in performance compared to the previous approach.
Our next approach was to use Capsule networks (Sabour et al., 2017) instead of Convolution Neural Networks (CNN). Capsule networks have shown promising results in the field of computer vision. Sabour et al. (2017) argues that it is essential to preserve the hierarchical translational and rotational features of the identified high-level fea-

Model
Macro-F1 GRU + Hierarchical Attention 0.671 GRU + CNN 0.657 GRU + Capsnet 0.692 Table 2: Performance analysis of the best models in each investigated approaches.

Model
Macro-F1 GRU (1 layer) + Capsnet 0.692 LSTM (1 layer) + Capsnet 0.687 GRU (2 layers) + Capsnet 0.678 Table 3: Performance analysis of different variants of the proposed system tures in order to perform image classification and object detection in the field of computer vision. However, traditional CNNs with max-pooling layers tend to lose this spatial information related to identified features. Sabour et al. (2017) introduces capsule networks to tackle these issues identified in traditional CNNs. Nonetheless, the usability of Capsule networks has not researched much in the Natural Language Processing (NLP) community. Along the same lines, we can intuitively argue that CNN based models with pooling layers will cause loss of information in text related classification tasks as well. Hence, we have investigated the usability of capsule networks for improving the performance of text classification models. The use of Capsule networks instead of CNNs has improved the performance of the model slightly and assisted in gaining the best performing model.

Model Architecture Variants
We have tried several variants of the proposed model. Table 3 shows the performance of each of those variants. We have tried approaches using Long Short Term Memory networks (LSTM) (Hochreiter and Schmidhuber, 1997) which is one of the other prominent types of RNNs. However, the results showed a minor drop. Another variant is to use two layers of GRUs instead of using a single layer. Even this approach made the performance of the model slightly lesser. A potential reason for this could be model over-fitting. Using a single GRU layer followed by the Capsnet gave the best performance.   Table 4 shows the performance of the proposed model for each class. As evident from the results, anger shows a significantly lower F1-score. Other emotions show similar results whereas joy stands out with a notably higher F1-score. Anger has been misclassified as sad in several examples. e.g.-Girls will get [#TARGETWORD#] that her man cheated with an ugly girl more than the fact he actually cheated.

Analysis on Predictions
In the above example, it is unclear whether the emotion is anger or sadness. Such ambiguity of anger has affected the reduction of F1-score values. There were few other similar cases where it is challenging even for humans to clearly discriminate emotions due to nuance nature of emotions expressed.

Conclusion
WASSA 2018 Implicit Emotion Shared Task (Klinger et al., 2018) introduces a task to predict the emotion of a tweet of which the explicit mentions of emotion terms have been removed. We have experimented with several deep learning based approaches to tackle this task. We have used pre-trained Word2Vec embeddings. All the approaches we tried utilize an initial GRU layer which learns latent representations from the input word embeddings. Different alternative methods have been investigated for the subsequent layer. These methods include attention layer, CNN layer, and Capsnet layer. Model with the Capsnet layer achieved the best results among the experimented alternatives. Potential future work includes investigating the possibility of using Capsule networks for other tasks in Natural Language Processing, especially where CNNs are involved. Another line of future work could be to follow the ap-proach mentioned in Felbo et al. (2017) and apply transfer learning on the model trained using this semi-automatically annotated dataset to test on human annotated datasets such as Mohammad et al. (2018).