SSN_NLP at SemEval-2019 Task 3: Contextual Emotion Identification from Textual Conversation using Seq2Seq Deep Neural Network

Emotion identification is a process of identifying the emotions automatically from text, speech or images. Emotion identification from textual conversations is a challenging problem due to absence of gestures, vocal intonation and facial expressions. It enables conversational agents, chat bots and messengers to detect and report the emotions to the user instantly for a healthy conversation by avoiding emotional cues and miscommunications. We have adopted a Seq2Seq deep neural network to identify the emotions present in the text sequences. Several layers namely embedding layer, encoding-decoding layer, softmax layer and a loss layer are used to map the sequences from textual conversations to the emotions namely Angry, Happy, Sad and Others. We have evaluated our approach on the EmoContext@SemEval2019 dataset and we have obtained the micro-averaged F1 scores as 0.595 and 0.6568 for the pre-evaluation dataset and final evaluation test set respectively. Our approach improved the base line score by 7% for final evaluation test set.


Introduction
Emotion identification is a process of identifying the emotions automatically from different modalities. Several research work have been presented on detecting emotions from text (Rao, 2016;Abdul-Mageed and Ungar, 2017;Samy et al., 2018;Al-Balooshi et al., 2018;Gaind et al., 2019), speech (Arias et al., 2014;Amer et al., 2014;Lim et al., 2016), images (Shan et al., 2009;Ko, 2018;Ayvaz et al., 2017;Faria et al., 2017;Mohammadpour et al., 2017) and video (Matsuda et al., 2018;Hossain and Muhammad, 2019;Kahou et al., 2016). Emotion understanding from video may be easier by analyzing the body language, speech variations and facial expressions. However, identification of emotions from textual conversations is a challenging problem due to absence of above factors. Emotions in text are not only identified by its cue words such as happy, good, bore, hurt, hate and fun, but also the presence of interjections (e.g. "whoops"), emoticons (e.g. ":)"), idiomatic expressions (e.g. "am in cloud nine"), metaphors (e.g. "sending clouds") and other descriptors mark the existence of emotions in the conversational text. Recently, the growth of text messaging applications for communications require emotion detection from conversation transcripts. This helps conversational agents, chat bots and messengers to avoid emotional cues and miscommunications by detecting the emotions during conversation. EmoContext@SemEval2019 shared task (Chatterjee et al., 2019) goal is to encourage more research in the field of contextual emotion detection in textual conversations. The shared task focuses on identifying emotions namely Angry, Happy, Sad and Others from conversation with three turns. Since, emotion detection is a classification problem, research works have been carried out by using machine learning with lexical features (Sharma et al., 2017) and deep learning with deep neural network (Phan et al., 2016) and convolutional neural network (Zahiri and Choi, 2018) to detect the emotions from text. However, we have adopted Seq2Seq deep neural network for detecting the emotions from textual conversations which include sequence of phrases. This paper elaborates our Seq2Seq approach for identifying emotions from text sequences.

Related Work
This section reviews the research work reported for emotion detection from text / tweets (Perikos and Hatzilygeroudis, 2013;Rao, 2016;Abdul-Mageed and Ungar, 2017;Samy et al., 2018;Al-Balooshi et al., 2018;Gaind et al., 2019) and text conversations (Phan et al., 2016;Sharma et al., 2017;Zahiri and Choi, 2018). Sharma et al. (2017) proposed a methodology to create a lexicon -a vocabulary consisting of positive and negative expressions. This lexicon is used to assign an emotional value which is derived from a fuzzy set function. Gaind et al. (2019) classified twitter text into emotion by using textual and syntactic features with SMO and decision tree classifiers. The tweets are annotated manually by Liew and Turtle (2016) with 28 fine-grained emotion categories and experimented with different machine learning algorithms. Results show that SVM and BayesNet classifiers produce consistently good performance for fine-grained emotion classification. Phan et al. (2016) developed an emotion lexicon from WordNet. The conversation utterances are mapped to the lexicons and 22 features are extracted using rule-based algorithm. They used fully connected deep neural network to train and classify the emotions. TF-IDF with handcrafted NLP features were used by Al-Balooshi et al. (2018) in logistic regression, XG-BClassifier and CNN+LSTM for emotion classification. The authors found that the logistic regression performed better than the deep neural network model. All the models discussed above considered the fine-grained emotion categories and used the twitter data to create a manually annotated corpus. These models used the rule-based or machine learning based algorithms to classify the emotion category.
A new C-GRU (Context-aware Gated Recurrent Units) a variant of LSTM was proposed by Samy et al. (2018) which extracts the contextual information (topics) from tweets and uses them as an extra layer to determine sentiments conveyed by the tweet. The topic vectors resembling an image are fed to CNN to learn the contextual information. Abdul-Mageed and Ungar (2017) built a very large dataset with 24 fine-grained types of emotions and classified the emotions using gated RNN. Instead of using basic CNN, a new recurrent sequential CNN is used by Zahiri and Choi (2018). They proposed several sequence-based convolution neural network (SCNN) models with attention to facilitate sequential dependencies among utterances. All the models discussed above show that the emotion prediction can be handled using variants of deep neural network such as C-GRU, G-RNN and Sequential-CNN. The commonality be-tween the above models are the variations of RNN or LSTM. This motivated us to use the Sequenceto-Sequence (Seq2Seq) model which consists of stacked LSTMs to predic the emotion labels conditioned on the given utterance sequences.

Data and Preprocessing
We have used the dataset provided by EmoCon-text@SemEval2019 shared task in our approach. The dataset consists of training set, development set and test set with 30160, 2755 and 5509 instances respectively. The dataset contains sequence id, text sequences with three turns which include user utterance along with the context, followed by emotion class label. The task is to label the user utterance as one of emotion class: happy, sad, angry or others. The textual sequences contain many short words. In preprocessing, these words are replaced with original or full word. We resort to build a look-up table which replace ''m', The sequences are converted to lower case. Also, the three turns/sentences are delimited with "eos" in the input sequences.

Methodology
Seq2Seq model is the most popular model in learning the target sequence conditioned on the source sequence. The Seq2Seq model is adopted to map the sequences of n words with a target label (n:1 mapping). This model has an embedding layer, an encoder, a decoder and a projection layer as shown in Figure 1.
Once the dialogue sentences are preprocessed, the first three turns of each instance are considered as the input sequences w 1 , w 2 ,..,w n , and the corresponding label e is considered as the target sequence. For example, the given instance "13 Bad Bad bad! That's the bad kind of bad. I have no gf sad" is converted into input sequence "bad eos bad bad that is the bad kind of bad eos i have no gf" and target label "sad". The input sequences and the target label are converted into its corresponding word embeddings by the embedding layer. The vector representation for each word is derived at embedding layer by choosing a fixed vocabulary of size V for input sequences and target labels. Now, the encoder which uses Bi-LSTM, encode these embeddings into a fixed vector representa- Figure 1: System Architecture tion s which also represents the summary of input sequences. Once the source sequences are encoded, the last hidden state of the encoder is used to initialize the decoder. The projection layer is fed with the tensors of the target output label. Given the hidden state h t , the decoder predicts the label e t . However, h t and e t are conditioned on the previous output e t−1 and on the summary s of the input sequence. The projection layer is a dense matrix to turn the top hidden states of decoder to logit vectors of dimension V . Given the logit values, the training loss is easily minimized by using standard SGD optimizer with a learning rate. The model is also trained with the attention mechanism, which computes the attention weight by comparing the current decoder hidden state with all encoder states. The detailed description of working principle about Seq2Seq model is described in (Sutskever et al., 2014).
We have adopted Neural Machine Translation 1 code to implement our Seq2Seq deep neural network. Several variations have been implemented by varying the number of layers, units and attention mechanisms. It is evident from the earlier experiments (Sutskever et al., 2014;Thenmozhi et al., 2018) (Sutskever et al., 2014;Bahdanau et al., 2014) and Scaled Luong (SL) (Luong et al., 2015(Luong et al., , 2017. Since, the model was developed using deep learning technique, it does not require much of linguistic features such as stemming, case normalization and PoS in identifying the emotion cue words. These linguistic phenomena could be captured by the encoder RNNs in sequence-tosequence (Seq2Seq) model. The other statistical features such as the word frequency are also not considered as input to the model, because the presence of particular cue alone does not guarantee to detect emotions in the text.

Results
Our approach is evaluated on EmoCon-text@SemEval2019 data set.
During development, we have implemented our variations with and without end of sentence (EOS) delimiter. We have built the models using entire training set (No split) and train-validation splits (TV split). 27160 and 3000 instances from training data were considered as training and validation set in TV split. The performance was measured in terms of micro-averaged F1 score (F1µ) for the three emotion classes namely Angry, Happy and Sad.
We have submitted eleven runs for EmoCon-text@SemEval2019 shared task on pre-evaluation dataset. The results obtained for pre-evaluation dataset are given in Table 1.
We observe from Table  1 that Normed Bahdanau attention mechanism performs better than Scaled Luong. Model building  with TV split performs better than the model without split. The incorporation of delimiter text EOS also improved the performance of our approach. Further, the performance degrades with the increase in number of layers. Thus, 2 layered LSTM with TVsplit, EOS delimiter and Normed Bahdanau attention mechanism perform better on the pre-evaluation dataset of EmoContext@SemEval2019 and this architecture is considered for evaluating the final-evaluation test set. The final evaluation submissions are based upon the variations in TV split ratio and the number of units as 16, 32, 64, 128 and 256. For TV split 1, the development set (2755 instances) given by EmoContext@SemEval2019 was considered as a validation set. The other two TV splits are by keeping the validation set as 1/5 (TV split 2) and 1/3 (TV split 3) of training set. The results of our submissions on final evaluation test data are given in Table 2. It is observed from Table 2 that 64U TV split 1 model outperforms all the other models with 0.656752 F1µ score. This score is higher than the base line score with 7% improvement. Table 3 shows the class-wise performance of our models on final evaluation set. Our models perform better for Angry class than the other two classes namely Happy and Sad.

Conclusion
We have adopted a Seq2Seq deep neural network to identify the emotions present in the text se-  quences. Our approach is evaluated on the Emo-Context@SemEval2019 dataset. The input sequences are pre-processed by replacing the short hand notations and by introducing a delimiter string. The sequence is vectorized using word embeddings and given to bi-directional LSTM for encoding and decoding. We have implemented several variations by changing the parameters namely, number of layers, units, attention wrappers, with and without delimiter string and train-validation split. The performance is measured using microaveraged F1 score on three emotion class labels namely Angry, Happy and Sad. Our experiments on development set show that 2 layered LSTM with Normed Bahdanau attention mechanism with delimiter string and train-validation split performs better than all the other variations. Three variations of train-validation split ratio were experimented on final evaluation test data by varying the number of units with the best parameter values that are learnt during the development phase. 64U TV split 1 model performs better than all the other runs we have submitted to the task. This model shows 7% improvement than the base line on final evaluation test set. Our Seq2Seq model can be improved further by incorporating the soft attention mechanism which uses joint distribution between attention and output layer (Shankar et al., 2018).