GWU NLP Lab at SemEval-2019 Task 3: EmoContext: Effective Contextual Information in Models for Emotion Detection in Sentence-level in a Multigenre Corpus

In this paper we present an emotion classifier model submitted to the SemEval-2019 Task 3: EmoContext. The task objective is to classify emotion (i.e. happy, sad, angry) in a 3-turn conversational data set. We formulate the task as a classification problem and introduce a Gated Recurrent Neural Network (GRU) model with attention layer, which is bootstrapped with contextual information and trained with a multigenre corpus. We utilize different word embeddings to empirically select the most suited one to represent our features. We train the model with a multigenre emotion corpus to leverage using all available training sets to bootstrap the results. We achieved overall %56.05 f1-score and placed 144.


Introduction
In recent studies, deep learning models have achieved top performances in emotion detection and classification.Access to large amount of data has contributed to these high results.Numerous efforts have been dedicated to build emotion classification models, and successful results have been reported.In this work, we combine several popular emotional data sets in different genres, plus the one given for this task to train the emotion model we developed.We introduce a multigenre training mechanism, our intuition to combine different genres are a) to augment more training data, b) to generalize detection of emotion.We utilize Portable textual information such as subjectivity, sentiment, and presence of emotion words, because emotional sentences are subjective and affectual states like sentiment are strong indicator for presence of emotion.The rest of this paper is structured as followings: section 2 introduce our neural net model, in section 3 we explain the experimental setup and data that is been used for training and development sets, section 4 discuss the results and analyze the errors, section 5 describe related works, section 6 conclude our study and discuss future direction.

Model Description
Gates Recurrent Neural Network (GRU) (Cho et al., 2014;Chung et al., 2015) and attention layer are used in sequential NLP problems and successful results are reported in different studies.Figure 1 shows the diagram of our model. 1RUhas been widely used in the literature to model sequential problems.RNN applies the same set of weights recursively as follow: GRU is very similar to LSTM with the following equations: GRU has two gates, a reset gate r t , and an update gate z t .Intuitively, the reset gate determines how to combine the new input with the previous memory, and the update gate defines how much of the previous memory to keep around.We use Keras2 GRNN implementation to setup our experiments.We note that GRU units are a concatenation of GRU layers in each task.
Attention layer -GRUs update their hidden state h(t) as they process a sequence and the final hidden state holds the summation of all other history information.Attention layer (Bahdanau et al., 2014) modifies this process such that representation of each hidden state is an output in each GRU unit to analyze whether this is an important feature for prediction.
Model Architectureour model has an embedding layer of 300 dimensions using fasttext embedding, and 1024 dimensions using ELMo (Peters et al., 2018) embedding.GRU layer has 70 hidden unites.We have 3 perceptron layers with size 300.Last layer is a softmax layer to predict emotion tags.Textual information layers (explained in section 2.1) are concatenated with GRU layer as auxiliary layer.We utilize a dropout (Graves et al., 2013) layer after the first perceptron layer for regularization.

Textual Information
Sentiment and objective Information (SOI)relativity of subjectivity and sentiment with emotion are well studied in the literature.To craft these features we use SentiwordNet (Baccianella et al., 2010), we create sentiment and subjective score per word in each sentences.SentiwordNet is the result of the automatic annotation of all the synsets of WORDNET according to the notions of positivity, negativity, and neutrality.Each synset s in WORDNET is associated to three numerical scores Pos(s), Neg(s), and Obj(s) which indicate how positive, negative, and objective (i.e., neutral) the terms contained in the synset are.Different senses of the same term may thus have different opinion-related properties.These scores are presented per sentence and their lengths are equal to the length of each sentence.In case that the score is not available, we used a fixed score 0.001.
Emotion Lexicon feature (emo)presence of emotion words is the first flag for a sentence to be emotional.We use NRC Emotion Lexicon (Mohammad and Turney, 2013) with 8 emotion tags (e.i.joy, trust, anticipation, surprise, anger, fear, sadness, disgust).We demonstrate the presence of emotion words as an 8 dimension feature, presenting all 8 emotion categories of the NRC lexicon.Each feature represent one emotion category, where 0.0013 indicates of absent of the emotion and 1 indicates the presence of the emotion.The advantage of this feature is their portability in transferring emotion learning across genres.

Word Embedding
Using different word embedding or end to end models where word representation learned from local context create different results in emotion detection.We noted that pre-trained word embeddings need to be tuned with local context during our experiments or it causes the model to not converge.We experimented with different word embedding methods such as word2vec, GloVe (Pennington et al., 2014), fasttext (Mikolov et al., 2018), and ELMo.Among these methods fasttext and ELMo create better results.show the embedding layer (we use ELMo too, but we do not show it in here).Features are presented to GRU and attention layer and the output of attention layer is sent to 3 perceptron layer.Last layer is a softmax layer to predict emotion labels.Model without contextual info, exclude the contextual info input, which we do not show in the architecture.

Experimental Setup
We split MULTI dataset into 80%,10%,10% for train, dev, and test, respectively.We use AIT and EmoContext (data for this task) split as it is given by SemEval 2018 and semEval 2019.We describe these data sets in details in the next section.All experiments are implemented using Keras4 and Tensorflow5 in the back-end.

Data
We used three different emotion corpora in our experiments.Our corpora are as follows: a) A multigenre corpus created by (Tafreshi and Diab, 2018) with following genres: emotional blog posts, collected by (Aman and Szpakowicz, 2007), headlines data set from SemEval 2007-task 14 (Strapparava and Mihalcea, 2007), movie review data set (Pang and Lee, 2005) originally collected from Rotten tomatoes 6 for sentiment analysis and it is among the benchmark sets for this task.We refer to this multigenre set as (MULTI), b) SemEval-2018 Affect in Tweets data set (Mohammad et al., 2018) (AIT) with most popular emotion tags: anger, fear, joy, and sadness, c) the data set that is given for this task, which is 3-turn conversation data.From these data sets we only used the emotion tags happy, sad, and angry.We used tag no-emotion from MULTI data set as others tag.Data statistics are shown in figures 2, 3, 4 .
Data pre-processingwe tokenize all the data.For tweets we replace all the URLs, image URLs, hashtags, @users with specific anchors.Based on the popularity of each emoticon per each emotion tag, we replace them with the corresponding emotion tag.We normalized all the repeated characters, finally caps words are replaced with lower case but marked as caps words.

Training the Models
We have input size of 70 for sentence length, sentiment, and objective features and emotion lexicon feature has size 8.All these features are explained in section 2.1 and are concatenated with GRU layer as auxiliary (input) layer.Attention comes next after GRU and have size 70.We select dropout of size 0.2.We select 30 epochs in each experiment, however, training is stopped earlier if 2 consecutive larger loss values are seen on evaluation of dev set.We use Adam (Kingma and Ba, 2014) optimizer with a learning rate 0.001.We use dropout with rates 0.2.The loss function is a categorical-cross-entropy function.We use a mini batch (Cotter et al., 2011) of size 32.All hyper-parameter values are selected empirically.We run each experiment 5 times with random initialization and report the mean score over these 5 runs.In section 4 we describe how we choose the hyper-parameters values.baselinein each sentence we tagged every emotional word using NRC emotion lexicon (Mohammad and Turney, 2013), if any emotion has majority occurrence we pick that emotion tag as sentence emotion tag, when all emotion tags happen only once we randomly choose among them, when there is no emotional word we tag the sentence as others.We only use the portion of the emotion lexicon that covers the tags in the task (i.e.happy, sad, and angry).

Related Works
In semEval 2018 task-1, Affect in Tweets (Mohammad et al., 2018), 6 team reported results on sub-task E-c (emotion classification), mainly using neural net architectures, features and resources, and emotion lexicons.Among these works (Baziotis et al., 2018) proposed a Bi-LSTM architecture equipped with a multi-layer self attention mechanism, (Meisheri and Dey, 2018) their model learned the representation of each tweet using mixture of different embedding.in WASSA 2017 Shared Task on Emotion Intensity (Mohammad and Bravo-Marquez, 2017), among the proposed approaches, we can recognize teams who used different word embeddings: GloVe or word2vec (He et al., 2017;Duppada and Hiray, 2017) and exploit a neural net architecture such as LSTM (Goel et al., 2017;Akhtar et al., 2017), LSTM-CNN combinations (Köper et al., 2017;Zhang et al., 2017) and bi-directional versions (He et al., 2017) to predict emotion intensity.Similar approach is developed by (Gupta et al., 2017) using sentiment and LSTM architecture.Proper word embedding for emotion task is key, choosing the most efficient distance between vectors is crucial, the following studies explore solution sparsity related properties possibly including uniqueness (Shen and Mousavi, 2018;Mousavi and Shen, 2017) .

Conclusion and Future Direction
We combined several data sets with different annotation scheme and different genres and train an emotional deep model to classify emotion.Our results indicate that semantic and syntactic contextual features are beneficial to complex and stateof-the-art deep models for emotion detection and classification.We show that our model is able to classify non-emotion (others) with high accuracy.
In future we want to improve our model to be able to distinguish between emotion classes in a more sufficient way.It is possible that hierarchical bidirectional GRU model can be beneficial, since these models compute history and future sequence while training the model.

Figure 1 :
Figure1: GRU-Attention neural net architecture.In this model framework, context information are features generated from SentiWordNet and emotion lexicon.We use fasttext to show the embedding layer (we use ELMo too, but we do not show it in here).Features are presented to GRU and attention layer and the output of attention layer is sent to 3 perceptron layer.Last layer is a softmax layer to predict emotion labels.Model without contextual info, exclude the contextual info input, which we do not show in the architecture.
Figure 2: MULTI data set -train, dev, test data statistic

Table 1 :
Data statistics illustrating the distributions of the train, dev, and test sets across different data sets.

Table 2 :
Results on the EmoContext test sets.We report

Table 3 :
Context results of each emotion tag.