SymantoResearch at SemEval-2019 Task 3: Combined Neural Models for Emotion Classification in Human-Chatbot Conversations

In this paper, we present our participation to the EmoContext shared task on detecting emotions in English textual conversations between a human and a chatbot. We propose four neural systems and combine them to further improve the results. We show that our neural ensemble systems can successfully distinguish three emotions (SAD, HAPPY, and ANGRY) and separate them from the rest (OTHERS) in a highly-imbalanced scenario. Our best system achieved a 0.77 F1-score and was ranked fourth out of 165 submissions.


Introduction
Detecting emotions in text is a key task in many scenarios, such as social listening, personalised marketing, customer caring, or in building emotionally intelligent chat-bots: in this last case, the task complexity increases, since a bot's response might influence the user's emotion.
The EmoContext shared task (Chatterjee et al., 2019) was posed as a sequence classification task. Given a set of three conversational turns (humanbot-human), the goal is to predict the emotion of the third turn. The label space contains the emotions SAD, ANGRY and HAPPY, and the label OTHERS denoting anything else (emotional or non-emotional), as illustrated in Table 1.
In this paper, we present our approaches to EmoContext shared task, and describe our best system in details. Additionally, we show that: (a) this task is very difficult even for humans (Section 2.2); (b) for this task, neural approaches outperform a strong non-neural baseline (Section 4); (c) an ensemble of neural systems with differ-ent architectures significantly outperforms the best neural model in isolation (Section 4).

Data
The data released by the organisers consist of English user-chatbot interactions occurring in an Indian chat room. An overview of the dataset is provided in Table 2. It can be seen that the label distribution is highly imbalanced, and different for the training set than for the development (dev) and test sets (a 14:18:18:50 distribution for the training set, and a 5:5:5:85 distribution for the dev and test sets). To overcome this issue we tested three strategies: (1) down-sampling the dataset to its smallest class; (2) up-sampling the emotionrelated labels with an in-house dataset; and (3) up-sampling by duplicating a random portion of the dataset. None of these solutions worked, and therefore, we trained our best models using the data provided by the organisers.

Preprocessing
The language of this corpus presents many of the features of micro-blogging language: large use of contractions (e.g. I'm gonna bother), elongations (e.g. a vacation tooooooo!), non-standard use of punctuation (e.g. gonna explain you later..!), incorrect spelling (e.g. U r).
To properly handle this language, we build a simple preprocessing pipeline which consists of: (1) the NLTK TweetTokenizer (Bird and Loper, 2004); and (2) a normalisation strategy that reduces sparseness by lowercasing all the words and converting elongations like looool to lol. These steps are used in all the experiments. Some of our models use additional preprocessing described further in the text.

Manual Validation
To check how difficult this task is, for a trained human annotator, and get an estimate of the expected upper limit for our classification models, we asked two fluent (but non-native) English speakers with previous annotation experience to label 300 randomly selected instances from the dev set. The annotators achieved the official F 1 -score of 0.73 and 0.72 against the 'gold' labels, and a 0.71 F 1score among themselves. The only observed misclassifications between "emotional" classes were those between SAD and ANGRY. The highest number of disagreements the annotators had was between the OTHERS and the "emotional" classes. This showed that: (1) the task is naturally difficult (the trained human annotators reach 0.73 F 1score at the most); (2) the main problem is distinguishing between the OTHERS class and the "emotional" classes.

Experimental Setup
We first randomly selected two times 2754 instances from the official training set, maintaining the class ratio that was announced for the official dev and test sets (4:4:4:88) resulting in 110 instances for the SAD, ANGRY and HAPPY, and 2424 instances for the OTHERS class. These two datasets we refer to as intDev and intTest sets, while the rest of the training dataset we refer to as intTrain. We train and tune our four neural models (Section 3.1) using intTrain and intDev sets, and test them on the intTest, and the official dev and test sets (in different phases of the competition). We further experiment with combining their softmax output per class probabilities (Section 3.2).
-other -emotional Concat. As a strong non-neural baseline we set up a linear SVM model with word and character n-grams (1-6) as features. 1

Neural Models
We propose four neural network models that slightly differ on their objective.

Three-Input Model (IN3)
Having the three conversation turns (T1, T2, and T3), we explicitly represent the position of each sequence in the conversation by creating an input branch for each turn. The branches are identical and represent the text using word embeddings that feed a 2-layer bidirectional Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997). An attention mechanism (Yang et al., 2016) combines its hidden states. This architecture allows to independently process and attend to the most relevant parts of T1, T2, and T3. The information is later combined by a simple concatenation and few fully connected dense layers. The model architecture is shown in Figure 1.
We use a proprietary model c Symanto Research to obtain 300-dimensional word embeddings on the English Wikipedia. The performance of this representation is comparable with fastText (Bojanowski et al., 2017) but the resulting embedding model is fifty times lighter. We apply 10% dropout on the output of the embedding and concatenation layers, and layer normalisation (Ba et al., 2016) after the concatenation and before the output softmax.

Two-Output Model (OUT2)
Motivated by the findings of the manual validation (Section 2.2), we build this model in an attempt to ease the emotional vs. OTHERS classification. For this reason, we use a multi-task learning approach and add an auxiliary output whose label space conflated the ANGRY, HAPPY, and SAD labels into a single emotional one. We hypothesise that this approach is well suited to our unbalanced scenario, with the dominant OTHERS class.
The model architecture is similar to our threeinput one (see Figure 1). However, the three conversational turns (T1, T2, and T3) are fed to the model as a single concatenated input, with additional tokens to mark the turn boundaries. The auxiliary output is connected to the output of the attention. This forces the attention weights to favour the emotional vs. OTHERS task.
We use the pretrained word embeddings described in Section 3.1.1. Our dense layers use the leaky version (LReLU) of the Rectified Linear Unit (ReLU) activation. In addition, we use the attention mechanism (He et al., 2017). Finally, we use the batch normalisation (Ioffe and Szegedy, 2015) to process the attention output.

Sentence-Encoder Model (USE)
As an exploration in transfer-learning, we build a simple feed-forward network together with a fine-tuned Universal Sentence Encoder (Cer et al., 2018). As input, we use the first (T1) and the last (T3) turn of the conversation, as we observed that adding the second turn (T2) leads to lower performances of this model.

BERT Model (BERT)
We fine-tune a BERT-base model (Devlin et al., 2018), modelling the problem as a sentence-pair classification problem: we use the first and the third conversational turn (T1 and T3) as the first and the second sentence respectively, completely ignoring the utterance by the bot (T2). We use this model in combination with a lexical normalisation system (van der Goot and van Noord, 2017).
We also built a neural model combining BERT, IN3, and OUT2, but it resulted in lower performance than any of those models separately, and is thus not presented here.

Ensemble Models
As we noticed that our neural systems have different strengths and weaknesses on the "emotional" classes (see Table 4), we combine them by using the softmax output probabilities of each class from all four models (16 features in total) and training several classification algorithms: Naïve Bayes (John and Langley, 1995), Logistic Regression (le Cessie and van Houwelingen, 1992), Support Vector Machines (Keerthi et al., 2001) with normalization (SVM-n) or standardization (SVMs), JRip rule learner (Cohen, 1995), J48 (Quinlan, 1993), Random Forest (Breiman, 2001), and various meta-learners on top of them or their subsets.
The neural systems are trained and tuned on the intTrain and intDev sets, and their per class probabilities are obtained for the intTest, dev, and test sets. The ensemble models are then trained on the intTest+dev set and tested on the official test set. For this second classification stage, we thus have 5509 instances for training (intTest+dev) and 5509 for testing (the official test set).

Results
We evaluate our systems using precision (P) and recall (R) per each emotional class, and the micro F 1 -score over the three "emotional" classes (the metric used by the task organisers for the official evaluation). The results for the baseline and the four neural systems are presented in Table 4. The results of the best ensemble models (trained on the per class probabilities of the four neural models) are presented in Table 5. We can notice that: (1) Our best neural system (IN3) reaches .73 on the intTest set and .72 on the official test set.
(2) All our neural systems have a noticeably higher recall on the HAPPY and SAD classes on the intTest set than on the official dev and test sets.
(3) Our two best neural systems (IN3 and OUT2) have a noticeably lower precision on the HAPPY and SAD classes on the intTest set than on the official dev and test sets.

Error Analysis
The confusion matrix for our best system is given in Figure 2. The highest number of confusions is between the HAPPY and OTHERS classes, followed by confusions between the ANGRY and OTHERS. Given the findings of our manual validation (Section 2.2), we performed an additional experiment. All instances for which our best system did not predict the gold label (355 instances), we pre- sented to one of our annotators together with its gold and predicted labels (in random order), and asked him to choose the correct one, or assign a NOT SURE label. The annotator chose the label predicted by our system in 46% of the cases, the gold label in 39% of the cases, and in 15% of the cases the annotator was not sure. Several examples of instances for which the predicted label did not match the "gold" label are presented in Table 3.

Conclusions
We presented our most successful approaches to the EmoContext shared task, with the goal of predicting the emotion (SAD, HAPPY, ANGRY, or OTHERS) in the third turn of a human-chatbothuman interaction, with an additional challenge of having a very unbalanced distribution of classes. We showed that the task is difficult even for trained human annotators, and that our best neural systems can reach the human performance (.72 F-measure). Furthermore, we showed that a SVM classifier trained on the softmax output per class probabilities of four different neural systems can improve results scoring a .77 F 1 -measure over the three emotional classes, and reaching thus the fourth place in the official competition.