Figure Eight at SemEval-2019 Task 3: Ensemble of Transfer Learning Methods for Contextual Emotion Detection

This paper describes our transfer learning-based approach to contextual emotion detection as part of SemEval-2019 Task 3. We experiment with transfer learning using pre-trained language models (ULMFiT, OpenAI GPT, and BERT) and fine-tune them on this task. We also train a deep learning model from scratch using pre-trained word embeddings and BiLSTM architecture with attention mechanism. The ensembled model achieves competitive result, ranking ninth out of 165 teams. The result reveals that ULMFiT performs best due to its superior fine-tuning techniques. We propose improvements for future work.


Introduction
Traditionally sentiment analysis attempts to classify the polarity of a given text at the document, sentence, or feature/aspect level, i.e., whether the expressed opinion in the text is positive, negative, or neutral. More advanced sentiment classification looks at emotional states such as "Angry", "Sad", and "Happy".
Due to the increasing popularity of social media, over the past years sentiment analysis tasks in SemEval competitions have been mostly focused on twitter (Rosenthal et al., 2014) (Rosenthal et al., 2015;Nakov et al., 2016;Rosenthal et al., 2017). SemEval-2018 Task 1: Affect in Tweets (Mohammad et al., 2018) includes an array of subtasks on inferring the emotions (such as joy, fear, valence, and arousal) of a person from his/her tweet.
As we increasingly communicate using text messaging applications and digital agents, contextual emotion detection in text is gaining importance to provide emotionally aware responses to users. SemEval-2019 Task 3 (Chatterjee et al., 2019) introduces a task to detect contextual emotion in conversational text.
Deep-learning based approaches have recently dominated the state-of-the-art in sentiment analysis. However, a good performing model often requires large amounts of labeled data and takes many days to train. In computer vision, transfer learning has enabled deep learning practitioners to leverage models that have been pre-trained on ImageNet, MS-COCO, and other large datasets (Razavian et al., 2014;Shelhamer et al., 2017;He et al., 2016;Huang et al., 2017). Fine-tuning such pre-trained models in computer vision has been a far more common practice than training from scratch.
In Natural Language Processing (NLP), the most common and simple transfer learning technique is fine-tuning pre-trained word embeddings (Mikolov et al., 2013). These embeddings are used as the first layer of the model on the new dataset, and still require training from scratch with large amounts of labeled data to obtain good performance.
In 2018 several pre-trained language models (ULMFiT, OpenAI GPT and BERT) emerged. These models are trained on very large corpus, and enable robust transfer learning for fine-tuning NLP tasks with little labeled data.
In SemEval-2019 Task 3, we apply transfer learning approach using both pre-trained word embeddings and pre-trained language models. Our model achieves highly competitive result.
In this paper we describe our approach and experiments. The rest of the paper is laid out as follows: Section 2 provides an overview of the task, Section 3 describes the system architecture, and Section 4 reports results and performs an error analysis to obtain a better understanding of strengths and weaknesses of our approach and subsequently proposes improvements. Finally we conclude in Section 5 along with a discussion about future work.
2 Task Overview

Dataset
The organizers provide a training, development, and test set. Each row in the dataset is a 3-turn conversation between two people. The task is to classify the emotion of a conversation as "Happy", "Sad", "Angry", or "Others". Table 1 shows the distribution of the datasets across the labels. No other dataset is used in our experiments.

Evaluation Metric
Evaluation metric is micro-averaged F1 score for the three emotion classes i.e. Happy, Sad and Angry (excluding the class "Others"). This is referred as micro F1 score throughout the paper.
3 System Description 3.1 System Architecture Figure 1 details the System Architecture. We now describe how all the different modules are tied together. The input raw text is pre-processed as described in Section 3.2. The processed text is passed through all the models described in Sections 3.3 to 3.8. Finally, the system returns the average of the predicted probabilities from all models as the output.

Pre-processing
The conversation text in the dataset is similar to tweets in that it may contain one or many emojis, and may have misspelled words. We use ekphrasis tool 1 to preprocess the data. The tool performs the following steps: tokenization, spell correction (i.e replace a misspelled word with the most probable candidate word), word normalization, and word segmentation. All words are lower-cased.
1 https://github.com/cbaziotis/ ekphrasis After the text in each turn is processed, we concatenate them with a separator " eos ".

Fine-tuning ULMFiT
Universal Language Model Fine-tuning (ULM-FiT) (Howard and Ruder, 2018) trains language models on Wikitext-103 (Merity et al., 2017b), which consists of 28,595 preprocessed Wikipedia articles and 103 million words. It's based on the language model AWD-LSTM (Merity et al., 2017a), a regular LSTM (with no attention, shortcut connections, or other sophisticated additions) with various tuned dropout hyperparameters. It provides two pre-trained models: a forward model trained from left to right, a backward model trained from right to left.
Furthermore, it introduced several novel techniques for transfer learning: discriminative finetuning, slanted triangular learning rates, and gradual unfreezing, to retain previous knowledge and avoid catastrophic forgetting during fine-tuning.
We first fine-tune the forward language model. We combine all data including training, dev and test set, and split into a training and validation set. We use fast.ai's lr find 2 method to find the optimum learning rate, and use early stopping on validation loss to tune the dropout values from 0.7 to 2.5.
Then we fine-tune the classifier on the training set using 10-fold cross validation. We use early stopping on the evaluation metric of the task (micro F1 with "Others" class excluded). We experiment with dropout values from 0.7 to 0.85.
We repeat the same process for the backward language model.

Fine-tuning BERT
Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018) trains language models on BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words). It trains a deep bidirectional language models by masking some percentage of the input tokens at random, and then predicting only those masked token. This creates deep bidirectional representations by jointly conditioning on both left and right context in all layers.
In addition, it also trains a binarized next sentence prediction task which helps with understanding relation between two sentences, important for Question Answering and Natural Language Inference tasks.
BERT provides pre-trained base and large models in multiple languages. In our experiments we use the large uncased English model. And we use the pytorch implementation by huggingface 3 .
We experiment with fine-tuning the language model on a training and validation set split from a combined data set including training, dev and test set. We use early stopping on validation loss.
We then add a classifier layer on top of the output from the language model, and train it using the training set from the task with 10-fold cross validation. We use early stopping on the evaluation metric of the task (micro F1 with "Others" class excluded). We experiment with learning rate from 7e-6 to 3e-5.

Fine-tuning OpenAI GPT
OpenAI's Generative Pre-Training (GPT) (Alec et al., 2018) trains a language model using Transformer architecture on BooksCorpus. It obtains state-of-the-art result on many tasks including Natural Language Inference, Question answering and commonsense reasoning, Semantic Similarity, and Text Classification.
We tune the hyperparameters (clf pdrop, embd pdrop, resid pdrop and attn pdrop) in different combinations of values 0.1 and 0.2 (default value is 0.1) on the dev set. Due to that the dev score is less promising than the previous approaches, we do not use cross validation as it would take significant more time and compute resources. In fact this model was not included in the final submission.

Fine-tuning DeepMoji
DeepMoji (Felbo et al., 2017) performs distant supervision on a dataset of 1246 million tweets containing one of 64 common emojis. It obtained state-of-the-art performance on 8 benchmark datasets within sentiment, emotion and sarcasm detection using a single pretrained model.
We perform fine-tuning using the training set for training, and dev set as a validation set. We adopt the gradual unfreezing apporach (introduced by ULMFiT): first unfreeze the last layer and finetune all unfrozen layers for one epoch. We then unfreeze the next lower frozen layer and repeat, until we fine-tune all layers until convergence at the last iteration.
We do not use 10-fold cross validation due to that the highest micro F1 score on dev set does not seem promising.

Training a DeepMoji model with NTUA embedding
We also train a model from scratch using the DeepMoji's architecutre, but replace its embedding with a 310 dimensional embedding trained by NTUA-SLP team (Baziotis et al., 2018), which was trained on a dataset of 550M English twit-   ter messages. It was trained based on word2vec and has 310 dimensional embeddings, consisting of 300 dim word2vec embeddings and 10 dim affective dimensions.
We use the keras lr finder 4 method to find the optimum starting learning rate (with the fastest decrease in training loss), and train the model on the training set using 10-fold cross validation and early stopping on the evaluation metric of micro F1 score.

Ensembling
We combine the predictions of all models above by taking the unweighted average of the posterior probabilities for these models, and the final prediction is the class with the largest averaged probability. Table 2 shows the results of various models on the dev set and test set. ULMFiT has the best performance on both dev and test sets, outperforming all other pre-trained models. The DeepMoji model 4 https://github.com/surmenok/keras lr finder trained from scratch with NTUA embedding ranks the second. Figure 2 shows some examples where the ULMFiT or BERT makes incorrect predictions for the same conversations. We observe that BERT often makes incorrect predictions when emojis are present in the text, while ULMFiT is more robust to emojis. This suggests that the high performance of ULMFiT is due to not only the large corpus on which the language model is pre-trained on, but also the superior fine-tuning methods, such as discriminative fine-tuning, slanted triangular learning rates, and gradual unfreezing.

Conclusion
In this paper we describe our methods for contextual emotion detection. We achieved very competitive results in SemEval-2019 Task 3 using an ensemble of Transfer Learning models. We demonstrate that with sophisticated fine-tuning techniques in ULMFiT, transfer learning using pre-trained language models yields the highest performance, outperforming models trained from scratch. For future work we plan to explore these techniques with OpenAI GPT and BERT as well.