NTUA-SLP at IEST 2018: Ensemble of Neural Transfer Methods for Implicit Emotion Classification

In this paper we present our approach to tackle the Implicit Emotion Shared Task (IEST) organized as part of WASSA 2018 at EMNLP 2018. Given a tweet, from which a certain word has been removed, we are asked to predict the emotion of the missing word. In this work, we experiment with neural Transfer Learning (TL) methods. Our models are based on LSTM networks, augmented with a self-attention mechanism. We use the weights of various pretrained models, for initializing specific layers of our networks. We leverage a big collection of unlabeled Twitter messages, for pretraining word2vec word embeddings and a set of diverse language models. Moreover, we utilize a sentiment analysis dataset for pretraining a model, which encodes emotion related information. The submitted model consists of an ensemble of the aforementioned TL models. Our team ranked 3rd out of 30 participants, achieving an F1 score of 0.703.


Introduction
Social media, especially micro-blogging services like Twitter, have attracted lots of attention from the NLP community. The language used is constantly evolving by incorporating new syntactic and semantic constructs, such as emojis or hashtags, abbreviations and slang, making natural language processing in this domain even more demanding. Moreover, the analysis of such content leverages the high availability of datasets offered from Twitter, satisfying the need for large amounts of data for training. * *These authors contributed equally to this work.
Emotion recognition is particularly interesting in social media, as it has useful applications in numerous tasks, such as public opinion detection about political tendencies (Pla and Hurtado, 2014;Tumasjan et al., 2010;Li and Xu, 2014), stock market monitoring (Si et al., 2013;Bollen et al., 2011b), tracking product perception (Chamlertwat et al., 2012), even detection of suicide-related communication (Burnap et al., 2015).
In the past, emotion analysis, like most NLP tasks, was tackled by traditional methods that included hand-crafted features or features from sentiment lexicons (Nielsen, 2011;Turney, 2010, 2013;Go et al., 2009) which were fed to classifiers such as Naive Bayes and SVMs (Bollen et al., 2011a;Kiritchenko et al., 2014). However, deep neural networks achieve increased performance compared to traditional methods, due to their ability to learn more abstract features from large amounts of data, producing state-of-the-art results in emotion recognition and sentiment analysis (Deriu et al., 2016;Goel et al., 2017;Baziotis et al., 2017).
In this paper, we present our work submitted to the WASSA 2018 IEST (Klinger et al., 2018). In the given task, the word that triggers emotion is removed from each tweet and is replaced by the token [#TARGETWORD#]. The objective is to predict its emotion category among 6 classes: anger, disgust, fear, joy, sadness and surprise. Our proposed model employs 3 different TL schemes of pretrained models: word embeddings, a sentiment model and language models.

Overview
Our approach is composed of the following three steps: (1) pretraining, in which we train word2vec word embeddings (P-Emb), a sentiment model (P-Sent) and Twitter-specific language models (P-LM), (2) transfer learning, in which we transfer the weights of the aforementioned models to specific layers of our IEST classifier and (3) ensembling, in which we combine the predictions of each TL model. Figure 1 depicts a high-level overview of our approach.

Data
Apart from the IEST dataset, we employ a Se-mEval dataset for sentiment classification and other manually-collected unlabeled corpora for our language models. Unlabeled Twitter Corpora. We collected a dataset of 550 million archived English Twitter messages, from 2014 to 2017. This dataset is used for calculating word statistics for our text preprocessing pipeline and training our word2vec word embeddings presented in Sec. 4.1.
For training our language models, described in Sec. 4.3, we sampled three subsets of this corpus. The first consists of 2M tweets, all of which contain emotion words. To create the dataset, we selected tweets that included one of the six emotion classes of our task (anger, disgust, fear, joy, sadness and surprise) or synonyms. We ensured that this dataset is balanced by concatenating approximately 350K tweets from each category. The second chunk has 5M tweets, randomly selected from the initial 550M corpus. We aimed to create a general sub-corpus, so as to focus on the structural relationships of words, instead of their emotional content. The third chunk is composed of the two aforementioned corpora. We concatenated the 2M emotion dataset with 2M generic tweets, creating a final 4M dataset. We denote the three corpora as EmoCorpus (2M), EmoCorpus+ (4M) and GenCorpus (5M). Sentiment Analysis Dataset. We use the dataset of SemEval17 Task4A (Sent17) (Rosenthal et al., 2017) for training our sentiment classifier as described in Sec. 4.2. The dataset consists of Twitter messages annotated with their sentiment polarity (positive, negative, neutral). The training set contains 56K tweets and the validation set 6K tweets.

Preprocessing
To preprocess the tweets, we use Ekphrasis (Baziotis et al., 2017), a tool geared towards text from social networks, such as Twitter and Facebook. Ekphrasis performs Twitter-specific tokenization, spell correction, word normalization, segmentation (for splitting hashtags) and annotation.

Word Embeddings
Word embeddings are dense vector representations of words which capture semantic and syntactic information. For this reason, we employ the word2vec  algorithm to train our word vectors, as described in Sec. 4.1.

Transfer Learning
Transfer Learning (TL) uses knowledge from a learned task so as to improve the performance of a related task by reducing the required training data (Torrey and Shavlik, 2010;Pan et al., 2010). In computer vision, transfer learning is employed in order to overcome the deficit of training samples for some categories by adapting classifiers trained for other categories (Oquab et al., 2014). With the power of deep supervised learning, learned knowledge can even be transferred to a totally different task (i.e. ImageNet (Krizhevsky et al., 2012)).
Following this logic, TL methods have also been applied to NLP. Pretrained word vectors Pennington et al., 2014) have become standard components of most architectures. Recently, approaches that leverage pretrained language models have emerged, which learn the compositionality of language, capture long-term dependencies and context-dependent features. For instance, ELMo contextual word representations (Peters et al., 2018) and ULMFiT (Howard and Ruder, 2018) achieve state-of-the-art results on a wide variety of NLP tasks. Our work is mainly inspired by ULMFiT, which we extend to the Twitter domain.

Ensembling
We combine the predictions of our 3 TL schemes with the intent of increasing the generalization ability of the final classifier. To this end, we employ a pretrained word embeddings approach, as well as a pretrained sentiment model and a pretrained LM. We use two ensemble schemes, namely unweighted average and majority voting. Unweighted Average (UA). In this approach, the final prediction is estimated from the unweighted average of the posterior probabilities for all different models. Formally, the final prediction p for a training instance is estimated by: where C is the number of classes, M is the number of different models, c ∈ {1, ..., C} denotes one class and p i is the probability vector calculated by model i ∈ {1, ..., M } using softmax function. Majority Voting (MV). Majority voting approach counts the votes of all different models and chooses the class with most votes. Compared to UA, MV is affected less by single-network decisions. However, this schema does not consider any information derived from the minority models. Formally, for a task with C classes and M different models, the prediction for a specific instance is estimated as follows: where v c denotes the votes for class c from all different models, F i is the decision of the i th model, which is either 1 or 0 with respect to whether the model has classified the instance in class c or not and p is the final prediction.

Network Architecture
All of our TL schemes share the same architecture: A 2-layer LSTM with a self-attention mechanism. It is shown in Figure 2.
Embedding Layer. The input to the network is a Twitter message, treated as a sequence of words.
We use an embedding layer to project the words w 1 , w 2 , ..., w N to a low-dimensional vector space R W , where W is the size of the embedding layer and N the number of words in a tweet. LSTM Layer. An LSTM takes as input a sequence of word embeddings and produces word annotations h 1 , h 2 , ..., h N , where h i is the hidden state at time-step i, summarizing all the information of the sentence up to w i . We use bidirectional LSTM to get word annotations that summarize the information from both directions. A bi-LSTM consists of a forward − → f that parses the sentence from w 1 to w N and a backward ← − f that parses it from w N to w 1 . We obtain the final annotation for each word h i , by concatenating the annotations from both directions, , where denotes the concatenation operation and L the size of each LSTM. When the network is initialized with pretrained LMs, we employ unidirectional instead of bi-LSTMs. Attention Layer. To amplify the contribution of the most informative words, we augment our LSTM with an attention mechanism, which assigns a weight a i to each word annotation h i . We compute the fixed representation r of the whole input message, as the weighted sum of all the word annotations. When the model is initialized with pretrained LMs, we use unidirectional LSTM instead of bidirectional.
where W h and b h are the attention layer's weights. Output Layer. We use the representation r as feature vector for classification and we feed it to a fully-connected softmax layer with L neurons, which outputs a probability distribution over all classes p c as described in Eq. 6: where W and b are the layer's weights and biases.

Pretrained Word Embeddings (P-Emb)
In the first approach, we train word2vec word embeddings with which we initialize the embedding layer of our network. The weights of the embedding layer remain frozen during training. The word2vec word embeddings are trained on the 550M Twitter corpus (Sec. 2.1), with negative sampling of 5 and minimum word count of 20, using Gensim's (Řehůřek and Sojka, 2010) implementation. The resulting vocabulary contains 800, 000 words.

Pretrained Sentiment Model (P-Sent)
In the second approach, we first train a sentiment analysis model on the Sent17 dataset, using the architecture described in Sec. 3. The embedding layer of the network is initialized with our pretrained word embeddings. Then, we fine-tune the network on the IEST task, by replacing its last layer with a task-specific layer.

Pretrained Language Model (P-LM)
The third approach consists of the following steps: (1) we first train a language model on a generic Twitter corpus, (2) we fine-tune the LM on the task at hand and finally, (3) we transfer the embedding and RNN layers of the LM, we add attention and output layers and fine-tune the model on the target task.

LM Pretraining.
We collect three Twitter datasets as described in Sec. 2.1 and for each one we train an LM. In each dataset we use the 50,000 most frequent words as our vocabulary. Since the literature concerning LM transfer learning is limited, especially in the Twitter domain, we aim to explore the desired characteristics of the pretrained LM. To this end, our contribution in this research area lies in experimenting with a taskrelevant corpus (EmoCorpus), a generic one (Gen-Corpus) and a mixture of both (EmoCorpus+). LM Fine-tuning. This step is crucial since, albeit the diversity of the general-domain data used for pretraining, the data of the target task will likely have a different distribution.
We thus fine-tune the three pretrained LMs on the IEST dataset, employing two approaches. The first is simple fine-tuning, according to which all layers of the model are trained simultaneously. The second one is a simplified yet similar approach to gradual unfreezing, proposed in (Howard and Ruder, 2018), which we denote as Simplified Gradual Unfreezing (SGU). According to this method, after we have transfered the pretrained embedding and LSTM weights, we let only the output layer fine-tune for n − 1 epochs. At the n th epoch, we unfreeze both LSTM layers. We let the model fine-tune, until epoch k − 1. Finally, at epoch k, we also unfreeze the embed-ding layer and let the network train until convergence. In other words, we experiment with pairs of numbers of epochs, {n, k}, where n denotes the epoch when we unfreeze the LSTM layers and k the epoch when we unfreeze the embedding layer. Naive fine-tuning poses the risk of catastrophic forgetting, or else abruptly losing the knowledge of a previously learnt task, as information relevant to the current task is incorporated. Therefore, to prevent this from happening, we unfreeze the model starting from the last layer, which is taskspecific, and after some epochs we progressively unfreeze the next, more general layers, until all layers are unfrozen. LM Transfer. This is the final step of our TL approach. We now have several LMs from the second step of the procedure. We transfer their embedding and RNN weights to a final target classifier. We again experiment with both simple and more sophisticated fine-tuning techniques, to find out which one is more helpful to this task.
Furthermore, we introduce the concatenation method which was inspired by the correlation of language modeling and the task at hand. We use pretrained LMs to leverage the fact that the task is basically a cloze test. In an LM, the probability of occurrence of each word, is conditioned on the preceding context, P (w t |w 1 , . . . , w t−1 ). In RNNbased LMs, this probability is encoded in the hidden state of the RNN, P (w t |h t−1 ). To this end, we concatenate the hidden state of the LSTM, right before the missing word, h implicit , to the output of the self-attention mechanism, r: where L is the size of each LSTM, and then feed it to the output linear layer. This way, we preserve the information which implicitly encodes the probability of the missing word.

Experimental Setup
Training. We use Adam algorithm (Kingma and Ba, 2014) to optimize our networks, with minibatches of size 64 and clip the norm of the gradients (Pascanu et al., 2013) at 0.5, as an extra safety measure against exploding gradients. We also used PyTorch (Paszke et al., 2017) and Scikitlearn (Pedregosa et al., 2011).
Hyperparameters. For all our models, we employ the same 2-layer attention-based LSTM ar-  Table 1: Hyper-parameters of our models.

Official Results
Our team ranked 3 rd out of 30 participants, achieving 0.703 F1-score on the official test set. Table 2 shows the official ranking of the top scoring teams.

Experiments
Baselines. In Table 5 we compare the proposed TL approaches against two strong baselines: (1) a Bag-of-Words (BoW) model with TF-IDF weighting and (2) a Bag-of-Embeddings (BoE) model, where we retrieve the word2vec representations of the words in a tweet and compute the tweet representation as the centroid of the constituent word2vec representations. Both BoW and BoE features are then fed to a linear SVM classifier, with tuned C = 0.6. All of our reported F1-scores are calculated on the evaluation (dev) set, due to time constraints. P-Emb and P-Sent models (4.1, 4.2). We evaluate the P-Emb and P-Sent models, using both bidirectional and unidirectional LSTMs. The F1 score of our best models is shown in Table 5. As expected, bi-LSTM models achieve higher performance. P-LM (4.3). For the experiments with the pretrained LMs, we intend to transfer not just the first layer of our network, but rather the whole model, so as to capture more high-level features of language. As mentioned above, there are three distinct steps concerning the training procedure of this TL approach: (1) LM pretraining: we train three LMs on the EmoCorpus, EmoCorpus+ and   GenCorpus corpora, (2) LM fine-tuning: we finetune the LMs on the IEST dataset, with 2 different ways. The first one is simple fine-tuning, while the second one is our simplified gradual unfreezing (SGU) technique.
(3) LM transfer: We now have 6 LMs, fine-tuned on the IEST dataset. We transfer their weights to our final emotion classifier, we add attention to the LSTM layers and we experiment again with our 2 ways of fine-tuning and the concatenation method proposed in Sec. 4.3.
In Table 3 we present all possible combinations of transferring the P-LM to the IEST task. We observe that SGU consistently outperforms Simple Fine-Tuning (Simple FT). Due to the difficulty in running experiments for all possible combinations, we compare our best approach, namely SGU + Concat., with P-LMs trained on our three unlabeled Twitter corpora, as depicted in Table 4. Even though EmoCorpus contains less training examples, P-LMs trained on it learn to encode more useful information for the task at hand.

Ensembling
Our submitted model is an ensemble of the models with the best performance. More specifically, we leverage the following models: (1) TL of pretrained word embeddings, (2) TL of pretrained sentiment classifier, (3) TL of 3 different LMs, trained on 2M, 4M and 5M respectively. We use Unweighted Average (UA) ensembling of our best  models from all aforementioned approaches. Our final results on the evaluation data are shown in Table 5.

Discussion
As shown in Table 5, we observe that all of our proposed models achieve individually better performance than our baselines by a large margin. Moreover, we notice that, when the three models are trained with unidirectional LSTM and the same number of parameters, the P-LM outperforms both the P-Emb and the P-Sent models. As expected, the upgrade to bi-LSTM improves the results of P-Emb and P-Sent. We hypothesize that P-LM with bidirectional pretrained language models would have outperformed both of them. Furthermore, we conclude that both SGU for finetuning and the concatenation method enhance the performance of the P-LM approach. As far as the ensembling is concerned, both approaches, MV and UA, yield similar performance improvement over the individual models. In particular, we notice that adding the P-LM predictions to the ensemble contributes the most. This indicates that P-LMs encode more diverse information compared to the other approaches.
In this paper we describe our deep-learning methods for missing emotion words classification, in the Twitter domain. We achieved very competitive results in the IEST competition, ranking 3 rd /30 teams. The proposed approach is based on an ensemble of Transfer Learning techniques. We demonstrate that the use of refined, high-level features of text, as the ones encoded in language models, yields a higher performance. In the future, we aim to experiment with subword-level models, as they have shown to consistently face the OOV words problem (Sennrich et al., 2015;Bojanowski et al., 2016), which is more evident in Twitter. Moreover, we would like to explore other transfer learning approaches. Finally, we share the source code of our models 1 , in order to make our results reproducible and facilitate further experimentation in the field.