NTUA-SLP at SemEval-2018 Task 1: Predicting Affective Content in Tweets with Deep Attentive RNNs and Transfer Learning

In this paper we present deep-learning models that submitted to the SemEval-2018 Task~1 competition:"Affect in Tweets". We participated in all subtasks for English tweets. We propose a Bi-LSTM architecture equipped with a multi-layer self attention mechanism. The attention mechanism improves the model performance and allows us to identify salient words in tweets, as well as gain insight into the models making them more interpretable. Our model utilizes a set of word2vec word embeddings trained on a large collection of 550 million Twitter messages, augmented by a set of word affective features. Due to the limited amount of task-specific training data, we opted for a transfer learning approach by pretraining the Bi-LSTMs on the dataset of Semeval 2017, Task 4A. The proposed approach ranked 1st in Subtask E"Multi-Label Emotion Classification", 2nd in Subtask A"Emotion Intensity Regression"and achieved competitive results in other subtasks.


Introduction
Social media content has dominated online communication, enriching and changing language with new syntactic and semantic constructs that allow users to express facts, opinions and emotions in short amount of text. The analysis of such content has received great attention in NLP research due to the wide availability of data and the interesting language novelties. Specifically the study of affective content in Twitter has resulted in a variety of novel applications, such as tracking product perception (Chamlertwat et al., 2012), public opinion detection about political tendencies (Pla and Hurtado, 2014; Tumasjan et al., 2010), stock market monitoring (Si et al., 2013;Bollen et al., 2011b) etc. The wide usage of figurative language, such as emojis and special language forms like abbreviations, hashtags, slang and other social media markers, which do not align with the conventional language structure, make natural language processing in Twitter even more challenging.
In the past, sentiment analysis was tackled by extracting hand-crafted features or features from sentiment lexicons (Nielsen, 2011;Turney, 2010, 2013;Go et al., 2009) that were fed to classifiers such as Naive Bayes or Support Vector Machines (SVM) (Bollen et al., 2011a;Kiritchenko et al., 2014). The downside of such approaches is that they require extensive feature engineering from experts and thus they cannot keep up with rapid language evolution (Mudinas et al., 2012), especially in social media/micro-blogging context. However,  Figure 2: High-level overview of our approach recent advances in artificial neural networks for text classification have shown to outperform conventional approaches (Deriu et al., 2016;Rouvier and Favre, 2016;Rosenthal et al., 2017a). This can be attributed to their ability to learn features directly from data and also utilize hand-crafted features where needed. Most of aforementioned works focus on sentiment analysis, but similar approaches have been applied to emotion detection (Canales and Martínez-Barco, 2014) leading to similar conclusions. SemEval 2018 Task 1: "Affect in Tweets" (Mohammad et al., 2018) focuses on exploring emotional content of tweets for both classification and regression tasks concerning the four basic emotions (joy, sadness, anger, fear) and the presence of more fine-grained emotions such as disgust or optimism.
In this paper, we present a deep-learning system that competed in SemEval 2018 Task 1: "Affect in Tweets". We explore a transfer learning approach to compensate for limited training data that uses the sentiment analysis dataset of Semeval Task 4A (Rosenthal et al., 2017b) for pretraining a model and then further fine-tune it on data for each subtask. Our model operates at the word-level and uses a Bidirectional LSTM equipped with a deep self-attention mechanism (Pavlopoulos et al., 2017). Moreover, to help interpret the inner workings of our model, we provide visualizations of tweets with annotations of the salient tokens as predicted by the attention layer.

Overview
Figure 2 provides a high-level overview of our approach, which consists of three main steps: (1) the word embeddings pretraining, where we train word2vec and affective word embeddings on our unlabeled Twitter dataset, (2) the transfer learning step, where we pretrain a deep-learning model on a sentiment analysis task, (3) the fine-tuning step, where we fine-tune the pretrained model on each subtask. Task definitions. Given a tweet we are asked to: Subtask EI-reg: determine the intensity of a certain emotion (joy, fear, sadness, anger), as a realvalued number between in the [0, 1] interval. Subtask EI-oc: classify its intensity towards a certain emotion (joy, fear, sadness, anger) across a 4-point scale. Subtask V-oc: classify its valence intensity (i.e sentiment intensity) across a 7-point scale [−3, 3]. Subtask V-reg: determine its valence intensity as a real-valued number between in the [0, 1] interval. Subtask E-c: determine the existence of none, one or more out of eleven emotions: anger, anticipation, disgust, fear, joy, love, optimism, pessimism, sadness, surprise, trust.

Data
Unlabeled Dataset. We collected a big dataset of 550 million English tweets, from April 2014 to June 2017. This dataset is used for (1) calculating word statistics needed in our text preprocessing pipeline (Section 2.3) and (2) training word2vec and affective word embeddings (Section 2.2). Pretraining Dataset. For transfer learning, we utilized the dataset of Semeval-2017 Task4A. The dataset consists of 61, 854 tweets with {positive, neutral, negative} sentiment (valence) annotations. To our knowledge, this is the largest Twitter dataset with affective annotations.

Word Embeddings
Word embeddings are dense vector representations of words (Collobert and Weston, 2008;, capturing their semantic and syntactic information. To this end, we train word2vec word embeddings, to which we add 10 affective dimensions. We use our pretrained embeddings, to initialize the first layer (embedding layer) of our neural networks. Word2vec Embeddings. We leverage our unlabeled dataset to train Twitter-specific word embeddings. We use the word2vec  algorithm, with the skip-gram model, negative sampling of 5 and minimum word count of 20, utilizing Gensim's (Řehůřek and Sojka, 2010) implementation. The resulting vocabulary contains 800, 000 words. Affective Embeddings. Starting from small manually annotated lexica, continuous norms (within the [−1, 1] interval) for new words are estimated using semantic similarity and a linear model along ten affect-related dimensions, namely: valence, dominance, arousal, pleasantness, anger, sadness, fear, disgust, concreteness, familiarity. The method of generating word level norms is detailed in (Malandrakis et al., 2013) and relies on the assumption that given a similarity metric between two words, one may derive the similarity between their affective ratings. This approach uses a set of N words with known affective ratings (seed words), as a starting point. Concretely, we calculate the affective rating of a word w as follows: where t 1 ...t N are the seed words, υ(t i ) is the affective rating for seed word t i , α i is a trainable weight corresponding to seed t i and S() stands for the semantic similarity metric between t i and w. The seed words t i are selected separately for each dimension, from the words available in the original manual annotations (see 2.2). The S() metric is estimated as shown in (Palogiannidi et al., 2015) using word-level contextual feature vectors and adopting a scheme based on mutual information for feature weighting.
Manually annotated norms. To generate affective norms, we need to start from some manual annotations, so we use ten dimensions from four sources. From the Affective Norms for English Words (Bradley and Lang, 1999) we use norms for valence, arousal and dominance. From the MRC Psycholinguistic database (Coltheart, 1981), we use norms for concreteness and familiarity. From the Paivio norms (Clark and Paivio, 2004) we use norms for pleasantness. Finally from (Stevenson et al., 2007) we use norms for anger, sadness, fear and disgust.

Preprocessing 1
We utilized the ekphrasis 2 (Baziotis et al., 2017) tool as a tweet preprocessor. The preprocessing steps included in ekphrasis are: Twitter-specific tokenization, spell correction, word normalization, word segmentation (for splitting hashtags) and word annotation. Tokenization. Tokenization is the first fundamental preprocessing step and since it is the basis for the other steps, it immediately affects the quality of the features learned by the network. Tokenization on Twitter is challenging, since there is large variation in the vocabulary and the expressions which are used. There are certain expressions which are better kept as one token (e.g. antiamerican) and others that should be split into separate tokens. Ekphrasis recognizes Twitter markup, emoticons, emojis, dates (e.g. 07/11/2011, April 23rd), times (e.g. 4:30pm, 11:00 am), currencies (e.g. $10, 25mil, 50e), acronyms, censored words (e.g. s**t), words with emphasis (e.g. *very*) and more using an extensive list of regular expressions. Normalization. After tokenization, we apply a series of modifications on the extracted tokens, such as spell correction, word normalization and segmentation. Specifically for word normalization we use lowercase words, normalize URLs, emails, numbers, dates, times and user handles (@user). This helps reducing the vocabulary size without losing information. For spell correction (Jurafsky and James, 2000) and word segmentation (Segaran and Hammerbacher, 2009) we use the Viterbi algorithm. The prior probabilities are obtained from word statistics from the unlabeled dataset. The benefits of the aforementioned procedure are the reduction of the vocabulary size, without removing any words, and the preservation of information that is usually lost during tokenization. Table 1 shows an example text snippet and the resulting preprocessed tokens. 1 Significant portions of the systems submitted to SemEval 2018 in Tasks 1, 2 and 3, by the NTUA-SLP team are shared, specifically the preprocessing and portions of the DNN architecture. Their description is repeated here for completeness. 2 github.com/cbaziotis/ekphrasis original The *new* season of #TwinPeaks is coming on May 21, 2017. CANT WAIT \o/ !!! #tvseries #davidlynch :D processed the new <emphasis> season of <hashtag> twin peaks </hashtag> is coming on <date> . cant <allcaps> wait <allcaps> <happy> ! <repeated> <hashtag> tv series </hashtag> <hashtag> david lynch </hashtag> <laugh>  (Taigman et al., 2014) and visual QA (Agrawal et al., 2017), where image features trained on ImageNet (Deng et al., 2009) and word embeddings estimated on large corpora via unsupervised training are combined.
Although model transfer has seen widespread success in computer vision, transfer learning beyond pretrained word vectors is less pervasive in NLP.
In our system, we explore the approach of pretraining a network in a sentiment analysis task in Twitter and use it to initialize the weights of the models of each subtask. We chose the dataset of Semeval 2017 Task4A (SA2017) (Rosenthal et al., 2017b), which is a semantically similar dataset to the emotion datasets of this task. By pretraining on a dataset in a similar domain, it is more likely that the source and target dataset will have similar distributions.
To build our pretrained model, we initialize the weights of the embedding layer with the word2vec Twitter embeddings and train a bidirectional LSTM (BiLSTM) with a deep self-attention mechanism (Pavlopoulos et al., 2017) on SA2017, similar to (Baziotis et al., 2017). Afterwards, we utilize the encoding part of the network, which is the BiLSTM and the attention layer, throwing away the last layer. This pretrained model is used for all subtasks, with the addition of a subtaskspecific final layer for classification/regression.

Recurrent Neural Networks
We model the Twitter messages using Recurrent Neural Networks (RNN). RNNs process their inputs sequentially, performing the same operation, h t = f W (x t , h t−1 ), on every element in a sequence, where h t is the hidden state t the time step, and W the network weights. We can see that the hidden state at each time step depends on the previous hidden states, thus the order of elements (words) is important. This process also enables RNNs to handle inputs of variable length.
RNNs are difficult to train (Pascanu et al., 2013), because gradients may grow or decay exponentially over long sequences (Bengio et al., 1994;Hochreiter et al., 2001). A way to overcome these problems is to use more sophisticated variants of regular RNNs, like Long Short-Term Memory (LSTM) networks (Hochreiter and Schmidhuber, 1997) or Gated Recurrent Units (GRU) , introducing a gating mechanism to ensure proper gradient flow through the network.

Self-Attention Mechanism
RNNs update their hidden state h i as they process a sequence and the final hidden state holds a summary of the information in the sequence. In order to amplify the contribution of important words in the final representation, a self-attention mechanism  is used as shown in Fig. 3. By employing an attention mechanism, the representation of the input sequence r is no longer limited to just the final state h N , but rather it is a combination of all the hidden states h i . This is done by computing the sequence representation, as the convex combination of all h i . The weights a i are learned by the network and their magnitude signifies the importance of each h i in the final representation. Formally:

Model Description
Next, we present in detail the submitted models. For all subtasks, we adopted a transfer learning approach, by pretraining a BiLSTM network with a deep attention mechanism on SA2017 dataset. Afterwards, we replaced the last layer of the pretrained model with a task-specific layer and finetuned the whole network for each subtask.

Transfer Learning Model (TF)
Our transfer learning model is based on the sentiment analysis model in (Baziotis et al., 2017). It consists of a 2-layer bidirectional LSTM (BiL-STM) with a deep self-attention mechanism. Embedding Layer. The input to the network is a Twitter message, treated as a sequence of words. We use an embedding layer to project the words w 1 , w 2 , ..., w N to a low-dimensional vector space R W , where W is the size of the embedding layer and N the number of words in a tweet. We initialize the weights of the embedding layer with our pre-trained word embeddings (Section 2.2). BiLSTM Layer. An LSTM takes as input a sequence of word embeddings and produces word annotations h 1 , h 2 , ..., h N , where h i is the hidden state of the LSTM at time-step i, summarizing all the information of the sentence up to w i . We use bidirectional LSTMs (BiLSTM) in order to get word annotations that summarize the information from both directions. A BiLSTM consists of 2 LSTMs, a forward LSTM − → f that parses the sentence from w 1 to w N and a backward LSTM ← − f that parses the sentence from w N to w 1 . We obtain the final annotation for each word h i , by concatenating the annotations from both directions, where denotes the concatenation operation and L the size of each LSTM. Attention Layer. To amplify the contribution of the most informative words, we augment our BiL-STM with a self-attention mechanism. We use a deep self-attention mechanism (Pavlopoulos et al., 2017), to obtain a more accurate estimation of the importance of each word. The attention weight in the simple self-attention mechanism, is replaced with a multilayer perceptron (MLP), composed of l layers with a non-linear activation function (tanh). The MLP learns the attention function g. The attention weights a i are then computed as a probability distribution over the hidden states h i . The final representation r is the convex combination of h i with weights a i .
Output Layer. We use vector r as the feature representation, which we feed to a final task-specific layer. For the regression tasks, we use a fullyconnected layer with one neuron and a sigmoid activation function. For the ordinal classification tasks, we use a fully-connected layer, followed by a sof tmax operation, which outputs a probability distribution over the classes. Finally, for the multilabel classification task, we use a fully-connected layer with 11 neurons (number of labels) and a sigmoid activation function, performing binary classification for each label.

Fine-Tuning
After training a network on the pretraining dataset (SA2017), we fine-tune it on each subtask, by re-placing its final layer with a task-specific layer. We experimented with two fine-tuning schemes. The first approach is to fine-tune the whole network, that is, both the pretrained encoder (BiL-STM) and the task-specific layer. The second approach is to use the pretrained model only for weight initialization, freeze its weights during training and just fine-tune the final layer. Based on the experimental results, the first approach obtains significantly better results in all tasks.

Regularization
In both models, we add Gaussian noise to the embedding layer, which can be interpreted as a random data augmentation technique, that makes models more robust to overfitting. In addition to that, we use dropout (Srivastava et al., 2014) and we stop training after the validation loss has stopped decreasing (early-stopping). Furthermore, we do not fine-tune the embedding layers. Words occurring in the training set, are projected in the embedding space and the classifier correlates certain regions of the embedding space to certain emotions. However, words included only in the test set, remain at their initial position which may no longer reflect their "true" emotion, leading to mis-classifications.

Experimental Setup
Training We use Adam algorithm (Kingma and Ba, 2014) for optimizing our networks, with minibatches of size 32 and we clip the norm of the gradients (Pascanu et al., 2013) at 1, as an extra safety measure against exploding gradients. For developing our models we used PyTorch (Paszke et al., 2017) and Scikit-learn (Pedregosa et al., 2011). Class Weights. In subtasks EI-oc and V-oc, some classes have more training examples than others, introducing bias in our models. To deal with this problem, we apply class weights to the loss function, penalizing more the misclassification of under-represented classes. These weights are computed as the inverse frequencies of the classes in the training set. Hyper-parameters. In order to tune the hyperparameter of our model, we adopt a Bayesian optimization (Bergstra et al., 2013) approach, performing a more time-efficient search in the high dimensional space of all the possible values, compared to grid or random search. We set size of the embedding layer to 310 (300 word2vec + 10 affective dimensions), which we regularize by adding Gaussian noise with σ = 0.2 and dropout of 0.1. The sentence encoder is composed of 2 BiLSTM layers, each of size 250 (per direction) with a 2layer self-attention mechanism. Finally, we apply dropout of 0.3 to the encoded representation.

Experiments
In Table 2, we compare the proposed transfer learning models against 3 strong baselines. Pearson correlation is the metric used for the first four subtasks, whereas Jaccard index is used for the E-c multi-label classification subtask. The first baseline is a unigram Bag-of-Words (BOW) model with TF-IDF weighting. The second baseline is a Neural Bag-of-Words (N-BOW) model, where we retrieve the word2vec embeddings of the words in a tweet and compute the tweet representation as the average (centroid) of the constituent word2vec embeddings. Finally, the third baseline is similar to the second one, but with the addition of 10-dimensional affective embeddings that model affect-related dimensions (valence, dominance, arousal, etc). Both BOW and N-BOW features are then fed to a linear SVM classifier, with tuned C = 0.6. In order to assess the impact of transfer learning, we evaluate the performance of each model in 3 different settings: (1) random weight initialization (LST-M-RD), (2) transfer learning with frozen weights (LSTM-TL-FR), (3) transfer learning with finetuning (LSTM-TL-FT). The results of our neural models in Table 2 are computed by averaging the results of 10 runs to account for model variability. Baselines. Our first observation is that N-BOW baselines significantly outperform BOW in subtasks EI-reg, EI-oc, V-reg and V-oc, in which we have to predict the intensity of an emotion, or the tweet's valence. However, BOW achieves slightly better performance in subtask E-c, in which we have to recognize the emotions expressed in each tweet. This can be attributed to the fact that BOW models perform well in tasks where we the occurrence of certain words is sufficient, to accurately determine the classification result. This suggests that in subtask E-c, certain words are highly indicative of some emotions. Word embeddings, though, that encode the correlation of each word with different dimensions, enable NBOW to better predict the intensity of various emotions. Further-   Transfer Learning. We observe that our neural models achieved better performance than all baselines by a large margin. Moreover, we can see that our transfer learning model yielded higher performance over the non-transfer model in most of the Emotion Intensity (EI) subtasks. In the Emotion multi-label classification subtask (E-c), transfer learning did not outperform the random initialization model. This can be attributed to the fact that our source dataset (SA17) was not diverse enough to boost the model performance when classifying the tweets into none, one or more of a set of 11 emotions. As for fine-tuning or freezing the pretrained layers, the overall results show that enabling the model to fine-tune always results in significant gains. This is consistent with our intuition that allowing the weights of the model to adapt to the target dataset, thus encoding task-specific information, results in performance gains. Regarding the emotion of joy, we observe that in EI-reg and EI-oc subtasks, LSTM-RD matches the performance of LSTM-TL-FR. We interpret this result as an indication of the semantic similarity between the source and the target task.
Mystery dataset. The submitted models were also evaluated against a mystery dataset, in order to investigate if there is statistically significant social bias in them. This is a very important experiment, especially when automated machine learning algorithms are interacting with social media content and users in the wild. The mystery dataset consists of pairs of sentences that differ only in the social context (e.g. gender or race). Submitted models are expected to predict the same affective values for both sentences in the pair. The evaluation metric is the average difference in prediction scores per class, along with the p-value score indicating if the difference is statistically significant. Results are summarized in Table 3. Fig. 10 shows a heat-map of the attention weights on top of 8 example tweets (2 tweets per emotion). The color intensity corresponds to the weight given to each word by the self-attention mechanism and signifies the importance of this word for the final prediction. We can see that the salient words correspond to the predicted emotion (e.g. "irritated" for anger, "mourn" for sadness etc.

Attention visualizations
). An interesting observation is that when emojis are present they are almost always selected as important, which indicates their function as weak annotations. Also note that the attention mechanism can hint to dependencies between words even if they far in a sentence, like the "why" and "mad" in the sadness example.

Conclusion
In this paper we present a deep-learning system for short text emotion intensity, valence estimation for both regression and classification and multiclass emotion classification. We used Bidirectional LSTMs, with a deep attention mechanism and took advantage of transfer learning in order to address the problem of limited training data. Our models achieved excellent results in single and multi-label classification tasks, but mixed results in emotion and valence intensity tasks. Future work can follow two directions. Firstly, we aim to revisit the task with different transfer learning approaches, such as (Felbo et al., 2017;Howard and Ruder, 2018;Hashimoto et al., 2016).
Secondly, we would like to introduce characterlevel information in our models, based on (Wieting et al., 2016;Labeau and Allauzen, 2017), in order to overcome the problem of out-of-vocabulary (OOV) words and learn syntactic and stylistic features (Peters et al., 2018), which are highly indicative of emotions and their intensity.
Finally, we make both our pretrained word embeddings and the source code of our models available to the community 3 , in order to make our results easily reproducible and facilitate further experimentation in the field.