psyML at SemEval-2018 Task 1: Transfer Learning for Sentiment and Emotion Analysis

In this paper, we describe the first attempt to perform transfer learning from sentiment to emotions. Our system employs Long Short-Term Memory (LSTM) networks, including bidirectional LSTM (biLSTM) and LSTM with attention mechanism. We perform transfer learning by first pre-training the LSTM networks on sentiment data before concatenating the penultimate layers of these networks into a single vector as input to new dense layers. For the E-c subtask, we utilize a novel approach to train models for correlated emotion classes. Our system performs 4/48, 3/39, 8/38, 4/37, 4/35 on all English subtasks EI-reg, EI-oc, V-reg, V-oc, E-c of SemEval 2018 Task 1: Affect in Tweets.

3. Sentiment intensity regression (V-reg): given tweet, predict real-valued sentiment intensity from 0 (no sentiment) to 1 (high sentiment). In this subtask, the directionality of the tweet sentiment is ignored. A negative tweet will be given the same score as a positive tweet with the same valence.
The task is particularly challenging since E-c and EI-oc are completely new subtasks. Thus, no prior data or working models are available for comparison. The leaderboard is also not public during the competition. As shown in Table 1, taken from , the development sets are particularly small compared to the test sets, and the test sets are comparable in size to the training sets, so the model must generalize. For EI-oc and EI-reg, the development and test sets are also annotated separately from the training sets. This impacts performance as our system would have placed 1st with average pearson score 0.755 on the WASSA 2017 task, in which the EI-reg train, development, and test data are annotated in the same format. Furthermore, tweets are difficult to analyze due to the unstructuredness of its language (hashtags, emoticons, slang, misspellings, poor grammar).
To extract linguistic features, some systems employ pre-trained word embeddings (Baziotis et al., 2017;Cliche, 2017) or a combination of manually created features and/or lexicons (Köper et al., 2017;Duppada and Hiray, 2017). However, exclusively relying on hand-crafted features for EIreg may result in a model that fails to encompass unforeseen linguistic relationships. Similarly, relying exclusively on deep learning models without lexicon inputs can lead to simple misclassifications due to the small training data.
To combine the best of both worlds, previous systems collapse high-dimensional word embeddings into a single dimension arithmetically, before combining it with hand-crafted features (usually one-dimensional). Goel et al. for instance averaged the word embeddings for each word in a tweet in order to concatenate it with a 43dimensional vector. Duppada and Hiray simply averaged the two top performing model outputs.
In this paper, we present a deep learning system whose variants competed competitively in all English subtasks in SemEval-2018 Task 1: Affect in Tweets, specifically EI-reg, EI-oc, V-reg, V-oc, and E-c. We make the following contributions: • A deep learning system that can take in a combination of one-dimensional handcrafted and multi-dimensional word embedding inputs.
• A deep learning system that uses transfer learning from sentiment tasks to overcome the lack of training data compared to test data. To the best of our knowledge, this is the first instance of transferring knowledge from sentiment to emotion.
• Specifically for Task E-c, procedures for training correlated target classes.

Overview
Fig 1 shows an overview of our system, which consists of three steps: (1) preprocessing input using a text processor and the Weka AffectiveTweets package 1 (Mohammad and Bravo-Marquez, 2017) (2) pre-training Components A to C using sentiment data (3) training the entire system, including Components A, B, C, E, using subtask-specific dataset.

Preprocessing
We use the ekphrasis text processor 2 and word embeddings 3 built by Baziotis  vectors. The TweetToLexiconFeatureVector returns a 43-dimension feature vector using sentiment and emotion lexicons such as Bing-Liu, AFINN, Sentiment140, and NRC-10 Expanded.

Transfer Learning
Transfer learning is the process of using knowledge from solving a source task to help performance in a target task. In particular, transfer learning is useful when the target task training set is small.
Another common way to deal with small data is distant supervision (Mintz et al., 2009), a process for generating labelled data from an unlabelled set according to a set of rules. For instance, for a sentiment analysis task, distant supervision can involve labelling tweets with smileys as positive and those with sad emojis as negative (Read, 2005).
Transfer learning has historically performed well on computer vision problems (Yosinski et al., 2014;Razavian et al., 2014). Traditionally, the CNN layer weights are frozen, its output treated as a feature vector input to a fully-connected layer, which will learn the new target task. Intuitively, the CNN will learn low-level image features on the source task while the dense layers will use these low-level features to predict a new target task.
Another strategy is to unfreeze the later layers weights of the pre-trained network and instead backpropagate all the way to the pre-trained network. In this case, the later layers of the pretrained network can be fine-tuned. We choose to leave all weights from the pre-trained network unfrozen.
Transfer learning in natural language processing applications has been largely successful only within the same task such as POS tagging or sentiment (Blitzer et al., 2006(Blitzer et al., , 2007. For different domains, good results are only achieved in semantically equivalent transfer (in which a source task and target task have the same objective but different data) but not for semantically different transfer (in which a source and target task have different objectives) (Mou et al., 2016).
For all subtasks, we will use transfer learning to pre-train our models on sentiment data. The source task objective is to predict sentiment categorical classes ('positive', 'negative', or 'neutral') given a tweet. Since the source task is not equivalent to any of the target tasks, we'd expect lower performance than those experiments on domain adaptation.
There are two main ways to perform transfer learning, the parameter initialization approach, in which a model is trained on a source task and the weights are transferred to a target task, and multitask learning, in which a model is trained to learn multiple tasks simultaneously. We choose to implement the parameter initialization approach as Mou et al. has shown both approaches to be comparable.

Neural Network
The Recurrent Neural Network (RNN) is an extension of the traditional neural network that allows sequential data. In its simplest form, the hidden state h t ∈ R d (where d is the size of the RNN at time step t) is a function f of the current word embedding x t , the past hidden state h t−1 , and the learned θ parameters.
Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) is a special form of RNN with a memory cell and input, forget, and output gates that allow it to take into account long-term dependencies. It is more widely used than RNN since it overcomes the vanishing or exploding gradient problem common in RNNs. The architecture of a standard LSTM will be the same as that of an RNN as shown in Fig 2, but with a different repeating A module. Formally, each LSTM cell is computed as follows: where f t is the forget gate, i t the input gate, c t the cell state, h t the hidden state, σ the sigmoid function and element-wise multiplication.
We use bidirectional LSTM (biLSTM) to incorporate both past and future context.  We use an attention mechanism (Rocktäschel et al., 2015) to learn which words in a tweet contribute more to a target task. Fig 4 shows the architecture of a standard LSTM with attention mechanism.
The attention layer is usually a 1 or 2-layer neural net that takes the output of an LSTM or RNN as input. It assigns "attention weight" α i to each hidden state h i and outputs a weighted representation r of the hidden states.
where α is an attention weight vector, r is a weighted representation of the hidden states, and W h and b h are learned during backpropagation. Components A and C are bidirectional LSTMs. Component B is a LSTM with attention mechanism.

Data
We used the dataset provided by the SemEval challenge.
For transfer learning, we pre-train Components A to C using train, development and test data from 2013 to 2017 SemEval Task 4A sentiment analysis classification tasks 5 . In total, this provided 7,723 negative, 22,195 neutral and 19,652 positive tweets.

Model
Component A is a 1-layer biLSTM. The input is a matrix A ∈ R n×d , where n is the number of words in the tweet and d is the dimension of the word embedding. Component B is an LSTM with attention and takes in the same input as Component A. The attention layer is a 1 unit dense layer. Component C is also a 1-layer biLSTM like Component A, except the input is a matrix C ∈ R n×43 , since Weka's TweetToLexiconFea-tureVector returns a 43-dimension vector and C is just a concatenated sequence of 43-dimension vectors of words passed into the TweetToLexicon-FeatureVector. The tweet input for Components A through C are all zero padded. Component D just takes in an entire tweet and returns the 43dimension TweetToLexiconFeatureVector vector. Component E is a 5-layer fully-connected neural net.

Regularization
In Components A through C, we apply dropout to both input and recurrent connections. Dropout (Srivastava et al., 2014) is a technique that involves randomly dropping units during training to prevent overfitting and co-adaptation of neurons. By randomly dropping units, neighboring neurons make up for the dropped units and learn representations for the target, resulting in a more robust network.
In Components A and C we apply global max pooling for the final layer.
We incorporate mini-batch gradient descent in all our models. Mini-batch gradient descent is calculated over small batches of data instead of the entire dataset as in traditional gradient descent. Compared to stochastic gradient descent, where gradient descent is calculated after every example, it is not as computationally intensive.
In the E-c dataset and sentiment classification dataset, some classes are overrepresented. This class imbalance can lead to bias in model output. To overcome this, for E-c we apply class weight to the loss function to boost recall of the minority class.

Hyper-parameters
We train all our models using mini-batches of size 8, and Adam (Kingma and Ba, 2014) optimization. For the E-c subtask, we minimize binary cross entropy loss. For EI-oc, EI-reg, V-oc and V-reg, we minimize mean squared error.
Models Components A through C all have dropout of 0.2 for their input layer and recurrent connections.
Components A and C are both biLSTM with 256 units. Component A takes in A ∈ R 50×300 , while Component C takes in C ∈ R 50×43 . We chose 50 since none of the tweets are longer than 50 words. Finally, we add a global max pooling layer.
Component B is a LSTM of 256 units, and takes in B ∈ R 50×300 .
For source task learning, we apply a dense layer of 3 hidden units with sigmoid activation function to Components A through C.
Component E is 5 dense layers with 300, 125, 50, 25, and 1 hidden units. We use Rectified linear unit (ReLU) as the activation function for the former four layers.

Training
Source Task Learning During source task learning, we train each of Components A through C individually on the sentiment dataset with 10% hold out validation. The source task objective is to predict sentiment categorical class ('positive', 'negative', or 'neutral'). For each of the models, we train it for 1, 2, 3, 4, and 5 epochs and save the best performing one.
We experimented with RNN, CNN, LSTM, and biLSTM before settling on biLSTM and LSTM with attention.
Target Task Learning During target task learning, the final dense layers in Components A through C are removed and the penultimate layers are concatenated together with the output from Component D into a single vector as input to Component E. None of the weights are frozen and the entire system (Components A through E) is trained for between 5 to 10 epochs for subtasks EI-oc, EIreg, V-oc, and V-reg. We choose the best performing model based on performance on the development set.
For E-c, we notice the emotion classes are intercorrelated. The following Figure 5 is a dendrogram generated from E-c training data. The emotions are hierarchically clustered based on their correlations. The horizontal axis labels correspond to the 11 classes. The shorter the length of the sideways n-shape, the more correlated the two classes. For E-c, we obtain the following emotion clusters from the dendrogram: [anger, disgust], [sadness, pessimism, fear], [joy, optimism, love], [anticipation, surprise, trust]. For each of the clusters, we obtain a fresh copy of the entire pretrained system and train it for 2 epochs consecutively for each target class in the aforementioned order within the emotion cluster. The pseudocode is as follows. We call this method "cluster training": Predict emotion test with S

Experimental Setup
We build all the models with the Keras library and train them on Google Datalab. The dendrogram diagram is built with Plotly.

Evaluation & Results
SemEval Results Our model ranks 4/48 in EI-reg, 3/39 in EI-oc, 8/38 in V-reg, 4/37 in V-oc, 4/35 in E-c (Mohammad et al., 2018). Our performance for V-reg is less than satisfactory because V-reg measures sentiment intensity without regards for directionality, whereas our source task takes into account directionality. This supports findings by Mou et al. that pre-training is less useful in a semantically different transfer.
System To evaluate our system, we assess the performance of each Component and various combinations of them. Table 2 shows the development set performance. In particular, we note that Component A+B performs better than Component A or Component B separately, as with Component C+D. Furthermore, Component A+B+C+D perform better overall compared to Component A+B and Component C+D.
Cluster training Subtask E-c classes are imbalanced, with 95% of "Trust" and "Surprise" training examples being negative. Table 3 shows the breakdown of negative and positive training examples for each of the E-c classes.
To assess our cluster training procedure, we evaluate the performance of independently training each of the E-c emotion classes (using a fresh copy of the pre-trained system for each of the 11 emotions) as well as with various class weighing schemes. Table 4 shows our experiment results.
Within independent training experiments, squared inverse weights performed best as measured by accuracy and micro-avg F1. Using squared inverse weights, cluster training performs better than independent training, attesting to the utility of cluster training.

Conclusion
In this paper, we present the first attempt to perform transfer learning from sentiment to emotions. Model weights are pre-trained with past SemEval sentiment categorization tasks and the penultimate layers of the models are concatenated into a single vector as input to new dense layers. The entire system is then trained for each subtask with the weights unfrozen. Our deep learning system combines multi-dimensional word embeddings with single dimensional lexicon-based features.
Specifically we combine features of X ∈ R 50×300 , R 50×43 , R 1×43 , which results in better performance than systems using just one of the features.
For the E-c subtask, we utilize hierarchical clustering to group correlated emotions together and train the same model incrementally for emotions within the same cluster. This novel method outperforms a system which trains on each emotion independently.    We participated in all of the English subtasks of SemEval 2018 Task 1: Affect in Tweets and obtained top 4 in 4 out of the 5 subtasks, testifying to our model robustness.
For future work, we would like to experiment with other training methods such as multi-task learning and distant supervision, as well as tune the hyper-parameters of our model to augment its performance across all subtasks .