SeerNet at SemEval-2018 Task 1: Domain Adaptation for Affect in Tweets

The paper describes the best performing system for the SemEval-2018 Affect in Tweets(English) sub-tasks. The system focuses on the ordinal classification and regression sub-tasks for valence and emotion. For ordinal classification valence is classified into 7 different classes ranging from -3 to 3 whereas emotion is classified into 4 different classes 0 to 3 separately for each emotion namely anger, fear, joy and sadness. The regression sub-tasks estimate the intensity of valence and each emotion. The system performs domain adaptation of 4 different models and creates an ensemble to give the final prediction. The proposed system achieved 1stposition out of 75 teams which participated in the fore-mentioned sub-tasks. We outperform the baseline model by margins ranging from 49.2% to 76.4 %, thus, pushing the state-of-the-art significantly.


Introduction
Twitter is one of the most popular micro-blogging platforms that has attracted over 300M daily users 1 with over 500M 2 tweets sent every day. Tweet data has attracted NLP researchers because of the ease of access to large data-source of people expressing themselves online. Tweets are micro-texts comprising of emoticons, hashtags as well as location data, making them feature rich for performing various kinds of analysis. Tweets provide an interesting challenge as users tend to write grammatically incorrect and use informal and slang words.
In domain of natural language processing, emotion recognition is the task of associating words, phrases or documents with emotions from predefined using psychological models. The classification of emotions has mainly been researched from two fundamental viewpoints. (Ekman, 1992) and (Plutchik, 2001) proposed that emotions are discrete with each emotion being a distinct entity. On the contrary, (Mehrabian, 1980) and (Russell, 1980) propose that emotions can be categorized into dimensional groupings.
Affect in Tweets (Mohammad et al., 2018)shared task in SemEval-2018 focuses on extracting affect from tweets confirming to both variants of the emotion models, extracting valence (dimensional) and emotion (discrete). Previous version of the task (Mohammad and Bravo-Marquez, 2017) focused on estimating the emotion intensity in tweets. We participated in 4 sub-tasks of Affect in Tweets, all dealing with English tweets. The sub-tasks were: EI-oc: Ordinal classification of emotion intensity of 4 different emotions (anger, joy, sadness, fear), EI-reg: to determine the intensity of emotions (anger, joy, sadness, fear) into a real-valued scale of 0-1, V-oc: Ordinal classification of valence into one of 7 ordinal classes [-3, 3], V-reg: determine the intensity of valence on the scale of 0-1.
Prior work in extracting Valence, Arousal, Dominance (VAD) from text primarily relied on using and extending lexicons (Bestgen andVincze, 2012) (Turney et al., 2011). Recent advancements in deep learning have been applied in detecting sentiments from tweets (Tang et al., 2014), (Liu et al., 2012), (Mohammad et al., 2013. In this work, we use various state-of-the-art machine learning models and perform domain adaptation (Pan and Yang, 2010) from their source task to the target task. We use multi-view ensemble learning technique (Kumar and Minz, 2016) to produce the optimal feature-set partitioning for the classifier. Finally, results from multiple such classifiers are stacked together to create an ensemble (Polikar, 2012).
In this paper, we describe our approach and experiments to solve this problem. The rest of the paper is laid out as follows: Section 2 describes the system architecture, Section 3 reports results and inference from different experiments. Finally we conclude in Section 4 along with a discussion about future work.
2 System Description 2.1 Pipeline Figure 1 details the System Architecture. We now describe how all the different modules are tied together. The input raw tweet is pre-processed as described in Section 2.2. The processed tweet is passed through all the feature extractors described in Section 2.3. At the end of this step, we extract 5 different feature vectors corresponding to each tweet. Each feature vector is passed through the model zoo where classifiers with different hyper parameters are tuned. The models are described in Section 2.4. For each vector, the results of top-2 performing models (based on cross-validation) are retained. At the end of this step, we've 10 different results corresponding to each tweet. All these results are ensembled together via stacking as described in Section 2.4.3. Finally, the output from the ensembler is the output returned by the system.

Pre-processing
The pre-processing step modifies the raw tweets to prepare for feature extraction. Tweets are pre-processed using tweettokenize 3 tool. Twitter specific keywords are replaced with tokens, namely, USERNAME, PHONENUMBER, URLs, timestamps. All characters are converted to lowercase. A contiguous sequence of emojis is first split into individual emojis. We then replace an emoji with its description. The descriptions were scraped from EmojiPedia 4 .

Feature Extraction
As mentioned in Section 1, we perform transfer learning from various state-of-the-art deep learning techniques. We will go through the following sub-sections to understand these models in detail.

DeepMoji
DeepMoji (Felbo et al., 2017) performs distant supervision on a very large dataset (1246 million tweets) comprising of noisy labels (emojis). Deep-Moji was able to obtain state-of-the-art results in various downstream tasks using transfer learning. This makes it an ideal candidate for domain adaptation into related target tasks. We extract 2 different feature sets by extracting the embeddings from the softmax and the attention layer from the pretrained DeepMoji model. The vector from softmax layer is of dimension 64 and the vector from attention layer is of dimension 2304.

Skip-Thought Vectors
Skip-Thought vectors (Kiros et al., 2015) is an offthe-shelf encoder that can produce highly generic sentence representations. Since tweets are restricted by character limit, skip-thought vectors can create a good semantic representation. This representation is then passed to the classifier. The representation is of dimension 4800. (Radford et al., 2017) developed an unsupervised system which learned an excellent representation of sentiment. The original model was trained to generate amazon reviews, this makes the sentiment neuron an ideal candidate for transfer learning. The representation extracted from Sentiment Neuron is of size 4096.

EmoInt
Apart from all the pre-trained embeddings, we choose to also include various lexical features bundled through the EmoInt package 5 (Duppada and Hiray, 2017) The lexical features include AFINN (Nielsen, 2011) This gives us five different feature vector variants. All of these feature vectors are passed individually to the underlying models. The pipeline is explained in detail in Section 2.1

Machine Learning Models
We participated in 4 sub-tasks, namely, EI-oc, EIreg, V-oc, V-reg. Two of the sub-tasks are ordinal classification and the remaining two are regressions. We describe our approach for building ML Figure 1: System Architecture. models for both the variants in the upcoming sections.

Ordinal Classification
We participated in the emotion intensity ordinal classification where the task was to predict the intensity of emotions from the categories anger, fear, joy, and, sadness. Separate datasets were provided for each emotion class. The goal of the subtask of valence ordinal classification was to classify the tweet into one of 7 ordinal classes [-3, 3]. We experimented with XG Boost Classifier, Random Forest Classifier of sklearn (Pedregosa et al., 2011).

Regression
For the regression tasks (E-reg, V-reg), the goal was to predict the intensity on a scale of 0-1. We experimented with XG Boost Regressor, Random Forest Regressor of sklearn (Pedregosa et al., 2011).
The hyper-parameters of each model were tuned separately for each sub-task. The top-2 best models corresponding to each feature vector type were chosen after performing 7-fold cross-validation.

Stacking
Once we get the results from all the classifiers/regressors for a given tweet, we use stacking ensemble technique to combine the results. In this case, we pass the results from the models to a meta classifier/regressor as input. The output of this meta model is treated as the final output of the system.
We observed that using ordinal regressors gave us better performance than using classifiers which treat each output class as disjoint. Ordinal Regression is a family of statistical learning meth-

Task
Baseline 2

Task Results
The metrics used for ranking various systems are discussed in this section.

Primary Metrics
Pearson correlation with gold labels was used as a primary metric for ranking the systems. For EIreg and EI-oc tasks Pearson correlation is macroaveraged (MA Pearson) over the four emotion categories. Table 1 describes the results based on primary metrics for various sub-tasks in English language. Our system achieved the best performance in each of the four sub-tasks. We have also included the results of the baseline and second best performing system for comparison. As we can observe,  Task Pearson (gold in 0.5-1) V-reg 0.697 (1) EI-reg 0.638 (1) Table 3: Secondary metrics for regression subtasks. System rank is mentioned in brackets.
our system vastly outperforms the baseline and is a significant improvement over the second best system, especially, in the emotion sub-tasks.

Secondary Metrics
The competition also uses some secondary metrics to provide a different perspective on the results. Pearson correlation for a subset of the test set that includes only those tweets with intensity score greater or equal to 0.5 is used as the secondary metric for the regression tasks. For ordinal classification tasks following secondary metrics were used: • Pearson correlation for a subset of the test set that includes only those tweets with intensity classes low X, moderate X, or high X (where X is an emotion). The organizers refer to this set of tweets as the some-emotion subset (SE).
• Weighted quadratic kappa on the full test set • Weighted quadratic kappa on the someemotion subset of the test set The results for secondary metrics are listed in Table 2 and 3. We have also included the ranking in brackets along with the score. We see that our system achieves the top rank according to all the secondary metrics, thus, proving its robustness.

Feature Importance
The performance of the system is highly dependent on the discriminative ability of the tweet representation generated by the featurizers. We measure the predictive power for each of the featurizer     used by calculating the pearson correlation of the system using only that featurizer. We describe the results for each sub task separately in tables 4-7. We observe that deepmoji featurizer is the most powerful featurizer of all the ones that we've used. Also, we can see that stacking ensembles of mod-els trained on the outputs of multiple featurizers gives a significant improvement in performance.

System Limitations
We analyze the data points where our model's prediction is far from the ground truth. We observed some limitations of the system, such as, sometimes understanding a tweet's requires contextual knowledge about the world. Such examples can be very confusing for the model. We use deepmoji pre-trained model which uses emojis as proxy for labels, however partly due to the nature of twitter conversations same emojis can be used for multiple emotions, for example, joy emojis can be sometimes used to express joy, sometimes for sarcasm or for insulting someone. One such example is 'Your club is a laughing stock'. Such cases are sometimes incorrectly predicted by our system.

Future Work & Conclusion
The paper studies the effectiveness of various representations of tweets and proposes ways to combine them to obtain state-of-the-art results. We also show that stacking ensemble of various classifiers learnt using different representations can vastly improve the robustness of the system.
Further improvements can be made in the preprocessing stage. Instead of discarding various tokens such as punctuation's, incorrectly spelled words, etc, we can utilize the information by learning their semantic representations. Also, we can improve the system performance by employing multi-task learning techniques as various emotions are not independent of each other and information about one emotion can aid in predicting the other. Furthermore, more robust techniques can be employed for distant supervision which are less prone to noisy labels to get better quality training data.