DL Team at SemEval-2018 Task 1: Tweet Affect Detection using Sentiment Lexicons and Embeddings

The paper describes our approach for SemEval-2018 Task 1: Affect Detection in Tweets. We perform experiments with manually compelled sentiment lexicons and word embeddings. We test their performance on twitter affect detection task to determine which features produce the most informative representation of a sentence. We demonstrate that general-purpose word embeddings produces more informative sentence representation than lexicon features. However, combining lexicon features with embeddings yields higher performance than embeddings alone.


Introduction
The paper describes our approach for SemEval-2018 Task 1: Affect Detection in Tweets .
The research question we address in this paper is what are the best features for tweet affect detection. Our solution uses two types of features: lexicon features obtained from manually compiled emotion lexicons, and word embeddings built unsupervisedly from large corpora. We use well established lexicons, namely DepecheMood and Vader Sentiment, and most popular Word embeddings, namely GloVe and Google News. We systematically compare all features on two subtasks and demonstrate that even though lexicon features produce unsatisfactory results in isolation, they significantly improve an algorithm performance when combined with more general embeddings.
In addition, we demonstrate that special treatment of Twitter hash-tags also improves the algorithm performance.

Tasks and Data
The paper addresses three subtasks: • EI-reg-an emotion intensity regression task: Given a tweet and an emotion E, determine the intensity of E that best represents the mental state of the tweeter-a realvalued score between 0 (no E at all) and 1 (the highest magnitude of E); separate datasets are provided for fear, sadness, anger, and joy.
• V-reg-a sentiment intensity regression task: Given a tweet, determine the intensity of sentiment or valence (V) that best represents the mental state of the tweeter-a real-valued score between 0 (most negative) and 1 (most positive).
• E-c-an emotion classification task: Given a tweet, classify it as 'neutral or no emotion' or as one, or more, of eleven given emotions that best represent the mental state of the tweeter: trust, sadness, disgust, fear, optimism, love, joy, pessimism, anticipation, surprise, and anger.
We use English data for all three subtasks. The train, development and test set sizes are shown in Table 1. More details on the data can be found in the task organizers' paper (Mohammad and Kiritchenko, 2018).

Baseline
As a baseline we use the Text-Processing API 1 . The API uses a Naive Bayes model trained using movie reviews and NLTK. The model returns probabilities for negative, positive and neutral labels. Negative and positive probabilities sum to 1 while neutral probability stands alone.

DepecheMood
DepecheMood (Staiano and Guerini, 2014) is an emotion lexicon collected using crowdsourcing. The respondents annotated news articles with eight predefined emotions: afraid, amused, angry, annoyed, dont care, happy, inspired, sad. Document annotations were then used in a dimensionality reduction algorithm to obtain word emotional scores. The lexicon contains approximately 37 thousand entry. Each entry consists of a word and eight values between 0 and 1, one value for each emotion.

Vader
Vader (Valence Aware Dictionary and sEntiment Reasoner) is a rule-based sentiment analysis tool and a lexicon specifically attuned to sentiments expressed in social media, such as Twitte (Hutto and Gilbert, 2014). The lexicon consists of more than 7000 term, which were compelled from other lexicons and then manualy annotated. Git repository 2 of Vader Sentiment toolkit provides function polarity scores which takes as an input a text and returns 4-dimensional feature vector, which contains negative, positive, neutral and compound scores.

GloVe
GloVe (Pennington et al., 2014) is an unsupervised algorithm that constructs embeddings from large corpora. The GloVe project 3 provides a number of models trained on various collections. We use the following two models: 2. Twitter Crawl: 200-dimensional vectors trained on 2 billion tweets with 27 billion tokens and 1.2 million distinct words.

Google News
We use word2vecs (Mikolov et al., 2013) embedding trained on Google News collection 4 , which have become almost standard embeddings since they are most frequently used in various research tasks. These embeddings are 300-dimensional vectors built using Google News dataset of 100 billion tokens and 3 million distinct words and phrases.

Method
We use various combinations of baseline, lexicon and embedding features, described above. Textprocessing API and Vader return text-level features. For other sources a tweet representation is built by averaging the word vectors. Concatenation is used to combine features obtained from various sources. We run several preliminary experiments with Vreg task to compare several algorithms, namely Gradient Boosting Regressor and Random Forest. We use sklearn implementations 5 . Gradient Boosting Regressor yields the best performance for all feature combinations (Table 2). In our official submission we apply Gradient Boosting Regressor for tasks EI-reg and V-reg, and Gradient Boosting Classifier for task E-c.
Hash-tags are special types of tokens in Twitter used to specify a topic or a context for a given message. They frequently contain emotional words.
Here are several examples from the dataset: • @leesyatt you are a cruel, cruel man. #therewillbeblood #revenge.
• straight people are canoodling on the quad and I'm #offended .
Thus, we try two different setting: first, processing hash-tags similar to all other words in the text;  second processing hash-tags separately to preserve authors' encoding of their emotions. The second strategy consistently yields better results as can be seen from Table 2.

Discussion
Comparisons of feature sets and algorithms are presented in Table 2. As can be seen from the table, results are consistent: emeddings yield higher performance than lexicon features for all tasks. DepechMode, even though it has five times more entries than Vader, seems to be less suitable for tweet emotion prediction and yields performance much lower than the baseline. Moreover, using both lexicons in combination not always improves performance and in some cases works even worse than Vader alone. There is no significant difference between different embeddings. Various embeddings achieve better performance depending on the task, though the best results obtained by using all three in combination.
It can also be seen from Table 2 that separate treatment of hash tags improves model performance. For example, for joy detection task the difference is about 10%, which means that joy is frequently expressed explicitly in hash tags.
The best results for all tasks obtained by using all feature sets in combination (with the only exception of angry intensity detection subtask). This makes an improvement in 5.5% for anger detection subtask, 4% for fear, 7.5% for joy, 5.4% for sadness, and about 2% for sentiment intensity detection subtask. This means that even though lexicons cannot be used by themselves to detect emotions, they provide important features that cannot be extracted from embeddings. We hypothesize that the main reason for that is low coverage, meaning that many tweets have few lexicon features or no such features at all. The coverage of the task corpora by various feature sets is presented in Table 3. It can be seen from the table that embeddings have much higher coverage than DepecheMood lexicon. Another interesting observation is that GloVe Twitter does not have a higher coverage than GloVe Common Crawl though GloVe Twitter has higher coverage of hash-tags.

Results
The best model, used in our officially submitted solution, exploits all six feature sets plus separate embedding vectors for hash-tags. The list of feature sets and their dimensionality is presented in Table 4. The official results for EI-reg and V-reg tasks are presented in Table 5. We report results for all instances and for instance with highest emotion intensity. The numerical values are similar to what we obtained on the development set. The official results for E-c classification task are presented in Table 6.

Conclusion
In this paper we presented our approach for Se-mEval Affect Detection in Tweets Task. We compare manually collected lexical features with embeddings automatically extracted from huge corpora. We demonstrated that even though lexicons are less suitable for affect detection in tweets due to low coverage they can improve model performance when lexical features are used together with more general embeddings.
In addition, we demonstrated that hash tags are important features for tweet affection detection, since they frequently include emotional words.  Table 5: Official results for EI-reg (emotion intensity regression) and V-reg (valence intensity regression). Scores are given in the format X / Y , where X is our result, and Y is the best official result on the task. Pearson correlation.
Accuracy micro-avg F1 macro-avg F1 47.7 / 58.8 61.0 / 70.1 41.6 / 52.8 Table 6: Official results for E-c (emotion classification) task. Scores are given in the format X / Y , where X is our result, and Y is the best official result on the task.
In this paper we used rather simplistic methods to combine various features, i.e., vector concatenation. In the future we plan to try another approach: to build a separate classifier for each feature set and then use a meta classifier on top of their results.

Repository
Repository with the code is located on the following URL link: https://github.com/ dmikrav/SemEval2018AffectsTweets The web-site to this project is on the following URL link: https://dmikrav.github.io/ SemEval2018AffectsTweets/