Seernet at EmoInt-2017: Tweet Emotion Intensity Estimator

The paper describes experiments on estimating emotion intensity in tweets using a generalized regressor system. The system combines various independent feature extractors, trains them on general regressors and finally combines the best performing models to create an ensemble. The proposed system stood 3rd out of 22 systems in leaderboard of WASSA-2017 Shared Task on Emotion Intensity.


Introduction
Twitter, a micro-blogging and social networking site has emerged as a platform where people express themselves and react to events in real-time.It is estimated that nearly 500 million tweets are sent per day1 .Twitter data is particularly interesting because of its peculiar nature where people convey messages in short sentences using hashtags, emoticons, emojis etc.In addition, each tweet has meta data like location and language used by the sender.It's challenging to analyze this data because the tweets might not be grammatically correct and the users tend to use informal and slang words all the time.Hence, this poses an interesting problem for NLP researchers.Any advances in using this abundant and diverse data can help understand and analyze information about a person, an event, a product, an organization or a country as a whole.Many notable use cases of the twitter can be found here2 .
Along the similar lines, The Task 1 of WASSA-2017 (Mohammad andBravo-Marquez, 2017c) poses a problem of finding emotion intensity of Figure 1: System Architecture four emotions namely anger, fear, joy, sadness from tweets.In this paper, we describe our approach and experiments to solve this problem.The rest of the paper is laid out as follows: Section 2 describes the system architecture, Section 3 reports results and inference from different experiments, while Section 4 points to ways that the problem can be further explored.

Preprocessing
The preprocessing step modifies the raw tweets before they are passed to feature extraction.Tweets are processed using tweetokenize tool3 .Twitter specific features are replaced as follows: username handles to USERNAME, phone numbers to PHONENUMBER, numbers to NUMBER, URLs to URL and times to TIME.A continuous sequence of emojis is broken into individual tokens.Finally, all tokens are converted to lowercase.

Feature Extraction
Many tasks related to sentiment or emotion analysis depend upon affect, opinion, sentiment, sense and emotion lexicons.These lexicons associate words to corresponding sentiment or emotion metrics.On the other hand, the semantic meaning of words, sentences, and documents are preserved and compactly represented using low dimensional vectors (Mikolov et al., 2013) instead of one hot encoding vectors which are sparse and high dimensional.Finally, there are traditional NLP features like word N-grams, character N-grams, Part-Of-Speech N-grams and word clusters which are known to perform well on various tasks.
Based on these observations, the feature extraction step is implemented as a union of different independent feature extractors (featurizers) in a light-weight and easy to use Python program EmoInt4 .It comprises of all features available in the baseline model (Mohammad and Bravo-Marquez, 2017a)5 along with additional feature extractors and bi-gram support.Fourteen such feature extractors have been implemented which can be clubbed into 3 major categories: Lexicon Features: AFINN (Nielsen, 2011) word list are manually rated for valence with an integer between -5 (Negative Sentiment) and +5 (Positive Sentiment).Bing Liu (Hu and Liu, 2004) opinion lexicon extract opinion on customer reviews.+/-EffectWordNet (Choi and Wiebe, 2014) by MPQA group are sense level lexicons.The NRC Affect Intensity (Mohammad, 2017) lexicons provide real valued affect intensity.NRC Word-Emotion Association Lexicon (Mohammad and Turney, 2010) contains 8 sense level associations (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and 2 sentiment level associations (negative and positive).Expanded NRC Word-Emotion Association Lexicon (Bravo-Marquez et al., 2016) expands the NRC wordemotion association lexicon for twitter specific language.NRC Hashtag Emotion Lexicon (Mohammad and Kiritchenko, 2015) contains emotion word associations computed on emotion labeled twitter corpus via Hashtags.NRC Hashtag Sentiment Lexicon and Sentiment140 Lexicon (Mohammad et al., 2013) contains sentiment word associations computed on twitter corpus via Hashtags and Emoticons.SentiWordNet (Baccianella et al., 2010) assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity.Negation lexicons collections are used to count the total occurrence of negative words.In addition to these, SentiStrength (Thelwall et al., 2010) application which estimates the strength of positive and negative sentiment from tweets is also added.
Word Vectors: We focus primarily on the word vector representations (word embeddings) created specifically using the twitter dataset.GloVe (Pennington et al., 2014) is an unsupervised learning algorithm for obtaining vector representations for words.200-dimensional GloVe embeddings trained on 2 Billion tweets are integrated.Edinburgh embeddings (Bravo-Marquez et al., 2015) are obtained by training skip-gram model on Edinburgh corpus (Petrovic et al., 2010).Since tweets are abundant with emojis, Emoji embeddings (Eisner et al., 2016) which are learned from the emoji descriptions have been used.Embeddings for each tweet are obtained by summing up individual word vectors and then dividing by the number of tokens in the tweet.
Syntactic Features: Syntax specific features such as Word N-grams, Part-Of-Speech N-grams (Owoputi et al., 2013), Brown Cluster N-grams (Brown et al., 1992) obtained using TweetNLP6 project have been integrated into the system.
The final feature vector is the concatenation of all the individual features.For example, we concatenate average word vectors, sum of NRC Affect Intensities, number of positive and negative Bing Liu lexicons, number of negation words and so on to get final feature vector.The scaling of final features is not required when used with gradient boosted trees.However, scaling steps like standard scaling (zero mean and unit normal) may be beneficial for neural networks as the optimizers work well when the data is centered around origin.
A total of fourteen different feature extractors have been implemented, all of which can be enabled or disabled individually to extract features from a given tweet.

Regression
The dev data set (Mohammad and Bravo-Marquez, 2017b) in the competition was small hence, the train and dev sets were merged to perform 10-fold cross validation.On each fold, a model was trained and the predictions were col-lected on the remaining dataset.The predictions are averaged across all the folds to generalize the solution and prevent over-fitting.As described in Section 2.2, different combinations of feature extractors were used.After performing feature extraction, the data was then passed to various regressors Support Vector Regression, AdaBoost, RandomForestRegressor, and, BaggingRegressor of sklearn (Pedregosa et al., 2011).Finally, the chosen top performing models had the least error on evaluation metrics namely Pearson's Correlation Coefficient and Spearman's rank-order correlation.

Parameter Optimization
In order to find the optimal parameter values for the EmoInt system, an extensive grid search was performed through the scikit-Learn framework over all subsets of the training set (shuffled), using stratified 10-fold cross validation and optimizing the Pearson's Correlation score.Best cross-validation results were obtained using Ad-aBoost meta regressor base regressor as XG-Boost (Chen and Guestrin, 2016) with 1000 estimators and 0.1 learning rate.Experiments and analysis of results are presented in the next section.

Experimental Results
As described in Section 2.2 various syntax features were used namely, Part-of-Speech tags, brown clusters of TweetNLP project.However, these didn't perform well in cross validation.Hence, they were dropped from the final system.While performing grid-search as mentioned in Section 2.4, keeping all the lexicon based features same, choice of combination of emoji vector and word vectors are varied to minimize cross validation metric.Table 1 describes the results for experiments conducted with different combinations of word vectors.Emoji embeddings (Eisner et al., 2016) give better results than using plain GloVe and Edinburgh embeddings.Edinburgh embeddings outperform GloVe embeddings in Joy and Sadness category but lag behind in Anger and Fear category.The official submission comprised of the top-performing model for each emotion category.This system ranked 3 rd for the entire test dataset and 2 nd for the subset of the test data formed by taking every instance with a gold emo-tion intensity score greater than or equal to 0.5.Post competition, experiments were performed on ensembling diverse models for improving the accuracy.An ensemble obtained by averaging the results of the top 2 performing models outperforms all the individual models.

Feature Importance
The relative feature importance can be assessed by the relative depth of the feature used as a decision node in the tree.Features used at the top of the tree contribute to the final prediction decision of a larger fraction of the input samples.The expected fraction of the samples they contribute to can thus be used as an estimate of the relative importance of the features.By averaging the measure over several randomized trees, the variance of the estimate can be reduced and used as a measure of relative feature importance.In Figure 2 feature importance graphs are plotted for each emotion to infer which features are playing the major role in identifying emotional intensity in tweets.+/-EffectWordNet (Choi and Wiebe, 2014), NRC Hashtag Sentiment Lexicon, Sentiment140 Lexicon (Mohammad et al., 2013) and NRC Hashtag Emotion Lexicon (Mohammad and Kiritchenko, 2015) are playing the most important role.

System Limitations
It is important to understand how the model performs in different scenarios.Table 2 analyzes when the system performs the best and worst for each emotion.Since the features used are mostly lexicon based, the system has difficulties in capturing the overall sentiment and it leads to amplifying or vanishing intensity signals.For instance, in example 4 of fear louder and shaking lexicons imply fear but overall sentence doesn't imply fear.A similar pattern can be found in the 4 th example of Anger and 3 rd example of Joy.The system has difficulties in understanding of sarcastic tweets, for instance, in the 3 rd tweet of Anger the user expressed anger but used lol which is used in a positive sense most of the times and hence the system did a bad job at predicting intensity.The system also fails in predicting sentences having deeper emotion and sentiment which humans can understand with a little context.For example, in sample 4 of sadness, the tweet refers to post travel blues which humans can understand.But with little context, it is difficult for the system to accurately estimate the intensity.The performance is

Future Work & Conclusion
The paper studies the effectiveness of various affect lexicons word embeddings to estimate emotional intensity in tweets.A light-weight easy to use affect computing framework (EmoInt) to facilitate ease of experimenting with various lexicon features for text tasks is open-sourced.It provides plug and play access to various feature extractors and handy scripts for creating ensembles.Few problems explained in the analysis section can be resolved with the help of sentence embeddings which take the context information into consideration.The features used in the system are generic enough to use them in other affective computing tasks on social media text, not just tweet data.Another interesting feature of lexicon-based systems is their good run-time performance during prediction, future work to benchmark the performance of the system can prove vital for deploying in a real-world setting.
Anger @Claymakerbigsi @toghar11 @scott mulligan @BoxingFanatic Fucker blocked me 2 years ago over a question lol proper holds a grudge old Joe

Table 1 :
Evaluation Metrics for various systems.Systems are abbreviated as following: For example Em1-Ed0-Gl1 implies Emoji embeddings and GloVe embeddings are included, Edinburgh embeddings are not included in features keeping other features same.Results marked with * corresponds to official submission.Results in bold are the best results corresponding to that metric.

Table 2 :
Sample tweets where our system's prediction is best and worst.