IITP at EmoInt-2017: Measuring Intensity of Emotions using Sentence Embeddings and Optimized Features

This paper describes the system that we submitted as part of our participation in the shared task on Emotion Intensity (EmoInt-2017). We propose a Long short term memory (LSTM) based architecture cascaded with Support Vector Regressor (SVR) for intensity prediction. We also employ Particle Swarm Optimization (PSO) based feature selection algorithm for obtaining an optimized feature set for training and evaluation. System evaluation shows interesting results on the four emotion datasets i.e. anger, fear, joy and sadness. In comparison to the other participating teams our system was ranked 5th in the competition.


Introduction
Emotion analysis (Picard, 1997) deals with automatic extraction of emotion expressed in a user written text. Basic emotions expressed by a human being, as categorized by Ekman (1992), are joy, sadness, surprise, fear, disgust and anger. With the growing amount of social media generated text it has become a challenging task to efficiently mine emotions of the user. However, finding only the emotion does not always reflect exact state of mood of a user. Level or intensity of emotion often differs on a case-to-case basis within a single emotion. Some emotions are gentle (e.g 'not good') while others can be very severe (e.g. 'terrible'). Finding the intensity level of the expressed emotion is another non-trivial task that researchers have to face.
The shared task on Emotion Intensity (EmoInt-2015) (Mohammad and Bravo-Marquez, 2017) was targeted to build an efficient system for intensity prediction on a continuous scale of 0 (least intense) to +1 (most intense). There were four datasets collected from Twitter, each reflecting one class of emotion i.e. anger, fear, joy and sadness, respectively.
We propose a Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) based neural network architecture cascaded with Support Vector Regression (SVR) (Smola and Schölkopf, 2004). We build our system on top of word embeddings along with the assistance of an optimized feature set obtained through Particle Swarm Optimization (PSO) (Kennedy and Eberhart, 1995). A major hurdle in obtaining a good word representation was the noisy and informal nature of text. Therefore, in the preliminary step, we perform a series of normalization heuristics in line with (Akhtar et al., 2015). The word embeddings of the resultant normalized text was more representative than that of the unnormalized text.
The high-dimensionality of feature vector often contributes to high complexity of the system. Also, some features have high degree of relevance towards a particular task/domain than the others. Careful selection of features for any task often leads to improved system performance. However, finding the relevant set of features is cumbersome and time-consuming task. Motivated by this we employ a Particle Swarm Optimization (PSO) based feature selection technique for selecting a subset of features from a feature pool. By utilizing the reduced and pruned feature set for training and evaluation, resultant system often performs considerably well. At the same time complexity of the system also reduces as it requires fewer parameters to learn. Literature survey shows successful application of PSO for various tasks and/or domains (Lin et al., 2008;Akhtar et al., 2017;Yadav et al., 2017).

System Description
This section discusses our proposed approach in detail. The subsequent subsections present various components of our system.

Pre-processing and Normalization
• Mentions, URLs and Punctuations: In this step we filter out all the user mentions and URLs as they do not have any emotional bondings. Secondly, we strip off all the punctuations from the word boundaries to make it a valid dictionary word, e.g. 'first//' to 'first'. Improper use of punctuation was one of the reasons for data sparsity, when working with distributed word representation. After employing this step we observed that the number of out-of-vocabulary (OOV) words are effectively reduced.
• Hashtag Segmentation: Here the '#' symbol is stripped off from the hashtags. The resulting token is split into constituent words. For example, '#Spilled-BeerOnFloor' is converted to 'Spilled Beer On Floor'. This is achieved using the Word-Segment 1 module for word segmentation available in python. It is to be noted here that the segmented words are required only for obtaining word embeddings. For obtaining lexicon based features (cf. Section 2.3.1 ) the entire token with the '#' is used.
• Elongation: User tends to express their state of emotion by elongating a valid word e.g. 'jooooy', 'goooodd' etc. In this step, all such elongated words are identified and converted into valid words by removing the consecutive characters. For example 'jooyyyy' and 'jooooy' are converted to 'joy'.
• Verb present participle: In Twitter domain, it is observed that user tends to omit the character 'i' or 'g' in words ending with 'ing'. For example, 'going' is written as 'goin' or 'gong'. Such errors have been identified and corrected. We apply this rule for all the verbs that ends with either 'ng' of 'in'.
1 https://github.com/grantjenks/wordsegment • Frequent noisy term: We compile a dictionary of frequently used slang terms and abbreviations along with its normal form that are commonly in practice in the Twitter domain. Every token in a tweet is searched in this dictionary. If a match is found then it is replaced with the normal form. The list was compiled utilizing the datasets of WNUT-2015 shared task on Twitter Lexical Normalization (Baldwin et al., 2015).
• Expand contractions: Contraction of a multi-word token is formed by making it shorter by dropping some characters and placing an apostrophe between them. For example, the contraction of 'i am' is 'iḿ'. We compile a dictionary of contractions and its normalized forms employing the datasets of (Baldwin et al., 2015). We replace every occurrence of a contraction in a tweet by its expanded form.

LSTM based Approach
Long short term memory (Hochreiter and Schmidhuber, 1997) network is a special kind of recurrent network that can efficiently learn sequences over a longer period of time. The proposed method utilizes LSTM network to obtain the sentence embedding vector, which is then fed as an input to SVR for prediction. The proposed network comprises of one Bidirectional LSTM (BiLSTM) (Schuster and Paliwal, 1997) layer followed by two dense layers. Hidden layer of the LSTMs consists of 100 neurons whereas the dense layers contain 100 and 50 neurons, respectively.

Word Embeddings
Word embedding (or word vector) is a distributed representation of words that contains syntactic and semantic information (Mikolov et al., 2013;Pennington et al., 2014). For this task, we use GloVe (Pennington et al., 2014) pre-trained word embedding trained on common crawl corpus. Each token in the tweet is represented by 300 dimension word vector. The choice of common crawl word embeddings for Twitter datasets is because of the normalization steps (Section 2.1). We observe that the application of normalization has a positive effect on the overall performance of the system.

Particle Swarm Optimization based Feature Selection
Particle swarm optimization (Kennedy and Eberhart, 2001) is an optimization technique build over the social behavior of a flock of birds. Each potential solution, also known as particles, stores its best position attained so far. The global best solution recorded by any particle in the flock is also recorded and shared among the particles. In the search space, each particle moves towards the optimal solution based on its own best position and the global best position. Eventually, particles concentrate on a limited search space dictated by the global best solution found so far. The entire process is governed by three operations namely, evaluate, compare and imitate. Evaluation step quantifies the goodness of each particle, whereas, the comparison step obtains the best solution by comparing the particles. The imitate step produces new particles based on the best solution. A particle is an n-dimensional binary vector, where each element represents one feature. The value of each element (i.e. 0 or 1) signifies the presence or absence of its corresponding feature. Consequently, missing feature in a particle does not participate in training and testing of the system. On termination, PSO yields a particle (encoding a particular feature subset) that represents the best solution. We closely follow PSO based feature selection algorithm of (Akhtar et al., 2017) in the current work.

Feature Set
This section describes the features that we extract to predict the emotion intensity. All these features are fed to the PSO to generate the optimized feature set.
• VADER Sentiment: VADER (Gilbert, 2014) stands for Valence Aware Dictionary and Sentiment Reasoner. It is a rule-based sentiment analysis technique designed to work with contents on social media. For every input tweet, it provides positive, negative, neutral and compound sentiment score. We use these four values as features.
• Lexicon based Features: For each tweet we extract the following lexicon based features: -Polar word count: Count of positive and negative words using the MPQA subjectivity lexicon (Wiebe and Mihal-cea, 2006) and Bing Liu lexicon (Ding et al., 2008). -Aggregate polarity scores: Positive and negative scores are obtained from each of the following lexicons: Sen-timent140 , AFINN (Nielsen, 2011) and Sentiwordnet (Baccianella et al., 2010). It is calculated by aggregating the positive and negative word scores provided by each lexicon.

Regression Model
An overall schema of the proposed system is depicted in Figure 1. Our proposed regression model consists of LSTM network and Support Vector Regression (SVR). First a LSTM network is trained using word vectors as input with sigmoid activation. Upon completion of training, the output of the top most hidden layer is used as sentence embedding. The trained sentence embeddings represent the relevant semantic and syntactic features of the tweets. Next, optimized feature set, as obtained by PSO, is concatenated with sentence embeddings for training a SVR model. The idea of cascading SVR with LSTM was motivated by the recent works of (Akhtar et al., 2016;Wang et al., 2016).

Experimental Results
We use Python based neural network library, i.e. Keras 2 , for the implementation. For tokenization of tweets, we utilize CMU ARK tool 3 . The official evaluation metric was Pearson coefficient. We use tanh as an activation function at the intermediate layers while at the output layer we utilize sigmoid. We employ Adam (Kingma and Ba, 2014) optimizer and set the Dropout (Srivastava et al., 2014) as 40%. We train our network for 50 epochs. Table  1 depicts the evaluation results on the development and test sets. We first train a BiLSTM network utilizing GloVe common crawl embeddings. The resultant network produces average Pearson score of merely 0.1877. We observe that a good percentage of tokens (mostly noisy) were missing in the embeddings -thus poses challenge to the network during the learning phase. Subsequently, we try to minimize the effect of noisy tokens by utilizing GloVe Twitter embeddings. Though, the network obtains improved average Pearson score at 0.1921, improvement is not significant. On analysis we find similar issues with Twitter embeddings. To address the problem of data sparsity we employ a series of heuristics (c.f. Section 2.1) in order to normalize the text. Consequently, we obtain average Pearson score of 0.6289 with normalization outperforming the baseline system (0.610) provided by the organizers of the shared task.
We then cascade the LSTM network with SVR for the final predictions (LSTM+SVR). On cascading we obtain 0.6641 average Pearson score, reporting a gain of 0.04 points. Finally, to further improve the prediction accuracies we introduce various handcrafted lexicon features (c.f. Section 2.3.1) into the architecture (LSTM+SVR+Feat). Although, we see an improvement of 0.01 point in average Pearson score, introduction of same set of lexicons features have contrasting effect on different emotion datasets i.e. anger, fear, joy & sadness. We observe improvement for joy and sadness, whereas for anger use of this same set of features degrades the system performance. For fear, introduction of features to LSTM+SVR almost have no effect. Motivated by these results we perform PSO based feature selection algorithm in order to find optimal set of features for different emotions. We get the best average Pearson score of 0.7271 on the development set by utilizing sentence embeddings, optimized feature set and SVR (LSTM+SVR+PSO). We also observe improvement in Pearson score for each of the emotion datasets ranging from 0.5-0.7 points over LSTM+SVR. It is evident from the obtained results that normalization of tweets is a major factor in obtaining good performance. Also, introduction of the PSO based feature selection in LSTM+SVR hybrid model further assists the system in improving the performance.
On final evaluation, i.e. on the test set, our proposed system (LSTM+SVR+PSO) scores an average Pearson score of 0.682. In comparison, baseline system produces 0.6470 average Pearson  score, a difference of 4%. For anger and fear we observe a small performance drop on the test set as compared to the development set while our proposed system performs better in case of sadness. Further, we observe that our system does not perform at par (a drop of nearly 17%) for joy as compared to the development set. However, similar phenomenon was observed for the baseline system as well i.e. a drop of 6% in joy. We also observe that our proposed system is statistically significant over baseline system with p-value = 0.03683.  (Ding et al., 2008) and AFINN (Nielsen, 2011) lexicons have been employed by only fear & joy, respectively.

Error Analysis
We also perform error analysis on the obtained results. Following are the few cases where our system consistently suffers in predicting the intensity values.
• Presence of high intensity emotion words (such as anger, revenge, fury, exciting etc) makes it non-trivial for the system to correctly predicts the intensity values.
Example 1: Tweet: #Forgiveness might make us look #weak, but the weakest person is the one who holds #anger, #hatred, and #revenge. Actual: 0.

Conclusion
In this paper, we have presented a hybrid LSTM-SVR architecture for predicting the intensity level w.r.t. to an emotion. We first applied various heuristics for normalizing the tweets. Following this step, the noisiness of tweets is addressed to a great effect and consequently improves the performance of the system. The proposed approach further utilized relevant set of hand-crafted features obtained through a PSO based feature selection technique. Adding optimized features in the proposed architecture (LSTM+SVR+PSO) attains significant improvement over the system without it (LSTM+SVR) and this phenomenon was observed for all the four emotion datasets i.e. anger, fear, joy and sadness.