CrystalFeel at SemEval-2018 Task 1: Understanding and Detecting Emotion Intensity using Affective Lexicons

While sentiment and emotion analysis has received a considerable amount of research attention, the notion of understanding and detecting the intensity of emotions is relatively less explored. This paper describes a system developed for predicting emotion intensity in tweets. Given a Twitter message, CrystalFeel uses features derived from parts-of-speech, n-grams, word embedding, and multiple affective lexicons including Opinion Lexicon, SentiStrength, AFFIN, NRC Emotion & Hash Emotion, and our in-house developed EI Lexicons to predict the degree of the intensity associated with fear, anger, sadness, and joy in the tweet. We found that including the affective lexicons-based features allowed the system to obtain strong prediction performance, while revealing interesting emotion word-level and message-level associations. On gold test data, CrystalFeel obtained Pearson correlations of 0.717 on average emotion intensity and of 0.816 on sentiment intensity.


Introduction
While humans experience emotions every day, the degree of one's emotions varies from one experience to another. To date, a vast majority of NLP and computational linguistics research deals with ground truth data constructed through the assignment of discrete labels to text messages by annotators. Conventionally, sentiment analysis seeks to determine the valence (positive, negative or neutral) of the feelings and opinions that annotators can recognize in a text message (Hu and Liu, 2004;Pang and Lee, 2008;Socher et al., 2013). Emotion classification, a closely related task, typically seeks to predict the presence or absence of an emotion, i.e., if there is joy or no joy, anger or no * Both authors contributed to this research equally. Correspondence should be sent to yangyp@ihpc.a-star.edu.sg. anger, fear or no fear, in a particular message (Alm et al., 2005;Aman and Szpakowicz, 2007;Wen and Wan, 2014). The detection of emotion intensity along a continuous scale is a relatively less explored task.
One of the key reasons for the lack of work on detecting emotion intensity is plausibly attributable to the difficulty in measuring the very concept of emotion intensity. As highlighted in prior research, the question "how intense was your emotional experience on a scale of 1 to 10?" cannot generate reliable responses even for the same emotion type (Frijda et al., 1992). For example, asking people to respond to "how intense was your fear towards getting rejected" and "how intense was your fear towards receiving a medical test result" would lead to inconsistent answers across the same annotators at different times, as well as across different annotators. Because of the lack of a clear reference point, it is nearly impossible to construct ground truth datasets with adequate reliability.
To address the measurement issue, Mohammad and Bravo-Marquez (2017) used a best-worst scaling (BWS) method to create a tweet emotion intensity dataset. Annotators were asked to rank the best and worst examples of the intensity of emotions among n text examples (called n-tube, where n > 2 and typically n = 4). This reduces the reference point ambiguity issue faced by annotators with regards to which baseline they would have used to rate a text along a single scale. Upon having a target tweet annotated with 24 ranking judgements, the emotion intensity score for the tweet was computed as a real-valued score in the range of 0 to 1 (based on linear transformed value of the difference between the percentage of the number of times the tweet ranked the highest and the times ranked the lowest among all ranking judgements). In total, the dataset consists of 7,097 an-Emotion intensity score in range of [0, 0.5] Emotion intensity score in range of (0.  hammad et al., 2018). Table 1 provides a few examples from the dataset. The ability to detect the degree or intensity of emotions is beneficial to many AI applications. For example, a virtual service assistant would be able to employ more appropriate response strategy when a high-intensity anger or frustration is sensed from its customer, as compared to respond monotonically in normal dialogues. Customer relationship management systems can be more targeted by engaging customers who express high degrees of joy or excitement with their products and services. Homecare robots, empowered with the ability to recognize high-intensity grief or distress, would be less likely to miss the opportunity to alert professional human care givers.
In this paper, we discuss our approach to address this emotion intensity detection task, with a focus on the use of and experiments with affective lexicons. In the following sections, we introduce our in-house developed Emotion Intensity lexicons, and compare the performance of feature sets derived from various affective lexicons as well as parts-of-speech, n-grams and word embedding with SVM-based classifier.

Emotion Intensity and Affective Lexicons
In its simplest form, emotion intensity refers to the degree or amount of an emotion (Mohammad and Bravo-Marquez, 2017). A basic feature of emotion intensity would be the use of quantifier words. For example, one may indicate that he or she is a bit annoyed, very pissed off, or extremely angry. On the other hand, one may also say that he or she feels angry, livid, or furious. Without quantifier words, emotion words in itself are salient features indicating the intensity of emotions.
In 2016, we started in-house efforts to develop a multidimensional affective lexicon that computationally captures and distinguishes different psychological and linguistic meanings associated with each emotion-related word. Our initial version of Emotion Intensity (EI) Lexicon is a collection of 3,204 emotion-related English words, common emoticons and Internet slangs labelled in strength and intensity dimensions (as used in Gupta and Yang 2017). In the beginning, the rationale underlying our work centered on the fact that human emotions can be characterized using two fundamental dimensions: the dimension of evaluation strength in that an expression would have different levels of pleasantness or unpleasantness (Osgood et al., 1957), and the dimension of intensity (Shaver et al., 1987) which concerns about−and what Osgood et al. (1957) originally called as−motivational "potency" and physical "activity" 1 . By developing a lexicon that distinguishes strength and intensity, anger-based expressions (high in potency), for example, can be differentiated from equally unpleasant, sadnessbased expressions (low in potency).
In Gupta and Yang (2017), we explored the use of the Emotion Intensity (EI) Lexicon and found it helpful in enhancing sarcasm detection and sentiment analysis. Encouraged by its effectiveness, we continue to develop and use the lexicon by adding more psychologically meaningful affective dimensions. We consider three more dimensions: the "basic" emotion categories (Shaver et al., 1987;Ekman, 1973) including fear, anger, sadness, joy, love and surprise, fine-grained emotion categories (as summarized in Robinson 2009) including finer emotions such as joy-contentment, joy-cheerfulness, joy-excitement, and psychological conditions including affective condition, cognitive condition, physical & bodily state and external condition . In addition, we also add a levels of polarity dimension to reflect if a word is more uni-polarized (e.g., "angry" and "careless" are definitely negative) or more bi-polarized (e.g., "surprised" and "sympathetic" may imply both positive and negative feelings). These new considerations contribute to forming the Enhanced Emotion Intensity (E2I) Lexicon. The following table (Table 2) presents the properties of our lexicon in the context of five affective lexicons which were shown to be useful in prior sentiment and emotion analysis research.

CrystalFeel System
Focusing on features design and experiments, we employed SVM as the main classifier for the Crys-talFeel system. In terms of features, we considered two broad categories: affective lexicon based features, and non-affective lexicon based features.

Affective Lexicons-Based Feature Sets
Following the discussion in Section 2, seven sets of affective lexicons based features were extracted for the experiments: • OL (6 features   & −ive words (1), start positions of first occurrence of +ive & −ive words (2), counts of words holding three strengths of 1 to 3 (3), count of words holding three intensities of 1 to 3 (3), counts of words belonging to 6 emotions and unspecific words (7), counts of words belonging to 31 emotions and unspecific words (32), counts of words belonging to 4 psychological conditions (4), counts of words belonging to 4 polarity conditions (4), pairwise intersection features across all the dimensions (56)

N-grams, POS, Word Counts, and Word Embedding Feature Sets
In addition to affective lexicons, we extracted 25dimensional Tweet part-of-speech (POS) features (Owoputi et al., 2013) for each tweet. Furthermore, as emotion intensity is likely to be associated with the total tweet length and use of capital letters, we added the word counts (WC) features set including counting of total words and counting of uppercase letters. We extracted n-grams in the same way as in our earlier work (Gupta and Yang, 2017). Lastly, we used FastText (Joulin et al., 2016) to convert the tweets into a 100-dimensional feature vectors. To train FastText model, we downloaded close to 8 million tweets using Twitter Streaming API. In summary: • POS (25 features  train and test the performance of various individual and combined feature conditions. The results are presented in Table 3. Among individual lexicon based feature sets, features derived from E2I alone led to highest macro-averaged Pearson correlations (r = 0.456). (Note that r ranges from −1 to 1 where −1 means perfect reverse correlation and 1 means perfect correlation; a random algorithm gives close to 0.) The performances of NRC-Hash-Emo and OL came closely as second (r = 0.445) and third (r = 0.428). Interestingly, on specific emotions, E2I's advantage (r = 0.531) on the prediction of sadness is significantly greater than the second highest prediction of sadness from AFFIN derived features (r = 0.440). NRC-Hash-Emo led to highest results for anger (r = 0.453) and fear (r = 0.492), and OL features led to the highest value for joy (r = 0.538).
For the combined affective lexicon features settings, we observed that there was a tendency for each feature set to result in additional advantage (e.g., combining OL + SS features with OL or SS features alone) on the macro-averaged scores, suggesting the complementarity across these lexicons. Combining all the lexicons resulted in a large improvement (avg. r = 0.605).
Among non-affective lexicons based features sets, word embedding features obtained the best result (r = 0.583). Except for predicting fear, in which n-grams performed better (r = 0.608), word embedding's advantage also held for predicting anger (r = 0.611), joy (r = 0.585) and sadness (r = 0.580).
Finally, we combined all the lexicon-based features (with small variations 9 from the individual experiment conditions) and non-lexicon based features. This "all-features" condition resulted in the highest performance for avg. Pearson correlation (r = 0.684) and individual correlations for all four emotions. The all-features setting was used for the CrystalFeel system for gold test data.

Word-level and Message-level Analysis
So to what extent are emotion words from affective lexicons indicative of emotion intensity in tweets at the message level? To explore this question, we ran correlation analysis by calculating bivariate Pearson correlation coefficients between each feature derived from the affective lexicons and the emotion intensity ground truth labels. Figure 1 shows the results.
The analysis indicated several interesting patterns related to the usefulness of lexicon dimensions. First, the sentiment/valence dimension of affective lexicons were generally useful, as the counts of +ive and −ive words (regardless the source of the lexicons) showed up in top ten fea-   For example, count of fear-fright words from E2I (E2I fear-fright) is highly correlated with fear intensity (r = 0.374) and count of sadness words as genuine emotions (E2I sadness affective) is highly correlated with sadness intensity (r = 0.421). Furthermore, the results revealed interesting word-level and message-level feature associations across the four emotions. While the top features for intensities of anger, fear and sadness in tweets (9 or 10 out of 10 top features) are positive associations with the presence and higher amount of negative or emotion-specific words, the top features for intensity of joy (7 out of 10 top features) are negative associations with the absence and lower amount of negative words. It deserves future research to further investigate these patterns and to cross examine these patterns in other datasets.

Results
We evaluated the CrystalFeel system using gold test datasets provided by SemEval-2018Task 1 (Mohammad et al., 2018. Besides testing the main task of emotion intensity, since it is our primary interest, we have also participated in all other subtasks. In all subtasks, CrystalFeel system outperformed the baseline set by the task organizer. Tables 4-8 summarize the final results.

Conclusion
This paper describes CrystalFeel system which is capable of predicting the intensity of emotions associated with a Twitter message. The results of the feature experiments supported the usefulness of our in-house developed EI & E2I lexicons as a new manually constructed lexicon on a relatively small number of lexicon items. In addition, the lexicon also aided us to understand the different patterns of associations between emotionspecific words and emotion-specific intensities at the tweets/messages level. Based on the current analysis, it appeared that our approach possesses a special advantage in understanding and predicting sadness-specific intensity present in tweets. For the use of classifiers, we focused on using SVM as our machine learning classifier in the present study. We plan to investigate the use of deep learning methods in future work.