ECNU at SemEval-2018 Task 1: Emotion Intensity Prediction Using Effective Features and Machine Learning Models

This paper describes our submissions to SemEval 2018 task 1. The task is affect intensity prediction in tweets, including five subtasks. We participated in all subtasks of English tweets. We extracted several traditional NLP, sentiment lexicon, emotion lexicon and domain specific features from tweets, adopted supervised machine learning algorithms to perform emotion intensity prediction.


Introduction
The Semeval 2018 Task 1 aims to automatically determine the intensity of emotions of the tweeters from their tweets, including five subtasks. That is, given a tweet and one of the four emotions (anger, fear, joy, sadness), the subtask 1 and 2 are to determine the intensity and classify the tweet into one of the four ordinal classes of intensity of the emotion respectively. Similarly, the subtask 3 and 4 determine the intensity and classify the tweet into one of seven ordinal classes of intensity of valance. Subtask 5 is a multi-label emotion classification task which classifies the tweets as neutral or no emotion or as one, or more, of eleven given emotions (anger, anticipation, disgust, fear, joy, love, optimism, pessimism, sadness, surprise, trust) that best represent the mental state of the tweeter. For each task, training and test datasets are divided into English, Arabic, and Spanish tweets. We participated in all subtasks of English tweets.
Traditional sentiment classification is a coarsegrained task in sentiment analysis which focuses on sentiment polarity classification of the whole sentence (i.e., positive, negative, neutral, mixed).
The difference between these subtasks lies in the emotion granularity and classification or quantification, so in our work, the similar method is adopted for five subtasks. We extracted a rich set of elaborately designed features. In addition to linguistic features, sentiment lexicon features and emotion lexicon features, we also extracted some domain specific features. Also, we conducted a series of experiments on different machine learning algorithms and ensemble methods to obtain the better performing for each subtask. For subask 5, we adopted multiple binary classification and constructed a model for each emotion.

System Description
We first performed data preprocessing, then extracted several types of features from tweets and constructed supervised models for this task.

Data Preprocessing
Firstly, all words are converted to lower case, URLs are replaced by "url", abbreviations, slangs and elongated words are transformed to their normal format. Then, emojis are replaced by corresponding emojis names by "Emoji Library" 1 . Finally, we use Stanford CoreNLP tools (Manning et al., 2014) for tokenization, POS tagging, named entity recognizing (NER) and parsing.

Feature Engineering
We extracted a set of features to construct supervised models for five subtasks, that is linguistic features, sentiment lexicon features, emotion lexicon features and domain-specific features.

Linguistic Features
• Lemma unigram Considering there is similar emotion intensity expressed by "anger" and "angers", we choose word lemma unigram features from tweets rather than word unigram features.
• Negation Negation in a sentence often affects its sentiment orientation, and conveys its intensity of the sentiment. For example, a sentence with several negation words is more inclined to negative sentiment polarity. Following previous work (Zhang et al., 2015), we manually collected 29 negations 2 and designed two binary features. One is to indicate whether there is any negation in the tweet and the other is to record whether this tweet contains more than one negation.
• NER Given a tweet "@JackHoward the Christmas episode genuinely had me in tears of laughter", it has useful information like person name and festival which may convey tweeter's happiness. So we extracted 12 types of named entities (DURATION, SET, NUMBER, LOCATION, PERSON, ORGA-NIZATION, PERCENT, MISC, ORDINAL, TIME, DATE, MONEY) from the sentence and represented each type of named entity as a binary feature to check whether it appears in the sentence.

Sentiment Lexicon Features
Many tasks related to sentiment or emotion analysis depend upon affect, opinion, sentiment, sense and emotion lexicons. So we employ eight sentiment lexicons to capture the sentiment information of the given sentence. The eight sentiment lexicons are as follows: Bing Liu lexicon 3 , General Inquirer lexicon 4 , IMDB 5 , MPQA 6 , NRC Emotion Sentiment Lexicon 7 , AFINN 8 , NRC Hashtag Sentiment Lexicon 9 , and NRC Sentiment140 Lexicon 10 . There is not a unified form among the eights lexicons. For example, Bing Liu lexicon use two values for each word to represent its sentiment scores which one for positive sentiment and the other for negative sentiment. In order to unify the form, we transformed the two scores into a onedimensional value by subtracting negative emotion scores from positive emotion scores. Given a tweet, we calculated the following six scores: -the ratio of positive words to all words.
-the ratio of negative words to all words.
-the maximum sentiment scores.
-the minimum sentiment scores.
-the sum of sentiment scores.
-the sentiment score of the last word in tweet.

Emotion Lexicon Features
Considering subtask 1, 2, 5 are related to emotion intensity prediction, subtask 3, 4 are related valence intensity prediction, three emotion lexicons and one valence lexion are adopted.
That is NRC Hashtag Sentiment Lexicon (Mohammad and Kiritchenko, 2015), NRC Affect Intensity Lexicon (Mohammad, 2017), NRC Word-Emotion Association Lexicon (Bravo-Marquez et al., 2017) and ANEW-1999 Lexicon (Bradley and Lang, 1999). Given a tweet, we calculate three scores for each lexicon to construct emotion lexicon features: the maximum scores, the sum of scores, the number of words exist in lexicons.

Domain-specific Features
• Punctuation People often use exclamation mark(!) and question mark(?) to express surprise or emphasis. Therefore, we extract the following 6 features: -whether the tweet contains an exclamation mark. -whether the tweet contains more than one exclamation mark. -whether the tweet has a question mark.
-whether the tweet contains more than one question mark. -whether the tweet contains both exclamation marks and question marks. -whether the last token of this tweet is an exclamation or question mark.
• Bag-of-Hashtags Hashtags reflect emotion orientation of tweets directly, so we constructed a vocabulary of hashtags appearing in the training set and development set, then adopted the bag-of-hashtags method for each tweet.
• Emoticon We collected 67 emoticons from Internet 11 , including 34 positive emoticons and 33 negative emoticons, then designed the following 4 binary features: -to record whether the positive and negative emoticons are present in the tweet, respectively (1 for yes, 0 for no). -to record whether the last token is a positive or a negative emoticon.
• Intensity Words Some words appeared more frequently in tweets with higher intensity, some words has higher score in emotion lexicons, these words may contain information that express strong emotion intensity. So we extracted this type words in two ways: -Pick up words whose emotion score is greater than threshold from emotion lexicons. -Calculate the probability of each word appearing at different intensity for subtask 2 and 4, then pick up words whose probability greater than threshold(i.e., 0.5).
Finally, for each word in intensity words list, we use a binary feature to check whether it appears in the given tweet.

Dataset
The statistics of the English datasets provided by Semeval 2018 Task 1 are shown in Table 1

Evaluation Metric
To evaluate the performance of different systems, the official evaluation measure Pearson Correlation Coefficient with the Gold ratings/labels is adopted for the first four subtasks. The correlation scores across all four emotions will be averaged (macro-average) to determine the final system performance.
As for the last subtask, systems are evaluated by calculating multi-label accuracy namely Jaccard index, the formula are follow: where G t is the set of the gold labels for tweet t, P t is the set of the predicted labels for tweet t, and T is the set of tweets.

Experiments on Training and Test Data
Firstly, we performed a series of experiments in order to explore the effectiveness of each feature type.     different features on development set with Support Vector Regression algorithm for subtask 1. We find that: (1) All feature types make contribution to the performance of emotion intensity prediction and their combination achieves the best performance.
(2) Linguistic features act as baseline and have shown poor performance for emotion intensity prediction. However, we find the system performance drops once we remove the Linguistic features.
(3) Sentiment lexicon features make a considerable contribution to the performance, which indicates that sentiment lexicon features are beneficial not only in traditional sentiment polarity analysis tasks, but also in emotion intensity prediction tasks.
(4) Beside, we find that the system performance only drops by 0.2% if we remove intensity words features. This indicates that these intensity words fail to distinguish emotion intensity. The reason may be that their function have overlap with sentiment and emotion lexicon features.
Also, we explored the performance of different learning algorithms. Table 4 shows the results of different algorithms for subtask 1 based on all features described before. From table 4, we find that GBR outperforms other single algorithm, and the ensemble model are superior to the models using single algorithm. The ensemble model use the four algorithms to build the ensemble regression models, which averages the output scores of al-l regression algorithm. Therefore, the system configurations for test data are: using all features for five subtasks, ensemble model for subtask 1 and 3, Logistic Regression for subtask 2, 4 and 5. Based on the system configurations described above, we train separate model for each subtask and evaluate them against the test set in SemEval 2018 Task 1. Table 5 and Table 6 shows the results with ranks on test set for subtask 1 to 5. Compared with the top ranked systems, there is much room for improvement in our work. First, the biggest issue is that we only used hand-craft features but ignoring deep learning method. Second, we find that our system achieves greater performance on test set compared with the development set, the possible reason might be the different data distribution held between them.

Conclusion
In this paper, we extracted several traditional NLP, sentiment lexicon, emotion lexicon and domain specific features from tweets, adopted supervised machine learning algorithms to perform emotion intensity prediction. The system performance ranks above average. In future work, we consider to use deep learning method to model sentence with the aid of sentiment word vectors.