CPH: Sentiment analysis of Figurative Language on Twitter #easypeasy #not

This paper describes the details of our sys-tem submitted to the SemEval 2015 shared task on sentiment analysis of ﬁgurative language on Twitter. We tackle the problem as regression task and combine several base systems using stacked generalization (Wolpert, 1992). An initial analysis revealed that the data is heavily biased, and a general sentiment analysis system (GSA) performs poorly on it. However, GSA proved helpful on the test data, which contains an estimated 25% non-ﬁgurative tweets. Our best system, a stacking system with backoff to GSA, ranked 4th on the ﬁnal test data (Cosine 0.661, MSE 3.404). 1


Introduction
Sentiment analysis (SA) is the task of determining the sentiment of a given piece of text. The amplitude of user-generated content produced every day raises the importance of accurate automatic sentiment analysis, for applications ranging from, e.g., reputation analysis (Amigó et al., 2013) to election results prediction (Tjong Kim Sang and Bos, 2012). However, figurative language is pervasive in usergenerated content, and figures of speech like irony, sarcasm and metaphors impose relevant challenges for a sentiment analysis system usually trained on literal meanings. For instance, consider the following example: 2 @CIA We hear you're looking for sentiment analysis to detect sarcasm in Tweets. That'll be easy! #SLA2014 #irony. Irony or sarcasm 1 After submission time we discovered a bug in ST2,which means that the results on the official website are of the GSA and not of the stacking system with backoff.
does not result always in the exact opposite sentiment and therefore it is not as simple as just inverting the scores from a general SA system. Only few studies have attempted SA on figurative language so far (Reyes and Rosso, 2012;Reyes et al., 2013). The prediction of a fine-grained sentiment score (between -5 and 5) for a tweet poses a series of challenges. First of all, accurate language technology on tweets is hard due to sample bias, i.e., collections of tweets are inherently biased towards the particular time (or way, cf. §2) they were collected (Eisenstein, 2013;. Secondly, the notion of figurativeness (or its complementary notion of literality) does not have a strong definition, let alone do irony, sarcasm, or satire. As pointed out by Reyes and Rosso (2012), "there is not a clear distinction about the boundaries among these terms". Yet alone attaching a fine-grained score is far from straightforward. In fact, the gold standard consists of the average score assigned by humans through crowdsourcing reflecting an uncertainty in ground truth.

Data Analysis
The goal of the initial data exploration was to investigate the amount of non-figurativeness in the train and trial data. Our analysis revealed that 99% of the training data could be classified using a simple heuristic: a regular expression decision list, hereafter called Tweet Label System (TLS), to split the training data into different key-phrase subgroups. The system searches for the expression in a tweet and then assigns a label in a cascade fashion following the order in Table 2, which lists the 14 possible label types (plus NONE), their associated expressions along with the support for each category in the training data. Table 1 shows that only a small fraction of the train and trial data could not be associated to a subgroup and it can be seen that the final test data was estimated to have a very different distribution with 25% of tweets presumably containing literal language use.

Dataset
Train Trial Test Instances 7988 920 4000 % Non-figurative 1% 7% 25% Since there are obvious subgroups in the data, our hypothesis is that this fact can be used to construct a more informed baseline. In fact ( § 4.1), simply predicting the mean per subgroup pushed the constant mean baseline performance considerably (from 0.73 to 0.81 Cosine, compared to random 0.59). Figure 1 plots predicted scores (ridge model, §3.1) of three subgroups against the gold scores on the trial data. It can be seen that certain subgroups have similar behaviour, 'sarcasm' has a generally negative cloud and the model performs well in predicting these values, while other groups such as 'SoTo-Speak' have more intra-group variance.

The Effect of a General Sentiment System
The data for this task is very different from data that most lexicon-based or general sentiment-analysis models fare best on. In fact, running a general sentiment classifier (GSA) described in Elming et al. (2014) on the trial data showed that its predictions are actually slightly anti-correlated with the gold standard scores for the Tweets in this task (cosine similarity score of -0.08 and MSE of 18.62). We exploited these anti-correlated results as features for our stacking systems (cf. § 3.2). Figure 2 shows the distributions of the gold scores and GSA predictions for the trial data. It shows that the gold distribution is skewed with regards to the number of negative instances to positives, while the GSA predicts more positive sentiment.

System Description
We approach the task (Ghosh et al., 2015) as a regression task (cf. §4.4), combining several systems using stacking ( § 3.2), and relying on features without POS, lemma or explicit use of lexicons, cf. § 3.3.

Single Systems
Ridge Regression (RR) A standard supervised ridge regression model with default parameters. 3 PCA GMM Ridge Regression (GMM) A ridge regression model trained on the output of unsupervised induced features, i.e., a Gaussian Mixture Models (GMM) trained on PCA of word n-grams. PCA was used to reduce the dimensionality to 100, and GMM under the assumption that the data was sampled from different distributions of figurative language, k Gaussians were assumed (here k = 12).
Embeddings with Bayesian Ridge (EMBD) A Bayesian Ridge Regressor learner with default parameters trained on only word embeddings. A corpus was build from the training data and an in-house Tweet collection sampled with the expressions from the TLS. This resulted in a total of 3.7 million tweets and 67 million tokens. For details on how the word embeddings were built see §3.3.

Ensembles
We developed two stacking systems (Wolpert, 1992), Stacking System 1 (ST1) and Stacking System 2: Stacking with Backoff (ST2). The systems used for these are shown in Table 3 and the Meta Learner used for both stacking systems is Linear Regression. The systems used in ST1 and ST2 are not the only differences between the two. ST2 uses the TLS to identify the subgroup that each tweet belongs to. For any tweet with the NONE subgrouping, the system would back off to the predictions from the GSA. We built ST2 as a system that is not limited to sentiment analysis for a small subsection of language, the phenomenon of figurative language, but is applicable in situations covering many types of tweets including those in which literal language is used.

Features
This section describe the features we used for the models in §3.1. Table 4 indicates the type of features used for the single models. Punctuation was kept as its own lexical item and we found removing stopwords and normalizing usernames to '@USER' increased performance and as such the preprocessing methods are the same across the models. Features were set on the trial data. 1. Word N-Grams Systems use different n-grams as features. In RR counts of 1 and 5 word grams, in GMM binary presence of 1,2, and 3 word grams. 2. Uppercase Words Counts of the numbers of word in a Tweet with all uppercase letters. 3. Punctuation Contiguous sequences of question, exclamation, and question and exclamation marks. 4. TLS Label The subgrouping label from TLS. 5. Word Embeddings Parameters for word embeddings: 4 100 dimensions, 5 minimum occurrences for a type to be included in the model, 5 word context window and 10-example negative sampling. Each tweet was represented by 100 features that represented the average of all the embeddings of the content words in the tweet.
Features/Systems RR GMM EMBD Word N-grams X X Uppercase X Punctuations X TLS Label X Word Embeddings X

Constant Baselines & Single Systems
We implemented the Mean, Mode, Median, Random and TSL ( §2) baseline systems. TSL is the hardest baseline, and RR is the only system that beats it.

Results Stacking Systems
The performance of the stacking systems on the trial data can be seen below in Table 6. ST2 did not perform well on the trial data although a reason for this

Final Results
Three models were submitted for final evaluation on the test data. The three models were RR, ST1, and ST2. For the final results we scaled back values outside the range [-5,5] to the nearest whole number in range. Tables 7 and 8 show the results for our systems on the final dataset and the performance of the overall winning system for the task (CLAC) . Table  7 shows the overall cosine similarity and MSE for the systems on the test data and Table 8 shows the breakdown of the cosine similarity for the systems on the different parts of language. It is interesting to note that the performance of ST2 on the 'Other' type of language is identical as the performance for CLAC, this is also the best cosine similarity score 'Other' out of all submissions.

The Case for Regression
Regression is less usual in NLP than classification. However for this data, it is desirable to use regression, because it incorporates the ordered relation between the labels, instead of treating them as orthogonal. It also keeps the decimal precision in the target variable when training, which is relevant when the target variable is the result of an average between several annotations. We ran classification experiments for this task but found that the best classification system's 6 performance (Cosine 0.82, MSE 2.51) is still far from the RR model (0.88,1.60).

Conclusions
We tested three systems for their abilities to analyse sentiment on figurative language from Twitter. Our experiments showed that a general SA system trained on literal Twitter language was highly anticorrelated with gold scores for figurative tweets. We found that for certain figurative types, sarcasm and irony, our system's predictions for these phenomena faired well. Our system did not explicitly use a lexicon to define the sentiment of a tweet, but instead used machine learning and strictly corpusbased features (no POS or lemma) to place us 4th in the task. More effort may be needed to discriminate metaphorical from literal tweets to build a more robust system, although, even for humans the sentiment of tweets is hard to judge. This can be seen from the data where a number of tweets were repeated, but did not always share the same gold score.