TakeLab at SemEval-2017 Task 5: Linear aggregation of word embeddings for fine-grained sentiment analysis of financial news

This paper describes our system for fine-grained sentiment scoring of news headlines submitted to SemEval 2017 task 5–subtask 2. Our system uses a feature-light method that consists of a Support Vector Regression (SVR) with various kernels and word vectors as features. Our best-performing submission scored 3rd on the task out of 29 teams and 4th out of 45 submissions with a cosine score of 0.733.


Introduction
Sentiment analysis (Pang and Lee, 2008) is a task of predicting whether the text expresses a positive, negative, or neutral opinion in general or with respect to an entity of interest. Developing systems capable of performing highly accurate sentiment analysis has attracted considerable attention over the last two decades. The topic has been one of the main research areas in recent shared tasks, with main focus on social media texts, which are of particular interest for social studies (O'Connor et al., 2010;Wang et al., 2012) and marketing analysis (He et al., 2013;Yu et al., 2013). At the same time, social media texts pose a big challenge for sentiment analysis due to their short, informal and often ungrammatical format.
This work focuses on the second subtask of SemEval-2017 Task 5, which aims to perform finegrained sentiment analysis of the financial news. Given that sentiments can affect market dynamics (Goonatilake and Herath, 2007;Van de Kauter et al., 2015), sentiment analysis of financial news can be a powerful tool for predicting market reactions. Similar to social media posts, finance news are short texts, but, unlike social media posts, the text is edited and hence grammatically correct. On the other hand, news headlines are notorious for the use of a specific language (Reah, 2002), which is often elliptical and compressed, and thus differs from the language used in the rest of the news story.
Many approaches to sentiment analysis resort to rich, domain-specific, hand-crafted features (Wilson et al., 2009;Abbasi et al., 2008). At the same time, there has been a growing interest in featurelight methods, including kernel-methods (Culotta and Sorensen, 2004;Lodhi et al., 2002a;Srivastava et al., 2013) and neural embeddings (Maas et al., 2011;Socher et al., 2013). These methods alleviate the need for manual creation of domain-specific features, while maintaining high accuracy. Most of the recently published work focuses on sentiment analysis problems that are framed as a classification task, while fine-grained analysis is framed as a regression problem. However, most of the high performing classification methods can be easily tuned to perform regression.
In this work we focus on feature-light methods as they do not require complex, time consuming feature engineering. More specifically, we focus on string kernels (Lodhi et al., 2002b) and methods using neural word embeddings (Mikolov et al., 2013a). Developing domain-specific, rich feature sets would probably make the method highly dependent to the specific problem and would be hardly applicable to similar problems in other domains. Feature-light methods have no such constrains: they typically offer satisfactory performance across different domains and may therefore be preferred to other domain-specific methods which use handcrafted features.

Related Work
There has been considerable research focusing on sentiment analysis of short texts (Thelwall et al., 2010;Kiritchenko et al., 2014), especially within recent SemEval campaigns (Nakov et al., 2016;Rosenthal et al., 2015Rosenthal et al., , 2014. A large body of recent work focuses on sentence-level sentiment prediction. Socher et al. (2012) and Socher et al. (2013) reported impressive results working with matrix-vector recursive neural network (MV-RNN) and recursive neural tensor networks models over sentence parse trees. Working with sentence parse trees Kim et al. (2015) and Srivastava et al. (2013) obtained competitive results using tree kernels as an alternative to recursive neural networks. These methods, while producing promising results, are highly dependent on parse trees. In practice, we often work with informal texts, where syntactic parsing often produces inaccurate results, which in turn heavily affects performances of the aforementioned methods. Furthermore, as noted by Le and Mikolov (2014), it is not straightforward how to extend these methods when working with text spans that range over multiple sentences.
There has been a growing amount of interest in methods that are not based on syntax. The most promising results have been achieved using neural word embeddings (Mikolov et al., 2013a), while string kernels (Zhang et al., 2008;Lodhi et al., 2002a;Leslie et al., 2002) offer a viable alternative. Maas et al. (2011) and Tang et al. (2014) reported promising results by learning sentiment specific word embeddings. By extending word embeddings to more complex paragraph embeddings Le and Mikolov (2014) reported state-of-the-art results on sentiment classification for both short and long English texts. Building on word embeddings, Joulin et al. (2016) developed an end-to-end, domain independent, high-performance text classification model.

Dataset
Our task was, given a news headline, to predict the sentiment score for a specific company mentioned in the headline. The dataset consisted of the name of the company, the text of the news headline and a value denoting the sentiment.
The sentiment was on a scale between −1 and 1 (inclusive), where −1 corresponds to very negative sentiment, 0 is considered neutral, while 1 stands for a very positive sentiment. The news headlines were on average 10 words in length and largely composed of abbreviations.
The training set was composed of 1142 news headlines, while the test set contained 491 headlines, i.e., a 70:30 train-test split. The training set id 5 company Ryanair title EasyJet attracts more passengers in June but still lags Ryanair sentiment 0.259 Table 1: Sample training data instance and the test set mention 294 and 168 unique companies, respectively. The distribution of headlines for a specific company was not uniform, and only 58 companies in the train set were targets of more than 4 news headlines, while "Barclays" -the most frequently mentioned one -was the target 67 times. In total, 112 companies occur in both the train and test set.
An example of a training data instance is given in Table 1. This particular example also illustrates a possible difficulty regarding the headlines as they might refer to more than one company. Such examples, however, are pretty rare in the dataset.
As for the class breakdown in the training set, we observe that the number of positively labeled instances is significantly larger than the number of negatively labeled instances (a ratio of 653 : 451 in favor of headlines with positive sentiment, including 38 headlines with a perfectly neutral score of 0.0). However, the distribution of the target variable has an almost zero mean value of 0.031 and a standard deviation of 0.39. All things considered, we conclude that the dataset was fairly well-balanced and the dependent variable was not skewed towards either class.

Methods
While working on fine-grained sentiment analysis, we focus on feature-light, domain independent methods. In all considered methods, we use support vector regression (SVR) model for sentiment prediction. The SVR allows us to experiment with both different features and kernels. Model training is performed using LIBSVM (Chang and Lin, 2011) for the non-linear kernel and LIBLIN-EAR (Fan et al., 2008) for the linear kernel.
BoW baseline. We use the standard bag-ofwords (BoW) methods as a sensible baseline. BoW methods are implemented by creating a dictionary of words appearing in the train set. We implemented the BoW baseline using all uni-, bi-, and trigrams that occur at least twice in the dataset, while filtering out words from the standard stopword list. We experiment with TF-IDF and Bernoulli weighting schemes for the word features. For generating the n-grams, we used NLTK toolkit (Bird et al., 2009), and filtered out n-grams consisting of stopwords.
String kernels. String kernels offer a dictionaryfree alternative compared to other commonly-used methods. There are several known string kernels in use, the most popular being the spectrum kernel (SK) (Leslie et al., 2002) and the subsequence kernel (SSK) (Lodhi et al., 2002a). The SSK measures string similarity by first mapping each input string s to: where u is a subsequence searched for in s, i is a vector of indices at which u appears in s, l is a function measuring the length of a matched subsequence and λ ≤ 1 is a weighting parameter giving lower weights to longer subsequences. Using (1), the SSK kernel is defined as: where n is maximum subsequence length for which we calculate the kernel and Σ n is a set of all finite strings of length n. Spectrum kernel can be defined as a special case of SSK where λ = 1 and i must yield continuous sequences. We experiment with both SK and SSK kernels, which we computed using the string similarity tool Harry. 1 Word embeddings. Word embedding are task independent features, yet they offer competitive results on many text classification tasks. We experimented with pretrained word embeddings, namely GloVe (Pennington et al., 2014) and Skipgram (Mikolov et al., 2013b) trained on the Google News corpus. 2 We achieved the best results with the 300-dimensional Google News vectors. The feature vector that is fed to the classifier is computed as the linear aggregate of the words making up the headline, simply as the average of the word embeddings of the individual words. Lowercasing the words that appear in the title gave us a considerable performance gain, which is expected since most of the words appearing in news headlines are title-cased. We refer to this method as the word embeddings method (WEM).
We further experimented with additional filtering of the word tokens we use for building word embedding vectors. Our motivation was based on the observation that sentiment-bearing words typically exclude the named entities. We therefore used StanfordNLP  named entity recognition (NER) tools to filter out all named entities before building adding up the word embedding vectors. We refer to this method as the filtered word embeddings method (FWEM).
When using word embeddings as features, we experimented with the linear, RBF, and cosine kernel (CK). The latter is defined as:

Results
Model evaluation was performed as defined on the task description page. 3 From the instances given in the test set, we create a vector containing ground truth annotations G and a vector containing our model predictions P . Model performance score is computed using cosine similarity between the two vectors, as follows: To optimize the hyperparameters of the models (C for linear SVR, n and λ SSK, n for SK, α and C for cosine kernel, and C and γ for RBF kernel), we performed a grid search in a nested K-folded cross-validation on the train set, using 10 folds in the outer and 5 folds in the inner loop. To select the best parameters for a model, we choose the ones that consistently provided the best result across the 10 outer loops. Using the chosen hyperparameters, we finally train that model on the complete train set. The best results for all of the considered models are reported in Table 2.
While working with BoW models, the best results were obtained using the simple Bernoulli feature weighting scheme, indicating whether a term appeared in the headline with a weight of 1 and 0 otherwise.  Interestingly, experiments showed that the SK kernel outperformed the SSK kernel.
Using word embeddings provided us with significant performance gains compared to the other two methods. Word embedding features combined with the linear kernel did not outperform string kernels. However, using non-linear kernel such as RBF and especially cosine kernel yielded substantial performance gains.

Conclusion
We described our system for fine-grained sentiment scoring of news headlines, which we submitted to the SemEval 2017 task 5, subtask 2. We implemented a number of feature-light methods for sentiment analysis with basic preprocessing. Our best performing method used skip-gram word embeddings trained on the Google News corpus, which were fed as features to a cosine kernel Support Vector Regression. We report our results on the gold set, where our system ranked 3rd place out of 29 teams, with a cosine score of 0.733.
It should be note that we did not use the information about which company the sentiment is measured for in any way. Arguably, not using this information leads to performance decreases when dealing with (1) headlines entirely unrelated to the company of interest and (2) headlines containing mentions of multiple companies. For future work, it would be interesting to consider encoding this information into the model or using additional pre-processing methods to detect specific parts of the headline related to the company of interest.