IITPB at SemEval-2017 Task 5: Sentiment Prediction in Financial Text

This paper reports team IITPB’s participation in the SemEval 2017 Task 5 on ‘Fine-grained sentiment analysis on financial microblogs and news’. We developed 2 systems for the two tracks. One system was based on an ensemble of Support Vector Classifier and Logistic Regression. This system relied on Distributional Thesaurus (DT), word embeddings and lexicon features to predict a floating sentiment value between -1 and +1. The other system was based on Support Vector Regression using word embeddings, lexicon features, and PMI scores as features. The system was ranked 5th in track 1 and 8th in track 2.


Introduction
We are living in a world where stock market directly affects the economic system of a country. Therefore, a reliable and prompt delivery of information plays an important role in the financial market. Up until the last decade printed/television news were the major source of stock marketrelated information. However, with the introduction of micro-blogging websites (e.g. Twitter etc.) the trend has been shifted. The rise of Twitter and StockTwits has given the people and organizations an opportunity to vent out their feelings and views. This information can be used by an individual or an organization to make an informed prediction related to any company or stock (Si et al., 2013). This opens a new avenue for sentiment analysis in the financial domain of microblogs and news.
News headlines are a short piece of text describing the nature of an article. Due to space constraints, headlines normally follow a compact writing style, known as headlinese, which limits the usage of articles, the verb form of to be, conjunctions etc.
Similarly, social media platforms text is prone to noise. There is a very high possibility of the data lacking a proper structure, grammar and appropriate punctuations. These inconsistencies make it challenging to solve any NLP problems including sentiment analysis (Khanarian and Alwarez-Melis, 2012). Moreover, each tweet can have reference to multiple company names (or stock symbols) and the expressed sentiment can be different towards different companies. Hence, there is a need to perform fine-grained sentiment analysis wherein, generally, a context is used to decide the relevant portion of a tweet for a particular company. Another inherent challenge with the microblog and news data is the use of short languages, hashtag, emoticons and embedded URL. Special attention should be given to these as they can provide some important hidden information (Mohammad et al., 2013). Example -#bullish-Market and #increasingProfit can reflect positive sentiment. These are some of the major challenges associated with fine-grained sentiment analysis of microblogging and news data. The SemEval-2017 task 5 (Fine-Grained Sentiment Analysis on Financial Microblogs and News) has two tracks (Cortis et al., 2017). For both the tracks, the overall aim was to assign a sentiment score to a cashtag/company over a continuous range of -1 (very negative/bearish) to 1 (very positive/bullish).
First track involves finding a sentiment score towards a given 'cashtag' (stock symbol preceded by a $, e.g. $AAPL for Apple Inc.) in microblog messages while the second track involves finding a sentiment score towards a given company name in the news headlines.Instances in track 1 datasets also contain 'span'. It is the section of a tweet from where sentiment score should be derived. We participated and submitted our system for both the tracks. A total of 27 and 29 teams participated in track 1 and track 2 respectively. Our system ranked 5 th in the first track with a cosine similarity of 0.725. In the second track, our system scored cosine similarity of 0.695 and ranked 8 th overall.
The rest of the paper is organized as follows: Section 2 briefly describes the proposed systems. Description of the feature set is given in Section 3. Section 4 is devoted to experimental result and error analysis. Lastly, we conclude in Section 5.

System Overview
In this section, we present a brief description of the proposed systems. We adopted a supervised approach for solving the problem of both the tasks. We employed Logistics Regression, Support Vector Machine (SVM) and Support Vector Regression (SVR) as the base classifier for the prediction. We tried various combinations of the feature set for training the model. Following this approach, we select a feature set that best suited for the problem at hand. To further improve the efficacy of the system we ensemble the outputs of various classifiers at the end. For ensemble, the final sentiment value was calculated by taking the harmonic mean of both the system's prediction and then, linearly scaling it in between -1 and +1.

Distributional Thesaurus
Missing words in word2vec or Glove vector representation makes it non-trivial to learn from the data. We employ Distributional Thesaurus (DT) (Biemann and Riedl, 2013) expansion strategy for those words whose representation was missing in word2vec or GloVe model. Distributional Thesaurus is an automatically computed word list which ranks words according to their semantic similarity. It finds words that tend to occur in similar contexts as the target word. We use a pre-trained DT model to expand a source word. If the representation of a word is not present in word2vec or GloVe model, then its corresponding most similar expanded word is used to replace it. If the replaced word does not have its corresponding representation also we select next similar word and so on. For a source word, we took top 5 similar words in the expanded list as targets. An example is listed in Table 2. For the source word 'drinks', its DT expanded word list contains 'beer', 'wine', 'coffee', 'liquids' and 'beverages'.
Word DT expanded list drinks beer, wines, coffee, liquids, beverages price prices, pricing, cash, cost, pennies laptop pc, computer, notebook, tablet, imac We use following set of features for training the model.

Track 1 -Microblogs messages
• Word Embedding: Word embeddings are known to capture the syntactic and semantic similarity in a better and representative way. We used 200 dimensional twitter based pretrained GloVe vectors 1 for word representation. Averaging of words representation was done for calculating sentence embeddings.
• Tf-Idf Score: We use Tf-Idf score as a feature value in the work. The score reflects how important a word is to a document in a corpus.
• Sentiment Lexicon: We compiled a list of positive and negative words using NRC Hashtag Sentiment Lexicon (Kiritchenko et al., 2014), MPQA Subjectivity Lexicon (Wilson et al., 2009) and Bing Liu Opinion Lexicon (Hu and Liu, 2004). Using these we created hand-engineered features. M pos and M neg are the number of positive and negative words in span and text.
-Agreement Score: It is the agreement value of the positive and negative words in the data instance. This was calculated both for span or text. If we have all positive or all negative words then A = 1.
We have modified the proposal in (Rao and Srivastava, 2012) to make the feature more effective.

Track 2 -News headlines
• Word Ngrams: We extracted and used unigrams and bigrams as features for this task.
• Sentiment Lexicon: Sentiment lexicons have been known to be a decisive feature in sentiment analysis tasks. We use the following four sentiment lexicons to get lexicon based features: For each instance, we extract 3 features: positive score, negative score, and cumulative score. Each token is assigned a score of +1 or -1 if it belongs to positive or negative list respectively. We followed stated approach for all lexicons except SentiWordNet. In the case of SentiWordNet lexicon, we use the positive and negative score as given in the lexicon rather than +1 or -1.
• Semantic Orientation (SO): Semantic orientation (Hatzivassiloglou and McKeown, 1997) finds the association of a token with respect to its positivity and negativity. We calculate a score for each term in our training corpus to get the association value.

score(w) = P M I(w, pos) − P M I(w, neg)
where PMI is point-wise mutual information and calculated as follows: P M I(w, pos) = log 2 f req(w, pos) * N f req(w) * f req(pos) In the above equation pos is the collection of positive reviews and N is the total number of tokens in the corpus.
• Word Embeddings: We use the 300dimensional pre-trained word2vec (Mikolov et al., 2013) vectors trained on part of Google News dataset (about 100 billion words). The sentence embedding is obtained by averaging the embedding vectors of all words in the sentence.

Dataset
The training datasets contains 1700 and 1142 instances of microblog messages and news headlines respectively. Test data comprises of 800 and 491 resp. of such instances for the two tracks. We use 20% of the training dataset as validation set.

Preprocessing
We used CMU ARK toolkit 2 for tokenization of microblog tweets. For preprocessing the text, each url, username and number was replaced by <url>, <user> and <number> respectively. Example -'www.twitter.com' by <url>, '@johnSnow' by <user> and '9.7' by <number>. Since the data was collected from the web all HTML entities were converted to their corresponding unicode characters e.g. '&amp;' to 'and'. Datasets analysis suggests that few hashtags convey explicit sentiment in the text. Therefore, we replace hashtags by '#' followed by the associated word with the hashtag. For example -'#happy' by '# happy'. Lastly, all the characters are converted to lower case and for the news headline we use NLTK 3 for the tokenization.

Experiments
We used python based machine learning package scikit-learn 4 for the implementation. As classification algorithm, we used Logistic Regression (LR), Support Vector Machine (SVM) and Support Vector Regression (SVR). As discussed earlier, each instance of the dataset need a score over a continuous range of -1 to +1. Since SVM predicts discrete class labels, as post-processing we use the probability of predicted class as the score. During validation phase we observed that models trained on SVM work better than that of SVR for the microblog datasets. In contrast, SVR works better than SVM in news headline datasets. The hyperparameters of the SVM were C = 30 and γ = 0.01, for SVR we used C = 10 and γ = 0.01 and for LR we set C = 6. Cosine similarity of various combinations of the feature set is listed in Table 3 and 4 for microblogs and news headlines validation set respectively. For fine tuning of hyper-parameters, we did an exhaustive grid search evaluated through ten-fold cross-validation on the training set.   As a result, we observed that the word embedding along with lexicon based features produce the 4 http://scikit-learn.org best cosine similarity for both the datasets. Further, we observed the output of different classifier are contrasting in nature, therefore we merge the outputs of different classifiers using averaging and harmonic mean. We found that harmonic mean of LR and SVM produces better cosine similarity score than other combinations for microblogs messages. However, for news headline performance did not improve on the ensemble, so we choose the best feature combination to train an SVR. Table 5 shows the results for harmonic mean of SVM and LR cosine similarities in microblogs datasets.

Model
Cosine  After finalizing the proposed approach on validation set, we evaluated it on the test datasets. For microblogs messages we got the cosine similarity of 0.725. In news headline, our system produces cosine similarity of 0.695. Table 6 depicts evaluation results on test datasets.

Conclusion
In this paper we proposed a supervised sentiment analyzer for financial texts as part of our participation in SemEval 2017 shared task. As base classification algorithm we used Logistic Regression (LR), Support Vector Machine (SVM) and Support Vector Regression (SVR) for predicting the sentiment score. In second stage we combine the predictions of two best performing models using harmonic mean. Evaluation shows encouraging results on the shared task dataset. In future we would like to explore other relevant features to improve the performance of the system.