Causality Analysis of Twitter Sentiments and Stock Market Returns

Sentiment analysis is the process of identifying the opinion expressed in text. Recently, it has been used to study behavioral finance, and in particular the effect of opinions and emotions on economic or financial decisions. In this paper, we use a public dataset of labeled tweets that has been labeled by Amazon Mechanical Turk and then we propose a baseline classification model. Then, by using Granger causality of both sentiment datasets with the different stocks, we shows that there is causality between social media and stock market returns (in both directions) for many stocks. Finally, We evaluate this causality analysis by showing that in the event of a specific news on certain dates, there are evidences of trending the same news on Twitter for that stock.


Introduction
Sentiment analysis of Twitter messages has been used to study behavioral finance, specifically, the effect of sentiments driven from social media on financial and economical decisions. For example, Bollen and Pepe 2011 used social-media sentiment analysis to predict the size of markets, while Antenucci et al. 2014 used it to predict unemployment rates over time. Twitter sentiment analysis in particular, is a challenging task because its text contains many misspelled words, abbreviation, grammatical errors, and made up words. Therefore, it contains limited contextual information.
In previous research, it was implied that if it is properly modeled, Twitter can be used to forecast useful information about the market. Tharsis et al. used a Twitter sentiment analysis from (Kolchyna et al., 2015) which was SVM approach, then compared them to different industries and showed that by adding the sentiments to their predictive models, the error rate reduced between 1 to 3 percent, in predicting the Expected Returns of different industries (Souza et al., 2015). Alanyali et al. found a positive correlation between the number of mentions of a company in the Financial Times and the volume of its stock (Alanyali et al., 2013). There has been many related research in this area, but there are shortcomings that needs to be specified. First, datasets used for sentiment analysis, is not specifically in context of finance (Bollen and Pepe, 2011;Souza et al., 2015). Secondly, the classification models mostly have low accuracy (Bollen and Pepe, 2011;Loughran and Mcdonald, 2010;Ranco et al., 2015;Lillo et al., 2012).
In our research on investigation on impacts of social media and stock market, we pulled a dataset of tweet in duration of three months that was labeled by both Amazon mechanical Turk, and then again we designed a classification model using SVM, with 79.9% of accuracy. Then, Granger Causality analysis of these two tweet datasets with various stock returns has shown that for many companies theres a statistical significant causality between stock and the sentiments driven from tweets in different lags. When evaluating this relation, we realized that on specific dates that jumps in stock market return occur, there are many evidence of mentions of the same news in our Twitter dataset which caused the change in stock market return In Section 2, we will describe the dataset that was pulled from Twitter, pre-processing techniques, labels by Amazon Mechanical Turk, and the machine learning classifier. This section has been also used in another analysis which is cur-rently under review to ECML-PKDD 2018. In section three, we explain the causality models, and results. And finally, in section five, we describe the evaluation process. We conclude our findings in section six.

Data
Tweets were pulled from Twitter using Twitter API between 1/1/2017 and 3/31/2017. In our filters, we only pulled tweets that are tweeted from a "Verified" account. A verified account on Twitter suggests that the account is a public interest and that it is authentic. An account gets verified by Twitter if the used is a distinguished person in different key interest areas, such as politics, journalism, government, music, business, and others. A Tweet were considered stock related if it contains at least one of the stock symbols of the first 100 most frequent stock symbols that were included in SemEval dataset form (Cortis et al., 2017). We were able to pull 20,013 tweets in that interval using mentioned filters.

Labeling using Amazon Mechanical Turk
The data was submitted to Amazon Mechanical Turk, was asked to be labeled by 4 different workers. Snow et al. 2008 suggested that 4 workers is enough to make sure that enough people have submitted their opinion on each tweet and so the results would be reliable. We assigned only AMT masters as our workers, meaning they have the highest performance in performing wide range of HITs (Human Intelligence Tasks). We also asked the workers to assign sentiments based on the question: "Is the tweet beneficial to the stock mentioned in tweet or not?". It was important that tweet is not labeled based on perspective of how beneficial it would be for the investor; rather how beneficial it would be to the company itself. Each worker assigned numbers from -2 (very negative) to +2 (very positive) to each tweet. Table 1 shows the inter-rater percentage agreement between sentiments assigned to each tweets by the four different workers. We considered labels 'very positive' and 'positive' as positive when calculating the inter-agreement percentage.
At the end, the average of the four sentiment was assigned to each tweet as the final sentiment. Out of 20013 tweet records submitted to AMT, we assigned neutral sentiment to a tweet if it had average score between [-0.5, +0.5]. We picked the sen-  Table 2 is a summary of the number of tweets in each category of sentiment.

Classification Model
We used Amazon Mechanical Turk to manually label our stock market tweets. In order to create a classification model, so it can be used to predict more tweets in the future analysis, we applied the same preprocessing technique and classification models explained in detail by Tabari et. al Tabari et al. (2017). In preprocessing phase, after tokenization, all numbers were substituted with <num> tag. Also, some characters were removed from the text, such as '-' and '.'. Then, to create our feature set, We modified Loughran's lexicon of positive and negative words (Loughran and Mcdonald, 2010) to be suited for stock market context and used it to calculate number of positive or negative words in each tweet as feature. For example, 'sell' has a negative sentiment in stock market context, that has been added to Loughran's lexicon. We ultimately added around 120 new words to their list which is added in Appendix A. Also, as another feature, we replaced couple of words that come together in a tweet, but has different sentiment in stock market context, with one specific word. For example, 'Go down' and 'Pull back' both contain negative sentiment in stock's perceptive. Around 90 word-couples was defined specifically for this content and are mentioned in Ap-pendix B. 3 Causality Models

Granger Causality
Granger Causality (GC) is a probabilistic approach for determining if information about past of one variable can explain another and it is based on aversion of the probabilistic theory of causality (Hitchcock, 2016). According to Suppes (Suppes, 1970), an event A causes prima facie an event B if the conditional probability of B given A is greater than the probability of B alone, and A occurs before B. which is a very common approach in econometrics. Clive Granger has expanded on this in what is now known as Granger Causality (Granger and Aug, 1969). Granger Causality: a variable A causes B if the probability of B conditional on its own past history and the past history of A does not equal the probability of B conditional on its past history alone. Advantage of this model is that it is operational and easy to implement. Although, the definition is not really one of causality but of increased predictability which is not really the same thing. There are plenty of people who criticize this definition and point out that A can Granger Cause B but controlling A might not imply that we can directly influence B or that we even know the magnitude of what will happen to B. Granger Causality is mainly important for causal notions for policy control, explanation and understanding of time-series, and in some cases for prediction.
Correlation is not causation It is important to understand that correlation is different than causation. Correlation means that there is relationship between two sets of variables, where change in one, causes change in the other variable. Whereas we describe causation in way that previous information about one time-series can help explaining the other variable. Two time-series can have causality but not any correlation between them and vice versa. Correlation is a symmetric relation a measure of statistical linear dependencebut causality is an asymmetric relation.
Definition of Granger Causality: A time-series Y can be written as an autoregressive process in which the past values of Y are able to explain (in part) the current value of Y: (1) Granger defined causality in the following way: Consider an other variable X which has past values as well. If the past values of X help improve the prediction of current values of Y beyond what we get with past values of Y alone, then X is said to Granger Cause Y . The test is under taken as: The test is an F-test on all being jointly equal to zero for all values of J. If you reject the null hypothesis then X is said to Granger Cause Y. Note that it is entirely possible, and appropriate, to test whether Y can be said to Granger Cause X. It is possible for X to GC Y, Y to GC X, or for neither to influence the other. Granger causality tests should only be undertaken on I(0) variables, that is variables with a time-invariant mean and variance and that can be adequately represented by a linear AR(p) process, i.e. the time series must be stationary.

Stock market returns
For each 100 stock ticker symbol mentioned in the tweet dataset, the stock closing price were downloaded. 1 After that, for each company we calculated the relative daily return. Using return instead of closing price, creates a stationary timeseries which is essential for most time-series analysis and specifically for Granger Causality. Relative return is the return an asset achieves over a period of time compared to a benchmark. 2 A relative return is a means to measure the performance of an active portfolio, compared to other investments. Relative stock return was calculate based on the following formula:

Comparison of social media sentiment analysis and stock market returns: Results
In order to use GC, we will first need to start with KPSS 3 test which is hypothesis testing for a timeseries to be stationary. A stationary time series is where statistical properties such as mean and variance are constant over time. The null-hypothesis for the test is that the data is stationary; with an alternative that the data is not stationary. We applied this test for all three datasets, the two daily sentiment and the stock return. And then for each non-stationary dataset, we calculated the difference that would create a stationary time-series using the appropriate lag number. After KPSS testing, in the case that the p-value was greater than 0.05, the null hypothesis for data being stationary were not rejected. After making sure all three datasets are stationary, the following GC models were applied on both the sentiments predicted by our classifier in part 2.2 and labeled by AMT.
Model (1): Model (2): SSC ∼ Lags(SSC, LAG) + Lags(RV, LAG) (5) Model one, is investigating if the stock returns cause sentiment scores and model 2 in the causal impact of sentiment score on stock return. RV (Return Value) is the calculated daily return for 83 different stocks. We considered SSC (Sentiment Score) once for the sentiments predicted in part 2.2 and again for the one labeled by AMT. We used LAGs between 1 to 10 in our model. The goal was twofold, first to find out if the causal relationship is happening two way? Secondly, we wanted to determine the lag number that would explain causality for each model. The P-value, and F-value of all granger causality modes that at least was statistically significant in one direction is mentioned in Appendix C. Figure 1: Daily comparison of stock returns and sentiment scores on $APPL. Sentiments are labeled by AMT. This shows that there is a general trend between stock return and the sentiments labeled by AMT. Figure 1 is comparing the daily sentiments calculated by AMT and stock return and Figure  2 shows the same information with sentiments predicted using machine learning classification. These two figure are a good visualization proof that there is a trend between how stock market moves and sentiment score changes. Comparing these two shows two important points: first, the overall trend of the stock returns and sentiment for both, follow each other. Secondly, comparing two Figure 2: Daily comparison of stock returns and sentiment scores on $APPL. Sentiments are predicted by ML model. This shows that there is a general trend between stock return and the sentiments labeled by our machine learning model. Although the trend is not as obvious as the one with AMT, but it still exists. This is a visual representation of that 20% error rate is damaging the trend to some extend. sentiment scores show that AMT has done a better job with the sentiments to our machine learning model. Therefore, decreasing the 20% error rate and improving the accuracy of the machine learning model is actually important and necessary. Lag is the number of days before current day that sentiment score had causal effect on stock market return. Figure 3 is the plot of the lag number in which the Granger Causality model was statistically significant for different stocks in model two (Impact of sentiment results on stock market). Lag number is the number of the days before the current day, that had sentiment scores had causal effect on stock market return. The stocks with just one bar indicate that the causality model for the other sentiment dataset was not significant in any lags, meaning there was no causality between the senti-Figure 4: Lag number for GC for various stocks in model one.Lag is the number of days before current day that stock market return had causal effect on sentiment scores. ment scores and the stock return in any lags. Figure 4 is the plot of the lag number in which the Granger Causality model was statistically significant for different stocks in model one (impact of stock market return on sentiment results). The stocks with just one bar indicate that the causality model for the other sentiment dataset was not significant in any lags, meaning there was no causality between the sentiment dataset and the stock return in any lags. Although out of all the stocks, this model showed statistically significance for more companies, but there is less consistency shown between two sentiment dataset.

Evaluation
Although as long as the f-test in Granger causality is statistically significant, then the causality test is proven and done, but in order to understand this causality relationship better, we attempted to investigate certain dates in different stocks and understand the news that affected company stock on certain dates and how did it affected the Twitter which created our causality results. In the next parts, for different stocks that actually showed causality with presented analysis, we focus on specific dates. While focusing on the news that actually affected the stock, we show that there was a significant trend of that news on Twitter specially focusing the news.

Apple Inc.
According to our Granger causality model, Apple shows a lag of two days on impact of social media on stock market return. On February first, Apple Inc ($APPL) released 4 its profitable first quarter report which was above expectations and the  'apple report first numbers slew new products selling including new macbook pro iphone 7 aapl' 1/31/2017 rt optionsaction 3 stocks could account 60 billion market cap swing week aapl fb amzn' 1/31/2017 stock went up by $4. On January 31st, Apple also reported record holiday quarter, stating iPhone7 sales boosted earnings after 3 consecutive quarters of low sales.
As it is shown in figure 5, we see a similar growth trend for the sentiment score value and the return value from January 30th to February 1st. On January 31st, Apple was set to post its numbers after the stock market closes, which created a trend of tweets regarding people suggesting to buy Apple stock on that day. There was a total of 354 tweets were sent by verified accounts on this topic, in these two dates. Table 4 shows a sample of tweets were mentioned in that two day period regarding APPL.  Similar to Apple, the Granger causality model, shows a lag of two days on impact of social media on Facebook stock market return on figure 6. On February first, Facebook Inc ($FB) reaches record territory after earnings show huge growth. 5 There was a total of 200 tweets were sent by verified accounts on this topic, in these two days. Table 5 shows a sample of tweets were mentioned in that two day period regarding FB.

Conclusion
In our research, on investigation on impacts of social media and stock market, we classified stock market related tweets in two different ways; using Amazon Mechanical Turk, and a classification model with accuracy of 79.9%. We then used these two sentiment scores and stock market returns to understand the causality between datasets. Granger Causality analysis of these two tweet datasets with various stock returns has shown that for many companies there is a statistical significant causality between stock and the sentiments driven from tweets. At the end, investigating on the tweets sent by verified accounts in specific dates, show that when stock return has a jump due to news regarding the stock, the amount of tweets sent on Twitter jumps in the same direction, adding value to the granger causality analysis.