Deep Attentive Learning for Stock Movement Prediction from Social Media Text and Company Correlations

In the ﬁnancial domain, risk modeling and proﬁt generation heavily rely on the sophisticated and intricate stock movement prediction task. Stock forecasting is complex, given the stochastic dynamics and non-stationary behavior of the market. Stock movements are in-ﬂuenced by varied factors beyond the conventionally studied historical prices, such as social media and correlations among stocks. The rising ubiquity of online content and knowledge mandates an exploration of models that factor in such multimodal signals for accurate stock forecasting. We introduce an architecture that achieves a potent blend of chaotic temporal signals from ﬁnancial data, social media, and inter-stock relationships via a graph neural network in a hierarchical temporal fashion. Through experiments on real-world S&P 500 index data and English tweets, we show the practical applicability of our model as a tool for investment decision making and trading.


Introduction
Stock prices have an intrinsically volatile and non-stationary nature, making their rise and fall hard to forecast (Adam et al., 2016). Investment in stock markets involves a high risk regarding profit-making. Prices are driven by diverse factors that include but are not limited to company performance (Anthony and Ramesh, 1992), historical trends (Kohara et al., 1997), investor sentiment (Neal and Wheatley, 1998). Uninformed trading decisions can leave traders and investors prone to financial risk and experience monetary losses. On the contrary, careful investment choices can maximize profits (de Souza et al., 2018). Conventional research focused on time series and technical analysis of a stock, i.e., using patterns from historical price signals to forecast stock movements (B et al., * Equal contribution. 2013). However, price signals alone fail to capture market surprises and impacts of sudden unexpected events. Social media texts like tweets can have huge impacts on the stock market. For instance, US President Donald Trump shared tweets expressing negative sentiments against Lockheed Martin, which led to a loss of around $5.8 Billion to the company's market capitalization. 1 The Efficient Market Hypothesis (EMH) (Malkiel, 1989) states that financial markets are informationally efficient, such that stock prices reflect all known information. Existing works (Sec. 2) mainly focus on subsets of stock relevant data. Although useful, they do not jointly optimize learning over modalities like social media text and inter stock relations limiting their potential to capture a broader scope of stock movement affecting data, as we show in Sec. 6. Multimodal stock prediction involves multiple challenges (Hu et al., 2018). Both price signals and tweets exhibit sequential context dependencies, where singular samples may not be informative enough but can be considered a sequence for a unified context. Tweets often have diverse influence on stock prices, based on their intrinsic content, such as breaking news as opposed to noise like vague comments. Fusing multiple modalities of vast stock related data generated with varying characteristics (frequency, noise, source) is complex and mandates the careful design of joint optimization over modality-specific components.
Building on the EMH and prior work (Sec. 2), we propose MAN-SF: Multipronged Attention Network for Stock Forecasting that jointly learns from historical prices, social media, and inter stock relations. MAN-SF through hierarchical attention captures relevant signals across diverse data to train a Graph Attention Network (GAT) for stock prediction (Sec. 3). MAN-SF (Sec. 4) jointly learns from price and tweets over graph-based models for stock prediction. Through varied experiments (Sec. 5), we show the predictive power of MAN-SF along with profitability analysis (Sec. 6) and qualitatively analyze MAN-SF in high risk scenarios (Sec. 7).
Newer models based on the EMH that are categorized under fundamental analysis (FA) (Dichev and Tang, 2006), account for stock affecting factors beyond numerical ones such as investor sentiment through news, etc. Work in natural language processing (NLP) from sources such as news (Hu et al., 2018), social media data (Xu and Cohen, 2018), earnings calls (Qin and Yang, 2019;Sawhney et al., 2020b) shows the merit of FA in capturing market sentiment, surprises, mergers, acquisitions that traditional TA based methods fail to account. A limitation of existing NLP methods for stock prediction is that they assume stock movements to be independent of each other, contrary to true market function (Diebold and Yilmaz, 2014). This assumption hinders NLP centric FA's ability to learn latent patterns for the study of interrelated stocks.
Another line of FA revolves around employing graph-based methods to improve TA (e.g., pricebased models) by augmenting them with inter stock relations (Feng et al., 2019b;Sawhney et al., 2020a). Matsunaga et al. (2019) combine historical prices with stock graphs through Graph Convolution Networks (GCNs), outperforming price-only models. Similarly, Kim et al. (2019) further improve graph neural network methods by weighing stock relations through attention mechanisms, as not all stock movements are equally correlated.
Despite the popularity of NLP and graph-based stock prediction, multimodal methods that capture inter stock relations and market sentiment through linguistic cues are seldom explored. Jue Liu (2019) combines feature extraction from news sentiment scores, financial information (price-earnings ratio, etc.) along with knowledge graph embeddings through TransR. However, such existing approaches (Deng et al., 2019) are unable to represent textual signals from social media and prices temporally, as they only utilize sentiment scores and do not account for stock correlations. To cover this gap in prior research, MAN-SF captures a broader set of features as opposed to both conventional TA and FA that singularly focus on either text or graph modalities, but not both together.

Problem Formulation
MAN-SF's main objective is to learn temporally relevant information jointly from tweets and historical price signals and make use of corporate relations among stocks to predict movements. Following Xu and Cohen (2018), we formalize movement based on the difference between the adjusted closing prices of the stock s ∈ S on trading days d and d − 1. We formulate stock movement prediction as a binary classification problem.
Problem Statement: Given stock s ∈ S, and historical price data and tweets for stock s over a lookback window of T days over the day range [t − T, t − 1], we define the price movement of stock s from day t − 1 to t as: where p c d represents the widely used (Yang et al., 2020;Qin and Yang, 2019) adjusted closing price 2 of a given stock on day t. Here, 0 represents a price downfall, and 1 represents a rise in the price.

MAN-SF: Components and Learning
In this section, we first give an overview of MAN-SF, followed by a detailed explanation of each component. As shown in Figure 1, MAN-SF first encodes market data for each stock over a fixed period. Formally, we encode stock features x t ∈ R w for each trading day t as, x t = B(c t , q t ); where, c t ∈ R u represents a social media feature that we  obtain by encoding tweets over the lag window for each stock s ∈ S = {s 1 , s 2 , . . . s S }. Similarly, q t ∈ R v are the features obtained from historical prices for a stock in the lag window. We detail these encoders first, and then explain the fusion B(·) over c t and q t to obtain x t ∈ R w . We then describe the graph to represent the inter stock relations. Lastly, we explain the GAT to which the fused feature vector x t is passed to propagate features based on inter-stock relations along with the joint optimization of MAN-SF.

Price Encoder
Technical Analysis shows that historical price information is a strong indicator of future trends (Jeanblanc et al., 2009). Therefore, price data from each day is a crucial input to MAN-SF. The Price Encoder shown in Figure 2 encodes historical stock price movements to produce price feature, q t . It takes in a per-day price feature from the lookback of T days and encodes the temporal trend in prices. To capture such sequential dependencies across trading days, we use a Gated Recurrent Unit (GRU) Giles et al., 2001). The output of the GRU on day i is denoted by: where, p i ∈ R dp is the price vector on day i for each stock s in the lookback. The raw price vector, comprises of a stock's adjusted closing price, highest price and lowest price for a trading day i. Since it is the price change that determines the stock movement rather than the absolute price value, we normalize it with its last adjusted closing price, p i = p i /p c i−1 . It has been shown that the stock trend of each day has a different impact on stock trend prediction (Feng et al., 2019a). Towards this end, we employ temporal attention ζ(·) (Li et al., 2018) that learns to weigh critical days and forms an aggregated feature representation across all hidden states of the GRU (Qin et al., 2017). The temporal attention mechanism yields q t = ζ(h p ); where, h p ∈ R dp×T is the concatenated hidden states of GRU p for each stock s. This temporal attention mechanism ζ(·) rewards days with more impactful information and aggregates it from all days in the lag window to produce price features q t ∈ R v .
Temporal Attention We use a temporal attention mechanism that is a form of additive attention . The mechanism ζ(·) aggregates all the hidden representations of the GRU across different time-steps into an overall representation with learned adaptive weights (Feng et al., 2019a). We formulate this mechanism ζ(·) as: where, h z ∈ R T ×dm denotes the concatenated hidden states of the GRU. β i represents the learned attention weights for trading day i, and W is a learnable parameter matrix.

Social Media Information Encoder (SMI)
Xu and Cohen (2018) suggest that tweets not only convey factual data, but also portray user sentiment towards stocks that influence financial prediction (Bollen et al., 2011). A variety of market factors beyond historical prices drive stock trends (Abu- Mostafa and Atiya, 1996). With the rising ubiquity of the Internet, social media platforms, such as Twitter, influence investors to follow market trends (Tetlock, 2007;Hu et al., 2018). Tweets not only convey factual information but also portray user sentiment towards stocks (Xu and Cohen, 2018;Fung et al., 2002). To this end, MAN-SF uses the SMI encoder to extract a feature vector c t using tweets. The encoder shown in Figure 3 extracts social media features, c t , by first encoding tweets for a day and then over multiple days using a hierarchical attention mechanism (Yang et al., 2016).
Tweet Embedding For any given tweet t w , we generate an embedding vector m ∈ R d . We explored word and sentence level embedding methods to learn tweet representations: Global Vectors for Word Representation (GloVe) (Pennington et al., 2014), Fasttext (Joulin et al., 2017), and Universal Sentence Encoders (USE) (Cer et al., 2018). Empirically, sentence-level embeddings generated using a deep averaging network encoder variant of the USE 3 gave us the most promising results. Thus, we encode each tweet t w using USE.
Learning Representations for one day On any day i, a variable number tweets [t w1 , t w2 , . . . t wK ] for each stock s are posted, and these capture and influence the stock trends (Fung et al., 3 Implementation used: https://tfhub.dev/ google/universal-sentence-encoder/2 2002). For each tweet, we obtain a representation using the Tweet Embedding layer (USE) as [m 1 , m 2 , . . . m K ] where m j ∈ R d and K is the number of tweets per stock on day i. To model the sequence of tweets within a day, we use a GRU. For stock s on each day i: The influence of online tweets on the market can vary greatly (Hu et al., 2018). To identify tweets that are likely to have a more substantial influence on the market, we use an intraday tweet level attention. For each stock s on each day i the mechanism can be summarized as: where, h m ∈ R K×dm denotes a concatenation of all hidden states from GRU m and d m is the dimension of each hidden state. γ j represents the attention weights and r i represents the features obtained from several published tweets on day i for each stock s. W is a learned linear transformation.
Learning Representations across days Analyzing a temporal sequence of tweets and combining them can provide a more reliable assessment of market trends (Zhao et al., 2017). We learn a social media representation from the sequence of day level tweet representations r i . This feature vector encodes all the information in a lookback window. We then feed temporal day level tweet vectors to a GRU for sequential modeling given by: where, h i summarizes the tweets on day i for stock s as well as tweets from preceding days while focusing on day i. Like historical prices, tweets from each day have a different impact on stock movements. Hence, the previously described temporal attention mechanism used for historical prices is also used for social media. This mechanism learns a procedure to aggregate impactful information to form SMI features c t over a lookback of T days for each stock s. The temporal attention mechanism yields c t = ζ(h s ); h s ∈ R T ×ds represents the concatenated hidden states of GRU s and d s is the size of output space of the GRU. This temporal attention ζ(·), along with the intraday tweet-level attention, forms a hierarchical attention mechanism. This mechanism captures the fact that tweets are differently informative and have varied impacts during different market phases. The obtained SMI and price features for each stock are then blended to obtain a joint representation.

Blending Multimodal Information
Signals from different modalities often carry complementary information about different events in the market (Robert P. Schumaker, 2019). Direct concatenation treats information from Price and SMI encoders equally (Li et al., 2016). Furthermore, the interdependencies between price and tweets are not appropriately captured, damping the framework's capacity to learn their correlations to market trends (Li et al., 2014). We use a bilinear transformation that learns the pairwise feature interactions from historical price features and tweets. Formally, q t ∈ R v and c t ∈ R u are obtained from the Price Encoder and SMI Encoder, respectively.
The output x t ∈ R w is given by: where, W ∈ R w×v×u is the weight matrix, and b ∈ R w is the bias. Methods like direct mean and attention-based aggregation  do not account for pair-wise interactions as shown in the results (Sec. 6). Other methods like factorized bilinear pooling (Yu et al., 2017), reduce computational complexity; however, we empirically find that the generalized bilinear layer outperforms these techniques. This layer learns an optimum blend of features from prices and tweets in a translationally invariant manner.

Graph Attention Network (GAT)
Stocks are often interlinked with one another, and thus, we model stocks and their relations as a graph. Following Feng et al. (2019b), we make use of Wiki company-based relations. Using Wikidata 4 , we extract first and second-order relations between the company stocks in the S&P 500 index. A first-order relation is defined as X

R1
− → Y where X and Y denote entities in Wikidata that correspond to the two stocks. A second-order relation is defined by X R2 − → Z R3 ← − Y where Z denotes another entity connecting the two entities X and Y. R1, R2, and R3, defined in Wikidata, are different types of entity-relations. For instance, Wells Fargo and Bank of America are related to Berkshire Hathaway via a first-order company relation "owned by." Another example is Microsoft and Berkshire Hathaway that are related through Bill Gates (second-order relation: "owned by" -"is a board member of") since Bill Gates possesses ownership over Microsoft and is a Board member of Berkshire Hathaway. We define the stock relation network as a graph G(S, E) where S denotes the set of nodes, and E is the set of edges. Each node s ∈ S represents a stock, and two stocks s 1 , s 2 ∈ S are joined by an edge e ∈ E if s 1 , s 2 are linked by a first or second-order relation.
Graph Attention Graph-based representation learning through graph neural networks can be considered as information exchange between related nodes (Gilmer et al., 2017). As each stock has a different degree of influence on another stock, it is essential that the graph encoding suitably weighs more relevant relations between stocks. To this end, we use graph attention networks (GATs), which are graph neural networks with node-level attention (Veličković et al., 2017).
We first describe a single GAT layer that is used throughout the GAT component. The input to the GAT is a set of stock (node) features, h = [x 1 , x 2 , . . . x |S| ], where x i is the encoded multi-modal market information (Sec. 4.3). The GAT layer produces an updated set of of node features h = [z 1 , z 2 , . . . z |S| ]; z i ∈ R w based on the GAT mechanism (shown in Figure 1). We first apply a shared linear transform parameterized by W ∈ R w ×w to all the nodes. Then, we apply a shared self-attention mechanism to each node i in its immediate neighborhood N i . For each node j ∈ N i , we compute normalized attention coefficients α ij representing the importance of relations among stocks i and j. Formally, α ij is given as: (10) where, . T and ⊕ represent transpose and concatenation respectively. a w ∈ R 2w is a learnable weight matrix of a single layer feed forward neural network. The learned attention coefficients α ij are used to weigh and aggregate feature vectors from neighboring with a non-linearity σ. The updated node feature vector z i is given as: We use multi-head attention to stabilise training (Vaswani et al., 2017). Formally, U independent executors apply the above attention mechanism. Their output features are concatenated to yield: where, α k ij and W k denote normalised attention coefficients and linear transformation parameter matrix computed by the k th attention mechanism.
We use a two-layer GAT, the first layer is followed by Exponential Linear Unit (Clevert et al., 2015), and the second layer outputs a vector y i for each stock i, which is then used to classify the stock's future price movements. MAN-SF is trained using the Adam optimiser by optimizing the cross-entropy loss, given as: where, Y i is the true price movement of stock i.

Dataset and Training Setup
We adopt the StockNet dataset (Xu and Cohen, 2018) for the training and evaluation of MAN-SF. The dataset contains data of high-trade-volume stocks in the S&P 500 index in the NYSE and NASDAQ markets. Stock specific tweets are extracted using regex queries made out of NASDAQ ticker symbols, for instance, $AMZN for Amazon. The price data has been obtained from Yahoo Finance 5 . We shift a 5-day lag window along the trading days to generate samples. We label the samples according to the movement percentage of the closing price such that those ≥ 0.55% and ≤ −0.5% are labeled positive and negative samples, respectively. This leaves us with 26, 614 samples divided as 49.78% and 50.22% in the two classes. We temporally split the dataset in a ratio of Train:Validation:Test in 70:10:20, leaving us with date ranges from 01/01/2014 to 31/07/2015 for training, 01/08/2015 to 30/09/2015 for validation, and 01/10/2015 to 01/01/2016 for testing. Following Xu and Cohen (2018), we align trading days by dropping samples that lack either prices or tweets, and further align the data across trading windows for related stocks to ensure data is available for all trading days in the window for all stocks. The hidden size of all GRUs is 64, and the USE embedding dimension is 512. We use U = 8 attention heads for both GAT layers. We use the Adam optimizer with a learning rate set to 5e−4 and train MAN-SF for 10, 000 epochs. It takes 3hrs to train and test MAN-SF on Tesla K80 GPU. We use early stopping based on Matthew's Correlation Coefficient (MCC) taken over the validation set.

Evaluation
Following prior research for stock prediction (Ding et al., 2014;Xu and Cohen, 2018), we use accuracy, F1 score, MCC (implementations from sklearn 6 ) for classification performance. We use MCC because, unlike the F1 score, MCC avoids bias due to data skew as it does not depend on the choice of the positive class and accounts for the True Negatives.

For a given confusion matrix
tp f n f p tn : Like prior work (Kim et al., 2019;Feng et al., 2019b), to evaluate MAN-SF's applicability to realworld trading, we assess its profitability on the test data of the S&P 500 index using two metrics: Cumulative Profit and Sharpe Ratio (Sharpe, 1994). We follow a trading strategy where, if MAN-SF predicts a rise in a stock's value the next day, then one share of that stock is bought (long position) at the closing price of the current trading session and sold on the next day's closing price. Otherwise, if the strategy speculates a fall in price, a short sell 7 is performed. We compute the cumulative profit (Krauss, 2018) earned as: where, S denotes the set of stocks, p t i denotes the price of stock i at day t. Action t−1 i is a binary value [0, 1]. The Action t−1 i is 0 if the long position is taken at time t for stock i; otherwise it is 1.

Baselines
We compare MAN-SF with the below baselines spanning both technical and fundamental analysis.
Technical Analysis: These methods uses only historical price information.
• RAND: Random guess as price rise or fall.
• ARIMA: Autoregressive Integrated Moving Average models historical prices as a nonstationary time series (Brown, 2004).
Fundamental Analysis: These methods use other modalities such as text information and company relationships along with historical prices.
• TSLDA: Topic Sentiment Latent Dirichlet Allocation model is a generative model that uses sentiments and topic modeling on social media (Nguyen and Shirai, 2015).
• HAN: A hierarchical attention mechanism to encode textual information during a day and across multiple days (Hu et al., 2018).
• StockNet: A variational Autoencoder (VAE) that uses price and text information. Text is encoded using hierarchical attention during and across days. Price features are modeled sequentially (Xu and Cohen, 2018). We compare with all five variants of StockNet.
• HATS: A hierarchical graph attention method that uses a multi-graph to weigh different relationships between stocks. It uses only historical price data (Kim et al., 2019).
• Chen et al. (2018): GCNs to model inter stock relations with only historical price data.

Results and Analysis
We now discuss the experimental results and some findings with their financial implications. Table 1 shows the performance of the compared methods on Stock-Net's test data split from 01/10/2015 to 31/12/2015 on the S&P 500 index averaged over ten different runs. Using a learned blend of historical price and tweets using corporate relationships, MAN-SF achieves the best performance, outperforming the strongest baselines, StockNet, and Adversarial LSTM. We also note that Fundamental Analysis (FA) techniques outperform numerical only Technical Analysis (TA) methods, reiterating the effectiveness of factoring in social media signals and   inter stock relations. These results empirically validate the effectiveness of multimodal signals due to a broader capture of stock price influencing information, including tweets and other related stocks.

Performance Comparison
Ablation Study In Table 2, we observe the ability of price and text models to predict the market trend to an extent using unimodal features. Improvements over individual modalities are noted with the inclusion of a graph-based learning model, i.e., GCN and GAT validating the premise of using inter stock relations for enhanced forecasting.
When the text and price signals are fused, and more relevant information is extracted using the attention mechanisms, a performance gain is seen. The ablation study ties up with the EMH, as we add additional modalities, we note an increment in MAN-SF's ability for stock prediction. Two critical observations from Table 2 are the substantial MCC gains when using GAT over GCN and the contrast between fusing text and prices via concatenation and bilinear transformations. We discuss these next.
Analyzing Graph Attention We notice that equally weighing all correlations using GCN-based models leads to smaller performance gains, as shown in Table 2, as compared to GAT (GAT, and MAN-SF variants). To analyze this difference, we first calculate each neighbor's attention scores in the stock relations graph, as shown in Figure 4b. By analyzing the different stock associations with the highest and lowest attention scores, we observe that some relations between stocks, such as being a part of the same industry or having the same founder, are more critical than other relations like stocks having the same country of origin. For instance, C (CitiCorp) and JPM (JP Morgan) have a relatively high attention score and are a part of the same investment and banking industry, whereas the attention score for JPM and CSCO (Cisco) is relatively low. We also observe that some stocks share hidden correlations captured by the GAT due to the market's temporal nature. We explain one such example in Section 7.
Profitability We examine MAN-SF's practical applicability through a profitability analysis on realworld stock data. From Table 3 and Figure 6, we note that MAN-SF achieves higher risk-adjusted returns and an overall profit. MAN-SF outperforms different baselines over the common testing period of three months using the stocks data in the S&P 500 index. These observations show the profitability of MAN-SF over models that do not capture stock correlations (StockNet) and models that do not use the impact of textual data (HATS). We potentially attribute these improvements to MAN-SF's ability to learn a more concentrated blend of text and price features as opposed to competitive

Qualitative Analysis
We conduct an extended analysis across two highrisk scenarios, as shown in Figure 5, to study the applicability of MAN-SF to investors in the stock market. The study is based on Apple's (AAPL) trend during 12 th Nov -18 th Nov. Figure 5 shows some of the tweets posted and AAPL's relations with relevant stocks such as Alibaba (BABA), Google (GOOG), and among others during that period.
12 th Nov to 16 th Nov: Failure of StockNet and models that do not capture inter stock relations: From Figure 5, we see from the price movement that 12 th to 16 th November 2015 shows a decline in Apple's stock price. Here, we observe that Stock-Net predicts a further drop in Apple's price, and similar models that use only price and text are unable to predict the price rise for Apple on 17th November correctly. However, we discover that Apple shares a strong relationship with Alibaba and Google during that time, as indicated by the attention weights. MAN-SF incorporates inter-stock relations through graph attention to learn latent correlations between AAPL, BABA, and GOOG, as shown by the graph snippet in Figure 5. MAN-SF correctly predicts a rise in Apple's price and makes a profit, unlike StockNet. We attribute this prediction to MAN-SF likely having a broader context by blending multimodal signals.
14 th Nov to 18 th Nov: Failure of HATS and models that do not leverage social media data: Despite Apple's sharp fall on 18 th November, we see tweets with positive sentiment having higher attention weights during the lookback window, indicating a possible increase in Apple's price. MAN-SF uses hierarchical attention mechanisms over tweets and inter-stock correlations correctly. Thereby likely predicting a rise in Apple's stock price, similar to models such as StockNet. As opposed to these, models such as HATS forecast a continual decrease in Apple's price, potentially due to not factoring in social media data.

Conclusion and Future Work
We study stock movement prediction by using natural language, graph-based and numeric features. We propose MAN-SF, a neural model that jointly learns temporally relevant signals from chaotic multimodal data spanning historical prices, tweets, and inter stock correlations in a hierarchical fashion. Extensive quantitative and qualitative experiments on real market data demonstrate MAN-SF's applicability for neural stock forecasting. We plan to further use news articles, earnings calls, and other data sources to capture market dynamics better. Another interesting direction of future research is to explore the cold start problem, where MAN-SF could be leveraged to predict stock movements for new stocks. Lastly, we would also like to extend MAN-SF's architecture to not be limited to model all stocks together (because of its GAT component) to increase scalability to cross-market scenarios.