Stock Movement Prediction from Tweets and Historical Prices

Stock movement prediction is a challenging problem: the market is highly stochastic, and we make temporally-dependent predictions from chaotic data. We treat these three complexities and present a novel deep generative model jointly exploiting text and price signals for this task. Unlike the case with discriminative or topic modeling, our model introduces recurrent, continuous latent variables for a better treatment of stochasticity, and uses neural variational inference to address the intractable posterior inference. We also provide a hybrid objective with temporal auxiliary to flexibly capture predictive dependencies. We demonstrate the state-of-the-art performance of our proposed model on a new stock movement prediction dataset which we collected.


Introduction
Stock movement prediction has long attracted both investors and researchers (Frankel, 1995;Edwards et al., 2007;Bollen et al., 2011;Hu et al., 2018). We present a model to predict stock price movement from tweets and historical stock prices.
In natural language processing (NLP), public news and social media are two primary content resources for stock market prediction, and the models that use these sources are often discriminative. Among them, classic research relies heavily on feature engineering (Schumaker and Chen, 2009;Oliveira et al., 2013). With the prevalence of deep neural networks (Le and Mikolov, 2014), eventdriven approaches were studied with structured event representations (Ding et al., 2014(Ding et al., , 2015. More recently, Hu et al. (2018) propose to mine news sequence directly from text with hierarchical attention mechanisms for stock trend prediction.
However, stock movement prediction is widely considered difficult due to the high stochasticity of the market: stock prices are largely driven by new information, resulting in a random-walk pattern (Malkiel, 1999). Instead of using only deterministic features, generative topic models were extended to jointly learn topics and sentiments for the task (Si et al., 2013;Nguyen and Shirai, 2015). Compared to discriminative models, generative models have the natural advantage in depicting the generative process from market information to stock signals and introducing randomness. However, these models underrepresent chaotic social texts with bag-of-words and employ simple discrete latent variables.
In essence, stock movement prediction is a time series problem. The significance of the temporal dependency between movement predictions is not addressed in existing NLP research. For instance, when a company suffers from a major scandal on a trading day d 1 , generally, its stock price will have a downtrend in the coming trading days until day d 2 , i.e. [d 1 , d 2 ]. 2 If a stock predictor can recognize this decline pattern, it is likely to benefit all the predictions of the movements during [d 1 , d 2 ]. Otherwise, the accuracy in this interval might be harmed. This predictive dependency is a result of the fact that public information, e.g. a company scandal, needs time to be absorbed into movements over time (Luss and d'Aspremont, 2015), and thus is largely shared across temporally-close predictions.
Aiming to tackle the above-mentioned outstanding research gaps in terms of modeling high market stochasticity, chaotic market information and temporally-dependent prediction, we propose StockNet, a deep generative model for stock movement prediction.
To better incorporate stochastic factors, we generate stock movements from latent driven factors modeled with recurrent, continuous latent variables. Motivated by Variational Auto-Encoders (VAEs;Kingma and Welling, 2013;Rezende et al., 2014), we propose a novel decoder with a variational architecture and derive a recurrent variational lower bound for end-to-end training (Section 5.2). To the best of our knowledge, StockNet is the first deep generative model for stock movement prediction.
To fully exploit market information, StockNet directly learns from data without pre-extracting structured events. We build market sources by referring to both fundamental information, e.g. tweets, and technical features, e.g. historical stock prices (Section 5.1). 3 To accurately depict predictive dependencies, we assume that the movement prediction for a stock can benefit from learning to predict its historical movements in a lag window. We propose trading-day alignment as the framework basis (Section 4), and further provide a novel multi-task learning objective (Section 5.3).
We evaluate StockNet on a stock movement prediction task with a new dataset that we collected. Compared with strong baselines, our experiments show that StockNet achieves state-of-the-art performance by incorporating both data from Twitter and historical stock price listings.

Problem Formulation
We aim at predicting the movement of a target stock s in a pre-selected stock collection S on a target trading day d. Formally, we use the market information comprising of relevant social media corpora M, i.e. tweets, and historical prices, in the lag [d − ∆d, d − 1] where ∆d is a fixed lag size. We estimate the binary movement where 1 denotes rise and 0 denotes fall, where p c d denotes the adjusted closing price adjusted for corporate actions affecting stock prices, e.g. dividends and splits. 4 The adjusted closing 3 To a fundamentalist, stocks have their intrinsic values that can be derived from the behavior and performance of their company. On the contrary, technical analysis considers only the trends and patterns of the stock price. 4 Technically, d − 1 may not be an eligible trading day and thus has no available price information. In the rest of this price is widely used for predicting stock price movement (Xie et al., 2013) or financial volatility (Rekabsaz et al., 2017).

Data Collection
In finance, stocks are categorized into 9 industries: Basic Materials, Consumer Goods, Healthcare, Services, Utilities, Conglomerates, Financial, Industrial Goods and Technology. 5 Since high-tradevolume-stocks tend to be discussed more on Twitter, we select the two-year price movements from 01/01/2014 to 01/01/2016 of 88 stocks to target, coming from all the 8 stocks in Conglomerates and the top 10 stocks in capital size in each of the other 8 industries (see supplementary material).
We observe that there are a number of targets with exceptionally minor movement ratios. In a three-way stock trend prediction task, a common practice is to categorize these movements to another "preserve" class by setting upper and lower thresholds on the stock price change (Hu et al., 2018). Since we aim at the binary classification of stock changes identifiable from social media, we set two particular thresholds, -0.5% and 0.55% and simply remove 38.72% of the selected targets with the movement percents between the two thresholds. Samples with the movement percents ≤-0.5% and >0.55% are labeled with 0 and 1, respectively. The two thresholds are selected to balance the two classes, resulting in 26,614 prediction targets in the whole dataset with 49.78% and 50.22% of them in the two classes. We split them temporally and 20,339 movements between 01/01/2014 and 01/08/2015 are for training, 2,555 movements from 01/08/2015 to 01/10/2015 are for development, and 3,720 movements from 01/10/2015 to 01/01/2016 are for test.
There are two main components in our dataset: 6 a Twitter dataset and a historical price dataset. We access Twitter data under the official license of Twitter, then retrieve stock-specific tweets by querying regexes made up of NASDAQ ticker symbols, e.g. "\$GOOG\b" for Google Inc.. We preprocess tweet texts using the NLTK package (Bird et al., 2009) with the particular Twitter paper, the problem is solved by keeping the notational consistency with our recurrent model and using its time step t to index trading days. Details will be provided in Section 4. We use d here to make the formulation easier to follow. 5 https://finance.yahoo.com/industries 6 Our dataset is available at https://github.com/ yumoxu/stocknet-dataset. mode, including for tokenization and treatment of hyperlinks, hashtags and the "@" identifier. To alleviate sparsity, we further filter samples by ensuring there is at least one tweet for each corpus in the lag. We extract historical prices for the 88 selected stocks to build the historical price dataset from Yahoo Finance. 7 4 Model Overview X |D| Z ✓ y Figure 1: Illustration of the generative process from observed market information to stock movements. We use solid lines to denote the generation process and dashed lines to denote the variational approximation to the intractable posterior.
We provide an overview of data alignment, model factorization and model components.
As explained in Section 1, we assume that predicting the movement on trading day d can benefit from predicting the movements on its former trading days. However, due to the general principle of sample independence, building connections directly across samples with temporally-close target dates is problematic for model training.
As an alternative, we notice that within a sample with a target trading day d there are likely to be other trading days than d in its lag that can simulate the prediction targets close to d. Motivated by this observation and multi-task learning (Caruana, 1998), we make movement predictions not only for d, but also other trading days existing in the lag. For instance, as shown in Figure 2, for a sample targeting 07/08/2012 and a 5-day lag, 03/08/2012 and 06/08/2012 are eligible trading days in the lag and we also make predictions for them using the market information in this sample. The relations between these predictions can thus be captured within the scope of a sample.
As shown in the instance above, not every single date in a lag is an eligible trading day, e.g. weekends and holidays. To better organize and use the input, we regard the trading day, instead of the calendar day used in existing research, as the basic unit for building samples. To this end, we first find all the T eligible trading days referred in a sample, in other words, existing in the time interval [d − ∆d + 1, d]. For clarity, in the scope of one sample, we index these trading days with t ∈ [1, T ], 8 and each of them maps to an actual (absolute) trading day d t . We then propose trading-day alignment: we reorganize our inputs, including the tweet corpora and historical prices, by aligning them to these T trading days. Specifically, on the tth trading day, we recognize market signals from the corpus M t in [d t−1 , d t ) and the historical prices p t on d t−1 , for predicting the movement y t on d t . We provide an aligned sample for illustration in Figure 2. As a result, every single unit in a sample is a trading day, and we can predict a sequence of movements y = [y 1 , . . . , y T ]. The main target is y T while the remainder y * = [y 1 , . . . , y T −1 ] serves as the temporal auxiliary target. We use these in addition to the main target to improve prediction accuracy (Section 5.3).
We model the generative process shown in Figure 1. We encode observed market information as a random variable X = [x 1 ; . . . ; x T ], from which we generate the latent driven factor Z = [z 1 ; . . . ; z T ] for our prediction task. For the aforementioned multi-task learning purpose, we aim at modeling the conditional probability distribution p θ (y|X) = Z p θ (y, Z|X) instead of p θ (y T |X). We write the following factorization for generation, where for a given indexed matrix of T vectors respectively. Since y * is known in generation, we use the posterior p θ (z t |z <t , x ≤t , y t ) , t < T to incorporate market signals more accurately and only use the prior p θ (z T |z <T , X) when generating z T . Besides, when t < T , y t is independent of z <t while our main prediction target, y T is made dependent on z <T through a temporal attention mechanism (Section 5.3).
We show StockNet modeling the above generative process in Figure 2. In a nutshell, StockNet ↵ g 1 g 2 g 3 Figure 2: The architecture of StockNet. We use the main target of 07/08/2012 and the lag size of 5 for illustration. Since 04/08/2012 and 05/08/2012 are not trading days (a weekend), trading-day alignment helps StockNet to organize message corpora and historical prices for the other three trading days in the lag. We use dashed lines to denote auxiliary components. Red points denoting temporal objectives are integrated with a temporal attention mechanism to acquire the final training objective.
comprises three primary components following a bottom-up fashion, 1. Market Information Encoder (MIE) that encodes tweets and prices to X; 2. Variational Movement Decoder (VMD) that infers Z with X, y and decodes stock movements y from X, Z; 3. Attentive Temporal Auxiliary (ATA) that integrates temporal loss through an attention mechanism for model training.

Model Components
We detail next the components of our model (MIE, VMD, ATA) and the way we estimate our model parameters.

Market Information Encoder
MIE encodes information from social media and stock prices to enhance market information quality, and outputs the market information input X for VMD. Each temporal input is defined as where c t and p t are the corpus embedding and the historical price vector, respectively.
The basic strategy of acquiring c t is to first feed messages into the Message Embedding Layer for their low-dimensional representations, then selectively gather them according to their quality. To handle the circumstance that multiple stocks are discussed in one single message, in addition to text information, we incorporate the position information of stock symbols mentioned in messages as well. Specifically, the layer consists of a forward GRU and a backward GRU for the preceding and following contexts of a stock symbol, s, respectively. Formally, in the message corpus of the tth trading day, we denote the word sequence of the kth message, k ∈ [1, K], as W where W = s, ∈ [1, L], and its word embedding matrix as E = [e 1 ; e 2 ; . . . ; e L ]. We run the two GRUs as follows, The stock symbol is regarded as the last unit in both the preceding and the following contexts where the hidden values, − → h l , ← − h l , are averaged to acquire the message embedding m. Gathering all message embeddings for the tth trading day, we have a mes-sage embedding matrix M t ∈ R dm×K . In practice, the layer takes as inputs a five-rank tensor for a mini-batch, and yields all M t in the batch with shared parameters.
Tweet quality varies drastically. Inspired by the news-level attention (Hu et al., 2018), we weight messages with their respective salience in collective intelligence measurement. Specifically, we first project M t non-linearly to u t , the normalized attention weight over the corpus, where ζ(·) is the softmax function and W m,u ∈ R dm×dm , w u ∈ R dm×1 are model parameters.
Then we compose messages accordingly to acquire the corpus embedding, Since it is the price change that determines the stock movement rather than the absolute price value, instead of directly feeding the raw price vectorp t = p c t ,p h t ,p l t comprising of the adjusted closing, highest and lowest price on a trading day t, into the networks, we normalize it with its last adjusted closing price, p t =p t /p c t−1 − 1. We then concatenate c t with p t to form the final market information input x t for the decoder.

Variational Movement Decoder
The purpose of VMD is to recurrently infer and decode the latent driven factor Z and the movement y from the encoded market information X.

Inference
While latent driven factors help to depict the market status leading to stock movements, the posterior inference in the generative model shown in Eq. (2) is intractable. Following the spirit of the VAE, we use deep neural networks to fit latent distributions, i.e. the prior p θ (z t |z <t , x ≤t ) and the posterior p θ (z t |z <t , x ≤t , y t ), and sidestep the intractability through neural approximation and reparameterization (Kingma and Welling, 2013;Rezende et al., 2014). We first employ a variational approximator q φ (z t |z <t , x ≤t , y t ) for the intractable posterior. We observe the following factorization, Neural approximation aims at minimizing the Kullback-Leibler divergence between the q φ (Z|X, y) and p θ (Z|X, y). Instead of optimizing it directly, we observe that the following equation naturally holds, where D KL [q p] is the Kullback-Leibler divergence between the distributions q and p. Therefore, we equivalently maximize the following variational recurrent lower bound by plugging Eq. (2, 9) into Eq. (10), where the likelihood term Li et al. (2017) also provide a lower bound for inferring directly-connected recurrent latent variables in text summarization. In their work, priors are modeled with p θ (z t ) ∼ N (0, I), which, in fact, turns the KL term into a static regularization term encouraging sparsity. In Eq. (11), we provide a more theoretically rigorous lower bound where the KL term with p θ (z t |z <t , x ≤t ) plays a dynamic role in inferring dependent latent variables for every different model input and latent history.

Decoding
As per time series, VMD adopts an RNN with a GRU cell to extract features and decode stock signals recurrently, We let the approximator q φ (z t |z <t , x ≤t , y t ) subject to a standard multivariate Gaussian distribution N (µ, δ 2 I). We calculate µ and δ as and the shared hidden representation h z t as where W φ z,µ , W φ z,δ , W φ z are weight matrices and b φ µ , b φ δ , b φ z are biases. Since Gaussian distribution belongs to the "location-scale" distribution family, we can further reparameterize z t as where denotes an element-wise product. The noise term ∼ N (0, I) naturally involves stochastic signals in our model.
Similarly, We let the prior p θ (z t |z <t , x ≤t ) ∼ N (µ , δ 2 I). Its calculation is the same as that of the posterior except the absence of y t and independent model parameters, Following Zhang et al. (2016), differently from the posterior, we set the prior z t = µ t during decoding. Finally, we integrate deterministic features and the final prediction hypothesis is given as where W g , W y are weight matrices and b g , b y are biases. The softmax function ζ(·) outputs the confidence distribution over up and down. As introduced in Section 4, the decoding of the main target y T depends on z <T and thus lies at the interface between VMD and ATA. We will elaborate on it in the next section.

Attentive Temporal Auxiliary
With the acquisition of a sequence of auxiliary predictionsỸ * = [ỹ 1 ; . . . ;ỹ T −1 ], we incorporate two-folded auxiliary effects into the main prediction and the training objective flexibly by first introducing a shared temporal attention mechanism.
Since each hypothesis of a temporal auxiliary contributes unequally to the main prediction and model training, as shown in Figure 3, temporal attention calculates their weights in these two contributions by employing two scoring components: an information score and a dependency score. Specifically, where W g,i , W g,d ∈ R dg×dg , w i ∈ R dg×1 are model parameters. The integrated representations G * = [g 1 ; . . . ; g T −1 ] and g T are reused as the final representations of temporal market information. The information score v i evaluates historical trading days as per their own information quality, while the dependency score v d captures their dependencies with our main target. We integrate the two and acquire the final normalized attention weight v * ∈ R 1×(T −1) by feeding their elementwise product into the softmax function. As a result, the main prediction can benefit from temporally-close hypotheses have been made and we decode our main hypothesisỹ T as where W T is a weight matrix and b T is a bias. As to the model objective, we use the Monte Carlo method to approximate the expectation term in Eq. (11) and typically only one sample is used for gradient computation. To incorporate varied temporal importance at the objective level, we first break down the approximated L into a series of temporal objectives f ∈ R T ×1 where f t comprises a likelihood term and a KL term for a trading day t, where we adopt the KL term annealing trick (Bowman et al., 2016;Semeniuta et al., 2017) and add a linearly-increasing KL term weight λ ∈ (0, 1] to gradually release the KL regularization effect in the training procedure. Then we reuse v * to build the final temporal weight vector where 1 is for the main prediction and we adopt the auxiliary weight α ∈ [0, 1] to control the overall auxiliary effects on the model training. α is tuned on the development set and its effects will be discussed at length in Section 6.5. Finally, we write the training objective F by recomposition, where our model can learn to generalize with the selective attendance of temporal auxiliary. We take the derivative of F with respect to all the model parameters {θ, φ} through backpropagation for the update.

Experiments
In this section, we detail our experimental setup and results.

Training Setup
We use a 5-day lag window for sample construction and 32 shuffled samples in a batch. 9 The maximal token number contained in a message and the maximal message number on a trading day are empirically set to 30 and 40, respectively, with the excess clipped. Since all tweets in the batched samples are simultaneously fed into the model, we set the word embedding size to 50 instead of larger sizes to control memory costs and make model training feasible on one single GPU (11GB memory). We set the hidden size of Message Embedding Layer to 100 and that of VMD to 150. All weight matrices in the model are initialized with the fan-in trick and biases are initialized with zero. We train the model with an Adam optimizer (Kingma and Ba, 2014) with the initial learning rate of 0.001. Following Bowman et al. (2016), we use the input dropout rate of 0.3 to regularize latent variables. Tensorflow (Abadi et al., 2016) is used to construct the computational graph of StockNet and hyper-parameters are tweaked on the development set.

Evaluation Metrics
Following previous work for stock prediction (Xie et al., 2013;Ding et al., 2015), we adopt the standard measure of accuracy and Matthews Correlation Coefficient (MCC) as evaluation metrics. MCC avoids bias due to data skew. Given the confusion matrix tp fn fp tn containing the number of samples classified as true positive, false positive, true negative and false negative, MCC is calculated as

Baselines and Proposed Models
We construct the following five baselines in different genres, 10 • RAND: a naive predictor making random guess in up or down. • ARIMA: Autoregressive Integrated Moving Average, an advanced technical analysis method using only price signals (Brown, 2004) . • RANDFOREST: a discriminative Random Forest classifier using Word2vec text representations (Pagolu et al., 2016). • TSLDA: a generative topic model jointly learning topics and sentiments (Nguyen and Shirai, 2015). • HAN: a state-of-the-art discriminative deep neural network with hierarchical attention (Hu et al., 2018). To make a detailed analysis of all the primary components in StockNet, in addition to HEDGE-FUNDANALYST, the fully-equipped StockNet, we also construct the following four variations, • TECHNICALANALYST: the generative StockNet using only historical prices.  (Hu et al., 2018) 57.64 0.051800 HEDGEFUNDANALYST 58.23 0.080796  • DISCRIMINATIVEANALYST: the discriminative StockNet directly optimizing the likelihood objective. Following Zhang et al. (2016), we set z t = µ t to take out the effects of the KL term.

Results
Since stock prediction is a challenging task and a minor improvement usually leads to large potential profits, the accuracy of 56% is generally reported as a satisfying result for binary stock movement prediction (Nguyen and Shirai, 2015). We show the performance of the baselines and our proposed models in Table 1. TLSDA is the best baseline in MCC while HAN is the best baseline in accuracy. Our model, HEDGEFUNDAN-ALYST achieves the best performance of 58.23 in accuracy and 0.080796 in MCC, outperforming TLSDA and HAN with 4.16, 0.59 in accuracy, and 0.015414, 0.028996 in MCC, respectively. Though slightly better than random guess, classic technical analysis, e.g. ARIMA, does not yield satisfying results. Similar in using only historical prices, TECHNICALANALYST shows an obvious advantage in this task compared ARIMA. We believe there are two major reasons: (1) TECHNICAL-ANALYST learns from training data and incorporates more flexible non-linearity; (2) our test set contains a large number of stocks while ARIMA is more sensitive to peculiar sequence stationarity. It is worth noting that FUNDAMENTALANA-LYST gains exceptionally competitive results with only 0.009092 less in MCC than HEDGEFUNDAN-ALYST. The performance of FUNDAMENTALANALYST and TECHNICALANALYST confirm the positive effects from tweets and historical prices in stock movement prediction, respectively. As an effective ensemble of the two market information, HEDGE-FUNDANALYST gains even better performance.
Compared with DISCRIMINATIVEANALYST, the performance improvements of HEDGEFUNDANA-LYST are not from enlarging the networks, demonstrating that modeling underlying market status explicitly with latent driven factors indeed benefits stock movement prediction. The comparison with INDEPENDENTANALYST also shows the effectiveness of capturing temporal dependencies between predictions with the temporal auxiliary. However, the effects of the temporal auxiliary are more complex and will be analyzed further in the next section.

Effects of Temporal Auxiliary
We provide a detailed discuss of how the temporal auxiliary affects model performance. As introduced in Eq. (28), the temporal auxiliary weight α controls the overall effects of the objective-level temporal auxiliary to our model. Figure 4 presents how the performance of HEDGEFUNDANALYST and DISCRIMINATIVEANALYST fluctuates with α.
As shown in Figure 4, enhanced by the temporal auxiliary, HEDGEFUNDANALYST approaches the best performance at 0.5, and DISCRIMINATIVEANALYST achieves its maximum at 0.7. In fact, objectivelevel auxiliary can be regarded as a denoising regularizer: for a sample with a specific movement as the main target, the market source in the lag can be heterogeneous, e.g. affected by bad news, tweets on earlier days are negative but turn to positive due to timely crises management. Without temporal auxiliary tasks, the model tries to identify positive signals on earlier days only for the main target of rise movement, which is likely to result in pure noise. In such cases, temporal auxiliary tasks help to filter market sources in the lag as per their respective aligned auxiliary movements. Besides, from the perspective of training variational models, the temporal auxiliary helps HEDGEFUNDANALYST to encode more useful information into the latent driven factor Z, which is consistent with recent research in VAEs (Semeniuta et al., 2017). Compared with HEDGEFUND-ANALYST that contains a KL term performing dynamic regularization, DISCRIMINATIVEANALYST requires stronger regularization effects coming with a bigger α to achieve its best performance.
Since y * also involves in generating y T through the temporal attention, tweaking α acts as a tradeoff between focusing on the main target and generalizing by denoising. Therefore, as shown in Figure 4, our models do not linearly benefit from incorporating temporal auxiliary. In fact, the two models follow a similar pattern in terms of performance change: the curves first drop down with the increase of α, except the MCC curve for DIS-CRIMINATIVEANALYST rising up temporarily at 0.3. After that, the curves ascend abruptly to their maximums, then keep descending till α = 1. Though the start phase of increasing α even leads to worse performance, when auxiliary effects are properly introduced, the two models finally gain better results than those with no involvement of auxiliary effects, e.g. INDEPENDENTANALYST.

Conclusion
We demonstrated the effectiveness of deep generative approaches for stock movement prediction from social media data by introducing StockNet, a neural network architecture for this task. We tested our model on a new comprehensive dataset and showed it performs better than strong baselines, including implementation of previous work. Our comprehensive dataset is publicly available at https://github.com/ yumoxu/stocknet-dataset.