VolTAGE: Volatility Forecasting via Text-Audio Fusion with Graph Convolution Networks for Earnings Calls

Natural language processing has recently made stock movement forecasting and volatility forecasting advances, leading to improved ﬁnancial forecasting. Transcripts of companies’ earnings calls are well studied for risk modeling, offering unique investment insight into stock performance. However, vocal cues in the speech of company executives present an underexplored rich source of natural language data for estimating ﬁnancial risk. Ad-ditionally, most existing approaches ignore the correlations between stocks. Building on existing work, we introduce a neural model for stock volatility prediction that accounts for stock interdependence via graph convolu-tions while fusing verbal, vocal, and ﬁnan-cial features in a semi-supervised multi-task risk forecasting formulation. Our proposed model, VolTAGE, outperforms existing meth-ods demonstrating the effectiveness of multi-modal learning for volatility prediction.


Introduction
Motivation Financial risk modeling is of great interest to capital market participants for making sound investment decisions. Stock volatility is a vital indicator of a company's risk profile (Poon and Granger, 2003;Yang et al., 2020). The stock market presents various opportunities that increasingly attract investors, who utilize the market's potential to generate profits, wherein stock volatility is a vital risk modeling factor. One underexplored, yet crucial event that leads to significant fluctuations in stock volatility, is the earnings conference call. These calls are held periodically by publicly traded companies' executives to summarize and prognosticate company's performance (Qin and Yang, 2019). Harnessing the interplay between the multimodal verbal and vocal cues in earnings calls can help better analyze the impact these calls may have on financial markets and forecast stock volatility (Dichev and Tang, 2009;Yang et al., 2020).
Challenges While stock trading presents unparalleled investment opportunities, accurately predicting the rise and fall of stock prices has numerous challenges (Campbell et al., 1997). Conventional research in finance revolves around using historical stock data to develop statistical models and recurrent neural networks (RNNs) capable of forecasting price trends (Kristjanpoller et al., 2014;Zheng et al., 2019). They are influenced by many factors ranging from public opinion to the movements of other related stocks (Malkiel, 2003). Recent advances in deep learning present a promising prospect in multimodal stock forecasting by analyzing online news (Hu et al., 2018), and social media (Guo et al., 2018) to learn latent patterns affecting stock prices (Jiang, 2020). However, the challenging aspect in stock forecasting is that most existing work treats stock movements to be independent of each other, contrary to true market function (Diebold and Yılmaz, 2014). Additionally, existing research has not leveraged the rich audio signals in company executives' speech, which could indicate the emotional and affective state of the speakers, and provide insights into company performance. More recently, the use of audio processing for earnings calls has gained an interest in both financial and linguistic research (Burgoon et al., 2015;Jiang and Pell, 2017).
Multimodal approaches can extract complementary information from multiple modalities to improve financial modeling, MDRM (Qin and Yang, 2019), and HTML (Yang et al., 2020) validate the premise of such approaches for volatility forecasting. Additionally, advances in graph-based deep learning (Kipf and Welling, 2017) have led to the rise of graph neural networks (GNNs) that can model the relationships between related stocks (Feng et al., 2019). Publicly available online company information can be used to identify connections between stocks that might influence each other, such as those having the same CEO or belonging to the same industry. Financial tasks are often correlated, thus making multi-task learning a promising choice for financial forecasting.
Contributions Building on advances in the intersection of financial research, graph neural networks, and natural language processing, we present VolTAGE: Volatility forecasting via Text-Audio fusion with Graph convolution networks for Earnings calls. VolTAGE comprises a set of neural components to capture cross-modal signals from earnings calls transcripts, CEO speech, inter stock dependence graphs, and numerical financial features. First, VolTAGE combines the verbal-vocal coherence between earnings calls transcripts and speech via an inter-modal multi-utterance attention mechanism. The fused features are then fed to a graph convolution network (GCN) to simultaneously solve two homogeneous stock volatility tasks -average volatility (main task) and singleday volatility prediction (auxiliary task), in a semisupervised fashion. Through a set of comparative, qualitative, and ablation experiments on real-world S&P 500 index data, we show VolTAGE's utility of augmenting vocal and verbal cues with graphbased features in a multi-task setup.
Ethical Considerations and Limitations Examining a CEO's speech and tone in earnings calls is a well-studied phenomenon in financial literature (Crawford Camiciottoli, 2011;Qin and Yang, 2019). Our work focuses only on calls for which companies publicly release transcripts and audio recordings. The data used in our study corresponds to earnings calls of S&P 500 companies. We acknowledge the presence of gender bias in our study, given the imbalance in the gender ratio of CEOs of S&P 500 companies. We also acknowledge the demographic bias in our study, as the S&P 500 companies are organizations listed in the US, and may not generalize directly to non-native speakers.

Background
Extensive studies have shown the utility of employing historical financial data (Jones, 2017;Dichev and Tang, 2009) for volatility prediction, yet financial forecasting using multiple modalities remains an underexplored avenue. While newer work focuses on data across multiple modalities, there exist drawbacks and understudied approaches to improve current methods, which we describe next.
Volatility Forecasting Forecasting stock volatility is a crucial pillar across multiple financial domains and has focused on numerous academic studies. Volatility is a key indicator of uncertainty and is a decisive variable to many investment decisions and portfolio creations. Previous work in this domain has mainly relied on numerical features (Liu and Chen, 2019;Nikou et al., 2019), such as macroeconomic indicators (Hoseinzade et al., 2019). This includes discrete (GARCH (Duan, 1995), rolling regression (Peng et al., 2018)), continuous (Andersen, 2007), and neural approaches (Kogan et al., 2009). This comprehensive work illustrates the significance of volatility in investment, security valuation, and risk management.

Natural Language Processing and Finance
Extensive studies incorporating related text information have proven successful in financial forecasting tasks. Mohan et al. (2019); Tan et al. (2019) utilized financial news articles to improve the accuracy of stock price predictions. Hu et al. (2018) propose a hybrid attention network to predict the stock trend based on the related sequential news articles. Researchers have also observed the influence of textual data in online media on stock markets (Bollen et al., 2011;Mittermayer and Knolmayer, 2006). Si et al. (2014) showed sentiment analysis based on social media is predictive of each stock's market. However, utilizing multimodal sources of information remains an underexplored avenue in financial forecasting.

Speech Processing and Finance
Newer studies (Qin and Yang, 2019;Yang et al., 2020) illustrate the gains obtained using vocal cues from the CEO's earnings conference calls for volatility prediction. Yet, the majority of the current work does not utilize speech based data. The audio features add greater context and provide psycho-linguistic signaling about the speaker's emotional state (Jiang and Pell, 2017). Qin and Yang (2019) illustrated that late fusion of audio and text features from earnings calls could be used to forecast stock volatility following the earnings call. The verbose quarterly earnings calls (Wang and Hua, 2014) act as a medium of voluntary disclosure (Tasker, 1998), thereby resulting in significant stock movements (Ding et al., 2015), yet the majority of existing approaches do not focus on such highly volatile macro activities, where the market microstructure is highly uncertain (Rogers et al., 2009). During these macro events, the stock returns' predictability can be improved since the disclosure of informed investors influences volatility spreads (Atilgan, 2014). Although multiple sources of information are crucial, not all modalities contribute equally (Akhtar et al., 2019). Noise in one modality can be detrimental in such multimodal frameworks (Morris-Drake et al., 2016).

Multimodality and Finance
The Efficient Market Hypothesis (Malkiel, 2003) illustrates the success of multimodal data sources for predictive financial tasks. The more recent multimodal HTML (Yang et al., 2020) is a transformer-based model that uses BERT (Devlin et al., 2019) for textual modeling, and the same hand-crafted audio features as MDRM, in an early fusion formulation. Both MDRM and HTML assume stocks' independence and do not exploit these relations between stock movements. Relations like the same industrial base and co-ownership also result in related stock movements (Feng et al., 2019). Recent works exploit stock relations through graph neural networks (Kipf and Welling, 2017;Veličković et al., 2018) for stock movement prediction (Kim et al., 2019;Sawhney et al., 2020).
Building on these gaps in existing literature, we propose VolTAGE for volatility prediction.

Forecasting Stock Volatility
Following Kogan et al. (2009) and Qin and Yang (2019) we define stock volatility as a regression task. For a given stock, with a close price of p i on trading day i, we calculate the average log volatility over n days following the earnings call as: where, the return price r i is defined as pi pi 1 1 and r is the mean of the return price over the period from day-0 to day-n. Additionally, for our auxiliary task we define the single-day log volatility using the daily log absolute returns as follows: Problem Statement Given an earning call e, comprising of an audio A, and aligned text T , and stock prices p [0,n] , we aim to learn a predictive

VolTAGE: Architecture and Learning
Below, we describe both the individual components and joint optimization of VolTAGE, and present an overview of the architecture in Figure 1.

Verbal Cues: Transcript Encoding
We use FinBERT 1 (Araci, 2019) as a sentence encoder, which is a pre-trained language model based on BERT, for language modeling specific to the financial domain. Recent works (Araci, 2019;Keith and Stent, 2019) in this domain indicate the benefits of using a language model pre-trained on financial corpora and retrofitting pre-computed embeddings, achieving considerable performance gains; thereby giving us a strong ground to incorporate the same.
FinBERT has been pre-trained on 46,000 documents of financial news articles and has shown state-of-the-art performance on FiQA 2 and Financial PhraseBank benchmarks (Malo et al., 2013). Formally, we represent the transcript utterances of each call as (t 1 , t 2 , ..., t K ), where t i is the i th text utterance and K is the number of sentences, which are encoded as follows: We then pass the sequence of these sentence representations to a BiLSTM as:  Figure 1: VolTAGE architecture overview: feature extraction, semi-supervised learning and multi-task regression.

Vocal Cues: Audio Call Encoding
Audio-based features provide prosodic cues related to the affective state of speakers (Montacié and Caraty, 2018). Capturing the emotional valence of the CEO can alter the understanding of the underlying linguistic utterances in an earnings call (Schröder et al., 2001). We extract a set of 26 acoustic features from each aligned audio clip at a sampling rate of 10ms for each sentence. These feature time series were then summarized by statistical functions such as mean, median, min, and max to yield a fixed dimensional representation for each sentence. We extend the feature sets of previous works (Qin and Yang, 2019;Yang et al., 2020). These features have shown to be correlated to the speaker's affective states such as stress and anxiety (APQ 11 Shimmer, DDA Shimmer) (Li et al., 2007;Mongia and Sharma, 2014), vocal pace reflecting inconsistencies in vocal cues (ratio of voiced to unvoiced frames in audio) (Přibil and Přibilová, 2009;Viswanathan et al., 2012) and deception (pitch) (Burgoon et al., 2015). We extracted these 26 features from each audio utterance using Praat (Boersma and Van Heuven, 2001).
Text-Audio Alignment Following Qin and Yang (2019), we use the pre-aligned dataset for earnings calls, where the audio is segmented and aligned with each corresponding utterance of the transcript using the Iterative Forced Alignment (IFA) algorithm. IFA is the process of determining the time interval (in the audio file) containing the spoken text for each fragment of the transcript. Qin and Yang (2019) implemented IFA using Aeneas 3 as the fundamental forced alignment method. For-mally, we represent the segmented audio clips as (a 1 , a 2 , ..., a K ) where a i is the i th audio clip and K being the number of clips of an earning call, with each clip being represented by 26 acoustic features. Similar to the processing of verbal utterances, we employ a BiLSTM layer to sequentially encodes these features, and obtain an audio encoding A t as:

Verbal-Vocal Attention
The acoustic features provide context and structure to the verbal cues. To capture the associations between verbal and vocal cues, we employ a Cross-Modal Gated Attention Fusion (CM Attn) mechanism that simultaneously learns the alignment weights between audio features and text sentence sequences. Thus, we employ this mechanism to highlight the contributing features by giving more attention to the respective utterance and neighboring utterances. Motivated by Akhtar et al. (2019); Dhingra et al. (2016), we employ the multiplicative gated attention mechanism to generate modality-specific attentive representations.
Formally, a multiplicative gating mechanism is used to attend the important components of text and audio sequences to get the final attentive feature embeddings F t , F a which are then combined as: · is the dot product, represents element-wise multiplication, and represents concatenation. The fused verbal-vocal feature vector per earnings call is then fed to a GCN, as described next.

Graph-based Semi Supervised Learning
Mining Stock Relations First, we construct a company graph, inspired by the relations defined by Feng et al. (2019). We mine connections between companies from Wikidata (Vrandečić and Krötzsch, 2014). Wikidata represents relations in the form of statements like (subject; predicate ;object), such as (Facebook; founded by; Mark Zuckerberg). 4 We say that Company A has a first-order relation with company B if there is a statement with A as the subject and B as the object. Similarly, there exists a second-order relation between them if they are related by an intermediate entity. This Wiki-Company graph G W C = (V, E W C ) is a homogeneous graph, where each node represents a company, and two nodes are connected by an edge representing either a first or second-order relation. We present examples of first and secondorder relations in Figure 2. Since the companies are related and not the earnings calls, we extend the graph G W C by incorporating nodes corresponding to earnings calls. Each call is connected to the company it corresponds to through an edge. This extended graph G(V, E) is heterogeneous with two types of nodes (companies and earnings calls).
Graph Convolution Network We frame the task as a graph-based semi-supervised learning problem since we have labels (volatility values) available for a subset of nodes (i.e., earnings call nodes) (Kipf and Welling, 2017). Our intuition behind applying GCNs is to allow the model to distribute gradient information from the supervised loss on the labeled earnings call nodes. As shown in Figure 1, we feed the fused verbal-vocal features H as node features for each earnings call node to the GCN. As for the stock nodes, since a stock may have multiple earnings calls, we consider the mean of feature vectors of all calls pertaining to a stock as its feature vector, to incorporate features across all earnings calls corresponding to that stock. Formally, let F 2 R n⇥m represent the input feature matrix comprising these feature vectors of length m for the nodes in G, and D represent the diagonal degree matrix defined as D ii = P j A ij . The update rule at layer l of the GCN is then: where the first layer is represented as: We experiment with a single layer and a 2-layer GCN, and find better results with the latter. We formulate the exact computation our GCN performs to yield estimated volatility values as follows: Using the earnings call node labels, we train the GCN on the MSE loss using the semi-supervised learning mechanism. This mechanism generates feature representations for both the company nodes and the earnings call nodes, of which we use the latter. Subsequently, these earnings call node features, denoted by O e are fed along with the financial features to a multimodal LSTM network in a multi-task learning setup as described next.

Multimodal LSTM for Risk Forecasting
Prior work (Figlewski, 1994) in the financial domain has shown the benefits of using past data for future volatility forecasting. However, fusing the sequential historical volatility data with nontemporal GCN embeddings poses a challenge. To overcome this disparity, we employ multimodal "conditioned" LSTM networks (Karpathy and Fei-Fei, 2015). In our case, we add GCN node embeddings from the first layer with the ReLU nonlinearity to the hidden state of the LSTM model at the first time-step to integrate temporally diverse modalities. Further, the past data introduces historical context in cases where calls may not have major announcements that would lead to large fluctuations in stock volatility.
Network Optimization To incorporate financial data, we extract past n-day average volatilities prior to the earning call, where n 2 [2, 30]. Formally, training the LSTM model takes the sequence input vectors (x 1 , .., x T ) representing the past financial data along with the earnings call node embeddings O e , obtained using GCN. The model computes a series of hidden states (h 1 , .., h T ) and a sequence of outputs (y 1 , ..y T ), by repeating the following recurrence relation from time t = 1 to T : o are learnable parameters and x t is the average t-day past volatility. Following Karpathy and Fei-Fei (2015), we feed the GCN embeddings to the LSTM only at the first iteration. We use the output y T from the last LSTM unit for the final multi-output prediction.

Network Optimization
We finally train VolT-AGE by optimizing a multi-task loss as: Here,ŷ i ,ŷ j are predicted volatilities and y i , y j are true volatilities for the main and auxiliary tasks, respectively. µ is a parameter that controls the relative weight of the loss between the two tasks.

Data
We used the S&P 500 2017 Earnings Conference Calls dataset (Qin and Yang, 2019). 5 The dataset consists of 559 earnings call audio recordings and their transcripts for 277 public companies in the S&P 500 index. Each call is segmented into a sequence of audio clips aligned with their corresponding text sentences, as spoken by the Chief Executive Officer (CEO) during the call. We temporally divide the data into train, validation, and test sets in a ratio of 70 : 10 : 20 respectively to 5 We were unable to map price data for 11 data points, which were subsequently dropped ensure future data is not used for forecasting. We extract stock prices for each company using Yahoo Finance 6 from 1 January'17 till 31 December'17. The stock data for 11 earnings calls was not available on Yahoo Finance; hence we excluded these calls from our dataset. Following Qin and Yang (2019); Yang et al. (2020), we experiment with n 2 {3, 7, 15, 30} days to analyze the performance over both short and long term periods.

Baselines
We compare VolTAGE with the following methods: • V past : Following Qin and Yang (2019), we use V past , the average log volatility of the past d days to predict the future d days' average log volatility.
• MDRM: Qin and Yang (2019) extract pretrained GloVe embeddings and hand-crafted acoustic features that are fed to separate BiL-STMs to get their uni-modal contextual embeddings, which are then fused and fed to a twolayer dense network.
• HTML: Yang et al. (2020) is the state-of-the-art model using WWM-BERT to encode text tokens. HTML makes use of the same audio features as MDRM. The unimodal features are fused and fed to a sentence-level transformer to get the multimodal representations for each call.
We use FinBERT with default pre-training parameters, which outputs a 768-dimensional embedding for each sentence. The maximum number of audio clips in any call is 520. Hence, we zeropad the calls that have less than 520 clips for efficient batching. The number of neurons in the time distributed dense layer following the audio and text BiLSTMs is 100. The heterogeneous graph 6 Results and Analysis

Comparative Analysis
We present the volatility prediction performance of VolTAGE and the baselines in Table 1. We report the MSE averaged across 10 different runs for all models for the main task (n-day average prediction). Our choice of using MSE as a comparative metric is motivated by prior work (Qin and Yang, 2019;Yang et al., 2020). Additionally, we also report the coefficient of determination R 2 = 1

MSE MSE V past
, to illustrate the improvements with V past . We observe gains across the multimodal HTML that leverages both text and audio modalities. We ascribe this improvement to the cross-modal attention fusion mechanism, which uses associations between audio and text modalities over each contextual utterance instead of concatenation used in HTML. Moreover, a key limitation of the baselines is the assumption of independence of inter-stock movements. VolTAGE captures the correlations between price movements of related stocks through the GCN, and hence, volatility, amplifying performance. Similar to prior work (Qin and Yang, 2019), Table 1 illustrates that forecasting 7 We extract features for nodes using the last layer of verbal-vocal fusion tuned only for average n-day volatility prediction. The verbal-vocal attention fusion was not trained on multi-task loss, and VolTAGE is not trained end-to-end.  Table 2: Ablation Results over model components volatility in the short-term is a more intricate task than long-term. Based on Post Earnings Announcement Drift (PEAD) (Bernard and Thomas, 1989), a documented financial phenomenon, we note that the price fluctuations around earning calls tend to stabilize over long periods. We observe that VolT-AGE outperforms the baselines by a large margin in short-term prediction (n = 3, 7); however the margin diminishes over longer durations (n = 30).

Ablation Study
We observe an improvement across the text modality (T), when compared to the HTML (Text) model (Yang et al., 2020) in Table 1 and Table 2. This performance gain can be attributed to FinBERT, which is trained to handle language tasks in the financial domain, while the sentence-level transformer employed in the HTML (Text) model is a generalized implementation of BERT (Devlin et al., 2019). We also note that representations learned by FinBERT outperform both GloVe (Pennington et al., 2014) and BERT embeddings, reiterating the effectiveness of domain-specific pre-training. Further, we observe from Table 2, that the Au-dio+FinBERT (CM Attn) model outperform unimodal components, demonstrating the utility of multimodal verbal-vocal cues for volatility prediction. On adding the GCN, we observe a gain of 17.7%, likely due to the GCN's ability to learn correlations between price movements of related stocks that are captured by the company relations. the conditioned LSTM network helps in counteracting the impact of PEAD by introducing earnings call independent information into the model; this can be observed in Table 2. We note that VolTAGE outperforms all its ablative components, demonstrating how each of its multimodal components complement each other.

On Multi-task Learning
Training a network for multiple tasks jointly has shown to improve performance on tasks that share a conceptual similarity (Caruana, 1997). In our case, we optimize VolTAGE on both n-day average and single-day volatility prediction tasks a multi-task formulation. In Figure 3, we analyze the variation of the weight parameter µ with the 3-day validation MSE of n-day average, and single-day predicted volatility. As both tasks share a weighted loss function, by tuning µ, we trade-off between the two tasks. We observe from Figure 3, that at the extreme values of the weight parameter µ = 0 and µ = 1, that represent single task learning on the single-day and n-day average prediction tasks respectively, VolTAGE does not obtain optimal performance. Empirically, we find the optimal µ = 0.8 for 3-day volatility forecasting on the main task, thus validating our hypothesis that multi-task learning across both average and single day spans of volatility prediction improve predictive power.

Qualitative Analysis
We analyze the Q3-2017 earnings call for DG (Dollar General), an American variety store company. The stock's price became highly volatile for a few days following the earnings call. Figure 4a shows the audio-aware text attention heatmap for the duration of the earning call. The heatmap represents cross-modal attention weights assigned to textual utterances using corresponding vocal cues. Here each cell (i, j) represents the weight of j th vocal utterance on the i th textual utterance. It is observed that the highest attention is towards the middle of the call, suggesting that the verbal cues of this portion have the highest impact on the text contextual embeddings for most of the sentences in the call.
Earning calls of companies are often structured such that the beginning of the call involves introductory disclaimer and greetings, while the CEO starts presenting financial results for the reporting quarter along with future goals of the company towards the middle of the call, which indicates why we see such influential utterances in this portion. Figure 4b shows the disparity between CEO's vocal and verbal cues around the utterance. While textual content seems positive, a sudden spike in shimmer features in CEO's voice while speaking this sentence suggests disharmony between verbal and vocal cues. Past research in acoustics (Li et al., 2007) suggests an elevated shimmer could be indicative of underlying stress in speech. After the earning call, it was noted that the gross margin of the company slipped by 0.4%, due to the increased transportation costs due to hurricane Irma in 2017. On analyzing the graph, we observe that DG has edge connections with WMT (Wallmart) and TGT (Target Corp.), both of which are retail variety stores, like DG. Analysts had estimated a negative impact of about $2.8 Billion on the retail sector due to the hurricane Irma. This examination is also reflected in the high volatilities recorded for WMT and TGT during the same quarter. A unimodal model may miss these subtle disparities between text and audio. Therefore, VolTAGE, by leveraging cross-modal attention fusion and correlation graphs, accurately forecasts the volatility of DG, three days post the earnings call.

Conclusion and Future Work
Volatility, measured as a deviation in returns, is a reliable indicator of market risk linked with a stock. A rich source of company information is earnings calls that provide high risk-reward opportunities given their uniqueness and critical information disclosure. Although evidence shows that enriching models with speech and inter-stock correlations can improve volatility forecasting, this area is underexplored. We propose VolTAGE, a neural architecture that jointly exploits coherence over speech, text, and inter-stock correlations for volatility forecasting following earnings calls. Through experiments on S&P 500 index data, we show the merit of crossmodal gated attention fusion, graph-based learning, and multi-task prediction for volatility forecasting.
There are several promising directions of future work that we wish to explore. First, we want to improve upon the audio feature extraction. To model the speech of CEOs in earnings calls, using semitones rather than raw frequency for pitch-related features. Experimenting with other sets of commonly used acoustic features such as MFCC coefficients, OpenSMILE features and auDeep features for representing audio utterances also form a future direction for audio feature extraction. Second, we want to expand the analysis presented in this paper beyond the S&P 500 index and US-based companies. Existing research (Qin and Yang, 2019;Yang et al., 2020) and this work at the intersection of natural language processing and earnings calls are limited to a small set of companies and earnings calls. Analyzing the demographic, cultural, and gender bias in research pertaining to financial disclosures, particularly earnings calls, forms a future direction of research. We would also want to work on studying a wider set of earnings calls and companies spanning multiple languages, demographics, speakers and gender.