An Empirical Investigation of Bias in the Multimodal Analysis of Financial Earnings Calls

Volatility prediction is complex due to the stock market’s stochastic nature. Existing research focuses on the textual elements of financial disclosures like earnings calls transcripts to forecast stock volatility and risk, but ignores the rich acoustic features in the company executives’ speech. Recently, new multimodal approaches that leverage the verbal and vocal cues of speakers in financial disclosures significantly outperform previous state-of-the-art approaches demonstrating the benefits of multimodality and speech. However, the financial realm is still plagued with a severe underrepresentation of various communities spanning diverse demographics, gender, and native speech. While multimodal models are better risk forecasters, it is imperative to also investigate the potential bias that these models may learn from the speech signals of company executives. In this work, we present the first study to discover the gender bias in multimodal volatility prediction due to gender-sensitive audio features and fewer female executives in earnings calls of one of the world’s biggest stock indexes, the S&P 500 index. We quantitatively analyze bias as error disparity and investigate the sources of this bias. Our results suggest that multimodal neural financial models accentuate gender-based stereotypes.


Introduction
Earnings calls are publicly available, quarterly conference calls where CEOs discuss their company's performance and future prospects with outside analysts and investors (Qin and Yang, 2019;Sawhney et al., 2020b). They consist of two sections: a prepared delivery of performance statistics, analysis and future expectations, and a spontaneous question-answer session to seek additional information not disclosed before (Keith and Stent, 2019).
Researchers have studied the Post Earnings Announcement Drift (PEAD) to observe that statements made by upper management affect the way information is digested and acted upon impacting short-term price movements (Ball and Brown, 1968;Bernard and Thomas, 1989;Yang et al., 2020).
Audio features contextualize text and connotate speaker's emotional and psychological state (Fish et al., 2017;Jiang and Pell, 2017;Burgoon et al., 2015;Bachorowski, 1999). Hence, when used with textual features, audio features significantly determine the effect of earning calls on the stock market (Qin and Yang, 2019;Yang et al., 2020). Past research has shown that audio features such as speakers' pitch, intensity, etc. vary greatly across genders (Mendoza et al., 1996;Burris et al., 2014;Latinus and Taylor, 2012). Moreover, female executives are highly underrepresented in these earnings calls (Agarwal, 2019;Investments, 2017). The variation in audio features is amplified by deep learning models due to a dearth of female training examples and is manifested as a gender bias. The system learns unneeded correlations between stock volatility and sensitive attributes like gender, accent, etc. It further perpetuates gender-based stereotypes and generalizations like female executives are less confident than male executives (Lonkani, 2019), men are assessed as more charismatic than female executives under identical conditions (Novák-Tót et al., 2017), and nurses are female and doctors are male (Saunders and Byrne, 2020). Biased models further perpetuate stereotypes that can harm underrepresented communities, specifically in the financial and corporate world. Novák-Tót et al. (2017) even show that female speakers have to deliver better acoustic-melodic performance to seem as charismatic as men.
Taking a step towards fair risk forecasting models, we analyze gender bias by studying the error disparity in the state-of-the-art for multimodal 3752 volatility prediction, MDRM (Qin and Yang, 2019).

Background: Why Study Bias?
Bias in Finance Public financial data is impacting virtually every aspect of investment decision making (Perić et al., 2016;Brynjolfsson et al., 2011). Prior research shows that NLP methods leveraging social media (Sawhney et al., 2020a), news (Du and Tanaka-Ishii, 2020), and earning calls (Wang and Hua, 2014) can accurately forecast financial risk. Companies and investors use statistical and neural models on multimodal financial data to forecast volatility (Cornett and Saunders, 2003;Trippi and Turban, 1992) and minimize risk. These models although effective, may be tainted by bias due to individual and societal differences, often unintended (Mehrabi et al., 2019). For example, models trained on the audio features extracted from CEO's speech in earnings calls (Qin and Yang, 2019), may be prone to bias given the underrepresentation of several demographics across race, gender, native language, etc. in the financial realm.
Bias in AI Bias is prevalent in AI based neural models owing to the lack of diversity in training data (Torralba and Efros, 2011;Tommasi et al., 2017). The design and utilization of AI models trained on gender imbalanced data, pose potential deprivation of opportunities to underrepresented groups such as females (Niethammer, 2020;Dastin, 2018). With over 75% of AI professionals being men, male experiences also dominate algorithmic creation (Forum, 2018). In terms of natural language representation, embeddings such as word2vec and GloVe, trained on news articles may inherit gender stereotypes (Packer et al., 2018;Bolukbasi et al., 2016;Park et al., 2018). Recent studies also show the presence of bias in speech emotion recognition (Li et al., 2019).

Bias in AI and Finance
With the advent of AI and Big Data, companies are intelligently using data to measure performance (Newman, 2020). But seldom do enterprises check on the imbalance in gathered data. Women still represent fewer than 20% positions in the financial-services C-suite (Chin et al., 2018) and only 5% of Fortune-500 CEOs are women (Suresh and Guttag, 2019). Studies show that models trained on gender imbalanced data reduce the chances for women to get capital investments or loans (Gürdeniz et al., 2020). Apart from that, using feature representations in- Figure 1: Model architecture used for training the multimodal audio-text model for evaluating the gender specific performance inspired by (Qin and Yang, 2019) trinsic to different genders can inculcate semantic gender bias (Li et al., 2019;Suresh and Guttag, 2019). Professional studies have found that men tend to self-reference using 'I', 'me' and 'mine' whereas women tend to reference the team, like 'we', 'our' and 'us' (Investments, 2017). Although there is great progress in mitigating bias in text, understanding its presence in multimodal speech based analysis, particularly in real world scenarios like corporate earnings calls analysis remain an understudied yet promising research direction. Another study found that despite having identical credibility, female CEOs are perceived as less capable to attract growth capital (Bigelow et al., 2014).

Formulation and Experiments
Stock volatility Following Kogan et al. (2009);Sawhney et al. (2020c), for a given stock, with a close price of p i on trading day i, we calculate the average log volatility over n days following the day of the earnings call as: where, the return price r i is defined as p i p i−1 − 1 and r is the average of r i from 0 to τ .  (2019), as shown in Figure 1. MDRM takes utterancde level audio A and text T embeddings and models them through two contextual BiLSTM layers followed by late multimodal fusion. The fused text-audio features are fed to another BiLSTM followed by two fullyconnected layers. MDRM is trained end-to-end by optimizing the mean square error (MSE) between the predicted and true stock volatility.    Table 4. The maximum number of audio clips in any call is 520. Hence, we zero-pad the calls that have less than 520 clips. The model is trained on TPU version 3.8 for 20 epochs using a learning rate of 0.001. The hyperparameters are tuned on the validation set defined by Qin and Yang (2019) following the same preprocessing. We perform 5 end-to-end runs with early stopping over the validation loss to arrive at the decision of training for 20 epochs.

Results and Analysis
Bias in Multimodal Volatility Prediction For evaluating gender bias in MDRM, we analyze the error disparity quantified by ∆G for the individual text and audio modalities and their combination for τ = 3,7,15,30 days. We tabulate the error disparity in terms of ∆G across modalities in Table 2 and performance in Table 3. We observe that for all modalities the error for male distribution is consistently less than that of female distribution for both short-and long-term durations. Although the audio modality improve model performance significantly, it has the highest amount of bias as audio features for males and females vary significantly. Further, the skewed distribution of speakers' gender in the earnings calls amplifies this error disparity.
Over amplification refers to bias that occurs in a system during model fitting. The model learns imperfect generalizations between the attributes and the final labels and amplifies them while predicting on the test set. In our case, since female examples are very less in comparison to the male counterparts, the model discriminates between male and female examples by inferring insufficient information beyond its source base rate as shown in Table 2.
To study this effect we train the model for different training sample ratios as per gender to observe performance variation in Figure 2. We note that as the male:female training ratio increases, the test loss is amplified the most in audio modality followed by audio+text and text.

Ethical Considerations
Degradation in the performance of speech models could be due to discernible noise and indiscernible sources like demographic bias: age, gender, dialect, culture, etc (Meyer et al., 2020;Hashimoto et al., 2018;Tatman and Kasten, 2017). Studies also show that AI can deploy biases against black people in criminal sentencing (Angwin et al., 2016;Tatman and Kasten, 2017). Although we only account for the gender bias in our study, we acknowledge that there could exist other kinds of bias due to age, accent, culture, ethnic and regional disparities in audio cues, as the publicly available earnings calls majorly have companies belonging to the US. Moreover, only publicly available earnings calls have been used limiting the scope of the data. This also limits the availability of genders in the data to only male and female. In the future, we hope to increase the amount of data to expand our study to more categories and types of sensitive attributes.

Conclusion
Earnings calls provide company insights from executives proving to be high risk-reward opportunities for investors. Recent multimodal approaches that utilize these acoustic and textual features to predict the financial risk achieve state-of-the-art performance, but overlook the gender bias associated with speech. We analyze the gender bias in volatility prediction of earnings calls due to gender sensitive audio features and underrepresentation of women in executive positions. We observe that the while adding speech features improves performance, it also perpetuates gender bias, as the audio modality has the highest error disparity. We further probe into the sources of bias, and analyze audio feature variations across gender, and perform experiments with varying training data distributions.
Our study presents the first analysis of its kind to analyze gender bias in multimodal financial forecasting to bridge the gap between fairness in AI, neural financial forecasting and multimodality.