Modeling Financial Analysts’ Decision Making via the Pragmatics and Semantics of Earnings Calls

Every fiscal quarter, companies hold earnings calls in which company executives respond to questions from analysts. After these calls, analysts often change their price target recommendations, which are used in equity re- search reports to help investors make deci- sions. In this paper, we examine analysts’ decision making behavior as it pertains to the language content of earnings calls. We identify a set of 20 pragmatic features of analysts’ questions which we correlate with analysts’ pre-call investor recommendations. We also analyze the degree to which semantic and pragmatic features from an earnings call complement market data in predicting analysts’ post-call changes in price targets. Our results show that earnings calls are moderately predictive of analysts’ decisions even though these decisions are influenced by a number of other factors including private communication with company executives and market conditions. A breakdown of model errors indicates disparate performance on calls from different market sectors.


Introduction
Financial analysts are key sell-side players in finance who are employed to analyze, interpret, and disseminate financial information (Brown et al., 2015). For the firms they cover, financial analysts regularly release recommendations to buy, hold, or sell the company's stock, and stock price targets. Financial analysts' forecasts are of value to investors (Givoly and Lakonishok, 1980) and may be better surrogates for market expectations than forecasts generated by time-series models (Fried and Givoly, 1982).
Analysts' decisions are influenced by market conditions and private communications 1 , so it is 1 Brown et al. (2015) find over half of the 365 analysts impossible to exactly reconstruct their decision making process. However, signals of analysts' decision making may be obtained by analyzing earnings calls-quarterly live conference calls in which company executives present prepared remarks (the presentation section) and then selected financial analysts ask questions (the questionanswer section). Previous work has shown that earnings calls disclose more information than company filings alone (Frankel et al., 1999) and influence investor sentiment in the short term (Bowen et al., 2002). However, recently company executives and investors have questioned their value (Koller and Darr, 2017;Melloy, 2018).
Earnings calls are extremely complex, naturally-occurring examples of discourse that are interesting to study from the perspective of computational linguistics (see Figure 1). In this work, we examine analysts' decision making in the context of earnings calls in two ways: • Correlating analysts' question pragmatics with their pre-call judgements: With domain experts, we select a set of 20 pragmatic and discourse features which we extract from the questions of earnings calls. Then we correlate these with analysts' pre-call judgments and find bullish analysts tend to be called on earlier in calls, and ask questions that are more positive, more concrete, and less about the past ( §4). • Predicting changes in analysts' post-call forecasts: We use the pragmatic features, along with representations of the semantic content of earnings calls, to predict changes in analysts' post-call price targets. Since analysts have a deep understanding of market factors influencing a company's performance and have access to private information, our null hypothesis is Sundar Pichai, CEO: On the YouTube one. Look, I mean, the shift to video is a profound medium shift and especially in the context of mobile, you know and obviously users are following that. You're seeing it in YouTube as well as elsewhere in mobile. And so, advertisers are being increasingly conscious. They're being very, very responsive. So, we're seeing great traction there and we'll continue to see that. They are moving more off their traditional budgets to YouTube and that's where we are getting traction. On mobile search, to me, increasingly we see we already announced that over 50% of our searches are on mobile. Mobile gives us very unique opportunities in terms of better understanding users and over time, as we use things like machine learning, I think we can make great strides. So, my long-term view on this is, it is as-compelling or in fact even better than desktop, but it will take us time to get there. We're going to be focused till we get there. that earnings calls are not predictive of forecast changes. However, our best model gives a reduction of 25% in relative accuracy error over a majority class baseline (twice the reduction of a model using market data alone), suggesting there is signal in the noise. We also conduct pairwise comparisons of modeling features including: semantic vs. pragmatic features, Q&Aonly vs. whole-call data, and whole-document vs. turn-level models ( §5).

Related work
NLP is used extensively for financial applications (Tetlock et al., 2008;Kogan et al., 2009;Leidner and Schilder, 2010;Loughran and McDonald, 2011;Wang et al., 2013;Ding et al., 2014;Peng and Jiang, 2016;Li and Shah, 2017;Rekabsaz et al., 2017). Earnings calls, in particular, are shown to be predictive of investor sentiment in the short-term, including of increased stock volatility and trading volume levels (Frankel et al., 1999), decreased forecast error and forecast dispersion (Bowen et al., 2002), and increased absolute returns for intra-day trading (Cohen et al., 2012). Although most prior work on earnings calls treat each call as a single document, Matsumoto et al. (2011) find that the question-answer portion of the earnings call is more informative (in terms of intra-day absolute returns) than the presentation portion, and Cohen et al. (2012) show firms "cast" earnings calls by disproportionately calling on bullish analysts. Most prior applications of NLP to earnings calls use only shallow linguistic features and correlation analyses, specifically correlations between political bigrams and stock return volatility (Hassan et al., 2016); contrastive words and share prices (Palmon et al., 2016); and euphemisms and earnings surprise (Suslava, 2017). Other work analyzes earnings calls from a sociolinguistic perspective, including in terms of discourse connectives (Camiciottoli, 2010), indirect requests (Camiciottoli, 2009), unanswered questions (Hollander et al., 2010, persuasion (Crawford Camiciottoli, 2018) and deception (Larcker and Zakolyukina, 2011). Focusing on only the audio of earnings calls, Mayew and Venkatachalam (2012) extract managers' affective states using commercial speech software. In the work most similar to ours, Wang and Hua (2014) use named entities, part-ofspeech tags, and probabilistic frame-semantic features in addition to unigrams and bigrams to correlate earnings calls with financial risk, which they defined as the volatility of stock prices in the week following the earnings call.
NLP-based corpus analyses of decision making are rare. Beňuš et al. (2014) analyze the impact of entrainment on Supreme Court justices' subsequent decisions. Multiple groups have examined the impact of various semantic and pragmatic features on modeling opinion change using reddit ChangeMyView discussions (e.g. (Hidey et al., 2017;Jo et al., 2018;Musi, 2018)), and there has been other work on opinion change using other web discussion data (e.g. (Tan et al., 2016;Habernal and Gurevych, 2016;Lukin et al., 2017)). Because many factors influence decision making behavior, the fact that any signal can be obtained Earnings calls total (2010-2017) 12,285 Train (2010 9,770 Validation ( from linguistic analyses of isolated language artifacts is scientifically interesting.

Data and pre-processing
Our data 2 consists of transcripts of 12,285 earnings calls held between January 1, 2010 and December 31, 2017. In order to control for analyst coverage effects (larger companies with a greater market share will typically be covered by more analysts), we include only calls from S&P 500 companies. We split the data by year into training, validation and testing sets (see Table 1). The transcripts are XML files with metadata specifying speaker turn boundaries and the name of the speaker (or "Operator" for the call operator). In order to identify speaker type (analyst or company representative) we use the following heuristic: if the transcript explicitly includes the speaker type with the speaker name (e.g. "John Doe, Analyst"), we do exact string matching for ", Analyst"; else, we assume the names of speakers between the first and second operator turns (i.e. in the presentation section) are those of company representatives and all other speakers are analysts. We manually checked this heuristic on a few dozen documents and found it to have high precision.
We remove turns spoken by the operator as well as turns that have fewer than 10 tokens since manual analysis revealed the latter were largely acknowledgment and greeting turns (e.g. "Thank you for your time" and "You're welcome"). We also lexicalized named entities and represented them as a single token. We obtained tokenization, part of speech tagging, and dependency parsing via a proprietary NLP library 3 .

Pragmatic correlations with analysts' pre-call judgments
We are interested in whether and how the forms of analysts' questions reflect their pre-call judgments about companies they cover. Analysts' questions are complex: a single turn may contain several questions (or answers). An example questionanswer pair is shown in Figure 1. We compute Pearson correlations between linguistic features indicating certainty, deception, emotion and outlook ( §4.1) and the type of analyst (bullish, bearish, or neutral) asking the question. We use a mapping of analysts' recommendations to a 1-5 scale 4 where a 1 denotes "strong sell" and a 5 denotes "strong buy." We label each analyst according to their recommendation of the company before the earnings call: • bearish if analysts give a company a 1 or 2, • neutral if they give a 3, and • bullish if they give a 4-5.
We have analyst recommendations for 160,816 total question turns and the distribution over analyst labels is 4.5% bearish, 35.7% neutral, and 59.7% bullish. Following other correlation work in NLP (Preoţiuc-Pietro et al., 2015;Holgate et al., 2018), we use Bonferroni correction to address the multiple comparisons problem.

Pragmatic lexical features
We extract 20 pragmatic features from each turn by gathering existing hand-crafted, linguistic lexicons for these concepts 5 . See Table 2 for statistics about the lexicons and Table 3 for examples.
Named entity counts and concreteness ratio. For each turn, we calculate the number of named entities in five coarse-grained groups constructed from the fine-grained entity types of OntoNotes 6    (Hovy et al., 2006): (1) events, (2) numbers, (3) organizations/locations, (4) persons, and (5) products. We also calculate (6) a concreteness ratio: the number of named entities in the turn divided by the total number of tokens in the turn. Predicate-based temporal orientation. Temporal orientation is the emphasis individuals place on the past, present, or future. Previous work has shown correlations between "future intent" extracted from query logs and financial market volume volatility (Hasanuzzaman et al., 2016). We determine the temporal orientation of every predicate in a turn. We extract OpenIE predicates via a re-implementation of PredPatt (White et al., 2016). For each predicate, we look at its Penn Treebank part-of-speech tag and use a heuristic 7 7 If the part-of-speech tag for the predicate is VBD or VBN the temporal orientation is "past"; otherwise if it is VB, VBG, VBP, or VBZ it is "present" unless the predicate has a dependent of the form will, 'll, shall or wo indicating "future", are is, am, or are indicating "present", or was or were indicating "past". to determine if it is "past," "present," or "future." : We calculate the number of (7) "past" oriented predicates, (8) "present" oriented predicates and (9) "future" oriented predicates in each turn.
Sentiment. We calculate the ratio of (10) positive sentiment terms and (11) negative sentiment terms to the number of tokens in each turn. We use the financial sentiment lexicons developed by Loughran and McDonald (2011) from fourteen years of 10-Ks. We supplement these with a general-purpose sentiment dictionary (Taboada et al., 2011), to account for the relative informality of earnings calls.
Hedging. We calculate (12) the ratio of hedges to tokens in each turn. Hedges are lexical choices by which a speaker indicates a lack of commitment to the content of their speech (Prince et al., 1982). We use the single-and multi-word hedge lexicons from (Prokofieva and Hirschberg, 2014).
Other lexicon-based features. We compute the ratios of (13) modal, (14) uncertain, (15) con-  straining, and (16) litigious terms in each turn using the respective lexicons from Loughran and McDonald (2011). In each case, we compute the ratio of terms in the category to the number of tokens in the turn.
Other pragmatic features. We also calculate (17) the turn order, (18) the number of tokens, (19) the number of predicates, and (20) the number of sentences in each turn.

Interpretation of correlation results.
Full results for the pragmatic correlation analysis are given in Table 4. For a number of features the correlations are not statistically significant. However, we expand upon the statistically significant results for negative (−) and positive (+) correlations with the bullishness of an analyst: • (+) Bullishness and turn order. This suggests bullish analysts tend to be called on earlier in the call and bearish and neutral analysts tend to be called on later in the call which confirms the conclusion of Cohen et al. (2012).
• (+) Bullishness and positive sentiment. Bullish analysts tend to ask more positive (less negative) questions and the reverse is true for neutral/bearish analysts. Intuitively, this makes sense since bullish analysts are more favorable towards the firm and thus probably cast the firm in a positive light.
• (+) Bullishness and entities. Here we find that bullish analysts are slightly more concrete in their questions towards the company and tend to ask more about organizations and people.
• (−) Bullishness and past predicates. This suggests bearish and neutral analysts tend to talk about the past more.
These correlations could be used by journalists and investors to flag questions that follow atypical patterns for a particular analyst.

Predicting changes in analysts' post-call forecasts
We are interested in what earnings-call related information is indicative of analysts' subsequent decisions to change (or not change) their price target after an earnings call. A price target is a projected future price level of asset; for example, an analyst may give a stock that is currently trading at $50 a six-month price-target of $90 if they believe the stock will perform better in the future.
We design experiments to answer the following research questions: (1) Is the text of earnings calls predictive of analysts' changes in price targets from before to after the call? This is an open research question since analysts may change their price targets at any time and consider external information (e.g. current events or private conversations with company executives); (2) If the text is predictive, is the text more predictive than marketbased features such as the company's stock price, volatility, and earnings? (3) If the text is predictive, what linguistic aspects (e.g. pragmatic vs. semantic) are more predictive and with which feature representations? (4) Is the question-answer portion of the call more predictive than the presentation portion? (5) Does a turn-based model of the call provide more signal than "single document" representations?

Representing analysts' forecast changes
We model analysts' changes in forecasts as both a regression task and a 3-class classification task because different formulations may be of interest to various stakeholders 8 .   Regression. For each earnings call in our corpus, i ∈ D, and each analyst in the set of analysts covering that call, j ∈ J i , let b j be the price target of analyst j before the call and let a j be the price target after the call 9 . Then the average percent change in analysts' price targets is (1) See Figure 1 for the distribution of y i . Since analysts can report price targets at any time, we set cut-off points for a j and b j to be 3 months before and 14 days after the earnings call date respectively (a majority of analysts who change their price targets do so within two weeks after a call). Classification. We create three (roughly equal) classes (negative, neutral, and positive change) by binning the y i values calculated in the equation above into thirds. For each earnings call i, Table 5 shows the class breakdown for each split of the data. 9 Because the company holding the earnings call chooses which analysts to call on for questions, our data includes analyst ratings and recommendations from analysts who do not ask a question in a call. Also, because individual analysts' recommendations may be sold to different vendors, we do not have analyst ratings and recommendations for all analysts who ask questions in our data.

Features
We compare models with market-based, pragmatic, and semantic features.

Market features
For each company and call in our dataset, we obtain 10 market features for the trading day prior to the call date: open price, high price, low price, volume of shares, 30-day volatility, 10-day volatility, price/earnings ratio, relative price/earnings ratio, EBIT yield, and earnings yield 10 . We impute missing values for these features using the mean value of features in the training data 11 . We scale features to have zero mean and unit variance.

Semantic features
Doc2Vec. We use the paragraph vector algorithm proposed by (Le and Mikolov, 2014) to obtain 300-dimensional document embeddings. Depending on the model, we train doc2vec embeddings over whole calls, question-answer sections only, and individual turns. Using the Gensim 12 implementation (Řehůřek and Sojka, 2010), we train doc2vec models for 50 epochs and ignore words that occur less than 10 times in the respective training corpus.
Bag-of-words. We lowercase tokens, augment them with their parts of speech, and then limit the vocabulary to the top 100K content words 13 in the training data. Depending on the model, we calculate bag-of-words feature vectors over the whole document, over the Q&A section, and over each turn separately.

Pragmatic features
We combine the 20 pragmatic features described in Section 4.1 into a single feature vector. These features are only used in our turn-level models.

Models
We use several models to predict changes in analysts' forecasts.
10 See Appendix B in supplemental material for detailed definitions of these finance terms. 11 There are missing values for less than 1% of the data. The missing values are mainly due to company acquisitions and changing of company names. 12 Version 3.6.0 13 UD Part of speech tags ADJ, ADV, ADV, AUX, INTJ, NOUN, PRON, PROPN, VERB.  Table 6: Test-set regression and classification results. Models are ridge regression (RR), long short-term memory networks (LSTM), logistic regression (LR), and ensemble (Ens.). WD denotes whole-document models, while Q&A denotes Q&A-only models. Evaluation metrics are mean squared error (MSE), the coefficient of determination (R 2 ), accuracy (Acc.), and macro-level F1. For regression, percent error reduction (% err.) is from the MSE of the baseline of predicting the training mean; for classification, it is from the accuracy of predicting the majority class.

Whole-document models
Ridge regression 14 . For regression, we use ridge regression 15 which has a loss function that is the linear least squares function and is regularized with an L2-norm. To tune hyperparameters, we perform a five-fold cross-validation grid search over the regularization strength 16 . We evaluate on mean squared error (MSE) and the coefficient of determination (R 2 ) scores. Logistic regression 17 . For classification, we train logistic regression with a L2 penalty 18 and we tune C, the inverse regularization constant, via a grid search and 5-fold cross validation on the training set. We evaluate validation and test sets using accuracy and macro F1 scores.

Q&A-only models
In order to compare the relative influence of the presentation versus question-answer sections of the earnings calls, we remove the presentation portion of each call and only predict on the Q&A por- 14 We also tried Kernel ridge regression with a Gaussian (RBF) kernel, which gave similar results. See Appendix C for more details. 15 Implemented with scikit-learn. 16 α in scikit-learn for values 10 −3 to 10 8 by logarithmic scale. 17 We also tried support vector machines; see Appendix C. 18 Implemented with sklearn. tion 19 . Except for this difference, Q&A-only models are identical to whole-document models.

Turn-by-turn models
LSTM for regression. We model transcripts as a sequence of turns using long-short term memory networks (LSTMs) (Hochreiter and Schmidhuber, 1997). Let x t ∈ R k be the input vector at time t for embedding dimension k, and let L be the total length of the sequence. Each x t is fed into the LSTM in order and produces a corresponding output vector h t . Then the final output vector is passed through a linear layer y = w y h L + b y for output y ∈ R with w y ∈ R k . For a given minibatch b, L b is fixed as the maximum number of turns among all documents and the sequences for the other documents in the mini-batch are padded. The network is trained with mean squared error (MSE) loss. LSTM for classification. The LSTM architecture for classification is similar to that used for regression except that there is an additional softmax layer after the final linear layer. This network is trained with cross-entropy loss.
Both LSTMs are trained via a grid search over the following hyperparameters: learning rate, hid-

Fusion and ensembling
Early fusion. We use early fusion (Atrey et al., 2010) to combine semantic and pragmatic feature vectors at every turn and feed these into a LSTM.
Ensembling via stacking. We use "stacked generalization" (Wolpert, 1992) (a.k.a. "stacking") to combine fusion and market-based models. For regression, we take the output values from the fusion and market-based models as features into a ridge regression model. For classification, we take the three-dimensional probability vector output from the fusion and market-based models and concatenate these as features into a logistic regression model. In both cases, hyperparameters are tuned on validation data.

Baselines.
We compare against several baselines: (1) random, drawing a random variable from a Gaussian centered at the mean of the training data, (2) predicting the mean change in forecast across all documents in the training set (regression), and (3) predicting 0, the majority class (classification).

Results.
See Table 6 for full results. We address our original research questions from the beginning of §5.
(1) Predictiveness. We find earnings calls are moderately predictive of changes in analysts' forecasts, with an almost 25% relative error reduction 20 https://pytorch.org/ in classification accuracy from the baseline of predicting the majority class. While the accuracy of our best model may seem modest, for this task, analysts' decisions can be influenced by many external factors outside of the text itself and our ability to find any signal among the noise may be interesting to financial experts.
(2) Text vs. market. Semantic features are more predictive of changes in analysts' price targets than market features (a 24.8% error reduction over baseline for bag-of-words and a 23.8% reduction for doc2vec, vs. a 12.4% error reduction for market features).
(3) Semantic vs. pragmatic. Semantic features (doc2vec and bag-of-words) are more predictive than pragmatic features. This suggests the semantic content of the earnings call is important in how analysts make decisions to change their price targets.
(4) Q&A-only vs. whole-doc. Contrary to Matsumoto et al. (2011) who find the question-answer portions of earnings calls to be most informative, we find the Q&A-only models are much less predictive for doc2vec (accuracy 0.479 vs. 0.385) and bag-of-words (accuracy 0.482 vs. 0.388) models.
(5) Whole-doc vs turn-level. Whole-document models are more predictive than turn-level models (the best LSTM model achieves 19.1% error reduction over baseline, vs. 24.8% for the best whole-doc model). We hypothesize that turn-level models might capture more signal if they incorporate speaker metadata, e.g. the role of the speaker or the analysts' pre-calls judgment for the company. Although whole-document models are more predictive, turn-level analyses of analysts' behavior may be more useful to alerting stakeholders to predictive signals in real-time (e.g. an important analyst analyst question mid-way through a live earnings call) since financial markets can vary significantly in short time periods.
Breakdown of results by industry. We analyze errors on the validation data by segmenting earnings calls by each company's Global Industry Classification Standard (GICS) sector 21 . See Figure 2 for the breakdown results. Notably, the bagof-words model performs almost 2.5 times worse on earnings calls from the Materials sector versus the Utilities and Telecommunication Services sectors. This suggests industry-specific models may be important in future work.

Conclusions and future work
In this work we (a) correlate pragmatic features of analysts' questions with the pre-call judgment of the questioner, (b) explore the influence of market, semantic and pragmatic features of earnings calls on analysts' subsequent decisions. We show that bullish analysts are more likely to ask slightly more positive and concrete questions, talk less about the past, and be called on earlier in a call. We also demonstrate earnings calls are moderately predictive of changes in analysts' forecasts.
Promising directions for future research include examining additional features and feature representations: pragmatic features such as formality (Pavlick and Tetreault, 2016) or politeness (Danescu-Niculescu-Mizil et al., 2013); acousticprosodic features from earnings call audio; more sophisticated semantic representations such as claims (Lim et al., 2016), automatically induced entity-relation graphs (Bansal et al., 2017) or question-answer motifs (Zhang et al., 2017) (these representations are non-trivial to construct because a single turn may contain many questions or answers); or even discourse structures. The models used in this work aim to be just complex enough to determine whether useful signals exist for this task; future modeling work could include training a complete end-to-end system such as a hierarchical attention network (Yang et al., 2016), or building industry-specific models.