Uncertainty over Uncertainty: Investigating the Assumptions, Annotations, and Text Measurements of Economic Policy Uncertainty

Methods and applications are inextricably linked in science, and in particular in the domain of text-as-data. In this paper, we examine one such text-as-data application, an established economic index that measures economic policy uncertainty from keyword occurrences in news. This index, which is shown to correlate with firm investment, employment, and excess market returns, has had substantive impact in both the private sector and academia. Yet, as we revisit and extend the original authors' annotations and text measurements we find interesting text-as-data methodological research questions: (1) Are annotator disagreements a reflection of ambiguity in language? (2) Do alternative text measurements correlate with one another and with measures of external predictive validity? We find for this application (1) some annotator disagreements of economic policy uncertainty can be attributed to ambiguity in language, and (2) switching measurements from keyword-matching to supervised machine learning classifiers results in low correlation, a concerning implication for the validity of the index.


Introduction
The relatively novel research domain of text-asdata, which uses computational methods to automatically analyze large collections of text, is a rapidly growing subfield of computational social science with applications in political science (Grimmer and Stewart, 2013), sociology (Evans and Aceves, 2016), and economics (Gentzkow et al., 2019). In economics, textual data such as news editorials (Tetlock, 2007), central bank communications (Lucca and Trebbi, 2009), financial earnings calls (Keith and Stent, 2019), company disclosures (Hoberg and Phillips, 2016), and newspa- * This work was done during an internship at Bloomberg. pers (Thorsrud, 2020) have recently been used as new, alternative data sources.
In one such economic text-as-data application, Baker et al. (2016) aim to construct an economic policy uncertainty (EPU) index whereby they quantify the aggregate level that policy is influencing economic uncertainty (see Table 1 for examples). They operationalize this as the proportion of newspaper articles that match keywords related to the economy, policy, and uncertainty.
The index has had impact both on the private sector and academia. 1 In the private sector, financial companies such as Bloomberg, Haver, FRED, and Reuters carry the index and sell financial professionals access to it. Academics show economic policy uncertainty has strong relationships with other economic indicators: Gulen and Ion (2016) find a negative relationship between the index and firmlevel capital investment, and Brogaard and Detzel (2015) find that the index can positively forecast excess market returns. The EPU index of Baker et al. has substantive impact and is a real-world demonstration of finding economic signal in textual data. Yet, as the subfield of text-as-data grows, so too does the need for rigorous methodological analysis of how well the chosen natural language processing methods operationalize the social science construct at hand. Thus, in this paper we seek to re-examine Baker et al.'s linguistic, annotation, and measurement assumptions. Regarding measurement, although keyword look-ups yield high-precision results and are interpretable, they can also be brittle and may suffer from low recall. Baker et al. did not explore alternative text measurements based on, for example, word embeddings or supervised machine learning classifiers. Demand for new clothing is uncertain because several states may implement large hikes in their sales tax rates. 2 The outlook for the H1B visa program remains highly uncertain. As a result, some high-tech firms fear that shortages of qualified workers will cramp their expansion plans. 3 The looming political fight over whether to extend the Bush-era tax cuts makes it extremely difficult to forecast federal income tax collections in 2011. 4 Uncertainty about prospects for war in Iraq has encouraged a build-up of petroleum inventories and pushed oil prices higher. 5 Some economists claim that uncertainties due to government industrial policy in the 1930s prolonged and deepened the Great Depression. 6 It remains unclear whether the government will implement new incentives for small business hiring. In exploring Baker et al.'s construction of EPU, we identify and disentangle multiple sources of uncertainty. First, there is the real underlying uncertainty about economic outcomes due to government policy that the index attempts to measure. Second, there is semantic uncertainty that can be expressed in the language of newspaper articles. Third, there is annotator uncertainty about whether a document should be labeled as EPU or not. Finally, there is modeling uncertainty in which text classifiers are uncertain about the decision boundary between positive and negative classes.
In this paper, we revisit and extend Baker et al.'s human annotation process ( §3) and computational pipeline that obtains EPU measurement from text ( §4). In doing so, we draw on concepts from quantitative social science's measurement modeling, mapping observable data to theoretical constructs, which emphasizes the importance of validity (is it right?) and reliability (can it be repeated?) (Loevinger, 1957;Messick, 1987;Quinn et al., 2010;Jacobs and Wallach, 2019).
Overall, this paper contributes the following: • We examine the assumptions Baker et al. use to operationalize economic policy uncertainty via keyword-matching of newspaper articles. We demonstrate that using keywords collapses some rich linguistic phenomena such as semantic uncertainty ( §2.1).
• We also examine the causal assumptions of Baker et al. through the lens of structural causal models (Pearl, 2009) and argue that readers' perceptions of economic policy uncertainty may be important to capture ( §2.2).
• We conduct an annotation experiment by reannotating documents from Baker et al.. We find preliminary evidence that disagreements in annotation could be attributed to inherent ambiguity in the language that expresses EPU ( §3).
• Finally, we replicate and extend Baker et al.'s data pipeline with numerous measurement sensitivity extensions: filtering to US-only news, keyword-matching versus supervised document classifiers, and prevalence estimation approaches. We demonstrate that a measure of external predictive validity, i.e., correlations with a stock-market volatility index (VIX), is particularly sensitive to these decisions ( §4).

Assumptions of Measuring Economic Policy Uncertainty from News
The goal of Baker et al. (2016) is to measure the theoretical construct of policy-related economic uncertainty (EPU) for particular times and geographic regions. Baker et al. assume they can use information from newspaper articles as a proxy for EPU, an assumption we explore in great detail in Section 2.2, and they define EPU very broadly in their coding guidelines: "Is the article about policyrelated aspects of economic uncertainty, even if only to a limited extent?" 2 For an article to be annotated as positive, there must be a stated causal link between policy and economic consequences and either the former or the latter must be uncertain.  Table 2 in the Appendix) then it is considered a positive document. Counts of positive documents are summed and then normalized by the total number of documents published by each news outlet.

Semantic Uncertainty
While the keywords Baker et al. (2016) select ("uncertain" or "uncertainty") are the most overt ways to express uncertainty via language, they do not capture the full extent of how humans express uncertainty. For instance, Example No. 6 in Table 1 would be counted as a negative by Baker et al. despite indicating semantic uncertainty via the phrase "it remains unclear." These keyword assumptions are a threat to content validity, "the extent to which a measurement model captures everything we might want it to" (Jacobs and Wallach, 2019). We look to definitions from linguistics to potentially expand the operationalization of uncertainty; we refer the reader to Szarvas et al. (2012) for all subsequent definitions and quotes. In particular, uncertainty is defined as a phenomenon that represents a lack of information. With respect to truth-conditional semantics, semantic uncertainty refers to propositions "for which no truth value can be attributed given the speaker's mental state." Discourse-level uncertainty indicates "the speaker intentionally omits some information from the statement, making it vague, ambiguous, or misleading" and in the context of Baker et al. could result from journalists' linguistic choices to express ambiguity in economic policy uncertainty. For instance, in the first example in Table 3, the lexical cues "suggest" and "might" indicate to the reader that the journalist writing the article is unclear about the intention of Alan Greenspan. In contrast, epistemic modality "encodes how much certainty or evidence a speaker has for the proposition expressed by his utterance," (e.g., "Congresswoman X: 'We may delay passing the tariff bill.'") and doxastic modality refers to the beliefs of the speaker ("I believe that Congress will . . . "). In the second example in Table 3, the entity "he" seems to be uncertain about the fate of the economy because he "shakes his head in bewilderment," which demonstrates that uncertainty can also be conveyed through world knowledge and inference.
Collapsing all these types of semantic uncertainty to the keywords "uncertainty" and "uncertain" has major implications: (a) the relationship between the uncertainty journalists express and what readers infer impacts the causal assumptions ( §2.2) and annotation decisions ( §3) of this task, and (b) Baker et al.'s keywords are most likely lowrecall which could affect empirical measurement results ( §4). We see fruitful future work in improving content validity and recall via automatic uncertainty and modality analysis from natural language processing, e.g.

Causal Assumptions
Using the paradigm of structural causal models (Pearl, 2009), we re-examine the causal assumptions of Baker et al.. In Figure 1, for a single timestep, 4 U * represents the real, aggregate level of Example Docid The stock market had soared on Mr.
Greenspan's suggestion that global financial problems posed as great a threat to the United States as inflation did, suggesting that a rate cut to stimulate the economy might be on the horizon 1047100 But ask him whether the Mexican stock market will rise or plunge tomorrow and he shakes his head in bewilderment. economic policy uncertainty in the world which is unobserved. If one could obtain a measurement of U * , then one could analyze the causal relationship between U * and other macroeconomic variables, M . Presumably, newspaper reporting, X, is affected by U * and x = f X (u * ) where f X is a nonparametric function that represents a causal process.

1043578
In our setting, f X represents the process of media production: for example, the ability of journalists to collect information from sources; or editorial decisions on what topics will be published. The major assumption of Baker et al. is that they can obtain a measure of U * via a proxy measure from ). Yet, aside from examining the political bias of media, Baker et al. largely ignore f X and how the media production process could influence EPU measurements. However, an alternative causal path from U * to M goes through H * , the macro-level human perception of real EPU. In this case, U * is irrelevant as long as people are perceiving policy-related economic uncertainty to be changing, they could potentially make real economic decisions (e.g. hiring or purchases) that could affect the greater macroeconomy, M . It is unclear how to design a causal intervention in which one manipulates the real EPU, do(U * ), in order to estimate its effect on X and M . However, one could design an ideal causal experiment to intervene on newspaper text, do(X); one could artificially change the level of EPU coverage in synthetic articles, show these to participants, and measure the resulting difference in participants' economic decisions. If H * to M is the causal path of interest, 5 then it is extremely important 5 There is some evidence from the original authors that hu- to measure and model human perception of EPU, an assumption we explore in terms of annotation decisions in Section 3.

Annotator Uncertainty
Reliable human annotation is essential for both building supervised classifiers and assessing the internal validity of text-as-data methods. In order to validate their EPU index, Baker et al. sample documents from each month, obtain binary labels on the documents from annotators, and then construct a "human-generated" index which they report has a 0.86 correlation with their keyword-based index (aggregated quarterly). Yet, in our analysis of Baker et al.'s annotations (denoted below as BBD), we find only 16% of documents have more than one annotator and of these, the agreement rates are moderate: 0.80 pairwise agreement and 0.60 Krippendorff's α chance-adjusted agreement (Artstein and Poesio, 2008). See Line 2 of Table 4 for additional descriptive statistics of these annotations. The original authors did not address whether this disagreement is a result of annotator bias, error in annotations, or true ambiguity in the text. In contrast to the popular paradigm that one should aim for high inner-annotator agreement rates (Krippendorff, 2018), recent research has shown "disagreement between annotators provides a useful signal for phenomena such as ambiguity in the text" (Dumitrache et al., 2018). Additionally, recent research in natural language processing man perception is important: In the EPU index released to the public, one of three underlying components is a disagreement of economic forecasters as a proxy for uncertainty. See http: //policyuncertainty.com/methodology.html.    (Sharmanska et al., 2016) has leveraged annotator uncertainty to improve modeling. Thus, for our setting, we ask the following research question: RQ1: Is there inherent ambiguity in the language that expresses economic policy uncertainty? If so, are annotator disagreements a reflection of this ambiguity?
The following evidence lends to our hypothesis that there is inherent ambiguity in whether documents encode EPU: (1) the original coding guide of Baker et al. had 17 pages of "hard calls" that describe difficult or ambiguous documents, (2) there was a moderate amount of annotator disagreement in BBD (Table 4), (3) we qualitatively analyze examples with disagreement and reason about what makes the inferences of these documents difficult ( §3.2, and Tables 11 and 10 in the Appendix), and (4) we run an experiment in which we gather additional annotations and show that our annotations have more disagreement with documents that have non-unanimous labels in BBD ( §3.1).

Our annotation experiment
The ideal assessment of inherent annotator uncertainty would be to gather a large number of annotations for many documents and then analyze the posterior distribution over labels. 6 We perform a similar, small-scale experiment in which we recruit 10 annotators, a mix of professional analysts and PhD students, who annotate 37 documents for a total of 193 annotations. 7 We sampled documents from the pool of BBD documents that had more than one annotator and the BBD labels were unanimous (Sample A) and non-unanimous (Sample B). We re-annotated these samples in order to provide insight into the nature of these unanimous and nonunanimous labels. See Figure 4 in the Appendix for our full annotation instructions. Pairwise cross-agreement. In order to quantitatively compare two annotation rounds (ours vs. Baker et al.'s), we provide a new metric, pairwise cross-agreement (PXA). Formally, for each document of interest, d ∈ D, let the A d and B d be the set of annotations on that document from each of the two rounds respectively. Let P d be the set of all pairs, (a ∈ A d , b ∈ B d ) from combining one annotation from each of the two rounds. Then, (1) Results. The results of our experiment (Tables 4 and 5) provide evidence supporting our hypothesis that there is inherent ambiguity in documents about EPU that contributes to annotator disagreement. In Table 5, PXA is higher in Sample A (0.70), in which BBD annotators had unanimous agreement, compared to Sample B (0.50) in which BBD annotators had non-unanimous labels. Since our annotations agreed with Sample A more, this could indicate these documents inherently have more agreement. The pairwise agreement between our annotations on Sample A and B are roughly the same (Table 4) but the proportion of documents that had unanimous agreement among our five annotators per document was slightly more in Sample A versus Sample B (0.37 vs. 0.28). Limitations of our experiment include that our sample size is relatively small and our annotation instructions are different and significantly shorter than Baker et al..

Qualitative Document Analysis
Our qualitative analysis suggests that readers' perceptions of EPU differ meaningfully and it is difficult to measure EPU with a simple document-level binary label. In Tables 10 and 11 in the Appendix, we present documents with the highest levels of agreement from Sample A and disagreement from Sample B. Annotators are likely to disagree on the label of the document when need real world knowledge to infer whether a policy is contributing to economic uncertainty. For instance, in Table 11 Example 1, the reader has to infer that the author of an op-ed would only write an op-ed about a policy if it was uncertain, but the uncertainty is never explicitly stated in text. In other instances, the causal link between policy and economic uncertainty is unclear. In Table 11 Example 4, economic downturn is mentioned as well as turnover in the administration but these are never explicitly linked; yet, some annotators may have read "questions about what lies ahead" as uncertainty that also encompasses economic uncertainty. Although there has been a rise of common sense reasoning research in natural language processing (e.g. Bhagavatula et al.  2019)), we suspect current state-of-the-art NLP systems would be unable to accurately resolve the inferences stated above. Furthermore, if there is inherent ambiguity in the language that expresses EPU, and, as we argue in Section 2.2, human perception is important, then we may desire to build models that can identify ambiguous documents and account for the uncertainty from ambiguity of language into measurement predictions, e.g. Paun et al. (2018). We leave this for future work.

Measurement
For text-as-data applications, substantive results are contingent on how researchers operationalize measurement of the (latent) theoretical construct of interest via observed text data. Using Baker et al.'s original causal assumptions (Section 2.2), we formally define the measurement of interest as: where g is the measurement function that maps text, X, to economic policy uncertainty, U . 8 For text-as-data practitioners, we emphasize that there is a "garden of forking paths" (Gelman and Loken, 2014) of how g can be operationalized, for instance, in the representation of text (bag-of-words vs. embeddings), document classification function (deterministic keyword matching vs. supervised machine learning classifiers), and ways of aggregating individual document predictions (mean of predictions vs. prevalence-aware aggregation). RQ2: What happens when we change g to equally or more valid measurement functions? In particular, we are interested in sensitivity: for two measurements, g 1 and g 2 , does U 1 correlate well with U 2 ; and external predictive validity: for each measurement, g i , does U i correlate well with the VIX, a stock-market volatility index based on S&P 500 options prices?
Baker et al. also use the VIX as a measure of external validity, and like Baker et al. we note that the VIX is a good proxy for economic uncertainty, but does not necessarily capture policy uncertainty. As Baker et al. mention, "differences in the topical scope between the VIX and the EPU index are an important source of distinct variation in the two measures." In the future, we could compare our  Data and pre-processing. Although Baker et al. use 10 newspapers to construct their US-based index, we instead use the New York Times Annotated Corpus (NYT-AC) (Sandhaus, 2008) because the text data is cleaned, easily accessible, and results on the corpus are reproducible. This collection includes over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007. Baker et al. assume that using newspapers based in the United States is sufficient to find a signal of US-based EPU. To test this assumption, we apply a simple heuristic to the dateline of NYT-AC articles and remove articles that mention non-US cities. However, we find relatively little variation in results via this heuristic (see Appendix, Figure 7).

Keyword matching
Matching keyword lists, also known as lexicons or dictionaries, is a straightforward method to retrieve and/or classify documents of interest, and has the advantage of being interpretable. However, relying on a small set of keywords can create issues with recall and generalization. On NYT-AC, we apply the original keyword matching method of Baker et al. (2016) who label a document as positive if it matches any of 2 economy keywords, AND any of 2 uncertainty keywords, AND any of 13 policy keywords, (KeyOrg). We also compare a method with the same economy and uncertainty matching criteria without policy keyword matching (KeyEU); and a method for which we expand the economic and uncertainty keywords via word embeddings (KeyExp). See Table 2 in the Appendix for the full list of keywords.
KeyExp. Although Baker et al. use human auditors to find policy keywords that minimize the false positive and false negative rates, they do not expand or optimize for economy or uncertainty keywords. Thus, we expand these keyword lists via GloVe word embeddings 9 (Pennington et al.,

2014)
, and find the five nearest neighbors via cosine distance. 10 This is a simple keyword expansion technique. In future work, one could look to the literature on lexicon induction to improve creating lexicons that represent the semantic concepts of interest (Taboada et al., 2011;Pryzant et al., 2018;Hamilton et al., 2016;Rao and Ravichandran, 2009). Alternatively, one could also create a probabilistic classifier over pre-selected lexicons to soften the predictions, or use other uncertainty lexicons or even automatic uncertainty cue detectors.

Document classifiers
Probabilistic supervised machine learning classifiers are optimized to minimize the training loss between the predicted and true classes, and typically have better precision and recall trade-offs compared to keyword matching methods. We use 1844 documents and labels from BBD from 1985-2007 as training data and 687 documents from 2007-2012 as a held-out test set. We train a simple logistic regression classifier using sklearn 11 (Pedregosa et al., 2011) with a bag-of-words representation of text (LogReg-BOW). We tokenize and prune the vocabulary to retain words that appear in at least 5 documents, resulting in a vocabulary size of 15,968. We tune the L2-penalty via fivefold cross-validation. We also try alternative (non-BOW) text representations but these did not result in improved performance (Appendix, § D). Note that the labeled documents in BBD are a biased sample as the authors select documents to annotate that match the economy and uncertainty keyword banks and do not select documents at random.

Prevalence estimation
Measuring economic policy uncertainty is an instance of prevalence estimation, the task of estimating the proportion of items in each given class. Previous work has shown that simple aggregation methods over individual class labels can be biased if there is a shift in the distribution from training to testing or if the task is difficult (Keith and O'Connor, 2018). We compare aggregating via classify and count (CC), taking the mean over binary labels, and probabilistic classify and count (PCC), taking the mean over classifiers' inferred probabilities. See the Appendix §D.3 for additional prevalence estimation experiments.

Results
Addressing RQ2, our experimental results show that changes in measurement can result in substantial differences in the corresponding index. Table 6 presents individual classification results on the training and test sets of BBD, and Figures 2  and 3 show inference of the models on NYT-AC. In Figure 2, we note that the overall prevalences are substantially different: KeyExp has higher prevalence than KeyOrg as expected with more keywords but the supervised methods infer prevalences near 0.2 (CC) and 0.4 (PCC) which indicates they may be biased towards the training prevalence. LogReg-BOW achieves both better individual classification predictive performance and combined with a probabilistic classify and count (PCC) prevalence esti-

Limitations
We use the NYT-AC as a "sandbox" for our experiments because of proprietary restrictions that limit us from acquiring the full text of all 10 news outlets used by Baker et al. To understand the limitations of using only a single news outlet, we compare the "official" aggregated index of Baker et al. 12 with KeyOrg applied to only the NYT-AC. Table 7 shows a 0.68 correlation between the official EPU index (KeyOrg-10) and the same keyword-matching method on only the NYT-AC (KeyOrg-NYT). Yet, KeyOrg-10 has a much higher correlation with the VIX, 0.57, compared to KeyOrg-NYT's correlation of 0.15. See Figure 8 in the Appendix for a graph of these different indexes. We hypothesize applying PCC-LogReg-BOW to the texts of the all 10 newspapers used by Baker et al. would result in improved external predictive validity, but we leave an empirical confirmation of this to future work. In practice, while keyword look-ups have lower recall than supervised methods they have the advantage of being interpretable and can use counts from document retrieval systems instead of full texts.

Related work
There have been only a few other attempts to construct alternative, non-keyword measurements of economic policy uncertainty. Azqueta-Gavaldón (2017) (Thorsrud, 2020;Bybee et al., 2020) while other work identifies negated uncertainty markers (e.g. "there is no uncertainty") in the Federal Reserve's Beige Books (Saltzman and Yung, 2018) and extracts sentiment from central bank communications (Apel and Grimaldi, 2012). Boudoukh et al. (2019) use off-the-shelf supervised document classifiers to demonstrate that the information in news can predict stock prices.
Text-as-data methods. Traditional ways of analyzing textual data include content analysis where human annotators read and hand-code documents for particular phenomena (Krippendorff, 2018). In the last decade, many researchers have adapted machine learning and NLP methods to the needs of social scientists (Card, 2019;O'Connor et al., 2011). NLP technologies such as lexicons, topic models (Roberts et al., 2014;Blei et al., 2003), supervised classifiers, word embeddings (Mikolov et al., 2013;Pennington et al., 2014), and largescale pre-trained language model representations (Devlin et al., 2019) have been applied to textual data to extract relevant signals. More recent work attempts to extend text-as-data methods to incorporate principles from causal inference (Pryzant et al., 2018;Wood-Doughty et al., 2018;Veitch et al., 2020;Roberts et al., 2020;Keith et al., 2020).

Future directions
In the future, estimating the sensitivity of causal estimates to the different measurement approaches presented in this paper could potentially have substantive impact. Using a Bayesian modeling approach to annotator uncertainty (Paun et al., 2018), investigating better calibration, which has been shown to improve prevalence estimation (Card and Smith, 2018), or estimating model uncertainty could improve measurement. One could also shift from document-level predictions of EPU to paragraph, sentence, or span-level predictions. Annotating discourse structure and selecting discourse fragments, e.g. Prasad et al. (2004), could potentially increase annotator agreement. These subdocument extraction models could also potentially provide human-interpretable contextualization of movements in an EPU index.

Conclusion
There is great promise for text-as-data methods and applications; however, we echo the cautionary advice of Grimmer and Stewart (2013) that automatic methods require extensive "problem-specific validation." Our paper's investigation of Baker et al. provides a number of general insights for text-as-data practioners along these lines. First, content validity: when dealing with text data, one needs to think carefully about the kinds of linguistic information one is trying to measure. For instance, mapping economic policy uncertainty to a document-level binary label collapses all types of semantic uncertainty, many of which cannot be identified via keywords alone. Second, one needs to examine social perception assumptions. Is one trying to prescribe an annotation schema, or, as we argue in this paper, are people's perceptions about the concept as important as the concept itself, especially in the face of ambiguity in language? Third, sensitivity of measurements: text-as-data practitioners can strengthen their substantive conclusions if multiple measurement approaches give similar results. For economic policy uncertainty, this paper demonstrates that switching from keywords to aggregating the outputs of a document classifier are not tightly correlated, a concerning implication for the validity of this index.

D Measurement: Additional Experiments
In this section, we provide additional measurement experiments. Also note there is a very small overlap between our training time documents and inference time NYT-AC documents. There are 375 documents at training time from NYT between the years of 1990 and 2006. However, the total number of inference documents is 1,501,131 so this is less than 0.025% of documents.

D.1 Filtering to US-Only News
Initial qualitative analysis reveals that many documents, and in particular articles with high annotator disagreement, are focused on events outside the United States. An unstated assumption of Baker et al. (2016) is that US-based news sources will primarily report US-based news and thus US-based economic policy uncertainty. We test this assumption empirically.
To remove non-US news, we use a simple heuristic that gives almost perfect precision. NYT-AC has metadata about the dateline of an article, for example "KUWAIT, Sunday, March 30," "SAN ANTO-NIO, March 29," or "BAGHDAD, Iraq, March 29." We (1) use the GeoNames Gazateer 17 and filter to cities that have greater than 15,000 inhabitants;

18
(2) separate these city names into US and non-US cities such that ties go to US. For example, Athens would not be removed because the town of Athens, Georgia is in the United State; (3) write a rulebased text parser that extracts the span of text that is in all capitals, (4) if the city name is in non-US cities, we discard the document. Per month, on average, we remove 449 documents that were about non-US news. See Figure 6 for a comparison of all NYT articles, articles with the dateline, and US-only articles based on our heuristic. Figure 7 displays correlation results for all models with the US-Only document filter. Applying the US-Only filter only slightly improves correlation of all models with the VIX (0.01-0.04 correlation). From these results, it seems that Baker et al.'s assumption is valid. However, we also acknowledge that our heuristic is high-precision, low recall and   in the future, one could possibly use a country-level document classifier instead.

D.2 Predicting after EU filter
As we acknowledge in the main text, the training set is biased because documents were sampled only if they matched the economy and uncertainty keyword banks. To make a fair comparison at inference time, we looked at the predictions of our document classifiers on the subset of documents in NYT-AC that also matched these economy and uncertainty keyword banks (KeyEU). In Figure 5, we see that the subset of these models had lower correlation with the VIX.

D.3 Additional prevalence estimation experiments
As an alternative to classify and count (CC) and probabilistic classify and count (PCC) prevalence estimation methods, we also experiment with the Implicit Likelihood (ImpLik) prevalence estimation method of Keith and O'Connor (2018). This method gives the predictions of a discriminative classifier a generative re-interpretation and backs out an implicit individual-level likelihood function which can take into account bias in the training prevalence. We use the authors' freq-e software package. 19 Figure 7 shows a high correlation between ImpLik and PCC, 0.83 correlation; however, ImpLik had much lower correlation with the VIX (0.1). Note, the mean prevalences from ImpLik are much lower than PCC or CC with a mean monthly prevalence across 1990-2006 of 0.02. Thus, the method seems to be correcting for a more realistic prevalence but the true prevalence values may be too low to pick-up relevant signal via this method.

D.4 BERT representations
Finally, we acknowledge that a bag-of-words representation in the document classifier is dissatisfying to capture long-range semantic dependencies and the contextual nature of language that 19 https://github.com/slanglab/freq-e. For the label prior we used the training prevalence of 0.48. has motivated recent research in contextual, distributed representations of text. Thus, we use the frozen representations of a large, pre-trained language model that has been optimized for long documents, the LongFormer (Beltagy et al., 2020). This is a model that optimizes a RoBERTa model (Liu et al., 2019) for long documents. We use the huggingface implementation of the Long-Former 20 and use the 768-dimensional "pooled output" 21 as our document representation. We then use the same sklearn logistic regression training as the BOW models.
Comparing Table 12 to Table 6, we see that this representation has decreased performance compared to LogReg-BOW. We speculate that this decrease in performance may originate in having to truncate documents to 4096 tokens due to the constraints of the model architecture. With more computational resources, we would fine-tune the pretrained weights instead of leaving them frozen. Future work could also consider obtaining alternative representations of text via weighted averaging of embeddings (Arora et al., 2017), deep averaging networks (Iyyer et al., 2015), or pooling BERT embeddings of all paragraphs in a document. 20 https://huggingface.co/transformers/ model_doc/longformer.html 21 This is the hidden state of the last layer of the first token of the sequence which is then passed through a linear layer and Tanh activation function. The linear layer weights are trained from the next sentence prediction objective during pre-training. Figure 6: NYT total documents (red), documents with datelines (green) and documents for which the dateline does not have a non-US city (blue). We checked and confirmed and the spike in 1995-10 is an artifact of the corpus.