Measure Country-Level Socio-Economic Indicators with Streaming News: An Empirical Study

Socio-economic conditions are difficult to measure. For example, the U.S. Bureau of Labor Statistics needs to conduct large-scale household surveys regularly to track the unemployment rate, an indicator widely used by economists and policymakers. We argue that events reported in streaming news can be used as “micro-sensors” for measuring socio-economic conditions. Similar to collecting surveys and then counting answers, it is possible to measure a socio-economic indicator by counting related events. In this paper, we propose Event-Centric Indicator Measure (ECIM), a novel approach to measure socio-economic indicators with events. We empirically demonstrate strong correlation between ECIM values to several representative indicators in socio-economic research.


Introduction
Socio-economic indicators are powerful instruments for measuring economic conditions and the sociocultural environment that people live in. They are widely used to inform policy makers, and help them to measure outcome of policy interventions. They are often difficult, if not impossible, to measure. Take unemployment rate as an example, the U.S. Bureau of Labor Statistics conducts large-scale household surveys in order to track it. More abstract indicators (e.g. economic uncertainty), which involve vague or complex social interactions, are very difficult to measure accurately.
We hypothesize that streaming news, reporting a vast amount of real-world events, can be used for measuring socio-economic indicators. We propose Event-Centric Indicator Measure (ECIM), a novel approach to measure socio-economic indicators using events extracted from streaming news. We demonstrated that ECIM is effective; ECIM values are strongly correlated with representative socio-economic indicators. Figure 1: The ECIM workflow. Words in bold are event triggers. Text in parentheses shows normalized locations (countries) and time. "Related" shows whether an event trigger is related to the target indicator (e.g., "unemployment rate") based on keyword matching.
We will first present an overview of the ECIM approach, and then describe how we extract events from text and aggregate them to calculate EC-IMs for each socio-economic indicator. We will then present large-scale experiments to demonstrate that ECIM is effective, using several widely used socio-economic indicators.

Event-Centric Indicator Measure
The ECIM workflow is summarized in Figure 1. Given a large-scale streaming news collection, the system first extracts event mentions (event trigger words or phrases) along with their locations and time. For each socio-economic indicator, the system will then aggregate relevant events per each time step to produce ECIM, a time series measuring the corresponding socio-economic condition.

Extract Events from Text
We extract event frames from predicate-argument structures that are automatically generated from text. Then we extract an event trigger, a location and a time, if available, from each event frame. To attach a location or time that is further away from the trigger (or based on the publishing date), we apply a few inference rules.
Extract Events Syntactic-semantic representations such as Abstract Meaning Representation (Huang et al., 2016) and Semantic Role Labeling (SRL) (Peng et al., 2016;Surdeanu et al., 2003), have been shown to be effective for event extraction. Following (Peng et al., 2016;Surdeanu et al., 2003), our event extractor is based on tagging predicates (verbs and eventive nouns) and their locations and temporal arguments.
We first apply SRL (Punyakanok et al., 2004;He et al., 2017) over each sentence. An example SRL representation on a sentence is shown in Figure 2. From the predicate-argument structure, the system then extracts the predicate as an event trigger 1 , the entity mention and time mention attached to the predicate through AM-LOC and AM-TMP as the location and time for the event respectively. The location is resolved to a country-level GPE by looking up part-of relations in GeoNames 2 (e.g., convert "Boston" into "U.S."). Our system also normalizes time mentions into Timex2 (Ferro et al., 2001) and resolves relative time (e.g., last Tuesday) into Timex2, based on the document publishing date. This process extracts an event mention in the form of a triple <trigger: clashed, location: Syria, time: 2018-11-23> from the sentence in Figure 2.

Find relevant event triggers
Given an indicator, we use the following approach to find a set of relevant event trigger words/phrases: • We process a development corpus (500 documents from the English Gigaword 3 ) with the above-mentioned approach to extract all event triggers, and then ask an annotator to find trigger words for the target indicator, starting with the most frequent trigger.
• We then use WordNet (Miller, 1995) synsets to automatically expand the trigger word list. We also use word embedding similarities to search for new trigger words that are most similar (e.g., using cosine similarity) to the centroid of existing triggers.
An annotator then reviews the expanded triggers and removes incorrect ones. The whole process 4 takes less than 10 minutes per indicator. Example triggers for three representative indicators are listed in Table 1.
To tag event mentions, we match the predicate to the relevant keyword list 5 , constructed using the processed described above.
Infer location and time Locations are crucial for identifying whether events are related to the country of interests. Time is necessary for binning events by time steps for counting. Extracting time and locations for event triggers can be challenging if they are further apart from the trigger. Oftentimes no location or time is mentioned in the same sentence where the event is stated.
To increase the coverage of location and time for events, we apply the following inference rules: • Same-sentence: if the event extractor doesn't find a location or a time, but there is one and only one location (an entity mention with type GPE or Location), and/or one and only one time mention in the same sentence, it will attach the location and/or time to the event.
• Document metadata: A news article often come with a publishing date and a location. When neither the extractor nor the samesentence rule found a location or a time for an event, we will attach the publishing location and/or date to the event.

Measure Socio-economic Indicators
We filter events by location (target country) and then use the following steps to calculate ECIM: • Aggregation and counting: We bin events by time step 7 . For each time step t and event e ∈ These policies and other institutional weaknesses continue to undermine prospects for sustained economic development Table 1: Indicators, examples of event triggers (words or phrases) and sentences E i (E i is the set of events for indicator i), we then generate count N e,t . For each indicator i, we aggregate the counts: e∈E i N e,t .
• Normalizing counts: Those counts are not normalized and can be inflated due to the increasing level of media activity. To normalize, we divide them by the total number of articles published in each month M t .
• Smoothing: To remove noise, we smooth the normalized counts by calculating moving averages 8 centered around each t with a window of T = 7 time steps 9 .
In summary, ECIM for indicator i at time t is defined as

Experiments
We use the English Gigaword corpus 10 , which consists of 5.7 million articles published from 1994 to 2010, from a wide range of sources including the New York Times, the Associated Press, Los Angeles Times, Washington Post, Agence France-Presse, Central News Agency of Taiwan, and Xinhua News Agency. We run event extractors on this corpus to extract over 10 million event mentions that happen in the U.S. from 1994 to 2010. We then generate ECIMs for 3 representative socio-economic indicators: (1) Unemployment rate-a crucial index of economy and for policy making, (2) Chicago Board Options Exchange (CBOE) Volatility Index (VIX) -a widely-used market volatility measurement, and (3) Economic Policy Uncertainty (EPU) (Baker et al., 2016)-a policy uncertainty index. We focus on these 3 indicators because (1) they are widely used in economics and social science research, and (2) their data are publicly available 11 . Figure 3 shows the time series of ECIMs (blue solid lines) and that of the 3 corresponding indicators (red dash lines): Unemployment rate: The downward/upward trends and the peaks of the two lines match each other quite well. This shows the ECIM is correlated with unemployment rate. There are some delay between peaks and downward trends shown in the line for unemployment rate: not surprisingly, unemployment rate reacts to events such as "economic downturn" with delay, and its recovery takes longer time than media coverage on "economic downturn" events. Among the events detected, our system found many events that can be labeled as "economic downturn" (e.g., "recession", "depression", "financial crisis") or "bankruptcy" (e.g., "bankrupt"). This matches our intuition that economic downturn and more bankruptcies are often correlated with higher unemployment rate (Reinhart and Rogoff, 2009). VIX: Figure 3 shows that the ECIM matches very well with VIX over time. We found that high market volatility is strongly correlated with unfavorable macroeconomic events such as "economic crisis", firm-level events such as "bankruptcy", as well as its after effects such as "loan".
EPU: The ECIM strongly correlates with EPU. Similarly, economic crisis, which often led to high economic policy uncertainty, is found to be among events the most frequently detected. In addition, our system found extreme events such as "attack", "conflict", and "terrorism" which may trigger major changes of economic policy. The slight deviation in 2003-2004 is caused by low-coverage of the Iraq War in Gigaword. Table 2 shows quantitative correlation analysis between ECIMs and their corresponding indicators. Pearson correlation coefficients 12 show strong correlation between these two for each indi-12 en.wikipedia.org/wiki/Pearson correlation coefficient  cator. We also test the hypothesis that ECIMs and their corresponding indicators are independent (pvalue is shown in the third column). For all three indicators, we are 99.99% confident to reject the hypothesis of independence.
On negative correlation We further study negative correlation between some events and indicators. We hypothesize that "Competition" is positively correlated with more participants in a market, therefore a higher demand for labor which means lower unemployment rate. We construct an ECIM M comp for "Competition" and then plot 1/M comp over time against unemployment rate. It is important to note that we use 1/M comp to flip the line for "Competition" upside down, so that the better the two lines aligns with each other, the more these two are negatively correlated. Figure 4 shows a strikingly high negative correlation between "Competition" and unemployment rate. The correlation coefficients between unemployment rate and 1/M comp are also very high as shown in the parentheses in the first row of Table  2. This points to a promising future direction: it is possible to measure an indicator using an event that occurs more frequently when the value of the indicator is low.

Related Work
Most prior work on constructing or measuring indicators (Bansal et al., 2005;Jurado et al., 2015;Bachmann et al., 2013) is in social or economic science research. Recent work (Dzielinski, 2012;Alexopoulos and Cohen, 2015;Baker et al., 2016) tries to incorporate text into economic research with keyword-based approaches. Similar to ECIM, EPU (Baker et al., 2016) applies a keyword-counting approach (with keywords "economic", "policy" and "uncertainty") to measure economic policy uncertainty. In contrast, this paper uses the richer information from syntacticsemantic analysis of text. Furthermore, EPU uses the aggregated sentiment (by counting the number of times people expressed their views on economic policy uncertainty) as the measure, but we use a more objective approach which measures events that are correlated to the uncertainty. (Rohlfs et al., 2016) applied supervised topic modeling to measure the effects of Federal Open Market Committee text content on the direction of short-and medium-term interest rate movements.

Conclusion and Future Work
This paper presents ECIM, a novel approach to measure socio-economic indicators with news events. Experiments show strong correlations between ECIM values and representative indicators in socio-economic research.
Our next steps are to further study the correlation with time lags, and to incorporate more sophisticated event extraction techniques.