A Corpus of Corporate Annual and Social Responsibility Reports: 280 Million Tokens of Balanced Organizational Writing

We introduce JOCo, a novel text corpus for NLP analytics in the field of economics, business and management. This corpus is composed of corporate annual and social responsibility reports of the top 30 US, UK and German companies in the major (DJIA, FTSE 100, DAX), middle-sized (S&P 500, FTSE 250, MDAX) and technology (NASDAQ, FTSE AIM 100, TECDAX) stock indices, respectively. Altogether, this adds up to 5,000 reports from 270 companies headquartered in three of the world’s most important economies. The corpus spans a time frame from 2000 up to 2015 and contains, in total, 282M tokens. We also feature JOCo in a small-scale experiment to demonstrate its potential for NLP-fueled studies in economics, business and management research.


Introduction
A crucial prerequisite in today's NLP research is the availability of large amounts of language data. National reference corpora such as the ANC for American English (Ide and Suderman, 2004), the BNC for British English (Burnard, 2000), and the DEREKO for German (Kupietz and Lüngen, 2014) assemble a collection of language data with a focus on ordinary language use covering a wide range of genres (e.g., newspaper articles, technical writing and popular fiction, letters, transcripts of court or parliament speeches, etc.). Corpora exclusively focusing on newspaper articles have been particularly influential for the development of syntactic and semantic methodologies in NLP * These authors contributed equally to this work. research (e.g., PENN TREEBANK (Marcus et al., 1993) or PENN PROPBANK (Palmer et al., 2005) for the English language).
Turning to more specialized, mostly scientific, domains these general language resources can only be reused at the cost of substantial performance penalties due to characteristic sublanguage phenomena in those domains. For the biomedical domain, e.g., these negative effects can be shown for the whole range of low-level (sentence splitting, tokenization (Tomanek et al., 2007;Griffis et al., 2016)) up to high-level tasks (such as syntactic analysis (Laippala et al., 2014;Jiang et al., 2015)). As a consequence, these specialized fields of NLP research have created their own resource infrastructure in terms of domain-specific lexicons and corpora for syntactic and semantic processing.
The rapidly increasing number of publications using text analytics for economics, business, and management (for surveys, cf. Lu et al. (2010); Goldenstein et al. (2015); Kumar and Ravi (2016)) indicates the emergence of an entirely new application domain for NLP systems (see Section 2). At first sight, one might argue that domain-specific corpora such as the PENN TREEBANK are sufficient since they already contain economy-related language data. Yet, as these resources assemble only excerpts from newspaper articles, at second sight, such resources turn out to be biased. Newspaper articles reflect journalists' interpretations and do not necessarily directly transport the attitudes and views of economic actors, such as an individual (consumer) or business corporations (Simon, 1991).
This shortcoming can be alleviated if one targets the economic actors' verbal communication behavior directly on various media channels. Our choice is to focus on annual reports (AR) and corporate social responsibility reports (CSRR) of major business corporations in Western economies.
Altogether these documents comprise 282M tokens and reflect the unfiltered views of these commercial enterprises and their embedding in the social and regulatory system in market-driven societies. Viewing enterprises as social actors with their own goals, their legal, social and other responsibilities becomes increasingly relevant for both the explanation and prediction of economic and organizational phenomena, as well as for economics, management and organization science, in general (King et al., 2010;Bromley and Sharkey, 2017). While the raw data set we assembled can be used for scientific purposes only, we also offer an embedding model trained on it which is available without any legal restrictions. 1
From a methodological perspective, the social interactions between these actors-customers, enterprises, and political/juridical authorities-have been studied in terms of sentiments they bring to bear (Van De Kauter et al., 2015). Evidence is collected from consumers' and enterprises' verbal behavior and their communication about products and services, e.g., via social media Si et al., 2014;Liu, 2015;Alshahrani et al., 2018). This research is complemented by studies related to reputation, expertise, credibility and trust models for agents in the economic process (as traders, sellers, advertisers) based on mining communication traces and recommendation legacy data, including fake ad/review recognition (Bar-Haim et al., 2011;Brown, 2012;Mukherjee et al., 2012;Rechenthin et al., 2013;Tang and Chen, 2014;Žnidaršič et al., 2018).
System-wise, specialized types of search engines have been developed, for instance, enterprise search engines (e-commerce, e-marketing) or consumer search engines, market monitors, product/service recommender systems (Vandic et al., 2017;Trotman et al., 2017). This also includes customer-supplier interaction platforms (e.g., portals, helps desks, newsgroups) and transaction support systems based on natural language communication (including business chat bots) (Cui et al., 2017;Altinok, 2018). Specialized modes of information extraction and text mining in economic domains, e.g., temporal event or transaction mining have also been explored (Tao et al., 2015;Lefever and Hoste, 2016;Ding et al., 2016), as well as information aggregation from single sources (e.g., review summaries, automatic threading) (Gerani et al., 2014).
Pioneering efforts in considering texts originally produced by enterprises as a basis for economic NLP were made by Kloptchenko et al. (2004) who used sentiments in enterprises' quarterly reports as a predictor for stock market prices. Later Kogan et al. (2009) came up with the influential 10-K Corpus, a collection of 54,379 ARs from 10,492 different, publically traded companies covering a time interval from 1996 up to 2006. This seminal resource is a cornerstone of economic corpus development and our work is meant to complement it with current and more diverse language data.

Corpus Description
The corpus we here introduce consists of ARs and CSRRs from companies in the United States, the United Kingdom and Germany. An AR is a comprehensive report published yearly by publiclylisted corporations on their activities and financial performance of the past year. ARs provide information for current and prospective shareholders, the governmental and regulatory bodies, the stock exchanges, as well as all other stakeholders (Neu et al., 1998;Yuthas et al., 2002). A CSRR is a regular report published by a company or an organization about the economic, environmental and social impacts caused by its activities (Dahlsrud, 2008;Chen and Bouvain, 2009;Fifka, 2013). CSRRs also present the organization's values and governance model, and reveal the link between its strategy and its commitment to the organization's environment and a sustainable global economy (Du et al., 2010;Aguinis and Glavas, 2012).
With regard to the popular 10-K corpus (Kogan et al., 2009), the data set we present is significantly smaller in size (both in terms of tokens and companies). However, the 10-K corpus only covers ARs, while we also include CSRRs allowing a wider view on organizational communication traces. Also, the 10-K corpus only includes reports up to the year 2006, whereas our work incorporates documents as recent as 2015. Additionally, the 10-K corpus is only based on the 10-k forms mandated by the Securities Exchange Commission (SEC) in the US. Nonetheless, US corporations' ARs contain the same information as required by the 10-k forms and much more. Furthermore, ARs are a genre of reports diffused globally (Rutherford, 2005;Meyer and Höllerer, 2010). Hence, the choice of ARs as a backbone for our corpus allows for a careful international sampling strategy balancing different kinds of corporations from different countries. This property makes our corpus particularly well suited for deeper economic investigations with respect to cross-index, crossindustry and cross-country comparisons.

Selection of Raw Data
ARs as well as CSRRs are considered relevant for our corpus based on two main criteria, namely the company that issued them and the year they report about. We selected companies in a step-wise process, first selecting the countries of origin and then the stock indices they were listed in.
Regarding the selection of countries, we chose the US, the UK and Germany, because altogether their total GDP makes up for 30% of the WGDP (as of 2014), thus representing a relevant portion of the global economy. For each of these three countries, 90 companies where selected for inclusion in our corpus. We first took the 30 most intensively traded and most highly valued corporations of the American Dow Jones Industrial Average (DIJA), the British Financial Times Stock Exchange (FTSE 100) and the German Stock Index (DAX; "Deutscher Aktienindex"). Next, we added reports of middle-sized companies (30 per country) and technology companies (again 30 per country) for a total of 270 companies in our sample. Middle-sized companies were selected from the S&P500, the FTSE 250 and the MDAX, whereas tech firms were chosen from the NASDAQ, the FTSE AIM 100 and the TECDAX indices for the US, the UK and Germany, respectively. We se-   Table 2: Sample word embeddings illustrated by their five nearest neighbors based on cosine similarity.
lected each corporation from the three countries so that they matched the corresponding two counterparts with respect to industry segment, sales and trading volumes. Lastly, we let the time span of our corpus range between the years 2000 and 2015. Each report (AR and CSRR) from one of the 270 companies in the previously defined sample that addresses one of these years was included in the corpus, if possible (see also the following Subsection 3.2). The year 2000 was chosen as a starting point because of, first, the burst of the dotcom-bubble and, second, the upcoming of CSRRs. Further details regarding our sampling strategy are provided in the README file of our corpus distribution.

Data Acquisition and Cleansing
The reports determined in this way were collected by three student assistants from the Business and Management Department by downloading the reports in PDF format from the companies' websites. In some cases, especially for documents from the early 2000s, reports were not available for downloading. The students (and, if necessary, one of the authors) then requested the documents directly from the respective investor relations department via email. The following metadata were recorded: report type (either AR or CSRR), reference year of the report 2 (as given on the title page), company of origin, and stock index.
2 In some cases, and in particular with regard to CSRR, sometimes multiple consecutive years were indicated. In these cases, only the first year is considered as reference year.
We used the pdf2text software by glyphandcog.com to extract plain text from the collected PDF files. In general, this software extracts text with sufficient quality. However, the final result depends heavily on the layout and style of the input files. For this reason, the resulting plain text files were iteratively refined in a rule-based fashion. This post-processing included restoring of the original text structure of headings and paragraphs, deleting superfluous line breaks and hyphenation, page numbers and (rarely occurring) odd character sequences, as well as remnants of structured data, such as tables. This post-processing strategy yielded a mostly clean corpus of raw textual data only, i.e., preserving the running text of the original PDF files as good as possible while at the same time stripping off all irrelevant non-linguistic data.

Corpus Analysis
After corpus construction, we used NLTK.org tools (Bird, 2006) for counting tokens and sentences for all of the reports. The results, summarized for each stock index, are depicted in Table  1. In total, our corpus comprises almost 5,000 reports, summing up to 282M tokens (9M sentences). This constitutes a substantial collection of textual data (for comparison, the BNC, ANC, and DEREKO contain 100M, 15M, and 42B tokens, respectively). The vast majority of the data set consists of ARs (247M tokens vs. 35M tokens from CSRRs). American, British and German corporations are properly represented in the data set, i.e., for each of these countries, their three indices add up to about 90M tokens. Figure 1 depicts the growth curves for ARs as well as CSRRs. As can be seen, for both ARs and CSRRs, the number of reports increases over time. This graph also reflects the fact that documents become harder to acquire the older they are, as we have experienced during data collection. Note that we could only collect a marginal number of CSRRs for the year 2000 (11). This is due to the fact, that their issuance became wide-spread only in this and the following years, as discussed above.

Word Embeddings
The distribution of the plain text data of JOCO is restricted by Intellectual Property Rights (IPR) regulations. As a substitute, we train word embeddings using the FastText.cc toolkit (Bojanowski et al., 2017) to capture the distributional semantics of economic jargon. As a prerequisite, the corpus was tokenized using NLTK and casefolded. Only words with frequency ≥ 50 were modeled. Subword information was not taken into account. The latter two decision were taken to decrease the number of artifacts stemming from the PDF conversion in our final embedding model.
To illustrate the semantics captured in this way, Table 2 lists sample entries of our embedding model together with their five nearest neighbors. As can be seen, the results reveal high face validity: "growth", e.g., exhibits strong reference to its economic meaning (such as in "double-digit growth" or "organic growth") but does not refer to biological growth which may have been indicated by neighbors like "plant" or "hormones".

Effects of Organizational Emotions
To demonstrate the potential of the JOCO corpus, we investigate the interaction of linguistic signals from corporations and their market performance. We focus on emotions expressed in ARs since the interplay of organizational cognition, character, and emotions is becoming a hot topic in organization science (Albrow, 1992;King, 2015;Buechel et al., 2016;Händschke et al., 2017). We conducted this work on a subsample of the corpus covering British and German firms only and their ARs from 2008 to 2015 to allow for European comparability. Financial and accounting metadata were retrieved from AMADEUS, 3 a database that holds data of European firms (except for banks and insurance companies).
In the regression analysis, we employ the generalized estimating equations (GEE) method (Liang and Zeger, 1986), a time series model that handles repeating observations over time. In our case we use its multivariate linear regression variant (see the Appendix for details). The dependent variable 'performance' is operationalized as Return on Equity (ROE), lagged by one year to allow for causality. Following the established psychological VAD model of emotions (Bradley and Lang, 1994), the independent explanatory variables are three dimensions of espoused organizational emotions-Valence, Arousal, and Dominance. These three dimensions are measured individually for each AR using the open-source tool JEmAS 4 (Buechel and Hahn, 2016) that yields a value for each of the dimensions per firm per year. Due to the high correlation between dominance and valence, the latter variable was dropped from the model to prevent biasing of the estimators (cf. the correlation matrix given in the Appendix, Table 3). Control variables are the corporation's size (in terms of employees and assets, both logarithmized), 5 operational profitability (sales per employee and sales per assets) and country of origin measured with a dummy variable where Germany is coded as '1'. For our full model (Model III in Table 4), we find that Arousal has a significant (p < .001) negative effect on ROE, meaning that a company performs better, the calmer it communicates. However, this effect is more pronounced for British companies since the interaction term be-tween Arousal and country (GER) shows a significant (p < .001) positive effect. Thus, our results suggest that espoused organizational emotionality correlates with performance, yet the nature of this interaction is country-dependent. Accordingly, our findings point towards the existence of a distinct organizational character (King, 2015) and emotionality (Albrow, 1992), and thus render support viewing organizations as social actors (King et al., 2010;Bromley and Sharkey, 2017). This piece of evidence might have far-reaching implications for the organizations' role and responsibility in society (Beyer et al., 2014).

Conclusion
We introduced JOCO, a novel text corpus for NLP analytics in the field of economics, business and management. This corpus comprises ARs and CSRRs of 270 publicly traded corporations in the US, UK and Germany from 2000 to 2015. Altogether, we assembled roughly up to 5,000 reports and, in total, 282M tokens (9M sentences). By design, JOCO carefully balances various characteristics allowing cross-index, cross-industry, and cross-country comparisons and, thus, enables informed prospective applications in business research and economics, for which we provided a first, yet preliminary example.   Table 4: Results of GEE panel regression with dependent variable ROE lagged by one year and interaction effects of arousal and dominance with the country dummy (GER=1). Columns give the respective slope coefficient (Beta), standard error (S.E.) and p-value (Sig.). The three models differ in the set of variables taken into account. The number of cases is 1,127 for each model (one AR per corporation per year in the application's subsample of the corpus).