Financial Event Extraction Using Wikipedia-Based Weak Supervision

Extraction of financial and economic events from text has previously been done mostly using rule-based methods, with more recent works employing machine learning techniques. This work is in line with this latter approach, leveraging relevant Wikipedia sections to extract weak labels for sentences describing economic events. Whereas previous weakly supervised approaches required a knowledge-base of such events, or corresponding financial figures, our approach requires no such additional data, and can be employed to extract economic events related to companies which are not even mentioned in the training data.


Introduction
Event Extraction from text (Hogenboom et al., 2011;Ritter et al., 2012;Hogenboom et al., 2016) has been the subject of active research for over two decades (Allan et al., 2003). Detection and extraction of finance-related events have mostly focused on events described in news articles, which are likely to impact stock prices. In particular, previous work has sought to extract descriptions of events pertaining to a specific company, and analyzed how such events correlate with measures of that company's stock (price, volatility etc.). While much of the literature has focused on the prediction of stock prices (e.g., Ding et al., 2015;Xie et al., 2013), it is recognized that predicting future stock movements is a formidable challenge (see e.g. Merello et al., 2018); still, there are use-cases that might benefit from business-related event extraction from news.
One promising direction is enhancing the finance-related research performed by finance analysts. Such research typically requires reviewing a large body of news data under severe time constraints. We propose an automatic system for high-lighting meaningful company-related news events that are likely to deserve the analyst's attention.
Work on economic event extraction often defines an ad-hoc taxonomy of events, and what constitutes an 'important event' for one might not be considered as such for another. For instance, the CoProE event ontology (Kakkonen and Mufti, 2011) includes events such as patent issuance and delayed filing of company reports, which are not considered by Du et al. (2016); similarly, while CoProE consider earnings estimates by analysts as events, Jacobs et al. (2018) examine instead analyst buy ratings and recommendations.
Outlining a comprehensive list of event types seems futile. For example, if a company's databases are hacked, this is certainly an influential event; but compiling an explicit and exhaustive event taxonomy that is sufficiently fine-grained to include all events such as this one is doomed to fail. At the same time, a formal event hierarchy is not necessarily required from an analyst's perspective. The strength of an automated system comes from the ability to process a large volume of news data and detect events of interest; automatically classifying these events into types is probably of secondary importance to an expert in the field.
Thus, our focus here is on a binary classification problem that is not type-based. This presents an interesting challenge, since the aim is not capturing the characteristics of predefined event types, but rather capturing general properties of relevant events.
The common NLP approach for economic event extraction has mostly made use of hand-crafted rules and patterns (Feldman et al., 2011;Arendarenko and Kakkonen, 2012;Xie et al., 2013;Hogenboom et al., 2013;Ding et al., 2014Ding et al., , 2015Du et al., 2016). However, creating and maintaining such rules is time consuming, and further seems less suitable for our scenario, where no set of underlying event types (which give rise to such rules) is assumed. Hence, here we follow a different, more flexible approach, that relies on a robust statistical learning framework for identifying relevant events. In particular, we adopt a supervised learning approach for identifying events related to a given company, and suggest to train a sentence-level classifier for this purpose. Given sentences from news articles discussing the company, the classifier aims to identify sentences containing events that would be of interest to the analyst. Since the sentences come from articles discussing the company, our main focus is on determining whether a sentence conveys an event worth considering, and not on ascertaining that it is related to the company. Learning a supervised model requires annotated data. The standard approach for obtaining annotated data involves human annotation, which requires a substantial effort and limits the size of the data, which in turn may hinder the results. One way to overcome this problem is using weak supervision (Zhou, 2017), where labelled data is generated automatically using heuristics rather than manual annotation. Although such data may be noisier and less precise compared to standard labelled data, it enables to create much larger amounts of data at a significantly lower cost. Here we rely on content from Wikipedia to automatically generate a weakly-labelled sentence dataset for company events. We report experimental results that demonstrate the potential merit of our approach.

Related Work
Arendarenko and Kakkonen (2012) relied on a collection of hand-crafted detection rules in order to recognize 41 distinct company-related event types, and Du et al. (2016) used about 600 distinct patterns to cover 15 business event types.
More recently, machine-learning techniques were considered for this task. Jacobs et al. (2018) frame the problem as a multi-class classification task. They define a taxonomy of 10 event types, in addition to a "no-event" class, and 7 companies of interest, and rely on manual annotation to train a sentence-level multi-class classifier. Testing several classifiers, they show that a linear SVM classifier attains the best results for most event types. While the current paper also adopts a supervised learning sentence-level approach, here the data is constructed based on weak labels, and the task is framed as a type-independent binary classification problem. Rönnqvist and Sarlin (2017) used weak supervision in the context of financial events, focusing on bank distress events. They consider 101 banks for which 243 such events, and their date, are known. They then extract 386K sentences referring to these banks, and consider a sentence as describing a distress event if there is a matching event in the knowledge base mentioning the same bank and occurring near the publication date of the article from which the sentence was extracted. This approach requires a large knowledgebase of specific events, which is not readily available when moving from a confined event type (i.e. bank distress) to a diverse space of events. In this work we suggest a weak-label approach that aims to encompass a variety of relevant entities, event types and event occurrences.

Data
We used two types of datasets, one which is created automatically based on weak labels, and another which is based on manual annotation.

Weakly labelled datasets -Wikipedia
We leverage the content of Wikipedia articles describing companies as a source of influential events in the company's chronology.
In order to automatically identify 'positive' sentences which likely describe noteworthy events, we rely on two observations: 1. Such events tend to appear within specific Wikipedia sections. 2. Sentences beginning with a date, specifically the date-pattern [ On/In/By/As of + month + year], often describe an event. Thus, we manually created a lexicon of words which tend to appear in the titles of event-prone sections. A section whose title contains one of the following words is defined as an event-section: history, creation, leadership, corporate, acquisitions, growth, finance, financial, lawsuits, litigation, legal.
Given a company C, we select from its Wikipedia article all sentences appearing in an event-section and starting with a date-pattern. We remove the opening date and mark the sentences as positive examples with respect to C. All sentences which do not start with a date-pattern and are not in an event-section are considered as negative. To balance the dataset, we enforce an equal number of positive and negative examples by discarding sentences from the larger set. In addition, since many positive examples begin with either the company's name or the words "the company", we aim to balance the two classes in terms of sentences containing these patterns. The rest of the negative examples are chosen at random.
The procedure described above was used to create two datasets. The first, S&P -wiki, is generated from Wikipedia articles of the companies on the S&P-500 index. A larger dataset, Extendedwiki, was later generated from Wikipedia articles of companies traded in one of five major stock exchanges 1 , yielding 3.8K companies in total.
Each dataset was split into train and test sets  Table 1 indicates the statistics of the resulting datasets, which will be released as part of this work.

Manually labelled dataset -SentiFM
To the best of our knowledge, the only manually annotated dataset for event detection in news articles is SentiFM (Jacobs et al., 2018). This dataset contains manual annotations of sentences into 10 predefined financial event types. However, this dataset is designed to solve a slightly different problem from the one explored in this paper. Sen-tiFM was constructed in the context of a multiclass classification problem, whereas here we deal with a binary problem. Namely, we are not interested in event types, and do not assume there is a closed set of underlying types describing the events of interest. Indeed, it is possible that an event of interest might not be included in the Sen-tiFM taxonomy, and hence a corresponding sentence would be labeled as negative. Despite these differences, we sought to examine how a classifier trained on the SentiFM data would perform on our task. To this end, we created a binary version of SentiFM, by considering all 'no-event' sentences as negative examples, and all event types as positives. We kept the original train/test split (see Table 1) and denote this data set as SentiF Mbinary.

Model
Train Test

2019 News Sentences -N ews-2019
In order to evaluate methods for detecting company-related events within news data, we compile a set of sentences from news articles. Specifically, we selected the 10 S&P companies with the largest number of events from 2019 mentioned in their Wikipedia page (see Table 2). For each company, we retrieved all articles from 2019 on Seeking Alpha 2 that contained the company name in their title. We assume that this set of articles provides a good coverage of the company's events of interest during 2019. We applied sentence-splitting 3 on the retrieved articles, keeping only sentences 10-50 tokens long.

Experiments
The datasets described in Section 3 were used to train three event detection models. All classification models are based on BERT (Devlin et al., 2018), which has shown state-of-the-art results in many NLP tasks. We use a singlesentence input, and fine-tune the classifier with the SentiF M -binary, S&P -wiki and Extendedwiki data sets. Henceforth, we will use these names to refer to their corresponding BERT models. We use the BERT BASE model configuration, with maximum sequence length of 256, batch size of 16, dropout rate of 0.1 and learning rate of 5e-5. Each model was fine-tuned over 3 epochs, using a cross-entropy loss function.

Initial model evaluation
We first evaluate the performance of the three models on their corresponding test sets. As shown in Table 3, all models reach high performance when tested on the same type of data used in training. Next, we evaluate these models on the Extended-wiki test set (see Table 3). Notably, although less than 15% of the companies in Extended-wiki are in S&P -wiki, the latter model exceeds 90% precision and recall over the Extended-wiki test data. This suggests that the model is also able to detect events for companies that were not seen in training.

Identifying Wikipedia events in the news
Ultimately we are interested in the ability to detect events in the target domain of news articles. To validate performance over this domain, we used sentences from N ews-2019 and cross-referenced them with company events from Wikipedia. Specifically, we manually extracted events from 2019 from the Wikipedia pages of the companies in Table 2. For each event, we asked 3 annotators to mark all sentences from N ews-2019 which mention this event. In total, 26 of the Wikipedia events were mentioned in at least one sentence.
We then applied each of the three models to all the news sentences, and kept only the sentences that were classified as positive by the model. For each model, we measure the event recall rate as the fraction of Wikipedia events which are mentioned in at least one positively-classified sentence.
As expected, the recall rates of the Wikipediabased models over the news data (Table 4) are lower than those achieved over Wikipedia data. This may be due to the difference in writing style between the two sources. Notably, even though SentiF M -binary was trained on news data, its recall is the lowest among the three models. This may be attributed to the mismatch between the event types in SentiF M and those in Wikipedia.
Sorting the positively-classified sentences by their model score, we also measure the average rank of the highest-scored mention of each    Table 2 were annotated by three co-authors of this work. The guidelines were to determine whether a given sentence contains information which may have influence on the companys stock price, as such events presumably deserve the attention of a finance analyst. The annotation process was composed of two stages. First, each sentence was annotated by two labelers. Then, the sentences on which there was disagreement between the labelers (21% of the sentences) were annotated by a third annotator. Average agreement between the initial two annotators was 0.45 (Cohen's Kappa). Table 5 shows the precision of the two models, compared to a baseline of randomly-selected sen-

Model
Precision Random sentences 0.28 SentiF M -binary 0.70 Extended-wiki 0.74 Finally, we wanted to analyze the diversity of events captured by the two models. For this purpose, we looked at the distribution of unique tokens in the top 200 predictions of each model, after filtering out stop words and the companies appearing in the list of Table 2. We sorted the remaining tokens by their frequency from highest to lowest, and computed the cumulative frequency as a function of the number of unique tokens. Figure  1 indicates that the top candidates of Extendedwiki capture a richer vocabulary than SentiF Mbinary, which is dominated by a smaller group of tokens. For example, 20% of the tokens are covered by the 36 and 19 most frequent tokens in Extended-wiki and SentiF M -binary, respectively. Moreover, despite their similar precision values, the population of events captured by the two models is quite different -the overlap between their top candidates is less than 10% (18 out of 200 examples). This observation suggests that the models are complementary, and that there is potential benefit to combining them.

Discussion
This paper focused on detecting 'important' events in news articles, related to a specific company. We suggested to leverage information contained in Wikipedia to create weakly-labelled data, and proved the usefulness of the resultant classifier for the desired task. We believe that the results can be further improved by finding additional sources for weak-labels, e.g. by exploiting information from relevant knowledge bases.
The potential coverage of relevant events can be increased by retrieving articles which do not necessarily include the name of the considered company in their title. Extending our framework to pinpoint noteworthy events for a particular company, mentioned in articles that are not focused on that company, is a natural direction for future research. Such an extension will require adapting the weak labelled data and the corresponding classifiers to cope with an environment in which sentences are not necessarily relevant to the company.