Forecasting Firm Material Events from 8-K Reports

In this paper, we show deep learning models can be used to forecast firm material event sequences based on the contents in the company’s 8-K Current Reports. Specifically, we exploit state-of-the-art neural architectures, including sequence-to-sequence (Seq2Seq) architecture and attention mechanisms, in the model. Our 8K-powered deep learning model demonstrates promising performance in forecasting firm future event sequences. The model is poised to benefit various stakeholders, including management and investors, by facilitating risk management and decision making.


Introduction
One Corporate Event Sequence (CES) is a sequence of events that take place at one company during a period of time. A series of company events can represent corporate strategy and future plans. Therefore, CES can be used as a tool to probe corporate strategy and decision-making behaviors.
Investors can use existing CES to project a company's future CES. For instance, an acquisition is a sign for financing, and insufficient funding is a vane for refinancing. Similarly, a failing operational decision can bring an executive personnel change, and a new senior-level appointment can be expected after that. Since CES embodies consistency (to corporate strategy) and continuity (to time), it is equipped to illuminate a maze pathway for organizational strategy evaluation.
Researches increasingly reveal the merit of textual data in financial studies. Many studies are centered on the financial market, such as a firm's stock price, return, and volatility (Fang and Peress, 2009;Tetlock, 2010;Edmans, 2011). Meanwhile, the high dimensionality characteristic of textual data presents challenges to traditional econometric models. Therefore, they render themselves to machine learning and deep learning models naturally. Deep learning models have become gradually popular in finance applications recently, and many of them have focused on market-related tasks as well (Ding et al., 2014(Ding et al., , 2015. These models, however, didn't take corporate strategy into account and didn't focus on the sequence nature of corporate events. U.S. Security and Exchange Commission (SEC) requires the publicly-traded company to file Form 8-K (also called 'Material Event Report' or 'Current Report') when certain types of the corporate event take place. In general, an 8-K report should be filed when the company has an event that its shareholders should be aware of. A material event is defined as a matter if there is a substantial likelihood that a reasonable person would consider it important 1 , and a "rule of thumb" impact scale of a material event is five to ten percent of net income 2 .
Although public companies are also required to file Form 10-K (annual report), and Form 10-Q (quarterly report), 10-K/Qs have apparent and significant drawbacks compared to 8-Ks. 10-K/Qs are designed to cover a mixed category of information. It is easy for them to plunge lower readability and create higher barrier for amateur readers. While the length of 10-K/Qs gets longer and longer (Cazier and Pfeiffer, 2015), not all investors have the skill to decipher the insightful message from the lengthy 10-K/Qs. When they encounter difficulties in 10K/Qs, most retail investors do not have enough resources as advanced institutional investors do. Most importantly, 10-K/Qs are re-leased long time after the event and get considerably prolonged-release intervals. It means investors have to wait one quarter or longer to see the updated official release from the company.
Various stakeholders, not only investors but also management teams and regulators, can find CES useful. CES provides not only corporate strategy hints but also corporate operation patterns. Given the continuous characteristic of CES, it notches up contents for stakeholders to achieve higher profits and a better position in a timely manner.
Given the versatile benefits of 8Ks and the textual data forecasting ability of deep learning models, we propose an end-to-end sequenceto-sequence neural network to predict corporate event sequences from 8-K reports in this paper.

Gated Recurrent Units (GRUs)
Gated Recurrent Unit (GRU) is a representative deep learning architecture, and it was first proposed by Cho et al. (2014). Numerous works have been done in natural language processing using GRUs, such as part of speech (POS) tagging, information extraction, syntactic parsing, speech recognition, machine translation (Cho et al., 2014), and question answering.

Sequence to Sequence (Seq2Seq) Neural Network
Sequence-to-sequence (seq2seq) model was introduced by Sutskever et al. (2014). It is widely used by machine translation tasks (Bahdanau et al., 2015;Luong et al., 2015), i.e. translating sentences from one language to another language, such as French to English. Attention mechanism has been explored broadly in recent publications.
The intuition for attention technique in natural language processing is to assign higher attention to texts where contain more information for the task on hand. Yang et al. (2016) employed both wordlevel and sentence-level attentions for document classification task. Wang et al. (2016) proposed an aspect-level attention to capture different sentiments for different aspects in a sentence. Ma et al. (2017) proposed an interactive attention architecture that models the interaction between the context and target for sentiment classification task. More recently, Vaswani et al. (2017) used multihead attentions alone to solve sequence prediction problems which are traditionally handled by other neural network techniques such as Long Short-Term Memory and Convolution Neural Networks. We are inspired by Bahdanau et al. (2015), Luong et al. (2015), Kadlec et al. (2016) and Cui et al. (2017) to compute event attention for our prediction task.
3 Material Events and Form 8-K Current Reports

Related Works
Various items 3 are required to be filed in company 8-K reports. Many studies have tried to categorize 8-Ks into different categories. Zhao (2016) classified 8-Ks into seven categories: 1) information about business and operations (OPR), 2) financial information (FIN), 3) matters related to the exchange or trading of the securities, 4) information related to financial accountants and financial statements, 5) corporate governance and management (GOV), 6) events related to Regulation Fair Disclosure (REG), 7) other events considered important to the firm (OTH). OPR, FIN, GOV, REG, and OTH are the five major 8-K categories which cover more than 95% of all 8-K reports in their study. Feuerriegel and Pröllochs (2018) used Latent Dirichlet Allocation (LDA) method and categorized 8-K reports into topics: energy sector, insurance sector, change of trustee, real estate, corporate structure, loan payment, amendment of shareholder rights, earnings results, securities sales, stock option award, credit rating, income statements, business strategy, securities lending, management change, health care sector, tax report, stock dilution, mergers and acquisitions, and public relations. Earnings results and public relations are the top two topics in their study. He and Plumlee (2019) categorized voluntary items (Item 2.02, 7.01, and 8.01) into a business combination, conference presentations, dividend announcement, litigation, patents, restructuring, security offerings, share repurchase, and shareholder agreement.

Our Approach
However, none of the above studies tried to categorize 8-Ks by the event nature. Meanwhile, certain items are mandatory to be reported and other items are voluntary. Therefore, we first read thousands of 8-Ks by ourselves and designed taxonomies to holistically characterize 8-Ks into multiple event types, based on human understanding of the report content and nature of the event. Then, we map every report to one of our event types for analysis. We list our event types in Table  1, and they are our prediction's target variables. Since some reports can be filed under different item numbers (from 1.01 to 8.01, we eliminated 9.01 Exhibits), the mapping between Report Items and Event Types is many to many. In other words, one report item number can also be seen in different event types. The bold item numbers in Table  1

Background
The goal of our work is to predict firm's future event sequences, based on its historical event sequences. Therefore, our prediction task is to solve a sequence-to-sequence problem. In particular, we use corporate event sequences in memory M to predict event sequences in forecasting horizon H. Historical event sequences are collected from corporate 8-K reports, and all events are identified by event types listed in Table 1. Memory M and forecasting horizon H are formed by smaller time intervals j and q, respectively. The time structure of our model is illustrated in Figure 1.

A Real-world Example
We can also view the sequence-to-sequence prediction as a story completion task. Particularly, once we know what events happened in the past, we can predict what events are likely to occur in the future, to complete the story. We present an example of corporate events sequence in Figure 2 to demonstrate the real-world practice and the importance of the problem. Figure 2 shows an example of the company AT&T event sequence during time t-M to t+H, and it illustrates how the historical corporate events during time t-M to t can forecast and impact future corporate events during time t to t+H. For example, at time t-M, AT&T wins several wireless spectrum auctions from the Federal Communica-tions Commission (FCC). Because of the auction win, AT&T needs more capital to support the new business. Therefore, we see AT&T has arranged loans and reported them in its 8-K afterward. In another stream, similar to its competitors, AT&T also desires to tap into the content industry. It announces the acquisition of Time Warner to support this corporate blueprint. As we have learned previously, because of the acquisition, AT&T needs more money to finance this deal. As a result, AT&T has filed a loan financing activity in its following 8-K report. Meanwhile, possibly because of the corporate strategy disagreement, AT&T's business solution CEO, who was a supporter of expanding business in hardware instead of the content industry, announces retirement. Given what we have learned so far, we can foresee that several corporate events can have higher chances to become real in the future. For instance, if the previous financing amounts were not enough for the Time Warner acquisition, AT&T have to require more loans. We can observe the loan financing activity admittedly happened in the forecasting horizon. Moreover, because of the acquisition, AT&T would need to make arrangements for Time Warner's executive members, which indeed happened in the forecasting window. AT&T announced Time Warner's CEO duty after the acquisition was completed. Additionally, after the acquisition was completed, AT&T has realized additional financial needs for the combined business. Therefore, we notice AT&T reported another loan activity after the completion of the acquisition. This example tells us that historical corporate events can affect not only what type of event will happen in the future, but also when the event will occur in the future.

Formal Definition
Let's formally define the problem as, where, • y denotes the event types in  • M denotes the size of memory, and H denotes the size of forecasting horizon, both are measured in terms of the number of time windows.
• i indexes companies.
• ev denotes event index, and |Ev| denotes the total number of event types.
• |K| denotes the total number of events per time window.
• S (i)jk is the embedding of the kth event of company C i in time window j, and E (i)j is the aggregate event embedding of company C i in time window j.
• g is a function that aggregates multiple event embeddings into one embedding.
• f is a learned model (function) that maps all event embeddings in memory to forecasting horizon.
In principle, g and f can be parameterized as any function approximator.

GRU model
Since our task works on time sequences and is formed as a sequence-to-sequence (seq2seq) problem, we use GRU as the backbone of our model. Additionally, we use encoder-decoder framework as the architecture in our model. In the encoder of the model, we train our event type embeddings in the Event Embedding Layer. Multiple reports can be filed at the same time window. Therefore, at each time window, we select the top |K| event embeddings for each company, based on each event embedding's L2 norm value. We institute various treatments of function g in Equation 1, such as attention mechanism. In the Event Attention Layer, we implement attention to the top |K| event embeddings and obtain the weighted embedding at each time window.
We define our Event Attention Layer as, The weighted sum context vector E (i)j is used as the aggregated semantic representation of the company events at each time window. Next, E (i)j is fed into the following Event Embedding Layer.
In the vanilla GRU model, the last hidden state of the encoder is directly connected to the decoder. At every time t in the decoder, the hidden state hs t is used to predict the current timestamp event type in the Prediction Layer, and we use sof tmax function to compute y t as,

GRU attention model
We implement the Alignment Attention Layer in the GRU attention model. In the encoder-decoder framework, information from the encoder is carried over to the decoder. To be able to capture what events happened in history play more roles in the prediction horizon, we employ attention mechanism (Bahdanau et al., 2015;Luong et al., 2015;Kadlec et al., 2016;Cui et al., 2017) to capture the dynamics. The GRU attention model's decoder looks at every hidden state in the encoder. The Alignment Attention Layer aligns different attention values to hidden states in the encoder, and aggregates them. We follow the "general" approach in Luong et al. (2015) and obtain attention scores between the target sequence and the input sequence as, , where h tr is the hidden state of target sequence and h sr is the hidden state of the source sequence. The context vector c t is the weighted sum of the product of attention scores and the hidden states in the encoder.
, where [.;.] denotes concatenation along the sequence dimension. We illustrate the GRU attention model in Figure  3. • GRU: event sequence as input and without attention.
• GRU attention: event sequence as input and with attention.

MCMC baseline
For every company, we gathered its event sequences during the entire experimental period, and constructed an event transition matrix. Given the obtained transition matrix, we can implement the Markov Chain Monte Carlo (MCMC) simulations. In particular, we view each row of the transition matrix as each event type's probability distribution. We recognize every event type at the last training timestamp as the current event type E t , then we can draw the next event type E t+1 given E t 's probability distribution. By doing this step repetitively, we can sample E t+2 based on E t+1 's probability distribution, and so on. Finally, we reach E t+H and complete the sampling process. In the experiment, for each event type E t , we sample its sequences 100 times, and we use the averaged sequence performance as our model baseline.

Per Event Type Evaluation
We experiment a threshold to convert the sof tmax result of each event type at time t into binary format as, In experiments, we set the threshold value as 0.1. We evaluate classification performance per event type. In particular, we compute classification criteria, i.e., precision, recall, and F1 score, for each event type and evaluate our model results.
In addition, since predicting event type correctly within a reasonable temporal approximate period is also important in real business setting, we use two approaches, i.e. precise evaluation and fuzzy evaluation to evaluate our models.
Precise evaluation: we compute the confusion matrix at the precise time t as, P recision ev,t (P r ev,t ) = T P ev,t T P ev,t + F P ev,t Recall ev,t (Re ev,t ) = T P ev,t T P ev,t + F N ev,t F 1 ev,t = 2 * P r ev,t * Re t P r ev,t + Re ev,t Fuzzy evaluation: because correctly predicting event type that close to the exact time t also has practical implications, i.e., forecasting accurately of the event type close to time t is also useful in reality, we compute the confusion matrix within [t − z, t + z] time window, and z ∈ [1, 2, ...]. We update the true positive for precision and recall for t ∈ [t − z, t + z], and re-compute the precision, recall, and F1 measures for t ∈ [t − z, t + z] as, Re ev,[t−z,t+z] = T P ev,[t−z,t+z] T P ev, [t−z,t+z]  (16) Precise evaluation is a special case of fuzzy evaluation when z=0. We report both precision evaluation and fuzzy evaluation z=1 results in the result section.

Data
We use 8-K Current Reports filed to SEC's EDGAR system (the Electronic Data Gathering, Analysis, and Retrieval system) between August of the year 2004 and December of the year 2018 as our data. Our study focuses on the Fortune 1,000 companies, and we use them as our focal companies.
Given the time constraint, we only use 200 companies data in the training set, 45 companies data in the validation set, and 45 companies data in the testing set. We split the dataset by company. In the end, we had 8,400 sequences in training, 2,304 sequences in validation, and 2,090 sequences in testing.

Preprocessing
We extracted the company name, report content, and published date from each 8-K report in EDGAR. We use Python for reports downloading and content extraction. We use Spacy 4 fuzzy matching to map report content to event types in Table 1.

Model Training
We train event embeddings in our model. In the experiments, we define time window i to be a month and memory size M = 36. We define time window q to be a month and the forecasting horizon H = 12. Given the event reporting nature, we 4 https://spacy.io/. use |K| = 2 in our experiments. We use sof tmax (Goodfellow et al., 2016) as the activation function, Adam (Kingma and Ba, 2014) as the optimizer, and Cross Entropy (CE) as the loss function.

Discussion
We show our precise model performance in Table  2, fuzzy model performance in Table 3, and model perplexity in Table 4.
First of all, the model performance tells us the potential to predict firm event sequences using a sequence-to-sequence neural network. The performance tables show the direction of designing the corporate event sequence prediction problem as a story completion task is promising, although there are rooms to keep improving the model. At the same time, one thing we want to point out is the dataset is unbalanced by following business nature. In other words, some event types happen less frequently than others. For instance, intellectual property activities, litigation and lawsuit, delisting, and bankruptcy happen much less frequently than financial activities, senior personnel change, and information disclosure. Therefore, some event types didn't generate prediction results as high as other types.
From the model results, we can see (1) the proposed sequence-to-sequence neural network models perform better than the baseline simulation model, and (2) the attention mechanism is useful on certain event type predictions as well. They both demonstrate the promising direction of the proposed problem formulation.
In both Table 2 and Table 3, we can see the sequence-to-sequence models perform better than the baseline simulation model. Moreover, when attention mechanism is added, the with attention model gains prediction performance for event types including senior personnel change, information disclosure, document updates, intellectual property activities, and delisting, in precise evaluation. The with attention model shows better performance results for business combination and restructuring, document updates, intellectual property activities, litigation and lawsuit, delisting and bankruptcy event types, in fuzzy evaluation. They show the value of the model formulation. Meanwhile, when we compare models between with attention and without attention, we can identify the usefulness of the attention mechanism on certain     Table 4 also verifies the promising model design direction as well.
In a real business setting, there are other data streams can also participate in the decision making process, such as company fundamental values. We are working on the integration of multiple data streams as well.

Conclusion and Future Work
In this paper, we proposed sequence-to-sequence (Seq2Seq) models to forecast firm material event sequences, based on firm historical event sequences. The proposed deep learning model demonstrates promising performance and design rationale for the task of predicting firm future event sequences.
However, there are still rooms to improve our models in the future. We plan to incorporate other data streams and other techniques, such as variational autoencoder (VAE) (Kingma and Welling, 2013), Transformer (Vaswani et al., 2017) and/or BERT (Devlin et al., 2018), in the model architecture. We also plan to further investigate the economic implications of our formulation and solution.