Detecting and Characterizing Events

Signiﬁcant events are characterized by interactions between entities (such as countries, organizations, or individuals) that deviate from typical interaction patterns. Analysts, including historians, political scientists, and journal-ists, commonly read large quantities of text to construct an accurate picture of when and where an event happened, who was involved, and in what ways. In this paper, we present the Capsule model for analyzing documents to detect and characterize events of potential signiﬁcance. Speciﬁcally, we develop a model based on topic modeling that distinguishes be-tween topics that describe “business as usual” and topics that deviate from these patterns. To demonstrate this model, we analyze a corpus of over two million U.S. State Department cables from the 1970s. We provide an open-source implementation of an inference algorithm for the model and a pipeline for exploring its results.


Introduction
Foreign embassies of the United States government communicate with one another and with the U.S. State Department through diplomatic cables. The National Archive collects these cables in a corpus, which traces the (declassified) diplomatic history of the United States. 1 The corpus contains, for example, over two million cables sent between 1973 and 1978. Most of these cables describe diplomatic "business as usual," such as arrangements for visiting officials, Figure 1: Capsule's analysis (described in detail in section 5) of two million cables from the National Archives' corpus. The y-axis represents a loose measure of "eventness" (equation (5)). The gray background depicts the number of cables sent over time.
through corpora of documents to find potentially significant events and the primary sources around them. We present Capsule, a probabilistic model for detecting and characterizing important events, such as the fall of Saigon, in large corpora of historical communication, such as diplomatic cables from the 1970s. Figure 1 illustrates Capsule's analysis of two million cables from the National Archives' corpus. The y-axis represents "eventness," a loose measure of how strongly a week's cables deviate from typical diplomatic "business as usual" to discuss some matter that is common to many embassies. (We describe this measure of "eventness" in detail in section 3.) This figure shows that Capsule detects many wellknown events between 1973 and 1978, including the fall of Saigon (April 30, 1975) and the death of Mao Tse-tung (September 9, 1976). Capsule also uncovers obscure, but significant, events that have largely escaped the attention of scholars, such as when the U.S. defended its control of the Panama Canal before the United Nations Security Council (March 19, 1973). Capsule therefore provides a new way to detect and characterize historical moments that may be of interest to historians, political scientists, and journalists.
The intuition behind Capsule is this: Embassies write cables throughout the year, usually describing typical diplomatic business, such as visits from government officials. Sometimes, however, important events occur, such as the fall of Saigon, that pull embassies away from their typical activities and lead them to write cables that discuss these events and their consequences. Capsule therefore operationalizes an "event" as a moment in history when multiple embassies deviate from their usual topics of discussion and each embassy deviates in a similar way.
Capsule embeds this intuition into a Bayesian model that uses latent variables to encode what "business as usual" means for each embassy, to characterize the events of each week, and to identify the cables that discuss those events. Given a corpus of cables, the corresponding posterior distribution of the latent variables provides a filter for the cables that isolates important moments in diplomatic history. Figure 1 depicts the mean of this posterior distribution.
We present the Capsule model in section 3, providing both a formal model specification and guidance on how to use the model to detect and characterize real-world events. In section 4, we validate Capsule using simulated data, and in section 5, we use it to analyze over two million U.S. State Department cables. Although we describe Capsule in the context of diplomatic cables, it is suitable for exploring any corpus with the same underlying structure: text (or other discrete multivariate data) generated over time by known entities. This includes email, consumer behavior, social media posts, and opinion articles.

Related Work
We first review previous work on automatic event detection and other related concepts, to contextualize our approach in general and Capsule in particular.
In both univariate and multivariate settings, ana-lysts often want to predict whether or not rare events will occur (Weiss and Hirsh, 1998;Das et al., 2008). In contrast, Capsule is intended to help analysts explore and understand their data; our goal is human interpretability rather than prediction or forecasting. Events can be construed as either anomaliestemporary deviations from usual behavior-or "changepoints" that mark persistent shifts in usual behavior (Guralnik and Srivastava, 1999;Adams and MacKay, 2007). We focus on events as anomalies.
Event detection in the context of news articles (Zhao et al., 2012;Zhao et al., 2007;Zhang et al., 2002;Li et al., 2005;Allan et al., 1998) and social media posts (Atefeh and Khreich, 2015;VanDam, 2012;Lau et al., 2012;Jackoway et al., 2011;Sakaki et al., 2010;Reuter and Cimiano, 2012;Becker et al., 2010;Sayyadi et al., 2009) usually means identifying clusters of documents. For news, the goal is to create new clusters as novel stories appear; each article is assumed to be associated with one event, which does not allow for distinctions between typical content and rare events. For social media, the goal is to identify rare events, but the resultant methods are intended for short documents, and are not appropriate for longer documents that may contain information about a variety of subjects.
Many existing methods for detecting events from text focus on individual vocabulary terms, often weighted by tf-idf values (Fung et al., 2005;Kumaran and Allan, 2004;Brants et al., 2003;Das Sarma et al., 2011;Zhao et al., 2007;Zhao et al., 2012). We characterize events by bursts in groups of terms.
Although groups of terms can be summarized directly (Peng et al., 2007;Chakrabarti and Punera, 2011;Gao et al., 2012), topic models (Blei, 2012) provide a way to automatically identify groups of related terms and reduce the dimensionality of text data. Researchers have previously used topic models to detect events mentioned in social media posts (Lau et al., 2012;Dou et al., 2012) and to find posts relevant to particular, monitored events (VanDam, 2012). Capsule uses topics to characterize both typical diplomatic content and potentially significant events.
In addition to modeling text over time, researchers have also used spatial information (Neill et al., 2005;Mathioudakis et al., 2010;Liu et al., 2011) and information about authors (Zhao et al., 2007) and news outlets  to enhance event detec- tion. We rely on author information to characterize diplomatic "business as usual" for each embassy.
Event detection is closely related to detecting and characterizing relationships between entities (Schein et al., 2015;Linderman and Adams, 2014;Das Sarma et al., 2011). Capsule can trivially use senderreceiver pairs instead of authors, and the model specification can be tailored to reflect network structure.
Finally, there are connections between Capsule and recent work on Poisson processes. In particular, we can interpret Capsule as a collection of related discrete-time Poisson processes with random intensity measures. Further, marginalizing out the event strengths (described in section 3.1) reveals that the use of a vocabulary term by one embassy can "excite" the use of that term by another. This suggests a close relationship to Hawkes processes (Hawkes, 1971).

The Capsule Model
In this section, we present the Capsule model for detecting and characterizing significant diplomatic events. We first provide the intuition behind Capsule, and then formally specify the model. We also explain how to use Capsule to explore a corpus and how to learn the posterior distribution of the latent variables.
Consider an entity like the Bangkok embassy, as illustrated in figure 2. We can imagine that this entity sends a stream of diplomatic cables over timesome to the U.S. State Department, others to other American embassies, such as the one in Hong Kong. Embassies usually write cables that describe typical diplomatic business. For example, the Bangkok embassy might write about topics regarding southeast Asia more generally. We can think of a topic as being a probability distribution over vocabulary terms. Now imagine that an event, such as the capture of Saigon during the Vietnam War, occurs during a particular time interval t. We cannot directly observe the occurrence of this event, but we can observe the stream of cables and the event's impact on it. When the event occurs, multiple entities deviate from their usual topics of discussion simultaneously, before returning to their usual behavior, as depicted in figure 2. For example, the day after the capture of Saigon, the majority of the diplomatic cables written by the Bangkok embassy and several other entities were about Vietnam War refugees. If we think of the event as another probability distribution over vocabulary terms, then each entity's stream of cables reflects its typical concerns, as well as any significant events.

Model Specification
We now define the Capsule model. Our data come from entities (e.g., embassies) who send messages (e.g., diplomatic cables) over time; specifically, we observe the number of times n dv that each vocabulary term v occurs in each message d . Each message is associated with an author entity a d and a time interval t d within which that message was sent.
We model each message with a bank of Poisson distributions 2 -one for each vocabulary term: The rate dv blends the different influences on message content. Specifically, it blends three types of topics, intended to capture "business-as-usual" discussion and content related to significant events. We operationalize each topic as a specialized probability distribution over vocabulary terms (the set of unique words in the corpus of messages), as is common in topic models (Blei et al., 2003;Canny, 2004;   Gopalan et al., 2014)-i.e., each term is associated with each topic, but with a different probability. Each message blends 1) general topicsˇ1; : : : ;ˇK about diplomacy (e.g., terms about diplomats, terms about communication), 2) an entity topic Á a d specific to the author of that message (e.g., terms about Hong Kong), 3 and 3) event topics 1 ; : : : ; T that are specific to the events in recent time intervals (e.g., terms about a coup, terms about the death of a dignitary).
Examples of these three types of topics are in table 1. The general topic relates to planning travel, the entity topic captures words related to the U.S.S.R., and the event topic captures words related to the evacuation of Saigon toward the end of the Vietnam War.
The messages share the three types of topics in different ways: all messages share the general topics, messages written by a single entity share an entity topic, and messages in the same time interval use the event topics in similar ways. Each message blends its corresponding topics with a set of message-specific strengths. As a result, each message captures a different mix of general diplomacy discussion, entityspecific terms, and recent events. Specifically, the Poisson rate for vocabulary term v in message d is where Â d k is message d 's strength for general topic k, d is message d 's strength for a d 's entity topic, and dt is message d 's strength for event topic t . The function f . / ensures that the events influences decay over time. As we describe in appendix B, we compared several different decay functions (exponential, linear, and step) and found that the following exponential decay function works well in practice: Dividing by five means that we can interpret it as the number of time intervals after which an event will have little impact on the content of the messages. We place hierarchical gamma priors over the message-specific strengths, introducing entityspecific strengths 1 ; : : : ; A and 1 ; : : : ; A that allow different entities to focus on different topics and event strengths 1 ; : : : ; T that allow different time intervals to be more or less "eventful." We place Dirichlet priors over the topics. The graphical model is in figure 3 and the generative process is in figure 4.
Given a corpus of messages, learning the posterior distribution of the latent variables uncovers the three types of topics, the message-and entity-specific strengths, and the event strengths. In section 3.3, we explain how an analyst can use the event strengths as a filter that isolates potentially significant messages.

Learning the Posterior Distribution
In order to use Capsule to to explore a corpus of messages, we must first learn the posterior distribution of  the latent variables-the general topics, the entity topics, the event topics, the message-and entity-specific strengths, and the event strengths-conditioned on the observed term counts. As for many Bayesian models, this posterior distribution is not tractable to compute; approximating it is therefore our central statistical and computational problem. We introduce an approximate inference algorithm for Capsule, based on variational methods (Jordan et al., 1999), 4 , which we outline in appendix A. 5 This algorithm produces a fitted variational distribution which be can then be used as a proxy for the true posterior distribution.

Detecting and Characterizing Events
We can use the mean of the fitted variational distribution to explore the data. Specifically, we can explore "business-as-usual" content using the posterior expected values of the general topicsˇ1; : : : ;ˇK and the entity topics Á 1 ; : : : ; Á A , and we can detect and characterize events using the posterior expected values of the event strengths and the event topics.
To detect events, we define an measure that quantifies the "eventness" of time interval t. Specifically, we first compute how relevant each message d is to that time interval: m dt D f .t d ; t / EOE dt . Using these relevancy values, we then compute the proportion of each message's term counts that are associated with the event topic specific to time interval t : Finally, we aggregate these values over messages: where the multiplicative fraction ensures that messages that were sent during time intervals that are further from t contribute less than than messages that were sent during time intervals that are closer to t. We can characterize an event t by selecting the highest-probability vocabulary terms from EOE t . By ordering the messages according to m dt D f .t d ; t/ EOE dt , we can also identify the messages that are most strongly associated with event t.
In section 5, we explore the cables associated with significant events in the National Archives' corpus of diplomatic cables. To make Capsule more accessible for historians, political scientists, and journalists, we have released an open-source tool for visualizing its results. 6 This tool allows analysts to browse a corpus of messages and the mean of the corresponding posterior distribution, including general topics, entity topics, and event topics. Figure 5 contains several screenshots of the tool's browsing interface. 5 Appendices are in the supplemental material. 6 Source code: https://github.com/ajbc/capsule-viz; demo: http://www.princeton.edu/~achaney/capsule/.

Model Validation with Simulated Data
Before using Capsule to explore a corpus of real messages (described in section 5), we provide a quantitative validation of the model using simulated data.
We used the generative process in figure 4 to create ten data sets, each with 100 time intervals, ten general topics, ten entities, and roughly 20,000 messages. We then used these data sets to compare Capsule's event detection performance to that of four baseline methods. We also compared the methods' abilities to identify the most relevant messages for each event.

Detecting Events
For each data set, we ordered the time intervals from most to least eventful, using the "eventness" measure described in section 3.3 and the simulated values of the latent variables. We then treated these ranked lists of time intervals as "ground truth" and assessed how well each method was able to recover them.
For Capsule itself, we used our approximate inference algorithm to obtain a fitted variational distribution for each simulated data set. We then ordered the time intervals using our "eventness" measure and the posterior expected values of the latent variables.
For our first baseline, we constructed an "eventonly" version of Capsule by dropping the first and second terms in equation (2). We used this baseline to test whether modeling "business as usual" discussion makes it easier to detect significant events. We obtained a fitted variational distribution for this model using a variant of our approximate inference algorithm, and then ordered the time intervals using our "eventness" measure, modified appropriately, and the posterior expected values of the latent variables.
For our second baseline, we drew inspiration from previous work on event detection in the context of news articles, and focused on each time interval's deviation in term counts from the average. Specifically, we ordered the time intervals 1; : : : ; T for each simulated data set according to this measure: We added tf-idf term weights for our third baseline: n dvˇ: (7) Finally, we randomly ordered the time intervals for each data set to serve as a straw-man baseline.
We also experimented with baselines that involved term-count deviations on the entity level and topicusage deviations on the message level (Dou et al., 2012), but found that they were not competitive.
For each data set, we compared each method's ranked list of time intervals to the corresponding "ground-truth" list of time intervals, by dividing the sum of the lists' actual set overlap at each rank by the sum of their maximum set overlap at each rank: where S truth r is a set of the top r time intervals according to the "ground-truth" list and S method r is a set of the top r time intervals according to the method. Figure 6 shows that Capsule outperforms all four baseline methods. These results serve as a sanity check for both the model and its implementation.

Identifying Relevant Messages
For each data set, we created a list of the most relevant messages for each time interval t by computing Figure 6: Event detection performance using ten simulated data sets. Each dot represents the performance (equation (8); higher is better) of a single method on a single data set; each shaded green area summarizes the distribution of performance for a single method. Capsule outperforms all four baseline methods. f .t d ; t / dt for each message d (using the simulated values of dt ) and ordering the messages accordingly. We then treated these ranked lists of messages as "ground truth" and assessed how well Capsule and the baseline methods were able to recover them.
For Capsule, we used our approximate inference algorithm to obtain a fitted variational distribution for each data set, and then, for each time interval, ordered the messages according to m dt D f .t d ; t / EOE dt . For our second and third baselines, we ordered the messages sent during each time interval according message-specific versions of equations (6) and (7).
For each data set, we compared each method's ranked list of messages for each time interval to the corresponding "ground-truth" list, by computing precision at ten messages. The average precision for Capsule was was 0.44, while the average precision for the "event-only" version of the model was 0.09. The other baselines recovered zero relevant messages.

Exploratory Analysis
Capsule is intended to help analysts explore and understand their data. In this section, we demonstrate its capabilities by analyzing a corpus of over two million U.S. State Department cables from the 1970s.

Data
The National Archive collects diplomatic cables sent between the U.S. State Department and its foreign embassies. We obtained a subset of this corpus from the Central Foreign Policy Files at the National Archives, via the History Lab at Columbia Univer-sity; 7 the subset contains cables sent between 1973 and 1978. In addition to the text of the cables, each message is labeled with its author (e.g., the U.S. State Department, a particular embassy, or an individual), the date the cable was sent, and other metadata. We used a vocabulary of 6,293 terms and omitted cables with fewer than three terms, resulting in 2,021,852 cables sent by 22,961 entities. We used weekly time intervals, as few cables were sent on weekends.

Model Settings
We ran our approximate inference algorithm for Capsule to obtain a fitted variational distribution. We used K D 100 general topics, the exponential decay function in equation (3) with D 4, and top-level hyperparameters s D r D 0:3. With these settings, a single iteration of the algorithm took about an hour. 8

Detecting Well-Known Events
To evaluate Capsule's ability to detect well-known events, we used a list, provided to us by the History Lab, of thirty-nine well-known events that took place between 1973 and 1978. Each event is present in at least one of six reputable collections of historic events, such as the Office of the Historian's Milestones in the History of U.S. Foreign Relations. 9 We treated this list of events as "ground truth" and assessed how well Capsule and each of the baselines described in section 4.1 were able to recover them-or, in other words, how well the methods identify these eventful weeks, compared to more typical weeks.
Specifically, we used each method to construct a ranked list of time intervals. Then, for each method, we computed the discounted cumulative gain (DCG), which, in this context, is equivalent to computing 39 X eD1 1 log rank e; L method T ; (9) where L method T is the method's ranked list of time intervals and rank e; L method T is the rank of the e th well-known event in L method T . Finally, we divided the DCG by the ideal DCG-i.e., P 39 eD1 1 log .e/ -to 7 http://history-lab.org 8 Each iteration of our algorithm considers all messages. Modifying it to stochastically sample the data would reduce the time required to obtain an equivalent fitted variational distribution. 9 https://history.state.gov /milestones/1969-1976 Method nDCG Capsule (this paper) 0.693 term-count deviation + tf-idf (equation (7)) 0.652 term-count deviation (equation (6)) 0.642 random 0.557 "event-only" Capsule (this paper) 0.426 obtain the normalized DCG (nDCG). Table 2 shows that Capsule outperforms all four baseline methods.

Exploration
We now turn to our primary goal-using Capsule to explore and understand a corpus of messages. Figure 1 shows our "eventness" measure (equation (5)) over time. One of the tallest peaks occurs during the week of December 1, 1975, when the United Nations General Assembly discussed omnibus decolonization. As described in section 3.3, we can characterize this event by computing m dt D f .t d ; t / EOE dt for each message d and then ordering the messages accordingly. Table 3 lists the highest-ranked messages. Another notable event was the seizure of the S.S. Mayaguez, an American merchant vessel, during May, 1975, at the end of the Vietnam War. Table 4 lists the highest-ranked messages for this event. We can examine these messages to confirm their relevancy and learn more about the event. For example, here is the content of the most relevant message: In absence of MFA Chief of Eighth Department Avramov, I informed American desk officer Yankov of circumstances surrounding seizure and recovery of merchant ship Mayaguez and its crew. Yankov promised to inform the Foreign Minister of US statement today (May 15). Batjer A third week of interest occurs in early July, 1976. On July 4, the U.S. celebrated its Bicentennial, but on the same day, Israeli forces completed a hostage rescue mission because an Air France flight from Tel Aviv had been hijacked and taken to Entebbe, Uganda. 10 This event was mostly discussed the week 10 Capsule assumes that only one event occurs during each   after the event took place; the most relevant messages are listed in appendix B (table 5). The cable from Stockholm describing the "Ugandan role in Air France hijacking" begins with the following content, which reveals further information about this event: 1. We provided MFA Director of Political Affairs Leifland with Evidence of Ugandan assistance to hijackers contained in Ref A. After reading material, Leifland described it a "quite good", and said it would be helpful for meeting MFA has scheduled for early this morning to determine position GOS will take at July 8 UNSC consideration of Israeli Rescue Operation. ...
In addition to detecting and characterizing wellknown events, such the S.S. Mayaguez incident and Operation Entebbe, Capsule can detect and characterize obscure, but significant, events, such as when Eritrean rebels kidnapped Tenneco oil employees (April 8, 1974) and when the U.S. Navy evacuated citizens from Lebanon ("Operation Fluid Drive," June 20, 1976). Both events appear in figure 1. Capsule uncovers events where analysts might not otherwise look.
Capsule also provides a way to explore "businesstime interval. This example is a clear violation of this assumption, but also serves to demonstrate that Capsule can successfully detect and characterize multiple events, even when they overlap.
as-usual" discussion using the posterior expected values of the general topicsˇ1; : : : ;ˇK and the entity topics Á 1 ; : : : ; Á A . Examples of each of these types of topics are in appendix B (tables 6 and 7, respectively); these examples illustrate that, as desired, the entity topics absorb location-specific terms, preventing them from overwhelming the general topics.

Conclusion
We presented Capsule, a Bayesian model for detecting and characterizing potentially significant events. We evaluated Capsule's ability to detect events and identify relevant messages; it outperformed four baseline methods. We used Capsule to analyze a large corpus of U.S. State Department cables from the 1970s, demonstrating that it can discover both well-known and obscure (but significant) events, as well as relevant documents. We anticipate that Capsule, and our visualization tool, will be useful for historians, political scientists, and journalists who wish to explore and understand large corpora of documents. This is increasingly important-the U.S. State Department alone produces around two billion e-mails annually.