Using Topic Modeling and Similarity Thresholds to Detect Events

This paper presents a Retrospective Event Detection algorithm, called Eventy-Topic Detection (ETD), which automatically generates topics that describe events in a large, temporal text corpus. Our approach leverages the structure of the topic modeling framework, specifically the Latent Dirichlet Allocation (LDA), to generate topics which are then later labeled as Eventy-Topics or non-Eventy-Topics. The system ﬁrst runs daily LDA topic models, then calculates the cosine similarity between the topics of the daily topic models, and then runs our novel Bump-Detection algorithm. Similar topics labeled as an Eventy-Topic are then grouped together. The algorithm is demonstrated on two Terabyte sized corpuses - a Reuters News corpus and a Twitter corpus. Our method is evaluated on a human annotated test set. Our algorithm demonstrates its ability to accurately describe and label events in a temporal text corpus.


Introduction
Vast amounts of research has been developed to help organize, search, index, browse and understand the immense number of electronic documents.Topic models have emerged as a powerful technique to discover patterns of words that reflect the underlying topics that are combined to form documents.Latent Dirichlet Allocation (Blei et al., 2003) defines topics as multinomial distributions over words, and documents as multinomial distributions over these topics.LDA uses Dirichlet priors for both the documenttopic and topic-word distributions.
Topic Detection and Tracking(TDT) is an area of research that was prominent in the 1990's (Allan et al., 1998).The goal of TDT is to detect the appearance of new topics and track their evolution over time.Specifically relevant to our paper is the task of Retrospective Event Detection.It is defined as the task of identifying all events in a corpus of stories.
In our Eventy-Topic Detection (ETD) algorithm we wish to leverage the powerful structure of topic models in the Retrospective Event Detection task.In particular, we develop an algorithm that is capable of identifying Eventy-Topics in a sequentially ordered, massive 'Big Data' sized corpus.We define an Eventy-Topic to be a topic that solely describes a specific, time sensitive news event.A topic that is consistently and persistently in the news is not an Eventy-Topic.
We run daily LDA topic models, then calculate the cosine similarities between the topics in all the models.Eventy-Topics contain a noticeable spike around the date of the event in these cosine similarity graphs.To detect these spikes, we smooth the cosine similarity values so that the bump has a monotonically increasing section, followed by a plateau, followed by a monotonically decreasing section.We then then run a novel algorithm called Bump Detection that searches for these properties.
Given a time-stamped corpus, our goal is to automatically detect and describe all of these Eventy-Topics.Our algorithm is capable of detecting onetime (uni-modal) Eventy-Topics, such as "Robin Williams Death", as well as multi-time (multimodal) related Eventy-Topics, such as "The Masters Golf Tournament".

Related Work
There have been multiple works that studied the topics of temporal corpora.Topics over Time (Wang and McCallum, 2006) incorporates time directly into the generative topic model.A timestamp is drawn from a beta distribution for every word in the corpus.One limitation of this method is the restrictiveness of the beta distribution.The presence of a topic in a corpus can be multi-modal, which conflicts with the beta distribution.In contrast, our work does not assume that the presence of an event in a corpus is unimodal.
Dynamic topic models (Blei and Lafferty, 2006) capture the evolution of topics in a time stamped corpus.It involves multiple static topic models in each time slice and models how the prior parameters change over time, given a logistic normal prior.The motivation for dynamic topic models is to track the evolution of topics, not to detect emerging topics that correspond to events.
Retrospective New Event Detection research utilizes metrics such as cosine similarity, Hellinger similarity, and KL Divergence to determine how similar documents are (Dou et al., 2012).On-line LDA (AlSumait et al., 2008) incorporates topic detection into its algorithm by calculating the KL divergence of evolving topics at adjacent time periods.If the calculated KL divergence exceeds an historic percentiled threshold, then the topic is flagged as an emerging, new topic.Our work is similar in spirit, but we use difference measures against all previous topics as opposed to just adjacent ones.
There has been success modeling the burstiness of phrases in the news cycle (Leskovec et al., 2008).Static LDA topic models have had their topics labeled as hot and cold based on the mean documenttopic mixtures in different time segments (Griffiths and Steyvers, 2004).
TimeMines (Swan and Jensen,200) is a TDT, 3 step system that first creates noun phrases for features, then finds significant features using a 2x2 contingency table and χ 2 test, then groups significant features together by testing for dependence.These groups of noun phrases for the topic description form the emerging topic.
The Group-Topic model (Wang et al., 2005) slices a 15 year U.N. text corpus into year slices, then runs a topic-relation model and later compares the trends of topics.
Multiscale Topic Tomography (Nallapati et al., 2007) uses a conjugative priors on the topic parameters to model the evolution of topics (simliar to DTM, but with conjugative priors).They present a tree-like hierarchy of topics, where topics can be zoomed in on different time periods, and topic trends can be analyzed.
Multi-Modal Retrospective News Event Detection (Li et al., 2005) is an extensive generative model that incorporates content, time, persons, and location.One challenge of this model is one needs to input the number of events to generate, just like a clustering application.

Training Corpus
Our Eventy-Topic Detection algorithm is demonstrated on a 525 day, 350,000 story Reuters News corpus and a 200 day, 2 billion tweet Twitter corpus.This comes out to average about 6200 stories per 10 day stretch and 10 million tweets a day, respectively.The computation is run over a 30 node Hadoop cluster.

Daily Topic Modeling
LDA Topic Modeling is run daily on the sequential text corpus.Topic modeling is done with our implementation of LDA topic modeling algorithm that uses efficient gibbs sampling (Yao et al., 2009) and is similar to the algorithm used in Mallet (McCallum, 2002).The text input for each LDA model training is the text that occurs between a fixed amount, N , of days before the date of interest.For the Reuters news corpus N = 9 so a total of 10 days is used in the training of each topic model.For the Twitter corpus N = 0 is used so only that exact day is inputed.N is chosen based off a couple of factors including having a max input of 6GB for each training model as well as having enough text to derive meaningful, consistent topics.Character unigrams are used as features for the Reuters news corpus and Alphabetic unigrams as well as hashtags are used as features for the Twitter Corpus.The models for each of the daily training runs are then serialized.

Similarity Measures
There are D serialized topic models (one for each day), with each topic model having K topics.Thus there are D × K total serialized topics, where each topic is represented as a multinomial distribution over words.For each of these topics, the cosine similarity is calculated between that topic and every other (D − 1) × K topics not in that day.Thus, there are a total of D × K × (D − 1) × K cosine similarity calculations.The symmetric KL divergence value can also be calculated for these pairs.The rest of the methodology only describes using cosine similarity; however it can be easily modified to use the symmetric KL-Divergence.
For each topic (date1:topic1), the topic with the highest cosine similarity score from each of the other D-1 daily topic models is saved (date2:topic2).This creates a mapping table-date1:topic1_date2:topic2→cosineSimilarity, where date1:topic1 and date2:topic2 are concatenated as the key, and the value is the cosine similarity.An example of what this mapping looks like can be seen in Table 1.The algorithm is outlined in Algorithm 1.

Smoothing
The cosine similarity values are then smoothed using Loess Smoothing (Cleveland and Loader, 1996).Figures 1-4 show the before and after of the cosine similarity graphs smoothed.The bumps that are present in Figure 2(a) and 4(a) do not contain monotonically increasing sections, followed by The main parameter, α, in Loess Smoothing determines the percentage of nearest points used in the weighted regressions.Smoothing is done for α= .02,.03,.04,.05,.10 on (x,y) pairs grouped by date1:topic1 in the mapping table.The date2 day index is the x-value, and the cosine similarity is the y-value.The α that we use in Eventy-Topic Detection is significantly lower than the usual .25 to .5 range.This is done to accommodate the sharp, unusual bumps that are found for Eventy-Topics in the cosine similarity pair graphs.The larger the α, the more smooth the graph becomes and the bump becomes less pronounced.These small α values assure a pronounced bump in Eventy-Topics as well as mononically increasing/decreasing sections.

Bump Detection
We created a detection method to identify Eventy-Topics out of the D × K collection of topics.We believe that if a topic contains a definite bump in its cosine similarity graph then it is an Eventy-Topic; if not, then it is a Non-Eventy-Topic.After smoothing,   the bumps display a monotonically increasing period followed by a monotonically decreasing period.To automatically detect these localized, relatively high cosine similarity bumps we use a novel algorithm called Bump Detection.This algorithm is outlined in Algorithm 2. Bump detection is used on each of the five different smoothed cosine similarity values (α= .02,.03,.04,.05,.10 ).There are a number of variables and parameters used: • coldLevel -number where all the non-bump cosine similarity values must be below • hotLevel -number where all the cosine similarity values in the bump plateau need to be above • maxRiseT ime -max time it takes to get from coldLevel to hotLevel • maxF allT ime -max time it takes to get from hotLevel back to coldLevel • minHot -the mininum number of cosine similarity values above the hotLevel • maxHot -the maximum number of cosine similarity values above the hotLevel • minHotColdDif f T hresh -parameter where (hotT hresh-coldT hresh) must be greater than in order for the topic to be labeled an 'Eventy-Topic' The hot cosine similarity values must be continuously above the hot threshold.The cold cosine similarity values must be continuous on both the left and right side of the rise and fall values, respectively.The minHotColdDif f T hresh is the key parameter that is used to select only graphs that contain large bumps.
Topic 042 from the model with date 2013-06-04 generated from the Reuters corpus represents a "Bond Topic" (Figure 1).Topic 017 from the model with date 2013-04-26 generated from the Twitter corpus represents a "Happy Birthday Topic" (Figure 3).Both of these figures show noisy cosine similarity graphs.This is because these topics are present at all/random times in their respective corpuses and do not correspond to a time specific event.In fact, in almost every serialized topic model in the Twitter corpus, there is a "Happy Birthday" topic with a nearly identical topic-word distribution.
Both the "Boston Marathon Bombing" topic from the Reuters corpus (Figure 2) and the "Robin Williams' Death" topic from the Twitter corpus (Figure 4) have noticable bumps in their cosine similarity graphs around the date of their respective events.
Figure 5 depicts the cosine similarity graph from topic 003 from the model with date 2014-08-12 generated from the Reuters corpus.This topic describes an event where Mt.Gox, a bitcoin exchange, collapsed in minutes.Figure 6 is a closeup on the bump that includes the variables generated from the bump detection algorithm.The difference between the hotLevel and coldLevel for this topics' cosine similarity graph is .536,which is significantly higher than our usual minHotColdDif f T hresh of .20.

Event Grouping
The final step of generating Eventy-Topics is grouping similar Eventy-Topics together.In the Reuters Corpus, for example, topic modeling is run daily over the previous 10 days, and thus each of the documents are input into 10 different, daily topic models.This makes the "Boston Marathon Bombing" Eventy-Topic exist in models run between April 16, 2013 and May 2, 2013.For each Eventy-Topic generated by the Bump Detection algorithm, there is almost surely other near identical Eventy-Topics.Topics with cosine similarity values in the hot zone of one Eventy-Topic are likely labeled Eventy-Topics as well.Thus we want to group these Eventy-Topics into one.We grouped these Eventy-Topics together by creating a graph where the vertices are the Eventy-Topics.If one Eventy-Topic K 1 is in another Eventy-Topic, K 2 's, hot zone, then we place an edge between these two vertices in our Eventy-Topic graph.We then run a connected components algorithm over the graph to generate a list of sets of Eventy-Topics.For each set in the list, the vertex with the highest degree is chosen to represent all the Eventy-Topics in that set.

Multi-Bump Detection
Some events might happen in two or more separate time periods.
The topics that describe these events will not be captured by the Bump detection algorithm because the cosine similarity graph will dip into the cold threshold between the two bumps.To modify single Bump Detection algorithm, we added an extra parameter minT imeBetweenBumps, which is used to control the minimum time the cosine similarity graph must stay in the cold zone between bumps.This algorithm will then allow multiple bumps as long as they are a certain distance apart from each other.
Figure 7 corresponds to an announcement in January 2013 in which India will raise 57 billion through its first sale of inflation-linked bonds in over a decade .India had periods where it issued these bonds (Mar 2013, Jun 2013, Oct 2013) that correspond to the multiple bumps on the graph.News about this major India debt offering were only present at these particular times and are all tied to that January 2013 announcement.

Experimental Results
Evaluation of our ETD algorithm was done by annotating a selected set of topics.To expedite and strengthen the annotation process we first ran Bump Detection with a relatively low 1 minColdHotDiffThresh and then again with this parameter set to a relatively high 2 value.The sampling for our annotation set was then divided into 3 strata.
• Strata I: topics that were not labeled as Eventy-Topics with a low minColdHotDiffTresh.• Strata II: topics that were labeled as Eventy-Topics with a low, but not a high minColdHotDiffTresh.
• Strata III: topics that were labeled as Eventy-Topics with a high minColdHotDiffTresh.
The details of our sampling for annotation can be seen in Table 3.Note that the annotation was done on topics and not on the results of the Event Grouping step.
Our The reason for dividing the sampled topics into different strata is because the annotation of our Eventy-Topic detection was different in each of these 3 Strata.80/84 topics in Strata I were labeled as 'Non-Eventy-Topics', while 21/22 topics in Strata III were labeled as 'Eventy-Topics'.6/11 topics sampled for Strata II were labeled as 'Eventy-Topics'.Strata II topics were the most difficult to annotate.Now that we had an annotated set of Eventy-Topics, we then tuned the parameters in our Eventy-Topic Detection algorithm to maximize performance over the annotated set.The results of our Reuters News corpus Eventy-Topic Detection with optimal parameters3 can be seen in

Discussion
The data sets need to be sufficiently large in size and time horizon in order for our ETD algorithm to be useful.The Reuters News Corpus spanned 525 days, and an even longer spanning corpus could yield better results.The algorithm also requires significant computation.We ran all our computation on Hadoop in the MapReduce framework and wrote all the data to HBase.On our 30-node Hadoop cluster, the daily topic modeling for the Reuters corpus took approximately 1 day, and the cosine similarity calculation took about 2 days.The Bump Detection algorithms for different smoothing parameters and thresholds only took a few minutes.
One limitation of ETD is that it is run on a stale, large corpus of sequential text and not on an online stream of text.Our algorithm can be modified to run the topic modeling, say every 3 hours, on an incoming stream of text, and then cosine similarity pairs and Bump Detection.
Further extensions, such as analyzing the shape of the bump, the rise time, and the fall time to determine if the Eventy-Topic was expected or not expected, could be very useful.
Our Eventy-Topic Detection algorithm was evaluated with a manually annotated corpus.This is similar to the way Retrospective Event Detection is evaluated in previous studies.

Figure 7 :
Figure 7: Cosine similarity graphs for Reuters Topic 20130115:052."Large India Bond Sale" -{ percent india gmt eye year inr ns indian oil rupees bond billion ...} annotation set consisted of randomly sampled 84 topics from Strata I, 11 topics from Strata II, and 22 topics from Strata III.The vast majority of topics fell into Strata I (40,270), with the second most in Strata II (1,151), and the rest in Strata III (579).

Table 1 :
Cosine similarity pair mapping table.

Table 3 :
Sampling of Topics from Reuters Corpus for Annotation

Table 4 :
Accuracy of Eventy-Topic Detection with Optimized minColdHotDiffThresh