An Unsupervised Bayesian Modelling Approach for Storyline Detection on News Articles

Storyline detection from news articles aims at summarizing events described under a certain news topic and revealing how those events evolve over time. It is a dif-ﬁcult task because it requires ﬁrst the detection of events from news articles published in different time periods and then the construction of storylines by linking events into coherent news stories. More-over, each storyline has different hierarchical structures which are dependent across epochs. Existing approaches often ignore the dependency of hierarchical structures in storyline generation. In this paper, we propose an unsupervised Bayesian model, called dynamic storyline detection model, to extract structured representations and evolution patterns of storylines. The proposed model is evaluated on a large scale news corpus. Experimental results show that our proposed model outperforms several baseline approaches.


Introduction
The rapid development of online news media sites is accompanied by the generation of tremendous news reports. Facing such massive amount of news articles, it is crucial to develop an automated tool which can provide a temporal summary of events and their evolutions related to a topic from news reports. Therefore, storyline detection, aiming at summarising the development of certain related events, has been studied in order to help readers quickly understand the major events reported in news articles. It has attracted great attention recently. Kawamae (2011) proposed a trend analysis model which used the difference between temporal words and other words in each document to detect topic evolution over time. Ahmed et al. (2011) proposed a unified framework to group temporally and topically related news articles into same storylines in order to reveal the temporal evolution of events. Tang and Yang (2012) developed a topic-user-trend model, which incorporates user interests into the generative process of web contents. Radinsky and Horvitz (2013) built storylines based on text clustering and entity entropy to predict future events. Huang and Huang (2013) developed a mixture-event-aspect model to model sub-events into local and global aspects and utilize an optimization method to generate storylines. Wang et al. (2013) proposed an evolutionary multi-branch tree clustering method for streaming text data in which the tree construction is casted as an online posterior estimation problem by considering both the current tree and the previous tree simultaneously.
With the fast development of social media platforms, newsworthy events are widely scattered not only on traditional news media but also on social media (Zhou et al., 2015). For example, Twitter, one of the most widely adopted social media platforms, appears to cover nearly all newswire events (Petrovic et al., 2013). Therefore, approaches have also been proposed for storyline summarization on social media. Given a user input query of an ongoing event, Lin et al. (2012) extracted the storyline of an event by first obtaining relevant tweets and then generating storylines via graph optimization. In (Li and Li, 2013), an evolutionary hierarchical Dirichlet process was proposed to capture the topic evolution pattern in storyline summarization.
However, most of the aforementioned approaches do not represent events in the form of structured representation. More importantly, they ignore the dependency of the hierarchical structures of events at different epochs in a storyline. In this paper, we propose a dynamic storyline detection model to overcome the above limitations.
We assume that each document could belong to one storyline s, which is modelled as a joint distribution over some named entities e and a set of topics z. Furthermore, to link events at different epochs and detect different types of storylines, the weighted sum of storyline distribution of previous epochs is employed as the prior of the current storyline distribution. The proposed model is evaluated on a large scale news corpus. Experimental results show that our proposed model outperforms several baseline approaches.

Methodology
To model the generation of a storyline in consecutive time periods for a stream of documents, we propose an unsupervised latent variable model, called dynamic storyline detection model (DS-DM), The graphical model of DSDM is shown in Figure 1. In this model, we assume that the storylinetopic-word, storyline-topic and storyline-entity probabilities at time t are dependent on the previous storyline-topic-word, storyline-topic and storyline-entity distributions in the last M epochs. For a certain period of time, we assume that each document could belong to one storyline s, which is modelled as a joint distribution over some named entities e and a set of topics z. This assumption essentially encourages documents published around similar time that involve the same named entities and discuss similar topics to be grouped into the same storyline. As the storyline distribution is shared across documents with the same named entities and similar topics, it essentially preserves the ambiguity that for example, documents comprising the same person and location may or may not belong to the same storyline.
The generative process of DSDM is shown be-low: For each time period t from 1 to T : We define an evolutionary matrix of storyline indicator s and topic z, σ t s,z,m , where each column σ t s,z,m denotes storyline-topic-word distribution of storyline indicator s and topic z at epoch m, an evolutionary topic matrix of storyline indicator s, τ t s , where each column τ t s,m denotes storylinetopic distribution of storyline indicator at epoch m, an evolutionary entity matrix of storyline indicator s, υ t s , where each column υ t s,m denotes storyline-entity distribution of storyline indicator s.
We attach a vector of M + 1 weights µ t s,z = {µ t s,z,m } M m=0 (µ t s,z,m > 0, ∑ M m=0 µ t s,z,m = 1), with its components representing the weights that each σ t s,z,m contributes to calculating the priors of φ t s,z . We do it similarly for θ t s and ω t s . The Dirichlet prior for the storyline-topic-word distribution, the storyline-topic distribution and the storylineentity distribution, respectively, at epoch t are: In our experiments, the weight parameters are set to be the same regardless of storylines or topics. They are only dependent on the time window using an exponential decay function, µ m = exp(−0.5 × m) where m stands for the mth epoch counting backwards in the past M epochs. That is, more recent documents would have a relatively stronger influence on the model parameters in the current epoch compared to earlier documents. It is also possible to estimate the weights directly from data. We leave it as our future work.
The storyline-topic-word distribution φ t s,z , the storyline-topic distribution θ t s and the storylineentity distribution ω t s at the current epoch t are generated from the Dirichlet distribution param- With this formulation, we can ensure that the mean of the Dirichlet parameter for the current epoch becomes proportional to the weighted sum of the word, topic distribution, and entity distribution at previous epochs.

Inference and Parameter Estimation
We use collapsed Gibbs sampling (Griffiths and Steyvers, 2004) to infer the parameters of the model, given observed data D. Gibbs sampling is a Markov chain Monte Carlo method which allows us repeatedly sample from a Markov chain whose stationary distribution is the posterior of interest, s t d and z t d,n here, from the distribution over that variable given the current values of all other variables and the data. Such samples can be used to empirically estimate the target distribution. Letting the subscript −d denote the quantity that excludes counts in document d, the conditional posterior for s d is: where N j denotes the number of documents assigned to storyline indicator j in the whole corpus, D is the total number of documents, n j,e is the number of times named entity e is assigned with storyline indicator j, n E j denotes the total number of named entities with storyline indicator j in the document collection, n j,k is the number of times words with topic label k with storyline indicator j, n j is the total number of words (excluding named entities) in the corpus with storyline indicator j, n j,k,v is the number of words v with storyline indicator j and topic label k in the document collection, counts with (d) notation denote the counts relating to document d only.
Letting the index x = (d, n) denote nth word in document d and the subscript −x denote a quantity that excludes data from the nth word position in document d. We only sample a topic z x if the nth word is not a named entity based on the following conditional posterior: Once the latent variables s and z are known, we can easily estimate the model parameters π, Θ, φ, ψ, ω. We set the hyperparameters α = γ = 0.1, β = ϵ = 0.01 for the current epoch (i.e., m = 0), and gather statistics in the previous 7 epochs (i.e., M = 7) to set the Dirichlet priors for the storyline-topic-word distribution φ t s,z , the storyline-topic distribution θ t s and the storylineentity distribution ω t s in the current epoch t, and run Gibbs sampler for 1000 iterations and stop the iteration once the log-likelihood of the training data converges under the learned model.

Dataset
We crawled and parsed the GDELT Event Database 1 containing news articles published in May 2014. We manually annotated one-week data containing 101,654 documents and identified 77 storylines for evaluation. We also report the results of our model on the one-month data containing 526,587 documents. But we only report the precision and not recall of the storylines extracted since it is time consuming to identify all the true storylines in such a large dataset. In our experiments, we used the Stanford Named Entity Recognizer for identifying the named entities. In addition, we removed common stopwords and only kept tokens which are verbs, nouns, or adjectives in these news articles.

Baselines
We chose the following three methods as the baseline approaches.
1. K-Means + Cosine Similarity (KMCS): the method first applies K-Means to cluster news documents for each day, then link storylines detected in different days based on the cosine similarity measurement.
2. LDA + Cosine Similarity (LDCS): the method first splits news documents on a daily basis, then applies the Latent Dirichlet Allocation (LDA) model to detect the latent storylines for the documents in each day, in which each storyline is modelled as a joint distribution over named entities and words, and finally links storylines detected in different days using the cosine similarity measurement.
3. Dynamic LDA (DLDA) 2 : this is the dynamic LDA (Blei and Lafferty, 2006) where the topic-word distributions are linked across epochs based on the Markovian assumption. That is, the topic-word distribution at the current epoch is only influenced by the topicword distribution in the previous epoch.

Evaluation Metric
To evaluate the performance of the proposed approach, we use precision, recall and F-score which are commonly used in evaluating information extraction systems. The precision is calculated based on the following criteria: 1) The entities and keywords extracted refer to the same storyline.
2) The duration of the storyline is correct. We assume that the start date (or end date) of a storyline is the publication date of the first (or last) news article about it.

Experimental Results
The proposed model is compared against the baseline approaches on the annotated one-week data which consist of 77 storylines. The number of storylines, S, and the number of topics, K, are both set to 100. The number of historical epochs, M , which is taken into account for setting the Dirichlet priors for the storyline-topicword, the storyline-topic and the storyline-entity distributions, is set to 7. The evaluation results of our proposed approach in comparison to the three baselines are presented in Table 1.  Table 1: Performance comparison of the storyline extraction results in terms of Precision (%), Recall (%) and F-score (%).
It can be observed from Table 1 that simply using K-means to cluster news articles in each day and linking similar stories across different days in hoping of identifying storylines gives the worst results. Using LDA to detect stories in each day improves the precision dramatically. The dynamic LDA model assumes topics (or stories) in the current epoch evolves from the previous epoch and further improves the storyline detection results significantly. Our proposed model aims to capture the long distance dependencies in which the statistics gathered in the past 7 days are taken into account to set the Dirichlet priors of the storyline-topic-word, storyline-topic and storyline-entity distributions in the current epoch. It gives the best performance and outperforms dynamic LDA by nearly 7% in F-measure.
To study the impact of the number of topics on the performance of the proposed model, we conducted experiments on the one-month data with different number of topics varying between 100 and 200. In all these experiments, the number of storylines, S, is set to 200, based on the speculation that about 40 storylines in the annotated one-week data last for one month and about 40 new storylines occur each week.  Figure 2: Storyline about the patent infringement case between Apple and Samsung was extracted by the proposed Model.

Structured Browsing
We illustrate the evolution of storylines by using structured browsing, from which the structured information (entity, topic, keywords) about storylines and the duration of storylines can be easily observed. Figure 2 shows the storyline about "The patent infringement case between Apple and Samsung". It can be observed that in the first two days, the hierarchical structure consists of entities (Apple, Samsung) and keywords (trial, patent, infringe). The case has gained significant attention in the next three days when US jury orders Samsung to pay Apple $119.6 million. It can be observed that the stories in the next three days also consist of entities (Apple, Samsung), but with different keywords (award, patent, win). The last day's story gives an overall summary and consists of entities (Apple, Samsung) and keywords (jury, patent, company).
To further investigate the storylines detected by the proposed model, we randomly selected three detected storylines. The first one is about "the patent infringement case between Apple and Samsung". It is a short-term storyline lasting for 6 day as shown in Figure 3. The second one is about "India election", which is a long-term storyline lasting for one month. The third one is about "Pistorius shoot Steenkamp", which is an intermittent storyline, lasting for a total of 22 days but with no relevant news reports in certain days as shown in Figure 3. It can be observed that the proposed model can detect not only continuous but also intermittent storylines, which further demonstrates the advantage of the proposed model.

Conclusions and Future Work
In this paper, we have proposed an unsupervised Bayesian model to extract storylines from news corpus. Experimental results show that our proposed model is able to extract both continuous and intermittent storylines and outperforms a number of baselines. In future work, we will consider modelling background topics explicitly and investigating more principled ways in setting the weight parameters of the statistics gathered in the historical epochs. Moreover, we will also explore the impact of different scale of the dependencies from historical epochs on the distributions of the current epoch.