Hierarchical Dirichlet Gaussian Marked Hawkes Process for Narrative Reconstruction in Continuous Time Domain

In news and discussions, many articles and posts are provided without their related previous articles or posts. Hence, it is difficult to understand the context from which the articles and posts have occurred. In this paper, we propose the Hierarchical Dirichlet Gaussian Marked Hawkes process (HD-GMHP) for reconstructing the narratives and thread structures of news articles and discussion posts. HD-GMHP unifies three modeling strategies in previous research: temporal characteristics, triggering event relations, and meta information of text in news articles and discussion threads. To show the effectiveness of the model, we perform experiments in narrative reconstruction and thread reconstruction with real world datasets: articles from the New York Times and a corpus of Wikipedia conversations. The experimental results show that HD-GMHP outperforms the baselines of LDA, HDP, and HDHP for both tasks.


Introduction
Online news sites and discussion forums generate large volumes of articles and discussions, which we can call "events". To fully understand the discussions and the news stories, one often needs a larger context for that text, such as what related posts and relevant articles have been posted before. For instance, to understand a news article about the presidential elections, we would need to know the history of the candidates' political actions through relevant previous articles. While there are some news articles with a curated set of related articles and discussion threads with a wellorganized structure, there are many more articles and discussion threads for which the structure is absent or incomplete. In this context, automatically reconstructing the narrative of articles and thread structure is an important problem.
Generally, textual information and various meta information such as location and keywords are used as features to solve this problem of narrative reconstruction. With these features, previous research mainly focus on three modeling strategies. First, they model the triggering relationship of events to identify which preceding events led to the occurrence of the current event. Second, they use meta information such as location and keywords. Third, they consider the temporal characteristics in the event stream, such that events in close temporal proximity are more likely to be related. However, there is no method that effectively considers all three of these. In narrative reconstruction, there are several approaches that focus on using meta information and temporal characteristics with clustering methods (Zhou et al., 2016;Tang et al., 2015;Ahmed et al., 2011), and there are several approaches using the Hawkes process to model the temporal characteristics (Du et al., 2015;Mavroforakis et al., 2017;Jankowiak and Gomez-Rodriguez, 2017). In thread reconstruction, there are approaches that focus on modeling triggering relationships of events and using meta information (Kim et al., 2010;Louis and Cohen, 2015;Wang et al., 2011b).
In this paper, we propose a novel Gaussian Marked Hawkes Process (GMHP) that effectively reconstructs the narrative structure of articles and the thread structure of discussions considering all three modeling strategies. GMHP uses the Hawkes process to model events in continuous time, a Gaussian distribution for modeling the meta information of text, and the mixture of Gaussian for modeling the triggering relationships of events. The detailed modeling strategies are described as follows. We use the Hawkes process to model time in the continuous domain, as the Hawkes process is a stochastic process used to understand a sequence of events in continuous time (Iwata et al., 2013;Rong et al., 2015). To use meta information, we represent text and meta information in a general vector form and use the Hawkes process to handle the vector of event information with a Gaussian distribution. To model the triggering relationships, we assume a model structure parameterized by each preceding event so that an event can be directly generated from a probability distribution parameterized by preceding events.
The GMHP models a single narrative or thread in event streams. To find the narratives or threads from a mixture of event streams, we combine our GMHP model with the Hierarchical Dirichlet Process to build HD-GMHP.
We evaluate the effectiveness of our model with two real world datasets: articles from the New York Times, and discussion threads from Wikipedia. In the New York Times dataset, we perform a narrative reconstruction experiment and compare the results with the human annotated narrative labels. In the Wikipedia discussion corpus, we perform two kinds of thread reconstruction experiment. One is grouping posts in the same thread. The other is reconstructing the post-reply structure of the posts. From these experiments, we see that our model outperforms the state-of-the-art model, the hierarchical Dirichlet Hawkes process (HDHP) (Mavroforakis et al., 2017).
The contributions of our research are threefold. First, we propose the Gaussian Marked Hawkes Process that effectively models a single narrative (event stream) with all three modeling strategies used in previous research. Second, we propose HD-GMHP, a combination of the GMHP model with the HDP to reconstruct the narratives of articles and the thread structure of discussions from a mixture of event streams. Finally, we propose a novel inference algorithm of the HD-GMHP with the Sequential Monte Carlo method (Doucet et al., 2001).

Related Work
Narrative Reconstruction: One major approach to reconstructing narratives from news articles is clustering articles by using a variant of the Chinese Restaurant Process (CRP). Related work such as (Zhou et al., 2016;Tang et al., 2015;Ahmed et al., 2011) models chronologically ordered news articles with text and various meta information including author, organization, keywords, and location. They use the CRP, distant-dependent CRP (Blei and Frazier, 2011). There is research that uses recurrent CRP (Ahmed and Xing, 2008) and exponential time decaying kernel to model probability of time difference between two relevant events. But they use discrete time information instead of continuous form and handcrafted parameters of the kernel (Ahmed et al., 2011).
There is another approach that reconstructs narratives by directly extracting important sentences from articles. (Xu et al., 2013) proposes a model that considers the sentence and image level narrative reconstruction as an optimization problem and solves it by maximizing the divergence of narratives with some constraints. (Wang et al., 2016) solves the narrative reconstruction problem as a sentence recommendation problem and uses matrix factorization. But these existing models focus on how to handle text and meta information of articles, while our model uses the Hawkes process to effectively model continuous time information of events. Discussion Thread Reconstruction: There are several approaches to reconstruct threads from a corpus of unstructured discussions. (Wang et al., 2011a) uses Conditional Random Field to reconstruct reply structure in discussion corpus. (Balali et al., 2014) uses content, time and author information as features of a single post with rank SVM to reconstruct thread structure. (Dehghani et al., 2013;Aumayr et al., 2011) uses SVM and a decision tree with meta information of posts.
However, a major limitation in these previous research is that they are assuming that for each post, the main thread where it belongs is given. That is, the problem they solve is finding the post for which a post is immediately replying, rather than treating the corpus as a single set of posts with no known information about the threads, the initial post of each thread, and the posts that belong to each thread. This limitation of the previous research means those approaches are not applicable in more general online conversation data, such as IRC or a Facebook group chat which is a massive unstructured online discussion for which the initial post of a thread is not labeled. Unlike this strong assumption in previous research, we use a more general assumption that the initial posts are unknown, so our approach would be applicable to a wider, more general discussion data. Also, as in the narrative reconstruction research area, previous research focuses on how to handle text and meta information in posts. Again, unlike previous research, our research uses the Hawkes process to model continuous time information. Continuous Time Modeling: The Hawkes process, a stochastic process that models continuous time information of events with event occurrence history, is an effective solution to model events in continuous time. One of the main research themes in the Hawkes process literature is finding which events trigger which other events. (He et al., 2015) models the topic diffusion patterns in a social network by inferring the triggering node with the Hawkes process. The Hawkes process is also used to model social event streams (Rong et al., 2015) and to classify rumors (Lukasik et al., 2016), and a combination of the Hawkes process and the Dirichlet mixture model is used to cluster event streams (Xu and Zha, 2017).
Recent research clusters text streams with the Hawkes process and the Chinese Restaurant Process or the Chinese Restaurant Franchise (Mavroforakis et al., 2017;Du et al., 2015). They use the bag-of-words representation of text in their model, while (Jankowiak and Gomez-Rodriguez, 2017) proposes a Hawkes process model that can handle a more general vector representation of events. The main difference of our model compared to this research is that we add the triggering relationship of two events. With this addition, our model can reconstruct narratives with an explicit relation of two documents.

Hawkes Processes
Before we describe our proposed model, we briefly explain the Hawkes process, one of two main stochastic processes used in our model. We leave out the explanation of the HDP due to space.
The Hawkes process (Hawkes, 1971) is a subclass of temporal point processes, whose functional form for intensity with exponential decaying kernel is represented as where the intensity, λ * (t) represents the conditional probability of an event occurrence within time window [t, t + dt). The Hawkes process is used to model the number of occurrences of events where one event can trigger other events. In the equation above, the base intensity λ 0 (t) models the intensity of events that occur on their own initiative whereas αβe −β(t−s) models the intensity of events that are triggered by the previous event that occurred at time s. Here, multiplication of α and β represents influence of the previous event and β represents decaying rate of the influence. Thus, the effect of the previous event exponentially decays with respect to the time difference. From the definition of intensity λ * (t), the derived likelihood form of the Hawkes process is as follows, where

Problem Setting
In this section, we define the event stream and the narrative and the thread reconstruction problems.

Definition of Event Stream:
If a text appears at time t i , we define the event s i as (t i , e i , z i , x i ).
Here, e i is the feature vector of the text, x i is the latent global cluster indicator of event s i which represents the cluster for events with similar text information, and z i is the latent local cluster indicator for events that are temporally related in the same cluster. We define event stream S as [s 1 , .., s n ].
Assumptions: 1) We assume that two events in same local cluster occur in near time and have similar feature vectors e. These properties are called temporal and spatial locality. 2) We assume hierarchy structure of a global cluster and a local cluster. That is, one global cluster can consist of multiple local clusters.
Problem Formulation: We formulate the spatial locality of two events in the same local cluster with a Gaussian distribution. If two events s i and s j are in the same local cluster and t i > t j , then we assume the later event e i is generated from one of two relations, Here, e 0 is the base event vector and Σ 0 is the covariance matrix of the cluster. Σ v is the covariance matrix of the Gaussian distribution generated by a past event in the cluster.
We use the Hawkes process to formulate the temporal locality of two events in the same local cluster. If event s i and s j are in the same local cluster and t i > t j , then t i is generated from in either following relations, Here, if t i is generated from Hawkes process of parameter α and β with time t j and e i is generated from e j , then we say that event s j is the parent event of event s i . We formulate the hierarchy structure of the global cluster and the local cluster with Hierarchical Dirichlet Process (Teh et al., 2006). If the parameters θ z i of local clusters z 1 , z 2 , .., z n are equal to the parameters of the global cluster Θ x , then we say that there is a hierarchy between all the local clusters and the global cluster. And this hierarchy structure can be written as follows, Now, we define the narrative reconstruction and the thread reconstruction problem as a problem of inferring the latent variables in S.

Model
We now describe clustering a mixture of event stream S with the Gaussian Marked Hawkes Process and the hierarchical Dirichlet process. We first propose Gaussian Marked Hawkes Process (GMHP) that models temporal and spatial locality assumptions that described in section 4. And after defining the GMHP, we propose Hierarchical Dirichlet Gaussian Marked Hawkes Process (HD-GMHP), a combination of the GMHP with the Hierarchical Dirichlet Process. The GMHP models event streams with the same local cluster z and HDP groups the local clusters to one global cluster x.

Model Description
In GMHP, we assume events are generated by a past event or by their own initiative. If event s i is generated by event s j , then we say that event s j is the parent event of event s i . If event s i occurs on their own initiative, the index of the parent event is 0. We define the intensity function with the given parent event c i as follows, To model the spatial locality of two Ddimensional event vectors e i , e c i , we define probability distribution for e i as follows, Here, e 0 is the base event vector for when c i = 0. Σ 0 and Σ v are covariance matrix for when an event occurs by their own initiative or occurs by past event. From the above definitions, we can calculate the intensity of the event vector e at time t as follows, (4) The total intensity of GMHP can be obtained by integrating the above intensity with the event vector e.

Parameter estimation
From equation 1, the likelihood of the observed event stream can be computed as follows, Since the likelihood of GMHP is hard to maximize, instead of using the likelihood, we define a likelihood with the given parent events as follows, where C ij becomes 1 when c i = j and 0 otherwise. By maximizing equation 7, we can estimate the parameter θ = {µ, α, e 0 , Σ 0 , Σ v }. The inference step of the parent events is described in section 6.

Modeling a Mixture of GMHP with the HDP
When clustering a mixture of streams using the Hawkes process, the exponential triggering function prevents two events with a large time difference from being assigned to the same global cluster. To solve this problem, (Mavroforakis et al., 2017) uses the HDP instead of using the Dirichlet process used in (Du et al., 2015). The hierarchy structure of the HDP assigns a cluster label with a probability proportional to the size of the cluster. This allows assignment of two events with a large time difference to the same cluster. For the same reason, we use the HDP to model mixture of the GMHP. We consider each GMHP in mixture as a table in the Chinese Restaurant Franchise metaphor. Since the intensity of k'th GMHP, λ k (t) represents how likely an event occurs in table k at time t, we use the intensity as the number of customers in the CRF metaphor. The whole generative process of HD-GMHP is as follows.
which is interpreted as parameter(θ xn ) for local cluster z n , and Increment K.
Here, N m is number of local cluster in global cluster m.
c n ∼ µ xn δ(N zn + 1) + Nz n j=1 g xn (t j )δ(j) (10) if c n = N zn + 1, then replace c n with 0 and sample event vector.   (Doucet et al., 2001). To calculate the posterior of the latent variables z and x for each timestamp t i in the inference, we need the estimated parameter to calculate the intensity at each time t i , λ(t i ). As described in section 5.1.2, the parameter estimation step needs the parent event information. In our proposed inference, the parent events are inferred from SMC. The inference algorithm is summarized in algorithm 1.

Sequential Monte Carlo with parent event inference
To approximate the posterior of the latent variables, SMC samples the latent variables from the proposal distribution and calculates the weight of each sampled variables which is called the particle weight. To infer the parent event in SMC, we define the particle weight of our modified SMC as follows: Here, ψ i n is (x i n , z i n ). Let the left part on the right hand term and right part on the right hand term of the equation 12 are w ψ .
(14) Here, δ i n is (x i n , z i n , c i n ).
We use p(ψ n |ψ 1:n−1 , s 1:n o ) as the proposal distribution of ψ i n in the equation 13 to minimize the variance of w i n (Doucet et al., 2000) and p(c n |δ 1:n−1 , ψ n , t n , s 1: When calculating the probability of t n in 17, we assume the parameters µ 1:K , α 1:K are given (Carvalho et al., 2010). From the likelihood of GMHP, the probability term p(t n |ψ 1:n , s 1:n−1 o ) in equation 17 can be calculated by λ zn (t n )e −Λ(tn,t n−1 ) . Where Λ(t n , t n − 1) is In the case of the probability term p( e n |c n , rest) and p( e n |z n , rest) in 17, as explained in the sampling process of ψ n , we can calculate the terms by student's t-distribution. With the particle weight update rule 17 and the parameter update rule described in section 6.2, we infer latent variables with algorithm 1.

Updating Parameter
From the equation 7 and the prior of the parameters used in GMHP, we can estimate the parameters by following form.

Approximation
To reduce the computation time in the inference algorithm, we use several approximation strategies.

Marginal distribution Approximation
To calculate p( e n |z n , ψ 1:n−1 , t n , s o 1:n−1 ) in the equation 17, we need marginalization of p( e n , c n |z n , ψ 1:n−1 , t n , s 1:n−1 o ) which takes time complexity of O(n of events in z n ) and cause the time complexity of the equation 17 to be O(n).
To reduce the time complexity, we note that event vector e n is sampled from a Gaussian mixture that the influence of each Gaussian distribution is exponentially decreases. We assume the marginal distribution p( e n |z n , ψ 1:n−1 , t n , s 1:n−1 o ) can be approximated to p( e n |c 1:n = 0, z n , ψ 1:n−1 , t n , s 1:n−1 o ). From the approximation, we can calculate the posterior predictive with student's t-distribution. The result of approximation is as follows, p( e n |z n , ψ 1:n−1 , t n , s 1:n−1 o ) = t νn ( e n | m n , κ n + 1 κ n ν n S n ), where ν n = 2α 0 + N zn , κ n = λ e 0 + N zn , To calculate p( e n |c n , t n , s 1:n−1 o , ψ 1:n ) in the equation 17, we need to calculate posterior predictive for each past event.
To reduce the computation time in the process of calculation, we approximate the probability distribution p( e n |c n , t n , s 1:n−1 o , ψ 1:n ) as follows.

Experiment
In this section, we demonstrate the narrative reconstruction and thread reconstruction performance of our model on a corpus of the New York Times articles and the Wikipedia conversation dataset.

Dataset
New York Times Dataset: We collected 112,538 New York Times news articles from January 2016 to July 2017. The dataset contains the text, timestamp, the news section, and the keywords. These keywords are semantic tags specified by the newsroom to indicate the main topics of the articles. We select news articles in sections "U.S.", "World", "Opinion", and "Sports" that contain at least one of the top ten most frequently used keywords. The statistics of these keywords are described in table 1. Further, we select articles with more than ten words in its body. The final number of articles used in our experiment is 16,858. The dataset is publicly available 1 . Wikipedia Conversation Dataset is released by (Danescu-Niculescu-Mizil et al., 2012). The dataset contains the timestamp, the initial post of the conversation, "reply to" link information, and the text information of each post in conversation threads in Wikipedia talk pages. We select threads that have ten or more posts from September 2010 to December 2010. The final number of posts used in our experiment is 2,004 and the final number of threads is 154.

Preprocessing
To apply our model to the real world datasets, we represent each event with time information and an event vector. For the time information, we take the first article or post and set the time as zero, the last article or post as one, and scale the timestamps of all other articles and posts accordingly. To extract the event vectors, we use different vectorization methods for the two datasets. For the NYT dataset, we use the document topic vector from LDA (Blei et al., 2003). For the Wikipedia dataset, because there are only a few words in each post, we cannot use the LDA topic vector, so we use the averaged word embedding vector (Mikolov et al., 2013) of the words used in each post. Narrative reconstruction: To demonstrate the narrative reconstruction performance of our model, we apply the inference method to our corpus of NYT articles. We use a set of multiple keywords of each article as the ground truth label. Then we run our model and consider the set of articles with the same global cluster information as one narrative. We compare the results with the ground truth labels using the common clustering metrics AMI and ARI (Hubert and Arabie, 1985;Vinh et al., 2010) to evaluate the narrative reconstruction performance of our model. We compare HD-GMHP with the following baselines: LDA and HDP with DBSCAN, and the Hierarchical Dirichlet Hawkes Process (HDHP) (Mavroforakis et al., 2017) which is a state-of-the-art model for text and continuous timestamps of an event. Also, to measure the similarity of each recovered narrative and the ground truth narrative, we use the F1 score of the top ten narratives. Thread reconstruction: In this experiment, we use two evaluation criteria. One is post grouping and the other is reply structure recovery, which is simply the recovery of the child nodes. Here, we use a different child node recovery task compared to the child node recovery used in previous research. In our task, we do not give the initial post of each thread, while previous research does. This makes thread reconstruction problem more general and more difficult.
In post grouping, we use the initial post of each of the posts as the ground truth label and measure the clustering metrics used in the NYT dataset. In the child node recovery experiment, we use the parent event information inferred from our method as the recovered tree structure of the threads. We  (Wang et al., 2011a;Dehghani et al., 2013). We compare our model with the following baselines: HDHP, and a naive baseline that reconstructs threads in the form of a single linked list of posts in chronological order.

Metrics
AMI, ARI are commonly used to measure clustering performance (Hubert and Arabie, 1985;Vinh et al., 2010). P node , R node measure local similarity between two thread structures (Wang et al., 2011a).
where, child GT (i) and child E (i) are the sets of children of node i in the ground truth thread structure and the recovered thread structure, respectively. The author (Wang et al., 2011a) also proposed P path , R path to measure the similarity of the global structure of two threads. The path metrics are sensitive to the recovered initial post of each thread, but since we do not give the initial post of each thread in our experiment, the path metrics are no longer proper in our experiment. So we measure the node metrics only. Table 2 shows the clustering accuracy of our method and the baseline methods in real world datasets. We average the results with five runs for each model. The highest value for each metric is indicated with boldface. From the results, we establish that our model outperforms the baseline methods in both the NYT narrative reconstruction task and the Wikipedia thread reconstruction task. For the NYT, to see the accuracy of our model in more detail, we compute and show the F-scores for the top ten most frequent labels and the micro and macro averages in table 3. To compute the F-score between the true labels and the recovered cluster labels, we select the cluster with the highest F-score as the corresponding cluster. From the results, we establish that our model performs better than the baseline model, HDHP. Table 4 shows the thread reconstruction results of our model and the baseline models in the Wikipedia conversation dataset. Since the HDHP model does not infer the parent event, we reconstruct threads in the form of chronologically ordered linked list of posts in each local cluster that inferred from HDHP. From the F1 node score of the results, we establish our model performs better than other baseline models.

Results
To demonstrate the robustness of HD-GMHP on dimensional change of the input vector, we measure the performance of each task in using 50, 100, and 150 dimensional vectors. The results are described in table 5 and 6. From the results, we verify there are no drastic changes in performance in both the NYT dataset and the Wikipedia dataset.

Conclusion
In this paper, we defined the narrative and thread reconstruction problems as clustering problems. To cluster the event streams with continuous time information and triggering event information, we

AMI ARI
HD-GMHP (50D) 0.2310 0.1518 HD-GMHP (100D) 0.2479 0.1416 HD-GMHP (150D) 0.2421 0.1191 proposed the Gaussian Marked Hawkes process that models event streams with additional event information represented in a vector form. Furthermore, we combined our model GMHP with the HDP to cluster event streams (HD-GMHP). We showed that our model performs better than several baseline methods in both narrative reconstruction in a dataset of NYT articles and thread reconstruction in a dataset of Wikipedia conversations.