Jointly Event Extraction and Visualization on Twitter via Probabilistic Modelling

Event extraction from texts aims to detect structured information such as what has happened, to whom, where and when. Event extraction and visualization are typically considered as two different tasks. In this paper, we propose a novel approach based on probabilistic modelling to jointly extract and visualize events from tweet-s where both tasks beneﬁt from each other. We model each event as a joint distribution over named entities, a date, a location and event-related keywords. More-over, both tweets and event instances are associated with coordinates in the visualization space. The manifold assumption that the intrinsic geometry of tweets is a low-rank, non-linear manifold within the high-dimensional space is incorporated into the learning framework using a regularization. Experimental results show that the proposed approach can effectively deal with both event extraction and visualization and performs remarkably better than both the state-of-the-art event extraction method and a pipeline approach for event extraction and visualization.


Introduction
Event extraction, one of the important and challenging tasks in information extraction, aims to detect structured information such as what has happened, to whom, where and when. The outputs of event extraction could be beneficial for downstream applications such as summarization and personalized news systems. Data visualization, an important exploratory data analysis task, provides a simple way to reveal the relationships among data (Nakaji and Yanai, 2012).
Although event extraction and visualization are two different tasks and typically studied separately in the literature, these two tasks are highly related. Documents which are close to each other in the low-dimensional visualization space are likely to describe the same event. Events in nearby locations in the visualization space are likely to share similar event elements. Therefore, jointly learning the two tasks could potentially bring benefits to each other. However, it is not straightforward to learn event extraction and visualization jointly since event extraction usually relies on semantic parsing results (McClosky et al., 2011) while visualization is accomplished by dimensionality reduction (Iwata et al., 2007;López-Rubio et al., 2002).
In this paper, we propose a novel probabilistic model, called Latent Event Extraction & Visualization (LEEV) model, for joint event extraction and visualization on Twitter. It is partly inspired by the Latent Event Model (LEM) (Zhou et al., 2015) where each tweet is assigned to one event instance and each event is modeled as a joint distribution over named entities, a date/time, a location and the event-related keywords. Going beyond LEM, we assume that each event is not only modeled as the joint distribution over event elements as in (Zhou et al., 2015), but also associate with coordinates in the visualization space. The Euclidean distance between a tweet and each events determines which event the tweet should be assigned to. Furthermore, the manifold assumption that the intrinsic geometry of tweets is a low-rank, non-linear manifold within the highdimensional space, is incorporated in the learning framework using a regularization. Experimental results show that the proposed approach can effectively deal with both event extraction and visualization tasks and performs remarkably better than both the state-of-the-art event extraction method and a pipeline approach for event extraction and visualization.

Related Work
Our proposed work is related to two lines of research, event extraction and joint topic modeling and visualization.

Event Extraction
Research on event extraction of tweets can be categorized into domain-specific and open domain approaches. Domain-specific approaches usually have target events in mind and aim to extract events from a particular location or for emergency response during natural disasters. Anantharam et al. (2015) focused on extracting city events by solving a sequence labeling problem. Evaluation was carried out on a real-world dataset consisting of event reports and tweets collected over four months from San Francisco Bay Area. TSum4act (Nguyen et al., 2015) was designed for emergency response during disasters and was evaluated on a dataset containing 230,535 tweets.
Most of open domain approaches focused on extracting a summary of events discussed in social media. For example Benson et al. (2011) proposed a structured graphical model which simultaneously analyzed individual messages, clustered, and induced a canonical value for each event. Capdevila et al. (2015) proposed a model named Tweet-SCAN based on the hierarchical Dirichlet process to detect events from geo-located tweets. To extract more information, a system called SEEFT (Wang et al., 2015) used links in tweets and combined tweets and linked articles to identify events. Zhou et al. (2014; proposed an unsupervised Bayesian model called latent event model (LEM) for event extraction from Twitter by assuming that each tweet message is assigned to one event instance and each event is modeled as a joint distribution over named entities, a date/time, a location and the event-related keywords. Our proposed method is partly inspired by (Zhou et al., 2015). However, different from previous methods, our approach not only extracts the structured representation of events, but also learns the coordinates of events and tweets simultaneously.

Joint Topic Modeling and Visualization
Since our proposed approach can be considered as a variant of topic model, we also review the relat-ed work of joint topic modeling and visualization here.
Traditionally, topic modeling and visualization are considered as two disjoint tasks and can be combined for pipeline processing. For example, probabilistic latent semantic analysis (Hofmann, 1999) can be first performed followed by parametric embedding (Iwata et al., 2007). Another pipeline approach (Millar et al., 2009) is based on latent Dirichlet allocation followed by selforganizing maps (López-Rubio et al., 2002).
Jointly modeling topics and visualization is a new problem explored in very few works. The state-of-the-art is a joint approach proposed in (Iwata et al., 2008). In this model, both documents and topics are assumed to have latent coordinates in a visualization space. The topic proportions of a document are determined by the distances between the document and the topics in the visualization space, and each word is drawn from one of the topics according to the document's topic proportions. A visualization was obtained by fitting the model to a given set of documents using the EM algorithm. Following the same line, by considering the local consistency in terms of the intrinsic geometric structure of the document manifold, an unsupervised probabilistic model, called SEMAFORE, was proposed in (Le and Lauw, 2014a) by preserving the manifold in the lower dimensional space. In (Le and Lauw, 2014b), a semantic visualization model is learned by associating each document a coordinate in the visualization space, a multinomial distribution in the topic space, and a directional vector in a highdimensional unit hypersphere in the word space.
Our work is partly inspired by (Le and Lauw, 2014a). However, our proposed approach differs from (Le and Lauw, 2014a) in that events, instead of topics, are modelled as the joint distribution over event elements. Both tweets and events are associate with coordinates in the visualization space.

Methodology
We follow the same pre-processing steps described in (Zhou et al., 2015) to filter out nonevent-related tweets and extract dates, locations, and named entities by temporal resolution, partof-speech (POS) tagging and named entity recognition. The pre-processed tweets are then fed into our proposed model for event extraction and visu- number of keywords in w m θ ey probability of named entity y in event e φ ed probability of date d in event e ψ el probability of location l in event e ω ek probability of keyword k in event e β, γ, η, λ Dirichlet hyperparameters χ, δ Normal hyperparameters G dimension of visualization space alization. We describe our model in more details below.

Latent Event Extraction & Visualization (LEEV) Model
We propose an unsupervised latent variable model called the Latent Event Extraction & Visualization (LEEV) model which simultaneously extracts events from tweets and generates a visualization of the events. Table 1 lists notations used in this paper.
In LEEV, each tweet message w m , m ∈ {1...M } is associated with a latent coordinate x m in the visualization space. Each event e ∈ {1...E} is also associated with a coordinate ϕ e . Assuming that each tweet message w m , m ∈ {1...M } is assigned to one event instance z m = e and e is modeled as a joint distribution over named entities y, the date d when e happened, the location l and the event-related keywords k, the generative process of the model is described as follows: Here, β, γ, η, λ, χ, δ are priors, I is an identity matrix, and P (e|x m , Φ) is the probability of the tweet w m with coordinate x m belonging to the event e. It is defined as, . (1) It is calculated as the normalized Euclidean distance between a tweet w m and an event e. Using this equation, when the Euclidean distance between a tweet w m and and an event e is small, the probability that tweet w m belongs to event e becomes large. The graphical model of LEEV is shown in Figure 1. The parameters to be learned are Θ = {θ e , φ e , ψ e , ω e } E e=1 , tweets' coordinates X = {x m } M m=1 and events' coordinates Φ = {ϕ e } E e=1 , which are collectively denoted as B = ⟨Θ, X , Φ⟩. Let The log likelihood of B given tweets W is, (2) For the events' coordinate ϕ e and tweets' coordinate x m , we use a Gaussian prior with a zero mean and a spherical covariance:

LEEV with Manifold Regularization
Recent studies suggest that the intrinsic geometry of textual data is a low-rank, non-linear manifold lying in the high dimensional space (Cai et al., 2008;Zhang et al., 2005). We therefore assume that when two tweets w i and w j are close in the intrinsic geometry of the manifold Υ, their low-rank representations should be close as well.
To capture this assumption, we consider Laplacian Eigenmaps (LE) (Belkin and Niyogi, 2003) which has been commonly used in manifold learning algorithms (Le and Lauw, 2014a). It constructs a k-nearest neighbors graph to represent data residing on a low-dimensional manifold embedded in a higher-dimensional space. In this paper, we use LE to incorporate neighborhood information of tweets. We construct a manifold graph with edges connecting two data points w i and w j . Set the edge weight υ ij = 1 if w j is one of the knearest neighbors of w i ; Otherwise υ ij = 0. That makes LEEV an special case when ξ = 0. We represent each tweet as a word-count vector, i.e., each element of a vector is weighted by its corresponding term frequency, and use cosine similarity metric to measure the distance between tweets when constructing the manifold graph. We also tried vectors with the TFIDF weighting strategy to represent tweets and found word-count vectors give better results. We apply a regularization framework to incorporate a manifold structure into a learning model. The new regularized log-likelihood function L is where ξ is the regularization parameter. The second component R is a regularization function, which consists of two parts: where F is a distance function that operates on the low rank space. We define F as the squared Euclidean distance of coordinates in the visualization space. F(w i , w j )is computed as follows: Minimizing R + leads to minimizing the distance between neighbors and minimizing R − leads to maximizing the distance between non-neighbors. By enforcing manifold learning, we capture the spirit of keeping neighbors close and keeping none-neighbors apart.

Parameter Estimation
As in Equation 2, the presence of the sum over e prevents the logarithm form directly acting on the joint distribution. Assuming that the corresponding latent event z m of each tweet w m is known, {W, Z} is called the complete data. Maximizing the log likelihood of the complete data, log P (W, Z|B), can be easily done. However, in practice we don't observe the latent variables Z and only have the incomplete data W . Therefore, the expectation maximization (EM) algorithm is employed to handle the incomplete data. EM involves an efficient iterative procedure to compute the Maximum Likelihood estimation of probabilistic models with unobserved latent variables involved.
The class posterior probability of the m th tweet under the current parameter valuesB, P (z m = e|m,B), is given as follows: which corresponds to the E-step in EM algorithm.
In M-step, model parameters B are updated by maximizing the regularized conditional expectation of the complete data log likelihood with priors defined as follows: where P (z m = e|m,B) is calculated in E-step. By maximizing Q(B|B) w.r.t θ ey , φ ed , ψ el , ω ek , the next estimates are given as follows, where Y, D, L, K are the total numbers of distinct named entities, dates, locations, and words appeared in the whole Twitter corpus, respectively. ϕ e and x m cannot be solved in a closed form, and are estimated by maximizing Q(B|B) using quasi-Newton method. The gradients of Q(B|B) w.r.t ϕ e and x m are as follows: where the gradient of R(B|Υ) w.r.t. x m is computed as follows: We set the parameter χ = 0.00005, δ = 0.05, β = γ = η = λ = 0.1 and run EM algorithm for 50 iterations. Finally we select an entity y, a date d, a location l and two keywords k with the highest probabilities to form a tuple ⟨y, d, l, k⟩ to represent each potential event.

Post-processing
In order to filter out spurious events, we calculate the correlation coefficient of each event element. Remove the event element if its correlation coefficient is less than a threshold C e and remove the event if the sum of the correlation coefficients of all its four event elements is less than C t .
For an event element A, its correlation coefficient is calculated below: where Ω is the set of the four event elements ⟨y, d, l, k⟩ and #(x) indicates the number of times x appeared in the whole corpus. We empirically set C e to 0.4 and C t to 4.

Experiments
In this section, we firstly describe the datasets used in our experiments and then present the experimental results.

Setup
We choose two datasets for model evaluation. The first one is the First Story Detection (FSD) dataset (Petrovic et al., 2013) (Dataset I) which contains 2,499 tweets published between 7th July and 12th September 2011. These tweets have been manually annotated with 27 events, covering a wide range of topics from accidents to science discoveries and from disasters to celebrity news. We filter out events mentioned in less than 15 tweets since events mentioned in very few tweets are less likely to be significant. The final dataset contains 2,453 tweets annotated with 20 events. This dataset has been previously used for evaluating event extraction models and the state-of-the-art results have been achieved using LEM (Zhou et al., 2015). We also create another dataset, called Dataset II, by manually annotating 1,000 tweets published in December 2010. A total of 20 events are annotated. We compare our model with LEM (Zhou et al., 2015), which also extracts events as 4-tuples ⟨ y,d,l,k ⟩. The main difference between LEM and our model is that LEM directly estimates the event distribution from the sampled latent event labels, while we derive the distribution from coordinates of tweets and events x m , ϕ e . We re-implemented the system described in (Zhou et al., 2015) and used the same evaluation metrics such as precision, recall and F-measure. Precision is defined as the proportion of the correctly identified events out of the system returned events. Recall is defined as the proportion of correctly identified true events. For calculating the precision of the 4-tuple ⟨y, d, l, k⟩, we use following criteria: • Do the entity y, location l, date d and keyword k that we have extracted refer to the same event?
• If the extracted representation contains keywords, are they informative enough to tell us what happened?
As mentioned in Section 2, PE (Iwata et al., 2007) is a nonlinear visualization method which takes a set of class posterior vectors as input and embeds samples in a low-dimensional Euclidean space. By minimizing the sum of Kullback-Leibler divergences, PE tries to preserve the posterior structure in the embedding space. In order to evaluate the visualization results, we compare our proposed method with a pipeline approach, event extraction using LEM (Zhou et al., 2015) followed by event visualization using PE (Iwata et al., 2007), named as LEM+PE. Table 2 shows the event extraction results on the two datasets. LEEV+R is LEEV with manifold regularization incorporated, in which the model parameters are estimated by the EM algorithm described in Section 3.3. For LEEV and LEEV+R, the number of events, E, is set to 50 for both datasets. For LEEV+R, the number of neighborhood size k is set to 10 and the regularization parameter ξ is set to 1. For LEM, E is set to 25 for both datasets following the suggestion in (Zhou et al., 2015).

Event Extraction Results
We ran our experiments on a server equipped with 3.40 GHz Intel Corel i7 CPU and 8 GB memory. The average running time of LEEV is 2328.1 seconds on Dataset I and 940.7 seconds on Dataset II for one iteration. The average running time  It can be observed that both LEEV and LEEV+R outperforms the state-of-the-art results achieved by LEM on Dataset I. In particular, LEEV improves upon LEM by over 5% in Fmeasure and with regularization, LEEV-R further improves upon LEEV by over 4%. A similar trend is observed on Dataset II where both LEEV and LEEV+R outperforms LEM and the best performance is given by LEEV+R. This shows the effectiveness of using regularization in LEEV. We will further demonstrate its importance in visualization results. Overall, we see superior performance of LEEV+R over the other two models, with the F-measure of over 89% being achieved on both datasets.
As described in Section 3.1, the coordinates of tweets and events are randomly initialized. Therefore, we would like to see whether the performance of event extraction is influenced heavily by random initialization. We repeat the experiments on the two datasets for 10 times using LEEV+R. The experimental results are shown in Figure 2. It can be observed that the performance of LEEV+R is quite stable on both datasets. The standard deviation of F-measure on both Dataset I and I-I is 0.036, which shows that random initialization does not have significant impact on the final performance of the model.

Impact of Number of Events E
We need to pre-set the number of events E in the proposed approach. Figure 3 shows the performance of event extraction based on LEEV+R versus different values of E on the two datasets. It can be observed that the performance of the proposed approach improves with the increased value of E and when E goes beyond 50, we notice a more balanced precision/recall values and a relatively stable F-measure. This shows that the proposed approach is less sensitive to the number of events E so long as E is set to a relatively larger value.

Impact of Neighborhood Size
As described in Section 3.2, the neighborhood information of tweets is incorporated into the learning framework. A manifold graph with edges connecting two tweets (or data points) w i and w j is constructed by setting the edge weight υ ij = 1 if w j is among the k-nearest neighbors of w i and υ ij = 0 otherwise. Therefore, it is crucial to see whether the performance of LEEV+R heavily depends on the setting of k. Figure 4 shows the performance of our proposed approach with different neighborhood size k. It can be observed that the performance of LEEV+R is quite stable and independent of the k value.

Visualization Results
We show the visualization results produced by different approaches on the two datasets in Figure 5 and 6 respectively. We compare LEEV and LEEV+R with the pipeline approach LEM+PE. In the figures, each point represents a tweet and different shapes and colors represent the different events they are associated with. Each red cross represents an extracted event with coordinate ϕ z .
For Dataset I, it can be observed from Figure 5(a) that the visualization result generated by LEM+PE is not informative. Tweets from different events are mixed together and events are evenly distributed across the whole visualization space. Thus, this visualization does not provide any sensible information about the relationships between tweets and events. The result generated by LEEV without manifold Regularization unit R seems better than that from LEM+PE, as shown in Figure 5(b). However, a large amount of tweets crowded together at the center, which makes it difficult to reveal the relations between tweets and events. The best visualization result is given by LEEV+R as shown in Figure 5(c) that different events are well separated and related events are located nearby. For example, the three events enclosed by a red circle represent "people died in terrorist attacks in Delhi, Oslo and Norway" respectively, while three events in the blue circle represent "riots in Ealing, Totteham and Croydon", respectively. And two events in the black circle represent "American credit rating" and "House debt bill", respectively. It shows that LEEV+R with manifold learning incorporated significantly improved upon LEEV without regularization and gives better visualization results. The relationships of events are directly reflected in the distances between their coordinates in the visualization space.
Similar visualization results have been obtained on Dataset II. Figure 6(a) and 6(b) failed to convey the semantic relations between different events. LEEV+R in Figure 6(c) is good at separating tweets from different events. The events in the red circle are the government activities of the United States. The events in the blue circle are categorized as traffic accidents. They are " Transport chaos caused by heavy snow ", "Train to Paris crushed " and "Demonstrators attacked car carrying Prince Charles ". Compared to LEM+PE and LEEV, LEEV+R gives much more informative visualization results.
To further analyze the visualization results in more detail, the 4 representative events and their corresponding tweets in the red circle of Figure 6(c) are visualized in Figure 7. These four events are "Senate vote on repealing gay ban", "US state governor plan to visit North Korea", "Send letter to President Obama to stop tax cut deal" and "Congress passed the Child Nutrition Bill". Their corresponding tweets are denoted as green '△', blue '2', green '2' and blue '+' individually in Figure 7. It can be observed that these four events are all about government activities of the United States, and they are located close to each other in the low-dimensional visualization space. Moreover, the tweets describing the same event are located close to each other and center around their corresponding events, while the tweets describing different events are far away from each other.

Conclusions
In this paper, we have proposed an unsupervised Bayesian model, called Latent Event Extraction & Visualization (LEEV) model, to extract the structured representations of events from social media and simultaneously visualize them in a twodimensional Euclidean space. The proposed approach has been evaluated on two datasets. Experimental results show that the proposed approach outperforms the previously reported best result on  Dataset I by nearly 10% in F-measure. Visualization results show that the proposed approach with manifold regularization can significantly improve the quality of event visualization. These results show that by jointly learning event extraction and visualization, our proposed approach is able to give better results on both tasks. In future work, we will investigate scalable and parallel model learning to explore the performance of our model for large-scale real-time event extraction and visualization.