Event extraction from Twitter using Non-Parametric Bayesian Mixture Model with Word Embeddings

To extract structured representations of newsworthy events from Twitter, unsupervised models typically assume that tweets involving the same named entities and expressed using similar words are likely to belong to the same event. Hence, they group tweets into clusters based on the co-occurrence patterns of named entities and topical keywords. However, there are two main limitations. First, they require the number of events to be known beforehand, which is not realistic in practical applications. Second, they don’t recognise that the same named entity might be referred to by multiple mentions and tweets using different mentions would be wrongly assigned to different events. To overcome these limitations, we propose a non-parametric Bayesian mixture model with word embeddings for event extraction, in which the number of events can be inferred automatically and the issue of lexical variations for the same named entity can be dealt with properly. Our model has been evaluated on three datasets with sizes ranging between 2,499 and over 60 million tweets. Experimental results show that our model outperforms the baseline approach on all datasets by 5-8% in F-measure.


Introduction
Event extraction from texts is to automatically extract key information of events such as what happened to whom, when and where. Previous research mainly focused on news articles, the best and abundant source of newsworthy events. With the increasing popularity of social media platforms, events are also reported and discussed in social media apart from news articles. It was reported in (Petrovic et al., 2013) that even 1% of public Twitter stream covers 95% of all events on newswire. Extracting events from social media makes it possible to quickly understand what is being discussed. It can be further integrated into downstream applications such as tracking the public's viewpoints towards a certain event. However, due to the difficulty in acquiring annotated data for training and the short and informal text commonly appeared in social media, traditional approaches (Grishman et al., 2005; to event extraction from new articles are no longer applicable in social media data. Nevertheless, one important characteristic of social media data is that for most newsworthy events, there might be a high volume of redundant messages referring to the same event. An example of several tweets describing one event is given in Table 1. Approaches to event extraction from social media have largely explored the redundancy characteristic (Xia et al., 2015;Popescu et al., 2011;Abdelhaq et al., 2013). Most of the pervious methods aim to discover new or previously unidenti-fied events without extracting structured representations of events. Ritter et al. (2012) presented a system called TwiCal to extract and categorize events from Twitter. The strength of association between each named entity y and date d is measured based on the number of co-occurring tweets in order to form a binary tuple y, d to represent an event. However, TwiCal relies on a supervised sequence labeler trained on tweets annotated with event mentions for the identification of eventrelated phrases.
Assuming that each tweet message m ∈ {1..M } is assigned to one event instance e, while e is modeled as a joint distribution over the named entities y, the date/time d when the event occurred, the location l where the event occurred and the event-related keywords k, Zhou et al. (2014; proposed an unsupervised Bayesian model called latent event model (LEM) for event extraction from Twitter. However, LEM requires the number of events to be known beforehand, which is not realistic in practical applications. To address this limitation, in this paper, a non-parametric mixture model for event extraction is proposed, in which the number of events is inferred automatically from data. Moreover, the lexical variation of the same named entity, for example, "Charles" and "The Prince of Wales", if identified properly, could be exploited to help in detecting the same event described in tweets with different mentions. To this end, we further extend the non-parametric mixture model to incorporate word embeddings generated using neural language modelling.
The main contributions of the paper are summarized below: • We propose a non-parametric approach called the Dirichlet Process Event Mixture Model (DPEMM) to extract structured events information. It avoids the problem of presetting the number of events, a common issue in latent Dirichlet allocation (LDA) based approaches.
• We extend DPEMM by incorporating word embeddings to deal with the issue of using multiple mentions to refer to the same named entity.
• The proposed approaches have been evaluated on three datasets and a significant improvement on F-measure compared to the baseline approach is observed.

Related Work
Research on event extraction of tweets can be divided into domain-specific and open domain approaches. Domain-specific approaches typically focus on one particular type of events. For example, Panem et al. (2014) proposed an algorithm to extract attribute-value pairs and map such pairs to manually generated schemas for natural disaster events. Evaluation was carried out on 58,000 tweets for 20 events and the system can fill such event schemas with an F-measure of 60%. TSum4act (Nguyen et al., 2015) was designed for disaster responses based on tweets and has been evaluated on a dataset containing 230,535 tweets. Anantharam et al. (Anantharam et al., 2014) focused on extracting city events by solving a sequence labeling problem. Evaluation was carried out on a real-world dataset consisting of event reports and tweets collected over four months from San Francisco Bay Area. Open domain event extraction approaches are not limited to a specific event type or topic. Benson et al. (2011) proposed a structured graphical model which simultaneously analyzed individual messages, clustered, and induced a canonical value for each event. Popescu et al. (2011) focused on detecting events involving known entities from Twitter. Experimental results showed that events centered on specific entities can be extracted with 70% precision and 64% recall. Liu et al. (2012) worked on social events extraction for social network construction using a factor graph by harvesting the redundancy in tweets. Experiments were conducted on manually annotated data set and results showed that it achieved a gain of 21% in F-measure. In (Abdelhaq et al., 2013), a system called EvenTweet was constructed to extract localized events from a stream of tweets in real-time. The extracted events are described by start time, location and a number of related keywords. Armengo et al. (2015) proposed a model named Tweet-SCAN based on hierarchical Dirichlet process to detect events from geo-located tweets. To extract more information, a system called SEEFT (Wang et al., 2015) used links in tweets and combined tweets and linked articles to identify events. Xia et al. (2015) proposed a framework combining text, image and geo-location information to detect events with low spatial and temporal deviation.
Our proposed method belongs to the open do-main category. Different from the previous methods, our model can automatically identify the number of events in the corpus and deal with lexical variations of named entities using word embeddings generating from neural language modelling.

Methodology
Our proposed model for event extraction is based on a typical non-parametric mixture model, Dirichlet Process Mixture Model (DPMM) (Green and Richardson, 2001;Ishwaran and Zarepour, 2002) in which the number of active clusters is automatically learned from the data. We first give a brief introduction to DPMM. In DPMM, observation x i is assumed to be derived from the following model: where K denoting the number of components in the mixture model and can go to infinity, π is the mixture weights of each component, φ k is the parameter of the kth component, c i denotes the index of components, F (φ c i ) denotes the distribution of x i with parameter φ c i . In this model, π can be generated by stick-breaking model (Pitman, 2002) and Chinese restaurant process (Aldous, 1985). Suppose that all the observations are generated by DPMM and the variable of observation x i is θ i , which has the following conditional distribution: where φ 1 , ..., φ k are the distinct values of θ, n k is the number of observations that belong to component k, δ φ k is a probability measure concentrated on φ k , which returns 1 when θ i = φ k , G 0 is the base probability measure and generates new φ with probability α i−1+α .

Dirichlet Process Event Mixture Model (DPEMM)
We propose a Dirichlet Process Event Mixture Model (DPEMM) in which each event is represented as a 4-tuple y, l, k, d , where y stands for non-location named entity, l for location, k for event-related keyword and d for date. It is worth noting that y, l, k is not atomic and could be a set by itself. One event can have multiple named entities, locations or keywords. Also, some elements of the 4-tuple might be absent if no associated information can be found in tweets. Assuming that the data contains an infinite number of events and each event is modeled as a joint distribution over y, l, k and d, the model can be viewed as a Bayesian mixture model. The generative process of the proposed model is given below.
• For each tweet t: -Draw an event from event distribution e ∼ Multinomial(π). -For each non-location named entity oc- Here, K is the number of events and can go to infinity. To estimate the parameters of the model, we employ Markov chain sampling methods (Neal, 2000). As K goes to infinity, we cannot represent the infinite number of θ e , ψ e , ω e and φ e explicitly. Therefore, we perform Gibbs sampling for only those parameters that are currently associated with some observations. Gibbs sampling for the event label e i of tweet i is based on the following conditional probabilities: If e i is assigned with a previously seen event e, If e i is assigned with a new event, , where b is the normalizing constant that makes the probabilities sum to 1, e −i is the event assignment of all the other tweets excluding the data from ith tweet, s i is the four-tuple y i , l i , k i , d i , n is the total number of tweets, n −i e is the number of tweets assigned with event label e excluding the current assignment, F y (θ e ) is the multinomial distribution over non-location named entities with prior θ e , F l (ψ e ) over locations with ψ e , F k (ω e ) over keywords with ω e , and F d (φ e ) over dates with φ e . H y (θ e ) is the posterior distribution of parameters based on the prior G 0 (θ e ) ∼ Dirichlet(β) and all observations y j for which j = i and e j = e, and similarly for H l (ψ e ), H k (ω e ) and H d (φ e ).
We then derive the following formulae: If e i is assigned with a previously seen event e, If e i is assigned with a new event e , where the superscript −i denotes a count excluding data from ith tweet, n −i e,y , n −i e,l , n −i e,k , and n −i e,d denotes the occurrence count of non-location y, location l, keyword k and date d in event e, respectively. t −i denotes all other tweets. β, η, λ, γ are the hyperparameters and are set to the same value 1 in the experiments in the paper.

DPEMM With Word Embeddings
In the proposed model described above, each distinct word is treated separately without consider-ing their semantic relations. However, the knowledge of semantic relations of words might be useful for event extraction. For example, "Putin" and "The President of Russia" are two different mentions referring to the same person. Knowing such knowledge would help to cluster the following two tweets together, "President of Russia attended the opening ceremony of the 119th session of the International Olympic Committee." and "Putin took part in the presentation of Sochi, at the 119th of the IOC.", and hence identify a single event. Moreover, there might exist partitive relations between two location names. For example, Croydon is a part of London. The information will help to identify the same event described as happened in Croydon and London and subsequently improve the accuracy of event extraction.
To incorporate such information about semantic relations between words, we propose another model by employing word embeddings to describe the semantic relations among y or l, which is called DPEMM-WE. Word embedding for each word is often represented in a vector form. In the embedded hyperspace, words that are more semantically or syntactically similar to each other are located closer. We use neural language modeling (Collobert et al., 2011) to learn word representations by discriminating the legitimate phrase from incorrect phrases. Given a sequence of words p = (w 1 , w 2 , ..., w d ) with window size d, the goal of the model is to discriminate the sequence of words p (the correct phrase) from a random sequence of words p r . Thus, the objective function of the model is to minimize the ranking loss with respect to parameters θ: , where p is the set of all possible text sequences with d words coming from the corpus U , R is the dictionary of words, p r denotes the window of words obtained by replacing the central word of p by the word r and f θ (p) is the score of p.
The dataset for learning the language model can be constructed by considering all the word sequences in the corpus. Positive examples are the word sequences from the corpus, while negative examples are the same word sequence with the central word replaced by a random one. Different from DPEMM, in DPEMM-WE, nonlocation named entities y and locations l are as-sumed to follow Gaussian distribution to incorporate word embeddings and their prior distributions are assumed to follow Normal-Inverse-Wishart (NIW) distribution, which is conjugated with Gaussian distribution. The probability density function is where Σ and Ψ are p × p positive definite matrices and Γ p (·) is the multivariate gamma function.
The graphical model of DPEMM-WE is shown in Figure 1. The generative process of DPEMM-WE is given below.
• For each tweet t: -Draw an event from event distribution e ∼ Multinomial(π). -For each named entity occurred in t, choose a named entity y ∼ Gaussian(θ e ).
-For each location occurred in t, choose a location l ∼ Gaussian(ψ e ).
-For each keyword occurred in t, choose a keyword k ∼ Multinomial(ω e ).
-For each date occurred in t, choose a date d ∼ Multinomial(φ e ).
The parameters of locations' T distribution can be calculated similarly.

Post-Processing
DPEMM or DPEMM-WE essentially outputs tweet clusters where each cluster represents one event. To further extract structured representation of an event, such as named entities, locations, dates and keywords, from each cluster, we simultaneously look into the probabilities of each event element returned by our models and their co-occurrence frequencies. We assumed that nonlocation named entities were the most important since an event is usually operated by somebody or something. If an event happened in someplace like "A bomb attack was happened in London", the location is the most important. Therefore, we first select the top 3 non-location named entities ranked by the probability θ e . For each non-location named entity y, its occurrence frequency needs to exceed T y . If no such entities exist, the top 3 locations ranked by the probability ψ e are chosen; otherwise, the location l is chosen based on its co-occurrences with the selected non-location named entities. After that, keywords k are chosen among the top 10 ω e . Only those keywords with correlation coefficients with the chosen named entities and locations exceeding T c are selected. Then date d is chosen in a similar way. Here, we define the correlation coefficient between a and b as Corr(a, b) = log #(a,b) #(b) , where #(a, b) denotes the co-occurrence count of a and b in the same tweet within a tweet cluster and #(b) denotes the occurrence count of b in all tweets within a tweet cluster. In our experiments, we set the thresholds T y = 0.2, T c = 0.4.
If the entity or location is in the form of word embeddings, its occurrence frequency is calculated as the occurrence frequencies of all the neighboring words which have cosine similarity values greater than 0.85. The rationale behind our post-processing step is that although tweets have been filtered in the pre-processing step, tweet clusters generated by the proposed models still contain noisy event elements. As such, we select event elements from tweet clusters not only based on their probability distributions given by the proposed models but also taking into account their cooccurrences in each tweet cluster.

Experiments
We evaluate the proposed models on three datasets. Dataset I is the First Story Detection (FSD) dataset (Petrovic et al., 2013) containing 2,499 tweets manually annotated with 27 events. These tweets were published between 7th July and 12th September 2011, covering a range of categories such as accidents and science discoveries. Considering that events mentioned in a very few tweets are less likely to be significant, we remove events mentioned in less than 15 tweets and are left with 2,453 tweets annotated with 20 events. Dataset II and III were collected from tweets published in the month of December in 2010 using the Twitter streaming API. Dataset II consists of 6,297 tweets manually annotated with 73 events. All the annotated events in Dataset II are mentioned in at least 15 tweets. Dataset III contains 60 millions unlabelled tweets. We chose LEM (Zhou et al., 2014), the state-of-art approach based on Bayesian modelling for event extraction, as the baseline to compare with the proposed model. For all datasets, pre-processing is done as described in baseline (Zhou et al., 2014). A named entity tagger 1 specifically built for Twitter is used for extracting named entities including locations from tweets. A Twitter Part-of-Speech tagger (Gimpel et al., 2011) is used for POS tagging and only words tagged with nouns, verbs or adjectives are kept as candidate keywords. Word embeddings are trained on Dataset III (60 million tweets) using Word2Vec 2 . In this model, a word is used as an input to a log-linear classifier with continuous projection layer and the objective is to predict its neighboring words.
We train DPEMM, DPEMM-WE and LEM on an IBM 3850 X5 Linux server equipped with 1.86 Ghz processor and 8 GB DDR3 RAM. The number of Gibbs sampling iterations is set to 1,000 for LEM for all the datasets. For DPEMM, it converges in 16 iterations on Dataset I and 20 iterations on Dataset II and III. While for DPEMM-WE, it converges in 20 iterations on both Dataset II and III.

Experimental Results
To evaluate the performance of the proposed approaches, we calculate precision, recall, and Fmeasure on Dataset I and II and only precision on Dataset III since it is hard to know exactly how many events are mentioned in such a large dataset. The precision is defined based on the following criteria: 1) Do the entity y, location l and the date d refer to the same event? 2) Are the keywords k in accord with the event that other extracted elements y, l, d refer to and are they informative enough to tell us what happened? If the extracted events does not have any keyword, such events are considered as incorrect.
The performance comparison of event extraction results is presented in Table 2. It can be observed that the proposed DPEMM achieves better performance on all the three datasets compared to the baseline approach, with the improvement in Fmeasure being 6.1% and 7.7% on Dataset I and II, respectively. After incorporating word embeddings into DPEMM, the proposed DPEMM-WE further improves upon DPEMM slightly by 1.45% in F-measure on Dataset II, but more significantly by 4.16% in precision on Dataset III. It verifies our hypothesis that the knowledge about the semantic relations of entities and locations could potentially improve the performance of event extraction. We also compared the proposed models with K-means on Dataset I to justify whether these proposed generative models are better than traditional clustering methods based on co-occurrence. The feature set was constructed by organizing the words in four categories such as y, l, k, d and concatenating the four one-hot feature sets together.
It is worth noting that we did not apply DPEMM-WE on Dataset I because this dataset is very small, consisting of less than 2500 tweets. It is thus unreliable to learn word embeddings from such a small dataset. It is also hard to pre-train word embedding from extra dataset like Wikipedia  Table 3. It can be observed that the extracted results from DPEMM-WE contain more detailed and accurate information describing the events. For example, for the first event, DPEMM-WE is able to extraction the location information while DPEMM failed to do so. For the third event, DPEMM-WE gives more accurate location information compared to DPEMM. It might attribute to the advantage of incorporating word embeddings which are able to map semantically similar words into nearby locations in the embedding hyperspace. As such, although two tweets might contain different mentions of named entities and locations, they might still be clustered together if these named entities or locations have similar word embeddings.
We observed that the precision achieved by DPEMM on Dataset I is signifcantly better than LEM on Dataset I and II while similar on Dataset III. We found that DPEMM tended to generate many but smaller clusters compared to LEM. As dataset III is huge, DPEMM might generate some small clusters which do not contain enough information to describe a correct event.

Quality of Clusters
As the proposed approaches essentially group tweets into different clusters with each cluster corresponding to an event, we conduct experiments to explore the quality of clusters by a measure of purity, which is defined as P e = ne n , where n e denotes the number of tweets describing the event e extracted from a cluster and n denotes the total number of tweets in the cluster. Since it is difficult to calculate the purity on Dataset III, we only report the results on Dataset I and Dataset II as shown in Figure 2 and 3 respectively.
Each point (x, y) in the figures denotes the percentage y of the clusters whose purity is less than x. Obviously, if the curve is steeper, it means that the percentage of the clusters with low purity is smaller and the quality of the clusters is better. It can be observed that DPEMM achieves the best quality of cluster on both Dataset I and Dataset II, whose precision is lower than DPEMM-WE. Specifically, on Dataset I, more than 80% of clusters generated by DPEMM has the purity value greater than 0.9, compared to only 70% in LEM. It might be attributed to the property of DPEMM  that the cluster is generated dynamically without a preset number of clusters. On Dataset II, both DPEMM and DPEMM-WE achieve better clustering results compared to LEM. However, the purity of clusters generated by DPEMM is slightly higher than that generated by DPEMM-WE. This is somewhat contrary to our prior belief. By further analyzing the results, we found that as more tweets are clustered together using DPEMM-WE, more noisy information such as some named entities with similar word embeddings which are not related to the events is introduced. We present an example of the tweet clusters describing the same event generated by DPEMM and DPEMM-WE in Figure 4. For each method, we use a histogram to indicate the number of tweets which share the same event elements. Regions highlighted in dark or light red colors indicate that the corresponding tweets are event-related. Regions highlighted in blue denote the corresponding tweets are not event-related. It can be observed that the purity of the cluster generated by DPEMM is 91% which is better than DPEMM-WE's 63%. However, the size of the cluster returned by DPEMM is smaller and it failed to extract the location information. On the contrary, DPEMM-WE generated a larger cluster and for some tweets, it successfully extracted the location "Senate". However, more spurious tweets are included because "Harry Reid" is close to both "DreamAct" and "Obama", and "White House" is close to "Senate" in the word embedding space. Therefore, although DPEMM-WE gives better extraction results overall compared to DPEMM as shown in Table 2, it returns lower purity results because of some noisy information introduced through word embeddings.

Conclusions and Future Work
In this paper, we have proposed a model based on the Dirichlet Process mixture model to extract structured event information from social media data. Different from previous approaches for event extraction which require setting the number of events beforehand, it can infer the number of events automatically from data. It is specifically appealing for processing large-scale social media data. Moreover, considering different mentions of names could refer to the same person (and similarly for other named entities such as location), we have proposed to incorporate word embeddings into DPEMM so as to more effectively capture semantically similar words. Experiments have been conducted on three datasets and the proposed approaches achieve better performance on all the datasets in comparison with the baseline approach. In the future, we plan to investigate more effective way in reducing the noise introduced by word embeddings and incorporate emotion information into the proposed models to simultaneously ex-tract public opinions of the extracted event.