Open Event Extraction from Online Text using a Generative Adversarial Network

To extract the structured representations of open-domain events, Bayesian graphical models have made some progress. However, these approaches typically assume that all words in a document are generated from a single event. While this may be true for short text such as tweets, such an assumption does not generally hold for long text such as news articles. Moreover, Bayesian graphical models often rely on Gibbs sampling for parameter inference which may take long time to converge. To address these limitations, we propose an event extraction model based on Generative Adversarial Nets, called Adversarial-neural Event Model (AEM). AEM models an event with a Dirichlet prior and uses a generator network to capture the patterns underlying latent events. A discriminator is used to distinguish documents reconstructed from the latent events and the original documents. A byproduct of the discriminator is that the features generated by the learned discriminator network allow the visualization of the extracted events. Our model has been evaluated on two Twitter datasets and a news article dataset. Experimental results show that our model outperforms the baseline approaches on all the datasets, with more significant improvements observed on the news article dataset where an increase of 15% is observed in F-measure.


Introduction
With the increasing popularity of the Internet, online texts provided by social media platform (e.g. Twitter) and news media sites (e.g. Google news) have become important sources of realworld events. Therefore, it is crucial to automatically extract events from online texts.
Due to the high variety of events discussed online and the difficulty in obtaining annotated data * corresponding author for training, traditional template-based or supervised learning approaches for event extraction are no longer applicable in dealing with online texts. Nevertheless, newsworthy events are often discussed by many tweets or online news articles. Therefore, the same event could be mentioned by a high volume of redundant tweets or news articles. This property inspires the research community to devise clustering-based models (Popescu et al., 2011;Abdelhaq et al., 2013;Xia et al., 2015) to discover new or previously unidentified events without extracting structured representations.
To extract structured representations of events such as who did what, when, where and why, Bayesian approaches have made some progress. Assuming that each document is assigned to a single event, which is modeled as a joint distribution over the named entities, the date and the location of the event, and the event-related keywords, Zhou et al. (2014) proposed an unsupervised Latent Event Model (LEM) for open-domain event extraction. To address the limitation that LEM requires the number of events to be pre-set, Zhou et al. (2017) further proposed the Dirichlet Process Event Mixture Model (DPEMM) in which the number of events can be learned automatically from data. However, both LEM and DPEMM have two limitations: (1) they assume that all words in a document are generated from a single event which can be represented by a quadruple <entity, location, keyword, date>. However, long texts such news articles often describe multiple events which clearly violates this assumption; (2) During the inference process of both approaches, the Gibbs sampler needs to compute the conditional posterior distribution and assigns an event for each document. This is time consuming and takes long time to converge.
To deal with these limitations, in this paper, we propose the Adversarial-neural Event Model (AEM) based on adversarial training for opendomain event extraction. The principle idea is to use a generator network to learn the projection function between the document-event distribution and four event related word distributions (entity distribution, location distribution, keyword distribution and date distribution). Instead of providing an analytic approximation, AEM uses a discriminator network to discriminate between the reconstructed documents from latent events and the original input documents. This essentially helps the generator to construct a more realistic document from a random noise drawn from a Dirichlet distribution. Due to the flexibility of neural networks, the generator is capable of learning complicated nonlinear distributions. And the supervision signal provided by the discriminator will help generator to capture the event-related patterns. Furthermore, the discriminator also provides lowdimensional discriminative features which can be used to visualize documents and events.
The main contributions of the paper are summarized below: • We propose a novel Adversarial-neural Event Model (AEM), which is, to the best of our knowledge, the first attempt of using adversarial training for open-domain event extraction.
• Unlike existing Bayesian graphical modeling approaches, AEM is able to extract events from different text sources (short and long). And a significant improvement on computational efficiency is also observed.
• Experimental results on three datasets show that AEM outperforms the baselines in terms of accuracy, recall and F-measure. In addition, the results show the strength of AEM in visualizing events.

Related Work
Our work is related to two lines of research, event extraction and Generative Adversarial Nets.

Event Extraction
Recently there has been much interest in event extraction from online texts, and approaches could be categorized as domain-specific and opendomain event extraction. Domain-specific event extraction often focuses on the specific types of events (e.g. sports events or city events). Panem et al. (2014) devised a novel algorithm to extract attribute-value pairs and mapped them to manually generated schemes for extracting the natural disaster events. Similarly, to extract the city-traffic related event, Anantharam et al. (2015) viewed the task as a sequential tagging problem and proposed an approach based on the conditional random fields. Zhang (2018) proposed an event extraction approach based on imitation learning, especially on inverse reinforcement learning.
Open-domain event extraction aims to extract events without limiting the specific types of events. To analyze individual messages and induce a canonical value for each event, Benson et al. (2011) proposed an approach based on a structured graphical model. By representing an event with a binary tuple which is constituted by a named entity and a date, Ritter et al. (2012) employed some statistic to measure the strength of associations between a named entity and a date. The proposed system relies on a supervised labeler trained on annotated data. In (Abdelhaq et al., 2013), Abdelhaq et al. developed a realtime event extraction system called EvenTweet, and each event is represented as a triple constituted by time, location and keywords. To extract more information, Wang el al. (2015) developed a system employing the links in tweets and combing tweets with linked articles to identify events. Xia el al. (2015) combined texts with the location information to detect the events with low spatial and temporal deviations. Zhou et al. (2014; represented event as a quadruple and proposed two Bayesian models to extract events from tweets.

Generative Adversarial Nets
As a neural-based generative model, Generative Adversarial Nets (Goodfellow et al., 2014) have been extensively researched in natural language processing (NLP) community.
For text generation, the sequence generative adversarial network (SeqGAN) proposed in (Yu et al., 2017) incorporated a policy gradient strategy to optimize the generation process. Based on the policy gradient, Lin et al. (2017) proposed RankGAN to capture the rich structures of language by ranking and analyzing a collection of human-written and machine-written sentences. To overcome mode collapse when dealing with discrete data, Fedus et al. (2018)  posed MaskGAN which used an actor-critic conditional GAN to fill in missing text conditioned on the surrounding context. Along this line,  proposed SentiGAN to generate texts of different sentiment labels. Besides,  improved the performance of semi-supervised text classification using adversarial training, (Zeng et al., 2018;Qin et al., 2018) designed GAN-based models for distance supervision relation extraction.
Although various GAN based approaches have been explored for many applications, none of these approaches tackles open-domain event extraction from online texts. We propose a novel GANbased event extraction model called AEM. Compared with the previous models, AEM has the following differences: (1) Unlike most GAN-based text generation approaches, a generator network is employed in AEM to learn the projection function between an event distribution and the eventrelated word distributions (entity, location, keyword, date). The learned generator captures eventrelated patterns rather than generating text sequence; (2) Different from LEM and DPEMM, AEM uses a generator network to capture the event-related patterns and is able to mine events from different text sources (short and long). Moreover, unlike traditional inference procedure, such as Gibbs sampling used in LEM and DPEMM, AEM could extract the events more efficiently due to the CUDA acceleration; (3) The discriminative features learned by the discriminator of AEM provide a straightforward way to visualize the extracted events.

Methodology
We describe Adversarial-neural Event Model (AEM) in this section. An event is represented as a quadruple <e, l, k, d>, where e stands for non-location named entities, l for a location, k for event-related keywords, d for a date, and each component in the tuple is represented by component-specific representative words.
AEM is constituted by three components: (1) The document representation module, as shown at the top of Figure 1, defines a document representation approach which converts an input document from the online text corpus into ⃗ d r ∈ R V which captures the key event elements; (2) The generator G, as shown in the lower-left part of Figure1, generates a fake document ⃗ d f which is constituted by four multinomial distributions using an event distribution ⃗ θ drawn from a Dirichlet distribution as input; (3) The discriminator D, as shown in the lower-right part of Figure1, distinguishes the real documents from the fake ones and its output is subsequently employed as a learning signal to update the G and D. The details of each component are presented below.

Document Representation
Each document doc in a given corpus C is represented as a concatenation of 4 multinomial distributions which are entity distribution ( ⃗ d e r ), location distribution ( ⃗ d l r ), keyword distribution ( ⃗ d k r ) and date distribution ( ⃗ d d r ) of the document. As four distributions are calculated in a similar way, we only describe the computation of the entity distribution below as an example.
The entity distribution ⃗ d e r is represented by a normalized V e -dimensional vector weighted by TF-IDF, and it is calculated as: where C e is the pseudo corpus constructed by removing all non-entity words from C, V e is the total number of distinct entities in a corpus, n e i,doc denotes the number of i-th entity appeared in document doc, |C e | represents the number of documents in the corpus, and |C e i | is the number of documents that contain i-th entity, and the obtained d e r,i denotes the relevance between i-th entity and document doc.
Similarly, location distribution ⃗ d l r , keyword distribution ⃗ d k r and date distribution ⃗ d d r of doc could be calculated in the same way, and the dimensions of these distributions are denoted as V l , V k and V d , respectively. Finally, each document doc in the corpus is represented by a Vdimensional (V =V e +V l +V k +V d ) vector ⃗ d r by concatenating four computed distributions.

Generator
The generator network G is designed to learn the projection function between the document-event distribution ⃗ θ and the four document-level word distributions (entity distribution, location distribution, keyword distribution and date distribution).
More concretely, G consists of a E-dimensional document-event distribution layer, H-dimensional hidden layer and V -dimensional event-related word distribution layer. Here, E denotes the event number, H is the number of units in the hidden layer, V is the vocabulary size and equals to V e +V l +V k +V d . As shown in Figure 1, G firstly employs a random document-event distribution ⃗ θ as an input. To model the multinomial property of the document-event distribution, ⃗ θ is drawn from a Dirichlet distribution parameterized with ⃗ α which is formulated as: where ⃗ α is the hyper-parameter of the dirichlet distribution, E is the number of events which should be set in AEM, , θ t ∈ [0, 1] represents the proportion of event t in the document and ∑ E t=1 θ t = 1. Subsequently, G transforms ⃗ θ into a Hdimensional hidden space using a linear layer followed by layer normalization, and the transformation is defined as: where W h ∈ R H×E represents the weight matrix of hidden layer, and ⃗ b h denotes the bias term, l p is the parameter of LeakyReLU activation and is set to 0.  Figure 1), four subnets (each contains a linear layer, a batch normalization layer and a softmax layer) are employed in G. And the exact transformation is based on the formulas below: Finally, four generated distributions are concatenated to represent the generated document ⃗ d f corresponding to the input ⃗ θ:

Discriminator
The discriminator network D is designed as a fully-connected network which contains an input layer, a discriminative feature layer (discriminative features are employed for event visualization) and an output layer. In AEM, D uses fake document ⃗ d f and real document ⃗ d r as input and outputs the signal D out to indicate the source of the input data (lower value denotes that D is prone to predict the input data as a fake document and vice versa).
As have previously been discussed in Gulrajani et al., 2017), lipschitz continuity of D network is crucial to the training of the GAN-based approaches. To ensure the lipschitz continuity of D, we employ the spectral normalization technique (Miyato et al., 2018). More concretely, for each linear layer l d ( ⃗ h) = W ⃗ h (bias term is omitted for simplicity) in D, the weight matrix W is normalized by σ(W ). Here, σ(W ) is the spectral norm of the weight matrix W with the definition below: which is equivalent to the largest singular value of W . The weight matrix W is then normalized using:Ŵ Obviously, the normalized weight matrixŴ SN satisfies that σ(Ŵ SN ) = 1 and further ensures the lipschitz continuity of the D network (Miyato et al., 2018). To reduce the high cost of computing spectral norm σ(W ) using singular value decomposition at each iteration, we follow (Yoshida and Miyato, 2017) and employ the power iteration method to estimate σ(W ) instead. With this substitution, the spectral norm can be estimated with very small additional computational time.

Objective and Training Procedure
The real document ⃗ d r and fake document ⃗ d f shown in Figure 1 could be viewed as random samples from two distributions P r and P g , and each of them is a joint distribution constituted by four Dirichlet distributions (corresponding to entity distribution, location distribution, keyword distribution and date distribution). The training objective of AEM is to let the distribution P g (produced by G network) to approximate the real data distribution P r as much as possible.
To compare the different GAN losses, Kurach (2018) takes a sober view of the current state of GAN and suggests that the Jansen-Shannon divergence used in (Goodfellow et al., 2014) performs more stable than variant objectives. Besides, Kurach also advocates that the gradient penalty (GP) regularization devised in (Gulrajani et al., 2017) will further improve the stability of the model. Thus, the objective function of the proposed AEM is defined as: where L d denotes the discriminator loss, L gp represents the gradient penalty regularization loss, λ is the gradient penalty coefficient which trade-off the two components of objective, ⃗ d * could be obtained by sampling uniformly along a straight line between ⃗ d r and ⃗ d f , P d * denotes the corresponding distribution.
The training procedure of AEM is presented in Algorithm 1, where E is the event number, n d denotes the number of discriminator iterations per generator iteration, m is the batch size, α ′ represents the learning rate, β 1 and β 2 are hyperparameters of Adam (Kingma and Ba, 2014), p a denotes {α ′ , β 1 , β 2 }. In this paper, we set λ = 10, n d = 5, m = 32. Moreover, α ′ , β 1 and β 2 are set as 0.0002, 0.5 and 0.999.

Event Generation
After the model training, the generator G learns the mapping function between the document-event distribution and the document-level event-related word distributions (entity, location, keyword and date). In other words, with an event distribution Algorithm 1 Training procedure for AEM Input: E, λ, n d , m, α ′ , β 1 , β 2 Output: the trained G and D.
1: Initial D parameters ω d and G parameter ω g 2: while ω g has not converged do 3: , ω g , p a ) 18: end while ⃗ θ ′ as input, G could generate the corresponding entity distribution, location distribution, keyword distribution and date distribution.
In AEM, we employ event seed ⃗ s t∈{1,...,E} , an E-dimensional vector with one-hot encoding, to generate the event related word distributions. For example, in ten event setting, ⃗ s 1 = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0] ⊺ represents the event seed of the first event. With the event seed ⃗ s 1 as input, the corresponding distributions could be generated by G based on the equation below: where ⃗ ϕ 1 e , ⃗ ϕ 1 l , ⃗ ϕ 1 k and ⃗ ϕ 1 d denote the entity distribution, location distribution, keyword distribution and date distribution of the first event respectively.

Experiments
In this section, we firstly describe the datasets and baseline approaches used in our experiments and then present the experimental results.

Experimental Setup
To validate the effectiveness of AEM for extracting events from social media (e.g. Twitter) and news media sites (e.g. Google news), three datasets (FSD (Petrovic et al., 2013), Twitter, and Google datasets 1 ) are employed. Details are summarized below: • FSD dataset (social media) is the first story detection dataset containing 2,499 tweets. We filter out events mentioned in less than 15 tweets since events mentioned in very few tweets are less likely to be significant. The final dataset contains 2,453 tweets annotated with 20 events.
• Twitter dataset (social media) is collected from tweets published in the month of December in 2010 using Twitter streaming API. It contains 1,000 tweets annotated with 20 events.
• Google dataset (news article) is a subset of GDELT Event Database 1 , documents are retrieved by event related words. For example, documents which contain 'malaysia', 'airline', 'search' and 'plane' are retrieved for event MH370. By combining 30 events related documents, the dataset contains 11,909 news articles.
We choose the following three models as the baselines: • K-means is a well known data clustering algorithm, we implement the algorithm using sklearn 2 toolbox, and represent documents using bag-of-words weighted by TF-IDF.
• LEM (Zhou et al., 2014) is a Bayesian modeling approach for open-domain event extraction. It treats an event as a latent variable and models the generation of an event as a joint distribution of its individual event elements. We implement the algorithm  with the default configuration.
• DPEMM (Zhou et al., 2017) is a nonparametric mixture model for event extraction. It addresses the limitation of LEM that the number of events should be known beforehand. We implement the model with the default configuration.
For social media text corpus (FSD and Twitter), a named entity tagger 3 specifically built for Twitter is used to extract named entities including locations from tweets. A Twitter Part-of-Speech (POS) tagger (Gimpel et al., 2010) is used for POS tagging and only words tagged with nouns, verbs and adjectives are retained as keywords. For the Google dataset, we use the Stanford Named Entity Recognizer 4 to identify the named entities (organization, location and person). Due to the 'date' information not being provided in the

Experimental Results
To evaluate the performance of the proposed approach, we use the evaluation metrics such as precision, recall and F-measure. Precision is defined as the proportion of the correctly identified events out of the model generated events. Recall is defined as the proportion of correctly identified true events. For calculating the precision of the 4-tuple, we use following criteria: • (1) Do the entity/organization, location, date/person and keyword that we have extracted refer to the same event?
It can be observed that K-means performs the worst over all three datasets. On the social media datasets, AEM outpoerforms both LEM and DPEMM by 6.5% and 1.7% respectively in Fmeasure on the FSD dataset, and 4.4% and 3.7% in F-measure on the Twitter dataset. We can also observe that apart from K-means, all the approaches perform worse on the Twitter dataset compared to FSD, possibly due to the limited size of the Twitter dataset. Moreover, on the Google dataset, the proposed AEM performs significantly better than LEM and DPEMM. It improves upon LEM by 15.5% and upon DPEMM by more than 30% in F-measure. This is because: (1) the assumption made by LEM and DPEMM that all words in a document are generated from a single event is not suitable for long text such as news articles; (2) DPEMM generates too many irrelevant events which leads to a very low precision score. Overall, we see the superior performance of AEM across all datasets, with more significant improvement on the for Google datasets (long text).
We next visualize the detected events based on the discriminative features learned by the trained D network in AEM. The t-SNE (Maaten and Hinton, 2008) visualization results on the datasets are shown in Figure 2. For clarity, each subplot is plotted on a subset of the dataset containing ten randomly selected events. It can be observed that documents describing the same event have been grouped into the same cluster.
To further evaluate if a variation of the parameters n d (the number of discriminator iterations per generator iteration), H (the number of units in hidden layer) and the structure of generator G will impact the extraction performance, additional experiments have been conducted on the Google dataset, with n d set to 5, 7 and 10, H set to 100, 150 and 200, and three G structures (3, 4 and 5 layers). The comparison results on precision, recall and Fmeasure are shown in Figure 3. From the results, it could be observed that AEM with the 5-layer generator performs the best and achieves 96.7% in F-measure, and the worst F-measure obtained by AEM is 85.7%. Overall, the AEM outperforms all compared approaches acorss various parameter settings, showing relatively stable performance.
Finally, we compare in Figure 4 the training time required for each model, excluding the constant time required by each model to load the data. We could observe that K-means runs fastest among all four approaches. Both LEM and DPEMM need to sample the event allocation for each document and update the relevant counts during Gibbs sampling which are time consuming. AEM only requires a fraction of the training time compared to LEM and DPEMM. Moreover, on a larger dataset such as the Google dataset, AEM appears to be far more efficient compared to LEM and DPEMM.

Conclusions and Future Work
In this paper, we have proposed a novel approach based on adversarial training to extract the structured representation of events from online text. The experimental comparison with the state-of-the-art methods shows that AEM achieves improved extraction performance, especially on