Self-regulation: Employing a Generative Adversarial Network to Improve Event Detection

Due to the ability of encoding and mapping semantic information into a high-dimensional latent feature space, neural networks have been successfully used for detecting events to a certain extent. However, such a feature space can be easily contaminated by spurious features inherent in event detection. In this paper, we propose a self-regulated learning approach by utilizing a generative adversarial network to generate spurious features. On the basis, we employ a recurrent network to eliminate the fakes. Detailed experiments on the ACE 2005 and TAC-KBP 2015 corpora show that our proposed method is highly effective and adaptable.


Introduction
Event detection aims to locate the event triggers of specified types in text. Normally, triggers are words or nuggets that evoke the events of interest.
Detecting events in an automatic way is challenging, not only because an event can be expressed in different words, but also because a word may express a variety of events in different contexts. In particular, the frequent utilization of common words, ambiguous words and pronouns in event mentions makes them harder to detect: 1) Generalitytaken home <Transport> Ambiguity 1campaign in Iraq <Attack> Ambiguity 2political campaign <Elect> Coreference -Either its bad or good <Marry> A promising solution to this challenge is through semantic understanding. Recently, neural networks have been widely used in this direction (Nguyen and Ghaeini et al., * Corresponding author 2016;Feng et al., 2016;Liu et al., 2017b;Chen et al., 2017), which allows semantics of event mentions (trigger plus context) to be encoded in a high-dimensional latent feature space. This facilitates the learning of deep-level semantics. Besides, the use of neural networks not only strengthens current supervised classification of events but alleviates the complexity of feature engineering.
However, compared to the earlier study (Liao and Grishman, 2010;Hong et al., 2011;Li et al., 2013), in which the features are carefully designed by experts, the neural network based methods suffer more from spurious features. Here, spurious feature is specified as the latent information which looks like the semantically related information to an event, but actually not (Liu et al., 2017a). For example, in the following sample, the semantic information of the word "prison" most probably enables spurious features to come into being, because the word often co-occurs with the trigger "taken" to evoke an Arrest-jail event instead of the ground-truth event Transport: 2) Prison authorities have given the nod for Anwar to be taken home later in the afternoon. Trigger: taken. Event Type: Transport It is certain that spurious features often result from the semantically pseudo-related context, and during training, a neural network may mistakenly and unconsciously preserve the memory to produce the fakes. However, it is difficult to determine which words are pseudo-related in a specific case, and when they will "jump out" to mislead the generation of latent features during testing.
To address the challenge, we suggest to regulate the learning process with a two-channel selfregulated learning strategy. In the self-regulation process, on one hand, a generative adversarial network is trained to produce the most spurious features, while on the other hand, a neural network Figure 1: Self-regulated learning scheme is equipped with a memory suppressor to eliminate the fakes. Detailed experiments on event detection show that our proposed method achieves a substantial performance gain, and is capable of robust domain adaptation.

Task Definition
The task of event detection is to determine whether there is one or more event triggers in a sentence. Trigger is defined as a token or nugget that best signals the occurrence of an event. If successfully identified, a trigger is required to be assigned a tag to indicate the event type: Input: Either its bad or good Output: its <trigger>; Marry <type> We formalize the event detection problem as a multi-class classification problem. Given a sentence, we classify every token of the sentence into one of the predefined event classes (Doddington et al., 2004) or non-trigger class.

Self-Regulated Learning (SELF)
SELF is a double-channel model (Figure 1), consisted of a cooperative network (Islam et al., 2003) and a generative adversarial net (GAN) (Goodfellow et al., 2014). A memory suppressor S is used to regulate communication between the channels.

Cooperative Network
In channel 1, the generator G is specified as a multilayer perceptron. It plays a role of a "diligent student". By a differentiable function G(x, θ g ) with parameters θ g , the generator learns to produce a vector of latent features o g that may best characterize the token x, i.e., o g = G(x, θ g ).
The discriminator D (called "a lucky professor") is a single-layer perceptron, implemented as a differentiable function D(o g , θ d ) with parameters θ d . Relying on the feature vector o g , it attempts to accurately predict the probability of the token x triggering an event for all event classes, i.e.,ŷ = D(o g , θ d ), and assigns x to the most probable class c (iffŷ c > ∀ŷc,c = c). Therefore, G and D cooperate with each other during training, developing the parameters θ g and θ d with the same goal -to minimize the performance loss L(ŷ, y) in the detection task: where, y denotes the ground-truth probability distribution over event classes, and L indicates the deviation of the prediction from the ground truth.

Generative Adversarial Network
In channel 2, the generatorǦ and discriminatoř D have the same perceptual structures as G and D. They also perform learning by differentiable functions, respectivelyǦ(x, θǧ) andĎ(oǧ, θď). A major difference, however, is that they are caught into a cycle of highly adversarial competition. The generatorǦ is a "trouble maker". It learns to produce spurious features, and utilizes them to contaminate the feature vector oǧ of the token x.
ThusǦ changes a real sample x into a fake zsometimes successfully, sometimes less so. Using the fakes,Ǧ repeatedly instigates the discrimina-torĎ to make mistakes. On the other side,Ď ("a hapless professor") has to avoid being deceived, and struggles to correctly detect events no matter whether it encounters x or z.
In order to outsmart the adversary,Ǧ develops the parameters θǧ during training to maximize the performance loss, but on the contrary,Ď develops the parameters θď to minimize the loss: Numerous studies have confirmed that the twoplayer minmax game enables bothǦ andĎ to improve their methods (Goodfellow et al., 2014;Liu and Tuzel, 2016;Huang et al., 2017).

Regulation with Memory Suppressor
Using a memory suppressor, we try to optimize the diligent student G. The goal is to enable G to be as dissimilar as possible to the troublemakerǦ.
The suppressor uses the output oǧ ofǦ as a reference resource which should be full of spurious features. On the basis, it looks over the output o g of G, so as to verify whether the features in o g are different to those in oǧ. If very different, the suppressor allows G to preserve the memory (viz., θ g in G(x, θ g )), otherwise update. In other word, for G, the suppressor forcibly erases the memory which may result in the generation of spurious features. We call this the self-regulation.
Self-regulation is performed for the whole sentence which is fed into G andǦ. Assume that O g is a matrix, constituted with a series of feature vectors, i.e., the vectors generated by G for all the tokens in an input sentence (o g ∈ O g ), while Oǧ is another feature matrix, generated byǦ for the tokens (oǧ ∈ Oǧ). Thus, we utilize the matrix approximation between O g and Oǧ for measuring the loss of self-regulation learning L dif f . The higher the similarity, the greater the loss. During training, the generator G is required to develop the parameters θ g to minimize the loss: We present in detail the matrix approximate calculation in section 4.4, where the squared Frobenius norm (Bousmalis et al., 2016) is used.

Learning to Predict
We incorporate the cooperative network with the GAN, and enhance their learning by joint training.
In the 4-member incorporation, i.e., {G,Ǧ, D andĎ}, the primary beneficiary is the lucky professor D. It can benefit from both the cooperation in channel 1 and the competition in channel 2. The latent features it uses are well-produced by G, and decontaminated by eliminating possible fakes like those made byǦ. Therefore, in experiments, we choose to output the prediction results of D.
In this paper, we use two recurrent neural networks (RNN) (Sutskever et al., 2014;Chung et al., 2014) of the same structure as the generators. And both the discriminators are implemented as a fullyconnected layer followed by a softmax layer.

Recurrent Models for SELF
RNN with long short-term memory (abbr., LSTM) is adopted due to the superior performance in a variety of NLP tasks (Liu et al., 2016a;Lin et al., 2017;Liu et al., 2017a). Furthermore, the bidirectional LSTM (Bi-LSTM) architecture (Schuster and Paliwal, 1997;Ghaeini et al., 2016;Feng et al., 2016) is strictly followed. This architecture enables modeling of the semantics of a token with both the preceding and following contexts.

LSTM based Generator
Given a sentence, we follow  to take all the tokens of the whole sentence as the in-put. Before feeding the tokens into the network, we transform each of them into a real-valued vector x ∈ R e . The vector is formed by concatenating a word embedding with an entity type embedding.
• Word Embedding: It is a fixed-dimensional real-valued vector which represents the hidden semantic properties of a token (Collobert and Weston, 2008;Turian et al., 2010).
• Entity Type Embedding: It is specially used to characterize the entity type associated with a token. The BIO2 tagging scheme (Wang and Manning, 2013; is employed for assigning a type label to each token in the sentence.
For the input token x t at the current time step t, the LSTM generates the latent feature vector o t ∈ R d by the previous memory. Meanwhile, the token is used to update the current memory.
The LSTM possesses a long-term memory unit c t ∈ R d and short-term c t ∈ R d . In addition, it is equipped with the input gate i t , forgetting gate f t and a hidden state h t , which are assembled together to promote the use of memory, as well as dynamic memory updating. Similarly, they are defined as a d-dimensional vector in R d . Thus LSTM works in the following way: ⎡ where W ∈ R 4d×(d+e) and b ∈ R 4d are parameters of affine transformation; σ refers to the logistic sigmoid function and denotes element-wise multiplication.
The output functions of both the generators in SELF, i.e., G andǦ, can be boiled down to the output gate o t ∈ R d of the LSTM cell: where, the function LSTM (·;·) is a shorthand for Eq. (5-7) and θ represents all the parameters of LSTM. For both G andǦ, θ are initialized with the same values in experiments. But due to the distinct training goals of G andǦ (diligence or makingtrouble), the values of the parameters in the two cases will change to be very different after training. Therefore, we have o g,t = LST M (x t , θ g,t ) and oǧ ,t = LST M (x t , θǧ ,t ).

Fully-connected Layer for Discrimination
Depending on the feature vectors o g,t and oǧ ,t , the two discriminators D andĎ predict the probability of the token x t triggering an event for all event classes. As usual, they compute the probability distribution over classes using a fully connected layer followed by a softmax layer: wherey is a C-dimensional vector, in which each dimension indicates the prediction for a class; C is the class number;Ŵ ∈ R d is the weight which needs to be learned;b is a bias term. It is noteworthy that the discriminator D andĎ don't share the weight and the bias. It means that, for the same token x t , they may make markedly different predictions:

Classification Loss
We specify the loss as the cross-entropy between the predicted and ground-truth probability distributions over classes. Given a batch of training data that includes N samples (x i , y i ), we calculate the losses the discriminators cause as below: where y i is a C-dimensional one-hot vector. The value of its j-th dimension is set to be 1 only if the token x i triggers an event of the j-th class, otherwise 0. Bothŷ g,i andŷǧ ,i are the predicted probability distributions over the C classes for x i .

Loss of Self-regulated Learning
Assume that O g is a matrix, consisted of the feature vectors output by G for all the tokens in a sentence, i.e., o g,t ∈ O g , and Oǧ is that provided by G, i.e., oǧ ,t ∈ Oǧ, thus we compute the similarity between O g and Oǧ and use it as the measure of self-regulation loss L dif f (O g , Oǧ): where, · 2 F denotes the squared Frobenius norm (Bousmalis et al., 2016), which is used to calculate the similarity between matrices.
It is noteworthy that the feature vectors a generator outputs are required to serve as the rows in the matrix, deployed in a top-down manner and arranged in the order in which they are generated. For example, the feature vector o g,t the generator G outputs at the time t needs to be placed in the t-th row of the matrix O g .
At the very beginning of the measurement, the similarity between every feature vector in O g and that in OǦ is first calculated by the matrix-matrix where, the symbol denotes the transpose operation; l is the sentence length which is defined to be uniform for all sentences (l=80), and if it is larger than the real ones, padding is used; o g,i oǧ ,j denotes the scalar product between the feature vectors o g,i and oǧ ,j .
Let A m×n be a matrix, the squared Frobenius norm of A m×n (i.e., A m×n 2 F ) is defined as: where, a ij denotes the j-th element in the i-th row of A m×n . Thus, if we let A m×n be the matrix produced by the matrix-matrix multiplication For a batch of training data that includes N sentences, the global self-regulation loss is specified as the sum of the losses for all the sentences: .

Training
We train the cooperative network in SELF to minimize the classification loss L(ŷ g , y) and the loss of self-regulated learning L SELF : where λ is a hyper-parameter, which is used to harmonize the two losses. The min-max game is utilized for training the adversarial net in SELF: θǧ = argmax L(ŷǧ, y); θď = argmin L(ŷǧ, y).
All the networks in SELF are trained jointly using the same batches of samples. They are trained via stochastic gradient descent (Nguyen and Grishman, 2015) with shuffled mini-batches and the AdaDelta update rule (Zeiler, 2012). The gradients are computed using back propagation. And regularization is implemented by a dropout (Hinton et al., 2012).

Resource and Experimental Datasets
We test the presented model on the ACE 2005 corpus. The corpus is annotated with single-token event triggers and has 33 predefined event types (Doddington et al., 2004;Ahn, 2006), along with one class "None" for the non-trigger tokens, constitutes a 34-class classification problem.
For comparison purpose, we use the corpus in the traditional way, randomly selecting 30 articles in English from different genres as the development set, and utilizing a separate set of 40 English newswire articles as the test set. The remaining 529 English articles are used as the training set.

Hyperparameter Settings
The word embeddings are initialized with the 300dimensional real-valued vectors. We follow  and Feng et al (2016) to pre-train the embeddings over NYT corpus using Mikolov et al (2013)'s skip-gram tool. The entity type embeddings, as usual Feng et al., 2016;Liu et al., 2017b), are specified as the 50dimensional real-valued vectors. They are initialized with the 32-bit floating-point values, which are all randomly sampled from the uniformly distributed values in [-1, 1] 1 . We initialize other adjustable parameters of the back-propagation algorithm by randomly sampling in [-0.1, 0.1].
We follow Feng et al (2016) to set the dropout rate as 0.2 and the mini-batch size as 10. We tune the initialized parameters mentioned above, harmonic coefficient λ, learning rate and the L2 norm on the development set. Grid search (Liu et al., 2017a) is used to seek for the optimal parameters. Eventually, we take the coefficient λ of 0.1 +3 , learning rate of 0.3 and L2 norm of 0.
The source code of SELF 2 to reproduce the experiments has been made publicly available.

Compared Systems
The state-of-the-art models proposed in the past decade are compared with ours. By taking learning framework as the criterion, we divide the models into three classes: Minimally  (Hong et al., , 2015. It is based on structured perceptron and combines the local and global features. Neural network based approaches: including the convolutional neural network (CNN) (Nguyen and Grishman, 2015), the non-consecutive Ngrams based CNN (NC-CNN) (Nguyen and Grishman, 2016) and the CNN that is assembled with a dynamic multi-pooling layer (DM-CNN) . Others include Ghaeini et al (2016)

Experimental Results
We evaluate our model using Precision (P), Recall (R) and F-score (F). To facilitate the comparison, we review the best performance of the competitors, which has been evaluated using the same metrics, and publicly reported earlier. Table 1 shows the trigger identification performance. It can be observed that SELF outperforms other models, with a performance gain of no less than 1.1% F-score. Frankly, the performance mainly benefits from the higher recall (78.8%). But in fact the relatively comparable precision (75.3%) to the recall reinforces the advantages. By contrast, although most of the compared models achieve much higher precision over SELF, they suffer greatly from the substantial gaps between precision and recall. The advantage is offset by the greater loss of recall.

Trigger identification
GAN plays an important role in optimizing Bi-RNN. This is proven by the fact that SELF (Bi-LSTM+GAN) outperforms Nguyen et al (2016)'s Bi-RNN. To be honest, the models use two different kinds of recurrent units. Bi-RNN uses GRUs, but SELF uses the units that possess LSTM. Nevertheless, GRU has been experimentally proven to be comparable in performance to LSTM (Chung et al., 2014;Jozefowicz et al., 2015). This allows a fair comparison between Bi-RNN and SELF. Table 2 shows the performance of multi-class classification. SELF achieves nearly the same F-score as Feng et al (2016)'s Hybrid, and outperforms the others. More importantly, SELF is the only one which obtains a performance higher than 70% for both precision and recall.

Event classification
Besides, by analyzing the experimental results, we have identified the following regularities:  • Similar to the pattern classifiers that are based on hand-designed features, the CNN models enable higher precision to be obtained. However the recall is lower.
• The RNN models contribute to achieving a higher recall. However the precision is lower.
• Expansion of the training data set helps to increase the precision.
Let us turn to the structurally more complicated models, SELF and Hybrid.
SELF inherits the merits of the RNN models, classifying the events with higher recall. Besides, by the utilization of GAN, SELF has evolved from the traditional learning strategies, being capable of learning from GAN and getting rid of the mistakenly generated spurious features. So that it outperforms other RNNs, with improvements of no less than 4.5% precision and 1.7% recall.
Hybrid is elaborately established by assembling a RNN with a CNN. It models an event from two perspectives: language generation and pragmatics. The former is deeply learned by using the continuous states hidden in the recurrent units, while the later the convolutional features. Multi-angled cognition enables Hybrid to be more precise. However it is built using a single-channel architecture, concatenating the RNN and the CNN. This results in twofold accumulation of feature information, causing a serious overfitting problem. Therefore, Hybrid is localized to much higher precision but substantially lower recall.
Overfitting results in enlargement of the gap between precision and recall when the task changes to be more difficult. For Hybrid, as illustrated in   Figure 2, the gap becomes much wider (from 9% to 19.7%) when the binary classification task (trigger identification) is shifted to multi-class classification (event detection). By contrast, other work shows a nearly constant gap. In particular, SELF yields a minimum gap in each task, which changes negligibly from 3.5% to 3.4%. It may be added that, similar to DM-CNN and FB-RNN, SELF is cost-effective. Compared to other models (Table 3), it either uses less training data, or is only required to learn two kinds of embeddings, such as that of words and entity types.

Discussion: Adaptation, Robustness and Effectiveness
Domain adaptation is a key criteria for evaluating the utility of a model in practical application. A model can be thought of being adaptable only if it works well for the unlabeled data in the target domain when trained on the source domain (Blitzer et al., 2006;Plank and Moschitti, 2013). We perform two groups of domain adaptation experiments, respectively, using the ACE 2005 corpus and the corpus for TAC-KBP 2015 event nugget track (Ellis et al., 2015).
The ACE corpus consists of 6 domains: broad-cast conversation (bc), broadcast news (bn), telephone conversation (cts), newswire (nw), usenet (un) and web blogs (wl). Following the common practice of adaptation research on this data (Nguyen and Grishman, 2014Grishman, , 2015Plank and Moschitti, 2013), we take the union of bn and nw as the source domain and bc, cts and wl as three different target domains. We randomly select half of the instances from bc to constitute the development set. The TAC-KBP corpus consists of 2 domains: newswire (NW) and discussion forum (DF). We follow Peng et al (2016) to use one of NW and DF in alternation as the source domain, while the other the target domain. We randomly select a proportion (20%) of the instances from the target domain to constitute the development set. We compare with Joint, CNN, MSEP-EMD, SSED (Sammons et al., 2015) and Hybrid. All the models except Hybrid have been reported for the performance assessment of domain adaptation. In this section, we only cite the best performance they obtained. We reproduce Hybrid by using the source code given by authors. To ensure a fair comparison, we perform 3 runs, in each of which, both Hybrid and SELF were redeveloped on a new development set. What we report herein is the average performance they obtained over the 3 runs.

Adaptation Performance
We show the adaptation performance on the ACE corpus in Tables 4 and that on TAC-KBP in Table  5. It can be observed that SELF outperforms other models in the out-of-domain scenarios.
Besides, when testing is performed on the outof-domain ACE corpus, the performance degradation of SELF is not much larger than that of CNN and Hybrid. When the out-of-domain TAC-KBP corpus is used, the performance of SELF is impaired much less severely than SSED and Hybrid. Methods In-domain (bn+nw) Out-of-domain (bc) Out-of-domain (cts) Out-of-domain (wl) P(%) R(%) F(%) P(%) R(%) F(%) Loss P(%) R(%) F(%) Loss P(%) R(%) F(%) Loss     More importantly, the adaptability of SELF is relatively close to that of MSEP-EMD. Considering that MSEP-EMD is stable due to using minimal supervision (Peng et al., 2016), we suggest the fully trained networks in SELF may not appear to be extremely inflexible, but on the contrary, they should be transferable for use (Ge et al., 2016).

Robustness in Resource-Poor Settings
There are two resource-poor conditions discussed in this section, including lack of in-domain training data and that of out-domain. Hybrid and SELF are brought into the discussion.
For the former (in-domain) case, we went over the numbers of samples used for training in the adaptation experiments, which are shown in Table 6. It can be observed that there is a minimum number of training samples (triggers plus tokens) contained in the domain of NW. By contrast, the domain of bn+nw contains the smallest number of positive samples (triggers) though an overwhelming number of negative samples (general tokens).
Under such conditions, Hybrid performs better in the domain of NW compared to bn+nw and DF in the three in-domain adaptation experiments (see the column labelled as "In-domain bn+nw" in Table 4 as well as "In-domain NW" and "In-domain DF" in Table 5). It illustrates that Hybrid unnecessarily relies on a tremendous number of training samples to ensure the robustness. But SELF does. It needs far more negative samples than Hybrid because of the following reasons: • It relies on the use of spurious features to implement self-regulation during training.  • It's impossible to be aware of such negative samples. Therefore, taking into consideration as many negative samples as possible may help to increase the probability that the spurious features will be discovered. This is demonstrated by the fact that SELF obtains better performance in the domain of bn+nw but not NW (see the column labeled as "Training" in Table 6 and "In-domain" in Table 4 and 5). It may be added that SELF performs worse in DF although there are more negative samples used for training (see Table 6). Taking a glance at the number of positive samples, one may find that it is approximately 2.4 times more than that in bn+nw. But the number of negative samples in DF is only 1.5 times more than that in bn+nw. It implies that, if there are more positive samples used for training, SELF needs to consume proportionally more negative samples for self-regulation. Otherwise, the performance will degrade.
For the out-domain case, ideally, both Hybrid and SELF encounter the problem that there is lack of target domain data available for training. In this case, SELF displays less performance degradation

Recall and Missing
SELF is able to accurately recall the events whose occurrence is triggered by ambiguous words, such as "fine", "charge", "campaign", etc. These ambiguous words easily causes confusion. For example, "campaign" may trigger an Elect event or Attack in the ACE corpus. More importantly, SELF fishes out the common words which serve as a trigger, although they are not closely related to any kind of events, such as "take", "try", "acquire", "become", "create", etc. In general, it is very difficult to accurately recall such triggers because their meanings are not concrete enough, and the contexts may be full of kinds of noises (see example 2 in pg. 1). We observe that Bi-RNN and Hybrid seldom pick them up.
However, SELF fails to recall the pronouns that act as a trigger. This is because they occur in spoken language much more frequently than they occur in written language. The lack of narrative content makes it difficult to learn the relationship between the pronouns and the events. Some real examples collected from ACE are shown in Table 7.

Related Work
Event detection is an important subtask of event extraction (Doddington et al., 2004;Ahn, 2006).
The research can be traced back to the pattern based approach (Grishman et al., 2005). Encouraged by the high accuracy and the benefit of easyto-use, researchers have made great efforts to extract discriminative patterns. Cao et al (2015a;2015b) use dependency regularization and active leaning to generalize and expand the patterns.
In the earlier study, another trend is to explore the features that best characterize each event class, so as to facilitate supervised classification. A vari-ety of strategies have emerged for converting classification clues into feature vectors (Ahn, 2006;Patwardhan and Riloff, 2009;Liao and Grishman, 2010;Hong et al., 2011;Li et al., 2013Wei et al., 2017). Benefiting from the general modeling framework, the methods enable the fusion of multiple features, and more importantly, they are flexible to use by feature selection. But considerable expertise is required for feature engineering.
Recently, the use of neural networks for event detection has become a promising line of research. The closely related work has been presented in section 5.3. The primary advantages of neural networks have been demonstrated in the work, such as performance enhancement, self-learning capability and robustness.
The generative adversarial network (Goodfellow et al., 2014) has emerged as an increasingly popular approach for text processing Lamb et al., 2016;. Liu et al (2017a) use the adversarial multi-task learning for text classification. We follow the work to create spurious features, but use them to regulate the self-learning process in a single-task situation.

Conclusion
We use a self-regulated learning approach to improve event detection. In the learning process, the adversarial and cooperative models are utilized in decontaminating the latent feature space.
In this study, the performance of the discriminator in the adversarial network is left to be evaluated. Most probably, the discriminator also performs well because it is gradually enhanced by fierce competition. Considering this possibility, we suggest to drive the two discriminators in our self-regulation framework to cooperate with each other. Besides, the global features extracted in Li et al (2013)'s work are potentially useful for detecting the event instances referred by pronouns, although involve noises. Therefore, in the future, we will encode the global information by neural networks and use the self-regulation strategy to reduce the negative influence of noises.