Leveraging FrameNet to Improve Automatic Event Detection

Frames deﬁned in FrameNet (FN) share highly similar structures with events in ACE event extraction program. An even-t in ACE is composed of an event trigger and a set of arguments. Analogously, a frame in FN is composed of a lexical unit and a set of frame elements, which play similar roles as triggers and arguments of ACE events respectively. Besides having similar structures, many frames in FN actually express certain types of events. The above observations motivate us to explore whether there exists a good mapping from frames to event-types and if it is possible to improve event detection by using FN. In this paper, we propose a global inference approach to detect events in FN. Further, based on the detected results, we analyze possible mappings from frames to event-types. Finally, we improve the performance of event detection and achieve a new state-of-the-art result by using the events automatically detected from FN.


Introduction
In the ACE (Automatic Context Extraction) event extraction program, an event is represented as a structure consisting of an event trigger and a set of arguments. This paper tackles with the event detection (ED) task, which is a crucial component in the overall task of event extraction. The goal of ED is to identify event triggers and their corresponding event types from the given documents.
FrameNet (FN) (Baker et al., 1998;Fillmore et al., 2003) is a linguistic resource storing considerable information about lexical and predicateargument semantics. In FN, a frame is defined as a composition of a Lexical Unit (LU) and a set of Frame Elements (FE). Most frames contain a set of exemplars with annotated LUs and FEs (see Figure 2 and Section 2.2 for details).
From the above definitions of events and frames, it is not hard to find that the frames defined in FN share highly similar structures as the events defined in ACE. Firstly, the LU of a Frame plays a similar role as the trigger of an event. ACE defines the trigger of an event as the word or phrase which most clearly expresses an event occurrence. For example, the following sentence "He died in the hospital." expresses a Die event, whose trigger is the word died. Analogously, the LU of a frame is also the word or phrase which is capable of indicating the occurrence of the expressed semantic frame. For example, the sentence "Aeroplanes bombed London." expresses an Attack 1 frame, whose LU is the word bombed. Secondly, the FEs of a frame also play similar roles as arguments of an event. Both of them indicate the participants involved in the corresponding frame or event. For example, in the first sentence, He and hospital are the arguments, and in the second sentence, Aeroplanes and London are the FEs.
Besides having similar structure as events, many frames in FN actually express certain types of events defined in ACE. Table 1 shows some examples of frames which also express events.

Frame
Event Sample in FN Attack Attack Aeroplanes bombed London.

Invading Attack
Hitler invaded Austria .

Fining Fine
The court fined her $40.
Execution Execute He was executed yesterday. The aforementioned observations motivate us to 1 The notation of frames distinguishes from that of events by the italic decoration. explore: (1) whether there exists a good mapping from frames to event-types, and (2) whether it is possible to improve ED by using FN. For the first issue, we investigate whether a frame could be mapped to an event-type based on events expressed by exemplars annotated for that frame. Therefore the key is to detect events from the given exemplar sentences in FN. To achieve this goal, we propose a global inference approach (see figure 1). We firstly learn a basic ED model based on the ACE labeled corpus and employ it to yield initial judgements for each sentence in FN. Then, we apply a set of soft constraints for global inference based on the following hypotheses: 1). Sentences belonging to the same LU tend to express events of the same type; 2). Sentences belonging to related frames tend to express events of the same type; 3). Sentences belonging to the same frame tend to express events of the same type. All of the above constraints and initial judgments are formalized as first-order logic formulas and modeled by Probabilistic Soft Logic (PSL) (Kimmig et al., 2012;Bach et al., 2013). Finally, we obtain the final results via PSL-based global inference. We conduct both manual and automatic evaluations for the detected results.
For the second issue, ED generally suffers from data sparseness due to lack of labeled samples. Some types, such as Nominate and Extradite, contain even less than 10 labeled samples. Apparently, from such a small scale of training data is difficult to yield a satisfying performance. We notice that ACE corpus only contains about 6,000 labeled instances, while FN contains more than 150,000 exemplars. Thus, a straightforward solution to alleviate the data sparseness problem is to expand the ACE training data by using events detected from FN. The experimental results show that events from FN significantly improve the performance of the event detection task. To sum up, our main contributions are: (1) To our knowledge, this is the first work performing event detection over ACE and FN to explore the relationships between frames and events. (2) We propose a global inference approach to detect events in FN, which is demonstrated very effective by our experiments. Moreover, based on the detected results, we analyze possible mappings from frames to event-types (all the detecting and mapping results are released for further use by the NLP community 2 ). (3) We improve the performance of event detection significantly and achieve a new state-of-the-art result by using events automatically detected from FN as extra training data.

ACE Event Extraction
In ACE evaluations, an event is defined as a specific occurrence involving several participants. ACE event evaluation includes 8 types of events, with 33 subtypes. Following previous work, we treat them simply as 33 separate event types and ignore the hierarchical structure among them. In this paper, we use the ACE 2005 corpus 3 in our experiments. It contains 599 documents, which include about 6,000 labeled events.

FrameNet
The FrameNet is a taxonomy of manually identified semantic frames for English 4 . Figure 2 shows the hierarchy of FN corpus. Listed in the FN with each frame are a set of lemmas with part of speech (i.e "invade.v") that can evoke the frame, which are called lexical units (LUs). Accompanying most LUs in the FN is a set of exemplars annotated for them. Moreover, there are a set of labeled relations between frames, such as Inheritance.
FN contains more than 1,000 various frames and 10,000 LUs with 150,000 annotated exemplars. Eight relations are defined between frames in FN, but in this paper we only use the following three of them because the others do not satisfy our hypotheses (see section 4.2): Inheritance: A inherited from B indicates that A must correspond to an equally or more specific fact about B. It is a directional relation. See also: A and B connected by this relation indicates that they are similar frames. Perspective on: A and B connected by this relation means that they are different points-of-view about the same fact (i.e. Receiving vs. Transfer).

Related Work
Event extraction is an increasingly hot and challenging research topic in NLP. Many approaches have been proposed for this task. Nearly all the existing methods on ACE event task use supervised paradigm. We further divide them into featurebased methods and representation-based methods.
In feature-based methods, a diverse set of strategies has been exploited to convert classification clues into feature vectors. Ahn (2006) uses the lexical features(e.g., full word), syntactic features (e.g., dependency features) and externalknowledge features (WordNet (Miller, 1995)) to extract the event. Inspired by the hypothesis of One Sense Per Discourse (Yarowsky, 1995), Ji and Grishman (2008) combined global evidence from related documents with local decisions for the event extraction. To capture more clues from the texts, Gupta and Ji (2009), Liao and Grishman (2010) and Hong et al. (2011) proposed the crossevent and cross-entity inference for the ACE event task. Li et al. (2013) proposed a joint model to capture the combinational features of triggers and arguments. Liu et al. (2016) proposed a global inference approach to employ both latent local and global information for event detection.
In representation-based methods, candidate event mentions are represented by embedding, which typically are fed into neural networks. Two similarly related work has been proposed on event detection (Chen et al., 2015;Nguyen and Grishman, 2015). Nguyen and Grishman (2015) employed Convolutional Neural Networks (CNNs) to automatically extract sentence-level features for event detection. Chen et al. (2015) proposed dynamic multi-pooling operation on CNNs to capture better sentence-level features.
FrameNet is a typical resource for framesemantic parsing, which consists of the resolution of predicate sense into a frame, and the analysis of the frame's participants (Thompson et al., 2003;Giuglea and Moschitti, 2006;Hermann et al., 2014;. Other tasks which have been studied based on FN include question answering (Narayanan and Harabagiu, 2004;Shen and Lapata, 2007), textual entailment (Burchardt et al., 2009) and paraphrase recognition (Padó and Lapata, 2005). This is the first work to explore the application of FN to event detection.

Basic Event Detection Model
Alike to existing work, we model event detection (ED) as a word classification task. In the ED task, each word in the given sentence is treated as a candidate trigger and the goal is to classify each of these candidates into one of 34 classes (33 event types plus a NA class). However, in this work, as we assumed that the LU of a frame is analogical to the trigger of an event, we only treat the LU annotated in the given sentence as a trigger candidate. Each sentence in FN only contains one candidate trigger, thus "the candidate" denotes both the candidate trigger of a sentence and the sentence itself for FN in the remainder of this paper. Another notable difference is that we train the detection model on one corpus (ACE) but apply it on another (FN). That means our task is also a cross-domain problem. To tackle with it, our basic ED approach follows representation-based paradigm, which has been demonstrated effective in the cross-domain situation (Nguyen and Grishman, 2015).

Model
We employ a simple three-layer (a input layer, a hidden layer and a soft-max output layer) Artificial Neural Networks (ANNs) (Hagan et al., 1996) to model the ED task. In our model, adjacent layers are fully connected.
Word embeddings learned from large amount of unlabeled data have been shown to be able to capture the meaningful semantic regularities of words (Bengio et al., 2003;Erhan et al., 2010). This paper uses unsupervised learned word embeddings as the source of base features. We use the Skipgram model (Mikolov et al., 2013) to learn word embeddings on the NYT corpus 5 .
Given a sentence, we concatenate the embedding vector of the candidate trigger and the average embedding vector of the words in the sentence as the input to our model. We train the model using a simple optimization technique called stochastic gradient descent (SGD) over shuffled mini-batches with the Adadelta update rule (Zeiler, 2012). Regularization is implemented by a dropout (Kim, 2014;Hinton et al., 2012). The experiments show that this simple model is surprisingly effective for event detection.

Event Detection in FrameNet
To detect events in FN, we first learned the basic ED model based on ACE labeled corpus and then employ it to generate initial judgements (possible event types with confidence values) for each sentence in FN. Then, we apply a set of constraints for global inference based on the PSL model.

Probabilistic Soft Logic
PSL is a framework for collective, probabilistic reasoning in relational domains (Kimmig et al., 2012;Bach et al., 2013). Similar to Markov Logic Networks (MLNs) (Richardson and Domingos, 2006), it uses weighted first-order logic formulas to compactly encode complex undirected probabilistic graphical models. However, PSL brings two remarkable advantages compared with MLNs. First, PSL relaxes the boolean truth values of MLNs to continuous, soft truth values. This allows for easy integration of continuous values, such as similarity scores. Second, PSL restricts the syntax of first order formulas to that of rules with conjunctive bodies. Together with the soft truth values constraint, the inference in PSL is a convex optimization problem in continuous space and thus can be solved using efficient inference approaches. For further details, see the references (Kimmig et al., 2012;Bach et al., 2013).

Global Constraints
Our global inference approach is based on the following three hypotheses.

H1: Same Frame Same Event
This hypothesis indicates that sentences under the same frame tend to express events of the same type. For example, all exemplars annotated for the frame Rape express events of type Attack, and all sentences under the frame Clothing express NA (none) events. With this hypothesis, sentences annotated for the same frame help each other to infer their event types during global inference. H2: Related Frame Same Event This hypothesis is an extension of H1, which relaxes "the same frame" constraint to "related frames". In this paper, frames are considered to be related if and only if they are connected by one of the following three relations: Inheritance, See also and Perspective on (see section 2.2). For example, the frame Invading is inherited from Attack, and they actually express the same type of event, Attack. With this hypothesis, sentences under related frames help each other to infer their event types during global inference.
The previous two hypotheses are basically true for most frames but not perfect. For example, for the frame Dead or alive, only a few of the sentences under it express Die events while the remainder do not. To amend the this flaw, we introduce the third hypothesis. H3: Same LU Same Event This hypothesis indicates that sentences under the same LU tend to express events of the same type (as a remind, LUs are under frames). It is looser than the previous two hypotheses thus holds true in more situations. For example, H3 holds true for the frame Dead or alive which violates H1 and H2. In FN, LUs annotated for that frame are alive.a, dead.a, deceased.a, lifeless.a, living.n, undead.a and undead.n. All exemplars under dead.a, deceased.a and lifeless.a express Die events. Therefore, this hypothesis amends the flaws of the former two hypotheses.
On the other hand, the first two hypotheses also help H3 in some cases. For example, most of the sentences belonging to the LU suit.n under the frame Clothing are misidentified as Sue events due to the ambiguity of the word "suit". However, in this situation, H1 can help to rectify it because the majority of LUs under Clothing are not ambiguous words. Thus, under the first hypothesis, the misidentified results are expected to be corrected by the the results of other exemplars belonging to Clothing.

Inference
To model the above hypotheses as logic formulas in PSL, we introduce a set of predicates (see Table 2), which are grouped into two categories: observed predicates and target predicates. Observed predicates are used to encode evidences, which are always assumed to be known during the inference, while target predicates are unknown and thus need to be predicted.
CandEvt(c, t) is introduced to represent conf (c, t), which is the confidence value generated by the basic ED model for classifying the candidate c as an event of the type t. SameF r(c 1 , c 2 ) indicates whether the candidates c 1 and c 2 belong to the same frame. It is initialized by the indicator function I sf (c 1 , c 2 ), which is defined as follows: SameLU (c 1 , c 2 ) is similar, but applies for candidates under the same LU. The last three observed predicates in Table 2 are used to encode the aforementioned semantic relations between frames. For example, Inherit(c 1 , c 2 ) indicates whether the frame of c 1 is inherited from that of c 2 , and it is initialized by the indicator function I ih (c 1 , c 2 ), which is set to 1 if and only if the frame of c 1 is inherited from that of c 2 , otherwise 0. Evt(c, t) is the only target predicate, which indicates that the candidate c triggers an event of type t.

Type Predicate Assignment
Observed CandEvt(c, t) conf (c, t) SameF r(c 1 , c 2 ) Putting all the predicates together, we design a set of formulas to apply the aforementioned hypotheses in PSL (see Table 3). Formula f 1 connects the target predicate with the initial judgements from the basic ED model. Formulas f 2 and f 3 respectively encode H1 and H3. Finally, the remaining formulas are designed for various relations between frames in H2. We tune the formulas's weights via grid search (see Section 5.4). The inference results provide us with the most likely

Evaluations
In this section, we present the experiments and the results achieved. We first manually evaluate our novel PSL-based ED model on the FN corpus. Then, we also conduct automatic evaluations for the events detected from FN based on ACE corpus. Finally, we analyze possible mappings from frames/LUs to event types.

Data
We learned the basic ED model on ACE2005 dataset. In order to evaluate the learned model, we followed the evaluation of (Li et al., 2013): randomly selected 30 articles from different genres as the development set, and we subsequently conducted a test on a separate set of 40 ACE 2005 newswire documents. We used the remaining 529 articles as the training data set. We apply our proposed PSL-based approach to detect events in FrameNet. Via collecting all exemplars annotated in FN, we totally obtain 154,484 sentences for detection.

Setup and Performance of Basic Model
We have presented the basic ED model in Section 3. Hyperparameters were tuned by grid search on the development data set. In our experiments, we set the size of the hidden layer to 300, the size of word embedding to 200, the batch size to 100 and the dropout rate to 0.5. Table 4 shows the experimental results, from which we can see that the three-layer ANN model is surprisingly effective for event detection, which even yields competitive results compared with Nguyen's CNN and Chen's DMCNN. We believe the reason is that, compared with CNN and DMCNN,

Pre
Rec F1  ANN focuses on capturing lexical features which have been proved much more important than sentence features for the ED task by (Chen et al., 2015). Moreover, our basic model achieves much higher precision than state-of-the-art approaches (79.5% vs. 75.6%). We also investigate the performance of the basic ED model without pre-trained word embeddings 6 (denoted by ANN-Random). The result shows that randomly initialized word embeddings decrease the F 1 score by 7.3 (61.5 vs. 68.8). The main reasons are: 1). ACE corpus only contains 599 articles, which are far insufficient to train good embeddings. 2). Words only existing in the test dataset always retain random embeddings.

Baselines
For comparison, we designed four baseline systems that utilize different hypotheses to detect events in FN.
(1) ANN is the first baseline, which directly uses a basic ED model learned on ACE training corpus to detect events in FN. This system does not apply any hypotheses between frames and events.
(2) SameFrame (SF) is the second baseline system, which applies H1 over the results from AN-N. For each frame, we introduce a score function φ(f, t) to estimate the probability that the frame f could be mapped to the event type t as follows: where S f is the set of sentences under the frame f ; I(c, t) is an indicator function which is true if and only if ANN predicts the candidate c as an event of type t. Then for each frame f satisfying φ(f, t) > α, we mapped it to event type t, where α is a hyperparameter. Finally, all sentences under mapped frames are labeled as events. Note that, 6 We thank the anonymous reviewer for this suggestion.
unlike the PSL-based approach which applies constraints as soft rules, this system utilizes H1 as a hard constraint.
(3) RelatedFrame (RF) is the third baseline system, which applies H2 over the results from AN-N. For each frame f , we merge it and its related frames into a super frame, f . Similar with SF, a score function ζ(f , t), which shares the same expression to equation 3, is introduced. For the merged frame satisfying ζ(f , t) > β, we mapped it to the event type t. Finally, all sentences under f are labeled as events.
(4) SameLU (SL) is the last baseline, which applies the hypothesis H3 over the results from ANN. Also, a score function ψ(l, t) is introduced: where S l is the set of sentences under the LU l. For each LU satisfying ψ(l, t) > γ, we mapped it to the event type t. Finally, all sentences under l are labeled as events.

Manual Evaluations
In this section, we manually evaluate the precision of the baseline systems and our proposed PSLbased approach. For fair comparison, we set α, β and γ to 0.32, 0.29 and 0.42 respectively to ensure they yield approximately the same amount of events as the first baseline system ANN. We tune the weights of formulas in PSL via grid search by using ACE development dataset. In details, we firstly detect events in FN under different configurations of formulas' weights and add them to ACE training dataset, respectively. Consequently, we obtain several different expanded training datasets. Then, we separately train a set of basic ED models based on each of these training datasets and evaluate them over the development corpus. Finally, the best weights are selected according to their performances on the development dataset.

Manual Annotations
Firstly, we randomly select 200 samples from the results of each system. Each selected sample is a sentence with a highlighted trigger and a predicted event type. Figure 3 illustrates three samples. The first line of each sample is a sentence labeled with the trigger. The next line is the predicted event type of that sentence. Annotators are asked to assign one of two labels to each sample (annotating in the third line): Y: the word highlighted in the given sentence indeed triggers an event of the predicted type.
N: the word highlighted in the given sentence does not trigger any event of the predicted type. We can see that, it is very easy to annotate a sample for annotators, thus the annotated results are expected to be of high quality. To make the annotation more credible, each sample is independently annotated by three annotators 7 (including one of the authors and two of our colleagues who are familiar with ACE event task) and the final decision is made by voting. Table 5 shows the results of manual evaluations. Through the comparison of ANN and SF, we can see that the application of H1 caused a loss of 5.5 point. It happens mainly because the performance of SF is very sensitive to the wrongly mapped frames. That is, if a frame is mismapped, then all sentences under it would be mislabeled as events. Thus, even a single mismapped frame could significantly hurt the performance. This result also proves that H1 is inappropriate to be used as a hard constraint. As H2 is only an extension of H1, RF performs similarly with SF. Moreover, SL obtains a gain of 2.0% improvement compared with ANN, which demonstrates that the "same LU" hypothesis is very useful. Finally, with all the hypotheses, the PSL-based approach achieves the best performance, which demonstrates that our hypotheses are useful and it is an effective way to jointly utilize them as soft constraints through PSL for event detection in FN.

Automatic Evaluations
To prepare for automatic evaluations, we respectively add the events detected from FN by each of the aforementioned five systems to ACE training corpus. Consequently, we obtain five ex-7 The inter-agreement rate is 86.1%   Table 6: Automatic evaluations of events from FN. Table 6 presents the results where we measure precision, recall and F 1 . Compared with ACE-ANN-FN, events from SF and RF hurt the performance. As analyzed in previous section, SF and R-F yield quite a few false events, which dramatically hurt the accuracy. Moreover, ACE-SL-FN obtains a score of 70.3% in F 1 measure, which outperforms ACE-ANN-FN. This result illustrates the effectiveness of our "same LU" hypothesis. Finally and most importantly, consistent with the results of manual evaluations, ACE-PSL-FN performs the best, which further proves the effectiveness of our proposed approach for event detection in FN.

Improving Event Detection Using FN
Event detection generally suffers from data sparseness due to lack of labeled samples. In this section, we investigate the effects of alleviating the aforementioned problem by using the events detected from FN as extra training data. Our investigation is conducted by the comparison of two basic ED models, ANN and ANN-FN: the former is trained on ACE training corpus and the latter is trained on the new training corpus ACE-PSL-FN (introduced in the previous section), which contains 3,816 extra events detected from FN.  Table 7: Effects of expanding training data using events automatically detected from FN. Table 7 presents the experimental results. Compared with ANN, ANN-FN achieves a significant improvement of 1.9% in F 1 measure. It happens mainly because that the high accurate extra training data makes the model obtain a higher recall (from 60.7% to 65.2%) with less decrease of precision (from 79.5% to 77.6%). The result demonstrates the effectiveness of alleviating the data sparseness problem of ED by using events detected from FN. Moreover, compared with state-ofthe-art methods, ANN-FN outperforms all of them with remarkable improvements (more than 1.3%).

Analysis of Frame-Event Mapping
In this section, we illustrate the details of mappings from frames to event types. The mapping pairs are obtained by computing the function φ (see Section 5.3) for each (frame, event-type) pair (f , t) based on the events detected by the PSLbased approach. Table 8 presents the top 10 mappings. We manually evaluate their quality by investigating: (1) whether the definition of each frame is compatible with its mapped event type; (2) whether exemplars annotated for each frame actually express events of its mapped event type.
For the first issue, we manually compare the definitions of each mapped pair. Except Relational nat features 8 , definitions of all the mapped pairs are compatible. For the second issue, we randomly sample 20 exemplars (if possible) from each frame and manually annotate them. Except the above frame and Invading, exemplars of the remaining frames all express the right events. The only exemplar of Invading failing to express its mapped event is as follows: "The invasion of China by western culture has had a number of far-reaching effects on Confucianism." ACE requires an Attack event to be a physical act, while the invasion of culture is unphysical. Thus, the above sentence does not express an 8 The full name is Relational natural relations in FN.  Attack event. To sum up, the quality of our mappings is good, which demonstrates that the hypothesis H1 is basically true.

Analysis of LU-Event Mapping
This section illustrates the details of mappings from LUs to event types. The mapping pairs are obtained by computing the function ψ (see Section 5.3).  Table 9: Top 10 mappings from LUs to event types. N e is the number of exemplars detected as events; ||S l || and ψ hold the same meanings as mentioned in Section 5.3.
To investigate the mapping quality, we manu-9 Their frames separately are Hostile encounter, Cause harm, Forming relationships, Killing, Trial, Attack, Quarreling, Arrest, Forming relationships and Hit target. ally annotate the exemplars under these LUs. The result shows that all exemplars are rightly mapped. These mappings are quite good. We believe the reason is that an LU is hardly ambiguous due to its high specificity, which is not only specified by a lemma but also by a frame and a part of speech tag. Table 9 only presents the top 10 mappings. In fact, we obtain 54 mappings in total with ψ = 1.0. We released all the detected events and mapping results for further use by the NLP community.

Conclusions and Future Work
Motivated by the high similarity between frames and events, we conduct this work to study their relations. The key of this research is to detect events in FN. To solve this problem, we proposed a PSL-based global inference approach based on three hypotheses between frames and events. For evaluation, we first conduct manual evaluations on events detected from FN. The results reveal that our hypotheses are very useful and it is an effective way to jointly utilize them as soft rules through P-SL. In addition, we also perform automatic evaluations. The results further demonstrate the effectiveness of our proposed approach for detecting events in FN. Furthermore, based on the detected results, we analyze the mappings from frames/LUs to event types. Finally, we alleviate the data sparseness problem of ED by using events detected from FN as extra training data. Consequently, we obtain a remarkable improvement and achieve a new state-of-the-art result for the ED task.
Event detection is only a component of the overall task of event extraction, which also includes event role detection. In the future, we will extend this work to the complete event extraction task. Furthermore, event schemas in ACE are quite coarse. For example, all kinds of violent acts, such as street fights and wars, are treated as a single event type Attack. We plan to refine the event schemas by the finer-grained frames defined in FN (i.e. Attack may be divided into Terrorism, Invading, etc.).