Event Detection with Trigger-Aware Lattice Neural Network

Event detection (ED) aims to locate trigger words in raw text and then classify them into correct event types. In this task, neural net- work based models became mainstream in re- cent years. However, two problems arise when it comes to languages without natural delim- iters, such as Chinese. First, word-based mod- els severely suffer from the problem of word- trigger mismatch, limiting the performance of the methods. In addition, even if trigger words could be accurately located, the ambi- guity of polysemy of triggers could still af- fect the trigger classification stage. To ad- dress the two issues simultaneously, we pro- pose the Trigger-aware Lattice Neural Net- work (TLNN). (1) The framework dynami- cally incorporates word and character informa- tion so that the trigger-word mismatch issue can be avoided. (2) Moreover, for polysemous characters and words, we model all senses of them with the help of an external linguistic knowledge base, so as to alleviate the prob- lem of ambiguous triggers. Experiments on two benchmark datasets show that our model could effectively tackle the two issues and outperforms previous state-of-the-art methods significantly, giving the best results. The source code of this paper can be obtained from https://github.com/thunlp/TLNN.


Introduction
Event Detection (ED) is a pivotal part of Event Extraction, which aims to detect the position of event triggers in raw text and classify them into corresponding event types. Conventionally, the stage of locating trigger words is known as Trigger Identification (TI), and the stage of classifying trigger words into particular event types is called Trigger Classification (TC). Although neural network methods have achieved significant progress in event detection (Nguyen and Grishman, 2015;Chen et al., 2015;Zeng et al., 2016), both steps are still exposed to the following two issues.
In the TI stage, the problem of trigger-word mismatch could severely impact the performance of event detection systems. Because in languages without natural delimiters such as Chinese, mainstream approaches are mostly word-based models, in which the segmentation should be firstly performed as a necessary preprocessing step. Unfortunately, these word-wise methods neglect an important problem that a trigger could be a specific part of one word or contain multiple words. As shown in Figure 1(a), "射" (shoot) and "杀" (kill)  are two triggers that both are parts of the word "射 杀" (shoot and kill) . In the other case, "示威游 行" (demonstration) is a trigger that crosses two words. Under this circumstance, triggers could not be located correctly with word-based methods, thereby becoming a serious limitation of the task. Some feature-based methods are proposed (Chen and Ji, 2009;Qin et al., 2010;Li and Zhou, 2012) to alleviate the issue, but they heavily rely on the hand-crafted features. Lin et al. (2018) proposes the nugget proposal networks (NPN) in terms of this issue, which uses a neural network to model character compositional structure of trigger words in a fix-sized window. However, the mechanism of the NPN limits the scope of trigger candidates within a fix-sized window, which is inflexible and suffering from the problem of trigger overlaps. Even if the locations of triggers can be correctly detected in the TI step, the TC step could still be severely affected by the inherent problem of ambiguity of polysemy. Because a trigger word with multiple word senses could be classified into different event types. Take 1(b) as an example, a polysemous trigger word "释放" (release) could represent two distinctly different event types. In the first case, the word 'release' triggers an Attack event (release tear gas). But in the second case, the event triggered by 'release' becomes Release-Parole (release a man in court).
To further illustrate that the two problems mentioned above do exist, we make manual statistics on the proportion of mismatch triggers and polysemous triggers on two widely used datasets. The statistics are illustrated in Table 1, and we can observe that data with trigger-word mismatch and trigger polysemy do account for a considerable proportion and then affect the task.
In this paper, we propose the Trigger-aware Lattice Network (TLNN), a comprehensive model that can simultaneously tackle both issues. To avoid error propagation by NLP tools like segmentor, we take characters as the basic units of the input sequence. Moreover, we utilize HowNet (Dong and Dong, 2003), an external knowledge base that manually annotates polysemous Chinese and English words, to obtain the sense-level information. Further, we develop the trigger-aware lattice LSTM as the feature extractor of our model, which could leverage character-level, word-level and sense-level information at the same time. More specifically, in order to address the triggerword mismatch issue, we construct short cut paths to link the cell state between the start and the end characters for each word. It is worth mentioning that the paths are sense-level, which means all the sense information of words that end in one specific character will flow into the memory cell of the character. Hence, with the utilization of multiple granularity of information (character, word and word sense), the problem of polysemous triggers could be effectively alleviated.
We conduct sets of experiments on two realworld datasets in the task of event detection. Empirical results of the main experiments show that our model can efficiently address both mentioned issues. With comprehensive comparisons with other proposed methods, our model achieves the state-of-the-art results on both datasets. Further, sets of subsidiary experiments are conducted to further analyze how TLNN addresses the two issues.

Methodology
In the paper, event detection is regarded as a sequence labelling task. For each character, the model should identify if it is a part of one trigger and correctly classify the trigger into one specific event type.
The architecture of our model is shown in Figure 2, which primarily includes the following three parts: (1) Hierarchical Representation Learning, which reveals the character-level, word-level and sense-level embedding vectors in an unsupervised way.
(2) Trigger-aware Feature Extractor, which automatically extracts different levels of semantic features by a tree structure LSTM model.
(3) Sequence Tagger, which calculates the probability of being a trigger for each character candidate.

Hierarchical Representation Learning
Given an input sequence S = {c 1 , c 2 , ..., c N }, where c i represents the ith character in the sequence. In character level, each character will be represented as an embedding vector x c by Skip-Gram method (Mikolov et al., 2013).
In the word level, the input sequence S could also be S = {w 1 , w 2 , ..., w M }, where the basic unit is a single word w i . In this paper, we will use two indexes b and e to represent the start and the end of a word. In this case, the word embeddings are: However, the Skip-Gram method maps each word to only one single embedding, ignoring the fact that many words have multiple senses. Hence representation of finer granularity is still necessary to represent deep semantics. With the help of HowNet (Dong and Dong, 2003), we can obtain the representation of each sense of a character or a word. For each character c, there are possible multiple senses sen (c i ) ∈ S (c) annotated in HowNet. Similarly, for each word w, the senses could be sen (w i ) ∈ S (w) . Consequently, we can obtain the embeddings of senses by jointly learning word and sense embeddings via Skip-gram manner. This mechanism is also applied to (Niu et al., 2017).
where sen (c i ) j and sen (w b,e ) j represents the jth sense for the character c i and word w b,e in the sequence. And then s c i j and s w b,e j are the embeddings of c i and w b,e .

Trigger-Aware Feature Extractor
The trigger-aware feature extractor is the core component of our model. After training, the outputs of the extractor are the hidden state vectors h of an input sentence.
Conventional LSTM. LSTM (Hochreiter and Schmidhuber, 1997) is an extension of the recurrent neural network (RNN) with additional gates to control the information. Traditionally, there are following basic gates in LSTM: input gate i, output gate o and forget gate f . They collectively controls which information will be reserved, forgotten and output. All three gates are accompanied by corresponding weight matrix W . Current cell state c records all historical information flow up to the current time. Therefore, the character-based LSTM functions are: where h c i is the hidden state vector. Trigger-Aware Lattice LSTM. Trigger-Aware Lattice LSTM is the core feature extractor of our framework, which is an extension of LSTM and lattice LSTM. In this subsection, We will derive and theoretically analyze the model in detail.
In this section, characters and words are assumed to have K senses. As mentioned in 2.1, for the jth sense of the ith character c i , the embedding would be s c i j . Then an additional LSTMCell is utilized to integrate all senses of the character, hence the calculation of the cell gate of the multisense character c i would be: Figure 3: The structure of Trigger-Aware Feature Extractor, the input of the example is a part of the sentence "若罪名成立，他将被逮捕" (If convicted, he will be arrested) . In this case, "罪名成立" (convicted) is a trigger with event type Justice: Sentence. "成立" (convicted/found) and "立" (stand/conclude) are polysemous words.
To keep the figure concise, we (1) only show two senses for each polysemous word; (2) only show the forward direction.
where c c i j is the cell state of the jth sense of the ith character, c c i−1 is the final cell state of the i − 1th character. In order to obtain the cell state of the character, an additional gate is used: Then all the senses should be dynamically integrated into the temporary cell state: where α c i j is the character sense gate after normalization: Eq.11 obtains the temporary cell state of the character c * c i by incorporating all the senses information of the character. However, word-level information needs to be considered as well. As mentioned in 2.1, s w b,e j is the embedding for the jth sense of the word out w b,e . Similar to characters, extra LSTMCell is used to calculate the cell state of each word that matches the lexicon D.
Similar to Eq.11, the cell state of the word could be computed by incorporating all the cells of senses.
where α w b,e j is the word sense gate after normalization: For a character c i , the temporary cell state c * c i that contains sense information is calculated by Eq.11. Moreover, we could calculate all the cell states of words that end in the index i by Eq.16, which are represented as In order to ensure the corresponding information could flow into the final cell state of c i , an extra gate g m b,i is used to merge character and word cells: and the computation of the final cell state of the character c c i is: where α w b,i and α c i are word gate and character gate after normalization. The computation is similar to Eq. 12 and Eq. 17. Therefore, the final cell state c c i could represent the ambiguous characters and words in a dynamic manner. Similar to Eq. 7, hidden state vectors could be calculated to transmit to the sequence tagger layer.

Sequence Tagger
In this paper, the event detection task is regarded as a sequence tagging problem. For an input sequence S = {c 1 , c 2 , ..., c N }, there is a corresponding label sequence L = {y 1 , y 2 , ..., y N }. Hidden vectors h for each character obtained in 2.2 are used as the input. We use a classic CRF layer to perform the sequence tagging, thus the probability distribution is: where S is the score function to compute the emission score from hidden vector h i to the label y i : W y i CRF and b y i CRF are learned parameters specific to y i . And in Eq. 20, T is the transition function to compute the transition score from y i−1 to y i . C contains all the possible label sequences on sequence S and L is a random label sequence in C.
We use standard Viterbi (Viterbi, 1967) algorithm as a decoder to decode the highest scored label sequence. The loss function of our model is log-likelihood in sentence-level.
where M is the number of sentences, L i is the correct label for the sentence S i .  Chen and Ji (2009). To remain rigorous, we use the official evaluation toolkit 1 to perform the metrics for KBP2017.
Hyper-Parameter Settings. We tune the parameters of our models by grid searching on the validation dataset. Adam (Kingma and Ba, 2014) with a learning rate decay is utilized as the optimizer. The embedding sizes of characters and senses are all 50. To avoid overfitting, Dropout mechanism (Srivastava et al., 2014) is used in the system, and the dropout rate is set to 0.5. We select the best models by early stopping using the F1 results on the validation dataset. Because of the limited influence, we follow empirical settings for other hyper-parameters.

Overall Results
In this section, we compare our model with previous state-of-the-art methods. The proposed models are as follows:  Table 2: Overall results of proposed methods and TLNN on ACE2005 and KBP2017. * indicates the results adapted from the original paper. For KBP2017, "Trigger Identification" and "Trigger Classification" correspond to the "Span" and "Type" metrics in the official evaluation.
DMCNN (Chen et al., 2015) put forward a dynamic Multi-pooling CNN as a sentence-level feature extractor. Moreover, we add a classifier to DMCNN using IOB encoding. NPN (Lin et al., 2018) proposed a comprehensive model by automatically learning the inner compositional structures of triggers to solve the trigger mismatch problem.
The results of all the models are shown in Table  2. From the results, we can observe that: (1) Both for ACE2005 and KBP2017, TLNN outperform other proposed models significantly, achieving the best results on two datasets. This demonstrates that the trigger-aware lattice structure could enhance the accuracy of locating triggers. Further, thanks to the usage of sense-level information, triggers could be more precisely classified into correct event types.
(2) On the TI stage, TLNN gives the best performance. By linking shortcut paths of all word candidates with the current character, the model could effectively exploit both character and word information, and then alleviates the issue of triggerword mismatch.  (3) On the TC stage, TLNN still maintain its advantages. The results indicate that the linguistic knowledge of HowNet and the unique structure to dynamically utilize sense-level information could enhance the performance on the TC stage. More located triggers could be classified into correct event types by considering the ambiguity of triggers.

Effect of Trigger-aware Feature Extractor
In this section, we design a set of experiments to explore the effect of the trigger-aware feature extractor. We implement strong character-based and word-based baselines by replacing the triggeraware lattice LSTM with the standard Bi-LSTM. For word-based baselines, the input is segmented into word sequences firstly. Furthermore, we implement extra CNN and LSTM to learn character-level features as additional modules. For character-based baselines, the basic units of the  input sequence are characters. Then we enhance the character representation by adding external word-level features including bigram and softword (word in which the current character is located). Hence, both baselines could collectively utilize character and word information. As shown in Table 3, experiments of two types of baselines and our model are conducted on ACE2005 and KBP2017. For the word baseline, although adding character-level features can improve the performance, the effects are relative limited. For the char baseline, it gains considerable improvements when word-level features are taken into account. The results of baselines indicate that integrating different level of information is an effective strategy to improve the performance of models. Compared with baselines, the TLNN achieves the best F1-score compared to all the baselines on both datasets, showing remarkable superiority and robustness. The results show that by dynamically combining multi-grained information, the trigger-aware feature extractor could effectively explore deeper semantic features than the feature-based strategies used in baselines.

Influence of Trigger Mismatch
In order to explore the influence of trigger mismatch problem, we split the test data of ACE2005 and KBP2017 into two types: match and mismatch. Table 1 shows the proportion of wordtrigger match and mismatch on two datasets.
The recall of different methods of each split on Trigger Identification task is shown in Table 4. We can observe that: (1) The result indicates that the word-trigger mismatch problem could severely impact the performance of the task. All approaches except ours give lower recall rates in the trigger-mismatch part than in the trigger-match part. In contrast, our model could robustly address the word-trigger mismatch problem, reaching the best results on both parts of the two datasets.   Table 6: F1-score of two splits of two datasets on Trigger Classification task. The splits are based on the polysemy of triggers. "Poly" and "Mono" correspond to polysemous and monosemous trigger splits.
(2) To a certain extent, the NPN model could alleviate the problem by utilizing hybrid representation learning and nugget generator in a fix-sized window. However, the mechanism is still not flexible and robust to integrate character and word information.
(3) The word-based baseline is most severely affected by the trigger-word mismatch problem. This phenomenon is explainable because if one trigger could not be segmented as a specific word in the preprocessing stage, it is impossible to be located correctly.

Influence of Trigger Polysemy
In this section, we mainly focus on the influence of polysemous triggers. We select NPN model for comparison. And we implement a version of TLNN without sense information, which is denoted as TLNN -w/o Sense info in Table 5 and  Table 6.
Empirical results in Table 5 show the overall performance on ACE2005 and KBP2017. We can observe that the TLNN is weakened by removing sense information, which indicates the effectiveness of the usage of sense-level information. Even without sense information, our model could still outperform the NPN model on both two datasets.
To further explore and analyze the effect of word sense information, we split the KBP2017 dataset into two parts based on the polysemy of triggers and their contexts. The F1-score of each split is shown in Table 6, in which the TLNN yields the best results on both "Poly" parts. With-  out sense information, TLNN -w/o sense info could give comparable F1-scores with TLNN on the "Mono" parts. The results indicate that the trigger-aware feature extractor could dynamically learn all the senses of characters and words, gaining significant improvements under the condition of polysemy. Table 7 shows two examples comparing the TLNN model with other ED methods. The former example is about trigger-word mismatch, in which the correct trigger "抗"(resist) is part of the idiom word "抗敌援友" (resist the enemies and aid the allies). In this case, the word baseline gives the whole word "抗敌援友" as prediction because it is impossible for word-based methods to detect partof-word triggers. Additionally, the NPN model recognizes a non-existent word "刻抗". The reason is that the NPN enumerates the combinations of all characters within a window as trigger candidates, which is likely to generate invalid words. In contrast, our model detects the event trigger "抗" accurately.

Case Study
In the latter example, the trigger "送"(send) is a polysemous word with two different meanings: "送行"(see him off) and "送钱"(give him money). Without considering multiple word senses of polysemes, the NPN and TLNN (w/o Sense info) classify trigger "送" into wrong event type Trans-ferPerson. On the contrary, the TLNN can dynamically select word sense for polysemous triggers by utilizing context information. Thus the correct event type TransferMoney is predicted.

Related Work
Event Detection (ED) is a crucial subtask in Event Extraction task. Feature-based methods (Ahn, 2006;Ji and Grishman, 2008;Liao and Grishman, 2010;Huang and Riloff, 2012;Patwardhan and Riloff, 2009;McClosky et al., 2011) were widely used in the ED task, but these traditional methods are heavily rely on the manual features, limiting the scalability and robustness.
Recent developments in deep learning have led to a renewed interest in neural event detection. Neural networks can automatically learn features of the input sequence and conduct token-level classification. CNN-based models are the seminal neural network models in ED (Nguyen and Grishman, 2015;Chen et al., 2015;. However, these models can only capture the local context features in a fixed size window. Some approaches design comprehensive models to explore the interdependency among trigger words (Chen et al., 2018;Feng et al., 2018). To further improve the ED task, some joint models are designed Lu and Nguyen, 2018;Yang and Mitchell, 2016). These methods have achieved great success in English datasets.
However, in languages without delimiters, such as Chinese, the mismatch of word-trigger become significantly severe. Some feature-based methods are proposed to solve the problem (Chen and Ji, 2009;Qin et al., 2010;Li and Zhou, 2012), but they heavily rely on the hand-crafted features. Lin et al. (2018) proposes NPN, a neural network based method to address the issue. However, the mechanism of NPNs limits the scope of trigger candidates within a fix-sized window, which will cause two problems in the progress. First, the NPNs still cannot take all the possible trigger candidates into account, leading to meaningless computation. Furthermore, the overlap of triggers is serious in NPNs. Lattice-based models were used in other fields to combine character and word information (Li et al., 2019;. Mainstream methods also suffer from the problem of trigger polysemy. Lu and Nguyen (2018) proposes a multi-task learning model which uses word sense disambiguation to alleviate the effect of the trigger polysemy problem. But in this work, word disambiguation datasets are necessary. In contrast, our model can solve both word-trigger mismatch and trigger polysemy problems at the same time.

Conclusion and Future Work
We propose a novel framework TLNN for event detection, which can simultaneously address the problems of trigger-word mismatch and polysemous triggers. With the hierarchical representation learning and the trigger-aware feature extractor, TLNN efficaciously exploits multi-grained information and learn deep semantic features. Sets of experiments on two real-world datasets show that TLNN could efficiently address the two issues and yield better empirical results than a variety of neural network models.
In future work, we will conduct experiments on more languages with and without explicit word delimiters. In addition, we will try developing a dynamic mechanism to selectively consider the sense-level information rather than take all the senses of characters and words into account.