Nugget Proposal Networks for Chinese Event Detection

Neural network based models commonly regard event detection as a word-wise classification task, which suffer from the mismatch problem between words and event triggers, especially in languages without natural word delimiters such as Chinese. In this paper, we propose Nugget Proposal Networks (NPNs), which can solve the word-trigger mismatch problem by directly proposing entire trigger nuggets centered at each character regardless of word boundaries. Specifically, NPNs perform event detection in a character-wise paradigm, where a hybrid representation for each character is first learned to capture both structural and semantic information from both characters and words. Then based on learned representations, trigger nuggets are proposed and categorized by exploiting character compositional structures of Chinese event triggers. Experiments on both ACE2005 and TAC KBP 2017 datasets show that NPNs significantly outperform the state-of-the-art methods.


Introduction
Automatic event extraction is a fundamental task of information extraction. Event detection, which aims to identify event triggers of specific types, is a key step of event extraction. For example, from the sentence "Henry was injured, and then passed away soon", an event detection system should detect an "Injure" event triggered by "injured", and a "Die" event triggered by "passed away".
Recently, neural network methods, which transform event detection into a word-wise classification paradigm, have achieved significant progress in event detection (Nguyen and Grishman, 2015;  Chen et al., 2015b;Ghaeini et al., 2016). For instance, a model will detect events in sentence "Henry was injured" by successively classifying its three words into NIL, NIL and Injure. By automatically extracting features from raw texts, these methods rely little on prior knowledge and achieved promising results.
Unfortunately, word-wise event detection models suffer from the word-trigger mismatch problem, because a number of triggers do not exactly match with a word. Specifically, a trigger can be part of a word or cross multiple words, which is impossible to detect using word-wise models. This problem is more severe in languages without natural word delimiters such as Chinese. Figure 1 (a) shows several examples of part-of-word triggers, where two characters in one word "¿ "(acquire and merge) trigger two different events: a "Merge Org" event triggered by "¿"(merge) and a "Transfer Ownership" event triggered by " " (acquire). Figure 1 (b) shows a multi-word trigger, where three words "É"(is), " " and "ú"(injured) trigger an Injure event together. Table 1 shows the statistics of different types of word-trigger match on two standard datasets. We can see that word-trigger mismatch is crucial for Chinese event detection since nearly 25% of triggers in RichERE and 15% of them in ACE2005 dataset don't exactly match with a word.
To resolve the word-trigger mismatch problem, this paper proposes Nugget Proposal Networks (NPNs), which identify triggers by modeling character compositional structures of trigger nuggets regardless of word boundaries. Given a sentence, NPNs regard characters as basic detecting units and are able to 1) directly propose the entire potential trigger nugget at each character by exploiting inner compositional structure of triggers; 2) effectively categorize proposed triggers by learning semantic representation from both characters and words. For example, at character "ú"(injured) in Figure 1 (b), NPNs are not only capable to detect it is part of an Injure event trigger, but also can propose the entire trigger nugget "É ú"(is injured). The main idea behind NPNs is that most Chinese triggers have regular character compositional structure . Concretely, most of Chinese event triggers have one central character which can indicate its event type, e.g. "à"(kill) in "l à"(kill by shooting). Furthermore, characters are composed into a trigger based on regular compositional structures, e.g. "manner + verb" for "là"(kill by shooting), "và"(hack to death), as well as "verb + auxiliary + noun" for "É ú"(is injured) and "E ‹"(beaten). Figure 2 shows the architecture of NPNs. Given a character in sentence, a hybrid representation learning module is first used to learn its semantic representation from both characters and words in the sentence. This hybrid representation is then fed into two modules: one is trigger nugget generator, which proposes the entire potential trigger nugget by exploiting inner character compositional structure. Once a trigger is proposed, an event type classifier is applied to determine its event type. Compared with previous methods, NPNs mainly have following advantages: 1) By directly proposing the entire trigger nugget centered at a character, trigger nugget generator can effectively resolve the word-trigger mismatch problem. First, using characters as basic units, NPNs will not suffer from the word-trigger mismatch problem of word-wise methods. Furthermore, by modeling and exploiting character compositional structure  of triggers, our model is more error-tolerant to character-wise classification errors than traditional character-based models, as shown in Section 4.4.
2) By summarizing information from both characters and words, our hybrid representation can effectively capture information for both inner character composition and accurate event categorization. For example, the inner compositional structure of trigger "là"(kill by shooting) can be learned from the character-level sequence. Besides, characters are often ambiguous, therefore the accurate representations must take their word context into consideration. For example, the representation "à"(kill) in " l à"(kill by shooting) should be different from its representation in "à""(completed).
We conducted experiments on both the ACE2005 and the TAC KBP 2017 Event Nugget Detection datasets. Experiment results show that NPNs can effectively solve the word-mismatch problem, and therefore significantly outperform previous state-of-the-art methods 1 .

Hybrid Representation Learning
Given a sentence, NPNs will first learn a representation for each character, then the representation is fed into downstream modules. We observe that both characters and words contain rich information for Chinese event detection: characters reveals the inner compositional structure of event 1 Our source code, including all hyper-parameter settings and pre-trained word embeddings, is openly available at github.com/sanmusunrise/NPNs.  Figure 3: Token-level feature extractor, where PE is relative positional embeddings and WE is word embeddings. The concerning token is "¿ ".
triggers , while words can provide more accurate and less ambiguous semantics than characters (Chen et al., 2015a). For example, character-level information can tell us that "l à"(kill by shooting) is a trigger constructed of regular pattern "manner + verb". While wordlevel sequences can provide more explicit information when we distinguish the semantics of "à"(kill) in this context with that character in other words like "à""(completed). Therefore, we propose to learn a hybrid representation which can summarize information from both characters and words. Specifically, we first learn two separate character-level and word-level representations using token-level neural networks. Then we design three kinds of hybrid paradigms to obtain the hybrid representation.

Token-level Representation Learning
Two token-level neural networks are used to extract features from characters and words respectively. The network architecture is similar to DMCNN (Chen et al., 2015b). Figure 3 shows a word-level example. Given n tokens t 1 , t 2 , ..., t n in the sentence and the concerning token t c , let x i be the concatenation of the word embedding of t i and the embedding of t i 's relative position to t c , a convolutional layer with window size as h is introduced to capture compositional semantics: Here x i:i+j refers to the concatenation of embeddings from x i to x i+j , w i is the i-th filter of the convolutional layer, b i ∈ R is a bias term. Then a dynamic multi-pooling layer is applied to preserve important signals of different parts of the sentence: After that we concatenate r lef t i and r right i from all feature maps, as well as the embeddings of tokens nearing to t c to obtain the word-level representation f word of t c . Using the same procedure to character sequences, we can obtain the characterlevel representation f char .

Hybrid Representation Learning
So far we have both character-level feature representation f char and word-level feature representation f word . This section describes how we mix them up to obtain a hybrid representation. Before this, we first project f char and f word respectively into the same vector space using two dense layers, and we represent the projected d -dimensional vectors as f char and f word . Then we design three different paradigms to mix them up: Concat Hybrid, General Hybrid and Task-specific Hybrid, as illustrated in Figure 4.
Concat Hybrid is the most simple method, which simply concatenates character-level and word-level representations: This simple approach doesn't introduce any additional parameter, but we find it very effective in our experiments. General Hybrid aims to learn a shared hybrid representation for both trigger nugget proposal and event type classification. Specifically, we design a gated structure to model the information flow from f char and f word to the general hybrid feature representation f G : Here s is the sigmoid function, W GH ∈ R d ×d and U GH ∈ R d ×d are weight matrix, and b GH ∈ R d is the bias term. z G is a ddimensional vector whose values represent the contribution of f char and f word to the final hybrid representation, which models the importance of individual features in the given contexts.
As two downstream modules of NPNs have individual functions, they might hold different requirements to the input features. Intuitively, trigger nugget generator depends more on finegrained character-level features. In contrast, wordlevel features might play more important roles in the event type classifier since it is enriched with more explicit semantics. As a result, a unified representation may be insufficient and it is better to learn task-specific hybrid representations.
Task-specific Hybrid is proposed to tackle this problem, where two gates are introduced for two modules respectively. Formally, we learn one representation for the trigger nugget generator and one for event type classifier as: Here f N and f T are hybrid features for the trigger nugget generator and the event type classifier respectively and the meanings of other parameters are similar to the ones in Equation (4) and (5).

Nugget Proposal Networks
Given the hybrid representation of a character in a sentence, the goal of NPNs is to propose the potential trigger nugget, as well as to identify its corresponding event type at each character. For example in Figure 5, centered at the character "ú"(injured), NPNs need to propose "É ú"(is injured) as the entire trigger nugget and identify its event type as "Injure". For this, NPNs are equipped with two modules: one is called trigger nugget generator, which is used to propose the potential trigger nugget containing the concerning character by exploiting character compositional structures of triggers. Another module, named as event type classifier, is used to determine the specific type of this event once a trigger nugget is detected.
Figure 5: Our trigger nugget generator. For each character, there are 7 candidate nuggets including "NIL" if the maximum length of nuggets is 3.

Trigger Nugget Generator
Chinese event triggers have regular inner compositional structures, e.g. "É ú"(is injured) and "E ‹"(is beaten) have the same "verb + auxiliary + noun" structure, and "l à"(kill by shooting) and " à"(kill by shooting) share the same "manner + verb" pattern. If a model is able to learn this compositional structure regularity, it can effectively detect trigger nuggets at characters. Recent advances have presented that convolutional neural networks are effective at capturing and predicting the region information in object detection (Ren et al., 2015) and semantic segmentation (He et al., 2017), which reveals the strong ability of CNNs to learning spatial and positional information. Inspired by this, we propose a neural network based trigger nugget generator, which is expected to not only be able to predict whether a character belongs to a trigger nugget, but also can point out the entire trigger nugget. Figure 5 is an illustration of our trigger nugget generator. Hybrid representation f N for concerning character is first learned as described in Section 2, which is then fed into a fully-connected layer to compute the scores for different possible trigger nuggets containing that character: where O G ∈ R d N and d N is the amount of candidate nuggets plus one "NIL" label indicating this character doesn't belong to an trigger. Given the maximum length L of trigger nuggets, there are L 2 +L 2 possible nuggets containing a specific character, as we shown in Figure 5. In both ACE and Rich ERE corpus, more than 98.5% triggers contain no more than 3 characters, so for a specific character we consider 6 candidate nuggets and thus d N = 7. We expect NPNs to give a high score to a nugget if it follows a regular compositional structure of triggers. For example in Figure 5, "É ú"(is injured) follows the compositional pattern of "verb + auxiliary + noun", therefore a high score is given to the category where "ú" is at the 3 rd place of a nugget with a length of 3. By contrast " ú" does not match a regular pattern, then the score for "ú" at the 2 nd place of a nugget with a length of 2 will be low in this context.
After obtaining the scores for each nugget, a softmax layer is applied to normalize the scores: where O G i is the i-the element in O G and θ is the model parameters.

Event Type Classifier
The event type classifier aims to identify whether the given character in the given context will exhibit an event type. Once we detect an event trigger nugget at one character, the hybrid feature f T extracted previously is then feed into a neural network classifier, which further determines the specific type of this trigger. Following previous work (Chen and Ng, 2012), our event type classifier directly classifies nuggets into event subtypes, while ignores the hierarchy between event types.
Formally, given the hybrid feature vector f T of input x, a fully-connected layer is applied to compute its scores assigned to each event subtype: where O C ∈ R d T and d T is the number of event subtypes. Then similar to the trigger nugget generator, a softmax layer is introduced: where O C i is the i-th element in O C , representing the score for i-th subtype.

Dealing with Conflicts between Proposed Nuggets
While NPNs directly propose nugget at each character, there might exists conflicts between proposed nuggets at different characters. Generally speaking, there are two types of conflicts: (i) NIL/trigger conflict, which means NPNs propose a trigger nugget at one character, but classify other character in that nugget into "NIL" (e.g., proposing nugget "É ú"(is injured) at "É" and output "NIL" at " "); (ii) overlapped conflict, i.e., proposing two overlapped nuggets (e.g., proposing nugget "É ú"(is injured) at "É" and nugget "ú" at "ú"). But we find that overlapped conflict is very rare because NPNs is very effective in capturing positional knowledge and the main challenge of event detection is to distinguish triggers from non-triggers. Therefore in this paper, we employ a redundant prediction strategy by simply adding all proposed nuggets into results and ignoring "NIL" predictions. For example, if NPNs successively propose "É ú"(is injured), "NIL", " ú" from "É ú", then we will ignore the "NIL" and add both two other nuggets into result. We found such a redundant prediction paradigm is an advantage of our model. Compared with conventional character-based models, even NPNs mistakenly classified character " 0into "NIL0, we can still accurately detect trigger "É ú"(is injured) if we can predict the entire nugget at character "É0or "ú0. This redundant prediction makes our model more error-tolerant to character-wise classification errors, as verified in Section 4.4.

Model Learning
To train the trigger nugget generator, we regard all characters included in trigger nuggets as positive training instances, and randomly sample characters not in any trigger as negative instances and label them as "NIL". Suppose we have T G training examples in S G = {(x k , y G k )|k = 1, 2, ...T G } to train the trigger nugget generator, as well as T C examples in S C = {(x k , y C k )|k = 1, 2, ...T C } to train the event type classifier, we can define the loss function L(θ) as follow: where θ is parameters in NPNs. Since all modules in NPNs are differentiable, any gradient-based algorithms can be applied to minimize L(θ).

Data Preparation and Evaluation
We conducted experiments on two standard datasets: ACE2005 and TAC KBP 2017 Even-  Table 2: Experiment results on ACE2005 and KBPEval2017. * indicates the result adapted from the original paper. For KBPEval2017, "Trigger Identification" corresponds to the "Span" metric and "Trigger Classification" corresponds to the "Type" metric reported in official evaluation.
For KBPEval2017, we evaluated our model on the 2017 Chinese evaluation dataset(LDC2017E55), using previous RichERE annotated Chinese datasets (LD-C2015E78, LDC2015E105, LDC2015E112, and LDC2017E02) as the training set except 20 randomly sampled documents reserved as development set. Finally, there were 506/20/167 documents for training/development/test set. We used Stanford CoreNLP toolkit (Manning et al., 2014) to preprocess all documents for sentence splitting and word segmentation. Adadelta update rule (Zeiler, 2012) is applied for optimization.
Models are evaluated by micro-averaged Precision(P), Recall(R) and F1-score. For ACE2005, we followed Chen and Ji (2009) to compute the above measures. For KBPEval2017, we used the official evaluation toolkit 2 to obtain these metrics.

Baselines
Three groups of baselines were compared: Character-based NN models. This group of methods solve Chinese Event Detection in a character-level sequential labeling paradigm, which include Convolutional Bi-LSTM model (C-BiLSTM) proposed by Zeng et al. (2016), Forward-backward Recurrent Neural Network-s (FBRNN) by Ghaeini et al. (2016), and a character-level DMCNN model with a classifier using IOB encoding (Sang and Veenstra, 1999).
Word-based NN models. This group of methods directly adopt currently NN models into wordlevel sequences, which includes word-based F-BRNN, word-based DMCNN and Hybrid Neural Network proposed by , which incorporates CNN with Bi-LSTM and achieves the SOTA NN based result on ACE2005. To alleviate OOV problem stemming from word-trigger mismatch, we also adopt errata table replacing (Han et al., 2017), which introduce an errata table extracted from the training data and replace those words that part of whom was a trigger nugget with that trigger directly.
Feature-enriched Methods. This group of methods includes Rich-C (Chen and Ng, 2012) and CLUZH (KBP2017 Best) (Makarov and Clematide, 2017). Rich-C developed several handcraft Chinese-specific features, which is one of the state-of-the-art on ACE2005. CLUZH incorporated many heuristic features into LSTM encoder, which achieved the best performance in TAC KBP2017 evaluation. Table 2 shows the results on ACE2005 and KBPE-val2017. From this table, we can see that:
2) By exploiting compositional structures of triggers, our trigger nugget generator can effectively resolve the word-trigger mismatch problem. As shown in Table 2, NPN(Taskspecific) achieved significant F1-score improvements on trigger identification task on both datasets. It is notable that our method achieved a remarkable high recall on both datasets, which indicates that NPNs do detect a number of triggers which previous methods can not identify.
3) By summarizing information from both characters and words, the hybrid representation learning is effective for event detection. Comparing with corresponding characterbased methods 3 , word-based methods achieved 2 to 3 F1-score improvements, which indicates that words can provide additional information for event detection. By combining character-level and word-level features, NPNs are able to perform character-based event detection meanwhile take word-level knowledge into consideration too.

Comparing with Conventional
Character-based Methods To further investigate the effects of the trigger nugget generator, we compared NPNs with other character-based methods and analyzed behaviors of them. We conducted a supplementary experiment by replacing our trigger nugget generator and event type classifier with an IOB encoding labeling layer. We call this system NPN(IOB). Besides, we also compared the result with F-BRNN(Char), which proposes candidate trigger nuggets according to an external trigger   Table 3 shows the results on KBP2017Eval. We can see that NPN(Task-specific) outperforms other methods significantly. We believe this is because: 1) FBRNN(Char) only regards tokens in the candidate table as potential trigger nuggets, which limits the choice of possible trigger nuggets and results in a very low recall rate.
2) To accurately identify a trigger, NPN(IOB) and conventional character-based methods require all characters in a trigger being classified correctly, which is very challenging (Zeng et al., 2016): many characters appear in a trigger nugget will not serve as a part of a trigger nugget in the majority of contexts, thus they will be easily classified into "NIL". For the first example in Table 5, NPN(IOB) was unable to fully recognize the trigger nugget "å >"(congratulatory message) because character "å"(congratulatory) doesn't often serve as part of "PhoneWrite" trigger. In fact, "å" serves as a "NIL" in the majority of similar contexts, e.g., "åU"(congratulation) and "6å"(congratulation).
3) NPNs are able to handle above problems. First, NPNs doesn't rely on candidate tables to generate potential triggers, which guarantees a good generalization ability. Second, NPNs propose the entire trigger nugget at each character, such a redundant prediction paradigm makes NPNs more error-tolerant to character-level errors. For example, even might mistakenly classify "å" into "NIL", NPNs can still identify the correct nugget "å>" at character ">" because ">" is a common part of "PhoneWrite" event trigger.

Influence of Word-Trigger Mismatch
This subsection investigates the effects of resolving the word-trigger mismatch problem using different methods. According to different types of word-trigger match, we split KBP2017Eval test set into three parts: Exact, Part-of-Word, Cross-Words, which are as defined in Table 1    exactly detect boundaries of trigger nuggets, thus has a low recall on all splits. Conventional DM-CNN regards words as potential triggers, which means it can only identify triggers that exactly match with words. As the second example in Table 5, word "kú"(dead or injured) as a whole has never been annotated as a trigger, so DMCNN is unable to recognize it at all. Errata replacing can only solve some of the part-of-word mismatch problem, but it can not handle the cases where one word contains multiple triggers(e.g., "kú" in Table 5) and the cases that a trigger crosses multiple words.

Effects of Hybrid Representation
This section analyzed the effect of feature hybrid in NPNs. First, from Table 2, we can see that Task-specific Hybrid method achieved the best performance in both datasets. Surprisingly, simple Concat Hybrid outperforms the General Hybrid approach. We believe this is because the trigger nugget generator and the event type classifier rely on different information, and therefore using one unified gate is not enough. And Task-specific Hybrid uses two different task-specific gates which can satisfy both sides, thus resulting in the best overall performance. Furthermore, to investigate the necessary of using hybrid features, an auxiliary experiment, called NPN(Char), was conducted by removing word-level features from NPNs. Also, we compared with the model removing character-level features, which is the original DMCNN(Word).  Table 6: Results of using different representation on Trigger Classification task on KBP2017Eval. Table 6 shows the experiment results. We can see that neither character-level or wordlevel representation can achieve competitive results with the NPNs. This verified the necessity of hybrid representation. Besides, we can see that NPN(Char) outperforms other character-level methods in Table 2, which further confirms that our trigger nugget generator is still effective even only using character-level information.

Related Work
Event detection is an important task in information extraction and has attracted many attentions. Traditional methods (Ji and Grishman, 2008;Patwardhan and Riloff, 2009;Liao et al., 2010;Mc-Closky et al., 2011;Hong et al., 2011;Huang and Riloff, 2012;Li et al., 2013aLi et al., ,b, 2014 rely heavily on hand-craft features, which are hard to transfer among languages and annotation standards. Recently, deep learning methods, which automatically extract high-level features and perform token-level classification with neural networks (Chen et al., 2015b;Nguyen and Grishman, 2015), have achieved significant progress. Some improvements have been made by jointly predicting triggers and arguments  and introducing more complicated architectures to capture larger scale of contexts Ghaeini et al., 2016). These methods have achieved promising results in English event detection.
Unfortunately, the word-trigger mismatch problem significantly undermines the performance of word-level models in Chinese event detection (Chen and Ji, 2009). To resolve this problem, Chen and Ji (2009) proposed a feature-driven BIO tagging methods at character-level sequences. Qin et al. (2010) introduced a method which can automatically expand candidate Chinese trigger set. While  and  defined manually character compositional patterns for Chinese event triggers. However, their methods rely on hand-crafted features and patterns, which make them difficult to be integrated into recent Deep Learning models.
Recent advances have shown that neural networks can effectively capture spatial and positional information from raw inputs (Ren et al., 2015;He et al., 2017;Wang and Jiang, 2017). This paper designs Nugget Proposal Networks to capture character compositional structure of event triggers, which is more robust and more effective than previous hand-crafted patterns or characterlevel sequential labeling methods.

Conclusions and Future Work
This paper proposes Nugget Proposal Networks for Chinese event detection, which can effectively resolve the word-trigger mismatch problem by modeling and exploiting character compositional structure of Chinese event triggers, using hybrid representation which can summarize information from both characters and words. Experiment results have shown that our method significantly outperforms conventional methods.
Because the mismatch between words and extraction units is a common problem in information extraction, we believe our method can also be applied to many other languages and tasks for exploiting inner composition structure during extraction, such as Named Entity Recognition.