Cost-sensitive Regularization for Label Confusion-aware Event Detection

In supervised event detection, most of the mislabeling occurs between a small number of confusing type pairs, including trigger-NIL pairs and sibling sub-types of the same coarse type. To address this label confusion problem, this paper proposes cost-sensitive regularization, which can force the training procedure to concentrate more on optimizing confusing type pairs. Specifically, we introduce a cost-weighted term into the training loss, which penalizes more on mislabeling between confusing label pairs. Furthermore, we also propose two estimators which can effectively measure such label confusion based on instance-level or population-level statistics. Experiments on TAC-KBP 2017 datasets demonstrate that the proposed method can significantly improve the performances of different models in both English and Chinese event detection.


Introduction
Automatic event extraction is a fundamental task in information extraction.Event detection, aiming to identify trigger words of specific types of events, is a vital step of event extraction.For example, from sentence "Mary was injured, and then she died", an event detection system is required to detect a Life:Injure event triggered by "injured" and a Life:Die event triggered by "died".
Recently, neural network-based supervised models have achieved promising progress in event detection (Nguyen and Grishman, 2015;Chen et al., 2015;Ghaeini et al., 2016).Commonly, these methods regard event detection as a wordwise classification task with one NIL class for tokens do not trigger any event.Specifically, a neural network automatically extracts high-level features and then feed them into a classifier to categorize words into their corresponding event sub- types (or NIL).Optimization criteria of such models often involves in minimizing cross-entropy loss, which equals to maximize the likelihood of making correct predictions on the training data.However, we find that in supervised event detection, most of the mislabeling occurs between a small number of confusing type pairs.We refer to this phenomenon as label confusion.Specifically, there are mainly two types of label confusion in event detection: 1) trigger/NIL confusion; 2) sibling sub-types confusion.For example, both Transaction:Transfer-money and Transaction:Transfer-ownership events are frequently triggered by word "give".Besides, in many cases "give" does not serve as a trigger word.Table 1 shows the classification results of a state-of-the-art event detection model (Chen et al., 2015) on all event triggers with coarse type of Contact on TAC-KBP 2017 English Event Detection dataset.We can see that the model severely suffers from two types of label confusion mentioned above: more than 50% mislabeling happens between trigger/NIL decision due to the ambiguity of natural language.Furthermore, the majority of remaining errors are between sibling sub-types of the same coarse type because of their semantic relatedness (Liu et al., 2017b).Similar results are also observed in other event detection datasets such as ACE2005 (Liu et al., 2018a).Therefore, it is critical to enhance the supervised event detection models by taking such label confusion problem into consideration.
In this paper, inspired by cost-sensitive learning (Ling and Sheng, 2011), we introduce costsensitive regularization to model and exploit the label confusion during model optimization, which can make the training procedure more sensitive to confusing type pairs.Specifically, the proposed regularizer reshapes the loss function of model training by penalizing the likelihood of making wrong predictions with a cost-weighted term.If instances of class i are more frequently misclassified into class j, we assign a higher cost to this type pair to make the model intensively learn to distinguish between them.Consequently, the training procedure of models not only considers the probability of making correct prediction, but also tries to separate confusing type pairs with a larger margin.Furthermore, in order to estimate such cost automatically, this paper proposes two estimators based on population-level or instancelevel statistics.
We conducted experiments on TAC-KBP 2017 Event Nugget Detection datasets.Experiments show that our method can significantly reduce the errors between confusing type pairs, and therefore leads to better performance of different models in both English and Chinese event detection.To the best of our knowledge, this is the first work which tackles with the label confusion problem of event detection and tries to address it in a cost-sensitive regularization paradigm.

Cost-sensitive Regularization for
Neural Event Detection

Neural Network Based Event Detection
The state-of-the-art neural network models commonly transform event detection into a word-wise classification task.Formally, let D = {(x i , y i )|i = 1, 2, ..., n} denote n training instances, P (y|x; θ) is the neural network model parameterized by θ, which takes representation (feature) x as input and outputs the probability that x is a trigger of event sub-type y (or NIL).Training procedure of such models commonly involves in minimizing following crossentropy loss: which corresponds to maximize the log-likelihood of the model making the correct prediction on all training instances and does not take the confusion between different type pairs into consideration.

Cost-sensitive Regularization
As discussed above, the key to improve event detection performance is to solve the label confusion problem, i.e., to guide the training procedure to concentrate on distinguishing between more confusing type pairs such as trigger/NIL pairs and sibling sub-event pairs.To this end, we propose cost-sensitive regularization, which reshapes the training loss with a cost-weighted term of the loglikelihood of making wrong prediction.Formally, the proposed regularizer is defined as: where C(y i , y j ; x) is a positive cost of mislabeling an instance x with golden label y i into label y j .A higher C(y i , y j ; x) is assigned if y i and y j is a more confusing type pair (i.e., more easily mislabeled by the current model).Therefore, the costsensitive regularizer will make the training procedure pay more attention to distinguish between confusing type pairs because they have larger impact on the training loss.Finally, the entire optimization objective can be written as: where λ is a hyper-parameter that controls the relative impact of our cost-sensitive regularizer.

Cost Estimation
Obviously it is critical for the proposed costsensitive regularization to have an accurate estimation of the cost C(y i , y j ; x).In this section, we propose two approaches for this issue based on population-level or instance-level statistics.

Population-level Estimator
A straightforward approach for measuring such costs is to use the relative mislabeling risk on the dataset.Therefore our population-level cost estimator is defined as: where #(y i , y j ) is the number of instances with golden label y i but being classified into class y j in the corpus.These statistics can be computed either on the training set or on the development set.This paper uses statistics on development set due to its compact size.And the estimators are updated every epoch during the training procedure.

Instance-level Estimator
The population-level estimators requires large computation cost to predict on the entire dataset when updating the estimators.To handle this issue, we propose another estimation method based directly on instance-level statistics.Inspire by Lin et al. (2017), the probability P (y j |x i ; θ) of classifying instance x i into the wrong class y j can be directly regarded as the mislabeling risk of that instance.Therefore our instance-level estimator is: Then cost-sensitive regularizer for each training instance can be written as: Note that if the probability of making correct prediction (i.e., P (y i |x i ; θ)) is fixed, L IN S (x i ; θ) achieves its minimum when the probabilities of mislabeling x i into all incorrect classes are equal.This is equivalent to maximize the margin between the probability of golden label and that of any other class.In this circumstance, the loss L(θ) can be regarded as a combination of maximizing both the likelihood of correct prediction and the margin between correct and incorrect classes.We conducted experiments on two state-of-theart neural network event detection models to verify the portability of our method.One is DMCNN model proposed by Chen et al. (2015).Another is a LSTM model by Yang and Mitchell (2017).Due to page limitation, please refer to original papers for details.
2) Focal Loss (Focal) (Lin et al., 2017), which is an instance-level method that rescales the loss with a factor proportional to the mislabeling probability to enhance the learning on hard instances.
3) Hinge Loss (Hinge), which tries to separate the correct and incorrect predictions with a margin larger than a constant and is widely used in many machine learning tasks.
4) Under-sampling (Sampling), a representative cost-sensitive learning approaches which samples instances balance the model learning and is widely used in event detection to deal with imbalance (Chen et al., 2015).
We also compared our methods with the top systems in TAC-KBP 2017 Evaluation.We evaluated all systems with micro-averaged Precision(P), Recall(R) and F1 using the official toolkit2 .

Overall Results
Table 2 shows the overall performance on TAC-KBP 2017 datasets.We can see that: 1) Cost-sensitive regularization can significantly improve the event detection performance by taking mislabeling costs into consideration.The proposed CR-INS and the CR-POP steadily outperform corresponding baselines.Besides, compared with population-level estimators, instance-level cost estimators are more effective.This may because instance-level estimators can be updated every batch while population-level estimators are updated every epoch, which leads to a more accurate estimation.
2) Cost-sensitive regularization is robust to different languages and models.We can see that cost-sensitive regularization achieves significant improvements on both English and Chinese datasets with both CNN and RNN models.This indicates that our method is robust and can be applied to different models and datasets.
3) Data imbalance is not the only reason behind label confusion.Even Focal and Sampling baselines deals with the data imbalance problem, they still cannot achieve comparable performance with CR-POP and CR-INS.This means that there are still other reasons which are not fully resolved by conventional methods for data imbalance.

Comparing with State-of-the-art Systems
Figure 1 compares our models with the top systems in TAC-KBP 2017 Evaluation.To achieve a strong baseline 3 , we also incorporate ELMOs (Peters et al., 2018) to English system for better representations.We can see that CR-INS can further gain significant improvements over all strong baselines which have already achieved comparable performance with top systems.In both English and Chinese, CR-INS achieves the new SOTA performance, which demonstrates its effectiveness.

Error Analysis
To clearly show where the improvement of our method comes from, we compared the mislabeling made by Sampling and our CR-INS method.Table 3 shows the results.We can first see that trigger/NIL mislabeling and sibling sub-types mislabeling make up most of errors of CE baseline.This further verifies our motivation.Besides, costsensitive regularization significantly reduces these two kinds of errors without introducing more other 3 Top systems in the evaluation are commonly ensembling models with additional resources, while reported in-house results are of single model.types of mislabeling, which clearly demonstrates the effectiveness of our method.
In this paper, we propose cost-sensitive regularization for neural event detection, which introduces a cost-weighted term of mislabeling likelihood to enhance the training procedure to concentrate more on confusing type pairs.Experiments show that our methods significantly improve the performance of neural network event detection models.

Table 1 :
Prediction percentage heatmap of triggers with Contact coarse type.Row labels are the golden label and the column labels indicate the prediction.

Table 2 :
Overall results.CR-POP and CR-INS are our method with population-level and instance-level estimators.All F1 improvements made by CR-POP and CR-INS are statistically significant with p < 0.05.
Comparison with the top systems in TAC-KBP 2017.CR is our CR-INS method.The srcb system in English used additional CRF based models to deal with multi-word triggers in English, which is not considered in our model and leads to a significant higher recall than other competitors.

Table 3 :
Error rates (CNN) on trigger words on the Chinese test set with Sampling(SP) and CR-INS(CR).