A Context-Dependent Gated Module for Incorporating Symbolic Semantics into Event Coreference Resolution

Event coreference resolution is an important research problem with many applications. Despite the recent remarkable success of pre-trained language models, we argue that it is still highly beneficial to utilize symbolic features for the task. However, as the input for coreference resolution typically comes from upstream components in the information extraction pipeline, the automatically extracted symbolic features can be noisy and contain errors. Also, depending on the specific context, some features can be more informative than others. Motivated by these observations, we propose a novel context-dependent gated module to adaptively control the information flows from the input symbolic features. Combined with a simple noisy training method, our best models achieve state-of-the-art results on two datasets: ACE 2005 and KBP 2016.


Introduction
Within-document event coreference resolution is the task of clustering event mentions in a text that refer to the same real-world events (Lu and Ng, 2018). It is an important research problem, with many applications (Vanderwende et al., 2004;Ji and Grishman, 2011;Choubey et al., 2018). Since the trigger of an event mention is typically the word or phrase that most clearly describes the event, virtually all previous approaches employ features related to event triggers in one form or another. To achieve better performance, many methods also need to use a variety of additional symbolic features such as event types, attributes, and arguments (Chen et al., 2009;Chen and Ji, 2009;Zhang et al., 2015;Sammons et al., 2015;Lu and Ng, 2016;Chen and Ng, 2016;Duncan et al., 2017). Previous neural methods (Nguyen et al., 2016;Choubey and Huang, 2017;Huang et al., 2019) also use noncontextual word embeddings such as word2vec 1 The code is publicly available at https://github.com/ laituan245/eventcoref.  Table 1: An example of using the modality attribute to improve event coreference resolution. (Mikolov et al., 2013) or GloVe (Pennington et al., 2014). With the recent remarkable success of language models such as BERT (Devlin et al., 2019) and SpanBERT (Joshi et al., 2020), one natural question is whether we can simply use these models for coreference resolution without relying on any additional features. We argue that it is still highly beneficial to utilize symbolic features, especially when they are clean and have complementary information. Table 1 shows an example in the ACE 2005 dataset, where our baseline SpanBERT model incorrectly predicts the highlighted event mentions to be coreferential. The event triggers are semantically similar, making it challenging for our model to distinguish. However, notice that the event {head out} ev1 is mentioned as if it was a real occurrence, and so its modality attribute is ASSERTED (LDC, 2005). In contrast, because of the phrase "were set to", we can infer that the event {leave} ev2 did not actually happen (i.e., its modality attribute is OTHER). Therefore, our model should be able to avoid the mistake if it utilizes additional symbolic features such as the modality attribute in this case.
There are several previous methods that use contextual embeddings together with type-based or argument-based information (Lu et al., 2020;Yu et al., 2020). For example, Lu et al. (2020) proposes a new mechanism to better exploit event type information for coreference resolution. Despite their impressive performance, these methods are specific to one particular type of additional information.
In this paper, we propose general and effective methods for incorporating a wide range of sym-bolic features into event coreference resolution. Simply concatenating symbolic features with contextual embeddings is not optimal, since the features can be noisy and contain errors. Also, depending on the context, some features can be more informative than others. Therefore, we design a novel context-dependent gated module to extract information from the symbolic features selectively. Combined with a simple regularization method that randomly adds noise into the features during training, our best models achieve state-of-the-art results on ACE 2005 (Walker et al., 2006) and KBP 2016 (Mitamura et al., 2016) datasets. To the best of our knowledge, our work is the first to explicitly focus on dealing with various noisy symbolic features for event coreference resolution.

Preliminaries
We focus on within-document event coreference resolution. The input to our model is a document D consisting of n tokens and k (predicted) event mentions {m 1 , m 2 , . . . , m k }. For each m i , we denote the start and end indices of its trigger by s i and e i respectively. We assume the mentions are ordered based on s i (i.e., If i ≤ j then s i ≤ s j ).
We also assume each m i has K (predicted) categorical features {c Table 2 lists the symbolic features we consider in this work. The definitions of the features and their possible values are in ACE and Rich ERE guidelines (LDC, 2005;Mitamura et al., 2016). The accuracy scores of the symbolic feature predictors are also shown in Table 2. We use OneIE  to identify event mentions along with their subtypes. For other symbolic features, we train a joint classification model based on SpanBERT. The appendix contains more details.

Single-Mention Encoder
Given a document D, our model first forms a contextualized representation for each input token using a Transformer encoder (Joshi et al., 2020). Let X = (x 1 , ..., x n ) be the output of the encoder, where x i ∈ R d . Then, for each mention m i , its trigger's representation t i is defined as the average of its token embeddings:  Next, by using K trainable embedding matrices, we convert the symbolic features of

Mention-Pair Encoder and Scorer
Given two event mentions m i and m j , we define their trigger-based pair representation as: where FFNN t is a feedforward network mapping from R 3×d → R p , and • is element-wise multiplication. Similarly, we can compute their featurebased pair representations {h where u ∈ {1, 2, . . . , K}, and FFNN u is a feedforward network mapping from R 3×l → R p . Now, the most straightforward way to build the final pair representation f ij of m i and m j is to simply concatenate the trigger-based representation and all the feature-based representations together: However, this approach is not always optimal. First, as the symbolic features are predicted, they can be noisy and contain errors. The performance of most symbolic feature predictors is far from perfect (Table 2). Also, depending on the specific context, some features can be more useful than others. Inspired by studies on gated modules Lai et al., 2019), we propose Context-Dependent Gated Module (CDGM), which uses a gating mechanism to extract information from the input symbolic features selectively ( Figure 1). Given two mentions m i and m j , we use their trigger feature vector t ij as the main controlling context to compute the filtered representation h where u ∈ {1, 2, . . . , K}. More specifically: ij is decomposed into an orthogonal component and a parallel component, and h (u) ij is simply the fusion of these two components. In order to find the optimal mixture, g ij is used to control the composition. The decomposition unit is defined as: where · denotes dot product. The parallel component p ij on t ij . It can be viewed as containing information that is already ij is orthogonal to t ij , and so it can be viewed as containing new information. Intuitively, when the original symbolic feature vector h (u) ij is very clean and has complementary information, we want to utilize the new information in o ij ≈ 1), and vice versa. Finally, after using CDGMs to distill symbolic features, the final pair representation f ij of m i and m j can be computed as follows: And the coreference score s(i, j) of m i and m j is: where FFNN a is a mapping from R (K+1)×p → R.

Training and Inference
Algorithm 1: Noise Addition for Symbolic Features Input: Document D Hyperparameters: Training We use the same loss function as in (Lee et al., 2017). Also, notice that the training accuracy of a feature predictor is typically much higher than its accuracy on the dev/test set (Table  2). If we simply train our model without any regularization, our CDGMs will rarely come across noisy symbolic features during training. Therefore, to encourage our CDGMs to actually learn to distill reliable signals, we also propose a simple but effective noisy training method. Before passing a training data batch to the model, we randomly add noise to the predicted features. More specifically, for each document D in the batch, we go through every symbolic feature of every event mention in D and consider sampling a new value for the feature. The operation is described in Algorithm 1 (we use the same notations mentioned in Section 2.1). { 1 , 2 , · · · , K } are hyperparameters determined by validation. In general, the larger the discrepancy between the train and test accuracies, the larger .
Inference For each (predicted) mention m i , our model will assign an antecedent a i from all pre-   ceding mentions or a dummy antecedent : The dummy antecedent represents two possible cases: (1) m i is not actually an event mention (2) m i is indeed an event mention but it is not coreferent with any previous extracted mentions. In addition, we fix s(i, ) to be 0.

Experiments and Results
Data and Experiments Setup We evaluate our methods on two English datasets: ACE2005 (Walker et al., 2006) and KBP2016 (Ji et al., 2016Mitamura et al., 2016). We report results in terms of F1 scores obtained using the CoNLL and AVG metrics. By definition, these metrics are the summary of other standard coreference metrics, including B 3 , MUC, CEAF e , and BLANC (Lu and Ng, 2018). We use SpanBERT (spanbert-base-cased) as the Transformer encoder (Wolf et al., 2020a;Joshi et al., 2020). More details about the datasets and hyperparameters are in the appendix. We refer to models that use only trigger features as [Baseline]. In a baseline model, f ij is simply t ij (Eq. 2). We refer to models that use only the simple concatenation strategy as [Simple] (Eq. 4), and models that use the simple concatenation strategy and the noisy training method as [Noise].
Overall Results (on Predicted Mentions) Table 3 and Table 4 show the overall end-to-end results on ACE2005 and KBP2016, respectively. We    Peng et al. (2016) conducted 10-fold cross-validation and essentially used more training data. Nevertheless, the magnitude of the differences in scores between our best model and the state-of-the-art methods indicates the effectiveness of our methods.
Overall Results (on Ground-truth Triggers) The overall results on ACE 2005 using groundtruth triggers and predicted symbolic features are shown in Table 5. The performance of our full model is comparable with previous state-of-the-art result in (Yu et al., 2020). To better analyze the usefulness of symbolic features as well as the effectiveness of our methods, we also conduct experiments using ground-truth triggers and ground-truth symbolic features (Table 6). First, when the symbolic features are clean, incorporating them using the simple concatenation strategy can already boost the performance significantly. The symbolic features contain information complementary to that in the SpanBERT contextual embeddings. Second, we also see that the noisy training method is not helpful when the symbolic features are clean. Unlike other regularization methods such as dropout (Srivastava et al., 2014) and weight decay (Krogh and Hertz, 1992), the main role of our noisy training method is not to reduce overfitting in the traditional sense. Its main function is to help CDGMs learn to distill reliable signals from noisy features.   Table 7 shows the results of incorporating different types of symbolic features on the ACE 2005 dataset. Overall, our methods consistently perform better than the simple concatenation strategy across all feature types. The gains are also larger for more noisy features than clean features (feature prediction accuracies were shown in Table 2). This suggests that our methods are particularly useful in situations where the symbolic features are noisy.

Comparison with Multi-Task Learning
We also investigate whether we can incorporate symbolic semantics into coreference resolution by simply doing multi-task training. We train our baseline model to jointly perform coreference resolution and symbolic feature prediction. The test AVG score on ACE 2005 is only 56.5. In contrast, our best model achieves an AVG score of 59.76 (Table 3). Table 8 shows few examples from the ACE 2005 dataset that illustrate how incorporating symbolic features using our proposed methods can improve the performance of event conference resolution. In each example, our baseline model incorrectly predicts the highlighted event mentions to be coreferential.

Qualitative Examples
Remaining Challenges Previous studies suggest that there exist different types and degrees of event coreference (Recasens et al., 2011;Hovy et al., 2013). Many methods (including ours) focus on the full strict coreference task, but other types of coreference such as partial coreference have remained underexplored. Hovy et al. (2013) defines two core types of partial event coreference relations: subevent relations and membership relations. Subevent relations form a stereotypical sequence of events, whereas membership relations represent instances of an event collection. We leave tackling the partial coreference task to future work.

Related Work
Several previous approaches to within-document event coreference resolution operate by first ap-  plying a mention-pair model to compute pairwise distances between event mentions, and then they apply a clustering algorithm such as agglomerative clustering or spectral graph clustering (Chen et al., 2009;Chen and Ji, 2009;Chen and Ng, 2014;Nguyen et al., 2016;Huang et al., 2019). In addition to trigger features, these methods use a variety of additional symbolic features such as event types, attributes, arguments, and distance. These approaches do not use contextual embeddings such as BERT and SpanBERT (Devlin et al., 2019;Joshi et al., 2020). Recently, there are several studies that use contextual embeddings together with typebased or argument-based information (Lu et al., 2020;Yu et al., 2020). These methods design networks or mechanisms that are specific to only one type of symbolic features. In contrast, our work is more general and can be effectively applied to a wide range of symbolic features.

Conclusions and Future Work
In this work, we propose a novel gated module to incorporate symbolic semantics into event coreference resolution. Combined with a simple noisy training technique, our best models achieve competitive results on ACE 2005 and KBP 2016. In the future, we aim to extend our work to address more general problems such as cross-lingual crossdocument coreference resolution.
interpreted as necessarily representing the official policies, either expressed or implied, of DARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.

A.1 Symbolic Feature Predictors
In an end-to-end setting, we train and use OneIE  to identify event mentions along with their subtypes. For other symbolic features, we train a simple joint model. More specifically, given a document, our joint model first forms contextualized representations for the input tokens using SpanBERT (Joshi et al., 2020). Each event mention's representation is then defined as the average of the embeddings of the tokens in its trigger. After that, we feed the mentions' representations into classification heads for feature value prediction. Each classification head is a standard multi-layer feedforward network with softmax output units.

A.2 Datasets Description
In this work, we use two English within-document coreference datasets: ACE 2005 and KBP 2016. The ACE 2005 English corpus contains finegrained event annotations for 599 articles from a variety of sources. We use the same split as that stated in (Chen et al., 2015), where there are 529/30/40 documents in the train/dev/test split. In ACE, a strict notion of event coreference is adopted, which requires two event mentions to be coreferential if and only if they had the same agent(s), patient(s), time, and location. For KBP 2016, we follow the setup of (Lu and Ng, 2017a), where there are 648 documents that can be used for training and 169 documents for testing. We train our model on 509 documents randomly chosen from the training documents and tune parameters on the remaining 139 training documents. Different from ACE, KBP adopts a more relaxed definition of event coreference, where two event mentions can be coreferent as long as they intuitively refer to the same realworld event. Table 9 summarizes the basic statistics of the datasets.

A.3 Hyperparameters
We use SpanBERT (spanbert-base-cased) as the Transformer encoder (Wolf et al., 2020a;Joshi et al., 2020). We did hyperparameter tuning using the datasets' dev sets. For all the experiments, we pick the model which achieves the best AVG score on the dev set, and then evaluate it on the test set. For each of our models, two different learning rates are used, one for the lower pretrained Transformer encoder and one for the upper layer. The optimal hyperparameter values are variant-specific, and we experimented with the following range of possible values: {8, 16} for batch size, {3e-5, 4e-5, 5e-5} for lower learning rate, {1e-4, 2.5e-4, 5e-4} for upper learning rate, and {50, 100} for number of training epochs. Table 10 shows the value of we used for each symbolic feature type. In general, the larger the discrepancy between the train and test accuracies, the larger the value of .

A.4 Reproducibility Checklist
We present the reproducibility information of the paper. Due to license reason, we cannot provide downloadable links for ACE 2005 and KBP 2016. Computing Infrastructure The experiments were conducted on a server with Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz and NVIDIA Tesla V100 GPUs. The allocated RAM is 187G. GPU memory is 16G.

Implementation Dependencies Libraries
Average Runtime Table 11 shows the estimated average run time of our full model.  Hyperparameters of Best-Performing Models Table 12 summarizes the hyperparameter configurations of best-performing models. Note that  Expected Validation Performance We repeat training five times for each best-performing model. We show the average validation performance in Table 13. Our validation scores on KBP 2016 are not comparable to that of (Lu et al., 2020), because we split the original 648 training documents into the final train set and dev set randomly. We still use the same test set. For each best-performing model, we report the test performance of the checkpoint with the best AVG score in the main paper.