A Two-Step Approach for Implicit Event Argument Detection

In this work, we explore the implicit event argument detection task, which studies event arguments beyond sentence boundaries. The addition of cross-sentence argument candidates imposes great challenges for modeling. To reduce the number of candidates, we adopt a two-step approach, decomposing the problem into two sub-problems: argument head-word detection and head-to-span expansion. Evaluated on the recent RAMS dataset (Ebner et al., 2020), our model achieves overall better performance than a strong sequence labeling baseline. We further provide detailed error analysis, presenting where the model mainly makes errors and indicating directions for future improvements. It remains a challenge to detect implicit arguments, calling for more future work of document-level modeling for this task.


Introduction
Event argument detection is a key component in the task of event extraction. It resembles semantic role labeling (SRL) in that the main target is to find argument spans to fill the roles of event frames. However, event arguments can go beyond sentence boundaries: there can be non-local or implicit arguments at the document level. Figure 1 shows such an example: for the purchase event, which is triggered by the word "bought", its money argument appears in the previous sentence.
Implicit arguments have been under-explored in event extraction. Most of previous systems (Li et al., 2013;Chen et al., 2015;Nguyen et al., 2016;Wang et al., 2019) only consider local arguments in the same sentence of the event trigger. While incorporating implicit arguments requires corresponding annotations, few exists in most of the widely used event datasets, like ACE2005 (LDC, 2005;Walker et al., 2006) and RichERE (LDC, 2015). There are several annotation efforts for implicit arguments (a) The new computer cost 3000 dollars, while the old one cost 1000 dollars. Nevertheless, he still bought the more expensive one.
(b) The new computer cost 3000 dollars, while the old one cost 1000 dollars. Therefore, he bought the cheaper one.  Figure 1: Examples of implicit arguments and model illustration. The bold text indicates the trigger word for the purchase event, while the underlined text indicates its non-local "money" argument in the previous sentence. Our model first detects the head-word "dollars", and then expands it to the whole span.
Recently, Ebner et al. (2020) create the Roles Across Multiple Sentences (RAMS) dataset, which covers multi-sentence implicit arguments for a wide range of event and role types. They further develop a span-based argument linking model and achieve relatively high scores. However, they mainly explore a simplified setting that assumes the availability of gold argument spans. We extend their work and explore the more challenging full detection problem that predicts argument spans among all possible candidates. The difficulty of the full problem is highlighted in Figure 1. Both "3000 dollars" and "1000 dollars" are good candidates for the money role of the purchase event, but the selections are different given different contexts.
When considering all possible candidate spans that may occur in any sentences, their quadratic number poses great challenges for the detection. Inspired by dependency-based SRL (Surdeanu et al., 2008;Hajič et al., 2009), we take the syntactical head-words as the proxy for full argument spans, hypothesizing that the head-words can contain enough information to fill the argument roles. Based on this, we adopt a two-step approach: first detecting the head-words of the arguments, and adopting a second step of head-to-span expansion. Actually, this type of two-step setup is not uncommon in prior work of information extraction, including entity detection , coreference resolution (Peng et al., 2015) and document-level pseudo-coreference (Jauhar et al., 2015;Liu et al., 2016). By considering only individual tokens in the detection step, the system only needs to handle a candidate space whose size scales linearly in respective to the number of tokens instead of quadratically.
With the same setting of fine-tuning BERT (Devlin et al., 2019) encoder, we show the effectiveness of our model by obtaining overall better results than a strong sequence-labeling model. We further provide detailed error analysis, showing that the main difficulties of the task are upon non-local and non-core arguments. Our analysis shows that the implicit argument task is quite challenging, calling for more future work on document-level semantic understanding for this task.

Model
The goal of event argument detection is to create labeled links between argument spans and the predicate (event trigger). Recent state-of-the-art solutions for sentence-level SRL perform the detection in an end-to-end setting, such as span-based (He et al., 2018;Ouchi et al., 2018), and sequence labeling models (He et al., 2017;Shi and Lin, 2019). However, span-based models face great challenges when considering arguments across sentence boundaries, since the computational complex-ity of such models grows quadratically to deal with O(N 2 ) span candidates given N tokens. While traditional sequence labeling models can run in linear-time, they are less flexible and extensible in complex scenarios like overlapping mentions and multiple roles for one mention. In this work, we take a two-step approach that decomposes the problem explicitly into two sub-problems, based on the hypothesis that head-words can usually capture the information of the mention spans. Figure  1 illustrates the three main modules of our model: 1) BERT-based Encoder, 2) Argument Head-Word Detector, and 3) Head-to-span Expander.

BERT-based Encoder
Our encoding module is a BERT-based contextualized encoder. The input contains a predicate word (or occasionally a span), which triggers an event, together with its multi-sentence context. We refer to the sentence containing the event trigger as the center sentence. We concatenate the tokens within the 5-sentence window (the window size used in RAMS annotation) of the center sentences, and feed them to BERT to obtain the contextual representation e of each token. In addition, we add special token type ids indicators: tokens of the event trigger are assigned 0, other tokens in the center sentence get 1, and tokens in surrounding sentences get 0 1 . We only adopt the indicators when fine-tuning BERT, since the pre-trained BERT originally uses them as segment ids.

Argument Head-word Detector
Instead of directly deciding argument spans, we first identify the head-words of the arguments. The hypothesis is that the head-word is able to represent the meaning of the whole span. In this way, this sub-problem mimics a token-pairwise dependencyparsing problem. Following Manning, 2017, 2018), we adopt a biaffine module to calculate Pr r (p, c): the probability of a candidate word c filling an argument role r in the frame governed by a predicate p. We first take the contextualized representations of the candidate (e c ) and the predicate (e p ), which are calculated by BERT as described in §2.1. "Biaffine r " further gives the pairwise score based on these representations, and Pr r (p, c) is then given by softmax with the scores: where the normalization is done over the argument candidate set C (or null , whose score is fixed to 0) for each role, following (Ebner et al., 2020;Ouchi et al., 2018). During training, we use the cross-entropy loss to guide the network to pick head-words of gold arguments (or if there are no arguments for this role). If there are multiple arguments for one role, we view them as individual instances and sum the losses. At inference time, we simply pick the maximumly-scored argument (or ) for each role.

Head-to-span Expander
The second module expands each head-word of the argument to its full span. We view it as a combination of left and right boundary classification problems. Taking the left-expanding scenario (L) as example, for each head-word h, we generate a set of candidate spans by adding words one by one on the left up to K words (we empirically set K = 7), and calculate the probability of word b being the boundary as follow: Here, the input to the Multi-layer Perceptron (MLP) is again the contextualized representations as depicted in §2.1. During training, we minimize crossentropy losses on the left and right respectively. At test time, we expand to the maximumly-scored boundary words on both sides.

Experiment
We conduct all experiments 2 on the RAMS (v1.0) dataset and focus on the event argument detection task: given (gold) event triggers and their multisentence contexts, predicting the argument spans from raw input tokens. Following (Ebner et al., 2020), we only use gold event types in the typeconstrained decoding (TCD) setting.
Through our experiments, we adopt the pretrained bert-base-cased model. We train all the models for maximumly 20 epochs. If finetuning BERT, we set the initial learning rate to 5e-5; otherwise, it is set to 2e-4. We jointly train our  (Ebner et al., 2020) and Head-based (ours) models on RAMS, given gold argument spans. "+TCD" indicates whether applying type-constrained decoding based on gold event types.
argument-detector and span-expander, with loss multipliers of 1.0 and 0.5, respectively. Since head-words are not annotated, we apply a simple rule: utilizing predicted dependency trees, we heuristically pick the word that has the smallest arc distance to the dependency root as the head. Ties are broken by choosing the rightmost one. There are cases where this procedure does not always give the perfect head, or there is no single head-word for a span (e.g., in multi-word expressions or conjunction). Nevertheless, we find this strategy works well in practice.

Argument Linking with Gold Spans
Setting To compare our model with span-based models, we first evaluate in the same setting of (Ebner et al., 2020) that assumes gold argument spans. We directly apply the head rule on the gold spans and consider the head-words as candidates. We also adopt the same BERT setting: learning a linear combination of layers 9, 10, 11 and 12, and applying neither the special input indicators nor fine-tuning.
Results Table 1 compares our results with the reported results of the span-based model from (Ebner et al., 2020). The results show that the head-word approach can get comparable results to the spanbased counterpart. This matches our hypothesis that head-words contain sufficient information of surrounding words using contextualized embedding, making them reasonable alternatives to full argument spans.

Full Argument Detection
Setting This setting considers all arguments from any spans in the multi-sentence context. Unless otherwise noted, here we use the last layer of BERT and apply fine-tuning for the whole model. We compare with a strong BERT-based BIO-styled sequence labeling model (Shi and Lin, 2019 Table 3: Ablation on the encoder for the head-based argument detection model (on development set, no typeconstrained decoding). "BERT-Full" is our full finetuned BERT encoder, "No-Indicator" ablates indicating inputs, "No-FineTuning" freezes all pre-trained parameters of BERT, and "LSTM" replaces the BERT with a bi-directional LSTM encoder. adopt a modified version 3 from AllenNLP and retrain it on RAMS with similar settings: adopting special input indicators and fine-tuning BERT. For arguments that have multiple roles labels, we simply concatenate the labels as a new class.
Results Table 2 shows the main results for full argument detection. Since the criterion of full-span matching might be too strict in some way, we also report head-word based F1 scores by evaluating solely on head-word matches (obtained using the same head rules). The results show that our headword based approach gets better results on average without type-constrained decoding and significantly better results after adopting type-constrained decoding with gold event types. Our head-driven approach is also flexible and easily extensible to more complex scenarios like nesting mentions or multiple roles, while keeping the linear complexity.
model. Fine-tuning BERT and the special indicator inputs can provide further improvements. Table 4 lists the performance breakdown on different sentence distances between arguments and triggers. As opposed to the relative consistent performance in the gold span setting, as shown in (Ebner et al., 2020), we notice a dramatic performance drop on non-local arguments. There may be two main reasons: 1) data imbalance, since non-local implicit arguments appear much less frequently (only around 18% in RAMS) than local ones; 2) lack of direct syntax signals, making the connections between the implicit arguments and event triggers much weaker than the local ones.

On Sentence Distances
On Argument Roles We also investigate performance breakdowns on different argument roles. The results are shown in Figure 2, where we take the top-20 frequent roles to get more robust results. We can observe that our model performs better on core roles such as "communicator", "employee" and "victim" (with F1 > 50), but struggles on non-core roles, like "instrument", "origin" and "destination", with F1 scores of around 20 to 30. The F1 scores correlate well (with Pearson and Spearman correlation coefficients of 0.64 and 0.70, respectively) with the local percentages: the more often one role appears locally around the event trigger, the better results it can obtain. These patterns are not surprising if we consider the possible underlying reasoning. The non-core arguments are not closely related with the event trigger, and thus can appear more freely at other places (or sometimes even be omitted), leading to a lower local percentage and also being harder to detect.   Figure 2: Performance breakdown of Span-F1 on the top-20 frequent roles (on development set, no typeconstrained decoding). x-axis represents the percentage of local arguments for this role, while y-axis denotes the role specific Span-F1 scores. The two blue dashed lines denote the overall F1 scores (0.389) and local percentage (82.8%).

Manual Analysis
predicted ones. For both annotated and predicted arguments, we assign them to one of seven categories, and the results are listed in Table 5. Here, the "Span" errors denote unimportant span mismatches, and they take nearly 9% of all items. If we ignore these errors, the performance can reach around 47%, which roughly matches the automatically evaluated Head-F1 scores. In some way, this supports our intuition to adopt a two-step approach, since the decisions of the span ranges may be separated from the core problem of argument detection, where head-words can be reasonable representatives. Another major source of errors comes from "Coref.", which is not surprising since the same entities can have multiple appearances at the document level. Our analysis indicates that this is a problem that should be further investigated for both modeling and evaluation. Another notable type of error is frame mismatch ("Frame"). In the main setting (without type-constrained decoding), our model neither utilizes nor predicts event frame types, meaning that the frame information purely comes from the trigger words. Therefore, roles belonging to other event frames may be predicted. Finally, the "Others" category includes the ones where we cannot find obviously intuitive patterns. We would identify most of them as the more difficult cases, whose error breakdown follows similar patterns to the overall ones as shown in Figure 2.

Conclusion
In this work, we propose a flexible two-step approach for implicit event argument detection. Our head-word based approach effectively reduces the candidate size and achieves good results on the RAMS dataset. We further provide detailed error analysis, showing that non-local and non-core arguments are the main difficulties. We hope that this work can shed some light and inspire future work at this line of research.