Joint Modeling of Arguments for Event Understanding

We recognize the task of event argument linking in documents as similar to that of intent slot resolution in dialogue, providing a Transformer-based model that extends from a recently proposed solution to resolve references to slots. The approach allows for joint consideration of argument candidates given a detected event, which we illustrate leads to state-of-the-art performance in multi-sentence argument linking.


Introduction
Given an event recognized in text, we are concerned with finding its associated arguments. Significant work has focused at the level of single sentence contexts, such as in semantic role labeling (SRL; Gildea and Jurafsky, 2000;He et al., 2017;Ouchi et al., 2018, inter alia). Unfortunately even perfect performance in SRL will be limited by the existence of arguments outside the sentence boundary, leading to prior work (Das et al., 2010;Silberer and Frank, 2012;Ebner et al., 2020) on an alternative paradigm variously called implicit role resolution or argument linking, where an event trigger (e.g. "attack") evokes a set of roles (e.g. AT-TACKER, TARGET) to be filled, and they are linked to explicit argument mentions found in text. In argument linking, possible candidate arguments are first detected, then linked to specific roles of detected events. This bears similarity to coreference resolution, where document-level context can be aptly utilized. For an example, see Figure 1.
This formulation is similar to the resolution of referring expressions in conversational dialogues (Ç elikyilmaz et al., 2014), where a current utterance is considered to invoke an intent (e.g. BUY-BOOK), accompanied by a number of slots (e.g. 1 Our code can be found at https://github.com/ wanmok/joint-arglinking.  We propose a novel model for joint modeling of potential arguments inspired by Chen et al. (2019) for slot-filling in dialogue systems, which proposed to jointly predict spans that are relevant to the intent of the current round of dialogue. Over detected arguments, a Transformer (Vaswani et al., 2017) encoder is placed upon the event trigger and potential arguments to jointly learn the relations between the event trigger and its arguments. The input to this Transformer is no longer tokens but spans: given the Transformer output of each span, a classification loss is utilized to perform argument role classification. We demonstrate this leads to state-of-theart performance on the RAMS argument linking dataset introduced by Ebner et al. (2020), 3 showing the benefits of joint modeling when linking arguments to roles of events.

97
2 Background Implicit role resolution Palmer et al. (1986) treated unfilled semantic roles as special cases of anaphora and coreference resolution. Starting from the SemEval 2010 Task 10: Linking Roles (Ruppenhofer et al., 2010), there have been more recent modeling efforts on this task. Chen et al. (2010) approached this with their SRL system SE- MAFOR (Das et al., 2010), casting the task as extended SRL by admitting constituents (potential arguments) from context larger than sentence boundaries. Silberer and Frank (2012) considered the problem as an anaphora resolution task within the discourse context. Ebner et al. (2020) similarly considered the task as related to anaphora resolution, and introduced a new dataset, RAMS, for exploring non-local argument linking. Event extraction In event extraction there are historically three subtasks: detecting event triggers, detecting entity mentions, and then argument role prediction, where relations between mentions and triggers are predicted in accordance to the event type's predefined set of roles under a closed ontology. Prior work has proposed pipeline system of the subtasks (Ji and Grishman, 2008;Li et al., 2013;Yang and Mitchell, 2016, inter alia), or as a joint model over the three tasks (Nguyen and Nguyen, 2019; Lin et al., 2020, inter alia). Our work could be seen as a version of argument role prediction, but which operates beyond sentence boundaries.
Frame-based SLU In dialogue systems, semantic frame based spoken language understanding (SLU) is one of the most commonly applied SLU technologies for human-computer interaction. Such systems often output an interpretation of dialogues represented as intents and slots (Wang et al., 2011(Wang et al., ). Ç elikyilmaz et al. (2014 and Bapna et al. (2017) proposed models to resolve references to slots in the dialogue, tracking conversation states across multiple dialogue turns. Dhingra et al. (2017) augmented such methods with external knowledge bases (KBs) to create a multi-turn dialogue agent which helps users search KBs. Chen et al. (2019) proposed joint models over potential slots in dialogue to output which contextual slots should be carried over to the most recent utterance. Our approach is inspired by this work, by drawing analogies between concepts in SLU (intents / slots) and those in IE (events / arguments) (see Table 1).

Problem Formulation
Following Ebner et al. (2020) we consider argument linking as the task of choosing amongst detected mention span candidates given detected event trigger spans. Given a document = ( 1 , · · · , ) where each is a word, entity mention set (candidate arguments) containing mentions = [ : ] ∈ where and demarcates the left and right boundary (both inclusive), and a event trigger span = [ : ], an argument linking model predicts the role (or absence) of each mention with respect to the event.
An event ontology can be formulated as a set of event types T , where each type ∈ T is associated with a set of roles ( ), 4 while other roles are nonpermissible. We denote the union of all roles for all event types, plus an empty role (a dummy role denoting an argument is not part of the event structure) as R = ∈T ( ) ∪ { }.

Approach
Argument and trigger representation We compute a fixed-length vector with dimension for each argument and trigger span as their representations. To compute this, we first pass the document through a pre-trained contextualizing model (BERT (Devlin et al., 2019) here). 5 We split documents into sentences and feed each sentence to BERT for encoding. Each token might be split into more than 1 subword units-in this case we take the average of these subword representations so that each token has 1 vector representation w ∈ R tok , following Zhang et al. (2019).
For an argument span = ( , · · · , ), we follow Lee et al. (2017) to generate a span embedding. 6 The span embedding m for mention span comprises of three parts, the representation of its left boundary, its right boundary, and a learned pooling over the tokens in the span. This learned pooling utilized a global attention query vector q ∈ R tok , and computes the weighted sum of all tokens with respect to the attention scores derived from q: 5 Documents are chuncked into max-length 512 segments while respecting sentence boundaries, and each is fed to BERT respectively. 6 The width embeddings in Lee et al. (2017) are not used.
It showed footage of ambulances arriving at the Kilis State hospital and medical personnel unloading children on stretchers and a girl wrapped in a blanket , as well as a handful of adults.
" They hit the school , they hit the school , " wailed a Syrian woman who was unloaded from an ambulance onto a wheelchair .
The Observatory and al -Halaby also reported an air raid on the village of Kaljibrin near Azaz . and pass that through a 2-layer feed-forward neural network to yield a fixed-length vector m ∈ R span for each argument span : Similarly, for any trigger span = [ : ], we employ a different set of parameters: Joint modeling of arguments We propose a joint model for all the arguments with respect to the given event trigger with event type (see Figure 1). We form a sequence (t, m 1 , m 2 , · · · , m ) with the trigger span encoding as the prefix, then followed by the representations of all the candidate mentions, then fed to a Transformer encoder (Vaswani et al., 2017). A Transformer, by its self-attention mechanism, naturally models the relation between every trigger-argument and argument-argument pair. Note two major differences as compared to a Transformer that runs on tokens: (1) each input to the Transformer represents a span instead of a token, following Chen et al. (2019); (2) since the arguments do not take an explicit sequential order, we forgo the positional embeddings in Transformers, effectively modeling the input as a set of spans instead of a sequence (self-attention exhibits the property of permutation invariance without positional embeddings (Lee et al., 2019)). For each argument span input m , we pass the output from the Transformer encoderm to linear layer with the output size being the size of the role set R. Softmax is applied to the output of size |R|, with the non-permissible roles masked out, yielding a distribution over the set of roles designated by the given event type, plus the non-argument role: The model could hence be trained using a crossentropy loss function to maximize such likelihood.

Experiments
As we draw the connections between SLU in dialogue systems and argument linking in information extraction, we focus primarily on evaluating the model a discourse-level dataset, RAMS ( Baseline Aside from joint modeling of arguments, we also include an independent model as a case in ablation studies (while our proposed method labeled as joint). The independent model removes the Transformer encoder (cf. Equation 4), but directly applies a feed-forward neural network atop of the trigger representation and each argument representation to classify the role (or absence) of the argument with respect to the event trigger. 8 The result from model would show the difference between the proposed joint argument modeling approach v.s. a simpler, independent model.    Metrics We use precision, recall, and F 1 -score as metrics. A link between the trigger and an argument is considered correct, if and only if the predicted argument span offsets and role matches the gold reference. We report using micro-average among F 1 -scores across different roles.

ACE 2005
We use ACE 2005 as a sanity check for our discourse-context model to verify its ability to perform sentence-context extraction. We follow Lin et al. (2020)'s pre-processing and dataset splits for event extraction task (statistics see Table 2). Table 3 reports the experimental results on ACE 2005. Although the results are not directly comparable since our model has access to gold trigger/argument spans (Lin et al. (2020) does not), we can observe similar levels of performance, suggesting our method may be competitive when applied to event understanding beyond sentence boundaries.

RAMS
Roles Across Multiple Sentences (RAMS; Ebner et al., 2020) is an event extraction dataset that considers discourse-level, non-local arguments in document-level context. We follow the train/dev/test split provided in the dataset, with statistics shown in Table 2. Experiments setup follow the configuration employed for ACE 2005. Table 4 shows the performance of our models on    Table 4).
RAMS. Following the same conditions as Ebner et al. (2020), our joint model outperforms that work, and our independent baseline, by a substantial margin of 6.6%, illustrating the benefit of modeling potential arguments jointly. We analyze the performance of our model on non-local arguments, i.e., arguments that are not in the same sentence as the event trigger (Table 5). Our model's performance on non-local arguments is on par with local arguments, demonstrating the ability to handle non-local argument linking.
Case study We here show one example where the joint model performs better than the independent model. The joint model correctly labeled all the roles, while the independent model failed on two. We hypothesize that joint modeling of the arguments will avoid these cases where multiple spans are labeled with the same role.
... Stratfor analyst Sim Tack:" This was indeed an Islamic State attack, rather than an accidental explosion." New satellite imagery appears to reveal extensive damage to a strategically significant airbase in central Syria used by Russian forces ...

Conclusion
We proposed a joint modeling approach for argument linking that considers the interdependent relationships among argument mentions conditioning on a specific event. Our approach extends from recent work in dialogue systems, viewing a document as essentially a single-side discourse, and where event arguments are recognized as similar to slots that potentially carryover across utterances. Experimental results show our approach achieves superior performance on a recently introduced dataset for modeling discourse-level contexts.