BERTering RAMS: What and How Much does BERT Already Know About Event Arguments? - A Study on the RAMS Dataset

Using the attention map based probing framework from (Clark et al., 2019), we observe that, on the RAMS dataset (Ebner et al., 2020), BERT’s attention heads have modest but well above-chance ability to spot event arguments sans any training or domain finetuning, varying from a low of 17.77% for Place to a high of 51.61% for Artifact. Next, we find that linear combinations of these heads, estimated with approx. 11% of available total event argument detection supervision, can push performance well higher for some roles — highest two being Victim (68.29% Accuracy) and Artifact (58.82% Accuracy). Furthermore, we investigate how well our methods do for cross-sentence event arguments. We propose a procedure to isolate “best heads” for cross-sentence argument detection separately of those for intra-sentence arguments. The heads thus estimated have superior cross-sentence performance compared to their jointly estimated equivalents, albeit only under the unrealistic assumption that we already know the argument is present in another sentence. Lastly, we seek to isolate to what extent our numbers stem from lexical frequency based associations between gold arguments and roles. We propose NONCE, a scheme to create adversarial test examples by replacing gold arguments with randomly generated “nonce” words. We find that learnt linear combinations are robust to NONCE, though individual best heads can be more sensitive.


Introduction
The NLP representation paradigm has undergone a drastic change in this decade -moving from lin-1 Refer to Figure 1 of that paper for an example illustrating four role names. Since these role names are human readable and intuitively named, we refer to them without elaboration. 2 We use map to refer to the per-example word-word activations at a particular layer-head, while head refers either to the identity of the particular layer-head. We ground these terms more clearly in §2.1 guistic/task motivated 0-1 feature families to perword-type pretrained vectors (Pennington et al., 2014) to contextual embeddings (Peters et al., 2018).
Contextual embeddings (CEs) produce incontext representations for each token -the representation framework being a large, pretrained encoder with per-token outputs. The typical procedure to use CEs for a downstream task is to add one or more task layers atop each token, or for a designated token per-sentence, depending on the nature of the task.
The task layers (and optionally, the representation) are then "finetuned" using a task specific loss, albeit with a slower training rate than would be used for from-scratch training. ELMo (Peters et al., 2018) was an early CE. The three-fold recipe of a transformer based architecture, masked language modelling objective and large pre-training corpora, starting with BERT (Devlin et al., 2018) led to CEs which were vastly effective for most tasks.
The strong performance of contextual representations with just shallow task layers and minimal finetuning drove the urge to understand what and how much these models already knew about aspects of syntax and semantics. The study of methods and analysis to do this has come to be called probing. Besides "explaining" CE featurization, probing can aid in finding lacunae to be addressed by future representations. Linzen et al. (2016), one of the early works on probing, evaluated whether language models could predict the correct verb form agreeing with the noun. Marvin and Linzen (2018) generalized this approach beyond single-word gaps with a larger suite of "minimal pairs". They also control for lexical confounding and expand the probing to new aspects such as reflexive anaphora and NPIs. Gulordava et al. (2018) evaluate subject-verb agreement but only through "nonce" sentences to con-trol for both lexical confounding and memorization 3 . Lakretz et al. (2019) isolate units of LSTM language models whose activations closely track verb-noun number agreement, particularly for hard, long-distance cases. Clark et al. (2019), whose probing methods we adopt, examine if BERT attention heads capture dependency structure.
In this work, we probe what and how much a pretrained BERT representation already knows about event roles and their arguments. Understanding how well event arguments are represented can be a first foray into understanding other aspects about events. Extraction of event arguments is often a prerequisite for more complex event tasks. Some examples are event coreference (Lu and Ng, 2018), detecting event-event temporal (Vashishtha et al., 2019) and causal relations (Dunietz et al., 2017), sub-event structure (Araki et al., 2014) and generating approximate causal paths (Kang et al., 2017). Tuples of event-type and arguments are one way of inducing script like-structures (Chambers and Jurafsky, 2008). In summary, our work makes the following contributions: 1. We show that there always exists a BERT attention head (BESTHEAD) with above-chance ability to detect arguments, for a given event role. We also show that this ability is even stronger through learnt linear combinations The Transformer architecture (Vaswani et al., 2017) consists of |L| layers, each comprised of |H| > 1 "self-attention" heads. Here, we describe the architecture just enough to ground terminology -we defer to the original work for detailed exposition. In a given layer l 4 , a single self-attention head h consists of three steps -First, query, key and value projections T e i are computed from the previous layer's token embedding e i . Then, softmax normal- Henceforth, we refer to the parameter tuple {Q h,l , K h,l , V h,l } , uniquely identified by h ∈ {0, 1, . . . |H| − 1}, l ∈ {0, 1, . . . |L| − 1} as the "attention head" or simply "head", while values α h,l ij are collectively referred to as the "attention map".

BERT
BERT uses a Transformer architecture with 12 heads and 12 layers 5 . It comes with an associated BPE tokenizer (Sennrich et al., 2015) which tokenizes raw inputs to subwords. Its vocabulary contains three special tokens -[CLS], [SEP] and [MASK]. While [CLS] and [SEP] serve as start and end (or sequence-separator) tokens, [MASK] is used in pretraining as described next.
BERT follows a two-stage pretraining process. In the first stage, also known as masked language modelling (MLM), randomly selected token positions are replaced with [MASK]. The task is to predict the true identities of words at these positions, given the sequence. This stage uses single sentences as training examples. In the second stage, also known as next sentence prediction (NSP), the Figure 1: In this example, the head chosen by BESTHEAD for the PLACE role, correctly picks out the argument "Syria" for the trigger "capitulation". Attention probabilities are shown as blue lines from trigger token to other tokens, with boldness indicating magnitude. It manages to evade distractor pronouns (there) and other geographical entity names (Russia and United States). The text above flows in right to left direction. The full text reads: "Chances of intentional conflict are real as is the possibility of an unintended clash escalating . At the same time, Syria is not essential to the national security of Russia or the United States. It is not without importance but a defeat or capitulation there will not change the balance of power between them at all . . . " model is given a pair of sentences (separated by [SEP]), with the task being to predict whether these were truly consecutive or not.
Unless otherwise mentioned, we use the bertbase-uncased model. We use the implementation of BERT from HuggingFace. 6 (Wolf et al., 2019)

Dataset
We use the recently released RAMS dataset (Ebner et al., 2020) for all our experiments. The reasons for using this particular dataset for our analysis are • It has a wide mix of reasonably frequent roles (represented well across splits) from different kinds of frames . Discussion on non-frequent roles can be found in §3.6.
• For many roles, it has examples with the gold arguments being in a different sentence from the event trigger. This makes it easy to probe for intra-sentence and cross-sentence argument extraction in the same set of experiments. Analysis of cross-sentence performance can be found in §3.3 and §3.4 We note that the dataset is in English (Bender and Friedman, 2018) and observations made may not generalize to other languages.

Setup
For example x, we refer to the event, role, gold argument and document as e, r, a and D. D is an 6 github.com/huggingface/transformers/ ordered sequence of tokens {w 0 , w 1 . . . w |D|−1 }. i e denotes the event trigger index 7 . We use the layer index l and head indices 0 to |H| − 1 to index the respective head's attention distribution from token i to all other tokens j ∈ D at index i e .
Note, however that there exist a complementary set of attention values from each token j to the token i e . To use a unified indexing scheme to refer to these values, we use negative indices from −1 to −|H| as their head indices. Since these values come from attention-head activations of different positions, they need to be renormalized to use them as probabilities.

Words and Subwords
Our above framework assumed that the attention maps are between whole word tokens. However, BERT represents a sentence as a sequence of BPEsubwords at every level, including for the attention maps.
We use the quite intuitive approach described in Section 4.1 of (Clark et al., 2019) -incoming attentions to constitutent subwords of a word are added to get the attention to that word. Outgoing attention values from constituent subwords are averaged to get the outgoing attention value from the word.
Note that the above operations precede the probability computations in Equations 1 and 2.

Dataset Splits
We follow the practice of earlier probing works such as (Sorodoc et al., 2020) and (Linzen et al., 2016)

Evaluation Measure
For a given event e and role r, we define a predicted argument token index a to be accurate if it corresponds to any of the tokens in the gold argument span [a beg r,e , a end r,e ]. This is described formally in Equation 3. I stands for the 0-1 indicator function.
Acc e,r,a ( a) = I(a beg r,e ≤ a < a end r,e ) Typical measures of argument extraction differ from the one we use, being span-based. Given the limitations of our probing approaches, we lack a clear mechanism of predicting multi-word spans, and can only predict likely single tokens for the argument, which led us to choose this measure 8 .

BESTHEAD
Let X = {e m , r m , a m } m=M m=1 be the training set. X r is the subset of training examples with r m = r. For each role r, BESTHEAD selects the head {l, h} best (r) with best aggregate accuracy on X r . Other than one pass over the training set for comparing aggregate accuracies of heads for each role, there is no learning required for this method. At test-time, based on the test role, the respective best head is used to predict the argument token.

LINEAR
The LINEAR model learns a weighted linear combination of all |L| × |H| × 2 head distributions (twice for the "from" and "to" heads).
Note that gradients are not backpropagated into BERT -only the linear layer parameters w l,h , B are updated during backpropagation. This formulation is the same as the one in (Clark et al., 2019). For our loss function, we use the KL Divergence KL( P ||P ) between the predicted distribution over document tokens P and the gold distribution over document tokens P . For the gold distribution over arg tokens, the probability mass is equally distributed tokens in the argument span, with zero mass on the other tokens.

RAND
The expected accuracy of following the strategy of randomly picking any token i from the document D as the argument (other than the trigger word i e itself). For a given role r with a gold argument a r,e of length |a r,e |, this equals |ar,e| |D|−1 .

SENTONLY
The expected accuracy of following the strategy of randomly picking any token from the same sentence S e as the argument, save the event trigger itself. This is motivated by the intuition that event arguments mostly lie in-sentence. This equals |ar,e| |Se|−1

NONCE procedure
We wish to isolate how much of the heads performance is due to memorized "world knowledge" and typical lexical associations e.g Russia would typically always be a PLACE or TARGET. Recent works have shown that BERT does retain such associations, including for first names (Shwartz et al., 2020), and enough so that it can act as a reasonable knowledge base (Petroni et al., 2019).
One way of implementing this is to create perturbed test examples where gold arguments are replaced with synthetically created "nonce" words not necessarily related to the context. This is similar to the approach of (Gulordava et al., 2018).
• Each gold argument token is replaced by a randomly generated token with the same number of characters as the original string.
• Stop words such as determiners, pronouns, and conjunctions are left unaltered, though they might be a part of the argument span.
• We also ensure that the shape of the original argument, i.e the profile of case, digit vs letter is maintained 9 . e.g Russia-15 can be randomly replaced by Vanjia-24, which has the same shape Xxxx-dd.
• Note that we do not take pronounceability of the nonce word into account. Though this could arguably be a relevant invariant to maintain, we were not sure of an apt way to enforce it automatically.
• We also note that BERT may end up using a likely larger number of subword tokens to replace the nonce words than it would use for the gold argument token. Since these are essentially randomly composed tokens, they can contain subwords which are rarely seen in vocabulary tokens.
We refer to this procedure as NONCE, and overloading the term, the test set so created as the NONCE test set.

Spotting the Best Head
In Table 2, we record the accuracies and layer positions of best heads for the 15 most frequent roles.
1. BESTHEAD always has higher accuracy than the RAND and SENT baselines. 2. 5 of the 15 roles can be identified with 40%+ accuracy -the highest being COMMUNICA-TOR , at 51.61%.  3. The best head for arguments which are not together present in frames is often the same. For instance, Layer 8, Head 10 is the best head for TRANSPORTER, ATTACKER, COMMUNICA-TOR and BENEFICIARY. 4. Most best heads are located in the higher layers, specifically the 7th, 8th or 9th layers. An exception are the best head for DESTINATION and ARTIFACT roles, located in the 0th layer and 4th layers respectively. 5. Place roles are the hardest to identify, with an accuracy of 17.77%. 6. Layer 8, Head 10 seems to be doing a lot of the heavylifting. For 7 out of 15 roles, this is the best head. This shows that it is quite "overworked" in terms of the number of roles it tracks. Furthermore, though some of these role pairs are from different frames (e.g see Point 3 above), some aren't, e.g GIVER and BENEFICIARY. In such cases, atleast one of the two arguments predicted for these two roles is sure to be inaccurate -e.g the head would point to either the GIVER or BENEFI-CIARY, but not both. 10 7. Most of the best heads for roles are "from" heads rather than "to" heads, apart from those for ORIGIN, ATTACKER and ARTIFACT.  While DESTINATION and PLACE do see increases in LINEAR compared to BESTHEAD, it could be the case that none of the individual heads are particularly good at capturing cross-sentence arguments for the other three roles, while the best head is already good enough to capture the intrasentence case. This would make LINEAR not any more rich as a hypothesis space compared to BEST-HEAD -causing the similar or slightly worse accuracy. In §3.3 we dig deeper into the aspect of cross-sentence performance.

Cross-Sentence Performance
From Table 4, we observe that both BESTHEAD and LINEAR performance degrades in the crosssentence case i.e when "trigger sentence" and "gold argument sentence" differ. Three potential reasons: 1. There are too few instances of cross-sentence event arguments in the small supervised set we use. Furthermore, even if there are a sufficient quantum of cross-sentence event arguments, these form a much smaller proportion of the total instances in comparison to the intra-sentence instances. 2. Because the limited number of attention heads are already dominated by intra-sentence aspects such as dependency relations, punctuation and subject-verb agreement (Clark et   2019), it is difficult for a single attention head to have a higher value for outside sentence tokens compared to in-sentence ones. 3. Different heads might be best for intra and cross-sentence performance, and finding one best head for both could be sub-optimal.

Cross Sentence Occlusion (CSO)
Motivated by the above reasons, we devise a procedure which we refer to as cross-sentence occlusion (CSO). Since Reason 1 is a property of the data distribution, we attempt to alleviate Reasons 2 and 3. To address Reason 3, we try to learn a different head (combination) for the cross-sentence case. To address Reason 2, while finding the best cross-sentence head, we zero-mask out the attention values corresponding to in-sentence 11 tokens and re-normalize the probability distribution. In practice, one would not be able to use two separate argument detectors for the intra and crosssentence cases for the same role, since groundtruth information of whether the argument is crosssentence would be unavailable. We assume this contrived setting only to allow easy analysis 12 , and to gloss over the lack of an intuitive zero-shot mechanism of switching between the two cases, when predicting arguments using just attention heads.

+CSO Results
From Table 4, we can observe the improvement in cross-sentence test accuracy when using the +CSO approach over its simple counterpart, both for BESTHEAD and LINEAR. The only exception to this is the ORIGIN role, where LINEAR betters LINEAR-CSO.
For the INSTRUMENT role, both BEST-HEAD+CSO and LINEAR+CSO get close to 50% 11 RAMS comes with a given sentence segmentation. 12 And also so that we can validate our diagnosis for poor cross-sentence performance in §3.3 accuracy. In part, their relatively stronger performance can be explained by BESTHEAD and LINEAR already being relatively better at detecting cross-sentence INSTRUMENT (just above 20%, but higher than the sub-15 accuracies on the other roles). Nevertheless, CSO still leads to a doubling of accuracies for both approaches.
We highlight here again that these numbers are only on that subset of the test set where we know that the gold arguments are located in other sentences -though this setting is useful for analysis, a model actually solving this task won't have access to this information.
Even in our case, there is no obvious way to have a consolidated probe which uses a LIN-EAR+CSO and LINEAR component together, since this would require learning an additional component which predicts whether the gold arguments lie intra-sentence or across-sentence.

Effect of NONCE
In Figures 2a and 2b, we compare the performances of our methods on perturbations of the test set created using the NONCE procedure outlined in §2.5.3 with their normal test performance. Since NONCE is stochaistic, corresponding results are averaged over NONCE sets created with 5 different seeds. BESTHEAD test performance is more sensitive to NONCE than LINEAR. Especially for INSTRU-MENT, ARTIFACT and ORIGIN, the decrease in accuracy is quite drastic. Surprisingly, we also see increases for 4 of the 15 roles -DEFENDANT, GIVER, VICTIM and PARTICIPANT. All other roles see small decreases. For LINEAR, however, most roles are largely unmoved by NONCE, showing that LINEAR relies less on lexical associations.

Non-Frequent Roles
So far, we've focussed on analyzing the 15 most frequent roles. In this subsection, we also evaluate our approaches for some non-frequent roles outside this set, such as PREVENTER and PROSECUTOR. The results are presented in Table 5. Note that, owing to high sparsity for these roles, these results should be taken with "a pinch of salt" (which is why we chose to separate them out from the frequent roles).
For the frequent roles, we had seen that LINEAR was mostly better than, or equally good as BEST-HEAD. For the non-frequent roles, we see that the comparative performance of LINEAR vs BEST-HEAD varies a lot more -LINEAR is better for 6 of the 11 roles, and worse for the other 5. The fall in LINEAR performance is largest for PROSECUTOR (58.33 → 16.67).
We conjecture that this drop is due to poor generalization as a result of learning from lesser supervision as a result of the roles being non-frequent. Since BESTHEAD has only two parameters (identity of the best head) compared to the 289 parameters of LINEAR, the latter is more sensitive to this problem.
Secondly, we notice that the gap between BEST-HEAD and the RAND and SENTONLY baselines is much narrower. For VEHICLE and MONEY, SEN-TONLY even outdoes BESTHEAD. For VEHICLE, the BESTHEAD accuracy even drops to 0. However, in all these cases, we find that LINEAR still manages to outdo both baselines. We conjecture that these cases could be due to the best head predicted not being very generalizable due to small training set size (for that role). Though LINEAR would also suffer from poor generalization in this case, it might stand its ground better since it relies on multiple heads rather than just one.

Cased vs Uncased
In our analysis so far, we have been using the same contextual embedding mechanism through-  Table 5: Test accuracies using all the baselines and probe approaches described in § §2.4 for some non frequent roles. Both BESTHEAD and LINEAR probes still outdo the baselines in most cases, but not as convincingly as for frequent roles. Unlike the frequent roles case, LINEAR actually does worse than BESTHEAD for many roles.
out, namely bert-base-uncased. In Figure 3a, we plot the difference of BESTHEAD test accuracies when using bert-base-cased vs bert-base-uncased.
We can see that bert-base-uncased is better for most roles -except for Attacker, Victim and Artifact. We also notice that the best layer-head configuration {l best , h best } is mostly not preserved between the bert-base-cased and bert-base-uncased scenarios. The difference between bert-base-uncased and bert-base-cased is even more drastic in the cross sentence only experiment , for instance, while there exists a single head which can find cross-sentence Instrument args with 37% accuracy, the best such head of bert-base-cased has only 17% accuracy.

Qualitative Examples
In Figure 3.8, we illustrate some examples of BEST-HEAD identifying arguments. We defer further discussion to Appendix §A owing to lack of space.

Related Work
A complete description of the large body of work on probing is beyond the scope of this paper. Besides those discussed earlier, other aspects studied include filler-gap dependencies (Wilcox et al., 2018), function word comprehension (Kim et al., 2019), sentence-level properties (Adi et al., 2016) and negative polarity items (Warstadt et al., 2019).
Probing is not limited to examining pairwise word relations or sentence properties. Hewitt and Manning (2019) find that BERT token representations are linearly projectable into a space where they embed constituency structure. Recently, Sorodoc et al. (2020) probed transformer based language models for coreference. However, they restrict themselves to entity coreference. Further- Hewitt and Liang (2019) raised a note of caution about classifier based probes, pointing out that probes themselves could be rich enough to learn certain phenomena even with random representations. We avoid direct classifier-based probing, thus avoiding the mentioned pitfalls.

Conclusion
We showed how BERT's attention heads have modest but well above chance ability to detect arguments for event roles. This ability is achievable either with only i) 2 parameters per role (BESTHEAD) ii) 289 parameters per role (LINEAR). Furthermore, the supervision required to reach this is just ≈ 11% of full training set size. Secondly, we propose a method to learning separate heads (combinations) for cross-sentence argument detection. Our experiments show that the heads so learnt have higher cross-sentence accuracy. Thirdly, we show that LINEAR performance is robust to a perturbed NONCE test setting with weakened lexical associations. In future, we plan to extend our probing to other event aspects like coreference and subevents.  Figure 4: In (a), BESTHEAD correctly picks out the TARGET of "airstrike" as "Yemen". In (b), BESTHEAD correctly picks out the RECIPIENT of "advised" as "companies". In (c), the token picked is coreferent but not identical to the gold argument. Attentions are shown as blue lines from trigger token, with lineweight ∝ value. Gold arguments are shaded green .