Relation Classification with Cognitive Attention Supervision

Many current language models such as BERT utilize attention mechanisms to transform sequence representations. We ask whether we can influence BERT’s attention with human reading patterns by using eye-tracking and brain imaging data. We fine-tune BERT for relation extraction with auxiliary attention supervision in which BERT’s attention weights are supervised by cognitive data. Through a variety of metrics we find that this attention supervision can be used to increase similarity between model attention distributions over sequences and the cognitive data without significantly affecting classification performance while making unique errors from the baseline. In particular, models with cognitive attention supervision more often correctly classified samples misclassified by the baseline.


Introduction
For humans, the task of determining semantic relationships may entail complicated inference based on concepts' contexts (Yee and Thompson-Schill, 2016;Zhang et al., 2020) and commonsense knowledge (e.g., causal relations; Chiang et al., 2021), and for labeling relations between entities in texts the task may depend on the genre of the text (e.g. biomedical, biographical) and constraints indicated by annotator instructions (Mohammad, 2016). The advent of crowdsourcing for machine learning approaches to Natural Language Processing (NLP) creates challenges in collecting high quality annotations (Ramírez et al., 2020). A platform such as Amazon Mechanical Turk (MTurk) allows accessible, sophisticated task design (Stewart et al., 2017) but defaults to simple templates for NLP tasks, and is susceptible to self-selection bias (raters may not represent the population) and social desirability bias or demand effects, where judges seek to confirm the inferred hypotheses of experimenters (Antin and Shaw, 2012;Mummolo and Peterson, 2019;Aguinis et al., 2020). Cognitive research has shown that self-reports are frequently inaccurate (Vraga et al., 2016), and that subjects are unable to effectively introspect about or recall their eye movements during reading (Võ et al., 2016;Clarke et al., 2017;Kok et al., 2017). This encourages the use of precise, objective recordings of non-conscious language processing behavior to use as model training data, rather than relying solely on reader annotations. As emphasized by , when reading humans produce reliable patterns that can be recorded, such as tracking gaze trajectories or measuring brain activity. These signals can associate linguistic features with cognitive processing and subsequently be applied to NLP tasks. The recording of eye movements during reading can be traced to psychology and physiology in the late 1800s (Wade, 2010), but the use of eye-tracking data in NLP is a relatively new phenomenon (Mishra and Bhattacharyya, 2018). Brain data has a longstanding relationship with language processing and in recent years has been investigated with NLP models (Schwartz et al., 2019), leveraged notably by Mitchell et al. (2008) to predict fMRI activity from novel nouns.
The working intuition in using cognitive data in recent NLP studies is that signals produced by humans during naturalistic reading can be leveraged by artificial neural networks to induce human-like biases and potentially improve natural language task performance. For example, recognizing and relating entities while reading sentences might elicit patterns of activation or particular gaze behaviors in human readers which can be transferred to and recovered by models given the same text sequences as inputs. Models might then generalize learned biases to similar text inputs. One route for augmenting neural networks with cognitive data is to regularize attention, such as with eye-tracking (Barrett et al., 2018) and/or electroencephalography (EEG) data (Muttenthaler et al., 2020). Eye-

Phrase
Relation <e> ford </e> became an engineer with the <e> edison illuminating company </e> Employer <e> ford </e> became an <e> engineer </e> Job Title <e> ford </e> was born on a prosperous farm in <e> springwells township </e> Birthplace <e> mary litogot </e> ( c1839-1876 ) , immigrants from <e> county cork </e> Visited 2 Related Work Mathias et al. (2020) describe the key terms used in gaze behavior studies; eye-tracking appears to be the more robust and proven measurement modality for augmenting machine learning models. In particular, fixations are the eyes' focused pauses on Areas of Interest (AOIs); saccades are rapid movements from one point to another. These movements can be progressive or regressive, moving to later or earlier AOIs (e.g., the words in a sentence), and occur on the order of milliseconds.  combine the indirect signals of ET with EEG data, moving beyond inferences based on eye-screen positioning (e.g., that content words are more likely to be fixated upon, and unfamiliar words have longer fixation durations). In general, EEG provides a high temporal resolution but due to interference from the scalp exhibits a poorer spatial resolution than other brain imaging methods such as magnetoencephalography (MEG; . To understand cognitive processes involved in, e.g., longer fixation durations, EEG can complement ET, where larger amplitudes for 1 https://huggingface.co/ bert-base-uncased 2 https://osf.io/2urht/ event-related potentials (ERPs) such as N400 correspond to less frequent or less predictable words and semantic processing (Frank et al., 2015).
A number of studies have applied cognitive data to NLP tasks, among them: sentiment analysis (Mishra et al., 2016), part-of-speech (POS) tagging , and named entity recognition (NER) .  apply both gaze and brain data to a suite of NLP tasks , including relation classification. For sentiment analysis,  use MTL for a bidirectional Long Short-Term Memory (biLSTM) network, learning gaze behavior as the auxiliary task. Malmaud et al. (2020) predict ET data with a variant of BERT as an auxiliary to question answering. Bautista and Naval (2020) predict gaze features with an LSTM to evaluate on sentiment classification and NER tasks. Barrett et al. (2018) supervise model attentions with ET data by adding attention loss to the main classification loss so the model jointly learns a sentence classification task and the auxiliary task of attending more to tokens on which humans typically focus. Muttenthaler et al. (2020) follow this paradigm using EEG data.
A number of studies impose schemata or mechanisms to encourage BERT to learn more structured RC representations: Soares et al. (2019) fine-tune BERT for RC, experimenting with the use of additional special entity tokens from BERT's final hidden states to represent relations, rather than the last layer's classification token, [CLS]: the [CLS] token is conventionally used as the sentence representation for tasks such as classification (Devlin et al., 2019), as well as attention analysis (Clark et al., 2019). For joint entity and relation extraction Xue et al. (2019)

Data
Hollenstein et al. (2018) created ZuCo, a corpus of ET and EEG recordings in which 12 adult subjects (fluent English speakers) read full sentences at their own speed, with brain recordings synchronized to eye fixations. The sentences used by the corpus were written English: 400 review excerpts from Stanford Sentiment Treebank (Socher et al., 2013) and 707 biographical sentences from a Wikipedia relation extraction dataset (Culotta et al., 2006). In this work we use a subset of 300 relation sentences (7,737 tokens) divided into 566 phrases 3 by  to encompass the multiple binary relation statements, and annotated with markers around entity mentions. The dataset uses 11 relation types, as seen in Table 2. For ET we had access to five features for each word, including first fixation duration (FFD), gaze duration (sum of fixations), and total reading time (TRT: the sum of the word's fixations including regressions to it). The features for EEG we use are the 105 electrode values mapped to first-pass fixation onsets to create fixation-related potentials (FRPs), so that each word has 105 values. We average ET and EEG values over all subjects, which has been shown to reduce variability of results (Hollenstein et al., 2020) and overfitting . To obtain a single ET value for each token, Barrett et al. (2018) used the mean fixation duration (MFD), by dividing TRT by number of fixations. There is no best practice to our knowledge, and in this study we use TRT as a proxy for overall where a t j is the t j -th value of the softmax of sample j's C prediction scores ϕ(ŷ j ): We additionally compute auxiliary attention losses. BERT takes an input of sequence hidden states ∈ R N ×d (N tokens, d = 768 features) and uses 12 attention heads at each layer to create 12 token-token attention weight matrices ∈ R N ×N .  Specifically, in these matrices, there is a row for every token in the sequence-a distribution of N attention weights, where each scalar weight corresponds to a token's similarity to a token in the sequence. The resulting matrices are multiplied with the input to transform the tokens' features and produce a context matrix ∈ R N ×d . Each token context vector c contains a blend of features from the sequence's tokens: each feature for c is a weighted sum dominated by that feature's values from tokens most attended by c. For instance, the features in the context vector for [CLS] will reflect the features of those tokens given highest attention by [CLS], with the features of lower weighted tokens scaled down and contributing minimally.
These operations are founded on the conception of attention emerging from relationships between tokens in sequence contexts, or the notion of each token attending to the others, and computations occur in the subspaces of heads' attention weights: this is incompatible with the concept of a single abstracted human reading a displayed word sequence. Therefore, to intervene on the production of contextualized model representations using the ZuCo data as proxies for attention, we seek a single distribution of weights from the multiple token-token attention matrices for a given sequence, analogous to the competitive attention given by a human reader. Due to its use as the sequence representation used for classification, we take from each matrix the row of weights accorded by [CLS], resulting in 12 vectors, treating [CLS] as our model reader. We average these vectors along the head axis to obtain a [CLS]token vector α ∈ R 1×N of attention weights. This aggregate is supervised during training: in this way, each independent representation subspace (head) is informed by the human values, influencing the features of the sequence representation used for the RC task.
We then obtain human scores for the sequence tokens. Previous studies used "type-aggregated"  cognitive data, where values are averaged over corpus word occurrences to obtain an aggregated value for that word type. This method exchanges specific sample contexts for the ability to synthesize distributions for samples not in the original data through type lexicon queries, using 0 for unknown word types. For relation extraction, previously Hollenstein et al. (2019) discretized and binned ZuCo features which were used in an auxiliary task. To preserve context, we extract from ZuCo the raw ET and EEG values for each sample without typeaggregating, so that ZuCo coverage of tokens in the samples is complete: every token has a ZuCo value, excluding special model tokens, which are assigned zeros.
Because BERT uses subword tokenization, to allow matching entries to be found in the ZuCo wordlevel data we split the ZuCo words into BERT tokens, evenly dividing values between each subword piece (e.g., "delicacy" → "del", "##ica", "##cy", each piece allotted a third of the ZuCo value), a technique used by Malmaud et al. (2020). We preserve entity markers "<e>" and "</e>" in each sample by adding them as special tokens to the BERT tokenizer so their embeddings are learned with other tokens during fine-tuning. Human ET and EEG token values z ET and z EEG are passed through softmax to obtain two distributions over sequences, vectors α ET and α EEG . ET features such as TRT are much larger, measured in milliseconds, than the small EEG microvoltages (µV ), so the raw ET values' softmax output α ET would be much peakier than α EEG , providing an extremely low entropy signal where weights are forced onto one or two tokens. To combat this, we reduce each ET token value by dividing by the maximum value Figure 1: Plots of baseline and attention-supervised model attentions against ZuCo ET and EEG values where the baseline is correct and attention-supervised model is incorrect. Note that piece attentions are combined (e.g., "may": 0.1004 + "##sville" 0.0939 → "maysville": 0.1943). The ET+EEG model in the top plot was influenced to emphasize the location "maysville" alongside "born in" and predicts "Visited" rather than the correct "Birthplace", whereas the baseline places relatively more emphasis on "she was" and "born". At bottom, the baseline attends strongly to "died" whereas the ET+EEG model has learned a more uniform attention distribution. for its sequence (Eq. 3), returning softmax output α ET . Each sequence thereby has a context-specific distribution, reflecting the averaged responses of the human subjects. Following other studies that implemented attention supervision (Qiuxia et al., 2020;Sharan et al., 2019;Sood et al., 2020;, we compute attention losses based on the Kullback-Leibler divergence (D KL 4 ) from the aggregate model attention weights α to the human weights α ET and α EEG . We do so for each sequence j in batches of size M for each modality, obtaining eye-tracking loss L ET and EEG loss L EEG . By toggling binary coefficients λ, one or both losses are added to RC categorical crossentropy loss to give us the overall multi-task finetuning loss, L M T L .
4 For this computation, zeros are set to 1e-12.
where α ET j is the softmax of the maxnormalized vector of ET token values for sequence j: 5 Experimental Results

Ablations
We perform ablations comparing base BERT finetuned for four runs with arbitrary random seeds and varying combinations of the cognitive data. The baseline used in ablations is the result of finetuning on the ZuCo data without attention supervision. For the ET model, we add only the loss computed from the ET data. For the EEG model we do likewise with the EEG loss, and for the combined ET+EEG model we compute and add both auxiliary losses to the main classification loss. We similarly create random ET, EEG, and ET+EEG Figure 2: Confusion matrices of accuracies averaged over four runs for baseline and ET+EEG models. Models often misclassified "Deathplace" (which comprises roughly 3.5% of the splits' samples) as "Visited" (25%) or "Nationality" (7%) as "Job title" (26%), and "Visited" was occasionally misclassified as "Education" (7%). models. For random models, we replace the modality's ZuCo values with values uniformly sampled from the fixed minimum and maximum range of the modality's ZuCo values. This should allow us to distinguish the effects of learning regularities in ZuCo token attention values versus the effects of constraining the range of magnitudes given by the ZuCo values. After training, we evaluated the final models on the held-out test data of 57 samples. Table 3 shows evaluation results. Two-sided Pitman's permutation tests (Dror et al., 2018) were performed on final accuracies to assess statistical significance, comparing each of the six models against the baseline. Averaging over four runs, there are no statistically significant differences (p > 0.05) between baseline vs. ET, EEG, ET+EEG, and random versions thereof, respectively. Figure 2 displays confusion matrices for the models, showing similar per-class results, with some cases where classes with few samples such as "Deathplace" (19 samples) were classified as more dominant categories such as "Visited" (144 samples).

Attention Similarity
Sen et al. (2020) define a behavioral similarity metric to quantify the extent to which model attentions focus on the same words as the human attention; in their work, human attention maps are binary vec-tors used as the ground truth against which the continuous model attention maps are compared using Area Under the Curve (AUC), a binary classification metric. In a similar vein, in order to assess whether models learn a generalizable bias in attention we create a measurement to assess the amount of token overlap between continuous human and model attention vectors for phrases in the test set. Results of this measurement as well as relative entropies are shown in Table 4.
We compare a fixed top-k tokens for sequences using a variety of k values, for tokens scored by model attentions after fine-tuning and the scores given by human data. We run the models on all splits, using the methods described in §4 to obtain model attentions α, and compute the attention similarity for the test set by pairwise comparison of each model's attentions with the human data. Specifically, as Equation 4 describes, for each model we obtain sets of all samples' token indices and values for the top k attention weights from both ZuCo values α and α, and divide the cardinality of the sets' intersection by k to obtain an overlap ratio. To factor the k weights' salience into the similarity, we divide their total weight given by the model by their total ZuCo weight and multiply this percentage-capped at 1.0-with the overlap ratio. For example, if both the baseline and an attentionsupervised model have the same tokens in the top k  attention, the model that weighs these tokens similarly to the ZuCo data should have a greater score. We take the average over each sample j in dataset D: j is the set of intersecting indices of the top k attention values for sequence j and α corresponds separately to α ET (Eq. 3) or α EEG : As Table 4 shows, baseline and random models have less overlap than the ET model for all sets. Curiously, after the baseline, EEG overlap was weakest for the model supervised with EEG, including for the random models. This might indicate a diffusion of attention that makes top-k overlap difficult to differentiate, as EEG overlap values reach parity with non-EEG models with k > 10. Figure  1 visualizes the respective final [CLS] attention weights averaged over attention heads for baseline vs. attention-supervised models against the ET and EEG ZuCO data values used to supervise the latter models.

Unique Errors
While task performance is not significantly different, we can see that model attentions are affected. To detect the possible effects of these attentional differences, where alternative features may be emphasized or diminished in the sequence representations used for RC, we analyze errors made by As seen in Table 5, we note that models with non-random ZuCo attention supervision have more unique errors compared with the baseline than those with random supervision. In this case, the EEG-based attention loss seems to be the source of the small differences, as ET and Random ET models have similar mismatches. Lin et al. (2020) examine fixes: instances where the baseline is in error, but the modified baseline is correct. We analyze the percentages of fixes and also breaks, which we define to occur when the baseline is correct, but the model with supervised attention is incorrect. These are also shown in Table 5. Compared to random models, the ZuCo models seem to more frequently predict correctly samples that the baseline labeled incorrectly.

Conclusions and Future Work
Overall, BERT models with multiple modes of human attention supervision converged to accuracy for the relation classification task that does not differ significantly from the fine-tuned base BERT model, despite possessing attention distributions that were shifted toward the cognitive data. Measured by overlap, attention supervision with eye-tracking data was most influential on the final layer's [CLS]-assigned attention weights. In addition, we have shown that the behavior of these models differs from the baseline consistently by misclassifying different samples, exposing pathologies which may be of interest for research in neural network-based human language processing.  have pointed to distinct reading patterns evident in eye-tracking studies for unfamiliar proper nouns which may be more readily apparent in the ET values. On the other hand, it may be that the EEG data were too noisy and that dimensionality reduction to find the most predictive electrode values, such as performed by Muttenthaler et al. (2020), is needed to provide a consistent signal. Additionally, Hollenstein et al. (2019) and Muttenthaler et al. (2020) incorporated EEG frequency bands into their ZuCobased studies; the α frequency band has been associated with attention (Feldmann-Wüstefeld and Awh, 2020) and supervision with this band might yield different results. The cognitive data used in this study were not specifically produced from an entity-related reading task, but Brédart (2017) has noted the increased difficulty of processing proper names which is reflected in behavioral studies, with a double dissociation between common nouns and proper names where production of one type of noun is impaired but the other is intact. A more careful use of neuroimaging data may be needed to leverage signals reflecting the differing brain mechanisms involved in human lexical access.
Typically, researchers implicitly seek to induce a human-like bias in classifiers so they correlate more highly with human judgments by using selfreported annotations to supervise learning. This supervision is limited insofar as self-reports can not specify responses inaccessible to annotator introspection, such as the brain's electrical activity or detailed gaze behavior. Models additionally biased by non-conscious physiological responses may learn to more robustly reflect human language processing, incorporating both subjective and objective signals. Human annotations are conventionally taken as ground truth. Yet cognitive data may offer valid judgments, as well. For example, in sentiment analysis, a false negative according to a self-report could be a true negative according to physiological affective responses. Cognitive data may reveal inconsistencies and gradations obscured by labels.
In the case of relation extraction, cognitive data might uncover patterns more reflective of different, potentially novel categories of semantic relation, or different dynamics, due to linguistic ambiguity and/or changing contexts and readerships. In terms of limitations, we did not investigate the breadth or depth of influence of our method of [CLS]-based aggregate attention supervision on the model attentions across layers and heads, nor the supervision of specific layers or heads as done by Strubell et al. (2018). We did not explore trade-off coefficients on the multiple losses, such as the convex combination used by Malmaud et al. (2020). We used a relatively small English dataset, which limited generalizability and robustness.  describe some ethical concerns in the recording and use of cognitive data, including voluntary data procured but not recorded by NLP researchers. This includes loss of privacy with the identification of subjects, an overrepresentation and normalization of particular demographics, and the perpetuation of fossilized human prejudices. Sen et al. (2020) have described the potential for human attention supervision to address the validity of attention as a faithful, human-like explanation for model decisions while Pruthi et al. (2019) have discussed the potential for deception by manipulating attention to make models appear less biased. Future work could scrutinize whether human attention supervision can provide a basis for exploring cognitive biases learned by models, or align attention-based explanations to model outcomes: enabling performant models to adhere faithfully to auditor expectations.