CogNLP-Sheffield at CMCL 2021 Shared Task: Blending Cognitively Inspired Features with Transformer-based Language Models for Predicting Eye Tracking Patterns

The CogNLP-Sheffield submissions to the CMCL 2021 Shared Task examine the value of a variety of cognitively and linguistically inspired features for predicting eye tracking patterns, as both standalone model inputs and as supplements to contextual word embeddings (XLNet). Surprisingly, the smaller pre-trained model (XLNet-base) outperforms the larger (XLNet-large), and despite evidence that multi-word expressions (MWEs) provide cognitive processing advantages, MWE features provide little benefit to either model.


Introduction and Motivation
Many researchers now agree that eye movements during reading are not random (Rayner, 1998); as a result, eye-tracking has been used to study a variety of linguistic phenomena, such as language acquisition (Blom and Unsworth, 2010) and language comprehension (Tanenhaus, 2007). Readers do not study every word in a sentence exactly once, so following patterns of fixations (pauses with the eyes focused on a word for processing) and regressions (returning to a previous word) provides a relatively non-intrusive method for capturing subconscious elements of subjects' cognitive processes.
Recently, cognitive signals like eye-tracking data have been put to use in a variety of NLP tasks, such as POS-tagging (Barrett et al., 2016), detecting multi-word expressions  and regularising attention mechanisms : the majority of research utilising eye-tracking data has focused on its revealing linguistic qualities of the reading material and/or the cognitive processes involved in reading. The CMCL 2021 Shared Task of Predicting Human Reading Behaviour (Hollenstein et al., 2021) asks a * Equal Contribution slightly different question: given the reading material, is it possible to predict eye-tracking behaviour?
Our ability to quantitatively describe linguistic phenomena has greatly increased since the first feature-based models of reading behaviour (i.e. Carpenter and Just (1983)). Informed by these traditional models, our first model tests 'simple' features that are informed by up-to-date expert linguistic knowledge. In particular, we investigate information about multi-word expressions (MWEs) as eye-tracking information has been used to detect MWEs in context (Rohanian et al., 2017;, and empirically MWEs appear have processing advantages over non-formulaic language (Siyanova-Chanturia et al., 2017).
Our second model is motivated by evidence that Pre-trained Language Models (PLMs) outperform feature based models in ways that do not correlate with identifiable cognitive processes (Sood et al., 2020). Since many PLMs evolved from the study of human cognitive processes (Vaswani et al., 2017) but now perform in ways that do not correlate with human cognition, we wished to investigate how merging cognitively inspired features with PLMs may impact predictive behaviour. We felt this was a particularly pertinent question given that PLMs have been shown to contain information about crucial features for predicting eye tracking patterns such as parts of speech (Chrupała and Alishahi, 2019;Tenney et al., 2019) and sentence length (Jawahar et al., 2019).
We therefore had the goals of providing a competitive Shared Task entry, and investigating the following hypotheses: A) Does linguistic/cognitive information that can be predicted by eye-tracking features prove useful for predicting eye-tracking features? B) Can adding cognitively inspired features to a model based on PLMs improve performance in predicting eye tracking features?

Task Description
The CMCL 2021 Shared Task of Predicting Reading Behaviour formulates predicting gaze features from the linguistic information in their associated sentences as a regression task. The data for the task consists of 991 sentences (800 training, 191 test) and their associated token-level gaze features from the Zurich Cognitive Language Processing Corpora (Hollenstein et al., 2018(Hollenstein et al., , 2020. For each word, the following measures were averaged over the reading behaviour of the participants: FFD (first fixation duration, the length of the first fixation on the given word); TRT (total reading time, the sum of the lengths of all fixations on the given word); GPT (go past time, the time taken from the first fixation on the given word for the eyes to move to its right in the sentence); nFix (number of fixations, the total quantity of fixations on a word, regardless of fixation lengths) and fixProp (fixation proportion, the proportion of participants that fixated the word at least once). Solutions were evaluated using Mean Absolute Error (MAE). For more details about the Shared Task, see Hollenstein et al. (2021).

Related Work
Transformer architectures Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019) is a Language Representation model constructed from stacked Neural Network attention layers and 'massively' pre-trained on large Natural Language Corpora. In contrast with traditional language models, BERT is pre-trained in two settings: a 'cloze' task where a randomly masked word is to be predicted, and next sentence prediction. BERT or derivative models have been used to achieve state-of-the-art baselines on many NLP tasks (Devlin et al., 2019;Yang et al., 2019). Analysis studies have shown that BERT learns complex, task-appropriate, multi-stage pipelines for reasoning over natural language, although there is evidence of model bias. XLNet (Yang et al., 2019) is an autoregressive formulation of BERT which trains on all possible permutations of contextual words, and removes the assumption that predicted tokens are independent of each other. Similar studies To our knowledge, studies that attempt to predict cognitive signals using language models are fairly few and far between. Djokic et al.
(2020) successfully used non-Transformer word embeddings to decode brain activity recorded during literal and metaphorical sentence disambigua-tion. Since RNNs may be considered more 'cognitively plausible' than Transformer based models, Merkx and Frank (2020) compared how well these two types of language models predict different measures of human reading behaviour, finding that the Transformer models more accurately predicted self-paced reading times and EEG signals, but the RNNs were superior for predicting eyetracking measures.
In a slightly different task, Sood et al. (2020) compared LSTM, CNN, and XLNet attention weightings with human eye-tracking data on the MovieQA task (Tapaswi et al., 2016), finding significant evidence that LSTMs display similar patterns to humans when performing well. XLNet used a more accurate strategy for the task but was less similar to human reading.
Though these studies may indicate that Transformer models are not the most suited to eyetracking prediction, they are still considered State of the Art in creating broad semantic representations and general linguistic competence (Devlin et al., 2019). As such, we hoped they would allow us to investigate Carpenter and Just's speculation that the dominance of word length and frequency for predicting eye-tracking behaviour may reduce "as the metrics improve for describing higher-level factors" like semantic meaning (1983, p. 290).

Experimental Design 1
We pursued both feature engineering and deep learning approaches to the task; though both methods performed well independently, there was little improvement in predictive capability when combining their features (see Table 1). As such, we developed and submitted two models: Model 1 (Feature Rich) and Model 2 (XLNet). Additional details about the feature combinations used in our final models can be found in Appendices A and C.

Linguistic Features
Each word in the training vocabulary was encoded as a one-hot vector. Since function words are more likely to be fixated than open class words (Carpenter and Just, 1983), we included POS information generated by Spacy (Honnibal et al., 2020) (honouring the tokenisation in the training data). We included a a binary indicator for whether a word was the first or last in its sentence to incorporate the knowledge that first and last fixations on a line are 5-7 letter spaces from the two respective ends (Rayner, 1998). We generated raw frequencies (proportion per million words) and Zipf frequencies (Van Heuven et al., 2014).
Finally, concreteness norms (a measure of how 'abstract' a given word is) were included as features (mean, standard deviation, and the % of participants familiar enough with the word to accurately judge its concreteness; Brysbaert et al. (2014)). We specifically tested concreteness due to the unusually large coverage of the norms.

Reading Specific Features
Word length has been empirically demonstrated as a very good predictor of gaze features in many studies (i.e. Rayner and McConkie (1976);Carpenter and Just (1983). Duration of fixation is observed to increase for words that exceed the mean saccade length (7-9 letters), and probability of fixation is reduced for words shorter than half the mean saccade length (Rayner and McConkie, 1976). Therefore, as features we included both the raw word lengths, and categorical variables representing word length as a proportion of a mean saccade length.
Since readers may store information about adjacent words (Rayner, 1975(Rayner, , 1998Barrett, 2018), we also experimented with supplying features from previous and future words to each target word.

Type Summary Statistics from GECO
Following Barrett et al. (2016), we used the monolingual data from the GECO corpus (Cop et al., 2017) to generate type-level summary statistics for each word. Specifically, we averaged the gaze features across the 12 participants who completed the reading task, and normalised these features to reflect the normalisation of the Shared Trask training data. We then averaged these values again at the type (word) level. For words present in the task training data but not the GECO data, we estimated the values using means for words in the GECO data of a similar frequency (according to the wordfreq).

Multi-word Expression Features
We generated an MWE lexicon and summary metrics using the Wikitext-103 corpus (Merity et al., 2016) and mwetoolkit (Ramisch, 2012 MWEs identified in the training data were assigned MWE embeddings and compositionality information as features, and non-MWEs were assigned single word embeddings and zero values for compositionality.

XLNet
In order to obtain Massively Pre-trained Language Model features we used XLNet. We finetuned a model that was pre-trained on BooksCorpus (Zhu et al., 2015), English Wikipedia, Giga5 (Courtney Napoles, Matthew R. Gormley, 2012), ClueWeb 2012-B (Callan et al., 2009), and Common Crawl text (Crawl, 2019). For predictions, we took the final hidden representation of the first sub-word token encoding of each word. We concatenated this feature with an integer representing the total word length in characters to encourage the model to explicitly attend to word length. We tested the effectiveness of sub-word aggregation but found this reduced the model's accuracy by an average of 0.04 MAE, which we speculate is due loss of information in the pooling operation whilst head sub-word units already contain contextual information. We then passed the concatenated sub-word and wordlength features to a 3-layer dense Neural Network which was used to predict the Shared Task's five target features. This 3-layer multi-feature Network was found to be optimal through experimentation. For stability, we used the Huber loss objective, which approximates L2 loss for small values and L1 loss for large values. We trained using the AdamW optimiser and with learning rates and training duration chosen through grid search across 3-fold cross-validation, obtaining an optimal learning rate of 0.00001 and 800 epochs.

Results
In Table 1 we present the MAE on validation splits of the training data. This information informed our choice of model submissions alongside a preference for models using more cognitive features. We submitted two sets of predictions from Model 2 ( ElasticNet(XLNet-base-cased)) and one set of predictions from Model 1 (Feature Rich).  the overall task. Our overall standing is shown to be 5th, with an MAE delta of 0.143 behind the best model. Whilst a prediction which combined Models 1 and 2 was slightly more accurate (see Table 1), we regard this improvement as within margin of error. We therefore focussed on Models 1 and 2 separately since this allowed for clearer comparisons between the two approaches.

Analysis and Discussion
Our results (Table 1) support both our hypotheses introduced in Section 1. We did not anticipate that XLNet-base would outperform XLNet-large, which had more pretraining data and layers. This is possibly due to the limited amount of training data specific to the task for fine-tuning, resulting in the larger model underfitting. We are able to confirm that the knowledge XLNet learns through massive pre-training crucial to its performance in this arena -removal of this knowledge through weight randomisation increases MAE from 3.959 to 4.675. Hence we believe that both structure and pre-training of XLNet-base contribute to its success in this task.
We use normalised permutation feature importance (see Appendix B) to better understand the value of different features and present it on a pertarget basis for each model in Figure 2.
The most interesting outcome of our experiments was the fact that XLNet embeddings subsume information contained across most features except word length (especially in predicting nFix). It may be that the use of word-pieces obfuscate word length information thus requiring the explicit addition of that information. While the usefulness of features such as word length is consistent with the literature, we were surprised by the relative unimportance of MWE information given that many neurocognitive studies have demonstrated differences in how they are processed (Siyanova-Chanturia et al., 2011Cacciari and Tabossi, 1988). An additional surprise is that even though the Skip-gram embeddings provide semantic information about single words as well as MWEs, the Feature Rich models make little use of them. Many of the Feature Rich models utilize the GECO features, which may be because they provide approximate guidance about the distributions of the various gaze features that would be difficult to learn directly given the sparsity of the training data.

Conclusion and Future Work
This work describes our submissions to the 2021 CMCL Shared Task: we contributed a Feature Rich model inspired by cognitive and linguistic information, and model predominantly based on contextual XLNet-base embeddings. We find that only a limited subset of the cognitive features (such as word length) are helpful in the XLNet model. To our surprise, neither XLNet-large embeddings nor MWE features provide performance improvements. However, we believe this indicates a need for further research into MWE representations as opposed to suggesting that MWEs are unimportant for creating effective cognitive models.