KonTra at CMCL 2021 Shared Task: Predicting Eye Movements by Combining BERT with Surface, Linguistic and Behavioral Information

This paper describes the submission of the team KonTra to the CMCL 2021 Shared Task on eye-tracking prediction. Our system combines the embeddings extracted from a fine-tuned BERT model with surface, linguistic and behavioral features, resulting in an average mean absolute error of 4.22 across all 5 eye-tracking measures. We show that word length and features representing the expectedness of a word are consistently the strongest predictors across all 5 eye-tracking measures.


Introduction
The corpora ZuCo 1.0 and ZuCo 2.0 by Hollenstein et al. (2018Hollenstein et al. ( , 2019 contain eye-tracking data collected in a series of reading tasks on English materials. For each word of the sentences, five eyetracking measures are recorded: 1) the number of fixations (nFix), 2) the first fixation duration (FFD), 3) the go-past time (GPT), 4) the total reading time (TRT), and 5) the fixation proportion (fixProp). Providing a subset of the two corpora, the CMCL 2021 Shared Task (Hollenstein et al., 2021) requires the prediction of these eye-tracking measures based on any relevant feature.
To tackle the task, we conduct a series of experiments using various combinations of BERT embeddings (Devlin et al., 2018) and a rich set of surface, linguistic and behavioral features (SLB features). Our experimental setting enables a comparison of the potential of BERT and the SLB features, and allows for the explainability of the system. The best performance is achieved by the models combining word embeddings extracted from a fine-tuned BERT model with a subset of the SLB features that are the most predictive for each eye-tracking measure. Overall, our model was ranked 8th out of 13 models submitted to the shared task.
Our main contributions are the following: 1) We show that training solely on SLB features provides better results than training solely on word embeddings (both pre-trained and fine-tuned ones). 2) Among the SLB features, we show that word length and linguistic features representing word expectedness consistently show the highest weight in predicting all of the 5 measures.

Describing Eye-Tracking Measures
To explore the impact of linguistic and cognitive information on eye-movements in reading tasks, we extract a set of surface, linguistic, behavioral and BERT features, as listed in Table 1.
Surface Features Given the common finding that surface characteristics, particularly the length of a word, influence fixation duration (Juhasz and Rayner, 2003;New et al., 2006), we compute various surface features at word and sentence level (e.g., word and sentence length).

Linguistic Features
The linguistic characteristics of the words co-occurring in a sentence have an effect on eye movements (Clifton et al., 2007). Thus, we experiment with features of syntactic and semantic nature. The syntactic features are extracted using the Stanza NLP kit (Qi et al., 2020). For each word, we extract its part-of-speech (POS), its word type (content vs. function word), its dependency relation and its named entity type. According to Godfroid et al. (2018) and Williams and Morris (2004), word familiarity (both local and global) has an effect on the reader's attention, i.e., readers may pay less attention on words that already occurred in previous context. In this study, we treat familiarity as word expectedness and model it using three types of semantic similarity: a) similarity of the current word w m to the whole sentence (similarity wm,s ), b) similarity of the current word to its previous word (similarity wm,w m−1 ), and c) similarity of the current word to all of its previous words within the current sentence (similarity wm,w 1...m−1 ). To compute these similarity measures, we use the BERT (base) (De-

Feature Category
Feature Name Surface Features word length, sentence length in tokens, sentence length in characters, word length-sentence length ratio Linguistic Features POS, word type, named entity type, dependency relation, surprisal score, frequency score, similarity wm,s , similarity wm,w m−1 , similarity wm,w 1...m−1 Behavioral Features age of acquisition, prevalence score, valence score, arousal score, dominance score, concreteness human , concreteness auto BERT Features pre-trained BERT embedding, fine-tuned BERT embedding vlin et al., 2018) pre-trained model 1 and map each word to its pre-trained embedding of layer 11. We chose this layer because it mostly captures semantic properties, while the last layer has been found to be very close to the actual classification task and thus less suitable for our purpose (Jawahar et al., 2019;Lin et al., 2019). Based on these extracted embeddings, we calculate the cosine similarities.
To measure the similarity of the current word to the whole sentence (similarity wm,s ), we take the CLS token to represent the whole sentence; we also experiment with the average token embeddings as the sentence embedding, but we find that the CLS token performs better. For measuring the similarity of the current word to all of its previous words (similarity wm,w 1...m−1 ), we average the embeddings of the previous words and find the cosine similarity between this average embedding and the embedding of the current word. Furthermore, semantic surprisal, i.e., the negative log-transformed conditional probability of a word given its preceding context, provides a good measure of predictability of words in context and efficiently predicts reading times (Smith and Levy, 2013), N400 amplitude  and pupil dilation (Frank and Thompson, 2012). We compute surprisal using a bigram language model trained on the lemmatized version of the first slice (roughly 31-million tokens) of the ENCOW14-AX corpus (Schäfer and Bildhauer, 2012). As an additional measure of word expectedness, we also include frequency scores based on the US subtitle corpus (SUBTLEX-US, Brysbaert and New, 2009).

Behavioral Features As discussed in Juhasz and
Rayner (2003) and Clifton et al. (2007), behavioral measures highly affect eye-movements in reading 1 https://github.com/google-research/ bert tasks. For each word in the sentence, we extract behavioral features from large collections of human generated values available online: age of acquisition (Kuperman et al., 2012), prevalence (Brysbaert et al., 2019), valence, arousal, dominance (Warriner et al., 2013) and concreteness. For concreteness, we experiment both with human generated scores (concreteness human , Brysbaert et al., 2014) and automatically generated ones (concreteness auto , Köper and Schulte im Walde, 2017). All behavioral measures have been centered (mean equal to zero) and the missing values have been set to the corresponding mean value.

BERT Features
Given the success of current language models for various NLP tasks, we investigate their expressivity for human-centered tasks such as eye-tracking: each word is mapped to two types of contextualized embeddings. First, each word is mapped to its BERT (Devlin et al., 2018) embedding extracted from the pre-trained base model. To extract the second type of contextualized embedding, we fine-tune BERT on each of the five eyetracking measures. Specifically, the BERT base model 2 is fine-tuned separately 5 times, one for each of the eye-tracking measures to be predicted. Based on these fine-tuned models, we extract the embedding of each word as a fixed feature vector to be used for further experimentation. This means that in this step each word is in fact mapped to five distinct embeddings, one for each fine-tuned model. In the later experimentation, we use the respective embedding based on which measure is currently predicted (e.g., the embedding extracted from the model fine-tuned for nFix is used to predict nFix).

Experiment 1: Using Only SLB Features
In Experiment 1, we train the aforementioned model architectures on the full set of SLB features. Among the three models, the Random Forest Regressor achieves the best overall performance, with an average MAE across all 5 eye-tracking measures of MAE RF = 4.059 , MAE DT = 4.187, MAE LR = 4.322. To shed light on the most predictive features for each of the eye-tracking measures, we perform feature selection based on the features' weight, i.e., the impurity-based feature importance (Gini importance) computed as the normalized total reduction of the criterion brought by that feature -the higher, the more important the feature. We select features with importance higher than 0.01, resulting in a reduced SLB feature set as shown in Table 2. This selected set is further used for Experiment 3 (see Section 3.3).

Experiment 2: Using Only BERT
Our second experiment aims at investigating the expressivity of the contextualized BERT embeddings. We experiment with the two variants of BERT embeddings (see Section 2). In the first variant, the three models use the pre-trained BERT embeddings, while in the second variant, the models use the fine-tuned BERT embeddings. The latter means that for each of the 5 eye-tracking measures, the extracted embeddings of the corresponding finetuned model are used and 3 models are trained for each measure, with a total of 15 models. We also experiment with the predictions directly resulting from the fine-tuning tasks, but we observe that these predictions show similar performance. This finding is in line with what is reported in Devlin et al. (2018).

Experiment 3: Enhancing BERT with SLB Features
Extracting BERT embeddings as fixed-length features instead of using the predictions directly out of the fine-tuned model allows us to extend the BERT vectors with further features. Thus, in the last experiment, we train the 3 regression models on an extended vector, comprising the extracted 768-dimensional BERT embedding and additional dimensions for the reduced SLB feature set of Experiment 1 (see Section 3.1). Again, two variants are tested: one using the pre-trained embeddings and the other one using the fine-tuned embeddings of the corresponding model.  all 5 eye-tracking measures. When we compare the predictive power of the models including only SLB features against the models trained only on BERT, we see that the embeddings are less informative than the carefully selected set of SLB features. A closer investigation of the selected SLB features in Table 2 provides interesting insights about the nature of the features and the task.

Results and Discussion
Surface Features Among all SLB features, word length is consistently the predictor with the highest weight across all 5 measures. Furthermore, word length-sentence length ratio is among the most important contributors in 4 of the 5 measures. This confirms the observation in Hollenstein et al. (2018, p. 10) that the probability of a word being skipped reduces as word length increases.
Linguistic Features Two features for word expectedness, i.e., frequency score and similarity wm,w m−1 , also show a high predictive power for all 5 measures. This confirms previous findings by Godfroid et al. (2018) and Williams and Morris (2004). Likewise, similarity wm,w 1...m−1 ranks among the most important features for 4 of the 5 measures, and surprisal score for 3 of the 5 measures. Most importantly, surprisal score shows a much higher importance in predicting GPT, which indicates that encountering an unexpected word may cause a regressive reading to re-inspect previous words and thus increases the go-past time. On the other hand, the syntactic properties of a word (e.g., POS, dependency relation and named entity type) do not show any strong effect in our results. The only exception is that numeral tokens are among the most important features in predicting GPT and TRT. After a closer look into the data, we found that a majority of the numeral tokens are information about date (e.g. November 28; . The effect of such numeral tokens could probably be explained by the nature of the data, where a majority of the sentences are biographical sentences from Wikipedia (Hollenstein et al., 2018(Hollenstein et al., , 2019. In such data, this numeral information is highly relevant for the context. Behavioral Features Dominance and age of acquisition also play a significant role in predicting GPT: as indicated in the literature (Juhasz and Rayner, 2003), such behavioral measures have a strong impact on the processing time of words in context.

Conclusion
We presented a system of eye-tracking feature prediction which combines BERT with a rich set of surface, linguistic and behavioral (SLB) features. Overall, our three studies indicate that including not only semantic properties that can be directly extracted from text, such as embeddings and surprisal score, but also measures reflecting behavioral (e.g., dominance and age of acquisition) and surface properties (word and sentence length) has a positive impact on the performance of our models in predicting eye-tracking data.