PIHKers at CMCL 2021 Shared Task: Cosine Similarity and Surprisal to Predict Human Reading Patterns.

Eye-tracking psycholinguistic studies have revealed that context-word semantic coherence and predictability influence language processing. In this paper we show our approach to predict eye-tracking features from the ZuCo dataset for the shared task of the Cognitive Modeling and Computational Linguistics (CMCL2021) workshop. Using both cosine similarity and surprisal within a regression model, we significantly improved the baseline Mean Absolute Error computed among five eye-tracking features.


Introduction
The shared task proposed by the organizers of the Cognitive Modeling and Computational Linguistics workshop (Hollenstein et al., 2021) requires participant to create systems capable of predicting eye-tracking data from the ZuCo dataset (Hollenstein et al., 2018). Creating systems to efficiently predict biometrical data may be useful to make prediction about linguistic materials for which we have few or none experimental data, and to make new hypothesis about the internal dynamics of cognitive processes.
The approach we propose relies mainly on two factors that have been proved to influence language comprehension: i.) the semantic coherence of a word with the previous ones (Ehrlich and Rayner, 1981) and ii.) its predictability from previous context (Kliegl et al., 2004). We model the first factor with the cosine similarity (Mitchell et al., 2010;Pynte et al., 2008) between the distributional vectors, representing the context and the target word, produced by different Distributional Semantic Models (DSM) (Lenci, 2018). We compared 10 state-of-the-art word embedding models, and two different approaches to compute the context vector. We model the predictability of a word within the context with the word-by-word surprisal computed with 3 of the above mentioned models (Hale, 2001;Levy, 2008). Finally, cosine similarity and surprisal are combined in different regression models to predict eye tracking data.

Related Works
Different word embedding models (GloVe, Word2Vec, WordNet2Vec, FastText, ELMo, BERT) have been evaluated in the framework proposed by Hollenstein et al. (2019). The evaluation is based on the model capability to reflect semantic representations in the human mind, using cognitive data in different datasets for eye-tracking, EEG, and fMRI. Word embedding models are used to train neural networks on a regression task. The results of their analyses show that BERT, ELMo, and FastText have the best prediction performances.
Regression models with different combinations of cosine similarity and surprisal, to predict (and further study the cognitive dynamics beneath) eye movements have been created by Frank (2017), who claims that, since word embeddings are based on co-occurrences, semantic distance may actually represent word predictability, rather than semantic relatedness, and that previous findings showing correlations between reading times and semantic distance were actually due to a confound between these two concepts. In his work, he uses linear regression models testing different surprisal measures, and excluding it. The results show that when surprisal is factored out, the effects of semantic similarity on reading times disappear, proving thus the existence of an interplay between the two elements.

Datasets
The shared task materials come from ZuCo (Hollenstein et al., 2018), that includes EEG and eyetracking data, collected on 12 English speakers reading natural texts. The data collection has been done in three different settings: two normal reading tasks and one task-specific reading session. The original dataset comprises 1, 107 sentences, and for the shared task 800 sentences (15, 736 words) have been used for the training data, while the test set included about 200 sentences (3, 554 words). Since the shared task focuses on eye-tracking features, only this latter data were available. The training dataset structure includes sentence number, wordwithin-sentence number, word, number of fixations (nFix), first fixation duration (FFD), total reading time (TRT), go-past time (GPT), fixation proportion (fixProp). The first three elements were part of the test set too.
Our approach includes a preliminary step of feature selection. For this purpose we also used GECO (Cop et al., 2017) and Provo (Luke and Christianson, 2018), two eye-tracking corpora containing long, complete, and coherent texts. GECO is a monolingual and bilingual (English and Dutch) corpus composed of the entire Agatha Christie's novel The Mysterious Affair at Styles. GECO contains eye-tracking data of 33 subjects (19 of them bilingual, 14 English monolingual) reading the full novel text, presented paragraph-by-paragraph on a screen. GECO is composed of 54, 364 tokens. Provo contains 55 short English texts about various topics, for a total of 2, 689 tokens, and a vocabulary of 1, 197 words. These texts were read by 85 subjects and their eye-tracking measures were collected in an available on-line dataset. Similarly to ZuCo, GECO and Provo data are recorded during naturalistic reading on everyday life materials. For every word in GECO and Provo, we extracted its mean total reading time, mean first fixation duration, and mean number of fixations, by averaging over the subjects. Table 1 shows the embeddings types used in our experiments, consisting of 6 non-contextualized DSMs and 4 contextualized DSMs. The former include predict models (SGNS and FastText) (Mikolov et al., 2013;Levy and Goldberg, 2014;Bojanowski et al., 2017) and count models (SVD and GloVe) (Bullinaria and Levy, 2012;Pennington et al., 2014). Four DSMs are window-based and two are syntax-based (synt). Embeddings have 300 dimensions and were trained on the same corpus of about 3.9 billion tokens, which is a concatenation of ukWaC and a 2018 dump of Wikipedia.

Method
To predict eye tracking data we tested different regression models and several features combinations.
Feature Selection. To select the features to be used, for each word embedding model and language model we carried out a preliminary investigation computing Spearman's correlation between eye tracking features, and respectively surprisal and cosine similarity: The features with the highest correlation with biometrical data have been selected for being used in the regression model.
For each target word w in GECO, Provo and ZuCo, we measure the cosine similarity between the embedding of w and the embedding of the context c composed of the previous words in the same sentence. We then compute the Spearman correlation between the cosine and the eye-tracking data for w. We test two different ways of computing the context embedding: Additive model (for every embedding type): The context vector is the sum of all its word embeddings. Because of the bidirectional nature of BERT, the input to this model needed a special pre-processing. In order to prevent that the vectors representing words within the context were computed using the target word itself, we passed to BERT a list of sub-sentences, each of which were composed of context words only. So given the sentence The dog chases the cat: T he + dog + chases + the from S[3]. CLS: The context vector is the embedding produced by BERT for the special token [CLS]. As for the additive model, BERT was fed with subsentences, and for each target word the CLScontext-vector was the one computed at the previous list element. So, looking at the previous example, for cat as target word, we will use the CLS vector representing all the S[3] elements. Given the positive effect of semantic coherence on language processing, we expect that the eyetracking data for w have a negative correlation with its cosine similarity with c: The higher the cosine, the lower the reading time of w measured by eyetracking.
We then used BERT, GPT2-xl and Neural Complexity to compute word-by-word surprisal. As for the cosine similarity, for BERT the input sentences were organized in sub-sentences, and the last token, the target word, was replaced with the special tag [MASK]. Finally, we compute the Spearman correlation between the surprisal of w, and the eye-tracking data for the target word. Differently from the cosine, we expect the surprisal to be positively correlated with the word reading time: The less predictable a word, the slower its processing.
The comparison has been done between 60 possible features: 6 values of cosine similarity between non-contextualized vectors, 51 values of cosine similarity between contextualized vectors (48 from 24 layers of BERT in two different ways to compute the context vector, and 3 from ELMo, GPT2-xl and Neural Complexity), 3 values of surprisal from BERT, GPT2-xl, Neural Complexity. Based on the correlation values, we selected one cosine similarity feature and one surprisal feature, that have been combined with two variables that are wellknown in the cognitive neuroscience literature for influencing eye movements: word length and word frequency, the last one computed on Wikipedia 1 .
Regression Model Selection. Taking into account the Spearman's correlations, we selected one word embedding model for cosine similarity and one Language Model for surprisal. Then, different kind of regression models from Scikit-learn have been compared. More precisely, PLS Regression, Multi-layer Perceptron Regressor, Random Forest Regressor, Linear Regression, Ridge Regression, Bayesian ridge regression, Epsilon-Support Vector Regression, Linear regression with combined L1 and L2 priors as Regularizer, Gradient Boosting Regressor. The metric used to evaluate different models is the Mean Absolute Error on ZuCo's eye tracking features prediction. Once the model and the features have been selected, the comparison between 3 different regression settings has been done: i) surprisal only; ii) cosine similarity only; iii) surprisal + cosine similarity. For the regression model selection, we used 2/3 of the ZuCo training set to train the model, and 1/3 for validation purposes. Once we found the best (i.e. lower MAE among eye tracking data) combination of features and regression model, the prediction on test data has been done.

Results and Discussion
Spearman's correlations between eye tracking features and cosine similarity showed that best performances are reached by vectors produced by BERT layer 22 CLS context (mean correlation over eye tracking features on the three datasets: −0.62), while best correlations between eye tracking data and surprisal are reached by GPT2-xl (mean correlation over eye tracking features on the three datasets: 0.40). These results led us to select as  features for regression model: cosine similarity between vectors computed by BERT 22 CLS and surprisal computed by GPT2-xl. We also tested the cosine similarity between vectors computed by GPT2-xl, to have a comparison with a regression model with features produced by the same model. While performing regression model selection comparing 9 models from Scikit-learn, we also tried different combinations of features. Table 2 shows the best 3 combinations of features and models, compared with the baseline created taking into account word frequency and word lenght only. The lowest MAEs for each eye-tracking feature were reached by a Gradient Boosting Regressor (GBR) using both the cosine similarity between vectors produced by BERT and the surprisal computed by GPT2-xl. The average MAE using the GBR model with BERT cosine and GPT2-xl surprisal was 4.22 (mean improvement compared with the baseline = 0.54), with one feature, fixProp, producing a MAE value significantly higher than the other eye tracking features. Since fixProp is "the proportion of participants that fixated the current word" (i.e., the probability of the word of being fixed), we hypothesized that the combination of phenomena influencing the likelihood of fixating a word could be captured by the other 4 eye tracking features, making them in turn good predictors of fixProp.
Therefore, we tested again the 9 regression models with Scikit-learn, this time using nFix, FFD, TRT, GPT, word lenght and word frequency as features, in every possible permutation (one per time, pairs of features, etc.). A lower MAE on fixProp on training data has been obtained using a Random Forest method with nFix, TRT, and GPT, reaching a MAE of 3.15.
The improvements of the final model over the baseline suggest that the information conveyed by the cosine similarity and the surprisal contributes in modeling the cognitive processing beneath reading. Our results are consistent with Pynte et al. (2008) and Mitchell et al. (2010) findings about the relation between cosine similarity and eye movements data, as well as with Hale (2001) and Levy (2008), who found surprisal to be useful in predicting reading times. Anyway, our model performance shows that taking into account both the computational measures benefits the modeling. Even if Frank (2017) rises an interesting issue about the interplay between the information included in word embeddings and the one provided by the suprisal computed by language models, our results keep us from fully agree with his observations: since the joined model performed better that the ones taking into account only cosine similarity or only surprisal, it is obvious that the two measures convey exclusive and useful information, even if it is more than plausible that they share some kind of information to some extent.
In summary, we used a two-step approach: i.) the final model to predict nFix, FFD, GPT, and TRT in test data was a Gradient Boosting Regressor having as features the cosine similarity between the CLS vector (BERT) and the target word embedding, GPT2-xl surprisal, word length and word frequency; ii.) the predicted values of nFix, GPT, and TRT were used in a Random Forest to predict fixProp.
The shared task final results over the test data, revealed that our model had an average MAE of 4.3877 over all eye tracking features (the baseline was 7.3699, while the best model reached a MAE of 3.8134).

Conclusions
In this paper we described the system we proposed in the CMCL2021 "Shared Task: Predicting human reading patterns". We were required to create a model capable of predicting number of fixations, first fixation duration, total reading time, go-past time, and fixation proportion of each word in the ZuCo dataset. We proposed a regression model using word length and word frequency, combined with two elements that are proved to influence reading processing: the semantic coherence and the predictability of a word within the context. To compute these last two regression features we used the cosine similarity between the vector representing the context and the word embedding of the target word, and the surprisal computed by Language Models, respectively. We selected the models to produce the vectors and to compute the surprisal calculating the Spearman correlation between the cosine similarity and the eye tracking data, and between the surprisal and the same data. We then used the best cosine similarity and surprisal within a regression model, selected among 9 possible models. Our results outperformed the baseline, with a average MAE among eye tracking features just 0.5743 higher than the best model in the competition. Our model may be improved exploring new types of regressors and word embeddings, and including new textual features such as sentence length and information regarding words immediately preceding the target ones.