LangResearchLab_NC at CMCL2021 Shared Task: Predicting Gaze Behaviour Using Linguistic Features and Tree Regressors

Analysis of gaze data behaviour has gained momentum in recent years for different NLP applications. The present paper aims at modelling gaze data behaviour of tokens in the context of a sentence. We have experimented with various Machine Learning Regression Algorithms on a feature space comprising the linguistic features of the target tokens for prediction of five Eye-Tracking features. CatBoost Regressor performed the best and achieved fourth position in terms of MAE based accuracy measurement for the ZuCo Dataset.


Introduction
Eye-Tracking data or Gaze data compiles millisecond-accurate records about where humans look while reading. This yields valuable insights into the psycho-linguistic and cognitive aspects of various tasks requiring human intelligence. Eye-Tracking data has been successfully employed for various downstream NLP tasks, such as part of speech tagging (Barrett et al., 2016), named entity recognition (Hollenstein et al., 2018), sentiment analysis (Mishra et al., 2018), text simplification (Klerke et al., 2016), and sequence classification (Barrett and Hollenstein, 2020) among others. Development of systems for automatic prediction of gaze behaviour has become an important topic of research in recent years. For example, Klerke et al. (2016) and Mishra et al. (2017) used bi-LSTM and CNN, respectively for learning different gaze features. In the present work, Eye-Tracking features for words/tokens of given sentences are learned using Tree Regressors trained on a feature space comprising the linguistic properties of the target tokens. The proposed feature engineering scheme aims at encoding shallow lexical features, possible familiarity with the readers, interactions of a target token with other words in its context, and statistical language model features.

Task Setup
The shared task is designed to predict five Eye-Tracking features namely, number of fixations (nF), first fixation duration (FFD), total reading time (TR), go-past time (GP) and, fixation proportion (fxP). ZuCo Eye-Tracking dataset is used for the present task (Hollenstein et al., 2021(Hollenstein et al., , 2020(Hollenstein et al., , 2018. The dataset contains three subsets corresponding to Train, Trial and Test which contains 700, 100, and 191 sentences, respectively. Their respective token counts are 13765, 1971, and 3554. Each input token is uniquely represented by a tuple < sid, wid >, where sid is the sentence_id and wid is the word_id. Mean Absolute Error (MAE) is used for evaluation.

Feature Engineering
For the above-mentioned task, linguistic features for a given input token are extracted in order to encode the lexical, syntactic, and contextual properties of the input token. Additionally, familiarity of the input token and its collocation with surrounding words is also modelled as explained below.

Shallow Lexical Features
It is intuitive that the lexical properties of a given input token have an effect on the amount of time spent on reading the word. Features, such as Number of letters (Nlets), vowels (Nvow), syllables (Nsyl), phonemes (Nphon), morphemes (Nmorph), and percentage of upper case characters (PerUp) in the input token are used to model shallow lexical characterstics of the target token. A feature (Is-Named) is used to indicate whether the input token is a Named Entity. The language of etymological 1 origin, e.g., Latin, French of the target token is also considered as a feature, named EtyOrig.
In addition, several Boolean features have been used for characterization of the input token. The input tokens, which are the last words of the respective sentences, are suffixed by the string <EOS>. These are identified by a Boolean feature (IsLast). The <EOS> string is removed for further feature extraction. Two Boolean features (IsNumber, Hyphen) are used to indicate whether the input token is numeric, and whether the target token contains multiple words connected using hyphens, respectively. To indicate that the input token is a possesive word, a Boolean feature is used (IsPossessive). The identification has been done with the help of POS tag of SpaCy library and presence of apostrophe. A Boolean feature (StartPunct) is used to identify inputs starting with a puntuation character, these punctuations are removed for further feature extraction. Furthermore, we have considered two sentence level features namely, the total number of tokens in the sentence (LenSent), and the relative position (Relpos) of the input token in the sentence.

Modelling Familiarity
In the present work, the familiarity of a token is modelled using various frequency based features as described below.
A Boolean feature (IsStopword) is used to indicate whether the token is a stopword or not. It has been observed that the gaze time for stopwords, such as a, an, of, is much less in comparison with uncommon words, such as grandiloquent < 457, 20 >, and contrivance < 715, 4 >. This feature has been extracted using NLTK's list of English stopwords.
Corpus based features are used to indicate the common usage of input tokens. A Boolean feature (InGoogle) indicates whether the input token belongs to the list of the 10,000 most common English words, as determined by n-gram frequency analysis of the Google's Trillion Word Corpus 2 . Similarly, to indicate the presence of input tokens in the list of 1000 words included in Ogden's Basic English 3 , a Boolean feature (InOgden) is used.
Frequency based features are also used to model the familiarity of input tokens. The following features are used: Frequency of input token in Ogden's Basic English (OgdenFreq), Exquisite Corpus (ECFreq) and, SUBTLEX (SUBTFreq). Exquisite Corpus 4 compiles texts from seven different domains. SUBTLEX contains frequency of 51 million words calculated on a corpus of Movie Subtitles. Contextual Diversity (ConDiversity) reported in SUBTLEX is also used as a feature. Contextual Diversity is computed as the percentage of movies in which the word appears. Furthermore, frequency of the input tokens given in the L count of (Thorndike and Lorge, 1944), and London-Lund Corpus of English Conversation by (Brown, 1984) are also used as features (TLFreq, BrownFreq).
The probability of the input token calculated using a bigram and trigram character language models are also considered as feature (CharProb2, CharProb3). The probability is lower for words where letters have unusual ordering. For example, consider the tokens crazy < 350, 3 > and czar < 525, 28 >, CharP rob2(crazy) > CharP rob2(czar) because the letter bigram cr (cry, crazy, create, cream, secret) is more common than bigram cz (czar, eczema) amongst English words. The letter bigram and trigram probabilities are calculated using letter counts from Google's Trillion Word Corpus 5 . Suppose a word W consist of N letters W = l 1 . . . l N then, the corresponding feature value is calculated as:

Modelling Context
There is a significant variation in the amount of time spent on comprehending the semantics of a word in different sentences. Variation in fixation time for the token early in different sentences is presented in Table 1. To model this variation, it is important to include features with respect to the context of the input word. Both simple Universal POS tag (UniTag) and detailed Penn POS tag (Pen-nTag) of the input token extracted using SpaCy are considered as features. The POS of a target word depends on the context in which it appears as shown in Table 2.
Number of synsets (Nsyn), hyponyms (Nhypo) and hypernyms (Nhyper) extracted from NLTK WordNet are also used as features. These features are calculated considering the synsets having the same POS tag as the input token. The Dependency tree of a sentence helps to understand the relationship between different words of a given sentence.  In this respect, the dependency tag of the input token with its syntactical head (DepTag) and, POS tag of the head (HeadPOS) are considered as features. Additionally, two features are extracted from the dependency tree, namely, depth of the input token in the tree (TokDepth), and the number of children of the input token (NChild).

Language Model Features
Statistical n-gram language models help to model collocation of words in sentences, and to determine the probability of a sequence of words. In the present work, we use a trigram language model trained on the Gigaword corpus 6 to extract two features (FragScore3, FragScore5) which measure the language model score of a word sequence containing the input token and the context words in the sentence in a window of 3 and 5, respectively.
We use an n-gram language model to calculate the conditional probability of a word given the preceding n-1 words. In particular, two features corresponding to the average conditional probabilities 6 lm_giga_64k_nvp_3gram.zip (AvgCondP3,AvgCondP2) have been extracted using the aforementioned trigram language model and a bigram model trained on Google's Trillion Word Corpus 7 . For words near the sentence boundary, the average is adjusted accordingly. If P 2 denotes the bigram language model probability then, AvgCondP 2(w n ) = 1 2 n+1 k=n P 2 (w k | w k−1 ) Sentences with higher perplexity have uncommon word sequences which may require more time to comprehend. Perplexity of the sentence calculated using tri-gram language model is also considered as a feature (Perplexity). P erplexity(S) = N 1/P 3 (w 1 w 2 . . . w N )

Description of Algorithms
Experiments were conducted using the following machine learning regression algorithms: • Partial Least Square Regression (PLS): This method aims at fitting a linear regression model by projecting the dependent and independent variables into a new space.
• Neural Network (NN): NN based regression method aims at predicting the value of the dependent variable as a function of input variables via a collection of interconnected nodes.
• Decision Tree (DT): The regression model is built in the form of a tree structure by breaking the dataset into smaller subsets.
• Random Forest (RF): RF regressor fits a multitude of decision trees on various sub-samples of the dataset, and uses averaging to improve accuracy and control over-fitting.
• XGBoost (XG) : Here, weakly learned decision trees are turned into strong learners by training upon residuals instead of aggregation (Chen and Guestrin, 2016).
• Light Gradient Boosting Machine (LG) : This method uses a histogram-based boosting algorthim which uses a specialised Gradient-based one-sided sampling of data points of large gradients (Ke et al., 2017).
• CatBoost (CB): This method takes advantage of the categorical features which are otherwise converted to numerical features in traditional gradient boosting algorithms. CB uses oblivious trees as base predictors which uses same splitting criterior accross the entire level of the tree, and hence are less prone to overfitting (Prokhorenkova et al., 2018).
Since five target Eye-Tracking metrics had to be predicted, Multioutput (MO) and Regressor Chain (RC) algorithms were deployed using sklearn.

Experimental Details
The input tokens containing only punctuations were removed. The Eye-Tracking feature for token '&' is assigned a fixed value 8 . For all other punctuation tokens, the assigned Eye-Tracking feature value is 0. SpaCy 9 is used for POS tagging, lemmatization, dependency parsing and NER. Stopword feature, Corpus features and Frequency features as described in Section 3.2 were extracted after lower casing and lemmatizing the input token. For RC the order is tuned between the 120 possibilities and the max_depth denoted as d, is tuned between 1 to 15. For NN the number of intermediate dense layers is tuned between 1 to 4, the layer dimension is tuned between {10, 25, 50, 100, 150, 200, 250, 300, 500} and dropouts is tuned randomly between 0 to 1. ReLU activation function is used in the intermediate dense layers, batch size is set to 32, learning rate is set to 0.005, and MAE is minimized using Adam optimizer (Kingma and Ba, 2015).

Results
The individual MAE for the five predicted features along with overall MAE for various regression techniques are reported in Table 3. For NN, two dense layers with dimension 100 and 200, respectively and corresponding dropouts 0.13 and 0.02, respectively were used. In the present work, CB outperforms other regression algorithms. This can be attributed to the permutation-driven ordered boosting technique of CB and effective use of categorical features. It can be observed that CB+MO performed the best on the Test Dataset. CB+RC with order (0,4,1,2,3) improved the performance for the Trial Dataset however, it did not have the same effect for the Test Data. The MAE of the proposed system is within 0.14 of the top performer.

Analysis
System predictions are presented in Table 4. The model had the highest MAE for the token < 824, 16 > which contained alphanumeric characters because the features failed to capture its properties. For the token < 900, 9 >, the gold labels are 0, but the system predicts positive values. The true gaze features nF, GP, and TR for multi-hyphenated and repeated token, viz. < 874, 20 > is found to be higher than the predicted values. However, the prediction of the system for the tokens < 951, 5 > and < 976, 26 > are close to the true values. The MAE for the token 'with' in sentence 828 is very low while in sentence 933, it is very high. This is because there is large variation in the true Eye-Tracking values while the variation is low in the predicted values.
To analyze the importance of each feature, the corresponding feature is eliminated and the CB+MO model is trained on the reduced feature space. It was observed that elimination of each individual feature increased the error and thus, each feature plays an important role in the overall performance of the system. The MAE on the Trial Set corresponding to individual features are reported in Table 5. The feature Relpos, which indicates the relative position of token in the sentence, emerged as the most important feature.

Conclusion and Future Work
Automatic prediction of Gaze features without human intervention is important for scalability of these features for tasks involving large datasets. The Shared Task aims at prediction of five Eye-Tracking features for each token of a given sentence. In the present work, a set of linguistic features focused on representing the shallow lexical characteristics of the token, rarity of the token, and interaction and collocation of the target token with its context are extracted. CB+MO regressor trained on the above feature space secured fourth rank on the Shared Task. Error analysis indicates that there is high variation of Eye-Tracking features for the same words in different contexts. However, the proposed system does not capture this variation. In future we would like to incorporate more features in order to represent the context of the target token more effectively.