That Looks Hard: Characterizing Linguistic Complexity in Humans and Language Models

This paper investigates the relationship between two complementary perspectives in the human assessment of sentence complexity and how they are modeled in a neural language model (NLM). The first perspective takes into account multiple online behavioral metrics obtained from eye-tracking recordings. The second one concerns the offline perception of complexity measured by explicit human judgments. Using a broad spectrum of linguistic features modeling lexical, morpho-syntactic, and syntactic properties of sentences, we perform a comprehensive analysis of linguistic phenomena associated with the two complexity viewpoints and report similarities and differences. We then show the effectiveness of linguistic features when explicitly leveraged by a regression model for predicting sentence complexity and compare its results with the ones obtained by a fine-tuned neural language model. We finally probe the NLM’s linguistic competence before and after fine-tuning, highlighting how linguistic information encoded in representations changes when the model learns to predict complexity.


Introduction
From a human perspective, linguistic complexity concerns difficulties encountered by a language user during sentence comprehension. The source of such difficulties is commonly investigated using either offline measures or online behavioral metrics. In the offline framework, complexity ratings can be elicited either by assessing errors in comprehension tests or collecting explicit complexity judgments from readers. Instead, in the online paradigm, cognitive signals are collected mainly through specialized machinery (e.g., MRI scanners, eye-tracking systems) during natural or task-oriented reading. Among the wide range of online complexity metrics, gaze data are widely regarded as reliable proxies of processing difficulties, reflecting both low and high-level complexity features of the input (Rayner, 1998;Hahn and Keller, 2016). Eye-tracking measures have recently contributed to significant improvements across many popular NLP applications (Hollenstein et al., 2019a(Hollenstein et al., , 2020 and in particular on tasks related to linguistic complexity such as automatic readability assessment (ARA) (Ambati et al., 2016;Singh et al., 2016;González-Garduño and Søgaard, 2018), obtaining meaningful results for sentence-level classification in easy and hard-to-read categories (Vajjala and Lučić, 2018;Evaldo Leal et al., 2020;Martinc et al., 2021). However, readability levels are conceptually very different from cognitive processing metrics since ARA corpora are usually built in an automated fashion from parallel documents at different readability levels, without explicit evaluations of complexity by target readers (Vajjala and Lučić, 2019). A different approach to complexity assessment that directly accounts for the perspective of readers is presented in the corpus by Brunato et al. (2018), where sentences are individually labeled with the perception of complexity of annotators, which may better reflect the underlying cognitive processing required by readers to parse the sentence. This consideration is supported by recent results highlighting the unpredictability of outliers in perceived complexity annotations, especially for sentences having complex syntactic structures (Sarti, 2020).
Given the relation between complexity judgments elicited from annotators and online cognitive processing metrics, we investigate whether the connection between the two perspectives can be highlighted empirically in human annotations and language model representations. We begin by leveraging linguistic features associated with a variety of sentence-level structural phenomena and analyzing their correlation with offline and online complexity metrics. We then evaluate the performance of models using either complexity-related explicit features or contextualized word embeddings, focusing mainly on the neural language model AL-BERT (Lan et al., 2020). In this context, we show how both explicit features and learned representations obtain comparable results when predicting complexity scores. Finally, we focus on studying how complexity-related properties are encoded in the representations of ALBERT. This perspective goes in the direction of exploiting human processing data to address the interpretability issues of unsupervised language representations (Hollenstein et al., 2019b;Gauthier and Levy, 2019;Abnar et al., 2019). To this end, we rely on the probing task approach, a recently introduced technique within the area of NLMs interpretability consisting of training diagnostic classifiers to probe the presence of encoded linguistic properties inside contextual representations (Conneau et al., 2018;Zhang and Bowman, 2018). We observe that fine-tuning on online and offline complexity produces a consequent increase in probing performances for complexityrelated features during our probing experiments. This investigation has the specific purpose of studying whether and how learning a new task affects the linguistic properties encoded in pretrained representations. In fact, while pre-trained models have been widely studied using probing methods, the effect of fine-tuning on encoded information was seldom investigated. For example, Merchant et al. (2020) found that fine-tuning does not impact heavily the linguistic information implicitly learned by the model, especially when considering a supervised probe closely related to a downstream task. Miaschi et al. (2020) further demonstrated a positive correlation between the model's ability to solve a downstream task on a specific input sentence and the related linguistic knowledge encoded in a language model. Nonetheless, to our knowledge, no previous work has taken into account sentence complexity assessment as a fine-tuning task for NLMs. Our results suggest that the model's competencies during training are interpretable from a linguistic perspective and are possibly related to its predictive capabilities for complexity assessment.
Contributions To our best knowledge, this is the first work displaying the connection between online and offline complexity metrics and studying how they are represented by a neural language model. We a) provide a comprehensive analysis of linguistic phenomena correlated with eye-tracking data and human perception of complexity, addressing  similarities and differences from a linguisticallymotivated perspective across metrics and at different levels of granularity; b) compare the performance of models using both explicit features and unsupervised contextual representations when predicting online and offline sentence complexity; and c) show the natural emergence of complexityrelated linguistic phenomena in the representations of language models trained on complexity metrics. 1

Data and Preprocessing
Our study leverages two corpora, each capturing different aspects of linguistic complexity: Eye-tracking For online complexity metrics, we used the monolingual English portion of GECO (Cop et al., 2017), an eye-tracking corpus based on the novel "The Mysterious Case at Styles" by Agatha Christie. The corpus consists of 5,386 sentences annotated at word-level with eyemovement records of 14 English native speakers. We select four online metrics spanning multiple  phases of cognitive processing, which are widely considered relevant proxies for linguistic processing in the brain (Demberg and Keller, 2008;Vasishth et al., 2013). We sum-aggregate those at sentence-level and average their values across participants to obtain the four online metrics presented in Table 1. As a final step to make the corpus more suitable for linguistic complexity analysis, we remove all utterances with fewer than 5 words. This design choice is adopted to ensure consistency with the perceived complexity corpus by Brunato et al. (2018).
Perceived Complexity For the offline evaluation of sentence complexity, we used the English portion of the corpus by Brunato et al. (2018). The corpus contains 1,200 sentences taken from the Wall Street Journal section of the Penn Treebank (Mc-Donald et al., 2013) with uniformly-distributed lengths ranging between 10 and 35 tokens. Each sentence is associated with 20 ratings of perceivedcomplexity on a 1-to-7 point scale. Ratings were assigned by English native speakers on the Crowd-Flower platform. To reduce the noise produced by the annotation procedure, we removed duplicates and sentences for which less than half of the annotators agreed on a score in the range µ n ± σ n , where µ n and σ n are respectively the average and standard deviation of all annotators' judgments for sentence n. Again, we average scores across annotators to obtain a single metric for each sentence. Table 2 presents an overview of the two corpora after preprocessing. The resulting eye-tracking (ET) corpus contains roughly four times more sentences than the perceived complexity (PC) one, with shorter words and sentences on average.

Analysis of Linguistic Phenomena
As a first step to investigate the connection between the two complexity paradigms, we evaluate the correlation of online and offline complexity labels with linguistic phenomena modeling a number of properties of sentence structure. To this end, we rely on the Profiling-UD tool (Brunato et al., 2020) to annotate each sentence in our corpora and extract from it ∼100 features representing their linguistic structure according to the Universal Dependencies formalism (Nivre et al., 2016). These features capture a comprehensive set of phenomena, from basic information (e.g. sentence and word length) to more complex aspects of sentence structure (e.g. parse tree depth, verb arity), including properties related to sentence complexity at different levels of description. A summary of most relevant features in our analysis is presented in Table 3. Figure 1 reports correlation scores for features showing a strong connection (|ρ| > 0.3) with at least one of the evaluated metrics. Features are ranked using their Spearman's correlation with complexity metrics, and scores are leveraged to highlight the relation between linguistic phenomena and complexity paradigms. We observe that features showing a significant correlation with eyetracking metrics are twice as many as those correlating with PC scores and generally tend to have higher coefficients, except for total regression duration (TRD). Nevertheless, the most correlated features are the same across all metrics. As expected, sentence length (n_tokens) and other related fea- tures capturing aspects of structural complexity occupy the top positions in the ranking. Among those, we also find the length of dependency links (max_links_len, avg_links_len) and the depth of the whole parse tree or selected sub-trees, i.e. nominal chains headed by a preposition (parse_depth, n_prep_chains). Similarly, the distribution of subordinate clauses (sub_prop_dist, sub_post) is positively correlated with all metrics but with stronger effect for eye-tracking ones, especially in presence of longer embedded chains (sub_chain_len). Interestingly, the presence of numbers (upos_NUM, dep_nummod) affects only the explicit perception of complexity while it is never strongly correlated with all eye-tracking metrics. This finding is expected since numbers are very short tokens and, like other functional POS, were never found to be strongly correlated with online reading in our results. Conversely, numerical information has been identified as a factor hampering sentence readability and understanding (Rello et al., 2013).
Unsurprisingly, sentence length is the most correlated predictor for all complexity metrics. Since many linguistic features highlighted in our analysis are strongly related to sentence length, we tested whether they maintain a relevant influence when this parameter is controlled. To this end, Spearman's correlation was computed between features and complexity tasks, but this time considering bins of sentences having approximately the same length. Specifically, we split each corpus into 6 bins of sentences with 10, 15, 20, 25, 30 and 35 tokens respectively, with a range of ±1 tokens per bin to select a reasonable number of sentences for our analysis. Figure 2 reports the new rankings of the most correlated linguistic features within each bin across complexity metrics (|ρ| > 0.2). Again, we observe that features showing a significant correlation with complexity scores are fewer for PC bins than for eye-tracking ones. This fact depends on controlling for sentence length but also on the small size of bins for the whole dataset. As in the coarse-grained analysis, TRD is the eye-tracking metric less correlated to linguistic features, while the other three (FXC, FPD, TFD) show a homogeneous behavior across bins. For the latters, vocabulary-related features (token-type ratio, average word length, lexical density) are always ranked on top (and with a positive correlation) in all bins, especially when considering shorter sentences (i.e. from 10 to 20 tokens). For PC, this is true only for some of them (i.e. word length and lexical density). At the same time, features encoding numerical information are still highly correlated with the explicit perception of complexity in almost all bins. Interestingly, features modeling subordination phenomena extracted from fixed-length sentences exhibit a reverse trend than when extracted from the whole corpus, i.e. they are negatively correlated with judgments. If, on the one hand, we expect an increase in the presence of subordination for longer sentences (possibly making sentences more convoluted), on the other hand, when length is controlled, our findings suggest that subordinate structures are not necessarily perceived as a symptom of sentence complexity. Our analysis also highlights that PC's relevant features are significantly different from those correlated to online eye-tracking metrics when controlling for sentence length. This aspect wasn't evident from the previous coarse-grained analysis. We note that, despite controlling sentence length, gaze measures are still significantly connected to length-related phenomena. This can be possibly due to the ±1 margin applied for sentence selection and the high sensitivity of behavioral metrics to small changes in the input.

Predicting Online and Offline Linguistic Complexity
Given the high correlations reported above, we proceed to quantify the importance of explicit linguistic features from a modeling standpoint. Table 4 presents the RMSE and R 2 scores of predictions made by baselines and models for the selected com-plexity metrics. Performances are tested with a 5fold cross-validation regression with fixed random seed on each metric. Our baselines use average metric scores of all training sentences (Average) and average scores of sentences binned by their length in # of tokens (Length-binned average) as predictions. The two linear SVM models leverage explicit linguistic features, using respectively only n_tokens (SVM length) and the whole set of ∼100 features (SVM feats). Besides those, we also test the performances of a state-of-the-art Transformer neural language model relying entirely on contextual word embeddings. We selected ALBERT as a   Table 4 we note that: i) the length-binned average baseline is very effective in predicting complexity scores and gaze metrics, which is unsurprising given the extreme correlation between length and complexity metrics presented in Figure 1; ii) the SVM feats model shows considerable improvements if compared to the length-only SVM model for all complexity metrics, highlighting how length alone accounts for much but not for the entirety of variance in complexity scores; and iii) ALBERT performs on-par with the SVM feats model on all complexity metrics despite the small dimension of the fine-tuning corpora and the absence of explicit linguistic information. A possible interpretation of ALBERT's strong performances is that the model implicitly develops competencies related to phenomena encoded by linguistic features while training on online and offline complexity prediction. We explore this perspective in Section 5.
As a final step in the study of feature-based models, we inspect the importance accorded by the SVM feats model to features highlighted in previous sections. Table 5 presents coefficient ranks produced by SVM feats for all sentences and for the 10±1 length bin, which was selected as the broadest subset. Despite evident similarities with the previous correlation analysis, we encounter some differences that are possibly attributable to the model's inability in modeling nonlinear relations. In particular, the SVM model still finds sentence length and related structural features highly relevant for all complexity metrics. However, especially for PC, lexical features also appear in the top positions (e.g. lexical density, ttr_lemma, char_per_tok), as well as specific features related to verbal predicate information (e.g. xpos_dist_VBZ,_VBN). This holds both for all sentences, and when considering single lengthbinned subsets. While in the correlation analysis eye-tracking metrics were almost indistinguishable, those behave quite differently when considering how linguistic features are used for inference by the linear SVM model. In particular, the fixation count metric (FXC) consistently behaves in a different way if compared to other gaze measures, even when controlling for length.

Probing Linguistic Phenomena in ALBERT Representations
As shown in Table 4, ALBERT performances on the PC and eye-tracking corpora are comparable to those obtained using a linear SVM with explicit linguistic features. To investigate if ALBERT encodes the linguistic knowledge that we identified as strongly correlated with online and perceived sentence complexity during training and prediction, we adopt the probing task testing paradigm. The aim of this analysis is two-fold: i) probing the presence of complexity-related information encoded by ALBERT representations during the pre-training All Sentences Bin 10±1   PC FXC FPD TFD TRD PC FXC FPD TFD TRD   n_tokens  1  1  1  1  1  -36  5  1  1  2  char_per_tok  2  2  12  10  16  3  1  3  3  19  xpos_dist_VBN  5  -37  76  77  75  28  9  26  21  42 Table 5: Rankings based on the coefficients assigned by SVM feats for all metrics. Top ten positive and negative features are marked with orange and cyan respectively. "/" marks features present in less than 5% of sentences.
process, especially in relation to analyzed features; and ii) verifying whether, and in which respect, this competence is affected by a fine-tuning on complexity assessment tasks.
To conduct the probing experiments, we aggregate three UD English treebanks representative of different genres, namely: EWT, GUM and Par-TUT by Silveira et al. (2014); Zeldes (2017); Sanguinetti and Bosco (2015), respectively. We thus obtain a corpus of 18,079 sentences and use the Profiling-UD tool to extract n sentence-level linguistic features Z = z 1 , . . . , z n from gold linguistic annotations. We then generate representations A(x) of all sentences in the corpus using the last-layer [CLS] embedding of a pretrained AL-BERT base model without additional fine-tuning, and train n single-layer perceptron regressors g i : A(x) → z i that learn to map representations A(x) to each linguistic feature z i . We finally evaluate the error and R 2 scores of each g i as a proxy to the quality of representations A(x) for encoding their respective linguistic feature z i . We repeat the same evaluation for ALBERTs fine-tuned respectively on perceived complexity (PC) and on all eye-tracking labels with multitask learning (ET), averaging scores with 5-fold cross-validation. Results are shown on the left side of Table 6.
As we can see, ALBERT's last-layer sentence representations have relatively low knowledge of complexity-related probes, but the performance on them highly increases after fine-tuning. Specifically, a noticeable improvement is obtained on features that were already better encoded in base pretrained representation, i.e. sentence length and related features, suggesting that fine-tuning possibly accentuates only properties already well-known by the model, regardless of the target task. To verify that this isn't the case, we repeat the same experiments on ALBERT models fine-tuned on the smallest length-binned subset (i.e. 10±1 tokens) presented in previous sections. The right side of Table 6 presents these results. We know from our length-binned analysis of Figure 2 Table 6: RMSE and R 2 scores for diagnostic regressors trained on ALBERT representations, respectively, without fine-tuning (Base), with PC and eye-tracking (ET) fine-tuning on all data (left) and on the 10 ± 1 length-binned subset (right). Bold values highlight relevant increases in R 2 from Base.
while ET scores remain significantly affected despite our controlling of sequence size. This also holds for length-binned probing task results, where the PC model seems to neglect length-related properties in favor of other ones, which were the same highlighted in our fine-grained correlation analysis (e.g. word length, numbers, explicit subjects).
The ET-trained model confirms the same behavior, retaining strong but lower performances for length-related features. We note that, for all metrics, features that were highly relevant only for the SVM predictions, such as those encoding verbal inflectional morphology or vocabulary-related ones (Table 5), are not affected by the fine-tuning process. Despite obtaining the same accuracy of a SVM, the neural language model seem to address the task more similarly to humans when accounting for correlation scores (Figure 2). A more extensive analysis of the relation between human behavior and predictions by different models is deemed interesting for future work.
To conclude, although higher probing tasks performances after fine-tuning on complexity metrics should not be interpreted as direct proof that the neural language model is exploiting newlyacquired morpho-syntactic and syntactic information, they suggest an importance shift in NLM representation, triggered by fine-tuning, that produces an encoding of linguistic properties able to better model the human assessment of complexity.

Conclusion
This paper investigated the connection between eye-tracking metrics and the explicit perception of sentence complexity from an experimental standpoint. We performed an in-depth correlation analysis between complexity scores and sentence-level properties at different granularity levels, highlighting how all metrics are strongly connected to sentence length and related properties, but also revealing different behaviors when controlling for length. We then evaluated models using explicit linguistic features and unsupervised word embeddings to predict complexity, showing comparable performances across metrics. We finally tested the encoding of linguistic properties in the contextual representations of a neural language model, noting the natural emergence of task-related linguistic properties within the model's representations after the training process. We thus conjecture that a relation subsists between the linguistic knowledge acquired by the model during the training procedure and its downstream performances on tasks for which the morphosyntactic and syntactic structures play a relevant role. For the future, we would like to test comprehensively the effectiveness of tasks inspired by the human language learning as intermediate steps to train more robust and parsimonious neural language models.

Broader Impact and Ethical Perspectives
The findings described in this work are mostly intended to evaluate recent efforts in the computational modeling of linguistic complexity. This said, some of the models and procedures described can be clearly beneficial to society. For example, using models trained to predict reading patterns may be used in educational settings to identify difficult passages that can be simplified, improving reading comprehension for students in a fully-personalizable way. However, it is essential to recognize the potentially malicious usage of such systems. The integration of eye-tracking systems in mobile devices, paired with predictive models presented in this work, could be used to build harmful surveillance systems and advertisement platforms using gaze predictions for extreme behavioral manipulation. In terms of research impact, the experiments presented in this work may provide useful insights into the behavior of neural language models for researchers working in the fields of interpretability in NLP and computational psycholinguistics.

A Parametrization and Fine-tuning Details for ALBERT
We leverage the pretrained albert-base-v2 checkpoint available in the HuggingFace's Transformer framework (Wolf et al., 2020) and use adapted scripts and classes from the FARM framework (Deepset, 2019) to perform multitask learning on eye-tracking metrics. Table 7 presents the parameters used to define models and training procedures for experiments in Sections 4 and 5. During training we compute MSE loss scores for task-specific heads for the four eye-tracking metrics ( F XC , F P D , T F D , T RD ) and perform a weighted sum to obtain the overall loss score ET to be optimized by the model: The use of T RD was shown to have a positive impact on the overall predictive capabilities of the model only when weighted to prevent it from dominating the ET sum.
Probing tasks on linguistic features are performed by freezing the language model weights and training 1-layer heads as probing regressors over the last-layer [CLS] token for each feature. In this setting no loss weighting is applied, and the regressors are trained for 5 epochs without early stopping on the aggregated UD dataset.

B Examples of Sentences from
Complexity Corpora Table 8 presents examples of sentences randomly selected from the two corpora leveraged in this study. We highlight how eye-tracking scores show a very consistent relation with sentence length, while PC scores are much more variable. This fact suggests that the offline nature of PC judgments makes them less related to surface properties and more connected to syntax and semantics.

C Models' Performances on Length-binned Sentences
Similarly to the approach adopted in Section 3, we test the performances of models on length-binned data to verify if performances on length-controlled sequences are consistent with those achieved on the whole corpora. RMSE scores averaged with 5-fold cross validation over the length-binned sentences subsets are presented in Figure 3. We note that ALBERT outperforms the SVM with linguistic features on nearly all lengths and metrics, showing the largest gains on intermediate bins for PC and gaze durations (FPD, TFD, TRD). Interestingly, overall performances of models follow a length-dependent increasing trend for eye-tracking metrics, but not for PC. We believe this behavior can be explained in terms of the high sensibility to length previously highlighted for online metrics, as well as the variability in bin dimensions (especially for the last bin containing only 63 sentences). We finally observe that the SVM model based on explicit linguistic features (SVM feats) performs poorly on larger bins for all tasks, sometimes being even worse than the bin-average baseline. While we found this behavior surprising given the positive influence of features highlighted in Table 4, we believe this is mostly due to the small dimension of longer bins, which negatively impacts the generalization capabilities of the regressor. The relatively better scores achieved by ALBERT in those, instead, support the effectiveness of information stored in pretrained language representations when a limited number of examples is available. Bin 35±1 There was a breathless hush, and every eye was fixed on the famous London specialist, who was known to be one of the greatest authorities of the day on the subject of toxicology.
4126 23.14 4814 1631 Table 8: Example of sentences selected from all the length-binned subset for the Perceived Complexity Corpus (top) and the GECO corpus (bottom). Scores are aggregated following the procedure described in Section 2. Reading times (FPD, TFD, TRD) are expressed in milliseconds.  Table 4, performing 5-fold crossvalidation on the same length-binned subsets used for the analysis of Figure 2. Lower scores are better.