Assessing Language Proficiency from Eye Movements in Reading

We present a novel approach for determining learners’ second language proficiency which utilizes behavioral traces of eye movements during reading. Our approach provides stand-alone eyetracking based English proficiency scores which reflect the extent to which the learner’s gaze patterns in reading are similar to those of native English speakers. We show that our scores correlate strongly with standardized English proficiency tests. We also demonstrate that gaze information can be used to accurately predict the outcomes of such tests. Our approach yields the strongest performance when the test taker is presented with a suite of sentences for which we have eyetracking data from other readers. However, it remains effective even using eyetracking with sentences for which eye movement data have not been previously collected. By deriving proficiency as an automatic byproduct of eye movements during ordinary reading, our approach offers a potentially valuable new tool for second language proficiency assessment. More broadly, our results open the door to future methods for inferring reader characteristics from the behavioral traces of reading.


Introduction
It is currently estimated that over 1.5 billion people are learning English as a Second Language (ESL) worldwide. Their learning progress is commonly evaluated with classroom tests prepared by language instructors, quizzes in language learning software such as Duolingo and Rosetta Stone, and by official standardized language proficiency tests such as TOEFL, IELTS, MET and others. In "high stakes" scenarios, official language proficiency tests are the de-facto standards for language assessment; they are accepted by educational and professional institutions, and are taken by millions of language learners every year (for example, in 2016 over three million people took the IELTS test (IELTS, 2017)). These tests probe language proficiency based on performance on various linguistic tasks, including grammar and vocabulary exams, reading and listening comprehension questions, as well as essay writing and speaking assignments.
Despite their ubiquity, traditional approaches to language proficiency testing have several drawbacks. First, such tests are typically prepared manually and require extensive resources for test development. Moreover, their validity can be undermined by test specific training, prior knowledge of the evaluation mechanisms (Powers et al., 2002), as well as plain cheating via unauthorized access to test materials. Further, the utilized testing and evaluation methodologies vary across different tests, and test materials are in most cases inaccessible to the research community. Perhaps most crucially, the reliance of these tests on the end products of linguistic tasks makes it challenging to study learners' language processing patterns and the difficulties they encounter in real time.
In this work we propose a novel methodology for language proficiency assessment which marks a significant departure from traditional language proficiency tests and addresses many of their drawbacks. In our approach, we determine language proficiency from broad coverage analysis of eye movements during reading of free-form text in a foreign language, a special case of the general problem of inferring comprehender characteristics and cognitive state from the measurable traces of real-time language processing. Our framework does not require the test taker to prepare for the test or to perform any hand-crafted linguistic tasks, but simply to attentively read an arbitrary set of sentences. To the best of our knowledge, this work is the first to propose and implement such an approach, yielding a novel language proficiency evaluation scheme which relies solely on ordinary reading.
Our framework builds on previous research in psycholinguistics demonstrating that the eyetracking record reflects how readers interact with the text and how language processing unfolds over time (Frazier and Rayner, 1982;Rayner, 1998;Rayner et al., 2012). In particular, it has been shown that key aspects of the reader's characteristics and cognitive state, such as mind wandering during reading (Reichle et al., 2010), dyslexia (Rello and Ballesteros, 2015) and native language (Berzak et al., 2017) can be inferred from their gaze record. Despite these advances, the potential of the rich and highly informative behavioral signal obtainable from human reading for automated inference about readers, and specifically about their linguistic proficiency has thus far been largely unutilized.
Here, we first introduce EyeScore, an independent measure of ESL proficiency which reflects the extent to which a learner's English reading patterns resemble those of native speakers. Second, we present a regression model which uses gaze features to predict the learner's scores on specific external proficiency tests. We address each of our tasks in two data regimes: Fixed Text, which requires eyetracking training data for the specific sentences presented to the test taker, as well as the more general and challenging Any Text regime, where the test taker is presented with arbitrary sentences for which no previous eyetracking data is available. To enable prediction mechanisms in both regimes, we utilize previously proposed gaze features, and develop new linguistically and psychologically motivated feature sets which capture the interaction between eye movements and linguistic properties of the text.
We demonstrate the effectiveness of our approach via score comparison to standardized English proficiency tests. Our primary benchmark test, taken in lab by 145 ESL participants, are the grammar and listening sections of the Michigan English Test (MET) whose scores range from 0 to 50. EyeScore yields 0.5 Pearson's correlation to MET in the Fixed Text regime, and 0.48 in the Any Text regime. Our regression model for predicting MET scores from eye movement features obtains a correlation of 0.7 and a Mean Absolute Error (MAE) of 3.31 points in the Fixed Text regime, and 0.49 correlation and 4.11 MAE in the Any Text regime. Our results are sub-stantially stronger compared to a baseline using only raw reading speed, and are reasonably close to correlations among traditional proficiency tests. These outcomes confirm the promise of the proposed methodology to reliably measure language proficiency. This paper is structured as follows. Section 2 describes the data and the experimental setup. In section 3 we delineate our feature sets for charactering eye movements in human reading. Section 4 introduces EyeScore, a second language proficiency metric which is based on similarity of reading patterns to native speakers. In section 5 we use eyetracking patterns to predict scores on MET and TOEFL. In section 6 we survey related work. Finally, we conclude and discuss future work in section 7.

Experimental Setup
Our study uses the dataset of eye movement records and English proficiency scores introduced in Berzak et al. (2017) 1 , which we describe here in brief. The dataset contains gaze recordings of 37 native English speakers and 145 ESL speakers belonging to four native language backgrounds: 36 Chinese, 36 Japanese, 36 Portuguese and 37 Spanish. Participants were presented with free-form English sentences appearing as one-liners. To encourage attentive reading each sentence was followed by a yes/no comprehension question. During the experiment participants held a controller with buttons for indicating sentence reading completion and answering the sentence comprehension questions. Participants' eye movements were recorded using a desktop mount EyeLink 1000 eyetracker (SR Research) at a sampling rate of 1000Hz.

Procedure and Reading Materials
An experimental trial for a sentence starts with a presentation of a target circle at the center left of a blank screen. A 300ms fixation on this circle triggers a one-liner sentence on a new screen starting at the same location. After completing reading the sentence, participants are presented with the letter Q on a blank screen. A 300ms fixation on this letter triggers a question about the sentence on a new screen. Participants provide a yes/no answer to the question and are subsequently informed if they answered correctly. The first trial of the experiment was presented to familiarize participants with the experimental setup, and is discarded from the analysis.
Each participant read a total of 156 English sentences, randomly drawn from the Wall Street Journal Penn Treebank (WSJ-PTB) (Marcus et al., 1993). The maximal sentence length was set to 100 characters, yielding an average sentence length of 11.4 words. All the sentences include the manual PTB annotations of POS tags (Santorini, 1990) and phrase structure trees, as well as Google universal POS tags (Petrov et al., 2012) and dependency trees obtained from the Universal Dependency Treebank (UDT) (McDonald et al., 2013).

Experimental Regimes
Half of the 156 sentences presented to each participant belong to the Fixed Text regime, and the other half belong to the Any Text regime. Sentences from the two regimes were interleaved randomly and presented to all participants in the same order.
Fixed Text In this regime, all the participants read the same suite of 78 pre-selected sentences (900 words). The Fixed Text regime supports token-level comparisons of reading patterns for specific words in the same contexts across readers. It enables the construction of a proficiency test which relies on a fixed battery of reading materials for which previous eyetracking data was collected.
Any Text In the second, Any Text regime, different participants read different sets of 78 sentences each (880 words on average). This regime generalizes the Fixed Text scenario; predicting reader characteristics in this regime requires formulating type-level abstractions that would allow meaningful comparisons of reading patterns across different sentences. It corresponds to a proficiency test in which the sentences presented to the test taker are completely arbitrary, and no prior eyetracking data is available for them.

Standardized English Tests
We use participants' performance on the Michigan English Test (MET) and TOEFL as external benchmarks of their English proficiency.
Michigan English Test (MET) Our primary indicator of English proficiency is the listening and grammar sections of the MET (Form-B), which were administered by Berzak et al. (2017) in-lab, and taken by all the 145 non-native participants upon completion of the reading experiment. The test has a total of 50 multiple choice questions, comprising 20 listening comprehension questions and 30 written grammar questions. The test score is computed as the number of correct answers for these questions, with possible scores ranging from 0 to 50. The mean MET score in the dataset is 41.46 (std 6.27).
TOEFL Berzak et al. (2017) also collected selfreported scores on the most recently taken official English proficiency test, which we use here as a secondary evaluation benchmark. We focus on the most commonly reported test, the TOEFL-iBT whose scores range from 0 to 120. We take into account only test results obtained less than four years prior to the experiment, yielding 33 participants. We sum the scores of the reading and listening sections of test, with a total possible score range of 0 to 60. In cases where participants reported only the overall score, we divided that score by two. We further augment this data with 20 participants who took the TOEIC Listening and Reading test within the same four years range, resulting in a total of 53 external proficiency scores. The TOEIC scores were converted to the TOEFL scale by fitting a third degree polynomial on an unofficial score conversion table 2 between the tests. The converted scores were then divided by two. Henceforth we refer to both TOEFL-iBT and TOEIC scores converted to TOEFL-iBT scale as TOEFL scores. The mean TOEFL score is 47.6 (std 9.55). The Pearson's r correlation between the TOEFL and MET scores in the dataset is 0.74.

Data Split
We divide the ESL speakers into training/development and test sets in the following manner. For MET, we split our 145 ESL participants into a training/development set of 88 participants and a test set of 57 participants. The test set consists of an entire held out native language -36 speakers of Portuguese -as well as 7 participants randomly sampled from each of the remaining three native languages. Our test set is thus particularly challenging due to the large fraction of participants belonging to the held out language, a design which emphasizes generalization to language learner populations which are not part of the training set. Figure 1 presents a schematic overview of our MET split. For TOEFL, due to the limited available data, in Section 4 we report EyeScore correlations for all the 53 test takers, and in Section 5 we perform regression experiments using leave-one-out cross validation.

Eye Movement Features
In order to capture behavioral psycholinguistic traces of language proficiency we utilize several linguistically and psychologically motivated feature representations of eye movements in reading. We include features introduced in prior work (see Words in Fixed Context and Syntactic Clusters (Berzak et al., 2017)) as well as newly developed feature sets (see Word Property Coefficients and Transitions). All our features rely on the well established division of gaze trajectories into fixations (stops) and saccades (movements between fixations) that characterizes human reading (Rayner, 1998).
Our fixation based features make use of several standard metrics of fixation times, defined below.
• First Fixation duration (FF) Duration of the first fixation on a word.
• First Pass duration (FP) Time spent from first entering a word to first leaving it (including re-fixations within the word).
• Total Fixation duration (TF) The sum of all fixation times on a word.
• Regression Path duration (RP) Time from first entering a word until proceeding to its right.
Our feature sets are divided into two groups. The first group consists of type-level features, applicable both in the Any Text and Fixed Text regimes. The second group of feature sets is tokenbased and can be extracted only in the Fixed Text regime, because it presupposes the same textual input for all participants.

Type-Level Features Word Property Coefficients (WP-Coefficients)
This new feature set quantifies the influence of three key word characteristics on reading times of individual readers: word length, word frequency and surprisal. The last measures the difficulty of processing a word in a sentence (Hale, 2001;Levy, 2008), and is defined as its negative log probability given a sentential context: (1) In the reading literature, these three characteristics were suggested as the most prominent linguistic factors influencing word reading times (e.g. Inhoff and Rayner, 1986;Rayner and Well, 1996;Pollatsek et al., 2008;Kliegl et al., 2004;Rayner et al., 2004Rayner et al., , 2011Smith and Levy, 2013;Luke and Christianson, 2016); whereby longer, less frequent and contextually less predictable words are fixated longer.
To derive this feature set, we measure length as the number of characters in the word. Word (log) frequencies are obtained from the BLLIP-WSJ corpus (Charniak et al., 2000). Estimates of surprisal are obtained from a trigram language model with Chen and Goodman's modified Kneser-Ney smoothing trained on the BLLIP-WSJ using SRILM (Stolcke et al., 2002). We then fit for each participant four regression models that use these three word characteristics to predict the word's raw FF, FP, TF and RP durations. The regression models are fitted using Ordinary Least Squares (OLS). We also train a logistic regression model for predicting word skips. Finally, we extract the weights and intercepts of these models and encode them as features. As each of the five models has three coefficients and one intercept term, the resulting WP-Coefficients feature set has 20 features.

Syntactic Clusters (S-Clusters)
Following Berzak et al. (2017), we extract average word reading times clustered by POS tags and syntactic functions. We utilize three metrics of reading times, FF, FP and TF durations. We then cluster words according to three types of syntactic criteria, Google Universal POS tags, PTB POS tags, and the syntactic function label of the word to its head word. To derive the feature set, we average the word fixation times of each cluster. An example of an S-Cluster feature is the average TF duration for words with the PTB POS tag DT. We take into account only cluster labels that appear at least once in the reading input of all the participants, yielding a total of 312 S-Clusters features in the Fixed Text regime. In the Any Text regime we obtain 156 S-Clusters features for MET and 165 S-Clusters features for TOEFL.

Transitions
Transitions is a new feature set which summarizes the sequence of saccades between words in a sentence. Given a sentence with n words, we construct an n × n matrix T . A matrix entry t i,j records the number of saccades whose launch site falls within word i and landing site falls within word j. With a total of 11,616 possible transitions in the Fixed Text sentences, the resulting feature set contains 9,077 features with a non-zero value for at least one participant for MET, and 8,132 such features for TOEFL.

Words in Fixed Context (WFC)
This feature set was previously used in Berzak et al. (2017) and consists of reading times for words within fixed contexts. We extract FP and TF durations for the 900 words in the Fixed Text sentences, resulting in a total of 1,800 WFC features.

English Proficiency Scoring Based on Eye Movements in Reading
We hypothesize that language proficiency influences the way that learners process a second language, which in turn will be reflected in eye movement patterns in reading. Specifically, we propose to examine whether the more proficient is an ESL learner, the more similar are their reading patterns to those of native English speakers. We operationalize the notion of native-like reading in the following manner. First, given a feature representation of choice and a dataset D comprising ESL learners D L2 and native speakers D L1 we Z score each feature in D using a Z scaler derived from D L2 . We then obtain a prototype feature vector of native reading v L1 by averaging the feature vectors of the native speakers.
Finally, we obtain an eyetracking based proficiency score of an ESL learner by computing the cosine similarity of their feature vector to the native reading prototype. Hereafter we refer to this measure as EyeScore.
Reading Speed Normalization To reduce bias towards fast readers, the feature representations used for Eyescore are normalized to be invariant to the reading speed of the participant. Specifically, for the S-Clusters and WFC feature sets we follow the normalization procedure of Berzak et al. (2017), where for a given participant, the reading time of a word w i according to a fixation metric M is normalized by S M,C , the metric's fixation time per word in the linguistic context C: The linguistic context is defined as the surrounding sentence in the Fixed Text regime, and the entire textual input in the Any Text regime. The normalized fixation time is then obtained as: For the WC-Coefficients features we take into account only the 15 model coefficients, and omit the 5 intercept features which capture the reading speed of the participant. Finally, we also normalize the Transitions features matrix T by the total number of saccades in the sentence to obtain T norm in which i,j t norm i,j = 1.

Correlation with MET and TOEFL
We evaluate the ability of EyeScore to capture language proficiency by comparing it against our two external proficiency tests, MET and TOEFL. The strongest correlations, 0.5 for MET and 0.54 for TOEFL, are obtained in the Fixed Text regime using the WFC features. This outcome confirms the effectiveness reading time comparisons when the presented sentences are shared across participants. To illustrate the quality of this result, Figure 2 presents a comparison of EyeScore and MET scores in the Fixed Text and WFC features setup. We further note good performance of the Transitions and S-Clusters features in this regime across both proficiency tests. The strongest performance in the Any Text regime is obtained using the S-Clusters features, yielding 0.48 correlation with MET and 0.45 correlation with TOEFL. These results are competitive with the WFC feature set in the Fixed Text regime, suggesting that reliable EyeScores can be obtained even when no prior eyetracking data is available for the sentences presented to the test taker.
In order to contextualize the correlations obtained with the EyeScore approach, we first compare our results to raw reading speed, an informative baseline which does not rely on eyetracking. EyeScore substantially outperforms this baseline for nearly all the feature sets on both MET and TOEFL, clearly showing the benefit of eye movement information for our task. Next, we consider possible upper bounds for our correlations. While obtaining such upper bounds is challenging, we can use correlations between different traditional standardized proficiency tests as informative reference points. First, as mentioned previously, in our dataset the MET and reported TOEFL scores have a Pearson's r correlation of 0.74. We further note an external study conducted by the testing company Education First (EF) which measured the correlation of their flagship standardized English proficiency test EFSET-PLUS with TOEFL-iBT (Luecht, 2015). Using 384 participants who took both tests, the study found a Pearson's r of 0.63 for the reading comprehension and 0.69 for the listening comprehension sections of these tests. Despite the radical difference of our testing methodology, our strongest feature sets obtain rather competitive results relative to these correlations, further strengthening the evidence for the ability of our approach to capture language proficiency.

Predicting Performance on MET and TOEFL
In section 4 we introduced EyeScore as an independent metric of language proficiency which is based on eye movements during reading. Here, we examine whether eye movements can also be used to explicitly predict the performance of participants on specific external standardized language proficiency tests. This task is of practical value for development of predictive tools for standardized proficiency tests, and constitutes an alternative framework for studying the relevance of eye movement patterns in reading to language proficiency.
To address this task, we use Ridge regression to predict overall scores on an external proficiency test from eye movement features in reading. The model parameters θ are obtained by minimizing  Table 2: Pearson's r and Mean Absolute Error (MAE) for prediction of MET scores (test set, 57 participants) and TOEFL scores (leave-one-out cross validation, all 53 participants) from eye movement patterns in reading. We consider two baselines which do not use eyetracking information: (1) the average proficiency score in the training set, which yields 4.82 MAE on MET and 8.29 MAE on TOEFL, and (2) the reading speed of the participant. the following loss objective: where y i is a participant's test score, x i is their eye movement record, and f (x i ) are the extracted eye movement features. To calibrate the model with respect to native English speakers, we augment each training set D L2tr with the group of 37 native speakers D L1 whose proficiency scores are assigned to the maximum grade of the respective test (50 for MET and 60 for TOEFL) 3 . Based on MET performance on the train/dev set, the features used for predicting scores on both tests are not normalized for speed 4 . As a preprocessing step, we fit a Z scaler for each feature using the ESL participants in the training set, and apply it to all the participants in the training and test sets.

Results
We evaluate prediction accuracy using Pearson's r and Mean Absolute Error (MAE) from the true proficiency test scores. The λ parameter for MET is optimized for MAE on 10 fold cross validation within the training/development set. For TOEFL, which has a relatively small number of participants, we report results on leave-one-out cross validation with λ set to 1. Table 2 presents the results for both proficiency tests. We consider two baselines; the first is assigning all test set participants with the average 3 Our experiments on the training/development set indicate that this training data augmentation step leads in most cases to improved regression performance. 4 We note that in line with the low correlation of reading speed with TOEFL, speed normalized features tend to be better predictors of TOEFL scores, obtaining r 0.59 and MAE 6.47 with WFC features in the Fixed Text regime, and r 0.58 and MAE 7.19 with S-Clusters in the Any Text regime. score of the training participants. This baseline yields an MAE of 4.82 on MET and 8.29 on TOEFL. The second baseline uses reading speed as the sole feature for prediction. In all cases, our eyetracking based features outperform the average score and reading speed baselines. The performance of the different feature sets is in most cases consistent across the two proficiency tests and is largely in line with the correlations of EyeScore reported in Table 1. Similarly to the EyeScore outcomes, the best performance in the Fixed Text regime is obtained using the WFC feature set, with a Pearson's r of 0.7 and MAE of 3.31 for MET. This result is highly competitive with correlations between different standardized English proficiency tests. Figure 3 depicts a com-parison between MET scores and our MET predictions in this setup. On TOEFL, WFC features obtain the strongest MAE of 6.68, while S-Clusters have a higher r coefficient of 0.55.
In the Any Text regime, differently from Eye-Score, we obtain comparable results for the S-Clusters and WP-Coefficients feature sets. Overall, the improvements of both feature sets over the baselines in the Any Text regime further support the ability of type-level features to generalize the task of language proficiency prediction to arbitrary sentences.

Related Work
Our work lies on the intersection of language proficiency assessment, second language acquisition (SLA), the psychology of reading and NLP. Automated language proficiency assessment from free-form linguistic performance has been studied mainly in language production (Dikli, 2006;Williamson, 2009;Shermis and Burstein, 2013). Over the past several decades, multiple essay and speech scoring systems have been developed for learner language using a wide range of linguistically motivated feature sets (e.g. Lonsdale and Strong-Krause, 2003;Landauer, 2003;Xi et al., 2008;Yannakoudakis et al., 2011). Some of these systems have been deployed in official language proficiency tests, for example the e-rater essay scoring system (Attali and Burstein, 2004) used in TOEFL (Ramineni et al., 2012). While this line of work focuses on assessment of language production, here we introduce and address for the first time automated language assessment during online language comprehension.
In SLA, there has been considerable interest in eyetracking, where studies have mostly focused on controlled experiments examining processing of specific linguistic phenomena such as syntactic ambiguities, cognates and idioms (Dussias, 2010;Roberts and Siyanova-Chanturia, 2013). A notable exception is (Cop et al., 2015) who used freeform reading to study differences in fixation times and saccade lengths between native and non-native readers. Our work also adopts broad coverage analysis of reading patterns, which we use to formulate predictive models of language proficiency.
Our study draws on a large body of work in the psychology of reading (see Rayner, 1998;Rayner et al., 2012, for overview) which has suggested that eye movement patterns during reading are sys-tematically influenced by a broad range of linguistic characteristics of the text, and reflect how readers mentally engage with the text (Frazier and Rayner, 1982;Rayner and Frazier, 1989;Reichle et al., 1998;Engbert et al., 2005;Demberg and Keller, 2008;Reichle et al., 2009;Levy et al., 2009, among many others). Prior work on reading has also demonstrated that gaze provides valuable information about various characteristics of the reader and their cognitive state. For example, Reichle et at. (2010) have shown that eye movement patterns are categorically different in attentive versus mindless reading. In Rello and Ballesteros (2015) eye movements were used to distinguish between readers with and without dyslexia. Berzak et al. (2017) collected the dataset used in our work and used it to predict the first language of non-native English readers from gaze. We build on these studies to motivate our task and design feature representations which encode linguistic factors known to affect the human reading process.
Related work in NLP developed predictive models of reading times in reading of free-form text (e.g. Nilsson and Nivre, 2009;Hara et al., 2012;Hahn and Keller, 2016). In a complementary vein, eyetracking signal has been used for linguistic annotation tasks such as POS tagging (Barrett and Søgaard, 2015a;Barrett et al., 2016) and prediction of syntactic functions (Barrett and Søgaard, 2015b). Both lines of investigation provide further evidence for the tight interaction between eye movements and linguistic properties of the text, which we leverage in our work for inference about the linguistic knowledge of the reader.

Conclusion and Discussion
We present a novel approach for automated assessment of language proficiency which relies on eye movements during reading of free-form text. Our EyeScore test captures the similarity of language learners' gaze patterns to those of native speakers, and correlates well with the standardized tests MET and TOEFL. A second variant of our approach accurately predicts participants' scores on these two tests. To the best of our knowledge, the proposed framework is the first proof-of-concept for a system which utilizes eyetracking to measure linguistic ability.
In future work, we plan to extend the analysis of the validity and consistency of our approach, and further explore its applications for language proficiency evaluation. In particular, we will examine the impact of factors that can undermine the validity of language proficiency tests, such as test specific training, familiarity with the evaluation system's features (Powers et al., 2002), and cheating via unauthorized prior access to test materials. Since participants are less likely to be able to manipulate their eye movements in an informed and systematic manner-readers are generally not even aware that their eye movements are saccadic-and since our test can be performed on arbitrary sentences, we expect it to be robust to prior exposure to the test materials and testing methodology. We will further study the consistency of our scores for repeated tests by the same participants. A preliminary split-half analysis indicates that eyetracking based scores are expected to be highly consistent across tests. Finally, our approach can be combined with traditional proficiency testing methodologies, whereby gaze will be recorded while the participant is taking a standardized language proficiency test. This will enable developing novel approaches to language proficiency assessment which will integrate task based performance with real time monitoring of cognitive and linguistic processing.