Predicting Native Language from Gaze

A fundamental question in language learning concerns the role of a speaker’s first language in second language acquisition. We present a novel methodology for studying this question: analysis of eye-movement patterns in second language reading of free-form text. Using this methodology, we demonstrate for the first time that the native language of English learners can be predicted from their gaze fixations when reading English. We provide analysis of classifier uncertainty and learned features, which indicates that differences in English reading are likely to be rooted in linguistic divergences across native languages. The presented framework complements production studies and offers new ground for advancing research on multilingualism.


Introduction
The influence of a speaker's native language on learning and performance in a foreign language, also known as cross-linguistic transfer, has been studied for several decades in linguistics and psychology (Odlin, 1989;Martohardjono and Flynn, 1995;Jarvis and Pavlenko, 2008;Berkes and Flynn, 2012;Alonso, 2015). The growing availably of learner corpora has also sparked interest in cross-linguistic influence phenomena in NLP, where studies have explored the task of Native Language Identification (NLI) (Tetreault et al., 2013), as well as analysis of textual features in relation to the author's native language (Jarvis and Crossley, 2012;Swanson and Charniak, 2013;Malmasi and Dras, 2014). Despite these advances, the extent and nature of first language influence in second language processing remains far from being established. Crucially, most prior work on this topic focused on production, while little is currently known about cross-linguistic influence in language comprehension.
In this work, we present a novel framework for studying cross-linguistic influence in language comprehension using eyetracking for reading and free-form native English text. We collect and analyze English newswire reading data from 182 participants, including 145 English as Second Language (ESL) learners from four different native language backgrounds: Chinese, Japanese, Portuguese and Spanish, as well as 37 native English speakers. Each participant reads 156 English sentences, half of which are shared across all participants, and the remaining half are individual to each participant. All the sentences are manually annotated with part-of-speech (POS) tags and syntactic dependency trees. We then introduce the task of Native Language Identification from Reading (NLIR), which requires predicting a subject's native language from gaze while reading text in a second language. Focusing on ESL participants and using a log-linear classifier with word fixation times normalized for reading speed as features, we obtain 71.03 NLIR accuracy in the shared sentences regime. We further demonstrate that NLIR can be generalized effectively to the individual sentences regime, in which each subject reads a different set of sentences, by grouping fixations according to linguistically motivated clustering criteria. In this regime, we obtain an NLIR accuracy of 51.03.
Further on, we provide classification and feature analyses, suggesting that the signal underlying NLIR is likely to be related to linguistic characteristics of the respective native languages. First, drawing on previous work on ESL production, we observe that classifier uncertainty in NLIR correlates with global linguistic similarities across native languages. In other words, the more similar are the languages, the more similar are the reading patterns of their native speakers in English. Second, we perform feature analysis across native and non-native English speakers, and discuss structural and lexical factors that could potentially drive some of the non-native reading patterns in each of our native languages. Taken together, our results provide evidence for a systematic influence of native language properties on reading, and by extension, on online processing and comprehension in a second language.
To summarize, we introduce a novel framework for studying cross-linguistic influence in language learning by using eyetracking for reading free-form English text. We demonstrate the utility of this framework in the following ways. First, we obtain the first NLIR results, addressing both the shared and the individual textual input scenarios. We further show that reading preserves linguistic similarities across native languages of ESL readers, and perform feature analysis, highlighting key distinctive reading patterns in each native language. The proposed framework complements and extends production studies, and can inform linguistic inquiry on cross-linguistic influence.
This paper is structured as follows. In section 2 we present the data and our experimental setup. Section 3 describes our approach to NLIR and summarizes the classification results. We analyze cross-linguistic influence in reading in section 4. In section 4.1 we examine NLIR classification uncertainty in relation to linguistic similarities between native languages. In section 4.2 we discuss several key fixation features associated with different native languages. Section 5 surveys related work, and section 6 concludes.

Experimental Setup Participants
We recruited 182 adult participants. Of those, 37 are native English speakers and 145 are ESL learners from four native language backgrounds: Chinese, Japanese, Portuguese and Spanish. All the participants in the experiment are native speakers of only one language. The ESL speakers were tested for English proficiency using the grammar and listening sections of the Michigan English test (MET), which consist of 50 multiple choice ques-tions. The English proficiency score was calculated as the number of correctly answered questions on these modules. The majority of the participants scored in the intermediate-advanced proficiency range. Table 1 presents the number of participants and the mean English proficiency score for each native language group. Additionally, we collected metadata on gender, age, level of education, duration of English studies and usage, time spent in English speaking countries and proficiency in any additional language spoken.

Reading Materials
We utilize 14,274 randomly selected sentences from the Wall Street Journal part of the Penn Treebank (WSJ-PTB) (Marcus et al., 1993). To support reading convenience and measurement precision, the maximal sentence length was set to 100 characters, leading to an average sentence length of 11.4 words. Word boundaries are defined as whitespaces. From this sentence pool, 78 sentences (900 words) were presented to all participants (henceforth shared sentences) and the remaining 14,196 sentences were split into 182 individual batches of 78 sentences (henceforth individual sentences, averaging 880 words per batch). All the sentences include syntactic annotations from the Universal Dependency Treebank project (UDT) (McDonald et al., 2013). The annotations include PTB POS tags (Santorini, 1990), Google universal POS tags (Petrov et al., 2012) and dependency trees. The dependency annotations of the UDT are converted automatically from the manual phrase structure tree annotations of the WSJ-PTB.

Gaze Data Collection
Each participant read 157 sentences. The first sentence was presented to familiarize participants with the experimental setup and was discarded during analysis. The following 156 sentences consisted of 78 shared and 78 individual sen-tences. The shared and the individual sentences were mixed randomly and presented to all participants in the same order. The experiment was divided into three parts, consisting of 52 sentences each. Participants were allowed to take a short break between experimental parts.
Each sentence was presented on a blank screen as a one-liner. The text appeared in Times font, with font size 23. To encourage attentive reading, upon completion of sentence reading participants answered a simple yes/no question about its content, and were subsequently informed if they answered the question correctly. Both the sentences and the questions were triggered by a 300ms gaze on a fixation target (fixation circle for sentences and the letter "Q" for questions) which appeared on a blank screen and was co-located with the beginning of the text in the following screen.
Throughout the experiment, participants held a joystick with buttons for indicating completion of sentence reading and answering the comprehension questions. Eye-movement of participants' dominant eye was recorded using a desktop mount Eyelink 1000 eyetracker, at a sampling rate of 1000Hz. Further details on the experimental setup are provided in appendix A.

Native Language Identification from Reading
Our first goal is to determine whether the native language of ESL learners can be decoded from their gaze patterns while reading English text. We address this question in two regimes, corresponding to our division of reading input into shared and individual sentences. In the shared regime, all the participants read the same set of sentences. Normalizing over the reading input, this regime facilitates focusing on differences in reading behavior across readers. In the individual regime, we use the individual batches from our data to address the more challenging variant of the NLIR task in which the reading material given to each participant is different.

Features
We seek to utilize features that can provide robust, simple and interpretable characterizations of reading patterns. To this end, we use speed normalized fixation duration measures over word sequences.

Fixation Measures
We utilize three measures of word fixation duration: • First Fixation duration (FF) Duration of the first fixation on a word.
• First Pass duration (FP) Time spent from first entering a word to first leaving it (including re-fixations within the word).
• Total Fixation duration (TF) The sum of all fixation times on a word.
We experiment with fixations over unigram, bigram and trigram sequences Importantly, we control for variation in reading speeds across subjects by normalizing each subjects's sequence fixation times. For each metric M and sequence seq i,k we normalize the sequence fixation time M seq i,k relative to the subject's sequence fixation times in the textual context of the sequence. The context C is defined as the sentence in which the sequence appears for the Words in Fixed Context feature-set and the entire textual input for the Syntactic and Information clusters feature-sets (see definitions of feature-sets below). The normalization term S M,C,k is accordingly defined as the metric's fixation time per sequence of length k in the context: We then obtain a normalized fixation time M norm seq i,k as:

Feature Types
We use the above presented speed normalized fixation metrics to extract three feature-sets, Words in Fixed Context (WFC), Syntactic Clusters (SC) and Information Clusters (IC). WFC is a token-level feature-set that presupposes a fixed textual input for all participants. It is thus applicable only in the shared sentences regime. SC and IC are typelevel features which provide abstractions over sequences of words. Crucially, they can also be applied when participants read different sentences.
• Words in Fixed Context (WFC) The WFC features capture fixation times on word sequences in a specific sentence. This featureset consists of FF, FP and TF times for each of the 900 unigram, 822 bigram, and 744 trigram word sequences comprising the shared sentences. The fixation times of each metric are normalized for each participant relative to their fixations on sequences of the same length in the surrounding sentence. As noted above, the WFC feature-set is not applicable in the individual regime, as it requires identical sentences for all participants.
• Syntactic Clusters (SC) CS features are average globally normalized FF, FP and TF times for word sequences clustered by our three types of syntactic labels: universal POS, PTB POS, and syntactic relation labels.
An example of such a feature is the average of speed-normalized TF times spent on the PTB POS bigram sequence DT NN. We take into account labels that appear at least once in the reading input of all participants. On the four non-native languages, considering all three label types, we obtain 104 unigram, 636 bigram and 1,310 trigram SC features per fixation metric in the shared regime, and 56 unigram, 95 bigram and 43 trigram SC features per fixation metric in the individual regime.
• Information Clusters (IC) We also obtain average FF, FP and TF for words clustered according to their length, measured in number of characters. Word length was previously shown to be a strong predictor of information content (Piantadosi et al., 2011). As such, it provides an alternative abstraction to the syntactic clusters, combining both syntactic and lexical information. As with SC features, we take into account features that ap-pear at least once in the textual input of all participants. For our set of non-native languages, we obtain for each fixation metric 15 unigram, 21 bigram and 23 trigram IC features in the shared regime, and 12 unigram, 18 bigram and 18 trigram IC features in the individual regime. Notably, this feature-set is very compact, and differently from the syntactic clusters, does not rely on the availability of external annotations.
In each feature-set, we perform a final preprocessing step for each individual feature, in which we derive a zero mean unit variance scaler from the training set feature values, and apply it to transform both the training and the test values of the feature to Z scores.

Model
The experiments are carried out using a log-linear model: where y is the reader's native language, x is the reading input and θ are the model parameters. The classifier is trained with gradient descent using L-BFGS (Byrd et al., 1995).

Experimental Results
In table 2 we report 10-fold cross-validation results on NLIR in the shared and the individual experimental regimes for native speakers of Chinese, Japanese, Portuguese and Spanish. We introduce two baselines against which we compare the performance of our feature-sets. The majority baseline selects the native language with the largest number of participants. The random clusters baseline clusters words into groups randomly, with the number of groups set to the number of syntactic categories in our data.
In the shared regime, WFC fixations yield the highest classification rates, substantially outperforming the cluster feature-sets and the two baselines. The strongest result using this featureset, 71.03, is obtained by combining unigram, bigram and trigram fixation times. In addition to this outcome, we note that training binary classifiers in this setup yields accuracies ranging from 68.49 for the language pair Portuguese and Spanish, to 93.15 for Spanish and Japanese. These results confirm the effectiveness of the shared input Generally, we observe that adding bigram and trigram fixations in the shared regime leads to performance improvements compared to using unigram features only. This trend does not hold for the individual sentences, presumably due to a combination of feature sparsity and context variation in this regime. We also note that IC and SC features tend to perform better together than in separation, suggesting that the information encoded using these feature-sets is to some extent complementary.
The generalization power of our cluster based feature-sets has both practical and theoretical consequences. Practically, they provide useful abstractions for performing NLIR over arbitrary textual input. That is, they enable performing this task using any textual input during both training and testing phases. Theoretically, the effectiveness of linguistically motivated features in discerning native languages suggests that linguistic factors play an important role in the ESL reading process. The analysis presented in the following sections will further explore this hypothesis.

Analysis of Cross-Linguistic Influence in ESL Reading
As mentioned in the previous section, the ability to perform NLIR in general, and the effectiveness of linguistically motivated features in particular, suggest that linguistic factors in the native and second languages are pertinent to ESL reading. In this section we explore this hypothesis further, by analyzing classifier uncertainty and the features learned in the NLIR task.

Preservation of Linguistic Similarity
Previous work in NLP suggested a link between textual patterns in ESL production and linguistic similarities of the respective native languages (Nagata and Whittaker, 2013; Nagata, 2014;Berzak et al., 2014Berzak et al., , 2015. In particular, Berzak et al. (2014) has demonstrated that NLI classification uncertainty correlates with similarities between languages with respect to their typological features. Here, we extend this framework and examine if preservation of native language similarities in ESL production is paralleled in reading. Similarly to Berzak et al. (2014) we define the classification uncertainty for a pair of native languages y and y in our data collection D, as the average probability assigned by the NLIR classifier to one language given the other being the true native language. This approach provides a robust measure of classification confusion that does not rely on the actual performance of the classifier. We interpret the classifier uncertainty as a similarity measure between the respective languages and de-note it as English Reading Similarity ERS. ERS y,y = (x,y)∈D y p(y |x;θ)+ (x,y )∈D y p(y|x;θ) |D y |+|D y | (5) We compare these reading similarities to the linguistic similarities between our native languages. To approximate these similarities, we utilize feature vectors from the URIEL Typological Compendium (Littel et al., 2016) extracted using the lang2vec tool (Littell et al., 2017). URIEL aggregates, fuses and normalizes typological, phylogenetic and geographical information about the world's languages.
We obtain all the 103 available morphosyntactic features in URIEL, which are derived from the World Atlas of Language Structures (WALS) (Dryer and Haspelmath, 2013), Syntactic Structures of the World's Languages (SSWL) (Collins and Kayne, 2009) and Ethnologue (Lewis et al., 2015). Missing feature values are completed with a KNN classifier. We also extract URIEL's 3,718 language family features derived from Glottolog (Hammarström et al., 2015). Each of these features represents membership in a branch of Glottolog's world language tree. Truncating features with the same value for all our languages, we remain with 76 features, consisting of 49 syntactic features and 27 family tree features. The linguistic similarity LS between a pair of languages y and y is then determined by the cosine similarity of their URIEL feature vectors. Figure 1 presents the URIEL based linguistic similarities for our set of non-native languages against the average NLIR classification uncertainties on the cross-validation test samples. The results presented in this figure are based on the unigram IC+SC feature-set in the individual sentences regime. We also provide a graphical illustration of the language similarities for each measure, using the Ward clustering algorithm (Ward Jr, 1963). We observe a correlation between the two measures which is also reflected in similar hierarchies in the two language trees. Thus, linguistically motived features in English reveal linguistic similarities across native languages. This outcome supports the hypothesis that English reading differences across native languages are related to linguistic factors. We note that while comparable results are obtained for the IC and SC feature-sets, together and in separation in the shared regime, WFC features in the shared regime do not exhibit a clear uncertainty distinction when comparing across the pairs Japanese and Spanish, Japanese and Portuguese, Chinese and Spanish, and Chinese and Portuguese. Instead, this feature-set yields very low uncertainty, and correspondingly very high performance ranging from 90.41 to 93.15, for all four language pairs.

Feature Analysis
Our framework enables not only native language classification, but also exploratory analysis of native language specific reading patterns in English. The basic question that we examine in this respect is on which features do readers of different native language groups spend more versus less time. We also discuss several potential relations of the observed reading time differences to usage patterns and grammatical errors committed by speakers of our four native languages in production. We obtain this information by extracting grammatical error counts from the CLC FCE corpus (Yannakoudakis et al., 2011), and from the ngram frequency analysis in Nagata and Whittaker (2013).
In order to obtain a common benchmark for reading time comparisons across non-native speakers, in this analysis we also consider our group of native English speakers. In this context, we train four binary classifiers that discern each of the non-native groups from native English speakers based on TF times over unigram PTB POS tags in the shared regime. The features with the strongest positive and negative weights learned by these classifiers are presented in table 3. These features serve as a reference point for selecting the case studies discussed below.
Interestingly, some of the reading features that are most predictive of each native language lend themselves to linguistic interpretation with respect to structural factors. For example, in Japanese and Chinese we observe shorter reading times for determiners (DT), which do not exist in these languages. Figure 2a presents the mean TF times for determiners in all five native languages, suggesting that native speakers of Portuguese and Spanish, which do have determiners, do not exhibit reduced reading times on this structure compared to natives. In ESL production, missing determiner errors are the most frequent error for native speakers of Japanese and third most common error for native speakers of Chinese.
In figure 2b we present the mean TF reading times for pronouns (PRP), where we also see shorter reading times by natives of Japanese and Chinese as compared to English natives. In both languages pronouns can be omitted both in object and subject positions. Portuguese and Spanish, in which pronoun omission is restricted to the subject position present similar albeit weaker tendency.   In figure 2c we further observe that differently from natives of Chinese and Japanese, native speakers of Portuguese and Spanish spend more time on NN+POS in head final possessives such as "the public's confidence". While similar constructions exist in Chinese and Japanese, the NN+POS combination is expressed in Portuguese and Spanish as a head initial NN of NN. This form exists in English (e.g. "the confidence of the public") and is preferred by speakers of these languages in ESL writing (Nagata and Whittaker, 2013). As an additional baseline for this construction, we provide the TF times for NN in figure 2d. There, relative to English natives, we observe longer reading times for Japanese and Chinese and comparable times for Portuguese and Spanish.
The reading times of NN in figure 2d also give rise to a second, potentially competing interpretation of differences in ESL reading times, which highlights lexical rather than structural factors. According to this interpretation, increased reading times of nouns are the result of substantially smaller lexical sharing with English by Chinese and Japanese as compared to Spanish and Portuguese. Given the utilized speed normalization, lexical effects on nouns could in principle account for reduced reading times on determiners and pronouns. Conversely, structural influence leading to reduced reading times on determiners and pronouns could explain longer dwelling on nouns. A third possibility consistent with the observed reading patterns would allow for both structural and lexical effects to impact second language reading. Importantly, in each of these scenarios, ESL reading patterns are related to linguistic factors of the reader's native language. We note that the presented analysis is preliminary in nature, and warrants further study in future research. In particular, reading times and classifier learned features may in some cases differ between the shared and the individual regimes. In the examples presented above, similar results are obtained in the individual sentences regime for DT, PRP and NN. The trend for the NN+POS construction, however, diminishes in that setup with similar reading times for all languages. On the other hand, one of the strongest features for predicting Portuguese and Spanish in the individual regime are longer reading times for prepositions (IN), an outcome that holds in the shared regime only relative to Chinese and Japanese, but not relative to native speakers of English.
Despite these caveats, our results suggest that reading patterns can potentially be related to linguistic factors of the reader's native language. This analysis can be extended in various ways, such as inclusion of additional feature types and fixation metrics, as well as utilization of other comparative methodologies. Combined with evidence from language production, this line of investigation can be instrumental for informing linguistic theory of cross-linguistic influence.

Related Work
Eyetracking and second language reading Second language reading has been studied using eyetracking, with much of the work focusing on processing of syntactic ambiguities and analysis of specific target word classes such as cognates (Dussias, 2010;Roberts and Siyanova-Chanturia, 2013). In contrast to our work, such studies typically use controlled, rather than free-form sentences. Investigation of global metrics in freeform second language reading was introduced only recently by Cop et al. (2015). This study compared ESL and native reading of a novel by native speakers of Dutch, observing longer sentence reading times, more fixations and shorter saccades in ESL reading. Differently from this study, our work focuses on comparison of reading patterns between different native languages. We also analyze a related, but different metric, namely speed normalized fixation durations on word sequences.
Eyetracking for NLP tasks Recent work in NLP has demonstrated that reading gaze can serve as a valuable supervision signal for standard NLP tasks. Prominent examples of such work include POS tagging (Barrett and Søgaard, 2015a;Barrett et al., 2016), syntactic parsing (Barrett and Søgaard, 2015b) and sentence compression (Klerke et al., 2016). Our work also tackles a traditional NLP task with free-form text, but differs from this line of research in that it addresses this task only in comprehension. Furthermore, while these studies use gaze recordings of native readers, our work focuses on non-native readers.
NLI in production NLI was first introduced in Koppel et al. (2005) and has been drawing considerable attention in NLP, including a recent shared-task challenge with 29 participating teams (Tetreault et al., 2013). NLI has also been driving much of the work on identification of native language related features in writing (Tsur and Rappoport, 2007;Jarvis and Crossley, 2012;Brooke and Hirst, 2012;Tetreault et al., 2012;Charniak, 2013, 2014;Malmasi and Dras, 2014;Bykh and Meurers, 2016). Several studies have also linked usage patterns and grammatical errors in production to linguistic properties of the writer's native language (Nagata and Whittaker, 2013;Nagata, 2014;Berzak et al., 2014Berzak et al., , 2015. Our work departs from NLI in writing and introduces NLI and related feature analysis in reading.

Conclusion and Outlook
We present a novel framework for studying crosslinguistic influence in multilingualism by measuring gaze fixations during reading of free-form En-glish text. We demonstrate for the first time that this signal can be used to determine a reader's native language. The effectiveness of linguistically motivated criteria for fixation clustering and our subsequent analysis suggest that the ESL reading process is affected by linguistic factors. Specifically, we show that linguistic similarities between native languages are reflected in similarities in ESL reading. We also identify several key features that characterize reading in different native languages, and discuss their potential connection to structural and lexical properties of the native langauge. The presented results demonstrate that eyetracking data can be instrumental for developing predictive and explanatory models of second language reading.
While this work is focused on NLIR from fixations, our general framework can be used to address additional aspects of reading, such as analysis of saccades and gaze trajectories. In future work, we also plan to explore the role of native and second language writing system characteristics in second language reading. More broadly, our methodology introduces parallels with production studies in NLP, creating new opportunities for integration of data, methodologies and tasks between production and comprehension. Furthermore, it holds promise for formulating language learning theory that is supported by empirical findings in naturalistic setups across language processing domains.