L2 Processing Advantages of Multiword Sequences: Evidence from Eye-Tracking

A substantial body of research has demonstrated that native speakers are sensitive to the frequencies of multiword sequences (MWS). Here, we ask whether and to what extent intermediate-advanced L2 speakers of English can also develop the sensitivity to the statistics of MWS. To this end, we aimed to replicate the MWS frequency effects found for adult native language speakers based on evidence from self-paced reading and sentence recall tasks in an ecologically more valid eye-tracking study. L2 speakers’ sensitivity to MWS frequency was evaluated using generalized linear mixed-effects regression with separate models fitted for each of the four dependent measures. Mixed-effects modeling revealed significantly faster processing of sentences containing MWS compared to sentences containing equivalent control items across all eyetracking measures. Taken together, these findings suggest that, in line with emergentist approaches, MWS are important building blocks of language and that similar mechanisms underlie both native and non-native language processing.


Emergentist approaches and statistical learning
A widely held assumption in the language sciences, including psycholinguistics, has long been the 'words and rules' view (Levelt, 1993;Jackendoff and Jackendoff, 2002;Pinker, 1999): In this view, speakers/writers generate sentences by combining words according to the grammatical rules of their language, and listeners/readers comprehend sentences by looking up words in their mental lexicon and combining them using the same rules. This view has been challenged recently by an accumulating body of evidence demonstrating that language users are highly sensitive not only to the frequencies of individual words but also to the frequencies of word sequences (see, e.g., Christiansen and Arnon, 2017, for a recent overview). This questions the strict compartmentalization between the lexicon as a storage of individual words and a grammar as a set of rules or constrained used to combine them. Moving away from the traditional 'words and rules' approach, emergentist approaches have put forward alternative theoretical models of language. Following the literature (see, e.g. Arnon and Snider, 2010;Kidd et al., 2017;MacWhinney and O'Grady, 2015;Mitchell et al., 2013), we use the term 'emergentist' as a cover term for a broad class of approaches to language including usage-based (a.k.a. experience-based) models, constraint-based approaches, exemplarbased models and connectionist models (for more general overviews, see, e.g., Beckner et al., 2009;Christiansen and Chater, 2016a,b;Ellis and Larsen-Freeman, 2006;Ellis, 2019;MacWhinney, 2012;McClelland et al., 2010). Distinct from nativist/generative approaches, emergentist approaches share the folowing two central assumptions: First, emergentist approaches eschew the existence of Universal Grammar and instead emphasize that language is learnable via general cognitive mechanisms. Second, these approaches put the emphasis on usage and/or experience with language and assume a direct and immediate relationship between processing and learning, conceiving of them as inseparable rather than governed by different mechanisms ('two sides of the same coin'). In these approaches, language acquisition is viewed as learning how to process efficiently (see, the 'learning-as-processing' assumption, Chang, Dell, and Bock, 2006; see also 'language acquisition as skill learning ' Chater and Christiansen, 2018). One of the major advances in the language sciences across theoretical orienta-tions has been the recognition that language consists of complex, highly variable patterns occurring in sequence, and as such can be described in terms of statistical or distributional relations among language units (see, e.g., Redington and Chater, 1997). Thus, learning a language heavily involves figuring out the statistics inherent in language input. This is supported by a large body of evidence from the literature on statistical learning. Statistical learning -defined as the mechanism by which language users discover the patterns inherent in the language input based on its distributional properties -has been shown to facilitate the acquisition of various aspects of language knowledge, including phonological learning (e.g., Maye et al., 2008;Thiessen and Saffran, 2003), word segmentation (e.g., Onnis et al., 2008;Saffran et al., 1996), learning the graphotactic and morphological regularities of written words (e.g., Pacton et al., 2005), learning to form syntactic and semantic categories and structures (e.g., Lany and Saffran, 2010;Saffran and Wilson, 2003;Thompson and Newport, 2007). Furthermore, an impressive body of evidence has been accumulating over the last years indicating a close relationship between individual differences in statistical learning ability and variation in native language learning in both child and adult L1 populations (e.g., Conway et al., 2010;Kidd and Arciuli, 2016;Misyak and Christiansen, 2012;Siegelman and Frost, 2015), and in adult L2 populations (e.g., Ettlinger et al., 2016;Frost et al., 2013;Onnis et al., 2016). Thus, from an emergentist perspective, language acquisition is essentially an 'intuitive statistical learning problem' (Ellis, 2008, p. 376).
Emergentist approaches have developed a growing interest in the role of multiword sequences (henceforth MWS), also commonly referred to as 'formulaic sequences' (Wray, 2013). MWS are succinctly defined as variably-sized compositional recurring sequence patterns comprised of multiple words (for a recent overview, see . Three mechanisms that have been proposed to underpin frequency effects specifically in learning word sequences are described as follows (Diessel, 2007): [1] increased frequency causes the strengthening of linguistic representations, [2] increased frequency causes the strengthening of expectations and [3] increased frequency leads to the automatization of chunks. The frequency with which building blocks of language occur is thus a driving force behind chunking and, all else being equal, each exposure to a given sequence of words (sounds or graphemes) will affect its subsequent processing. But why is there a need for chunking? To ameliorate the effects of the 'real-time' constraints on language processing imposed by the limitations of human sensory system and human memory in combination with the continual deluge of language input (cf., Christiansen and Chater, 2016a,b, for the 'Now-or-Never bottleneck'), through constant exposure to (both auditory and visual) language input, humans learn to rapidly and efficiently recode incoming information into larger sequences. The fact that language is abundant in statistical regularities at multiple levels of language representations and that humans are able to detect such regularities via statistical learning allows for such chunking to take place. The by-products of statistical learning and chunking enable anticipatory language processing humans rely on to integrate the greatest possible amount of available information as fast as possible. Processing a MWS as a chunk will minimize memory load and speed up integration of the MWS with prior context (see, a chunk-based computational model presented in a recent study by McCauley and Christiansen, 2019).

MultiWord Frequency Effects in Online Processing
There is now an extensive body of evidence demonstrating that language users are sensitive to the input frequency across all levels of linguistic analysis (Ellis, 2002;Diessel, 2007;Jurafsky, 2003). An accumulating body of evidence now suggests that frequency effects also extend to the processing of MWS. Children and adults are shown to be sensitive to the statistics of MWS and rely on knowledge of such statistics to facilitate language processing and boost their acquisition (for overviews, see, Christiansen and Shaoul and Westbury, 2011).
In the area of native language processing, a number of comprehension and production studies have provided evidence of processing advantages for MWS over non-MWS (see, e.g., Arnon and Snider, 2010;Bannard and Matthews, 2008;Conklin and Schmitt, 2012;Durrant and Doherty, 2010;Tremblay et al., 2011). Many of these studies follow an approach where the target stimuli are restricted to a certain frequency thresh-old. The threshold-approach studies aimed to determine whether and to what extent MWS -i.e., more precisely 'lexical bundles' (LB)-are processed faster over less frequent counterparts (non-LB). The stimulus material is typically derived from language corpora based on predefined frequency criteria, while sequences differing in frequency matched on other properties were created as control stimuli. Biber and Conrad (1999) proposed that for a sequence of words to be considered to be considered a MWS, it must occur at least ten times per million in a corpus for sequences between two and four words long, and at least five times per million for longer sequences. Among these studies, (Tremblay et al., 2011) is the most relevant for the purposes of the present study. They created a dichotomous category for their stimuli based on Biber's threshold criteria. Their sequences were matched on words in non-final position. Rather than presenting isolated phrases they embedded their sequences in the full sentential context, as in I sat in the middle of the bullet train. To examine sequence reading performance Tremblay et al. conducted three self-paced reading experiments: word-by-word reading, portion-byportion reading and whole sentence reading. The three self-paced reading experiments showed that LBs have an online processing facilitatory effects over equivalent NLBs, i.e. in all of these experiments, sentences with LB were read faster than those with non-LB. The magnitude of the wholestring frequency effect increased with the length of the presentation window (i.e. word-by-word: ∼ 50 − 65ms; portion-by-portion: ∼ 120ms; sentence-by-sentence: ∼ 380ms). The authors interpreted this incremental facilitatory effect as being linked to an increased opportunity to "skip" words. While there has been an increased interest in the role of MWS in L2 online processing, most of the available research has focused on either noncompositional phrases, i.e. idioms (e.g. kick the bucket) or shorter compositional MWS including binomials (e.g. bride and groom) or collocations,, i.e. frequently recurring two-word sequences (e.g. perfectly natural) (see, Conklin and Schmitt, 2012, for a review). However, much less is known whether and to what extent adult L2 speakers can develop sensitivities to the frequency of compositional -i.e. syntactically regular and semantically transparent -MWS larger than two words. The few existing studies have produced inconsistent results: Some studies found frequency effects in processing of MWS in nonnative speakers (e.g. Jiang and Nekrasova, 2007), whereas other studies found no such effects (e.g. Babaei et al., 2015). In addition, these previous studies have demonstrated frequency effects of MWS in a lexical (phrasal) decision task and/or acceptability judgment tasks using a self-paced reading paradigm.

The Present Study
As reviewed above, a processing advantage for MWS in native speakers is well attested. However, much less is known whether this extends to nonnative (L2) speakers. The few existing L2 studies that have addressed this question have produced mixed results. The main goal of the present study is to replicate the processing advantage of MWS found for adult native language speakers based on evidence from self-paced reading and sentence recall tasks (Tremblay et al., 2011) in an ecologically more valid eye-tracking study in a group of L2 speakers. Eye movements of thirty participants were recorded using both early and late measures (first fixation duration, first-pass reading time, total reading time and fixation count). In line with emergentist accounts we predict that L2 speakers are sensitive to the statistics of MWS -to the frequencies of lexical bundles (LBs) -as evident in faster reading times across these four eye-tracking measures.

Participants
Thirty L1 German L2 speakers of English (27 female) at the RWTH Aachen University participated in the study. There were 27 female and 3 male (mean age = 24.5; SD = 5.1). All participants had normal or corrected to normal vision. The L2 speakers were classified as having a Common European Framework (CEF) English proficiency level of upper intermediate (CEF = B2) or lower advanced (CEF = C1) based on their institutional status (educational background) and their scores on Lexical Test for Advanced Learners of English (LexTALE; Lemhöfer and Broersma, 2012): an English vocabulary size test that is often used to estimate the CEF proficiency level. In addition, our participants completed the Language Experience and Proficiency Questionnaire (LEAP-Q;Marian et al., 2007). Table 1 reports details on age of English acquisition, exposure, and proficiency of the L2 speakers group. The tested L2 group reached an average LexTALE score of 79.68, supporting their classification as intermediate to advanced. Regarding their English acquisition, the L2 speakers started learning English around the age of 9 and reported to have acquired fluency at around 15 years of age. On average, their current experience with English comes mainly from reading (mean score of 8.32 out of 10), watching TV (mean score of 7.61 out of 10) and listening to music (mean score of 6.55 out of 10). Self-ratings of their English language proficiency based on a 10-point scale were relatively high (all mean scores greater 7.5).

Material
We used the same stimulus material as in Tremblay et al. (2011). This material comprised of pairs of short sentences (mean length of sentences = 8.5 words (SD = 0.7)) that differed in exactly one word. An example of such a pair is presented in (1a) and (1b): 1a I sat in the middle of the bullet train.
1b I sat in the front of the bullet train.
The underlined portions in the sentences mark an MWS of either four or five words. The words in bold print are the words that distinguish MWS that are lexical bundles (LBs) -here in the middle of the -from those that are not (NLBs) -here in the front of the. Following Biber and Conrad (1999), the distinction of 'lexical bundlehood' was based on the frequencies of the MWS obtained from the spoken subcorpus of the BNC with frequency thresholds set to at least 10 occurrences per million words (for four-grams) and 5 occurrences per million (for five-grams). As shown in (1a) and (b), the MWS -LBs or NLBs -were embedded after the second word of the sentence and were followed by two more words. The frequency of the words occurring before and after the MWS were controlled. The sentence material comprised a total of 20 such pairs -40 sentences containing LBs and NLBs -as well as 40 filler sentences (20 of which made sense and 20 were nonsensical). The sentence material was split into two counterbalanced lists, list A and list B, each of which contained 10 sentences that contained LBs, 10 sentences that contained NLBs, 10 filler sentences that were meaningful, and 10 filler sentences that were nonsensical. A complete list of the stimulus material can be found in Tremblay et al. (2011).

Procedure
Participants were randomly assigned to one of two groups. Group one was first presented list A, followed by a thirty minute break, followed by list B. Group 2 was presented with the two lists in reversed order. The sentences were presented on a 23-inch TFT monitor (resolution: 1920 x 1080 pixels) in pseudorandomised order, i.e. order of presentation was randomly determined but then kept constant across groups. Participants were instructed to read the sentences for comprehension silently and at their own pace. Each trial consisted of the following steps. The participants saw an asterisk in the center of the screen (font: Arial bold; size: 100). When ready, the participants pressed a key to see the first sentence, which was then displayed in a single line with black 30-point font characters on a white background at the centre. Once they had finished reading the sentence, participants pressed a key to see the next one. Each trial ended with a simple yes-no question specific to the sentence to ensure that the participants ac-tually read and processed the material. Eye movements were recorded using a Tobii Tx300 remote eye tracker that records binocular gaze data at 300 Hz and filtered with the Tobii fixation filter with standard settings (velocity threshold = 30 pixels/sample; distance threshold). The experiment took about 15 minutes (incl. calibration and explanation).

Statistical analysis
Eye movements were analyzed based on data collected from four measures: (1) first fixation duration (FFD), i.e. time spent initially fixating the MWS region, (2) first pass reading time (FPRT), i.e. sum of all the fixations made in the MWS region until the point of fixation leaves the region, (3) total reading time (TRT), i.e. sums all fixation times made within a MWS region, including those fixations made when re-reading the region and (4) the number of regressive saccades into the MWS region (COUNT). 1 L2 speakers' sensitivity to MWS frequency was evaluated using mixed-effect regression models implemented with the lme4 package (Bates et al., 2014) in the R environment (R Core Team, 2018). Separate models were fitted for each of the four dependent measures gathered in the eye-tracking experiment (FFD, FPRT, TRT, COUNT). Fixation times were logged (natural log) to reduce the nonnormality of their distributions. In each model, the dependent measure was regressed onto the predictor lexical bundlehood (dummy coded: LB vs. NLB). In addition, two control variables (length of MWS (in characters) and participants' LexTALE scores, a measure of L2 vocabulary size) were entered into each model as fixed effects. All models had the maximal random-effects structure justified by the design ( Barr et al., 2013), which included bysubject random intercepts and slopes for lexical bundlehood as well as random intercepts for items.

Results
Prior to the analyses, -for each eye tracking measure -all trials that were more than 2 standard deviations above or below the participant's mean 1 FFD and FPRT are 'early measures' that are indicative of early processes during reading (e.g. familiarity checks, access to orthographic/phonological information and lexical meaning, cf. Reichle et al., 1998). TRT and COUNT are 'late measures' taken to reflect later processes (e.g. reanalysis of information, integration of information in discourse and recovery from processing difficulties; cf. Rayner, 1998). score were removed. This led to a loss of data of about 5% (4.9% for FFD, 4.4% for FPRT, 4.7% for TTR, and 4.8% for COUNT). Figure 1 shows the distributions of all four eye-tracking dependent measures for multiword sequences (MWS) that are lexical bundles (LB; left) and those that are not lexical bundles (NLB; right). The plots in Figure 1 suggest a processing advantage of MWS that are LB over those that are NLB for three out of the four eye tracking measures. On average, participants exhibited shorter total reading times ( The results of the mixed effects models are presented in Table 2. The top part of Table 2 presents the information regarding the effects of our key predictor variable, lexical bundlehood, and the two control variables, MWS length (in characters) and LexTALE scores. The 'Intercept' row lists the mean fixation times (for the TRT, FPRT and FFD measures) and regressive saccade count (for the COUNT measure) for LBs on the log scale. The 'NLB' row indicates the difference in log fixation times -or, in the case of the COUNT model, regressive saccade counts -between LBs and NLBs. The results show that -for all dependent variables except FFD, which only approached significance -lexical bundlehood was found to be a significant predictor of eye movements, even after controlling for the effects of MWS length and LexTALE scores: Participants were significantly faster in processing sentences containing LBs compared to sentences containing equivalent control items with NLBs N LB T RT : estimate = 0.33, SE = 0.09, t = 3.9, p < 0.001). After accounting for the effects of length and L2 proficiency and adjusting for the individual variation between subjects and items, sentences with LBs were read (exp(6.76 + 0.33) − exp(6.76) =) 339  Table 2 presents the variability in the data that is attributable to random effects (e.g. some participants exhibited overall faster reading times than others). We found that -across the four eye tracking measures -there was a relatively large amount of variability in reading speed between participants ( The standard deviation for the by-subject random slopes for lexical bundlehood were minimal (all SD < 0.2), indicating that the LB effect was consistent across subjects. This pattern of results is line with the results reported in (Tremblay et al., 2011).

Discussion
The main goal of the present study was to determine whether non-native (L2) speakers can develop sensitivity to the statistics of composi-tional multiword sequences (MWS) larger than two words. To this end, the study aimed to replicate the processing advantage of such sequences found for native speakers (Tremblay et al., 2011) in a group of L2 speakers of English. As reviewed in Section 1, (Tremblay et al., 2011)performed three self-paced reading studies to investigate the facilitatory effects of lexical bundles (LBs) and found that the magnitude of the wholestring frequency effect increased with the length of the presentation window. We were able to replicate the MWS frequency effects using eyetracking methodology: Mixed-effects modeling revealed that lexical bundlehood was a significant predictor of eye movements, even after controlling for the effects of MWS length and LexTale scores and after adjusting for the individual variation between subjects and items: Participants were significantly faster in processing sentences containing LBs compared to sentences containing equivalent control items with NLBs for all dependent variables except first fixation duration (FFD), which approached significance (p < 0.1). Similar results were reported in a recent eye-tracking study on the online processing of multiword sequences in Chinese (see, Yi et al., 2017) where significant or marginally significant effects of MWS frequency were found in the eye movement measures also investigated in the present study. Like the present study, (Yi et al., 2017) found that the effect of FFD on reading times was marginally significant. The findings reported here are thus consistent with the results reported in previous L2 studies (Ellis, 2008;Durrant and Schmitt, 2009;Hernández et al., 2016;Jiang and Nekrasova, 2007;Kerz and Wiechmann, 2017;Siyanova-Chanturia et al., 2011). Our study thus provides additional evidence in support of the hypothesis that similarly to native speakers, non-native speakers can also  Table 2: Regression coefficients (with standard errors) from the four mixed-effects models fitted to the eyemovement data. Estimates and standard errors of fixation times are in logged milliseconds. One observation is equal to one fixation time (or -in the case of the COUNT-model -regressive saccade count) measurement for one sentence read by one participant.
develop the sensitivity to the statistics of MWS. At a more general theoretical level, the results of the present study are consistent with emergentist accounts that challenge dual-system views of language and instead argue for single-systems of language. More importantly, the results indicate that similarities between L1 and late L2 learning are more striking than the differences and, therefore, that unified theoretical models rather than separate ones are needed to account for the mechanisms used for L1 and L2 learning (see, e.g., MacWhinney, 2017). Emergentist accounts have proposed such mechanisms, namely that of statistical learning and chunking (see Section 1.1 for more details). Sensitivity to the statistics of multiword sequences facilitates chunking -required to integrate the greatest possible amount of available information as fast as possible so at to overcome the fleeting nature of linguistic input and the limited nature of our memory for sequences of linguistic input (Now-or-Never bottleneck, see Christiansen and Chater, 2016a).
Some of the questions left open by the current study may provide interesting avenues for future work. First, we investigated sensitivity to 'simple statistics' -i.e. corpus-derived frequenciesof MWS in non-native speakers. The question arises whether similar results could be obtained for 'more complex' distributional statistics using association measures, such as transitional probability or mutual information or using informationtheoretic measures, such as entropy as well as measures that capture the variability of MWS. Second, the stimulus material used in this study was derived from a corpus representing spoken language. In the light of growing evidence that the statistics of written input play a crucial role in the development of linguistic knowledge -as it provides a source of substantial change in the statistics of an individual's language experience (Seidenberg and MacDonald, 2018) -it would be important to determine whether language users can 'tune to' multiple statistics inherent in different registers/genres. And, third, it would be important to determine whether the ability to tune to the statistics of MWS is subject to individual differences, and if so, to what extent these differences are linked to a host of experience-related, cognitive and affective factors.