Modelling semantic acquisition in second language learning

Using methods of statistical analysis, we investigate how semantic knowledge is acquired in English as a second language and evaluate the pace of development across a number of predicate types and content word combinations, as well as across the levels of language proficiency and native languages. Our exploratory study helps identify the most problematic areas for language learners with different backgrounds and at different stages of learning.


Introduction
Acquisition of semantic knowledge and vocabulary of a second language (L2), including appropriate word choice and awareness of selectional preference restrictions, are widely recognised as important aspects of L2 learning by native speakers, language teachers and learners themselves. Previous research demonstrated strong correlation between semantic knowledge and proficiency level (Shei and Pain, 2000;Alderson, 2005), and argued that the use of collocations makes one's speech more native-like (Kjellmer, 1991;Aston, 1995;Granger and Bestgen, 2014). James (1998) noted that learners often equate L2 mastery with mastery of L2 vocabulary, and Leacock et al. (2014) mention an experiment in which teachers of English ranked word choice errors among the most serious errors in L2 writing. At the same time, it has also been argued that acquisition of semantic knowledge proceeds on a word-by-word basis with each word being acquired as a separate construct (Gyllstad et al., 2015), and acquisition of content word combinations knowledge is slow and uneven, presenting challenges even at high proficiency levels (Bahns and Eldaw, 1993;Laufer and Waldman, 2011;Thewissen, 2013).
Native speakers are believed to be experts in their own language (James, 1998), and the language norm is usually set based on their preferences (Wulff and Gries, 2011). Apart from errors, learner English is often characterised by differences in the probabilistic distribution of lexical items which are expressed in under-or overuse of certain constructions (De Cock, 2004;Durrant and Schmitt, 2009;Laufer and Waldman, 2011;Wulff and Gries, 2011). In this paper, we adopt statistical approach and assume that native and learner language are characterised by different distributions. We investigate how non-native use of language develops and how closely it approximates native use at different levels of proficiency.
The native language distribution is modelled using a combination of the British National Corpus (BNC) and ukWaC, while learner language distributions are modelled using Cambridge Learner Corpus (CLC). CLC covers various L1 backgrounds as well as 6 language proficiency levels defined by the Common European Framework of Reference for Languages (CEFR) (Council of Europe, 2011a), ranging from "basic" (A1-A2) to "independent" (B1-B2) to "proficient" (C1-C2). In contrast to much of previous research, we run the experiments both on a wider scale, using a large corpus of learner English, and to finer level of granularity, exploring learner development across proficiency levels. Table 1 defines the amount and range of linguistic constructions that the learners are expected to be familiar with at different levels. Specifically, we explore: (1) the pace of semantic knowledge and vocabulary acquisition across levels; (2) the influence of one's L1 on the development of semantic knowledge; (3) acquisition and development of selectional preference patterns across levels.
Level Descriptor A1 Has a very basic repertoire of words and simple phrases related to personal details and particular concrete situations.

A2
Uses basic sentence patterns with memorised phrases, groups of a few words and formulae in order to communicate limited information in simple everyday situations.

B1
Has enough language to get by, with sufficient vocabulary to express him/herself with some hesitation and circumlocutions on topics such as family, hobbies and interests, work, travel, and current events.

B2
Has a sufficient range of language to be able to give clear descriptions, express viewpoints on most general topics, without much conspicuous searching for words, using some complex sentence forms to do so. C1 Has a good command of a broad range of language allowing him/her to select a formulation to express him/ herself clearly in an appropriate style on a wide range of general, academic, professional or leisure topics without having to restrict what he/she wants to say. C2 Shows great flexibility reformulating ideas in differing linguistic forms to convey finer shades of meaning precisely, to give emphasis, to differentiate and to eliminate ambiguity. Also has a good command of idiomatic expressions and colloquialisms.

Previous research
Within NLP, it is more typical to explore learner language from the perspective of automated assessment or error detection and correction (Leacock et al., 2014) which focus on the contrast between learner and native language in terms of errors in L2, rather than from a language development perspective. The latter was studied more extensively by Second Language Acquisition (SLA) researchers. Previous research looked into vocabulary acquisition and language development assessing passive, or receptive, vocabulary knowledge (Gyllstad et al., 2015) and trying to estimate the vocabulary that the learners might understand at different proficiency levels (Nation, 2006;Bergsma and Yarowsky, 2013). The vocabulary size tests of the type proposed by Nation (2012) were shown to not be appropriate to test productive vocabulary knowledge as they suffer from overestimation of the vocabulary size (Gyllstad et al., 2015). Using learner writing to estimate the productive vocabulary size provides more reliable results, but previous studies in this area were performed on a smaller scale, either focusing on a limited number of proficiency levels (Gilquin and Granger, 2011;Granger and Bestgen, 2014), L1s (Gilquin and Granger, 2011;Granger and Bestgen, 2014;Siyanova-Chanturia, 2015), or on overall smaller datasets (Grant and Ginther, 2000;Granger and Bestgen, 2014).
It is widely accepted that vocabulary develops over time, and richer vocabulary is characteristic of better language knowledge (Laufer and Waldman, 1995;Grant and Ginther, 2000). Moreover, as students become more proficient writers, they do not only start operating with an overall larger vocabulary, but also become more precise in their word choice which is reflected in the increase of the type-token ratio (TTR) (Ferris, 1994;Engber, 1995;Frase et al., 1999;Grant and Ginther, 2000). However, the methodology of tagging the word choice and measuring TTR similar to that adopted in Grant and Ginther (2000) fails taking the omissions into account, while the method proposed in this paper helps alleviate this problem.
With respect to the development of selectional preference patterns and phraseological knowledge, Siyanova-Chanturia (2015) show that L2 learners even at lower levels do not just focus on single words acquisition but also attend to combinatorial linguistic mechanisms. The studies of Durrant and Schmitt (2009) and Granger and Bestgen (2014) suggest that intermediate learners tend to overuse high frequency collocations (such as hard work) and underuse lower-frequency collocations (such as immortal souls), while as proficiency in the language increases, this balance changes. Durrant and Schmitt (2009) argue that learners at the lower proficiency levels seem to over-rely on forms which are common in the language, and Paquot and Granger (2012) note that this might be related to the fact that learners feel confident using such common forms.
An interesting observation concerns the pace of semantic knowledge development: for instance, Laufer and Waldman (1995) observed that advanced learners' vocabulary is too varied to remain stable across different samples of writing. Waldman (2011) andNesselhauf (2005) investigated the development of collocational knowledge and came to a somewhat counterintuitive conclusion that more proficient learners produce more deviant collocations than their less proficient counterparts. Thewissen (2008) argue that higher-level learners attempt a much wider range of lexical phrases which are not always error-free, and produce a large number of near-hits as compared to their lower intermediate counterparts. Paquot and Granger (2012) conclude that at an advanced level, learners take more risks, try out more complex lexical phrases and as a result, produce errors, but those are of a different, more 'advanced' nature than the basic errors typical of earlier stages.

Experimental setup
We focus on three types of content word combinations that are some of the most frequent in learner writing and have previously been found challenging for language learners (Lorenz, 1999;Paquot and Granger, 2012): adjective-noun (AN), verbdirect object (VO) and subject-verb (SV). We (1) investigate how the use of the predicating words (adjectives and verbs) within these combinations develops over time, 1 and (2) look into how their selectional preference patterns change across levels of language proficiency. We do not focus on collocations specifically for two reasons: firstly, there is a lot of disagreement in defining collocations (cf. Foster (2010), Nesselhauf (2005), Hoey (1991)), and secondly, learners have been shown to have difficulties with all types of content word combinations, including those that are referred to as 'free' (Paquot and Granger, 2012).

Data
Learner data: We have extracted the data for our experiments from the Cambridge Learner Corpus (CLC), which is a 52.5 million-word corpus of 1 We combine adjectives in AN and verbs in VO and SV combinations under the term of predicating words because we assume that they impose the selectional restrictions on the arguments (nouns) within the corresponding combinations.   (Nicholls, 2003). It comprises essays written during examinations in English by language learners with over 80 L1s and representing all 6 CEFR levels (Council of Europe, 2011a). Since the learners are not restricted in the word choice, 2 we believe that the range of vocabulary used in the essays is representative of what is in learners' active lexicon and, therefore, reflects semantic knowledge internalised at this point. We have extracted the word combinations from the full CLC parsed with the RASP (Briscoe et al., 2006). Table 2 summarises learner data: we include the number of types (unique combinations), tokens (overall number of combinations), typetoken ratio (T T R) as well as the number of predicates for each level. Table 2 demonstrates that the overall number of the combinations and predicates as well as T T R constantly increase from A1 through to C2, with the largest increase between levels A2 and B1, 3 when the learners transfer from beginners to intermediate and start using the vocabulary beyond basic and simple, and between levels C1 and C2, when learners are expected to master idiomatic expressions and colloquialisms.
Native data: To estimate the general linguistic and vocabulary range of a native speaker, we have extracted the statistics on the use of ANs, VOs and SVs and the predicates from a combination of the BNC (Burnard, 2007) and ukWaC (Ferraresi et al., 2008), which together amount to more than 2 billion words. For consistency, the native data has also been parsed with RASP (Briscoe et al., 2006).

Statistical methods
Distribution similarity: We measure the similarity between two distributions using Kullback-Leibler (KL) divergence (MacKay, 2003) which for distributions Q and P is defined as: In our experiments, P is the distribution in the learner data and Q is the distribution in the native data. The closer the two distributions are, the lower the value of D KL . To support the results, we additionally measure the Pearson correlation coefficient (P CC) between the predicates and content word combinations in the learner and native data. P CC is higher for the more similar distributions.
Argument clustering: To address the issue of data sparsity, we estimate selectional preferences (SP) over argument classes as well as individual arguments. We obtain SP classes using spectral clustering of nouns with lexico-syntactic features, which has been shown effective in previous lexical classification tasks (Brew and Schulte im Walde, 2002;Sun and Korhonen, 2009). Spectral clustering partitions the data relying on a matrix that records similarities between all pairs of data points. We use Jensen-Shannon divergence to measure the similarity between feature vectors for nouns w i and w j as follows: where d KL is the KL divergence, and m is the average of w i and w j . We construct the similarity matrix S computing similarities S ij as S ij = exp(−d JS (w i , w j )). The matrix S encodes a similarity graph G over the nouns, where S ij are the adjacency weights. The clustering problem can then be defined as identifying the optimal partition, or cut, of the graph into clusters, such that the intra-cluster weights are high and the inter-cluster weights are low. We cluster 2, 000 most frequent nouns in the BNC, using their grammatical relations as features. The features consist of verb lemmas occurring in the subject, direct object and indirect object relations with the given nouns in the RASP-parsed BNC. The feature vectors are constructed from the corpus counts and normalized by the sum of the feature values.
Selectional preference model: Once the SP classes are obtained, we quantify the strength of association between a given predicate and each of the classes. We adopt an information theoretic measure proposed by Resnik (1993) for this purpose. Resnik first measures selectional preference strength (SPS) of a predicate in terms of KL divergence between the distribution of noun classes occurring as arguments of the predicate, p(c|v), and the prior distribution of the noun classes, p(c): where R is the grammatical relation for which SPs are computed. SPS measures how strongly the predicate constrains its arguments. Selectional association with a particular argument class is then defined as a relative contribution of that argument class to the overall SPS of the predicate: We extract VO and SV relations, map the argument heads to SP classes and quantify selectional association of a given predicate with each SP class.

Experimental results
We run a series of experiments to test the aspects of semantic knowledge acquisition outlined in §1. Table 2 shows that at the lower levels learners operate with quite a small vocabulary. Many previous studies argued that learners at lower levels tend to overuse high frequency lexical items, whereas over time they expand their vocabulary with less frequent lexical items. It has also been argued that semantic knowledge acquisition is an unsteady process (see §2). First, we explore how exactly the semantic knowledge develops across proficiency levels, and investigate whether content word choice error rates -the proportion of word combinations where the predicate in chosen inappropriately as, for example, in *choose decision instead of make decision, or *actual room instead of current room -decrease over time.

Pace of semantic knowledge acquisition
For that, we identify 10 frequency bands for predicating words within each combination type using native English data. Each band covers from 363 (within band 1 of the most frequent predicates) up to 7, 672 (within band 10 of the least frequent ones) unique adjectives in ANs, and similarly from 281 to 3, 676 verbs in VOs, and 297 to 3, 367 verbs in SVs. For instance, band 1 contains such adjectives as big and good, and verbs give, go and see, while band 10 contains adjectives behaviouristic and decipherable, and verbs factor, garnish and mesmerise. It is reasonable to expect that learners are familiar with the "simpler" words from band 1 even at the lower proficiency levels, while they might find words from band 10 much more challenging. In order to quantitatively assess this, we measure the proportion of new predicating words used at each level and map it to the identified frequency bands. Next, we estimate the error rates for each level and for each frequency band. Figure 1 shows the distribution of the new vocabulary acquired at each level mapped against the frequency bands, as well as the distribution of the error rates across the frequency bands at each level. 4 While we observe that, as expected, learners expand their vocabulary acquiring words from lower frequency bands, the following trends are worth noting: most of the verb predicates in VOs and SVs that the learners know at level A1 are covered by frequency band 1. At A2 and B1 they still expand their vocabulary with some verbs from band 1, but starting with level B2 none of the new vocabulary comes from this band. Most new verbs in VOs at level C2 are covered by band 10, and in SVs by band 4. For adjectives, most new vocabulary at A1 and A2 comes from band 1, at B1 -band 3, at B2 -band 5, at C1 -band 8 and at C2 -band 10. Predictably, the error rates decrease towards the higher proficiency levels and within the higher frequency bands. The highest error rates are observed on the bands covering less frequent words: for example, even though the error rates are overall lower for C2 level, the highest error rate for C2 is associated with band 10 for all three types of combinations which confirms that semantic acquisition is challenging even at advanced levels.
While these results corroborate previous findings and show quantitatively how semantic knowledge develops across levels, we look further into how it approximates native English. In particular, it is reasonable to assume that the variety of English used by language learners at the lower proficiency levels is more dissimilar to the native English both for predicates and content word com-4 More detailed description is available at www.cl.cam. ac.uk/˜ek358/vocab-acquisition.html.  2) and expect that towards C2 level P CC increases and approximates 1.0, while KL decreases and approximates 0.0. Table 3 presents the P CC and KL values for the distribution of the adjectives and verbs in columns marked with pred for predicating words, and for combinations in columns marked with comb. These values show that P CC steadily increases while KL steadily decreases from level A1 through to level C1, with the biggest "jump" between levels A2 and B1 for the adjectives and verbs in SVs, and A1 and A2 for the verbs in VOs. However, we note that at level C2 predicating words distribution is less similar to native English distribution than at level C1 for all types of combinations -we mark these values in the table in bold. We hypothesise that at level C2 the learners are already familiar with the basic vocabulary and start experimenting with the use of novel constructions which might result in a quite distinct variety of English (see Thewissen (2008) and Paquot and Granger (2012) for similar hypotheses). To investigate this further, we identify 10 predicates per combination type such that after removing them from the list of predicates, KL between the learner and native distribution improves (see Table 4).
What makes the use of these predicates by learners different from native use? Column "#B" in Table 4 presents the mean of the frequency bands and shows that most of these predicates come from the first two frequency bands, so they represent frequent words that are overused by the learners. We calculate the average error rates for the combinations with these predicates (column "ErR") and compare them to the average error rate over all predicates for each level (in parentheses). For adjectives and verbs in VOs the error rates are comparable or below the average error rate at the lower levels, and higher than the average at the upper levels. Verbs in SVs demonstrate an opposite trend: at the lower levels error rates associated with the use of these predicates are higher than average, while at the upper levels they are comparable or lower. We conclude that the differences in the distributions at the lower levels are caused by the overuse of the basic vocabulary, while towards the upper levels it is due to occasionally incorrect use of more diverse vocabulary.
The rightmost columns of Table 3 also compare the distribution of the ANs, VOs and SVs in the learner data to those in the native English data. We note that, similarly to the distribution of the predicates, the use of the content word combinations becomes more similar to native use towards higher levels of language proficiency, and to further confirm our hypothesis about the peculiar use of language at C2, we observe a disruption of this trend at C2 level for VOs and SVs. We also note that the development goes at quicker pace between A1 through to B2, and slows down at the upper levels.

L1 effects
L1 influence on the word choice has been extensively studied by SLA researchers (Siyanova-Chanturia, 2015;Paquot and Granger, 2012). It seems reasonable to expect that the similarity between one's L1 and L2 should facilitate semantic acquisition in L2: for example, if L1 and L2 belong to the same language group, they can be expected to bear considerable semantic similarities that might help learners acquire semantic knowl-  Table 4: Top 10 predicates contributing to the difference between learner and native language distribution edge in L2, while one may expect to observe slower learning pace for speakers of more distant L1s (Gilquin and Granger, 2011).
To test to what extent L1 exerts influence on L2 semantic knowledge acquisition, we consider three language groups -Germanic L1s (GE) that belong to the same group as English (EN), Romance L1s (RM) that represent a different group within the same family of the Indo-European languages, and Asian L1s (AS) representing a group of languages most distant from English among the three. 5 We measure KL divergence for the three pairs, GE-EN, RM-EN and AS-EN, on the distribution of the predicates.
The results reported in Table 5 contradict our original assumption as we observe that the variety of English used by speakers of Romance L1s is closer to native English than the variety used by speakers of Germanic L1s. Furthermore, the variety of English used by speakers of Asian L1s, especially at the lower levels, is more similar to native English than the variety used by Germanic L1  speakers. We hypothesise that since Asian L1s are very different from English, the speakers of these languages may prefer to use prefabricated phrases more often than speakers of Germanic L1s, which makes their language more native-like. Similar hypotheses have been formulated earlier: for example, Gilquin and Granger (2011) noted that learners, especially at the lower levels, are likely to repeat expressions that are familiar to them and appear to be safe, and Hulstijn and Marchena (1989) noted that learners tend to rely on "play-it-safe" strategy rather than experiment unless they are confident in their vocabulary knowledge. We assume that speakers of Germanic L1s might feel more confident in their semantic knowledge and as a result be more "adventurous" in their use of English than speakers of Asian L1s. Our experiments on the individual L1s within each group show same trends as observed for L1 groups.

Selectional preference patterns
Finally, we investigate how selectional preference patterns develop across proficiency levels and whether they approximate native English patterns. For each predicate in learner and native data, we form argument clusters using the methodology described in §3.2, estimate SP strength for the predicates at each level using eq. 3, and then apply KL divergence and P CC to measure the difference. Table 6 overviews the similarity between the SP models in learner and native data for the arguments and argument clusters (see columns with cl). As before, we observe that the SP models in   Table 7: Examples of the most strongly associated arguments learner data become more similar to those in native language towards upper levels. Both ANs and VOs show the biggest improvements between A2 and B1, and we observe the disruption in this trend at the levels A2 and C2 (we mark those in bold). Next, we look into the set of predicates that have the most different SP patterns in the learner and native language, and using eq. 4, identify the argument cluster that is most strongly associated with each of these predicates in the learner and native data. For the sake of space, in Table 7 we present only some illustrative examples from different levels and combination types. 6 The experiments suggest that the difference between the learner and native SP models might be due to the learners' use of concrete nouns with the adjectives and verbs where native speakers prefer abstract nouns.
To further investigate this hypothesis, we identify 10 predicates per combination type and proficiency level with the most distinct selectional preference patterns. Using the MRC Psycholinguistic Database (Wilson, 1988), we calculate the average concreteness score for the arguments clusters in learner and native data. Our results show that at the lower levels learners use more concrete arguments than native speakers, with the difference statistically significant at 0.05 with t-test, while the difference becomes less pronounced towards C1-C2 levels. Our results for productive vocabulary knowledge corroborate previous findings on the relation between receptive vocabulary knowledge and acquisition of abstract concepts (Tanaka et al., 2013;Vajjala and Meurers, 2014).
The results show that the difference in selectional preference patterns between the learner and native language is due to the concreteness of the selected arguments. This may reflect (1) the difficulty in acquiring semantics of abstract concepts in L2, or, alternatively, (2) L1-based instructional practices that may focus first on teaching concrete concepts before abstract concepts. The awareness of this discrepancy can serve as further guidance for language instructors and learners, and help make one's language use more native-like.

Discussion and conclusions
This paper reports the results of a largescale corpus-based exploratory study of semantic knowledge acquisition by L2 learners. In contrast to previous work, we ran experiments on a wider scale, using a large learner corpus, and at finer granularity, investigating L2 development across 6 CEFR proficiency levels. We show that (1) the learners tend to overuse highly frequent English words across all proficiency levels, although towards the higher levels the lexical distributions in learner and in native language become more similar; (2) the two peaks of vocabulary acquisition are associated with the transition between beginner and intermediate levels (A2-B1), and between the two proficient levels (C1-C2); (3) lexical distribution at upper proficient level (C2) is less similar to native distribution than at lower proficient level (C1) which may be due to the more creative language use at C2; (4) the variety of English used by speakers of more distant L1s at lower levels of proficiency is closer to native English than the variety used by speakers of closer L1s, which might be an effect of "play-it-safe" strategy adopted by learners; (5) concrete nouns tend to be more strongly associated with the predicates in learner language than abstract nouns. The methodology presented in this paper can help identify the gaps in learner vocabulary knowledge and tailor vocabulary acquisition exercises to the needs of learners at dif-ferent proficiency levels.
We admit that potential topic and genre bias of learner exams data is a limitation of our corpusbased approach. We believe that corpus-based studies of the type presented in this paper will facilitate further research into semantic knowledge development, although it is possible that learner corpora provide only limited access to productive learner vocabulary. As Siyanova-Chanturia (2015) notes "in an ideal world, one would use the same topic across and within all tested levels, but in a language classroom, this is hardly possible". The future work will investigate possible solutions for this problem such as (1) augmentation of the data with other learner corpora, (2) use of fill-in-thegaps exercises that test vocabulary knowledge directly, and (3) sampling of the native data to more closely reflect the selection of topics in the learner data.