Cross-Lingual Lexico-Semantic Transfer in Language Learning

Lexico-semantic knowledge of our native language provides an initial foundation for second language learning. In this paper, we investigate whether and to what extent the lexico-semantic models of the native language (L1) are transferred to the second language (L2). Speciﬁcally, we focus on the problem of lexical choice and investigate it in the context of three typolog-ically diverse languages: Russian, Spanish and English. We show that a statistical semantic model learned from L1 data improves automatic error detection in L2 for the speakers of the respective L1. Finally, we investigate whether the semantic model learned from a particular L1 is portable to other, typologically related languages.


Introduction
Lexico-semantic knowledge of our native language is one of the factors that underlie our ability to communicate and reason about the world. It is also the knowledge that guides us in the process of second language learning. Lexico-semantic variation across languages (Bach and Chao, 2008) makes lexical choice a challenging task for second language learners (Odlin, 1989). For instance, the meaning of the English expression pull the trigger is realised as *push the trigger in Russian and Spanish, possibly leading to errors of lexical choice by Russian and Spanish speakers learning English. Our native language (L1) plays an essential role in the process of lexical choice. When choosing between several linguistic realisations in L2, non-native speakers may rely on the lexicosemantic information from L1 and select a translational equivalent that they deem to match their communicative intent best. For example, Russian speakers *do exceptions and offers instead of making them, and *find decisions instead of finding solutions, since in Russian do and make have a single translational equivalent (delat'), and so do decision and solution (resheniye). As a result, nonnative speakers who tend to fall back to their L1 translate phrases word-for-word, violating English lexico-semantic conventions.
The effect of L1 interference on lexical choice in L2 has been pointed out in a number of studies (Chang et al., 2008;Rozovskaya, 2010;Rozovskaya, 2011;Dahlmeier and Ng, 2011). Some of these studies also demonstrated that using L1specific properties, such as the error patterns of speakers of a given L1 or L1-induced paraphrases, improves the performance of automatic error correction in non-native writing. However, neither of the approaches has constructed a semantic model from L1 data and systematically studied the effects of its transfer onto L2. In addition, most previous work has focused on error correction, bypassing the task of error detection for lexical choice. Lexical choice is one of the most challenging tasks for both non-native speakers and automated error detection and correction (EDC) systems. The results of the most recent shared task on EDC, which spanned all error types including lexical choice, show that most teams either did not propose any algorithms for this type of errors or did not perform well on them (Ng, 2014).
In this paper, we experimentally investigate the influence of L1 on lexical choice in L2 and whether lexico-semantic models from L1 are transferred to L2 during language learning. For this purpose, we induce L1 and L2 semantic models from corpus statistics in each language independently, and then use the discrepancies between the two models to identify errors of lexical choice. We focus on two types of verb-noun combinations, VERB-DIRECT OBJECT (dobj) and SUBJECT-VERB (subj), and consider two widely spoken L1s from different language families -Russian and Spanish. We conduct our experiments using the Cambridge Learner Corpus (Nicholls, 2003), containing writing samples of non-native speakers of English. Spanish speakers account for around 24.6% of the non-native speakers represented in this corpus and Russian speakers for 4%.
Our experiments test two hypotheses: (1) that L1 effects in the lexical choice in L2 reveal themselves in the difference of the word association strength in the L1 and L2; and (2) that L1 lexicosemantic models are portable to other, typologically related languages. To the best of our knowledge, our paper is the first one to experimentally investigate these questions. Our results demonstrate that L1-induced information improves automatic error detection for lexical choice, confirming the hypothesis that L1 speakers rely on semantic knowledge from their native language during L2 learning. We test the second hypothesis by verifying that Russian speakers exhibit similar trends in errors with the speakers of other Slavic languages, and Spanish speakers with the speakers of other Romance languages. We find that the L1induced information from Russian and Spanish is effective in assessing lexical choice of the speakers of other languages for both language groups.
2 Related work

Error detection in content words
Early approaches to collocation error detection relied on manually created databases of correct and incorrect word combinations (Shei and Pain, 2000;Wible et al., 2003;Chang et al., 2008). Constructing such databases is expensive and timeconsuming, and therefore, more recent research turned to the use of machine learning techniques. Leacock et al. (2014) note that most approaches to detection and correction of collocation errors compare the writer's word choice to the set of alternatives using association strength measures and choose the combination with the highest score, reporting an error if this combination does not coincide with the original choice (Futagi et al., 2008;Ostling and Knutsson, 2009;Liu et al., 2009). This strategy is expensive as it relies on comparison with a set of alternatives, limited in capacity as it depends on the quality of the alternatives generated and circular as the detection cannot be performed independently of the correction. Our approach alleviates these problems, since error detection depends on the original combination only.
Some previous approaches focused on correction only (Dahlmeier and Ng, 2011;Kochmar and Briscoe, 2015), and although they show promising results, they have not attempted to perform error detection in lexical choice. Kochmar and Briscoe (2014) focus on error detection, but their system addresses adjective-noun combinations and does not use L1-induced information.

L1 factors in L2 writing
The influence of an L1 on lexical choice in L2 and the resulting errors have been previously studied (Chang et al., 2008;Östling and Knutsson, 2009;Dahlmeier and Ng, 2011). These works focus on errors in particular L1s and use the translational equivalents directly to improve candidate selection and quality of corrections. Dahlmeier and Ng (2011) show that L1-induced paraphrases outperform approaches based on edit distance, homophones, and WordNet synonyms in selecting the appropriate corrections. Rozovskaya and Roth (2010) show that an error correction system for prepositions benefits from restricting the set of possible corrections to those observed in the non-native data. Rozovskaya and Roth (2011) further demonstrate that the models perform better when they use knowledge about error patterns of the non-native writers. According to their results, an error correction algorithm that relies on a set of priors dependent on the writer's preposition and the writer's L1 outperforms other methods. Madnani et al. (2008) show promising results in whole-sentence grammatical error correction using round-trip translations from Google Translate via 8 different pivot languages.
The results of these studies suggest that L1 is a valuable source of information in EDC. However, all these works use isolated translational equivalents and focus on error correction only. In contrast, we construct holistic semantic models of L1 from L1 corpora and use these models to perform the more challenging task of error detection.

Data
We first use large monolingual corpora in Spanish, Russian and English to build word association models for each of the languages. We then apply the resulting models for error detection in the English learner data.

L1 Data
Spanish data The Spanish data was extracted from the Spanish Gigaword corpus (Mendonca et al., 2011), a one billion-word collection of news articles in Spanish. The corpus was parsed using the Spanish Malt parser (Nivre et al., 2007;Ballesteros et al., 2010). We extracted VERB-SUBJECT and VERB-DIRECT OBJECT relations from the output of the parser, which we then used to build an L1 word association model for Spanish.
Russian data The Russian data was extracted from the RU-WaC corpus (Sharoff, 2006), a two billion-word representative collection of texts from the Russian Web. The corpus was parsed using Malt dependency parser for Russian (Sharoff and Nivre, 2011), and the VERB-SUBJECT and VERB-DIRECT OBJECT relations were extracted from the parser output to create an L1 word association model for Russian.
Dictionaries and translation Once the L1 word associations have been computed for the verbnoun pairs, we identify possible translations for verbs and nouns (in each pair) in isolation, as a language learner might do. To create the translation dictionaries, we extracted translations from the English-Spanish and English-Russian editions of Wiktionary, both from the translation sections and the gloss sections if the latter contained single words as glosses. We focus on verb-noun pairs, therefore multi-word expressions were universally removed. We added inverse translations for every original translation. We then created separate translation dictionaries for each language and part-of-speech tag combination from the resulting collection of translations.

L2 data
To build the English word association model, we have used a combination of the British National Corpus (Burnard, 2007) and the UKWaC (Baroni et al., 2009). The corpora were parsed by the RASP parser (Briscoe et al., 2006) and VERB-SUBJECT and VERB-DIRECT OBJECT relations were extracted from the parser output. Since the UKWaC is a Web corpus, we assume that the data contains a certain amount of noise, e.g. typographical errors, slang and non-words. We filter these out by checking that the verbs and nouns in the extracted relations are included in WordNet (Miller, 1995) with the appropriate part of speech.

Learner data
To extract the verb-noun combinations that have been used by non-native speakers in practice, we use the Cambridge Learner Corpus (CLC), which is a 52.5 million-word corpus of learner English collected by Cambridge University Press and Cambridge English Language Assessment since 1993 (Nicholls, 2003). It comprises English examination scripts written by learners of English with 148 different L1s, ranging across multiple examinations and covering all levels of language proficiency. A 25.5 million-word component of the CLC has been manually error-annotated.
We have preprocessed the CLC with the RASP parser (Briscoe et al., 2006), as it is robust when applied to ungrammatical sentences. We have then extracted all dobj and subj combinations: in total, we have extracted 187, 109 dobj and 225, 716 subj combinations. We have used the CLC error annotation to split the data into correct combinations and errors. We note that some verb-noun combinations are annotated both as being correct and as errors, depending on their wider context of use. To ensure that the annotation we use in our experiments is reliable and not context-dependent, we have empirically set a threshold to filter out ambiguously annotated instances. The set of correct word combinations includes only those word pairs that are used correctly in at least 70% of the cases they occur in the CLC; the set of errors includes only those that are used incorrectly at least 70% of the time.

Experimental datasets
We split the annotated CLC data by language and relation type. Table 1 presents the statistics on the datasets collected. 1 We extract the verb-noun combinations from the CLC texts written by native speakers of Russian (RU) and Spanish (ES) to test our first hypothesis, as well as by speakers of ALL L1s in the CLC to test our second hypothesis. We then filter the extracted relations using the translated verb-noun pairs from Russian and Spanish corpora.
We note that Russian and Spanish have comparable number of word combinations in L1-specific subsets -10K-12K for dobj and subj combinations -and comparable error rates (ERR). We also note that the error rates in the dobj sub-  sets are higher than in subj subsets, presumably, because VERB-SUBJECT combinations allow for more flexibility in lexical choice. We find a large number of translated word combinations in other L1s, and it is interesting to note that the error rates are higher across multiple languages than in the same L1s, which corroborates our second hypothesis that the lexico-semantic models from L1s transfer to L2. The last two columns of Table  1 show how diverse our datasets are in terms of verbs and nouns used in the constructions: for example, RU dobj subset contains combinations with 786 different verbs and 1, 918 different nouns.

Methods
Our approach to detecting lexico-semantic transfer errors relies on the intuition that a mismatch between the lexico-semantic models in two languages reveals itself in the difference in word association scores. We argue that a high association score of a verb-noun combination in L1 shows that it is a collocation in L1, but low association score of its translational equivalent in L2 signals an error in L2 stemming from the lexico-semantic transfer. Following previous research (Baldwin and Kim, 2010), we measure the strength of verbnoun association using pointwise mutual information (PMI). Figure 1 illustrates this intuition. In Russian, both *find decision vs. find solution have a high PMI score. However, in English the latter has a high PMI while the former has a negative PMI. We expect such a discrepancy in word association to be an indicator of error of lexical choice, driven by the L1 semantics. We treat the task of lexico-semantic transfer error detection as a binary classification problem and train a classifier for this task. The classifier uses a combination of L1 and L2 semantic features. If our hypothesis holds, we expect to see an improvement in the classifier's performance when adding L1 semantic features.

L2 lexico-semantic features
We experiment with two types of L2 features: lexico-semantic features and semantic vector space features.
Lexico-semantic features include: • pmi in L2: we estimate the association strength between the noun and verb using the combined BNC and UKWaC corpus; • verb and noun: the identity of the verb and the noun in the pair, encoded in a numerical form in the range of (0, 1). The motivation behind that step is that certain words are more error-prone than others and converting them into numerical features helps the classifier to use this information.
Semantic vector space features Kochmar and Briscoe (2014) obtained state-of-the-art results in error detection by using the semantic component of the content word combinations. We reimplement these features and test their impact on our task. We extracted the noun and verb vectors from the publicly available word2vec dataset of word embeddings for 3 million words and phrases. 2 The 300-dimensional vectors have been trained on a part of Google News dataset (about 100 billion words) using word2vec . The dobj and subj vectors are then built using element-wise addition on the vectors (Mitchell and Lapata, 2008;Kochmar and Briscoe, 2014).
Once the compositional vectors are created, the method relies on the idea that correct combinations can be distinguished from the erroneous ones by certain vector properties (Vecchi et al., 2011;Kochmar and Briscoe, 2014). We implement a set of numerical features based on the following properties of the vectors: • length of the additive (vn) vector • cos vn∧n -cosine between the vn vector and the noun vector • cos vn∧v -cosine between the vn vector and the verb vector • dist 10 -distance to the 10 nearest neighbours of the vn vector • lex-overlap -proportion of the 10 nearest neighbours of the vn vector containing the verb/noun • comp-overlap -overlap between the 10 neighbours of the vn vector and 10 neighbours of the verb/noun vector • cos v∧n -cosine between the verb and the noun vectors.
The 10 nearest neighbours are retrieved in the combined semantic space containing word embeddings and additive phrase vectors. All features, except for the last one, have been introduced in previous work and showed promising results (Vecchi et al., 2011;Kochmar and Briscoe, 2014). For example, it has been shown that the distance from the constructed word combination vector to its nearest neighbours is one of the discriminative features of the error detection classifier. Manual inspection of the vectors and nearest neighbours shows that the closest neighbour to *find decision is see decision with the similarity of 0.8735 while the closest one to find solution is discover solution with the similarity of 0.9048.
We implement an additional cos v∧n feature based on the intuition that the distance between the verb and noun vectors themselves may indicate a semantic mismatch and thus help in detecting lexical choice errors.

L1 lexico-semantic features
We first quantified the strength of association between the L1 verbs and nouns in the original L1 data, using PMI. We then generated a set of possible translations for each verb-noun pair in L1 using the translation dictionaries. Each verb-noun pair in the CLC was then mapped to one of the translated L1 pairs and its L1 features. We used the following L1 features in classification: • pmi in L1: we estimate the strength of association on the original L1 corpora; • difference between the PMI of the verbnoun pair in L1 and in L2.

Classification
Classifier settings We treat the task as a binary classification problem and apply a linear SVM classifier using scikit-learn LinearSVC implementation. 3 The error rates in Table 1 show that we are dealing with a two-class problem where one class (correct word combinations) significantly outnumbers the other class (errors) by up to 11:1 (on RU subj ). To address the problem of class imbalance, we use subsampling: we randomly split the set of correct word combinations in n samples keeping the majority class baseline under 0.60, and run n experiments over the samples. We apply 10-fold cross-validation within each sample. The results reported in the following sections are averaged across the samples for each dataset.
Evaluation The goal of the classifier is to detect errors, therefore we primarily focus on its performance on the error class and, in addition to accuracy, report precision (P), recall (R) and F 1 on this class. Previous studies (Nagata and Nakatani, 2010) suggest that systems with high precision in detecting errors are more helpful for L2 learning than systems with high recall as non-native speakers find misidentified errors very misleading. In line with this research, we focus on maximising precision on the error class.
Baseline We compare the performance of our different feature sets to the baseline classifier which uses L2 co-occurrence frequency of the verb and noun in the pair as a single feature. Frequency sets a competitive baseline as it is often judged to be the measure of acceptability of an expression and many previous works relied on the frequency of occurrence as an evidence of acceptability (Shei and Pain, 2000;Futagi et al., 2008).

Experimental Results
To test our hypothesis that lexico-semantic models are transferred from L1 to L2, we first run the set of experiments on the L1 subsets of the CLC data, that is RU → RU CLC and ES → ES CLC , where the left-hand side of the notation denotes the lexico-semantic model and the right-hand side the L1 of the speakers that produced the word pairs extracted from the CLC. We incrementally add the features, starting with the set of lexico-semantic  features in L2 that are readily available without reference to the L1, and later adding L1 semantic features, and measure their contribution.

L2 lexico-semantic features
The first system configuration we experiment with uses the set of lexico-semantic features from L2. Adding the noun as a feature decreases performance of the classifier and we do not further use this feature. The verb used as an additional feature consistently improves classifier performance.

L2 semantic vector space features
Next, we test the combination of the semantic vector space features (sem) and combine them with two L2 lexico-semantic features including pmi En and verb (denoted as ft En hereafter for brevity). Table 3 reports the results.
We note that the semantic vector space features on their own yield precision of 50% − 52% on the error class in dobj combinations and lower than 50% on subj combinations. This suggests that the classifier misidentifies correct combinations as errors more frequently than it correctly detects errors. Moreover, recall of this system configuration is also low on all datasets. Adding the semantic vector space features to the other L2 semantic features, however, improves the performance, as shown in Table 3. As both groups of features refer to the phenomena in L2, the results suggest that they complement each other.   Table 4: System performance (in %) using L1 and L2 lexico-semantic features, L1 → L1 CLC .

L1 lexico-semantic features
Finally, we add the L1 lexico-semantic features to the well-performing L2 features (pmi and verb).
The combination of L1 lexico-semantic features with the L2 lexico-semantic and semantic vector space features achieves lower results, therefore we do not report them here. The use of L1 pmi improves both the accuracy and the F-score of the error class (see Table 4). For the ease of comparison, we also include the results obtained using a combination of L1 lexico-semantic features (denoted ft En ). The addition of the explicit difference feature between the two PMIs has not yielded further improvement. This is likely to be due to the fact that the classifier already implicitly captures the knowledge of this difference in the form of individual L1 and L2 PMIs. We note that the system using a combination of L1 and L2 lexico-semantic features gains an absolute improvement in accuracy from 1.04% for RU subj to 2.55% on ES dobj . The performance on the error class improves in all but one case (P e on RU dobj ), with an absolute increase in F 1 up to 7.66%. The system has both a higher coverage in error detection (a rise in recall) and a higher precision. The improvement in performance across all four datasets is statistically significant at 0.05 level. These results demonstrate the effect of lexico-semantic model transfer from L1 to L2.

Effect on different L1s
Next, we test our second hypothesis that a lexicosemantic model from one L1 is portable across several L1s, in particular, typologically related ones. We first experiment with the data representing all L1s in the CLC and then with the data representing a specific language group. We compare the performance of the baseline system using verb-noun co-occurrence frequency as a single feature, the system that uses L2 semantic features only and the system that combines both L2 and L1 semantic features. Table 1 shows that using the translated verb-noun combinations from our L1s (RU and ES) we are able to find a large amount of both correct and erroneous combinations in different L1s in the CLC including RU and ES (see ALL). This gives us an initial confirmation that the lexico-semantic models may be shared across multiple languages.

Experiments on all L1s
We then experiment with error detection across all L1s represented in the CLC. The results are shown in Table 5. The baseline system achieves similar performance on RU → ALL CLC as on RU → RU CLC , and better performance on ES → ALL CLC than on ES → ES CLC . The results obtained with the L2 lexico-semantic features are also comparable: the system achieves an absolute increase in accuracy of up to 9.86% for the model transferred from RU subj , reaching an accuracy of around 65 − 66% with balanced performance in terms of precision and recall on errors.
When the L1 lexico-semantic features are added to the model, we observe an absolute increase in the accuracy ranging from 0.57% (for RU subj ) to 1.43% (for ES dobj ). The Spanish lexico-semantic model has a higher positive effect on all measures, including precision on the error class. Although the addition of the L1 lexico-semantic features does not have a significant effect on the accuracy and precision, the system achieves an absolute improvement in recall of up to 12.71% (on RU dobj ). That is, the system that uses L1 lexico-semantic features is able to find more errors in the data originating with a set of different L1s. Generally, the results of the Spanish model are more stable and comparable to the results in the previous Section, which may be explained by the fact that Spanish is more well-represented in the CLC.  Table 5: System performance (in %) using L1 and L2 lexico-semantic features, L1 → all L1s.

Experiments on related L1s
The results on ALL L1s confirm our expectations: since we have extracted verb-noun combinations that originate with two particular L1s from the set of all different L1s in the CLC, and then used the L1 lexico-semantic features, the system is able to identify more errors thus we observe an improvement in recall. The precision, however, does not improve, possibly because the set of errors in ALL L1s is different from that in the two L1s we rely on to build the lexico-semantic models. The final question that we investigate is whether the lexicosemantic models of our L1s are directly portable to typologically related languages. If this is the case, we expect to see an effect on the precision of the classifier as well as on the recall. We experiment with the following groups of related languages ordered by the number of verbnoun pairs we found in the CLC data: • RU group: Russian, Polish, Czech, Slovak, Serbian, Croatian, Bulgarian, Slovene; • ES group: Spanish, Italian, Portuguese, French, Catalan, Romanian, Romansch.
In addition to investigating the effect of the L1 lexico-semantic model on the whole language group, we also consider its effects on individual languages. We chose Polish for the RU model, and Italian for the ES model as these two languages have the most data representing their native speakers in the CLC. Table 6 shows the number of verbnoun combinations and error rates for the language groups and these individual languages.
The results are presented in Tables 7 and 8. They exhibit similar trends in the change of the system performance on L1 → L1 GROUP as we   see for L1 → ALL L1s. Adding the L1 lexicosemantic features has only a minor effect on accuracy and precision, and a more pronounced effect on recall. On the contrary, when we test the system on one particular related L1 (Table 8) we observe the opposite effect: with the exception of ES subj data, precision and accuracy improve, suggesting that the error detection system using L1-induced information identifies errors more precisely.
Overall, the observed gains in performance indicate that L1 semantic models contribute information to lexical choice error detection in L2 for the speakers of typologically related languages. This in turn suggests that there may be less semantic variation within a language group than across different language groups.

Discussion and data analysis
The best accuracy achieved in our experiments is 71.19% on ES subj combinations. However, previous research suggests that error detection in lexical choice is a difficult task. For instance, Kochmar and Briscoe (2014) report that the agreement between human annotators on error detection in adjective-noun combinations is 86.50%.
We then qualitatively assessed the performance of our systems by analysing what types of errors  • verbs offer, propose and suggest which are often confused with each other. Correctly identified errors include *offer plan vs. suggest plan, *propose work vs. offer work and *suggest cost vs. offer cost; • verbs demonstrate and show where demonstrate is often used instead of show as in *chart demonstrates; • verbs say and tell particularly well identified with the ES model. Examples include *say idea instead of tell idea and *tell goodbye instead of say goodbye.
These examples represent lexical choice errors when selecting among near-synonyms, and violations of verb subcategorization frames. The error in *find solution discussed throughout the paper is also reliably identified by the classifier across all runs. It is interesting to note that in the pair of verbs do and make, which are often confused with each other by both Russian and Spanish L1 speakers, errors involving make are identified more reliably than errors involving do: for example, *make business is correctly identified as an error, while *do joke is missed by the classifier. Many of the errors missed by the classifier are context-dependent. Some of the most problematic errors involve errors in combinations with verbs like be and become. Such errors do not result from an L1 lexico-semantic transfer and it is not surprising that the classifiers miss them.

Conclusion
We have investigated whether lexico-semantic models from the native language are transferred to the second language, and what effect this transfer has on lexical choice in L2. We focused on two typologically different L1s -Russian and Spanish, and experimentally confirmed the hypothesis that statistical semantic models learned from these L1s significantly improve automatic error detection in L2 data produced by the speakers of the respective L1s. We also investigated whether the semantic models learned from particular L1s are portable to other languages, and in particular to languages that are typologically close to the investigated L1s. Our results demonstrate that L1 models improve the coverage of the error detection system on a range of other L1s.