Large-Scale Native Language Identification with Cross-Corpus Evaluation

We present a large-scale Native Language Identiﬁcation (NLI) experiment on new data, with a focus on cross-corpus evaluation to identify corpus-and genre-independent language transfer features. We test a new corpus and show it is comparable to other NLI corpora and suitable for this task. Cross-corpus evaluation on two large corpora achieves good accuracy and evidences the existence of reliable language transfer features, but lower performance also suggests that NLI models are not completely portable across corpora. Finally, we present a brief case study of features distinguishing Japanese learners’ English writing, demonstrating the presence of cross-corpus and cross-genre language transfer features that are highly applicable to SLA and ESL research.


Introduction
Native Language Identification, the task of determining the native language (L1) of an author based on a second language (L2) text, has received much attention recently. Much of this is motivated by Second Language Acquisition (SLA) as NLI, often accomplished via machine learning methods, can be used to study language transfer effects.
Most NLI research hitherto has focused on identifying linguistic phenomena that can capture transfer effects, with little effort towards interpreting discriminant features. Some researchers have now shifted their focus to developing data-driven methods for the automatic extraction and ranking of linguistic features that distinguish specific L1s (Swanson and Charniak, 2014).
Such methods could be used not only to confirm existing SLA hypotheses, but also to create new ones. This hypothesis formulation is an inherently difficult problem requiring copious amounts of data. Contrary to this requirement, researchers have long noted the paucity of suitable corpora 1 for this task (Brooke and Hirst, 2011). This is one of the research issues addressed by this work.
Furthermore, deriving SLA hypotheses from a single corpus may not be entirely useful for SLA research. Many variables like genre and topic are constant within a corpus, restricting the validity of such cross-validation studies to those dimensions.
An alternative, potentially more helpful approach, is to identify transfer features that reliably distinguish an L1 across multiple corpora of differing genres and domains. A cross-corpus methodology may be a more promising avenue to finding features that generalize to diverse text sources, but requires additional large corpora. It is also a more realistic approach, and one we pursue in this work.
Accordingly, the aims of the present work are to: (1) test a large new corpus suitable for NLI, (2) perform within-corpus evaluation with a comparative analysis against equivalent corpora, (3) perform cross-corpus evaluation to determine the efficiency of corpus independent features and (4) analyze the features' utility for SLA & ESL research.

Background and Motivation
NLI work has been growing in recent years, using a wide range of syntactic and more recently, lexical features to distinguish the L1. A detailed review of NLI methods is omitted here for reasons of space, but a thorough exposition is presented in the report from the very first NLI Shared Task that was held in 2013 .
Most English NLI work has been done using two corpora. The International Corpus of Learner En-glish (Granger et al., 2009) was widely used until recently, despite its shortcomings 2 being widely noted (Brooke and Hirst, 2012a). More recently, TOEFL11, the first corpus designed for NLI was released . While it is the largest NLI dataset available, it only contains argumentative essays, limiting analyses to this genre.
Research has also expanded to use non-English learner corpora (Malmasi and Dras, 2014a;Malmasi and Dras, 2014c). Recently, Malmasi and Dras (2014b) introduced the Chinese Learner Corpus for NLI and their results indicate that feature performance may be similar across corpora and even L1-L2 pairs. This is a claim that we will test here.
NLI is now also moving towards using features to generate SLA hypotheses. Swanson and Charniak (2014) approach this by using both L1 and L2 data to identify features exhibiting non-uniform usage in both datasets, creating lists of candidate transfer features. Malmasi and Dras (2014d) propose a different method, using linear SVM weights to extract lists of overused and underused linguistic features for each L1 group.
Cross-corpus studies have been conducted for various data-driven NLP tasks, including parsing (Gildea, 2001), WSD (Escudero et al., 2000) and NER (Nothman et al., 2009). While most such experiments show a drop in performance, the effect varies widely across tasks, making it hard to predict the expected drop for NLI. We aim to address this question using large training and testing data.

EFCamDat: A new corpus for NLI
The EF Cambridge Open Language Database (EFCAMDAT) is an English L2 corpus that was released recently (Geertzen et al., 2013). It is composed of texts submitted to Englishtown, an online school used by thousands of learners daily.
This corpus is notable for its size, containing some 550k texts from numerous nationalities, making it an ideal candidate for NLI research. While the TOEFL11 is made of argumentative essays, EF-CAMDAT has a much wider range of genres including writing emails, descriptions, letters, reviews, instructions and more.
In this work we present an application of NLI to this new data. As some of the texts can be short, we use the methodology of Brooke and Hirst (2011) to concatenate and create texts with at least 300 tokens, much like the TOEFL11. 2 The issues exist as the corpus was not designed for NLI.  From the data we choose 850 texts from each of the top 11 nationalities. This subset of EFCAMDAT thus consists of 9,350 documents totalling approximately 3.2m tokens. This is an average of 337 tokens per text, close to the 348 tokens per text in TOEFL11.
This also provides us with the same number of classes as the TOEFL11, as shown in Table 1, facilitating direct performance comparisons. The table also indicates the 9 classes common to both corpora. This subset of common classes enables us to perform large-scale cross-corpus validation experiments that have not been possible until now.

Methodology
We use the standard NLI classification approach. A linear Support Vector Machine is used for classification and feature vectors are created using relative frequency values. We also combine features with a mean probability ensemble classifier (Polikar, 2006, §4.2) using the probabilities assigned to each class. We compare results with a random baseline and the oracle baseline used by . The oracle correctly classifies a text if any ensemble member correctly predicts its label and defines an upper-bound for classification accuracy. We avoid using lexical features as EFCAMDAT is not topic balanced. We extract the following topicindependent feature types: Function words are topic-independent grammatical words such as prepositions which indicate the relations between other words. They are known to be useful for NLI. Frequencies of 400 English function words 3 are extracted as features. We also apply function word bigrams as described in Malmasi et al. (2013).
Context-free Grammar Production Rules are extracted after parsing each sentence. Each rule is a classification feature (Wong and Dras, 2011) and captures global syntactic patterns.

Within-Corpus Evaluation
Our first experiment applies 10-fold cross-validation within the corpus to assess feature efficacy. The results are shown in the first column of Table 2. All features perform substantially higher than the 9% baseline. POS trigrams are the best single feature (53%), suggesting there exist significant interclass syntactic differences. Next, we also combined all features using a classifier ensemble, which has been shown to be helpful for NLI (Tetreault et al., 2012). This yields the best accuracy of 65% against an upper-bound of 87% set by the oracle.
We also compare these results to those from the TOEFL11 and Chinese Learner Corpus (CLC). As shown in Figure 1, we find that feature performance is nearly identical across corpora. Consistent with the results in Malmasi and Dras (2014b), this seems to suggest an invariant degree of transfer across different learners and L1-L2 pairs. Figure 2 shows the confusion matrix. German is the most correctly classified L1, while the highest confusion is between Japanese-Korean, followed by Spanish-Portuguese and French-Italian. This is not surprising given their syntactic similarity as well as being typologically related in case of the latter two.

Large-scale Cross-Corpus Evaluation
Our second experiment tests the cross-corpus efficacy of the features by training on EFCAMDAT and testing on TOEFL11, 4 and vice versa. As the corpus texts are from different genres, this approach enables 4 The 9 common classes discussed in §3 are used.  us to test the cross-corpus and cross-genre generalizability of our features. Results are shown in the second and third column of Table 2. While lower than the cross-validation results which were on 11 classes vs 9 here, the results are far greater than the baseline. The accuracy for training on EFCAMDAT and testing on TOEFL11 is higher (33.45%) than the other way around (28.42%), even though TOEFL11 is the larger corpus. This is possibly because EFCAMDAT has numerous genres while TOEFL11 does not. The cross-corpus oracle is also over 20% lower, despite an increase in the random baseline, showing some features are not portable across corpora. Training on TOEFL11 yields a lower oracle.

A R A C H I F R E G E R IT
Although a performance drop was expected due to the big genre differences, results suggest the presence of some corpus-independent features that capture cross-linguistic influence. However, they also suggest that a large portion of the features helpful for NLI are genre-dependent.
Previously, word n-grams have been applied in small-scale cross-corpus studies and found to be the best feature (Brooke and Hirst, 2012b). Word ngrams have been previously used in NLI and are believed to capture lexical transfer effects which have been previously noted by researchers and linguists   (Odlin, 1989). The effects are mediated not only by cognates and word form similarities, but also semantics and meanings. Other NLI studies have also provided empirical evidence for this hypothesis (Malmasi and Cahill, 2015). However, issues stemming from topic bias 5 have also limited their use in NLI (Brooke and Hirst, 2012a), although use could be justified in crosscorpus scenarios due to the lower risk of topic-bias across corpora. We applied word unigrams to our cross-corpus experiment, achieving an accuracy of 41.8% for training on the EFCAMDAT and testing on TOEFL11 and 42.5% for the reverse setting. These are the best results in this setup.
To check for any topic-bias effects, we inspected the most discriminative features for each L1 class using the method proposed by Malmasi and Dras (2014d). This analysis revealed that the top features were mostly cultural and geographic references related to the author's country. Table 3 contains words selected from the top 15 most discriminative features found in the cross-corpus experiment for three L1s. We observe that most of these are toponyms or culture-specific terms such as names and currencies. These results reveal another potential issue with using lexical features. Although this isn't topicbias, the features do not represent genuine linguistic differences or lexical transfer effects between L1 groups. In practical scenarios, this could also make NLI systems vulnerable to content-based manipulation. The exclusion of proper nouns is one way to combat this.

A Case Study of Japanese Learners
To demonstrate the utility of this cross-corpus approach we present a brief case study of features that 5 Due to correlations between text topics and L1 classes. characterize English writings of Japanese learners. We extracted the most discriminative cross-corpus features of Japanese learner texts using the method of Malmasi and Dras (2014d). Table 4 contains the top production rule features. The first rule shows a preference for having a subordinate clause before the main clause. The next two rules show that Japanese learners tend to begin their sentences with adverbs and conjunctions. This preference for placing information at the start of sentences is most likely rooted in the fact that Japanese is an SOV head-final language 6 where dependent clauses generally precede the main clause and relative clauses precede the noun they modify. The influence of this head-direction parameter on English acquisition has been previously investigated (Flynn, 1989).
In contrast, it is quite common for the main clause to precede the subordinate clause in English. Other research has also noted that Japanese speakers have a "long before short" preference 7 (Yamashita and Chang, 2001). This is also evidence by another highly discriminative rule for this L1: S → S , CC S .
Japanese writers also seem more likely to split longer arguments into multiple shorter sentences, as suggested by our third production rule. It has also been noted that Japanese and Korean sentences in the TOEFL11 have the shortest mean length (Cimino et al., 2013, p. 211).
Turning to POS trigrams, the POS tag sequence VBZ JJ NN is strongly linked to Japanese learn-

Production Rule
Example Sentence S → SBAR , NP VP . If you have spare time, you'll think of shopping. S → ADVP , NP VP . Therefore, the online studying system is really convenient for me. S → CC NP VP .
But I'm not good at English. / But it wasn't comfortable and cosy.  ers. It represents a third person verb, such as is or has followed by an adjective and a noun. A brief analysis reveals that this is commonly observed in Japanese learner texts because the sequence is missing a determiner before the noun phrase. 8 This likely stems from the fact that Japanese learners have difficulty with English articles, often failing to use them (Butler, 2002;Thomas, 1989). Its prominence in the ranked list shows that it is a common issue across distinct learners and genres. The top overused and underused function words are listed in Table 5. The words however and therefore are highly relevant; Japanese writers often use these to start sentences, possibly due to the abovementioned production rules. The word into is also predictive and seems to be used in places where in is more appropriate. This may be due to the Japanese words for in, to and into being similar. 9 In the underuse list, perhaps is never used by Japanese learners. Other words here are low-frequency in Japanese L1 texts in both corpora.

Discussion
In this work we presented the first application of one of the largest and newest publicly available learner corpora to NLI. Cross-validation experiments mirrored the performance of other corpora and demonstrated its utility for the task. We believe this will motivate future work by equipping researchers with a large-scale corpus that is highly suitable for NLI.
Next, results from the largest cross-corpus NLI evaluation to date were presented, providing strong evidence for the presence of transfer features that generalize across learners, corpora, topics and genres. However, the fact that the cross-corpus accuracy is lower than within-corpus cross-validation highlights that a large portion of the features are highly corpus-specific. This suggests that NLI models are not entirely portable across corpora. Practical applications of NLI to forensic linguistics or SLA must be robust to input from numerous sources and their associated variations, and this finding highlights the need for a cross-corpus approach.
To demonstrate how this methodology could be used for SLA, an examination of the cross-corpus features effective in classifying texts of Japanese learners was conducted. Through feature analysis, we were able to link these patterns of syntactic productions, article use and lexical choices to L1-based SLA hypotheses.
Our output lists hundreds of features, not included or examined here due to space limitations, whose analysis would allow SLA researchers to explore and generate new hypotheses, specially by combining multiple syntactic feature types.
A shortcoming here is that we did not balance texts by proficiency to match the TOEFL11. We expect that a more even sampling of proficiency or using proficiency-segregated models will yield higher accuracy and features more representative of students at each proficiency level.
Directions for future work are manifold. The next phase of this research will focus on developing tools to derive and browse ranked lists of the most discriminative cross-corpus features, which will then be used to formulate SLA hypotheses. Subject to availability of data, this could be expanded to a multiple cross-corpus methodology, using three or more corpora. Its application to other languages besides English is also of interest.
NLI is a young but rapidly growing field of research and this study is but a first step in shifting efforts towards a more interpretive approach to the task. We hope that the new dataset and directions presented here will galvanize future work.