Contrastive Analysis with Predictive Power: Typology Driven Estimation of Grammatical Error Distributions in ESL

This work examines the impact of cross-linguistic transfer on grammatical errors in English as Second Language (ESL) texts. Using a computational framework that formalizes the theory of Contrastive Analysis (CA), we demonstrate that language specific error distributions in ESL writing can be predicted from the typological properties of the native language and their relation to the typology of English. Our typology driven model enables to obtain accurate estimates of such distributions without access to any ESL data for the target languages. Furthermore, we present a strategy for adjusting our method to low-resource languages that lack typological documentation using a bootstrapping approach which approximates native language typology from ESL texts. Finally, we show that our framework is instrumental for linguistic inquiry seeking to identify first language factors that contribute to a wide range of difficulties in second language acquisition.


Introduction
The study of cross-linguistic transfer, whereby properties of a native language influence performance in a foreign language, has a long tradition in Linguistics and Second Language Acquisition (SLA). Much of the linguistic work on this topic was carried out within the framework of Contrastive Analysis (CA), a theoretical approach that aims to explain difficulties in second language learning in terms of the relations between structures in the native and foreign languages.
The basic hypothesis of CA was formulated by Lado (1957), who suggested that "we can predict and describe the patterns that will cause difficulty in learning, and those that will not cause difficulty, by comparing systematically the language and culture to be learned with the native language and culture of the student". In particular, Lado postulated that divergences between the native and foreign languages will negatively affect learning and lead to increased error rates in the foreign language. This and subsequent hypotheses were soon met with criticism, targeting their lack of ability to provide reliable predictions, leading to an ongoing debate on the extent to which foreign language errors can be explained and predicted by examining native language structure.
Differently from the SLA tradition, which emphasizes manual analysis of error case studies (Odlin, 1989), we address the heart of this controversy from a computational data-driven perspective, focusing on the issue of predictive power. We provide a formalization of the CA framework, and demonstrate that the relative frequency of grammatical errors in ESL can be reliably predicted from the typological properties of the native language and their relation to the typology of English using a regression model.
Tested on 14 languages in a leave-one-out fashion, our model achieves a Mean Average Error (MAE) reduction of 21.8% in predicting the language specific relative frequency of the 20 most common ESL structural error types, as compared to the relative frequency of each of the error types in the training data, yielding improvements across all the languages and the large majority of the error types. Our regression model also outperforms a stronger, nearest neighbor based baseline, that projects the error distribution of a target language from its typologically closest language.
While our method presupposes the existence of typological annotations for the test languages, we also demonstrate its viability in low-resource scenarios for which such annotations are not available. To address this setup, we present a bootstrap-ping framework in which the typological features required for prediction of grammatical errors are approximated from automatically extracted ESL morpho-syntactic features using the method of (Berzak et al., 2014). Despite the noise introduced in this process, our bootstrapping strategy achieves an error reduction of 13.9% compared to the average frequency baseline.
Finally, the utilization of typological features as predictors, enables to shed light on linguistic factors that could give rise to different error types in ESL. For example, in accordance with common linguistic knowledge, feature analysis of the model suggests that the main contributor to increased rates of determiner omission in ESL is the lack of determiners in the native language. A more complex case of missing pronouns is intriguingly tied by the model to native language subject pronoun marking on verbs.
To summarize, the main contribution of this work is a CA inspired computational framework for learning language specific grammatical error distributions in ESL. Our approach is both predictive and explanatory. It enables us to obtain improved estimates for language specific error distributions without access to ESL error annotations for the target language. Coupling grammatical errors with typological information also provides meaningful explanations to some of the linguistic factors that drive the observed error rates.
The paper is structured as follows. Section 2 surveys related linguistic and computational work on cross-linguistic transfer. Section 3 describes the ESL corpus and the typological data used in this study. In section 4 we motivate our native language oriented approach by providing a variance analysis for ESL errors across native languages. Section 5 presents the regression model for prediction of ESL error distributions. The bootstrapping framework which utilizes automatically inferred typological features is described in section 6. Finally, we present the conclusion and directions for future work in section 7.

Related Work
Cross linguistic-transfer was extensively studied in SLA, Linguistics and Psychology (Odlin, 1989;Gass and Selinker, 1992;Jarvis and Pavlenko, 2007). Within this area of research, our work is most closely related to the Contrastive Analysis (CA) framework. Rooted in the comparative lin-guistics tradition, CA was first suggested by Fries (1945) and formalized by Lado (1957). In essence, CA examines foreign language performance, with a particular focus on learner difficulties, in light of a structural comparison between the native and the foreign languages. From its inception, CA was criticized for the lack of a solid predictive theory (Wardhaugh, 1970;Whitman and Jackson, 1972), leading to an ongoing scientific debate on the relevance of comparison based approaches. Important to our study is that the type of evidence used in this debate typically relies on small scale manual case study analysis. Our work seeks to reexamine the issue of predictive power of CA based methods using a computational, data-driven approach.
Computational work touching on crosslinguistic transfer was mainly conducted in relation to the Native Language Identification (NLI) task, in which the goal is to determine the native language of the author of an ESL text. Much of this work focuses on experimentation with different feature sets (Tetreault et al., 2013), including features derived from the CA framework (Wong and Dras, 2009). A related line of inquiry which is closer to our work deals with the identification of ESL syntactic patterns that are specific to speakers of different native languages (Swanson and Charniak, 2013;Swanson and Charniak, 2014). Our approach differs from this research direction by focusing on grammatical errors, and emphasizing prediction of language specific patterns rather than their identification.
Previous work on grammatical error correction that examined determiner and preposition errors (Rozovskaya and Roth, 2011;Rozovskaya and Roth, 2014) incorporated native language specific priors in models that are otherwise trained on standard English text. Our work extends the native language tailored treatment of grammatical errors to a much larger set of error types. More importantly, this approach is limited by the availability of manual error annotations for the target language in order to obtain the required error counts. Our framework enables to bypass this annotation bottleneck by predicting language specific priors from typological information.
The current investigation is most closely related to studies that demonstrate that ESL signal can be used to infer pairwise similarities between native languages (Nagata and Whittaker, 2013;Berzak et al., 2014) and in particular, tie the similarities to the typological characteristics of these languages (Berzak et al., 2014). Our work inverts the direction of this analysis by starting with typological features, and utilizing them to predict error patterns in ESL. We also show that the two approaches can be combined in a bootstrapping strategy by first inferring typological properties from automatically extracted morphosyntactic ESL features, and in turn, using these properties for prediction of language specific error distributions in ESL.

ESL Corpus
We obtain ESL essays from the Cambridge The FCE corpus has an elaborate error annotation scheme (Nicholls, 2003) and high quality of error annotations, making it particularly suitable for our investigation. The annotation scheme encompasses 75 different error types, covering a wide range of grammatical errors on different levels of granularity. As the typological features used in this work refer mainly to structural properties, we filter out spelling errors, punctuation errors and open class semantic errors, remaining with a list of grammatical errors that are typically related to language structure. We focus on the 20 most frequent error types 3 in this list, which are presented and 1 http://www.cambridge.org/gb/elt/ catalogue/subject/custom/item3646603 2 We plan to extend our analysis to additional proficiency levels and languages when error annotated data for these learner profiles will be publicly available.
3 Filtered errors that would have otherwise appeared in the top 20 list, with their respective rank in brackets: Spelling (1), Replace Punctuation (2), Replace Verb (3), Missing Punctuation (7), Replace (8), Replace Noun (9) Unnecessary Punctuation (13), Replace Adjective (18), Replace Adverb (20). exemplified in table 1. In addition to concentrating on the most important structural ESL errors, this cutoff prevents us from being affected by data sparsity issues associated with less frequent errors.

Typological Database
We use the World Atlas of Language Structures (WALS; Dryer and Haspelmath, 2013), a repository of typological features of the world's languages, as our source of linguistic knowledge about the native languages of the ESL corpus authors. The features in WALS are divided into 11 categories: Phonology, Morphology, Nominal Categories, Nominal Syntax, Verbal Categories, Word Order, Simple Clauses, Complex Sentences, Lexicon, Sign Languages and Other. Table 2 presents examples of WALS features belonging to different categories. The features can be associated with different variable types, including binary, categorical and ordinal, making their encoding a challenging task. Our strategy for addressing this issue is feature binarization (see section 5.3).
An important challenge introduced by the WALS database is incomplete documentation. Previous studies (Daumé III, 2009;Georgi et al., 2010) have estimated that only 14% of all the language-feature combinations in the database have documented values. While this issue is most acute for low-resource languages, even the well studied languages in our ESL dataset are lacking a significant portion of the feature values, inevitably hindering the effectiveness of our approach.
We perform several preprocessing steps in order to select the features that will be used in this study. First, as our focus is on structural features that can be expressed in written form, we discard all the features associated with the categories Phonology, Lexicon 4 , Sign Languages and Other. We further discard 24 features which either have a documented value for only one language, or have the same value in all the languages. The resulting feature-set contains 119 features, with an average of 2.9 values per feature, and 92.6 documented features per language.

Variance Analysis of Grammatical Errors in ESL
To motivate a native language based treatment of grammatical error distributions in ESL, we begin   by examining whether there is a statistically significant difference in ESL error rates based on the native language of the learners. This analysis provides empirical justification for our approach, and to the best of our knowledge was not conducted in previous studies.
To this end, we perform a Kruskal-Wallis (KW) test (Kruskal and Wallis, 1952) for each error type 5 . We treat the relative error frequency per word in each document as a sample 6 (i.e. the relative frequencies of all the error types in a document sum to 1). The samples are associated with 14 groups, according to the native language of the document's author. For each error type, the null hypothesis of the test is that error fraction samples of all the native languages are drawn from the same underlying distribution. In other words, rejection of the null hypothesis implies a significant difference between the relative error frequencies of at least one language pair.
As shown in table 1, we can reject the null hypothesis for 16 of the 20 grammatical error types with p < 0.01, where Unnecessary Determiner, Unnecessary Preposition, Wrongly Derived Noun, and Replace Conjunction are the error types that do not exhibit dependence on the native language. 5 We chose the non-parametric KW rank-based test over ANOVA, as according to the Shapiro-Wilk (1965) and Levene (1960) tests, the assumptions of normality and homogeneity of variance do not hold for our data. In practice, the ANOVA test yields similar results to those of the KW test. 6 We also performed the KW test on the absolute error frequencies (i.e. raw counts) per word, obtaining similar results to the ones reported here on the relative frequencies per word. Furthermore, the null hypothesis can be rejected for 13 error types with p < 0.001. These results suggest that the relative error rates of the majority of the common structural grammatical errors in our corpus indeed differ between native speakers of different languages.
We further extend our analysis by performing pairwise post-hoc Mann-Whitney (MW) tests (Mann and Whitney, 1947) in order to determine the number of language pairs that significantly differ with respect to their native speakers' error fractions in ESL. Table 1 presents the number of language pairs that pass this test with p < 0.01 for each error type. This inspection suggests Missing Determiner as the error type with the strongest dependence on the author's native language, followed by Replace Determiner, Verb Tense, Word Order, Missing Pronoun and Replace Preposition.

Predicting Language Specific Error
Distributions in ESL

Task Definition
Given a language l ∈ L, our task is to predict for this language the relative error frequency y l,e of each error type e ∈ E, where L is the set of all native languages, E is the set of grammatical errors, and e y l,e = 1.

Model
In order to predict the error distribution of a native language, we train regression models on individual error types: In this equationŷ l,e is the predicted relative frequency of an error of type e for ESL documents authored by native speakers of language l, and f (t l , t eng ) is a feature vector derived from the typological features of the native language t l and the typological features of English t eng . The model parameters θ l,e are obtained using Ordinary Least Squares (OLS) on the training data D, which consists of typological feature vectors paired with relative error frequencies of the remaining 13 languages: To guarantee that the individual relative error frequency estimates sum to 1 for each language, we renormalize them to obtain the final predictions: y l,e =ŷ l,e eŷ l,e (3)

Features
Our feature set can be divided into two subsets. The first subset, used in a version of our model called Reg, contains the typological features of the native language. In a second version of our model, called RegCA, we also utilize additional features that explicitly encode differences between the typological features of the native language, and the and the typological features of English.

Typological Features
In the Reg model, we use the typological features of the native language that are documented in WALS. As mentioned in section 3.2, WALS features belong to different variable types, and are hence challenging to encode. We address this issue by binarizing all the features. Given k possible values v k for a given WALS feature t i , we generate k binary typological features of the form: When a WALS feature of a given language does not have a documented value, all k entries of the feature for that language are assigned the value of 0. This process transforms the original 119 WALS features into 340 binary features.

Divergences from English
In the spirit of CA, in the model RegCA, we also utilize features that explicitly encode differences between the typological features of the native language and those of English. These features are also binary, and take the value 1 when the value of a WALS feature in the native language is different from the corresponding value in English: We encode 104 such features, in accordance with the typological features of English available in WALS. The features are activated only when a typological feature of English has a corresponding documented feature in the native language. The addition of these divergence features brings the total number of features in our feature set to 444.

Results
We evaluate the model predictions using two metrics. The first metric, Absolute Error, measures the distance between the predicted and the true relative frequency of each grammatical error type 7 : Absolute Error = |ŷ l,e − y l,e | When averaged across different predictions we refer to this metric as Mean Absolute Error (MAE). The second evaluation score is the Kullback-Leibler divergence D KL , a standard measure for evaluating the difference between two distributions. This metric is used to evaluate the predicted grammatical error distribution of a native language: D KL (y l ||ŷ l ) = e y l,e ln y l,ê y l,e  Table 3: Results for prediction of relative error frequencies using the MAE metric across languages and error types, and the D KL metric averaged across languages. #Languages and #Mistakes denote the number of languages and grammatical error types on which a model outperforms Base. Table 3 summarizes the grammatical error prediction results 8 . The baseline model Base sets the relative frequencies of the grammatical errors of a test language to the respective relative error frequencies in the training data. We also consider a stronger, language specific model called Nearest Neighbor (NN), which projects the error distribution of a target language from the typologically closest language in the training set, according to the cosine similarity measure. This baseline provides a performance improvement for the majority of the languages and error types, with an average error reduction of 13.3% on the MAE metric compared to Base, and improving from 0.052 to 0.046 on the KL divergence metric, thus emphasizing the general advantage of a native language adapted approach to ESL error prediction.
Our regression model introduces further substantial performance improvements. The Reg model, which uses the typological features of the native language for predicting ESL relative error frequencies, achieves 20.4% MAE reduction over the Base model. The RegCA version of the regression model, which also incorporates differences between the typological features of the native language and English, surpasses the Reg model, reaching an average error reduction of 21.8% from the Base model, with improvements across all the languages and the majority of the error types. Strong performance improvements are also obtained on the KL divergence measure, where the RegCA model scores 0.032, compared to the baseline score of 0.052.
To illustrate the outcome of our approach, consider the example in table 4, which compares the top 10 predicted errors for Japanese using the Base and RegCA models. In this example, RegCA correctly places Missing Determiner as the most common error in Japanese, with a significantly higher relative frequency than in the training data. Similarly, it provides an accurate prediction for the Missing Preposition error, whose frequency and rank are underestimated by the Base model. Furthermore, RegCA correctly predicts the frequency of Replace Preposition and Word Order to be lower than the average in the training data.

Feature Analysis
An important advantage of our typology-based approach are the clear semantics of the features, which facilitate the interpretation of the model. Inspection of the model parameters allows us to gain insight into the typological features that are potentially involved in causing different types of ESL errors. Although such inspection is unlikely to provide a comprehensive coverage of all the relevant causes for the observed learner difficulties, it can serve as a valuable starting point for exploratory linguistic analysis and formulation of a cross-linguistic transfer theory. Table 5 lists the most salient typological features, as determined by the feature weights aver-  Table 4: Comparison between the fractions and ranks of the top 10 predicted error types by the Base and RegCA models for Japanese. As opposed to the Base method, the RegCA model correctly predicts Missing Determiner to be the most frequent error committed by native speakers of Japanese. It also correctly predicts Missing Preposition to be more frequent and Replace Preposition and Word Order to be less frequent than in the training data.
aged across the models of different languages, for the error types Missing Determiner and Missing Pronoun. In the case of determiners, the model identifies the lack of definite and indefinite articles in the native language as the strongest factors related to increased rates of determiner omission. Conversely, features that imply the presence of an article system in the native language, such as 'Indefinite word same as one' and 'Definite word distinct from demonstrative' are indicative of reduced error rates of this type. A particularly intriguing example concerns the Missing Pronoun error. The most predictive typological factor for increased pronoun omissions is pronominal subject marking on the verb in the native language. Differently from the case of determiners, it is not the lack of the relevant structure in the native language, but rather its different encoding that seems to drive erroneous pronoun omission. Decreased error rates of this type correlate most strongly with obligatory pronouns in subject position, as well as a verbal person marking system similar to the one in English.

Bootstrapping with ESL-based Typology
Thus far, we presupposed the availability of substantial typological information for our target languages in order to predict their ESL error distributions. However, the existing typological documentation for the majority of the world's languages is scarce, limiting the applicability of this approach for low-resource languages. We address this challenge for scenarios in which an unannotated collection of ESL texts au-  thored by native speakers of the target language is available. Given such data, we propose a bootstrapping strategy which uses the method proposed in (Berzak et al., 2014) in order to approximate the typology of the native language from morpho-syntactic features in ESL. The inferred typological features serve, in turn, as a proxy for the true typology of that language in order to predict its speakers' ESL grammatical error rates with our regression model.
To put this framework into effect, we use the FCE corpus to train a log-linear model for native language classification using morpho-syntactic features obtained from the output of the Stanford Parser (de Marneffe et al., 2006): where l is the native language, x is the observed English document and θ are the model parameters.
We then derive pairwise similarities between languages by averaging the uncertainty of the model with respect to each language pair: In this equation, x is an ESL document, θ are the parameters of the native language classification model and D l is a set of documents whose native language is l. For each pair of languages l and l the matrix S ESL contains an entry S ESL l,l which represents the average probability of confusing l for l , and an entry S ESL l ,l , which captures the opposite confusion. A similarity estimate for a language pair is then obtained by averaging these two scores: SESL l,l = SESL l ,l = 1 2 (S ESL l,l + S ESL l ,l ) (10) As shown in (Berzak et al., 2014), given the similarity matrix S ESL , one can obtain an approximation for the typology of a native language by projecting the typological features from its most similar languages. Here, we use the typology of the closest language, an approach that yields 70.7% accuracy in predicting the typological features of our set of languages.
In the bootstrapping setup, we train the regression models on the true typology of the languages in the training set, and use the approximate typology of the test language to predict the relative error rates of its speakers in ESL. Table 6 summarizes the error prediction results using approximate typological features for the test languages. As can be seen, our approach continues to provide substantial performance gains despite the inaccuracy of the typological information used for the test languages. The best performing method, RegCA reduces the MAE of Base by 13.9%, with performance improvements for most of the languages and error types. Performance gains are also obtained on the D KL metric, whereby RegCA scores 0.041, compared to the Base score of 0.052, improving on 11 out of our 14 languages.

Conclusion and Future Work
We present a computational framework for predicting native language specific grammatical error distributions in ESL, based on the typological properties of the native language and their compatibility with the typology of English. Our regression model achieves substantial performance improvements as compared to a language oblivious baseline, as well as a language dependent nearest neighbor baseline. Furthermore, we address scenarios in which the typology of the native language is not available, by bootstrapping typological features from ESL texts. Finally, inspection of the model parameters allows us to identify native language properties which play a pivotal role in generating different types of grammatical errors. In addition to the theoretical contribution, the outcome of our work has a strong potential to be beneficial in practical setups. In particular, it can be utilized for developing educational curricula that focus on the areas of difficulty that are characteristic of different native languages. Furthermore, the derived error frequencies can be integrated as native language specific priors in systems for automatic error correction. In both application areas, previous work relied on the existence of error tagged ESL data for the languages of interest. Our approach paves the way for addressing these challenges even in the absence of such data.