Ensemble Methods for Native Language Identification

Our team—Uvic-NLP—explored and evaluated a variety of lexical features for Native Language Identification (NLI) within the framework of ensemble methods. Using a subset of the highest performing features, we train Support Vector Machines (SVM) and Fully Connected Neural Networks (FCNN) as base classifiers, and test different methods for combining their outputs. Restricting our scope to the closed essay track in the NLI Shared Task 2017, we find that our best SVM ensemble achieves an F1 score of 0.8730 on the test set.


Introduction
Native Language Identification (NLI) is the task of identifying a person's native language (L1) based on a sample of their writing or speech in a second language (L2). The underlying intuition is that those with the same L1 tend to use similar language patterns during L2 production. This is known as cross-linguistic influence (Ortega, 2014).
NLI can accelerate second language acquisition by giving students L1-specific feedback on their written or spoken samples (Malmasi et al., 2014). In forensic linguistics, NLI can be applied to identify the L1 of anonymous texts (Perkins, 2015).
The NLI Shared Task 2013-the first of its kind-was based on written essays , while the 2016 Computational Paralinguistics Challenge was based on spoken responses (Schuller et al., 2016). The NLI Shared Task 2017 organizers provided a dataset of both essays and transcriptions of verbal responses * These authors contributed equally to this work. . As our team-Uvic-NLP-participated in the closed essay track, we performed classification on essays only.
We begin our analysis by comparing various lexical features and focus on two high-performing classifiers: Support Vector Machines (SVM) and Fully Connected Neural Networks (FCNN). Then, we explore different ensemble methods for combining outputs of individual classifiers. We present and discuss three of our best systems for this task: a single SVM classifier, an SVM ensemble, and an FCNN ensemble.

Related Work
NLI is generally conceptualized as a multiclass supervised classification problem, where the classes represent the set of possible L1s. One of the first NLI systems trained SVMs on a variety of stylistic features (Koppel et al., 2005).
The NLI Shared Task 2013 introduced a corpus designed specifically for NLI . Use of a standardized dataset and evaluation metric allowed for the effective comparison of different models, and the results confirmed the usefulness of SVMs for NLI . Popular features included word, part of speech (POS), and character n-grams; higherorder n-grams were shown to be especially useful. Four of the top five teams used at least 4-grams, with the top team using up to 9-grams. String kernels using 5-to 8-grams at the character-level also worked well, and were one of the best performing models for this task (Ionescu et al., 2014).
A trend in recent work is the use of ensemble methods, which combine the predictions of a set of classifiers, giving more accurate results than a single classifier trained on a combination of different features (Tetreault et al., 2012;Malmasi et al., 2013). Malmasi and Dras (2017) used meta-classifier ensembles, where results from base classifiers are fed to an ensemble of meta-classifiers. Such models are the current state of the art for NLI.

Data
The dataset for the essay track of the NLI Shared Task 2017 was collected by Educational Testing Services, and consists of written responses to a standardized assessment of English proficiency for academic purposes.
13,200 response essays from test takers were separated into three sets: 11,000 for training (TRAIN), 1,100 for development (DEV), and 1,100 for testing (TEST).

Features
Previous work demonstrates that a variety of lexical and syntactic features are useful for NLI (Tetreault et al., 2012). In addition to incorporating lexical features known to be effective for this task, we also extract phonemes. Here, we describe each of the features in turn.
Word n-grams Where topic bias is pervasive, word n-grams are not useful features for classification (Brooke and Hirst, 2011), but have been used successfully in topic-balanced corpora (Tetreault et al., 2012). Our dataset is balanced across topics, making word n-grams useful.
Lemma n-grams Lemmas are the dictionary representation of words, i.e. words that are stripped of morphological marking. The lemmatized versions of all words in our corpus were attained using Natural Language Toolkit's WordNet interface (Bird et al., 2009;Feinerer and Hornik, 2016;Wallace, 2007;Fellbaum, 1998).
Character n-grams Tsur and Rappoport (2007) achieved good results on the NLI task using only character bigrams as features. Methods working at the character level were also the previous state of the art (Ionescu et al., 2014). Character n-grams can be generated from text within or across word boundaries.
Part of speech n-grams Koppel et al. (2005) found rare part of speech (POS) bigrams to be a useful feature; many teams in the 2013 Shared Task also made use of this feature . We use the Stanford Tagger to extract POS features (Toutanova et al., 2003).
Function words Function words are a closed class of words that serve a grammatical function in sentences, whose use for NLI was explored early on (Koppel et al., 2005). These include articles, determiners, conjunctions, and auxiliaries. These were extracted based on a list provided in the ModErn Text Analysis Toolkit (Massung et al., 2016).
Spelling errors Spelling errors were extracted by finding the difference between misspelled words before and after they were corrected using the autocorrect package (Jonas, 2013). We coded a subset of the spelling errors defined by Koppel et al. (2005): repeated letter, double letter appears only once, letter replacement, letter inversion, inserted letter, and missing letter.
Phoneme n-grams Phonemes are representations of sounds in a language. In English, one sound can be represented using many different letters (e.g. cat and kick). For mapping orthography onto phonemes, we used the Carnegie Mellon Pronouncing Dictionary (Weide, 2005). To our knowledge, phonemes have not yet been explored as a feature.

Classifiers
We evaluated classifier performance across features types and found that the SVM and FCNN classifiers consistently outperformed other classifiers, such as Perceptron and Multinomial Naive Bayes. As such, we focus on these two classifiers in subsequent experiments.
Ensemble methods involve combining the outputs of multiple classifiers to yield a final prediction (Polikar, 2006). Three types of ensemble methods which have been shown to be useful for NLI are explored here (Malmasi and Dras, 2017). At a high level, SVM and FCNN outputs are combined using (1) a voting scheme, (2) a Linear Discriminant Analysis (LDA) classifier trained on the outputs, and (3) multiple LDA classifiers-trained on random subsets of the outputs-whose predictions are in turn combined using a voting scheme. SVMs (Joachims, 1998) are frequently used for text classification and have been applied successfully to NLI . We use a scikit-learn SVM implementation: LinearSVC (Pedregosa et al., 2011).

Neural Networks
Since we found little previous work applying neural networks to NLI, this paper strives to fill this gap by constructing a FCNN using TensorFlow (Allaire et al., 2016) and the Keras (Chollet et al., 2015) framework. The network is comprised of one hidden layer of 128 nodes that uses a tanh activation function and an input dropout of 0.2. The optimal dropout value was established empirically. Following the hidden layer, there is an 11 node output layer that uses the softmax activation function. The entire network uses a cross entropy loss function and the Adam optimization algorithm.
Due to memory constraints, we limit analysis to only the 100,000 most important features, selected by performing an ANOVA F-test on the entire fea-ture set (Harwell et al., 1992).
In addition to the FCNN, we test another type of neural network for this task. Following the architecture described by Wang et al. (2016), we train a pipeline consisting of a convolutional neural network (CNN) which transforms the input data at the character-level and a Long Term Short Memory (LSTM) neural network which performs classification on the output of the CNN. We also trained an LSTM on word vectors (Mikolov et al., 2013). In both cases, however, we found results to be lacking in accuracy.

Ensemble construction
For any given SVM or FCNN, the output for 11way classification can be represented as a vector of 11 numbers. For the SVM, output is in the form of confidence scores for each class, which is equivalent to the signed distance of that sample to each class's hyperplane (Weston and Watkins, 1998). Similarly, each FCNN prediction is in the form of confidence values for each class, derived from the softmax output layer.
Using the best feature combination and representation from the previous experiments, we trained two sets of base classifiers-FCNNs and SVMs-on different features and combined each set of outputs using three different voting schemes (Polikar, 2006): • Mean: Final label is the class corresponding to the greatest average confidence score.
• Median: Final label is the class corresponding to the greatest median confidence score.
• Plurality vote: Final label is the class with the greatest number of votes. In a tie, we choose the class that comes first alphabetically.
In line with previous work, we achieve the highest accuracy using the mean rule (Malmasi et al., 2013), as shown in Table 3.

Meta-classifier
Another way to combine the outputs of several base classifiers is to feed their outputs into another classifier, also known as a meta-classifier. To obtain outputs from SVMs and FCNNs, we split the training set into ten folds and perform crossvalidation. This gave us a set of meta-features that were then used as input to an LDA metaclassifier, which was found to outperform other algorithms for meta-classification in Malmasi and Dras (2017).

Meta-classifier ensembles
Building on the idea of ensembles and metaclassification, we experiment with ensembles of meta-classifiers (Malmasi and Dras, 2017). SVM and FCNN outputs-meta-features-are generated in the same way as in section 5.4. However, instead of training a single meta-classifier on these features, we use bagging (bootstrap aggregating) to train multiple LDAs on random subsets of the base classifier outputs. A grid search was performed to find the optimal number of metaclassifiers and optimal percentage of samples to train each LDA on. The predictions from multiple LDAs were then combined using voting schemes described in section 5.3.

Results and Discussion
In this section we present our results on single features, feature combinations, single classifiers, and classifier ensembles.

Individual features
The results of SVM and FCNN classifiers trained on different features are shown in Table 1. For these experiments, features were represented by their frequency count. We observe a general trend within different feature types: F1 scores increase as n-gram order increases (see Table 1). This is not unexpected, given the success of NLI models that make use of higher-order n-grams (Jarvis et al., 2013;. One exception to this trend is that there seems to be a upper-bound for word n-grams at the bigram level, where accuracy drops for word trigrams. This may be attributed in part to the increased sparsity of features when we move from bigrams to trigrams at the word-level. Interestingly, spelling errors were less informative than what we had expected. Although we did not evaluate the accuracy of the autocorrect package we used for spelling correction, we suspect that it did not perform well since it operates naively, without looking at context (Jonas, 2013). Additionally, the types of errors we defined might have not been fine-grained enough to capture differences unique to groups of L1 writers.

Single classifier results
As in Malmasi et al. (2013), we measure the effectiveness of different feature representations. Of the feature types described above, we include in our final system only a subset of the highest performing features. Thus, analysis is limited to this subset of features.
With frequency counts as a baseline, we compare the performance of classifiers trained on three different combinations of high-performing features. These groups are: • Word: Lemmas, words (1-, 2-, and 3-grams).
Each group of features is tested with and without term frequency-inverse document frequency (TF-IDF) weighting. Further, we examine the effects of binarization, L1 normalization, and L2 normalization on the same feature set. Note that L1 and L2 normalization refer to the vector norms across each input row. These results are summarized in Table 2.
Comparing classifiers trained on individual features (Table 1) to those trained on combinations of features (Table 2), it is evident that better results are achieved by training a single classifier on multiple features than on any single feature type. Further, Table 2 shows that the best performing classifiers use L2-normalized features with TF-IDF.
Our official submission to the NLI Shared Task 2017 used a single SVM classifier, which requires less time and fewer computational resources to train compared to a FCNN. An SVM on words (1-, 2-, and 3-grams) and characters (4-and 5-grams) achieves an F1 score of 0.8633 on TEST (see Table  4). The features were binarized, L2-normalized and TF-IDF weighted. The confusion matrix is shown in Figure 1.

Ensembles
The results detailed in this section were not submitted as part of the NLI Shared Task 2017, and were obtained after the test phase ended.
At the most basic level, individual classifiers are combined in a straightforward manner using a voting scheme. As we increase the complexity of the model, first by training an LDA meta-classifier on  the outputs, and then by constructing an ensemble of meta-classifiers, we observe a slight performance gain for both SVMs and FCNNs at each step, consistent with the results in Malmasi and Dras (2017). Table 3 summarizes our results from using different ensemble methods to combine individual classifiers trained on words (2-and 3grams), characters (4-and 5-grams) and phonemes (4-and 5-grams). Further experiments with SVMs and FCNNs were conducted by selecting different features to combine on a trial and error basis. The decision to use character n-grams within as opposed to across word boundaries was made arbitrarily. All features are binarized, L2-normalized, and TF-IDF weighted. The results of our best ensemble classifiers on DEV and TEST are displayed in Table 4.
While an ensemble of meta-classifiers outperforms both a simple voting scheme and a single meta-classifier, we do not observe the same performance gain with respect to FCNNs (see Table  3).
Our best SVM ensemble consists of an ensemble of meta-classifiers. SVMs are trained on words (2-and 3-grams), characters within word boundaries (4-and 5-grams), and phonemes (4-and 5grams), giving a total of six classifiers. The outputs of these individual classifiers are fed to an ensemble of LDAs, as described in 5.5. Finally, the LDA predictions are combined using the mean rule. The F1 score on TEST for this model is 0.8730.
Our best FCNN ensemble applies a voting scheme to classifier outputs. Four FCNN networks are trained on the following combination of features: (1) word bigrams and lemma trigrams, (2) word bigrams, (3) character 5-grams, (4) character 5-grams within word boundaries. The outputs from these individual networks are combined using the mean rule, yielding an F1 score of 0.8560 on TEST. Additionally, we created an ensemble of different SVM and FCNN classifiers but found no improvement over pure ensembles of either type.

Future work
We excluded from our system individual features that did not perform well in our experiments. It would be helpful to evaluate the influence of these less accurate features and determine whether they would be useful to include in ensemble classifiers. Further, we tested a limited number of combinations of features. One facet of the problem involves developing a systematic approach to search for a good feature set.
Although we trained several FCNNs on different feature types, its utility as a meta-classifier has not been examined.
A CNN-LSTM model shown to perform well for sentiment analysis (Wang et al., 2016) did not achieve good results for NLI. While sentiment classification typically involves five or fewer classes, there were 11 classes for the NLI Shared Task 2017. It may may be that additional classes increase the possibility of error. Further investigation is required to explain why a CNN-LSTM architecture performs worse relative to a FCNN model.
Our results show the utility of various features for this task and confirm that ensemble methods perform better than single classifiers trained on multiple features. They also offer several new di-rections to further improve NLI systems.