CIC-FBK Approach to Native Language Identification

We present the CIC-FBK system, which took part in the Native Language Identification (NLI) Shared Task 2017. Our approach combines features commonly used in previous NLI research, i.e., word n-grams, lemma n-grams, part-of-speech n-grams, and function words, with recently introduced character n-grams from misspelled words, and features that are novel in this task, such as typed character n-grams, and syntactic n-grams of words and of syntactic relation tags. We use log-entropy weighting scheme and perform classification using the Support Vector Machines (SVM) algorithm. Our system achieved 0.8808 macro-averaged F1-score and shared the 1st rank in the NLI Shared Task 2017 scoring.


Introduction
Native language identification (NLI) is a natural language processing (NLP) task that aims at automatically identifying the native language (L1) of a language learner based on his/her writing in the second language (L2). Identifying the native language is based on the hypothesis that the L1 of a learner impacts his/her L2 writing due to the language transfer effect. NLI can be used for a variety of purposes, including marketing, security, and educational applications. From the machinelearning perspective, the NLI task is viewed as a multi-class, single-label classification problem, in which automatic methods have to assign class labels (L1s) to objects (texts).
Recent trends in NLI include cross-genre and cross-corpus NLI scenarios (Malmasi and Dras, 2015a), as well as identifying the L1 based on writings in other non-English L2s and cross-lingual NLI research (Malmasi and Dras, 2015b). However, following the practice of the first NLI shared task (Tetreault et al., 2013), this year's task focuses on L2 English data . This can be related to the use of English as lingua franca on the Internet and academia, when NLI methods are particularly useful for languages with a large number of foreign speakers. Moreover, following the 2016 Computational Paralinguistics Challenge (Schuller et al., 2016) and the VarDial workshop (Malmasi et al., 2016), this year's competition covers an NLI task based on the spoken response. Overall, this year's task consists of three tracks: NLI on the essay only, NLI on the spoken response only, and NLI on both essay and spoken response. In this paper, we describe the CIC-FBK approach to the essay-only track.
Previous works on identifying the native language from texts explored a large variety of features, including lexical and part-of-speech (POS) features (Koppel et al., 2005a), character n-grams (Ionescu et al., 2014), spelling errors (Koppel et al., 2005b), and syntactic features (Wong and Dras, 2011). Following previous research on the NLI task, we incorporate commonly used word n-grams, lemma n-grams, POS n-grams, and function words. In order to capture the L1 influences at the character level, we use recently introduced character n-grams from misspelled words (Chen et al., 2017), as well as 10 categories of character n-gram features proposed by Sapkota et al. (2015). We also include syntactic features by extracting syntactic dependencybased n-grams of words and of syntactic relation tags  using the algorithm designed by Posadas-Durán et al. (2014. We describe the features used by the CIC-FBK system in more detail in subsection 3.1.
Our system achieved 0.8808 macro-averaged F1-score and 0.8809 accuracy in the essay-only track and shared the 1 st rank in the NLI Shared Task 2017 scoring, obtaining the 2 nd absolute score with the difference of 0.0010 F1-score and 0.0009 accuracy with the 1 st place.

Data
The dataset used in the NLI Shared Task 2017 is composed of English essays written by non-native learners in a standardized assessment of English proficiency for academic purposes. The corpus consists of 13,200 essays (1,000 essays per L1 for training, 100 for development, and 100 for testing). The essays are sampled from 8 prompts, and score levels (low/medium/high) are provided for each essay. The training, development, and test sets are balanced in terms of the number of essays per L1 group. The 11 L1s covered by the corpus are: Arabic (ARA), Chinese (CHI), French (FRE), German (GER), Hindi (HIN), Italian (ITA), Japanese (JAP), Korean (KOR), Spanish (SPA), Telugu (TEL), and Turkish (TUR). The detailed description of the corpus and its statistics can be found in .

Methodology
Our system incorporates a wide range of features, i.e., word, lemma, and POS n-grams, spelling error character n-grams, typed character n-grams, and syntactic n-grams. We used the tokenized version of essays provided by the organizers. For the evaluation of our approach, we merged the training and development sets, and conducted experiments under 10-fold cross-validation. System performance was measured in terms of both classification accuracy and F1 (macro) score. The former was used as evaluation metric in the majority of previous works on NLI, whilst the later is the official evaluation metric in the NLI Shared Task 2017.

Features
3.1.1 Word, lemma, and POS n-grams Word and lemma features represent the lexical choice of a writer, while part-of-speech (POS) features capture the morpho-syntactic patterns in a text. Following previous works on the NLI task (Jarvis et al., 2013;Malmasi and Dras, 2017), we use word, lemma, and POS n-grams with n ranging from 1 to 3. We include punctuation marks and split n-grams by a full stop. We lowercase word and lemma n-grams and replace each digit by the same symbol (e.g., 12,345 → 00,000), as proposed in , to capture the format (e.g., 00.000 vs. 00,000), which reflects stylistic choice of a learner and not the value of a number that does not carry stylistic information. Lemmas and POS tags were obtained using the TreeTagger software package (Schmid, 1995).

Function words
Function words are the most common words in a language (e.g., articles, determiners, conjunctions). They are considered one of the most important stylometric features (Kestemont, 2014). Function words can be seen as indicators of the grammatical relations between other words. We use a set of 318 English function words from the scikit-learn package (Pedregosa et al., 2011). Other examined function word lists obtained from the Natural Language Toolkit 1 (127 function words) and the Onix Text Retrieval Toolkit 2 (429 function words), as well as function word skip-grams (Guthrie et al., 2006) did not lead to an improvement in accuracy.

Spelling error character n-grams
Spelling errors have been used as features for NLI since Koppel et al. (2005b). They are considered a strong indicator of an author's L1, since they reflect L1 influences, such as sound-to-character mappings in L1. Recently, Chen et al. (2017) introduced the use of character n-grams from misspelled words. The authors showed that adding spelling error character n-grams to other commonly used features (word and lemma n-grams) improves NLI classification accuracy. We extract 39,512 unique misspelled words from the training and development sets using the spell shell command. Then we build character n-grams (n = 4) from the extracted misspelled words. Other examined size of spelling error character n-grams (n = 1, 2, 3, and 5), as well as their combinations did not lead to an improvement in system performance.

Typed character n-grams
Character level features are sensitive to both the content and the form of a text and able to cap-ture lexical and syntactic information, punctuation and capitalization information related with the authors' style (Stamatatos, 2013). The effectiveness of character n-gram features for representing the stylistic properties of a text has been demonstrated in previous NLI studies (Ionescu et al., 2014;Chen et al., 2017). Their effectiveness in NLI is hypothesized to be a result of phoneme transfer from the learner's L1, and by their ability to capture orthographic conventions of a language (Tsur and Rappoport, 2007). Sapkota et al. (2015) defined 10 different character n-gram categories based on affixes, words, and punctuation. In this approach, instances of the same n-gram may refer to different typed n-gram features. For example, in the phrase less carelessness, the two instances of the 4-gram less are assigned to different character n-gram categories.
As an example, consider the following sample sentence: (1) Lisa said, "John should repair it tomorrow." The character n-grams (n = 4) for the sample sentence (1) for each of the categories proposed by Sapkota et al. (2015) are shown in Table 1. For clarity, spaces are represented by the underscore.

SC Category
N-grams affix prefix shou repa tomo suffix ould pair rrow space-prefix sai sho rep it tom space-suffix isa ohn uld air word whole-word Lisa said John mid-word houl epai omor morr orro multi-word * sa s hn s ld r ir i it t punct beg-punct "Joh mid-punct * * , " . " end-punct aid, row. * If the previous word is more than one character long, two characters are considered; otherwise, only one character is considered. * * We use the tokenized version of essays and set the size of n-grams to 3 for this category. For other categories of typed character n-grams, the size is set to 4.  Sapkota et al. (2015).
Typed character n-grams have shown to be predictive features for other classification tasks, such as authorship attribution (Sapkota et al., 2015), author profiling (Markov et al., 2016), and discriminating between similar languages (Gómez-Adorno et al., 2017). In our experiments, typed character n-grams (n = 4) outperformed traditional character n-grams of the same size in most system configurations. In addition, we compared the performance of typed and traditional character n-grams on the 7-way ICLEv2 corpus (Granger et al., 2009), following the corpus splitting as described in Ionescu et al. (2014). In this experiment, typed character n-grams proved to be more indicative than traditional character n-grams when used in combination with features described in this paper.

Syntactic n-grams
Syntactic features, including production rules (Wong and Dras, 2011) and Tree Substitution Grammars (TSGs) (Swanson and Charniak, 2012), have been previously explored for NLI. Tetreault et al. (2012) experimented with the Stanford parser (de Marneffe et al., 2006) dependency features and concluded that they are strong indicators of structural differences in L2 writing. We exploit the Stanford dependencies to build syntactic n-gram features by using the algorithm designed and made available by Posadas-Durán et al. (2014. 3 Consider the following sample sentence: (2) I remember this great experience.
These dependencies, including backoff transformation based on POS, were used as features for NLI in Tetreault et al. (2012). According to the metalanguage proposed in Sidorov (2013a) Here, the head element is on the left of a square parenthesis and inside there are the dependent elements; the elements separated by a coma refer to non-continuous syntactic n-grams, that is, the elements are at the same level in a syntactic tree.
Syntactic n-grams can be used in any task where traditional n-grams are applied. They allow to introduce syntactic information into machinelearning methods (obviously, at cost of previous syntactic parsing). Syntactic n-grams outperformed traditional n-grams in the task of authorship attribution  and were applied in tasks related with L2, for example, automatic English as L2 grammar correction (Sidorov, 2013b). In our system, we use only continuous syntactic n-grams of words and of syntactic relation tags with n ranging from 2 to 3. The inclusion of non-continuous syntactic n-grams improved 10-fold cross-validation accuracy; however, did not perform well on the test set.

Frequency threshold
The fine-tuning of feature set size has proved to be a useful strategy for NLI (Jarvis et al., 2013) and other NLP tasks (Stamatatos, 2013;. In our approach, we selected the frequency threshold value that provided the highest 10-fold cross-validation result. We consider only those features that occur in at least two documents in the training corpus and that occur at least 4 times in the entire training corpus. This frequency threshold improves 10-fold cross-validation accuracy by about 1%, compared to the configuration when all the features are considered, and reduces the size of the feature set by approximately 90% of the original. The final size of our feature set is 726,494.

Weighting scheme
We use log-entropy weighting scheme, which showed good results in previous studies on NLI (Jarvis et al., 2013;Chen et al., 2017).
Log-entropy weighting scheme consists of local weighting (denoted as L log (i, j)) and global weighting (denoted as G ent (i)). The local weighting is calculated by taking the logarithm value of adding-one smoothed term frequency: where f requency(i, j) is the frequency of term i with regard to document j. The global entropy weighting is calculated by the following formula: where J is the total number of documents in the corpus.
J j=1 p ij log p ij is the additive inverse of entropy of the conditional distribution given i and The final weighting W is calculated as follows: Other examined feature representations, i.e., binary feature representation, tf , tf -idf , and normalized feature representation did not enhance system performance. Using log-entropy weighting scheme outperforms tf -idf , the second best scheme in our experiments, by 2.6% in 10-fold cross-validation accuracy.

Classifier
Support Vector Machines (SVM) is considered among the best performing classification algorithms for text categorization tasks; moreover, it was the classifier of choice for the majority of the teams in the previous edition of the NLI shared task. We use the liblinear scikit-learn (Pedregosa et al., 2011) implementation of SVM with 'ovr' multi-class strategy. We set the penalty hyperparameter C to 100 based on our model selection result.

Results
We present the results of our experiments in two phases. First, we show the performance of each type of features in isolation under 10-fold cross validation on the merged training and development sets. Then, we compare the performance obtained on the test set with other participating teams. We present the 10-fold cross-validation results in terms of classification accuracy. For each experiment, the difference between accuracy and F1 (macro) score was less than 0.0003.
The individual performance of the features used in our system with the configurations described in the previous section, as well as the number of features (N) of each type are shown in Table 2.  In line with the previous works on the NLI task (Tetreault et al., 2013;Jarvis et al., 2013;Chen et al., 2017), in our configurations word and lemma n-grams are the most predictive features. They showed 0.8463 and 0.8454 10-fold cross-validation accuracy, respectively, when evaluated in isolation. Typed character n-grams also performed well with a much smaller feature size, achieving 0.7779 accuracy. Syntactic n-grams of syntactic relation tags showed the lower accuracy when evaluated in isolation; however, when used in combination with other features, they improve 10-fold cross-validation accuracy by 0.2%. The combination of all the features showed 0.8640 10-fold cross-validation accuracy on the merged training and development sets.
The NLI Shared Task 2017 organizers reported several 1 st ranked teams based on McNemar's statistical significance test with an alpha value of 0.05. The official results for the essay-only track in terms of F1 (macro) score and classification accuracy for the 1 st ranked teams, as well as the baseline results are shown in Table 3.
The CIC-FBK best run differs 0.0009 in terms of classification accuracy from the highest result achieved by the ItaliaNLP Lab system, which corresponds to one correctly predicted label. All the 17 participating teams in the NLI Shared Task 2017 achieved higher level of F1 (macro) score than the official baseline of 0.7104.  The CIC-FBK system showed 0.8639 F1 (macro) score and 0.8640 accuracy under 10-fold cross-validation on the merged training and development sets. Our other runs in the NLI Shared Task 2017 included small modifications in system configurations, such as variations in frequency threshold values and different strategy for dealing with digits (e.g., 12,345 → 0,0). However, since these modifications showed only marginal accuracy variations and did not improve system performance on the test set, the results for these runs are omitted in this paper.
The confusion matrix for our best run is shown in Figure 1. The highest level of confusion is between Hindi and Telugu classes. Korean and Japanese is another problematic language pair, in which Korean native speakers are often classified as Japanese. The highest accuracy of 0.9800 was achieved for German native speakers. These results are in line with the ones reported in the previous edition of the NLI share task (Tetreault et al., 2013), where the teams achieved low levels of accuracy for the Hindi/Telugu (none of the systems was able to reach 0.8000 accuracy for Hidni) and the Korean/Japanese pairs. In future work, we intend to tackle these two language pairs in isolation in order to improve the overall system performance. Typed character n-grams and syntactic n-grams are new types of features that are introduced in the NLI task for the first time. It was found during the preliminary experiments on the training and development sets that these features improve the classification accuracy when used in combination with other types of features, such as word n-grams, lemma n-grams, part-of-speech n-grams, spelling error character n-grams, and function words. The CIC-FBK system achieved 0.8808 F1 (macro) score and 0.8809 accuracy and shared the 1 st rank in the competition.