Native Language Identification Using a Mixture of Character and Word N-grams

Native language identification (NLI) is the task of determining an author’s native language, based on a piece of his/her writing in a second language. In recent years, NLI has received much attention due to its challenging nature and its applications in language pedagogy and forensic linguistics. We participated in the NLI2017 shared task under the name UT-DSP. In our effort to implement a method for native language identification, we made use of a fusion of character and word N-grams, and achieved an optimal F1-Score of 77.64%, using both essay and speech transcription datasets.


Introduction
Native Language Identification (NLI) is the task of using a piece of writing in a second language in order to determine the writers native language. The main applications of NLI are in language teaching and also in forensic linguistics (Kochmar, 2011).
In language teaching, NLI can help in determining the role of native language transfer in second language acquisition, so that course designers can change the material based on the native language of the learners (Laufer and Girsai, 2008).
In forensic linguistics, NLI can be the starting point in making assumptions about the authors identity of a text which is of some interest to intelligence agencies, yielding the linguistic background of the author (Tsvetkov et al., 2013).
The 2017 shared task contains 3 sub-challenges (Malmasi et al., 2017). The first challenge is predicting the native language of an English language leaner using a standardized assessment of English proficiency for academic purposes. The second challenge is native language identification using the transcriptions of spoken responses produced by test takers. The last sub-part of the NLI Shared Task 2017 is a fusion of the two, i.e. we have both written and spoken responses from test takers at our disposal in order to make a prediction about their native language.
Our team, UT-DSP participated in the NLI Shared Task 2017. An account of our participation is given in this paper.

Related Work
The first NLI Shared Task was organized in 2013 (Tetreault et al., 2013). The task was designed to predict the native language of an English learner based only on his/her English writing. The corpus used for the training phase of the task was the TOEFL11 corpus (Blanchard et al., 2013) which contained 11000 English texts written by native speakers of 11 different languages. 29 teams participated in total, achieving an overall accuracy rate between 0.836 and 0.319. According to the NLI Shared Task 2013 report, the prevailing trend among different teams was using character, word, and POS N-grams (Jarvis et al., 2013;Henderson et al., 2013;Bykh et al., 2013). The leading team (Jarvis) used the support vector machine (SVM) method with as many as more than 400,000 unique features including lexical and POS N-grams.
A number of teams employed simple N-grambased methods as the implementation of these approaches can be simpler and, as a result, less timeconsuming. (Gyawali et al., 2013) developed four different models using character n-grams, word n-grams, POS n-grams, and the perplexity rates of character n-grams. They used an ensemble of these 4 different models to achieve an accuracy rate of 0.75. (Kyle et al., 2013) used an approach employing key N-grams. They could outperform the random baseline with an accuracy of 0.59.
Three years after the first NLI Shared Task, in 2016, the Computational Paralinguistics Challenge included a sub-task aiming at the prediction of native language based on recordings of spoken responses. The accuracy rates reported by participating teams ranged from 30.9 to 47.5 per cent (Schuller et al., 2016).

Data Description
The datasets for the NLI Shared Task 2017 were released by the Educational Testing Service (ETS). These datasets were released in 4 phases, two of which belonged to the training, and the remaining two belonging to the testing phases. Each dataset released contained an equal number of files belonging to each of the following 11 languages: Araic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, and Turkish.

Train -Phase 1
In this phase, a dataset containing 12,100 essay files was released, 1,100 of which were included in a collection named dev chosen for evaluation purposes, and the rest were used for training the method.

Train -Phase 2
The dataset released in this phase contained a collection of 12,100 speech files, which were added to the essay files released in the previous phase. Similar to the previous phase, 1,100 of the speech files were chosen as the dev collection, in order to be used for evaluation. The remaining files were used to train the method.
As, in this stage, both essay and speech files were at our disposal, we could train a method to predict the test taker's native language, using both essay and speech datasets simultaneously, as well as using them separately.

Test -Phase 1
The first test phase's purpose was to test the implemented methods for native language prediction, using speech and train collections separately. The essay and speech collections contained 1,100 files each, with no overlap among the files in the two.

Test -Phase 2
The aim of this phase was to test the fusion method on a collection of files, belonging to 1,100 test tak-ers. For each test taker, an essay and a speech file were included in the collection.

Methodology
An N-gram-based language model is used to estimate the probability of the occurance of the next language particle (i.e. character, word, etc.) given its N previous particles of the same type, by using a maximum likelihood estimation (MLE) approach (Amini et al., 2016;Brown et al., 1992). For example, considering N (w i i−n+1 ) as the number of occurances of the word sequence w i−n+1 w i−n+2 ...w i−1 w i in a corpus, the n-gram probability of word w i based on the sequence of words w i−n+1 w i−n+2 ...w i−1 which come before it, is computed using formula 1: Our work employed a simple approach using a mixture of character and word N-grams. In order to do so, we had to train N-grams for each of the essay and speech transcription datasets in each language. The method was implemented without the use of i-vectors.
To compute the character N-grams, we first extracted two separate lists of characters from the essay and speech files. Then, for each language within each of the essay and speech groups, we computed the character trigrams and 4-grams, smoothed using the additive smoothing method with α = 0.1.
In order to compute the word N-grams, two separate lists of words from the essay and speech files were extracted. These two lists were then limited to the words which were encountered more than once. Afterwards, we computed the word monograms and bigrams (considering outof-vocabulary words), which were smoothed using the additive smoothing method with α = 0.01.
In order to predict the native language for a text file, considering it as an essay/speech transcription, we have to compute its probabilities using character and word N-grams of essay/speech for each language. The character-level probabilities are computed using the formulas 2 and 3: (3) In which P rob l,c−N (C) stands for the character-level probability of the text by the character N-gram for language l, m is the number of characters in the text, P l,c−3 (c i |c i−2 c i−1 ) represents the character trigram probability of language l for character c i given its two previous characters, and P l,c−4 (c i |c i−3 c i−2 c i−1 ) represents the character 4-gram probability of language l for character c i given its three previous characters.
The word-level probabilities are computed using the formulas 4 and 5: In which P rob l,w−N (W ) stands for the wordlevel probability of the text by the word N-gram for language l, n is the number of words in the text, P l,w−1 (w i ) represents the word monogram probability of language l for word w i , and P l,w−2 (w i |c i−1 ) represents the word bigram probability of language l for word w i given its previous word.
In order to compute the character-level Ngrams, we used the 4-gram probability to predict the language of an essay file, while for speech files, we used the summation of trigram and 4gram character probabilities. In both essay and speech files, we used the sum of word-level monogram and bigram probabilities. These N-grams were chosen in a way that they could achieve the best results on the dev dataset, when trained using the train one.
In order to compute the final probability of a text file for each language, we added the characterlevel and word-level probabilities together. The language with the highest probability was chosen as the predicted language for the text. To test our system on the test dataset, we trained our system using both train and dev datasets.

Results
In the first test phase, we achieved the macro F1score of 0.7609 and the overall accuracy of 0.7636 on the Essay track, and the macro F1-score of 0.4530 and the overall accuracy of 0.4536 on the Speech track. Tables 1 and 2 show our method's performance on each class, and Figure 1 and 2 show the confusion matrices yielded in the first test phase.
In the second test phase, we tested our system using both essay, speech, and the fusion of both essay and speech datasets. Table 3 shows the results achieved in each test. As you can see, the best result was achieved in the fusion test. Table  4 shows our method's performance on each class, and Figure 3 shows the confusion matrix from the fusion result in the second test phase.
All results reported in this section were officially submitted as part of the NLI Shared Task 2017.

Discussion
First of all, it is worth mentioning that all the results reported in this paper were achieved without the use of i-vectors, and therefore the comparisons between the results of our method with the baseline results are done only for essay, speech (transcriptions-only) and the fusion of essay and speech transcriptions.
Our implemented method is useful in the native language identification of essays (outperforming the baseline F1-score of 0.710), it does not perform well on speech transcriptions (whose baseline F1-score is 0.544), and as a result the fusion of essays and transcriptions (with a baseline F1score of 0.779). The reason for this can be the fact that in speech transcriptions, the file lengths vary much more than those of the essay files. The fact that, in our method, the length of the file can affect the probabilities can lead to this result.
As evident in Figure 1 to 3, most of the performance reduction was due to complications in telling Telugu and Hindi apart. Figure 2 shows that, in the speech track, both of these languages have very often been mistaken for each other; however, Figure 1 and 3 point to the fact that in the essay and fusion tracks, Hindi has been detected more accurately, while Telugu has often been labeled as Hindi.
An interesting point worth mentioning is that, although our method did not yield a decent perfor-  Table 2: Per Class Performance for the Speech Track mance on the speech dataset, it achieved optimal performance when implemented on the combination of both essay and speech files in the fusion phase. As explained in Section 3, our method is a rather simple one, compared to SVM and artificial neural networks. The combination of character Ngrams and word N-grams used in our method is purely experimental, and does not take advantage of a strong mathematical basis.
All that being said, our method could still be used in combination with a form of supervised learning, in order to be more effective and achieve a decent accuracy rate.