Discriminating between Similar Languages using Weighted Subword Features

The present contribution revolves around a contrastive subword n-gram model which has been tested in the Discriminating between Similar Languages shared task. I present and discuss the method used in this 14-way language identification task comprising varieties of 6 main language groups. It features the following characteristics: (1) the preprocessing and conversion of a collection of documents to sparse features; (2) weighted character n-gram profiles; (3) a multinomial Bayesian classifier. Meaningful bag-of-n-grams features can be used as a system in a straightforward way, my approach outperforms most of the systems used in the DSL shared task (3rd rank).


Introduction
Language identification is the task of predicting the language(s) that a given document is written in. It can be seen as a text categorization task in which documents are assigned to pre-existing categories. This research field has found renewed interest in the 1990s due to advances in statistical approaches, and it has been active ever since, particularly since the methods developed have also been deemed relevant for text categorization, native language identification, authorship attribution, text-based geolocation, and dialectal studies (Lui and Cook, 2013).
As of 2014 and the first Discriminating between Similar Languages (DSL) shared task , a unified dataset  comprising news texts of closely-related language varieties has been used to test and benchmark systems. The documents to be classified are quite short and may even be difficult to distinguish for human annotators, thus adding to the difficulty and the interest of the task. A second shared task took place in 2015 (Zampieri et al., 2015). An analysis of recent developments can be found in Goutte el al. (2016) as well as in the report on the third shared task .
Not all varieties are to be considered equally since differences may stem from extra-linguistic factors. It is for instance assumed that Malay and Indonesian derive from a millenium-old lingua franca, so that shorter texts have been considered to be a problem for language identification (Bali, 2006). Besides, the Bosnian/Serbian language pair seems to be difficult to tell apart whereas Croatian distinguishes itself from the two other varieties mostly because of political motives (Ljubeši [Pleaseinsertintopreamble] et al., 2007;Tiedemann and Ljubešić, 2012).
The remainder of this paper is organized as follows: in section 2 the method is presented, it is then evaluated and discussed in section 3.

Preprocessing
Preliminary tests have shown that adding a custom linguistic preprocessing step could slightly improve the results. As such, instances are tokenized using the SoMaJo tokenizer (Proisl and Uhrig, 2016), which achieves state-of-the-art accuracies on both web and CMC data for German. As it is rule-based, it is deemed efficient enough for the languages of the shared task. No stop words are used since relevant cues are expected to be found automatically as explained below. Additionnally, the text is converted to lowercase as it led to better results during development phase on 2016 data.

Bag of n-grams approach
Statistical indicators such as character-and tokenbased language models have proven to be efficient on short text samples, especially character n-gram frequency profiles from length 1 to 5, whose interest is (inter alia) to perform indirect word stemming (Cavnar and Trenkle, 1994). In the context of the shared task, a simple approach using n-gram features and discriminative classification achieved competitive results (Purver, 2014). Although features relying on the output of instruments may yield useful information such as POSfeatures (Zampieri et al., 2013), the diversity of the languages to classify as well as the prevalence of statistical methods call for low-resource methods that can be trained and applied easily.
In view of this I document work on a refined version of the Bayesline  which has been referenced in the last shared task (Barbaresi, 2016a) and which has now been used in official competition. After looking for linguistically relevant subword methods to overcome data sparsity (Barbaresi, 2016b), it became clear that taking frequency effects into consideration is paramount. As a consequence, the present method grounds on a bag-of-n-grams approach. It first proceeds by constructing a dictionary representation which is used to map words to indices. After turning the language samples into numerical feature vectors (a process also known as vectorization), the documents can be treated as a sparse matrix (one row per document, one column per n-gram).
Higher-order n-grams mentioned in the development tests below use feature hashing, also known as the "hashing trick" (Weinberger et al., 2009), where words are directly mapped to indices with a hashing function, thus sparing memory. The upper bound on the number of features has been fixed to 2 24 in the experiments below.

Term-weighting
The next step resides in counting and normalizing, which implies to weight with diminishing importance tokens that occur in the majority of samples. The concept of term-weighting originates from the field of information retrieval (Luhn, 1957;Sparck Jones, 1972). The whole operation is performed using existing implementations by the scikit-learn toolkit (Pedregosa et al., 2011), which features an adapted version of the tfidf (term-frequency/inverse document-frequency) term-weighting formula. 1 Smooth idf weights are obtained by systematically adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions.

Naive Bayes classifier
The classifier used entails a conditional probability model where events represent the occurrence of an n-gram in a single document. In this context, a multinomial Bayesian classifier assigns a probability to each target language during test phase. It has been shown that Naive Bayes classifiers were not only to be used as baselines for text classification tasks. They can compete with state-ofthe-art classification algorithms such as support vector machines, especially when using approriate preprocessing concerning the distribution of event frequencies (Rennie et al., 2003); additionally they are robust enough for the task at hand, as their decisions may be correct even if their probability estimates are inaccurate (Rish, 2001).

"Bayesline" formula
The Bayesline formula used in the shared task grounds on existing code  2 and takes advantage of a comparable feature extraction technique and of a similar Bayesian classifier. The improvements described here concern the preprocessing phase, the vector representation, and the parameters of classification. Character n-grams from length 2 to 7 are taken into account. 3 1 http://scikit-learn.org/stable/modules/feature extraction.html 2 https://github.com/alvations/bayesline 3 TfidfVectorizer(analyzer='char', ngram range=(2,7), strip accents=None, lowercase=True) followed by MultinomialNB ( 3 Evaluation

Data from the third edition
In order to justify the choice of the formula, experiments have been conducted on data from the third edition of the DSL shared task ; training and development sets have been combined as training data, and gold data used for evaluation. The method described above has been tested with several n-gram ranges; the results are summarized in Table 1. The best combinations were found with a minimum n-gram length of 1 to 3 and a maximum n-gram length of 6 to 8. Accordingly, an aurea mediocritas from 2 to 7 has been chosen. Table 2 shows the extraction, training, and testing times for n-gram lengths with a mininum of 2. One can conclude that the method is computationnally efficient on the shared task data. Execution with feature hashing is necessary for higherorder n-grams due to memory constraints; it effectively improves scalability but it also seems to be a trade-off between computational efficiency and accuracy, probably due to the upper bound on used features and/or hash collisions.   Table 3 documents the efficiency and accuracy of several algorithms on the classification task, without extensive parameter selection. The Ridge (Rifkin and Lippert, 2007) and Naive Bayes classifiers would have outperformed the best submis-sion of the 2016 competition (0.894) with scores of respectively 0.895 and 0.902, while the Passive-Aggressive (Crammer et al., 2006) and Linear Support Vector (Fan et al., 2008) classifiers would have been ranked second with a score of 0.892. It is noteworthy that the Naive Bayes classifier would still have performed best without taking the development data into consideration (accuracy of 0.898).

Data from the fourth edition
As expected, the method performed well on the fourth shared task, as it reached the 3rd place out of 11 teams (with an accuracy of 0.925 and a weighted F1 of 0.925). In terms of statistical significance, it was ranked first (among others) by the organizers. The official baseline/Bayesline used a comparable algorithm with lower results (accuracy and weighted F1 of 0.889).
The confusion matrix in Figure 1 details the results. Three-way classifications between the variants of Spanish and within the Bosnian-Croatian-Serbian complex still leave room for improvement, although Peruvian Spanish does not seem to be as noisy as the Mexican Spanish data from the last edition. The F-score on variants of Persian is fairly high (0.960) which proves that the method can be applied to a wide range of alphabets.
The same method has been tested without preprocessing on new data consisting in the identification of Swiss German dialects (GDI shared task). The low result (second to last with an accuracy of 0.627 and a weighted F1 of 0.606) can be explained by the lack of adaptation, most notably to the presence of much shorter instances. The classification of the Lucerne variant is particularly problematic, it calls for tailored solutions.

Conclusion
The present contribution revolves around a contrastive subword n-gram model which has been tested in the Discriminating between Similar Languages shared task. It features the following char-   (2) weighted character n-gram profiles; (3) a multinomial Bayesian classifier, hence the name "Bayesline". Meaningful bag-of-n-grams features can be used as a system in a straightforward way. In fact my method outperforms most of the systems used in the DSL shared task.
Thus, I propose a new baseline and make the necessary components available under an open source licence. 4 The Bayesline efficiency as well as the difficulty to reach higher scores in open training could be explained by artificial regular-4 https://github.com/adbar/vardial-experiments ities in the test data. For instance, the results for the Dari/Iranian Persian and Malay/Indonesian pairs are striking, these clear distinctions do not reflect the known commonalities between these language varieties. This could be an artifact of the data, which feature standard language of a different nature than the continuum "on the field", that is between two countries as well as within a single country. The conflict between in-vitro and real-world language identification has already been emphasized in the past (Baldwin and Lui, 2010); it calls for the inclusion of web texts (Barbaresi, 2016c) into the existing task reference.