When Sparse Traditional Models Outperform Dense Neural Networks: the Curious Case of Discriminating between Similar Languages

We present the results of our participation in the VarDial 4 shared task on discriminating closely related languages. Our submission includes simple traditional models using linear support vector machines (SVMs) and a neural network (NN). The main idea was to leverage language group information. We did so with a two-layer approach in the traditional model and a multi-task objective in the neural network case. Our results confirm earlier findings: simple traditional models outperform neural networks consistently for this task, at least given the amount of systems we could examine in the available time. Our two-layer linear SVM ranked 2nd in the shared task.


Introduction
The problem of automatic language identification has been a popular task for at least the last 25 years. From early on, different solutions showed very high results (Cavnar et al., 1994;Dunning, 1994), while the more recent models achieve nearperfect accuracies.
Distinguishing closely-related languages, however, still remains a challenge. The Discriminating between similar languages (DSL) shared task (Zampieri et al., 2017) is aimed at solving this problem. For this year's task our team (mm lct) built a model that discriminates between 14 languages or language varieties across 6 language groups (which had two or three languages or language varieties in them). 1 The most popular of the more recent systems, such as langid.py (Lui and Baldwin, 2012) and CLD/CLD2 2 produce very good results based on datasets containing fewer than 100 languages, but even a model trained on as many as 131 languages (Kocmi and Bojar, 2017) and whatlang (Brown, 2013) with trained on 184 and 1100 languages, are not able to distinguish closely-related (and therefore very similar) languages and dialects to a satisfying degree, at least not to the extent of the data available.
As part of the DSL 2017 shared task we chose to further explore traditional linear approaches, as well as deep learning methods. In the next Section we shortly discuss previous approaches to the task of discriminating between similar languages. Then in Section 3 we describe our systems and the data, followed by the results in Section 4, which are discussed in Section 5. We conclude in Section 6.

Related Work
Even though a number of researches in dialect identification have been conducted, (Tiedemann and Ljubešić, 2012;Lui and Cook, 2013;Maier and Gómez-Rodriguez, 2014;Ljubešić and Kranjcic, 2015, among many others), they mostly deal with particular language groups or language variations. We saw as our goal to create a language identifier that is able to produce comparable results for languages within all provided groups with the same set of features for every language group, so that it can be expanded outside those languages provided by the DSL shared task without any changes other than to the training corpus -as to make the system as language-independent and universal as possible.
The overviews of the previous DSL shared tasks Zampieri et al., 2015;Goutte et al., 2016) showed that SVMs always produce some of the top results in this task, especially when tested on same-domain datasets (Çöltekin and Rama, 2016). Thus, we chose to put our efforts into improving upon SVM approaches, but still decided to experiment with an neural network to see if we could get comparable results, while using fewer features and reducing the chance of overfitting.
The popularity of using NNs for NLP tasks is growing. A few neural language identifiers already exist as well (Tian and Suontausta, 2003;Takçi and Ekinci, 2012;Simões et al., 2014, among others), however on average traditional systems still seem to outperform them. The results of the DSL 2016 shared task also show the same tendency overall (Bjerva, 2016;Cianflone and Kosseim, 2016;Çöltekin and Rama, 2016;Malmasi et al., 2016).

Methodology and Data
In this section, we first describe the datasets that were provided for the DSL 2017 shared task. Then we describe the three systems we used to tackle the problem: first a two-layer SVM that uses language-group classification, then a single-layer SVM that does not use grouping and finally an neural network-based approach.

Data
This year's data is a new version of the DSL Corpus Collection (DSLCC) , with again 18,000 instances for training and 2,000 instances for development. The test data consists of 1,000 instances per language and contains the same languages as the training and development data. The test data is furthermore very similar to the development data, as supported by the resultsduring-development performance was almost the same as the performance on the test set. All instances come from short newspaper texts.
However whereas last year's version of the DSLCC contained Mexican Spanish, this year's version has Peruvian Spanish (es-PE). Another new addition is the Farsi language group, with the two variations Persian (fa-IR) and Dari (fa-AF). Thus, this year's version contains 14 languages belonging to 6 groups: • BCS: containing Bosnian, Croatian and Serbian; • Spanish: containing Argentine, Peninsular and Peruvian varieties; • Farsi: containing Afghan Farsi (or Dari) and Iranian Farsi (or Persian); • French: containing Canadian and Hexagonal varieties; • Indonesian and Malay; and • Portuguese: containing Brazilian and European varieties.
An overview of the data is given in Table 1, which includes the number of instances as well as the number of tokens for each language in the training and development data.
In the final submissions we performed no preprocessing on the data. During development we explored the usefulness of replacing all characters for lower case, having placeholders for numbers and removing punctuation, but we found that it decreased performance of the system.
Finally, for the final submission we have had all our runs trained on the combination of both training and development datasets, as has been shown to be effective by last year's winning team (Çöltekin and Rama, 2016).

Run 1 -SVM with grouping
As our first, most promising run we have developed and submitted a two-layer classifier, which first predicts for all instances which language group it belongs to, and then classifies the specific languages within the guessed language groups. This method has been used by DSL participants before (Franco-Salvador et al., 2015;Nisioi et al., 2016), and has shown to have a positive impact on the performance. Adopting this method, we have built a combination of SVMs with linear kernels.
The first SVM is for deciding on the language group to which the language belongs. As features it uses character-based uni-to 6-grams (including whitespace and punctuation characters) weighted  by tf-idf. 3 While testing it on the development set it appeared to be very reliable, as all misclassified instances on the group level contained only names and digits and were, therefore, impossible to be classified by a human either. The second SVM predicts the specific languages within each group (with the same feature parameters for every group), using word-based uni-and bigrams, in combination with characterbased n-grams up to 6 characters weighted by tfidf, as well. Figure 1a shows that when trained on a subset of 100,000 randomly selected instances (while keeping the language distribution the same) of the training data, the best accuracy is achieved when using character n-grams from 1 to 6 characters and no word n-grams. However, when we trained and tested it on the DSL 2016 data, it scored lower than the winning team (for the in-domain test set). We therefore chose a different set of features by adding word unigrams and bigrams that gave us a slight advantage over last year's task's results. It did, though, reduce the performance on this 3 The formula used to compute tf-idf is as follows, as defined by scikit-learn Python package: tf-idf(d, t) = tf(t) * idf(d, t) where idf(d, t) = log(n/df(d, t)) + 1 where n is the total number of documents and df(d, t) is the document frequency; the document frequency is the number of documents d that contain term t (Pedregosa et al., 2011). year's development, but the reduction was so minimal that we deemed it unlikely to be significant (accuracies of 0.90296 without word n-grams vs. 0.90206 with word uni-and bigrams), especially when considering that the difference between the accuracies becomes smaller the more training data is available.
Fine-tuning the second SVM for particular language groups seemed to defeat the goal of developing a language-independent classifier -retraining on other languages would have not been possible, without largely adjusting the system.

Run 2 -SVM without grouping
As the second run we submitted a single system, a linear kernel SVM that does not use languagegroup classification first but classifies languages straight away. When exploring different combinations of word and character n-grams we trained the system on the 100,000 same instances and found that the highest results were achieved with a combination of word uni-and bigrams and character uni-to 6-grams (see Figure 1b). Thus, for this run we have the same parameters as the within-groups classifier of run 1.
When trained on this year's full training set and tested on the development set, this system performs slightly better than the two-layer system  Figure 1: Visualisation of the differences in accuracy with changing maximum lengths of word and character n-grams trained on 100,000 instances of training data and tested on the development dataset.
Where n-grams are 0, n-grams were turned off; the left lower corner, therefore, is the random baseline.
(a) shows the accuracies for the SVM with grouping, (b) for the SVM without grouping.
(but likely to be insignificantly better, with a less than 0.1% difference in accuracy).

Run 3 -CBOW multi-task NN
We also experimented with NNs, in particular, an NN with a multi-task objective. The idea was to take advantage of language group information to guide learning. This represents a complimentary approach to run 1.
Our preliminary experiments confirmed earlier findings that NN-based approaches are outperformed by more simple linear models for language identification (Çöltekin and Rama, 2016;Gamallo et al., 2016). We compared recurrent NNs to simpler models based on continuous bag of word (CBOW) representations (Mikolov et al., 2013), which are similar to feedforward NNs and simply take the mean vector of the input embeddings as input representation. CBOW was not only quicker to train, it also outperformed their RNN/LSTM counterparts, thus resulting in our final submission.
In particular, run 3 is a simple CBOW NN with two output layers: the first predicting the actual language identifier, the second predicting the language group. The CBOW multi-task NN training objective is to minimise the cross-entropy loss on language identity (L 1 ) and language group identification (L 2 ), weighted by λ set on the development set and trained on a subset of 10,000 in-stances. The joined training objective was: As input features it uses embeddings on character uni-to 5-grams, which outperforms simple word input alone. We observed that the multi-task objective sped up learning, although ultimately the difference between an MTL and a non-MTL counterpart was minor. We submitted the MTL model as final run. It was trained on the joined training and development data without any preprocessing, as to make it more comparable to our SVM submissions.
Note that due to time constraints we did not fully explore many directions here, like feature space, hyperparameters or alternative models, but overall NN seemed less promising for this task.

Results
Based on absolute scores, our first system (SVM with grouping) performed second best in the DSL shared task (Zampieri et al., 2017) with an accuracy of 0.9254. Both our other systems also performed substantially higher than the random baseline of 0.0714: accuracies of 0.9236 and 0.8997 for the SVM without grouping and the NN, respectively. See Table 2 for an overview of the accuracies and F 1 -scores of our three systems. Table 3 presents the confusion matrix for the SVM with grouping. Out-of-group confusionswhich are very rare in general, in all three runs -  Table 2: Accuracies and F 1 -scores (micro, macro and weighted) for the three systems, along with the random baseline.  occur notably less often with the SVM with grouping (only 2.2% of the confusions it makes are out-of-group confusions) than with the other runs. This is to be expected as the SVM with grouping is designed to group instances of the same language group together and then to discriminating between the particular language variations within the groups. Within-group confusions also occur relatively less often with the SVM with grouping (in all groups, except for French, the accuracy is higher for the SVM with grouping than the SVM without grouping; the NN has notably lower accuracies for all groups: see Table 4).
Overall, fewest within-group confusions occurred in the Indonesian-Malay group. The most mistakes were made in the BSC group. This is also supported by the accuracies. The values, though, do not necessarily support claims that Bosnian, Serbian and Croatian must then be more alike  Table 4: Accuracies for all language groups for the first SVM (with grouping), the second SVM (without grouping) and the NN.
than, e.g., Indonesian and Malay are: differences in the amount of training data or the quality of the data may cause incomparable results. Also the language groups that contain three languages perform, as expected, overall worse than the groups with two languages.
Another striking aspect of the confusion matrix is that, in the BSC group, Bosnian seems to be confused more than Croatian or Serbian. Serbian and Croatian are rarely confused with each other. This suggests that in a gradual transition between Croatian and Serbian, Bosnian is somewhere in the middle. A similar gradual transition does not seem to exist for the Spanish varieties (as supported by the confusion matrix). This is also supported by the fact that Bosnian, of all 14 languages, performs the worst in terms of both precision and recall (F 1 = 0.79). Indonesian and Malay both perform the best, both with an almost perfect F 1 = 0.99. A full report of languagespecific performances for the SVM with grouping can be found in Table 5.

Discussion
We presented our approaches to tackling the problem of discriminating between similar languages and dialects. The SVM which first groups instances based on language group using word uniand bigrams and character unigrams to 6-grams as features works best by a very small margin -in the DSL shared task it performed second in absolute F 1 -scores, but also by a small margin. The margin between our two SVMs, though, is so small that it might not even be statistically sig-nificant. 4 However, although grouping does not really improve the performance of the system, it does make the model noticeably faster. This is because, when grouping, the system requires less memory at once, as it fits the data for only one language group at a time, which is only about a sixth of the total data (in this dataset), depending on the group. It only processes the total amount of the data once -when grouping the instances in language groups, but then it uses fewer features.
As expected, the SVMs do perform notably better than the deep-learning approach we tried. However, our NN uses simple CBOW and still places itself rather well among other systems. Figure 1a suggests that the two-layer SVM approach might perform slightly better when using no word n-grams altogether. Although we decided against such a system, it will be interesting to see what the impact of removing word n-grams for the two-layer SVM feature set will have on the performance of said approach. It would also be interesting to see if having only longer n-grams (i.e. only 3-5 character n-grams) or only combinations of particular lengths would improve the results.

Conclusions
Discriminating between similar languages is still not a fully solved problem -no known system reaches perfect performance. The models presented in this paper once again confirm that traditional models, such as SVMs, perform better on this task than deep learning techniques. We also showed that a two-layer approach in which languages are first classified based on language groups barely improves performance -yet, in our experience, it speeds up the system significantly.