A Perplexity-Based Method for Similar Languages Discrimination

This article describes the system submitted by the Citius_Ixa_Imaxin team to the VarDial 2017 (DSL and GDI tasks). The strategy underlying our system is based on a language distance computed by means of model perplexity. The best model configuration we have tested is a voting system making use of several n-grams models of both words and characters, even if word unigrams turned out to be a very competitive model with reasonable results in the tasks we have participated. An error analysis has been performed in which we identified many test examples with no linguistic evidences to distinguish among the variants.


Introduction
Language detection is not a solved problem if the task is applied to the identification of similar languages and varieties. Closely related languages or language varieties are much more difficult to identify and separate than languages belonging to different linguistic families. In this article, we describe the system submitted by the Citius Ixa Imaxin team to the VarDial 2017. We have participated in two task: Discriminating between Similar Languages (DSL) and German Dialect Identification (GDI). The strategy underlying our system is based on comparing language models using perplexity. Perplexity is defined as the inverse probability of the test text given the model. Most of the best systems for language identification use probability-based metrics with n-grams models. This report paper (Zampieri et al., 2017) describes the shared task and compares all the presented systems.
DSL is focused on discriminating between similar languages and national language varieties, including six different groups of related languages or language varieties: • Bosnian, Croatian, and Serbian The objective of GDI is the identification of German varieties (four Swiss German dialect areas: Basel, Bern, Lucerne, Zurich) based on speech transcripts.
Analysis about previous results on the two scenarios can be found in Goutte et al. (2016) and Malmasi et al. (2015). The latter is focused on Arabic varieties but the scenario is similar to the GDI task.

Language Identification and Similar Languages
Two specific tasks for language identification have attracted a lot of research attention in recent years, namely discriminating among closely related languages (Malmasi et al., 2016) and language detection on noisy short texts such as tweets (Zubiaga et al., 2015). The Discriminating between Similar Languages (DSL) workshop (Zampieri et al., 2014;Zampieri et al., 2015;Goutte et al., 2016) is a shared task where participants are asked to train systems to discriminate between similar languages, language varieties, and dialects. In the three editions organized so far, most of the best systems were based on models built with high-order character ngrams (>= 5) using traditional supervised learning methods such as SVMS, logistic regression, or Bayesian classifiers. By contrast, deep learning approaches based on neural algorithms did not perform very well (Bjerva, 2016).
In our previous participation (Gamallo et al., 2016) in the DSL 2016 shared task we presented two very basic systems: classification with ranked dictionaries and Naive Bayes classifiers. The results showed that ranking dictionaries are more sound and stable across different domains while basic Bayesian models perform reasonably well on in-domain datasets, but their performance drops when they are applied on out-of-domain texts. We also observed that basic n-gram models of characters and words work pretty well even if they are used with simple learning systems. In the current participation we decided to use basic n-grams with a very intuitive strategy: to measure the distance between languages on the basis of the perplexity of their models.

Perplexity
The most widely-used evaluation metric for language models is the perplexity of test data. In language modeling, perplexity is frequently used as a quality measure for language models built with n-grams extracted from text corpora (Chen and Goodman, 1996;Sennrich, 2012). It has also been used in very specific tasks, such as to classify between formal and colloquial tweets (González, 2015).

Methodology
Our method is based on perplexity. Perplexity is a measure of how well a model fits the test data. More formally, the perplexity (called P P for short) of a language model on a test set is the inverse probability of the test set. For a test set of sequences of characters CH = ch 1 , ch 2 , ..., ch n and a language model LM with n-gram probabilities P (·) estimated on a training set, the perplexity PP of CH given a character-based n-gram model LM is computed as follows: where n-gram probabilities P (·) are defined in this way: Equation 2 estimates the n-gram probability by dividing the observed frequency (C) of a particular sequence of characters by the observed frequency of the prefix, where the prefix stands for the same sequence without the last character. To take into account unseen n-grams, we use a smoothing technique based on linear interpolation.
A perplexity-based distance between two languages is defined by comparing the n-grams of a text in one language with the n-gram model trained for the other language. Then, the perplexity of the test text CH in language L2, given the language model LM of language L1, can be used to define the distance, Dist perp , between L1 and L2: The lower the perplexity of CH L2 given LM L1 , the lower the distance between languages L1 and L2. The distance Dist perp is an asymmetric measure.
In order to apply this measure to language identification given a test text, we compute the perplexity-based distance for all the language models and the test text, and the closest model is selected.

Runs and Data
In the DSL task we have taken part in both tracks: closed and open. The open model was trained with the datasets released in previous DSL tasks (Malmasi et al., 2016;Zampieri et al., 2015;Zampieri et al., 2014).
We prepared three runs for each task. All of them are based on perplexity but using different model configuration: • Run1 uses perplexity with a voting system over 6 n-gram models: 1-grams, 2-grams and 3-grams of words, and 5-grams, 6-grams and 7-grams of characters. We observed that short n-grams of words clearly outperform longer word n-grams, while long n-grams of • Run2 uses perplexity with just 1-grams of words. In the development tests, we observed that this simple model is very stable over different situations and tasks.
• Run3 also uses perplexity but with 7-grams of characters, since long n-grams of characters tend to perform better than short ones.

Results
In the first task (Discriminating between Similar Languages) we submitted systems generated with both closed and open training.

DSL Closed
The results obtained by our runs in the DSL task are shown in Table 1. The random baseline (14 classes) is 0.071 and the references from the best system in 2016 is 0.8938 accuracy. However, it is worth noticing that 2016 and 2017 DSL tasks are not comparable because the varieties proposed for the two shared tasks are not exactly the same. The table shows that best results are obtained using the two first configurations: Run1 and Run2. Let us notice that the second one reaches good results even if it is based on a very simple models (just words unigrams). This is also true for the GDI task (see below in the Discussion section). Our best run in task DSL achieved 0.903 accuracy (9th position out of 11 systems) while the best system in this task reached 0.927. Comparing confusion matrices for Spanish variants between Run1 and Run2, we can observe that although the results are similar in both cases, they guess and fail in a different way (Table 3). So, they seem to be quite complementary strategies.

DSL Open Training
We tried to improve the results by adding more training data from previous shared tasks. Table 2 shows that the simplest configuration (Run2) gets better results than in the closed training task, but only a slight improvement (0.5 %) was obtained. No comparison can be made with other systems because the other participants did not take part in this track. run1 run2 es-ar es-es es-pe es-ar es-es es-pe es-ar  892  67  36  861  81  56  es-es  88  871  35  78  870  48  es-pe  111  126  763  87 104 809  Table 4: Results for the GDI task.

GDI
The results for the GDI task are shown in Table 4. The majority class baseline is 0.258 and there were no previous results to compare with. However, the best results for Arabic dialects in VarDial 2016 (in similar conditions to GDI) were 0.513 (F-score). The results are much lower than in DSL task. Several factors which can influence these results are the following: • the GDI task has unbalanced test sets, • the data are from speech transcription, • the task itself is more difficult given the strong similarity of the varieties.
In this task, our best configuration is Run2, which, in spite of its simple model, improves the voting-based system. The confusion matrix for Run2 (see Figure 2) shows that the scores obtained for Lucerne dialect are very poor.
Run2 achieved 0.630 accuracy (8th position out of 10 systems) while the best system in this task reached 0.680. It is worth noticing that only two systems also involved in DSL 2016 task improve our results in GDI.

Discussion
The results show that our system, despite its simplicity, performs reasonably well. For the DSL task 2016 we obtained the second best performance even if the results are more discrete in 2017; and for the GDI task the results are better than the best score in 2016 for the Arabic Dialectal Identification task.
It can be underlined that the configuration of our run2 is very simple (just unigrams of words) and In order to find key elements for further improvement, we decided to carry out an analysis of errors on variants that we know quite well (variants of Spanish).

Analysis of errors in Spanish
From the list of errors among Spanish texts extracted from the evaluation carried on the development corpus we selected randomly 50 cases.
We decided to classify these texts on the following categories: • Not distinguishable: the dialect is impossible or very difficult to classify. There are no specific language features allowing to make a distinction. For instance: La propuesta de reunir en un mismo lugar a las etiquetas premium de las principales bodegas del país ha  logrado cautivar al público amante del buen vino, siendo hoy el evento del sector más esperado del año. is classified by our system as Spanish from Argentina (es-AR) but it was annotated as Spanish from Spain (es-ES). However, the text has no relevant dialectal characteristic.
• Distinguishable by dialectal uses. These are cases in which it is possible to find words such as mamá or tercerizar that are more frequent in some of the variants.
• Others: more complex cases in which it is difficult to make a decision since there are no clear language features from one particular variety. In some of the examples, several hypotheses were possible.
The figures for each case are shown in Table  5. We can observe that the first two cases (i.e not distinguishable and distinguishable by named entities) are the more frequent in the test test.

Future Work
Based on the error analysis we are planning to test a variant of our system with two new features: • The system will be provided with the none category for those cases where there is no enough evidence to make a decision. This can increase the precision of the system.
• The system will be enriched with lists (gazetteers) of named entities linked to the dialects or geographical locations. These gazetteers could be used to assign weights to n-grams or as new features in the voting system. However, it will be necessary to consider the interferences that this new information might add to the system. For instance, in the following example (Es indudable que los que utilice en los partidos amistosos que jugaremos contra España, en Huelva el 28 de mayo, y ante México...), the use of localized named entities could generate a false positive for Spanish from Spain (es-ES).
Additionally we intend to test the perplexity strategy to measure the distance among the language or dialects in a diachronic mode. This would allow us to observe the quantitative transformations of the languages/dialects and the relations among them.
Finally, we will perform further experiments with different voting systems in order to find the most appropriate for our models.
Our perplexity-based system to measure the distance between languages is freely available at https://github.com/gamallo/ Perplexity.