Evaluating HeLI with Non-Linear Mappings

In this paper we describe the non-linear mappings we used with the Helsinki language identification method, HeLI, in the 4th edition of the Discriminating between Similar Languages (DSL) shared task, which was organized as part of the VarDial 2017 workshop. Our SUKI team participated on the closed track together with 10 other teams. Our system reached the 7th position in the track. We describe the HeLI method and the non-linear mappings in mathematical notation. The HeLI method uses a probabilistic model with character n-grams and word-based backoff. We also describe our trials using the non-linear mappings instead of relative frequencies and we present statistics about the back-off function of the HeLI method.


Introduction
The 4 th edition of the Discriminating between Similar Languages (DSL) shared task (Zampieri et al., 2017) was divided into an open and a closed track. In the closed track the participants were allowed to use only the training data provided by the organizers, whereas in the open track the participants could use any data source they had at their disposal. This year we did not participate in the open track, so we did not use any additional sources for training and development. The creation of the earlier DSL corpora has been described by . This year's training data consisted of 18,000 lines of text, excerpts of journalistic texts, for each of the 14 languages. The corresponding development set had 2,000 lines of text for each language. The task had a language selection comparable to the 1 st , 2 nd (Zampieri et al., 2015), and 3 rd (Malmasi et al., 2016) editions of the shared task. The languages and varieties are listed in Table 1. The differences from the previous year's shared task were the inclusion of Persian and Dari languages, as well as replacing the Mexican Spanish variety with Peruvian Spanish.  For the 4 th edition, we were interested in modifying the HeLI method and use the TF-IDF scores and some non-linear mappings instead of relative frequencies. We were inspired by the successful use of TF-IDF scores by Barbaresi (2016). He was able to significantly boost the accuracy of his identifier after the 3 rd edition of the shared task by using the TF-IDF scores. Earlier, Brown (2014) managed to boost several language identification methods using non-linear mappings.

Related Work
Automatic language identification of digital text has been researched for more than 50 years. The first article on the subject was written by Mustonen (1965), who used multiple discriminant anal-ysis to distinguish between Finnish, English and Swedish. For more of the history of automatic language identification the reader is suggested to take a look at the literature review chapter of Marco Lui's doctoral thesis (Lui, 2014).
There has also been research directly involving the language groups present in this year's shared task. Automatic identification of South-Slavic languages has been researched by Ljubešic et al. (2007), Tiedemann and Ljubešic (2012), Ljubešic and Kranjcic (2014), and Ljubešic and Kranjcic (2015). Brown (2012) presented confusion matrices for the languages of the former Yugoslavia (including Bosnian and Croatian) as well as for Indo-Iranian languages (including Western and Eastern Farsi). Chew et al. (2009) experimented distinguishing between Dari and Farsi, as well as Malay and Indonesian, among others. Distinguishing between Malay and Indonesian was studied by Ranaivo-Malançon (2006). Automatic identification of French dialects was studied by  and Zampieri (2013). Discriminating between Portuguese varieties was studied by , whereas , Zampieri (2013), , and Maier and Gómez-Rodríguez (2014) researched language variety identification between Spanish dialects.
The system description articles provided for the previous shared tasks are all relevant and references to them are provided by , Zampieri et al. (2015), and Malmasi et al. (2016). Detailed analysis of the first two shared tasks was done by Goutte et al. (2016).
The language identification method used by the system presented in this article, HeLI, was first introduced by Jauhiainen (2010) and it was also described in the proceedings of the 2 nd edition of the DSL shared task (Jauhiainen et al., 2015). The complete description of the method was first presented in the proceedings of the 3 rd VarDial workshop (Jauhiainen et al., 2016). The language identifier tool using the HeLI method is available as open source from GitHub 1 . The non-linear mappings evaluated in this article were previously tested with several language identifiers by Brown (2014). 1 https://github.com/tosaja/HeLI

Methodology
In this paper, we re-present most of the description of the HeLI method from the last year's system description paper (Jauhiainen et al., 2016). We leave out the mathematical description of the words as features, as they were not used in the submitted runs. We tried several combinations of words, lowercased words, n-grams, and lowercased n-grams with the development set. The best results of these trials can be seen in Table 2. In the table, "l. n max " refers to the maximum number of lowercased n-grams, "c. n max " to the ngrams with also capital letters, "l. w." to lowercased words, and "c. w." to words with original capitalization. We did similar tests with different combinations of the language models when choosing the models to be used with the loglike-function described later.

On Notation
A corpus C is a finite sequence, u 1 , ..., u l , of individual tokens u i , which may be words or characters. The total count of all individual tokens u in the corpus C is denoted by l C . A feature f is some countable characteristic of the corpus C. When referring to all features F in a corpus C, we use C F and the count of all features is denoted by l C F . The count of a feature f in the corpus C is referred to as c(C, f ). An n-gram is a feature which consists of a sequence of n individual tokens. An n-gram of the length n starting at position i in a corpus is denoted u n i . If n = 1, u is an individual token. When referring to all n-grams of length n in a corpus C, we use C n and the count of all such n-grams is denoted by l C n . The count of an n-gram u in a corpus C is referred to as c(C, u) and is defined by Equation 1.
The set of languages is G, and l G denotes the number of languages. A corpus C in language g is denoted by C g . A language model O based on C g is denoted by O(C g ). The features given values by the model O(C g ) are the domain dom(O(C g )) of the model. In a language model, a value v for the feature f is denoted by v Cg (f ). For each potential language g of a corpus C in an unknown language, a resulting score R g (C) is calculated. A corpus in an unknown language is also referred to as a mystery text.

HeLI Method
The goal is to correctly guess the language g ∈ G in which the monolingual mystery text M has been written, when all languages in the set G are known to the language identifier. In the method, each language g ∈ G is represented by several different language models based on character ngrams from one to n max . Only one of the language models is used for every word t found in the mystery text M . The model used is selected by its applicability to the word t under scrutiny. If we are unable to apply the n-grams of the size n max , we back off to lower order n-grams. We continue backing off until character unigrams, if needed.
A development set is used for finding the best values for the parameters of the method. The three parameters are the maximum length of the used character n-grams (n max ), the maximum number of features to be included in the language models (cut-off c), and the penalty value for those languages where the features being used are absent (penalty p). The penalty value has a smoothing effect in that it transfers some of the probability mass to unseen features in the language models.

Creating the Language Models
The training data is tokenized into words using non-alphabetic and non-ideographic characters as delimiters. The relative frequencies of character n-grams from 1 to n max are calculated inside the words, so that the preceding and the following space-characters are included. The n-grams are overlapping, so that for example a word with three characters includes three character trigrams.
The c most common n-grams of each length in the corpus of a language are included in the language models for that language. We estimate the probabilities using relative frequencies of the character n-grams in the language models, using only the relative frequencies of the retained to-kens. Then we transform those frequencies into scores using 10-based logarithms.
The derived corpus containing only the n-grams retained in the language models is called C n . The domain dom(O(C n )) is the set of all character ngrams of length n found in the models of all languages g ∈ G. The values v C n g (u) are calculated similarly for all n-grams u ∈ dom(O(C n )) for each language g, as shown in Equation 2 In the first run of the shared task we used relative frequencies of n-grams as values v Cg (u). They are calculated for each language g, as in Equation 3 v where c(C n g , u) is the number of n-grams u found in the derived corpus of the language g and l C n g is the total number of the n-grams of length n in the derived corpus of language g.
Brown (2014) experimented with five language identifiers using two non-linear mappings, the gamma and the loglike functions. We tested applying the two non-linear mappings to the relative frequencies. Both functions have a variable (gamma or tau), the value of which has to be empirically found using the development set.
The value v Cg (u) using the gamma function is calculated as in Equation 4 v Cg (u) = c(C n g , u) l C n g γ (4) The value v Cg (u) using the loglike function is calculated as in Equation 5 v Cg (u) = log(1 + 10 τ c(C n g ,u) l C n g ) log(1 + 10 τ )

Scoring N-grams in the Mystery Text
When using n-grams, the word t is split into overlapping n-grams of characters u n i , where i = 1, ..., l t + 1 − n, of the length n. Each of the ngrams u n i is then scored separately for each language g.
If the n-gram u n i is found in dom(O(C n g )), the values in the models are used. If the n-gram u n i is not found in any of the models, it is simply discarded. We define the function d g (t, n) for counting n-grams in t found in a model in Equation 6.
, otherwise (6) When all the n-grams of the size n in the word t have been processed, the word gets the value of the average of the scored n-grams u n i for each language, as in Equation 7 vg(t, n) = where d g (t, n) is the number of n-grams u n i found in the domain dom(O(C n g )). If all of the n-grams of the size n were discarded, d g (t, n) = 0, the language identifier backs off to using n-grams of the size n − 1. If no values are found even for unigrams, a word gets the penalty value p for every language, as in Equation 8. vg(t, 0) = p (8)

Language Identification
The mystery text is tokenized into words using the non-alphabetic and non-ideographic characters as delimiters. After this, a score v g (t) is calculated for each word t in the mystery text for each language g. If the length of the word l t is at least n max − 2, the language identifier uses character n-grams of the length n max . In case the word t is shorter than n max − 2 characters, n = l t + 2. The whole mystery text M gets the score R g (M ) equal to the average of the scores of the words v g (t) for each language g, as in Equation 9 Rg where T (M ) is the sequence of words and l T (M ) is the number of words in the mystery text M . Since we are using negative logarithms of probabilities, the language having the lowest score is returned as the language with the maximum probability for the mystery text.

Experiments
In order to find the best possible parameters (n max , c, and p), we applied a simple form of the greedy algorithm using the development set. The best recall for the original HeLI method, 0.9105, was reached using n max = 8, c = 170,000, and p of 6.6.

TF-IDF
We made a small experiment trying to adapt the HeLI method to use TF-IDF scores (product of term frequency and inverse document frequency). TF-IDF scores were successfully used to boost the performance of a Naive Bayes identifier by Barbaresi (2016). Also Malmasi et al. (2015) used character n-grams from one to four, which were weighted with TF-IDF. There are several variations of TF-IDF weighting scheme and Malmasi et al. (2015) do not specify whether they used the basic formula or not. We calculated the TF-IDF as in Equation 10 v where df () is defined as in Equation 11. Let l G be the number of languages in a language segmented corpus C G . We define the number of languages in which an n-gram u appears as the document frequency df of u as We used the v Cg (u) values from Equation 10 instead of relative frequencies in Equation 2, but we were unable to come even close to the accuracy of our original method. We did not submit a run using the TF-IDF weighting.

Gamma Function
Using the gamma function in his experiments, Brown (2014) was able to reduce the error rate of his own language identifier by 83.9% with 1366 languages and 76.7% with 781 languages. We tested using the gamma function with the development set, which did not manage to improve our results. It seems that the penalty value p of the HeLI method and the γ variable have at least partly the same effect. If we fix one of the values we are able to reach almost or exactly the same results by varying the other. Table 3 shows some of the results on the development set. When using γ of 1.0 the method is identical to the original HeLI method. As there were no improvements on the results at all, we decided not to submit a run using the gamma function. Table 4 shows some of the results on the development set when using the loglike function, n max =  8, and c = 170,000. There seemed to be a local optimum at around τ = 2.9, so we experimented with a bit different n max and c around it as well. The best recall of 0.9109 was provided by n max = 7, c = 180,000, and τ = 3.0. The loglike funtion seemed to make a tiny (about half a percent) improvement on the error rate when using the development set. Using the loglike function, Brown (2014)

Results
Our SUKI team submitted two runs for the closed track. For both of the runs we used all of the training and the development data to create the language models. The first run was submitted using the relative frequencies as in Equation 3. In the second run, we used the loglike function as in Equation 5. The results and the parameters for each run can be seen in Tables 5 and 6. We have also included the results and the name of the winning team CECL (Bestgen, 2017). For the 3 rd edition of the task, we used the HeLI-method without any modifications and the   first run of the 4 th edition was run with an identical system. This year the Peruvian Spanish replaced the Mexican Spanish. It seems that it is more easily distinguished, at least with the HeLI method, from the Argentinian or Peninsular varieties, as the average F1-score for the Spanish varieties rose from last year's 0.80 to 0.86. Also the inclusion of the languages using the Arabic script helped to raise the overall average F1-score from 0.888 to 0.905.

Discussion
After this year's shared task we also looked into the backoff function of the HeLI method and calculated how often each of the n-gram lengths were used with the test set. These calculations can be seen in Table 7.   Table 8 shows the number of words of each length after removing non-alphabetic characters and adding extra space before and after the word. When comparing the two tables it seems that the backoff function was used only with a small fraction of words.

Conclusions
Using the loglike function with the actual test set improved the result much more than with the development set. The reduction on the error rate of the accuracy was 4.8%, which was around ten  times higher than with the development set. In the future, we will be making further experiments trying to introduce discriminating features into the HeLI method. As it is now, it is still a generative method, not relying on finding discriminating features between languages.