Discriminating between Similar Languages with Word-level Convolutional Neural Networks

Discriminating between Similar Languages (DSL) is a challenging task addressed at the VarDial Workshop series. We report on our participation in the DSL shared task with a two-stage system. In the first stage, character n-grams are used to separate language groups, then specialized classifiers distinguish similar language varieties. We have conducted experiments with three system configurations and submitted one run for each. Our main approach is a word-level convolutional neural network (CNN) that learns task-specific vectors with minimal text preprocessing. We also experiment with multi-layer perceptron (MLP) networks and another hybrid configuration. Our best run achieved an accuracy of 90.76%, ranking 8th among 11 participants and getting very close to the system that ranked first (less than 2 points). Even though the CNN model could not achieve the best results, it still makes a viable approach to discriminating between similar languages.


Introduction
Language identification is the task of detecting the language of a given text segment. Although methods that are able to achieve an accuracy of over 99% for clearly distinct languages like English and Spanish do exist (Dunning, 1994), it is still a major problem to distinguish between closely related languages, like Bosnian and Croatian, and language varieties, like Brazilian and European Portuguese (Goutte et al., 2016). The problem of discriminating between similar languages was addressed in the DSL shared task at VarDial 2017. In DSL 2017, participants were asked to develop systems that could distinguish between 14 language varieties, distributed over 6 language groups. Two participation tracks were available: closed and open training. In closed track, systems should be trained exclusively in the DSL Corpus Collection (Tan et al., 2014), provided by the organizers (see Section 3), while in open training the use of external resources was allowed. For a detailed description of the VarDial workshop and of DSL 2017, refer to the shared task report (Zampieri et al., 2017). This paper describes our system and the results of our submissions for closed track at DSL 2017. Our goal was to experiment with deep neural networks in language variety distinction, in particular word-level Convolutional Neural Networks (CNN). This kind of network has been successfully applied to several natural language processing tasks, such as text classification (Kim, 2014) and question answering (Severyn and Moschitti, 2015;Wang et al., 2016).
Like other participants did in previous editions of the DSL shared task (Zampieri et al., 2015), we chose to use two-stage classification. First, each sentence gets a group label, that guides the selection of a model especially trained for that group. Then, it goes through a classifier that predicts the final language variety. We experimented with different machine learning techniques for variety prediction while the language group classifier was kept the same. This allowed us to compare, not only the overall accuracy of each classifier, but also its accuracy within each language group.
To distinguish between language groups, the efficiency of character n-grams was leveraged (Vatanen et al., 2010), while three configurations had their performances comparared for language variety prediction. One run was submitted for each of the following configurations: (a) run1: a word-level CNN that learns word vectors from scratch; (b) run2: a multi-layer perceptron (MLP) fed by tf-idf vectors of word n-grams, and (c) run3: a hybrid configuration composed by word-level MLP models and character-level Naive Bayes models. Our best run (run3) was positioned 8th among 11 participants, with 90.76% of accuracy in the test set and with a difference of 1.98 percentage points from the first system in the rank.
Although our word-level CNN did not outperform the other two configurations, it scored very close to our best run. We also found that combinations of unigrams and bigrams produce higher scores than unigrams alone. This was observed in both convolutional networks and multi-layer perceptron networks.

Related Work
Many approaches to discriminating between similar languages have been attempted in previous DSL shared tasks, and best results were achieved by simpler machine learning methods like SVMs and Logistic Regression . However, since deep neural networks have been successfully applied to many NLP tasks such as question answering (Severyn and Moschitti, 2015;Santos et al., 2015;Rao et al., 2016), we wanted to experiment with similar network architectures, particularly CNNs, in the task of discriminating between similar languages. In the last shared task (DSL 2016), four teams used some form of convolutional neural network. The team mitsls (Belinkov and Glass, 2016) developed a character-level CNN, meaning that each sentence character was embedded in vector space. Their system ranked 6th out of seven rank positions, with 0.830 of overall accuracy, while the 1st system scored 0.894 using SVMs and character ngrams.
Cianflone and Kosseim (2016) used a characterlevel convolutional network with a bidirectional long short term memory (BiLSTM) layer. This approach achieved accuracy of 0.785.
A similar approach was used by the team Res-Ident (Bjerva, 2016). They developed a residual network (a CNN combined with recurrent units) and represented sentences at byte-level, arguing that UTF-8 encodes non-ascii symbols with more than one byte, which potentially allows for more disambiguating power. This system achieved accuracy of 0.849. The fourth team used a word-level CNN , but details are not available since a paper was not submitted.
In DSL 2015, Franco-Salvador et al. (2015 used logistic regression and SVM models fed by pre-trained distributed vectors. Two strategies were explored for sentence representation: sentences represented as an average of its word vectors trained by word2vec (Mikolov et al., 2013), and sentences represented directly as vectors trained by Paragraph Vector (Le and Mikolov, 2014). This system ranked 7th out of 9 participants. Collobert et al. (2011) propose avoiding taskspecific engineering by learning features during model training. In that work, several NLP tasks were used as benchmarks to measure the relevance of the internal representations discovered by the learning procedure. One of these benchmarks used a convolutional layer to produce local features around each word in a sentence.
We intended to experiment with learning word vectors in the target task, in an approach similar to that of Collobert et al. (2011). We are particulary interested in local features captured by convolutional networks. We believe these networks can learn words and language constructions commonly used in particular language varieties.

Data
Since we participated in the closed track, all models were trained and tested in the DSL Corpus Collection (Tan et al., 2014), provided by the organizers. This corpus was composed by merging different corpora subsets, for the purpose of the DSL shared task, and comprises news data of various language varieties.
New versions of the DSL Corpus Collection (DSLCC) are build upon lessons learned by the organizers. Thus, an overview of the version used in DSL 2017 is provided in Table 1. It encompasses 14 language varieties distributed over 6 language groups. Since its first release, the DSLCC contains 18,000 training sentences, 2,000 development sentences and 1,000 test sentences for each language variety; each sentence contains at least 20 tokens (Tan et al., 2014).

Methodology
Three system configurations were experimented, and one run was submitted for each. We use twostage classification, and apply different machine  learning techniques to train one classifier per language group in each configuration. Our pipeline starts with language group prediction. After getting a group label, each sentence is forwarded to the corresponding variety classifier. In all configurations, the group classifier was kept fixed.
Character n-grams are used to train a Naive Bayes classifier 1 that distinguishes between language groups. Before training, language codes are replaced with the respective group code (bs, hr, or sr becomes A, for example), sentences are tokenized, and each token gets an end mark ($). Tokens are defined as character segments delimited by whitespaces. Better results were achieved in the development set when letter case was kept original, so it was not changed. Named entities were not changed either. We found 5 to be the best size for n-grams, with accuracy of 0.9981 in the 1 We use scikit-learn multinomial Naive Bayes. development set. Values greater than 5 also give good results, but training is much slower.
In the first system configuration, language varieties are classified using convolutional neural networks. This is our main approach.

Convolutional Neural Network
The model, shown in Figure 1, is similar to one of the architectures experimented by Kim (2014). It takes raw sentences as input and generates class probabilities as output. The highest probability is selected as the predicted class.
Let s = {w 1 , w 2 , w 3 , . . . , w L } be a sentence of fixed length L. Each word w j must be mapped to a row vector x j ∈ R d embedded in matrix W |V |+1×d , where |V | is the number of distinct words in the language group. Rows in W follow the same order as words in the vocabulary, so that the i-th row in W represents the vector of the ith word in the vocabulary V . Words are mapped to vectors by looking up their corresponding indexes in W (embedding lookup). Words that are not found in the vocabulary V are skipped.
Matrix S L×d represents the sentence s and is obtained by concatenation of word vectors x j . Notice that W has |V | + 1 rows. The first row corresponds to a special token PAD, used to fill up sentences shorter than L.
Convolution filters are slided over S to generate intermediate feature vectors known as feature maps. Filters are always of width d, but there may be different filter lengths and multiple filters of each length.
Formally, each feature c i in a feature map c is computed as where w ∈ R h×d is a convolution filter, b ∈ R is a bias term, f (·) is a non-linear function such as the hyperbolic tangent, and h is the filter length.
The convolution of 3 filters of length 2 is represented in Figure 1. Each filter generates one feature map.
Max-over-time pooling is applied to each feature map c to take the maximum valueĉ = max(c). Those pooled values are concatenated to form a final feature vector that is fed to a fullyconnected layer followed by softmax. For regularization, dropout is applied to the fully-connected layer. The final output is a probability distribution over the class labels.

Model Training
To train the model, sentences are tokenized and all digits (0-9) are replaced with zeros. Letter case is not changed. Tokens are delimited by whitespaces, but no end marker is appended to them. Maximum sentence length L is set to 80, since the longest sentence found in the training set had 77 tokens.
One model is trained for each language group. The vocabulary V is the set of unique tokens found in the training set for the current group. Vocabulary sizes are shown in in Table 2.  Word vectors (matrix W ) are initialized randomly and updated by backpropagation along with other network weights. Since we intend to minimize the dependence of our model on external resources, that may not be readily available for specific languages, the use of pre-trained word embeddings is entirely avoided.

Group Languages # of tokens
The model hyperparameters are: vector dimension d = 200, filters of lengths (h) 1 and 2 with 100 feature maps each, hyperbolic tangent for non-linearity, drop-rate of 0.20 (or keeping probability of 0.80) for dropout, and shuffled minibatches of size 50. Parameter values were found by grid search on the development set. All models are trained for 3 epochs, using Adam optimizer (Kingma and Ba, 2014) to minimize the cross-entropy, without early stopping. We use TensorFlow (Abadi et al., 2016) for implementation.

Multi-Layer Perceptron
A vanilla Multi-Layer Perceptron 2 (MLP) was used to compare the CNN performance with that of another neural model. In this approach, one classifier is trained for each language group, just as before. Sentences are represented as bag of word n-grams structured as high-dimensional tf-idf vectors. To make ngrams comparable to filters in the CNN models, they are extracted from sentences in sizes of 1 and 2 words (unigrams and bigrams). Letter case is not changed and no transformation is done on digits.
The model has a hidden layer of size 30 and each language variety corresponds to one unit in the output layer. The activation function is hyperbolic tangent. Models are trained for 10 epochs without early stopping by stochastic gradient descent with mini-batches of 200 examples. Optimization is carried out by having Adam optimizer to minimize the cross entropy.

Hybrid System Configuration
Considering the lower performance of both previous configurations in group A, relatively to other groups, we came up with a hybrid system configuration in which all language varieties are predicted by MLP classifiers, except for group A. For that group, a standard character n-gram model is applied. It is exactly the model described in Section 4 as the first component of our pipeline.
This change caused little impact on performance, as discussed later in Section 6. Table 3 shows the performance of our convolutional neural network (run1) in each language variety, while Table 4 shows the corresponding confusion matrix. In Table 4, the horizontal axis indicates predicted labels, while true labels are indicated on the vertical axis. For example, it can be understood that 28 hr sentences were wrongly predicted as sr. For fine grained results, we opted to report on our main approach (CNN) instead of reporting on our best performing system.

Results
The overall results of our three submitted runs, along with a random baseline, are summarized on Table 5. The result of the best performing system is also reported, and an extra column was appended to the table to report on development set accuracy. Our best run (run3) ranked 8th out of 11 participants according to the official evaluation. It achieved an accuracy of 0.9076, with a small difference of 0.0198 percentage points to the best system. Our deep neural network (run1) achieved an accuracy of 0.8878, indicating that the CNN scored close to our best run, but could not outperform it. Accuracy values computed on the development set behave similarly to that of the official evaluation.
The result of a traditional single-stage character n-gram model is also reported in Table 5 as a baseline for the development set. This is the Naive Bayes model described in Section 4, used to distinguish between language groups, but trained over all 14 language varieties.

Discussion
Although we focus on results of our main approach, all three runs behaved similarly. We can see in Table 4 that the confusion between language groups is minimal. This is due to the two-stage architecture that separates sentences in groups before discriminating between varieties.
The group classifier performs its task almost perfectly. In the development set, the group classifier achieved accuracy of 99.81%. We have conducted an error analysis by sampling misclassified sentences, and found that most of them really seems to belong to the predicted language group. In the following example, the classifier predicted group D (French) instead of the true label F (Spanish): Jean-Paul Bondoux, chef propietario de La Bourgogne & Jérôme Mathe, chef de Le Café des Arts (Figueroa In most examples, the classifier is misguided by proper nouns in foreign languages, like names of soccer players commonly found in news texts.
Prior classification of language groups narrows down the set of output classes for variety classifiers, allowing for their optimization in a single language. We believe this raises the accuracy within language groups.
However, some language groups are more challenging than others, as is shown in Table 3. Groups A and F are responsible for the lowest scores. Group A, particulary, contains the most difficult language to discriminate (bs) for our three system configurations. Even the change from a neural to a statistical approach in our hybrid configuration had little impact in that group performance (Table 5). This was observed both in the development set and the official runs.
The vocabulary of group A may lead to more sparse language models that hinders performance of classifiers. Group A contains almost 2 times the number of tokens in group F, the second largest group which also comprises 3 language varieties (Table 2).
Overall, our hybrid configuration showed the best performance, which is very close to the MLP. In fact, we would still rank the same position if the MLP configuration (run2) were considered instead.
Although the MLP scored higher than the CNN, difference was small. Also, the convolutional model is trained relatively fast in appropriate hardware, considering that pre-trained word vectors are not used and all model values are initialized randomly. With its minimum preprocessing requirements, these characteristics make our wordlevel CNN a viable model for discriminating between similar languages.

Conclusion
In this work we explored word-level convolutional neural networks to discriminate between similar languages and language varieties. Our intuition hr bs sr es-ar es-es es-pe fa-af fa-ir fr-ca fr-fr id my pt-br pt-pt  hr  837 131  28  0  1  0  0  0  0  1  2  0  0  0  bs  156 718 125  0  0  0  0  0  0  1  0  0  0  0  sr  10 114 876  0  0  0  0  0  0  0  0  0  0  0  es-ar  0  0  0  798  77  123  0  0  0  0  0  0  2  0  es-es  0  0  0  90  842  63  0  1  1  0  0  0  0  3    is that language varieties can be distinguished by particular words and common language constructions. Even though we argue for avoiding taskspecific feature engineering, we believe this kind of linguistic bias is fundamental to the success of methods that address the task of discriminating between similar languages. We believe both the CNN and the MLP models were able to capture particular words and common language constructions as features.