The SMarT Classifier for Arabic Fine-Grained Dialect Identification

This paper describes the approach adopted by the SMarT research group to build a dialect identification system in the framework of the Madar shared task on Arabic fine-grained dialect identification. We experimented several approaches, but we finally decided to use a Multinomial Naive Bayes classifier based on word and character ngrams in addition to the language model probabilities. We achieved a score of 67.73% in terms of Macro accuracy and a macro-averaged F1-score of 67.31%


Introduction
Arabic is a complex language which presents significant challenges for natural language processing and its applications.Arabic is characterized by its plurality.It consists of a wide variety of languages, which includes the Modern Standard Arabic (MSA), and a set of various dialects differing according to regions and countries.Language identification is the task of identifying the language of a given text.It is an important preprocessing step for many Natural Language Processing (NLP) tasks such as machine translation (Meftouh et al., 2018;Harrat et al., 2017), sentiment analysis (Rana et al., 2016;Abdul-Mageed et al., 2014;Saad et al., 2013), etc.In general, language identification is not a high challenging issue since this research has been supported for a long time and several machine learning techniques have been tested in this area that yielded to more or less good results.Nonetheless, in cases such as identifying languages from very little data, from mixed input or when the languages are extremely close to each other, the task becomes very challenging (Goutte et al., 2014).This paper describes the submission of Loria (SMarT research group) to the Madar shared task on Arabic fine-grained dialect identification covering 25 specific cities from across the Arab World, in addition to Modern Standard Arabic (Bouamor et al., 2019).This shared task is the first to target a large set of dialect labels at the city and country levels.It has two subtasks.
Our submission to this campaign is dealing with the first subtask.The remainder of this paper is organized as follows: in the next section, we discuss related work pertaining to Arabic dialect identification.Section 3 reviews the modeling choices we made for the shared task, and Section 4 describes results in detail.

Related Work
Several research works addressed the problem of Arabic dialect identification.The authors of Habash et al. (2008) (Elfardy and Diab, 2013).The authors use token level labels to derive sentencelevel features.These features are then used with other core and meta features to train a generative classifier that predicts the correct label for each sentence in the given input text.In addition to a multi-dialect, multi-genre, human annotated corpus, the authors in Cotterell and Callison-Bursh (2014) present the results of a language identification task extended to include 5 dialects.They considered Naïve Bayes and Support Vector Machines.The approach used by Darwish et al. (2014)  The second (Corpus-6) contains 10, 000 additional sentences translated to the dialects of only five cities: Beirut, Cairo, Doha, Tunis and Rabat, in addition to MSA.They are splitted on two categories: 9, 000 instances per language for the training and 1, 000 instances per language for the development.

Method
In order to develop a language identification system that can distinguish between several Arabic dialects, we tested three methods namely simple neural networks (LSTM) (Sak et al., 2015), a method based on word embedding (Word2vec) (Mikolov et al., 2013) and Naïve Bayes classifiers.
Given the limited size of the provided corpora, the first two methods have proven ineffective.We give in Table 1 the results we obtain using Corpus 26 in terms of Macro averaged F1-score, precision and recall.We used a Naïve Bayes method because in the past, we did a comparative study of methods for Topic identification.This method for French leads to the best results (Bigi et al., 2001).In this work we consider a Multinomial Naïve Bayes classifier, in fact a study proposed in McCallum and Nigam (1998) showed that the multinomial model is found to be almost better than the multivariate Bernoulli model and the experimental results yielded to better results.So, we consider a Multinomial Naïve Bayes classifier for this task.In this case, the term Multinomial Naïve Bayes lets us know that each p(f i |c) (where f i is a feature and c the category or the class) is a multinomial distribution, rather than some other distribution such as a Bernoulli distribution.To develop our system, we used Python, relying on Scikit-Learn module (Pedregosa et al., 2011).

Features
A Naïve Bayes model classifier identifies a category by calculating the distributions of the features within a category.It also assumes that each of the features it uses are conditionally independent of one another given a category.Identifying features is a critical step when applying Naïve Bayes classifiers.That is why we did several experiments to select some adequate features.After several experiments, we selected for each sentence, the following 38 features as follows: • A unigram of words.

• A bigram of words
• Character n-grams: from 1 to 5 • Character n-grams: from 1 to 5, by taking into account the spaces between words; in other words ngrams at the edges of words are padded with space.All the symbols of punctuation have been removed from the training, development and test data.
• 26 likelihoods estimated by the 26 unigram language models For all the features, we use a special character to mark the start of the sentences.We utilize Term Frequency-Inverse Document Frequency (Tf-Idf) scores (Spärck Jones, 1972) as it has been shown to outperform count weights in several NLP applications.

Results and Discussion
For the purpose of this campaign, we built several systems using the model described in section 3. We did several experiments to determine the smoothing adding value, necessary for the Naive Bayes method, and we set it to 0.093 for all the systems.In Table 2, we report the results of all the experiments concerning the Multinomial Naive Bayes method.For the evaluation purpose, we use the Macro averaged F1-score which is retained as the official metric by the organizers of Madar shared task.First, we train the multinomial NB on word ngrams.The best results are achieved with the use of unigrams and bigrams.For higher order of n-grams, the performance of the model degrades due to the data sparsity.Then, we tested the effect of character ngrams features with (wi) and without (wo) taking into account the space at the end of the words.We experimented using the features of each option alone and combined.In Table 2 the symbol x-y means that all the n-grams features from x to y of the corresponding column are taken into account in the classification.In all the experiments, the best model is obtained for n ranging from 1 to 5. We remark that a classifier based on character ngrams features (1-5) outperforms the classifier based on word ngrams features by at least 3 points.Finally, the best classifier is the one using word unigrams and bigrams, and character ngrams ranging from 1 to 5 with and without space.The introduction of the language model features improved the result on the development corpus and reduced it on the test corpus.
We decided finally to participate to the campaign with the classifier including the language model parameters.

Conclusion
In this paper, we described the experiments we conducted as part of the MADAR shared task on Arabic fine-grained dialect identification.This task is the first covering the dialects of 25 specific cities from across the Arab World, in addition to MSA.Thus, we tested several systems exploring a large set of features.A blind run on the test set was then performed and submitted as part of the shared task.The Macro accuracy is 67.73% (macro-averaged F1-score 67.31%), placing our classifier first among 19 participants.This result shows that our approach despite its simplicity performs very well and even if it is ranked first, we need to make more efforts to make it powerful so that it can become an effective tool for the community.
Salameh et al. (2018)n of the Egyptian dialect was based on lexical, morphological and phonological information.They show that accounting for such information can improve dialect detec-They performed an experiment on six dialects from different Algerian regions.InSalameh et al. (2018), the authors present the first system dealing with fine-grained dialect classification task and covering 25 specific cities from across the Arab World, in addition to Standard Arabic.For this purpose, they build several classification systems using a Multinomial Naïve Bayes classifier and exploring a large space of features.For the experiments reported in this paper, we only use the training and the development data available in the subtask 1 of the shared task.The dataset of this subtask is the same as the one reported onBouamor et al. (2018)andSalameh et al. (2018).It is composed of two corpora.The first (Corpus-26) is a collection of parallel sentences, built to cover the dialects of 25 cities from the Arab World, in addition to MSA.The training part consists of 1600 labeled instances per class, while the development part has 200 labeled instances per class.
Bouamor et al. (2014)ly 10%.Using a set of surface features based on characters and wordsMalmasi et al. (2015)conduct three experiments with a linear SVM classifier and a meta-classifier using stacked generalization on the Multidialectal Parallel Corpus of Arabic (MPCA) compiled byBouamor et al. (2014).They first conduct a 6way multi-dialect classification task then investigate pairwise binary dialect classification and finally conduct cross-corpus evaluation on the Arabic Online Commentary (AOC) dataset.In Al-

Table 2 :
Macro averaged F1-score on Development and Test sets for Corpus-26.