ST MADAR 2019 Shared Task: Arabic Fine-Grained Dialect Identification

This paper describes the solution that we propose on MADAR 2019 Arabic Fine-Grained Dialect Identification task. The proposed solution utilized a set of classifiers that we trained on character and word features. These classifiers are: Support Vector Machines (SVM), Bernoulli Naive Bayes (BNB), Multinomial Naive Bayes (MNB), Logistic Regression (LR), Stochastic Gradient Descent (SGD), Passive Aggressive(PA) and Perceptron (PC). The system achieved competitive results, with a performance of 62.87 % and 62.12 % for both development and test sets.


Introduction
Dialect identification (Zaidan and Callison-Burch, 2014) is a sub field of language identification which can be coarse-grained or fine-grained. Coarse-grained dialect identification or simply dialect identification (Meftouh et al., 2015) refers to the process of dividing a language into the main dialects that belong to that language. On the other hand, fine-grained dialect identification  considers the differences between the sub dialects inside a dialect of some language.
In this paper, we describe a fine grained dialect identification systems that participated in MADAR 2019 Arabic Fine-Grained Dialect Identification task (Bouamor et al., 2019) In this task, our system was trained on a data-set of short sentences in the travel domain. A sentence in this data set belongs to one or more Arabic fine-grained dialects. These  . The task of our system is to identify the dialect of a given sentence that belong to these 26 dialects.
The multi-way classification system that we propose uses word n-grams and char n-grams as features, and MNB, BNB and SVM as classifiers.
The rest of the paper is organized as follows. In Section 2, we describe the data-set. In Section 3.1, we address the task as a multiway textclassification task; where we describe the proposed system in 3. We report our experiments and results in 4 and conclude with suggestions for future research and conclusion in 5 and 6.

Dataset
In this work, we used the MADAR Travel Domain dataset built by translating the Basic Traveling Expression Corpus (BTEC) (Takezawa et al., 2007). The whole sentences have been translated manually from English and French to the different Arabic dialects by speakers of 25 dialects Bouamor et al., 2019). The training data is composed of 1600 sentences for each of the 25 dialects in addition to MSA. The size of the development and test sets is 200 sentences per dialect. The sentences are short, ranging from 4 to 15 words each. Each sentence is annotated with the speaker dialect. In table 1, we provide some statistics on the used corpora.
Arabic dialects can be considered as variants of Modern Standard Arabic. However, the absence of a standard orthography   (Habash et al., 2012) for dialects generates many different shapes of the same word. Despite this, there are still similarities between these dialects which make their identification difficult under textual format. In figure 3, we present respectively the number of words and sentences, shared be-

System
The presentation of our proposed approach is shown in figure 2.

Feature extraction
We applied a light preprocessing step where a simple blank tokenization and punctuation filtering have been achieved. It is worthy to say, that we deployed in our preliminary experiments Low level NLP processing such as POS-tagging (Freihat et al., 2018b) features and lemmatization (Freihat et al., 2018a) but without a significant enhancement of the achieved results. Besides the word and character n-grams features used in previous work such as Lichouri et al., 2018), we added the characterword boundary (char wb). In the following, we present a description of the three adopted features.
• Word n-grams: We extract word n-grams, with n ranging from 1 to 3.
• Char n-grams: The character first to third grams are used as features.
• Char wb n-grams: This feature creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.
The count matrix obtained using these features are transformed to a tfidf representation.

Classification Models
Our model is based on a set of classifiers using the scikit-learn library (Pedregosa et al., 2011), namely: Support Vector Machines (SVM), Bernoulli Naive Bayes (BNB), Multinomial Naive Bayes (MNB), Logistic Regression (LR), Stochastic Gradient Descent (SGD), Passive Aggressive (PA) and Perceptron (PC). In the following, we present the selected parameters for each classifier.

Results
Using the aforementioned classifiers, the best achieved performance (F1-Macro) for coarsegrained and fine-grained dialect identification was 90.55% (     Table 4: Best results on the development dataset (Corpus-6) using the word n-grams.

Discussion
We experimented different classifiers and a set of features to solve fine-grained dialect identification, i.e. a 26-way classification problem. The results show that fine grained dialect identification is more difficult given the similarity between dialects on one side, and on the other side, the nonstandardization of writing dialectal texts that generates unpredictable texts. In addition, we noted the presence of MSA texts in several dialectal tweets which distorts the results. By using the test data-set, we calculated the accuracy achieved by our best model and presented in In figure 3, we show the average accuracy of the 5-regions and MSA, as described in , for both development and test set. We notice that the best results were achieved for Yemen region with an accuracy of 75%, and an average accuracy of over 67% for the Maghreb Region.

Conclusion
In this paper, we proposed an Arabic fine-grained dialect identification system. Our best run on the test data yielded an F1-Macro score of 62% using Naive Bayes classifier and word n-gram features. Despite the simplicity of these features, the results were promising. In order to improve performance, we intend to investigate alternative methods as deep learning architectures and rule-based techniques in future work.