MICHAEL: Mining Character-level Patterns for Arabic Dialect Identification (MADAR Challenge)

We present MICHAEL, a simple lightweight method for automatic Arabic Dialect Identification on the MADAR travel domain Dialect Identification (DID). MICHAEL uses simple character-level features in order to perform a pre-processing free classification. More precisely, Character N-grams extracted from the original sentences are used to train a Multinomial Naive Bayes classifier. This system achieved an official score (accuracy) of 53.25% with 1<=N<=3 but showed a much better result with character 4-grams (62.17% accuracy).


Introduction
The Arabic language is one of the most widely spoken language in the world, currently considered as the fifth language (Chung, 2008) with more than 330 million Arabic speakers. It is the official language of more than 22 countries. In its written form, commonly referred as Literary Arabic, it is divided into two categories: Classical Arabic and Modern Standard Arabic (MSA). However, Arabic speakers mostly use dialects which are a linguistic variant of classical Arabic with their own features, varying with respect to the country or the region. If MSA is used only for written and official communication, dialects are used for oral communication as well as for many device mediated communication forms: email, sms, chat or blogs. Therefore, Arabic dialects identification (DID) has become a very important pre-processing step that attracts many attention from NLP research. Indeed, the knowledge about the dialect of an input text is useful in several NLP tasks such as sentiment analysis (Al-Twairesh et al., 2016).
We propose a simple, light-weight, characterbased method to classify Arabic sentences into 26 classes (25 dialects + MSA) based on the MADAR corpus provided for this competition (Bouamor et al., 2019). This paper is organized as follows: in Section 2, we present some related word for DID. In section 3, we describe some aspects of the Arabic dialects and in section 4 we present the MADAR dataset and we introduce MICHAEL, the system we designed to tackle the DID task. Finally, we show our results in Section 5 and give some future directions in section 6

Previous Work
Arabic Dialect Identification is a very difficult task because of several factors like the lack of NLP tools that deal with Arabic variants. So far, the researchers have tried to address this task using different methods.
Salameh et al. , presented a MNB (Multinomial Naive Bayes) classifier trained to identify a tweet among 26 classes (MSA+25 dialects) using a large-scale of parallel sentences . Their models reach 67.9% accuracy for sentences with an average length of 7 word and reached more than 90% with 16 words. Elfardy and Diab (Elfardy and Diab, 2013) proposed a supervised method for identifying whether a given sentence in prevalently MSA or Egyptian using the Arabic online commentary dataset(AOC) (Zaidan and Callison-Burch, 2011). Their system achieves an accuracy of 85.5% on an Arabic online-commentary dataset.
Najafian et al. (Najafian et al., 2018), presented different approaches for Dialect Identification (DID) in Arabic broadcast speech using use Support Vector Machines (SVM), and Convolutional Neural Networks (CNN) as backend classifiers. The final system merges these results and obtains 24.7% and 19.0% relative error rate reduction compared to conventional phonotactic DID, and i-vectors with bottleneck features. Rabee et al. (Naser and Hanani, 2018), describes an Automatic Dialect Recognition (ADI) system for the VarDial 2018 challenge, with the goal of distinguishing four major Arabic dialects, as well as Modern Standard Arabic (MSA) using four sys-tems. The first system uses word transcriptions and tries to recognize the speaker's dialect by modeling the word sequence of each dialect. The second one aims to recognize the dialect by modeling the telephonesequences produced by non-Arabic telephone recognition devices. The other two systems use GMM trained in acoustic functions to recognize the dialect. This system reached 68.77% in micro F1. Elaraby et al. (Elaraby and Abdul-Mageed, 2018), presented a deep learning models for DID taking advantage of the performance of several conventional machine learning models under different conditions. Their model showed a 87.65% score in accuracy for the binary task (MSA vs. dialects), 87.4% for the 3 class task (Egyptian, Gulf and Levantine).

The Dialectal Varieties of Arabic
Arabic language is a rather generic term that refers in fact to many variants and dialects. Nowadays, one can consider that Arabic language is divided into three major categories: classical Arabic, standard Arabic (MSA) and dialectal Arabic. The 2019 MADAR competition focused on the latter.
Dialectal Arabic is a proper form of the Arabic language used in everyday communication, usually called "darija". It varies from one country to another and even from one region to another within the same country. All Arab countries have their own dialects that are more or less close to each other. The differences the dialects exhibit depend mainly on the history of each country and its geographical location. For example, the Tunisian dialect (TUN) integrates several borrowings from French language as it has been colonized by France. Words like "stylo" (pen/pencil) or "cartable" (schoolbag) are examples of borrowings completely integrated into TUN. According to Zaidan and Callison-Burch (2014), arabic dialects can be classified into five major classes (these classes can have several subclasses): • Egyptian: The most widely understood dialect, due to a thriving Egyptian television and movie industry (Haeri, 2003).
• Levantine: A set of dialects that differ somewhat in pronunciation and intonation, but are largely equivalent in written form. They are closely related to Aramaic (Amara, 2010).
• Gulf: Probably the closest to MSA, perhaps because the current form of MSA evolved from an Arabic variety originating in the Gulf region. There are differences between Gulf and MSA but Gulf kept more of MSA's verb conjugation than other dialects (Versteegh, 2001).
• Iraqi: Despite its similarity to Gulf dialects it exhibits some very distinctive features in terms of prepositioning, verb conjugation and pronunciation (Mitchell, 1993).
• Maghrebi: These dialects were influenced by both French and Berber languages. The Western-most varieties could be unintelligible for speakers from other regions in the Middle East, especially in spoken form.
Maghreb is a large region with more variation than regions like the Levant or the Gulf. It makes it probably easier to distinguish its local variants : Tunisia, Algeria, Morocco, Libya. . . (Tilmatine, 1999).
Arabic dialect differ from one another and from MSA on several levels of linguistic representation such as phonology, morphology, lexicon and syntax. Table 1 exhibits examples of differences between some dialects. For instances, the phonem "qaf" (first column) will not have the same pronounciation in all the dialects. In the second column one can see that the future tense is not marked by the same morpheme in each variant. The syntax of negation (third column) is not the same in Maghrebian dialects and in othe dialects. Regarding lexicon (fourth column) the concept "car" in ALG and MAR dialects reflects a borowing from the French term "automobile". Despite the differences between the different dialects, their automatic identification remains a very difficult task, even impossible in some cases. This difficulty is due to several factors: • Shared lexicon: dialects have a common vocabulary and a dialectal sentence can contain several dialects as well as MSA.
• Grammatical Ambiguity: some identical words are used with different functions. For example, the word "Tyb" can be an adjective in some dialects and an interjection in others.
• Homonyms: mostly due to the omission of short vowels, a dialectal word can have the same spelling as an MSA word but an entirely different meaning. This includes strongly dialectal words such as dwl: it is either the Egyptian (EGY) word for "these" (pronounced dowl) or the MSA word for "countries" (pronounced duwal) (Zaidan and Callison-Burch, 2014).

Data: The MADAR corpus
The purpose of the shared task is to give each short sentence a label among 26 avialable labels. We took advantage of the MADAR corpus supplied for the competition in order to train various classifiers. We did not use anay external resource.   Table 3: Results for the Multinomial Naive Bayes Classifier, character N-grams with various range of N from N min = 1 to N max = 5 with different training and testing configurations (blue score is our official score) we tried different classifiers but quickly found that, under the technical constraints we were facing, Naive Bayes algorithms were the most appropriate for such a multi-class problem. The One VS Rest implementation of SVM we tested were unable to reach a result and we did not want to train 26 different classifiers separately. We used the SCI-KIT LEARN implementation of MNB and it proves quickly that among the NB implementations of this library, the Multinomial Naive Bayes (MNB) was the most efficient. We will show in the next section different learning configurations and various size of n-grams for feature engineering.
It appears that the results obtained on the Test Set were worse than those obtained on the Dev Set (third column of Table 3), with an average loss of 1.6 percentage points. Merging the Train and the Dev Set resulted in a gain that in most cases was marginal (+0.26 pp). With N max > 4 we did not find much improvement in results, except on the dev set but this can be a bias. This threshold may be related to the fact that character N-grams with N > 4 tend to represent the lexicon more than general properties of the dialect itself. Table 4 shows the confusion matrix of our best configuration. The 25 dialects are grouped by regions and MSA appears as the last class. We can see that MUS and SAN are the closest dialects to MSA with respectively 35 and 17 errors involving the MUS-MSA and the SAN-MSA pairs. CAI, MUS and DAM dialects were the most difficult to detect with respectively 112, 106 and 100 False Negatives (FN). Regarding False Positives (FP), the most problematic cases were ASW (106) , RIY (105) and JED (103). Interestingly, the most difficult dialect pairs to discriminate were from Maghreb: FES-RAB (36 and 34 FP) and SFX-TUN (47 and 22). Most of FPs occured between dialects of the same regions with two exceptions : (I) a minor one because North Levant dialects are hard to distinguish from South Levant dialects and (II) a more strange situation with BEN-RIY and KHA-MUS being rather difficult pairs to distinguish despite their apparent distance.

Conclusion and Future Work
In this paper, we explored the problem of Arabic dialect classification into 26 classes (covering 25 cities from the Arab World in addition to Modern Standard Arabic(MSA)). We presented MICHAEL a simple, pre-processing free, system design for this DID task. MICHAEL uses character N-Grams features to train a Multinomial Naive Bayes classifier. Beside its simplicity, MICHAEL does not need a huge amount of training data to achieve good results. This system achieved an official score (accuracy) of 53.25% with 1 ≤ N ≤ 3 but showed a much better result with only character 4-grams (62.17% accuracy). Using N-grams with N > 4 did not seem to improve the results. However, an accurate feature selection technique, like mutual information, may help to get advantage of these longer n-grams that capture more lexical information than shorter N-grams.
Using other types of character features like closed motifs (Buscaldi et al., 2018) would be a first way to assess the influence of the classifier and the features. We plan to explore if adding pre-processing steps like tokenization into words or normalization may improve the results. Another interesting perspective would be to test a Bilstm RNN architecture since this has proven to be adapted to sequential data and Bilstm can exploit both character-level and word-level features. In another perspective it would be very interesting to perform a deeper analysis of classification errors.