Natural Language Processing for Dialectical Arabic: A Survey

This paper presents a wide literature review of natural language processing for di-alectical Arabic. Four main research areas were identiﬁed and the dialect coverage in research work was outlined. The paper can be used as a quick reference to identify relevant contributions that address a speciﬁc NLP aspect for a speciﬁc dialect.


Introduction
The last ten years have experienced a growing interest in natural language processing for dialectical Arabic. This growth can be attributed to several factors including the wide usage of Arabic dialects in social media. The topics treated by computational linguists for Arabic dialects range from fundamental language aspects including morphology up to sophisticated solutions such as machine translation.
To have an overview of the research that has been done in this area we went through as many papers as possible and tried to specify the main contributions of each paper. We could identify four main categories, whereas each category has some subcategories. The main categories are basic language analyses, building language resources, semantic-level analysis and synthesis, and identifying Arabic dialects. Then, we mapped each paper to categories and subcategories as well as to the addressed dialect or dialects in a matrix form as given in Table 1. By this means, it can be easily identified what has been done in the Arabic NLP, by whom, and for what dialects.
The following four sections describe the related work in the four main categories. For space reasons, however, we limited the description to main aspects. The final section provides a brief discussion of the findings of this survey.

Basic Language Analyses
Several solutions have been proposed for the morphological analysis, syntactical analysis, and orthographic analysis and generation. The following three sections describe these solutions, respectively.

Morphological Analysis and POS Tagging
The morphology of dialectal Arabic had gained early attention by computational linguists. In , a morphological analyzer and generator, denoted MAGED, was presented. This tool is able to analyze the Levantine dialect and to convert MSA to Levantine. In a later publication the authors detailed the morphophonemic and the orthographic rules encoded in MAGEAD (Habash & Rambow, 2007).
In (Almeman & Lee, 2012), two morphological analyzers for Gulf, Levantine, Egyptian, North African, Sudani, and Iraqi dialects were presented. The first one relies on a MSA morphological analyzer. The second one applies word segmentation and uses web data as a corpus to produce statistical information about the frequency of different segment combinations. In (Zribi, Khemakhem, & Belguith, 2013), a morphological analyzer for the Tunisian dialect based on a MSA analyzer was proposed. Furthermore, a lexicon for the Tunisian dialect is built as an expansion of a MSA lexicon. An unsupervised approach for morphological segmentation was applied to improve machine translation from the Qatari dialect to English (Al-Mannai et al., 2014).
In (Duh & Kirchhoff, 2005), a part-of-speech tagger for Egyptian Arabic was proposed based on a morphological analyzer for MSA and a min-imally supervised approach that requires raw text data from several Arabic varieties.
In (Al-Sabbagh & Girju, 2012a), a functionbased POS tagger is proposed that was trained on a manually-annotated Egyptian Arabic corpus.
A rule-based stemmer for Arabic Gulf dialect was proposed in (Abuata & Al-Omari, 2015), and a fine-grained POS tagger for Tunisian dialect was presented in .

Syntax and Parsing
The syntax of Arabic dialects was purely addressed in the context of computational linguistics. In (Brustad, 2000), the author presented a comparative study of Moroccan, Egyptian, Syrian, and Kuwaiti dialects with respect to syntax however without computational aspects.
In (Chiang et al., 2006) a parser for the Levantine Arabic is proposed. The parser doesn't rely on annotated Levantine corpus or a parallel Levantine-MSA corpus. Rather, the Levantine word is translated into a bag of MSA words that are scored and decoded relying on MSA corpus. The resulting text is then parsed using an MSA parser. Finally, the terminal nodes in the resulting parse structure are replaced with the original Levantine words.
Levantine was also the dialect treated in . In this work a pilot Levantine Arabic Treebank is developed by a morphological and syntactic annotation of 26,000 words of Levantine Arabic conversational telephone speech. The Treebank was used to develop and evaluate parsers for Levantine texts. Grammatical mapping rules were defined to provide language resources for machine translation from Tunisian dialect to MSA and other target languages in (Sadat, Mallek, et al., 2014).

Orthographic Analysis
In contrast to MSA, dialectical Arabic has no orthographic standard. The same word can be written in different forms. This poses difficulties to NLP tools. In (Dasigi & Diab, 2011), first steps towards normalizing Arabic dialects orthography for Levantine and Egyptian were made. For that, different similarity measures were employed that exploit string similarity and contextual semantic similarity.
In (Habash, Diab, & Rambow, 2012), a conventional orthography is proposed to help building computational models for Arabic dialects in general and Egyptian in particular. The rules and guidelines produced were named CODA.
In (Zribi, Graja, et al., 2013), orthography guidelines for Tunisian dialect were presented for the purpose of transcribing a Tunisian speech corpora. The rules presented are based on the standard Arabic transcription conventions. This work was later used in (Zribi, Khemakhem, & Belguith, 2013) for morphological analysis presented in the Morphological Analysis and POS Tagging section.

Building Resources for Dialectal Arabic
The problem of the lack of language resources in dialectical Arabic is well known. Many researchers addressed this problem by creating lexicons, wordnets, corpora, and treebanks. In (Zaghouani, 2014), a useful survey of freely available Arabic corpora including lexicons was presented. The author highlighted the huge lack of freely available dialectal corpora because only two resources could be identified (Graja et al., 2010), (Almeman & Lee, 2013) In (Sansò, 2004), the MED-TYP project was presented which aimed at building a typological database for Mediterranean languages including MSA and Arabic dialects. While the researchers found out that the Mediterranean could not be identified as a linguistic area in the traditional sense, a number of significant contact phenomena were discovered.

Building Lexicons and Lexical Analysis
In (Graff et al., 2006), a lexicon for the Iraqi dialect was presented. The lexicon comprises words from recorded speech tagged with pronunciation data, morphology information, and partof-speech. The annotation was performed manually with the aid of a user interface and supporting tools.
In (Al-Sabbagh & Girju, 2010) a lexicon for Egyptian Cairene Arabic is described. Each Cairene entry was mapped to its MSA synonym and tagged with its part-of-speech. Additionally, the entry is tagged with its top-ranked meaning according to web queries.
A spelling corrector for the Iraqi dialect was presented in (Rytting et al., 2011). An orthographic density metric is used to motivate the need for a fine-grained ranking method for candidate words.
In (Graff & Maamouri, 2012), the update of three bilingual dictionaries for English-speaking learners of Moroccan, Syrian and Iraqi Arabic was presented. The original editions of the dictionaries were developed by the Linguistic Data Consortium and Georgetown University Press in the 1960's. In the updated dictionaries, both Arabic script and International Phonetic Alphabet orthographies are used. A web interface enables searching, editing, reviewing and managing the lexicon efficiently.
In (Boujelbane et al., 2013), a Tunisian dialect text corpus as well as a method for building a bilingual dictionary are described. The target is to create a language model for a speech recognition system for the Tunisian Broadcast News.
In (Duh & Kirchhoff, 2006), a Levantine lexicon was built using transductive learning through partially annotated text. For the purpose of sentiment analysis of social networks data, a dedicated lexicon for slang sentimental words and idioms was developed in (Hedar & Doss, 2013).
In (Cavalli-Sforza et al., 2013) an Iraqi Word-Net is presented based on the MSA WordNet, the English WordNet, and an English-Iraqi dictionary. A Tunisian dialect WordNet was built in (Bouchlaghem & Elkhlifi, 2014) starting from a Tunisian corpus.

Building Corpora and Treebanks
In (Al-Sabbagh & Girju, 2012b), a primary work on building a multi-genre corpus for Egyptian Arabic was described. The corpus data is compiled from Twitter, blogs, forums, and online knowledge market services. The paper addresses different aspects related to building dialectal Arabic corpora such as function-based web harvesting, dialect identification, vowel-based spelling variation, linguistic hypercorrection, unsupervised part-of-speech tagging and base phrase chunking for dialectal Arabic.
Using the web as a source was also described in (Almeman & Lee, 2013), where multi-dialect Arabic corpora were built for Gulf, Levantine, Egyptian and North African dialects. The work by Boujelbaneon et al. on building a lexicon for Tunisian dialect can be recited here due to building a corpus from Tunisian broadcast news (Boujelbane et al., 2013).
In (Cotterell & Callison-Burch, 2014), a multidialect, multi-genre corpus for Egyptian, Gulf, Levantine, Maghrebi, and Iraqi dialects was presented. Another multi-dialecti corpus based on twitter data was built in  for seven different dialects. A preliminary work on a corpus for Palestinian dialect with 43K words was presented in (Jarrar et al., 2014). A parallel corpus for Algerian dialect and MSA was proposed in (Harrat et al., 2014) for the purpose of machine translation.
In , which was cited in Section 2.2, a pilot Levantine Arabic Treebank was presented. A conversational telephone speech with about 26,000 words was annotated with morphological and syntactic data. Recently, Maamouri et al. presented a treebank for the Egyptian Dialect .
As the quality of the annotation process is essential for building accurate language resources, some researchers payed special attention to this process. In , multiple systems to develop NLP resources for Arabic dialects including Levantine, Egyptian, Moroccan, and Iraqi were presented. The systems utilized MAGEAD  as well as Buckwalter morphological analyzer and generator (BAMA) (Buckwalter, 2004). The COLABA ability to process Arabic dialects was evaluated through the COLABA information retrieval system.
A web application for annotating Egyptian, Iraqi, Levantine, and Moroccan dialects was proposed in . The authors follow non-functional objectives including optimizing speed, accuracy, and efficiency while maintaining the security and integrity of the data. In (Zaidan & Callison-Burch, 2011), the building of a 52M-word Arabic online commentary dataset rich in dialectal content was presented. The longterm annotation effort to identify the dialect level in each sentence was also discussed. The au-thors of (Elfardy & Diab, 2012b) presented a set of guidelines for detecting code switching in Arabic on the word and token levels. These guidelines were used to annotate a corpus that is rich in Egyptian, Levantine, and Iraqi dialects with frequent code switching to MSA. In (Habash et al., 2008a), guidelines for identifying the level of dialectalness of a certain text were presented. Three levels for dialectalness were proposed: MSA with non-standard orthography, MSA words with dialect morphology, and a Dialectal lexeme.
In (Hawwari et al., 2014), a framework for classifying and annotating Egyptian multi-word expressions in a specialized computational lexicon was proposed. A graphical tool for annotating Moroccan tweets was presented in (Tratz et al., 2013).
In , comprehensive guidelines for annotating an Arabic corpus including Qatar dialect was proposed. The corpus is denoted Qatar Arabic Language Bank (QALB). A special attention in this work is paid to the manual correction which should provide training data for learning-based Arabic error correction tools.

Semantic-Level Analysis and Synthesis
Most work in this area relates to machine translation from or to Arabic dialects. Some papers treat other tasks such as information retrieval and sentiment analysis.

Machine Translation
In (Bakr et al., 2008), the authors proposed a hybrid approach to convert an Egyptian sentence into its corresponding diacritized MSA. The approach is generic, i.e., it can be extended to other Arabic dialects. Some techniques for lexical acquisition of colloquial words are developed.
In (Sawaf, 2010), a hybrid machine translation system was extended to handle Arabic dialects from 15 regions including Northern Iraq, Baghdad, Southern Iraq, Saudi-Arabia, Southern Arabic Peninsula, Egypt, Sudan, Libya, Morocco, Tunisia, Lebanon, North Syria, Damascus, Palestine and Jordan. A decoding algorithm was developed to normalize non-standard, spontaneous and dialectal Arabic into Modern Standard Arabic.
In (Salloum & Habash, 2011), the quality of Arabic-English statistical machine translation was improved to deal with Levantine and Egyptian dialects using morphological knowledge. A simple rule-based approach was used to generate MSA paraphrases for dialectal Arabic out-of-vocabulary words and low frequency words.
In (Zbib et al., 2012), crowdsourcing was applied to build Levantine-English and Egyptian-English parallel corpora, consisting of 1.1M words and 380k words, respectively. The dialectal sentences were selected from a large corpus of Arabic web text, and translated using Amazon's Mechanical Turk. The data was used to build dialectal machine translation systems.
In (Jehl et al., 2012), the authors collected bilingual sentence pairs for training statistical machine translation systems to translate microblog messages. The paper addressed the Gulf, Levantine, and Egyptian dialects as well as MSA. The technique presented was found to perform better than other methods such as techniques based on extracting phrases from similar text.
In (Al-Gaphari & Al-Yadoumi, 2012) an algorithm was proposed that normalizes Sanaáni dialect to MSA based on morphological rules. Input text was tokenized and each token was analyzed into stem and affixes. The stem and the affixes can be either dialect-specific, MSA-specific, or both. For each morphological rule the algorithm checks the possibility of applying such a rule.
In (Salloum & Habash, 2012), a rule-based approach for machine translation from Arabic dialects to MSA was presented. The approach relies on morphological analysis, morphological transfer rules and dictionaries in addition to language models to produce MSA paraphrases of dialectal sentences. The treated dialects are Levantine, Egyptian, Iraqi, and Gulf Arabic.
In (Mohamed et al., 2012), a translator from MSA to the Egyptian dialect was presented. Among others, this process helps in the annotation of the Egyptian dialect and in the translation from this dialect to English.
In (Soltau et al., 2011), a corpus-based translator from MSA to Levantine was described. The translator is trained on corpora with a mixture of Levantine dialect and MSA.
The Iraqi dialect was studied with respect to MT in two papers by Condon et al. In (Condon et al., 2010), a two-way evaluation of English-Iraqi dialog translation was performed. Four MT systems were evaluated and error types were specified. The English-Iraqi speech translation systems were evaluated using automated metrics. The study described Iraqi speech data features and the difficulties it presents on machine translation quality evaluation.
In (Jeblee et al., 2014), domain and dialect adaptation was suggested to produce a statistical machine translation system from English to the Egyptian dialect with MSA as a pivot. A machine translation system of the Moroccan dialect into MSA based on statistical models and a rule-based approach was proposed in (Tachicart & Bouzoubaa, 2014).

Other Semantic Tasks
Sentiment and subjectivity analysis (SSA) was treated in several papers. In (Abdul-Mageed et al., 2014), the authors investigated how to treat Arabic dialects and whether genre-specific features have a measurable impact on performance of a sentiment analyzer.
In (Hedar & Doss, 2013), a classifier for Arabic slang that applies sentiment analysis to classify news and comments on Facebook was presented.
In (Mourad & Darwish, 2013), the issue of limited Arabic SSA lexicons was addressed by providing baselines that employ Arabic specific processing including stemming, POS tagging, and tweets normalization. Also, a random graph walking algorithm was employed to expand SSA lexicons. Open issues in sentiment analysis were discussed in (El-Beltagy & Ali, 2013) and a sentiment lexicon for Egyptian dialect was presented.
Recently, other sentiment analysis systems for social media data were proposed in (Duwairi et al., 2014) and (Ibrahim et al., 2015) for the Jordanian and Egyptian dialects, respectively.
In (El-Fishawy et al., 2014), a microblog summarization technique based on machine learning for Egyptian dialect was presented. The results achieved were compared to several well-known algorithms such as SumBasic, TF-IDF, PageRank, MEAD, and human summaries. (Pasha et al., 2013) addressed the challenges of retrieving information in Arabic dialects, which have significant linguistic differences from Standard Arabic. The presented tool automatically generates dialect search terms with relevant morphological variations from English or Standard Arabic query terms.
In (Zirikly & Diab, 2014) and (Zirikly & Diab, 2015) different approaches for Named Entity Recognition in the Egyptian dialect were proposed. Named entity recognition in microblogs was also treated by Darwish and Gao, however, for MSA mainly (Darwish & Gao, 2014).
In (Darwish & Magdy, 2014), a general study of Arabic information retrieval was presented. The survey includes different domains and applications of Arabic IR systems as well as the specific challenges in this NLP area.

Dialect Identification and Recognition
The recognition of dialectal content in an Arabic text or speech gained a special interest in the literature.

Dialect Identification in Text
Some of the previously cited work on text annotation, e.g.  and (Zaidan & Callison-Burch, 2011), or machine translation, e.g., (Soltau et al., 2011), implicitly include components for dialect identification.
In (Habash et al., 2008b), standard annotation guidelines to identify a switching between MSA and an Egyptian or a Levantine dialect in written text were presented. The guidelines can be used to annotate large collections of data used for training and testing NLP tools.
In (Elfardy & Diab, 2013), a supervised approach on the sentence level is proposed to differentiate between MSA and the Egyptian dialect. Token level labels are used to derive sentencelevel features that are employed with other core and meta features to train a generative classifier that predicts the correct label for each sentence in the given input text. This work was extended to the Iraqi, Levantine and Moroccan dialects by the same authors in (Elfardy & Diab, 2012a).
In (Zaidan & Callison-Burch, 2012), the authors used a large annotated dataset to train and evaluate automatic classifiers for the sake of Arabic dialect identification. Given an Arabic sentence, the task consists in determining the variety of Arabic in which it is written. The variety can be MSA, Maghrebi, Egyptian, Levantine, Iraqi, or Gulf.
Recently, a native Bayes classifier based on character bi-gram model was proposed to identify 18 different Arabic dialects (Sadat, Kazemi, & Farzindar, 2014). In , the authors based their identification approach of the Egyptian dialect on lexical, morphological, as well as phonological information.
annotations to identify Levantine, Gulf, Egyptian, Iraqi, and Maghrebi dialects. The identification of several Maghrebi dialects in addition to Syrian and Palestinian Arabic was an aspect in the crossdialectical study proposed in (Harrat et al., 2015).

Dialect Recognition in Speech
In (Lei & Hansen, 2009), a factor analysis-based modeling technique was proposed to describe the composition of the supervector defined by the Gauss Mixture Model for dialect identification. The method utilizes knowledge types of information contained in the transcript file of the data. The addressed dialects in this work are the Emirati, the Egyptian, the Iraqi, the Palestinian, and the Syrian dialects.
In (Biadsy et al., 2009), the authors described a system that automatically identifies the Arabic dialect (Gulf, Iraqi, Levantine, Egyptian and MSA) of a speaker given a sample of his/her speech.
In (Akbacak et al., 2011), the authors studied the effectiveness of recently developed language recognition techniques based on speech recognition models for the discrimination of Arabic dialects.
In (Belgacem et al., 2010), an automatic recognition system for Arabic dialects was proposed. The analyzed dialects are Tunisian, Moroccan, Algerian, Egyptian, Syrian, Lebanese, Yemeni, Iraqi, and Gulf. The proportion of vocalic intervals and the standard deviation of consonantal intervals are analyzed using the platform Alize and Gaussian Mixture Models.
In (Zhang et al., 2013), the authors investigated variations to supervector pre-processing for dialect identification based on phone recognitionsupport vector machines. They studied the normalization of supervector dimensions in the presquashing stage, the impact of alternative squashing functions, and the N-gram selection for supervector dimensionality reduction. Addressed dialects include Iraqi, Gulf, Egyptian, and Levantine.
Speech recognition for Arabic dialects was addressed in (Kirchhoff & Vergyri, 2005), (Boujelbane et al., 2013), and (Alghamdi et al., 2008) for the Egyptian, Tunisian and Saudi dialects, respectively. In (Kirchhoff & Vergyri, 2005), the authors described the use of MSA acoustic data to improve the recognition of Egyptian conversational dialect. To simplify this task, the MSA data is vowelized automatically before combining it with the Egyptian conversational dialect data. The corpus building in (Boujelbane et al., 2013) was motivated by the need to create language models towards a speech recognition system for the Tunisian Broadcast News.
Recently, Ali et al. presented a system for Egyptian speech recognition that reduces word error rate using micro blog data (Ali, 2014).
In (Alghamdi et al., 2008), the authors aimed to present a speech database by native speakers across Saudi Arabia. The paper shows an approach that enables researchers to select samples from a population to produce a speech database where a dialect map is unobtainable. The resulted corpus was used to train a speech recognition system.
In (Iskra et al., 2004), the results of the Orien-Tel project were presented. This European project dealt with building telephony databases across Northern Africa and the Middle East. Table 1 summarizes the discussed research work on Arabic NLP. The columns represent the different research areas and the rows show the different covered dialects. Based on this table and on the discussions in the previous sections the following comments can be made.

Discussion
1. By counting all published works, it can be seen that the research on computational linguistics for dialectal Arabic, as an alternative to Modern Standard Arabic, is emerging. Given that the different Arabic dialects are spoken by more than 390 million people in total, the total amount of research conducted in this area is still very limited.
2. The most treated dialect in Arabic NLP is the Egyptian Arabic. This may be attributed to the fact that Egypt is the country with the largest population in the Arabic world. However, such a population argument fails to explain why the Levantine Arabic has been paid relatively high attention, while the dialects of some population-rich countries such as Sudan, Morocco, and Algeria have been treated very poorly. The relatively high concentration on Levantine Arabic may be associated with geopolitical issues and the Middle-East conflict.
3. Most research work has been spent on building and annotating dialectical corpora due to the fact that dialectical Arabic is still a resource-poor language. Dialect identification and speech recognition were also researched intensively. Recall that these two tasks are frequently performed towards building language resources. While the morphology of dialectical Arabic was addressed in some papers, the syntactical analysis is almost ignored in research.
4. The selection of the geographic granularity level on which Arabic dialects are treated is not clear. The majority of related work that addresses Levantine, for instance, treats this variety as one dialect. Levantine, however, is spoken in Syria, Jordan, Lebanon, and Palestine. In each of these countries, furthermore, a lot of varieties can be identified.
From this discussion it is obvious that the research on Arabic dialects should be enhanced both on the dialect as well as on the topic level. A hierarchical scheme should be introduced to define the granularity of Arabic dialects so that researchers can be more specific in assigning their work to some dialect or dialects. The built language resources especially annotated corpora should be made available to accelerate the research in this area. More research on the syntactical analysis of Arabic dialects is required to improve the quality of related tools.