Morphologically Annotated Corpora for Seven Arabic Dialects: Taizi, Sanaani, Najdi, Jordanian, Syrian, Iraqi and Moroccan

We present a collection of morphologically annotated corpora for seven Arabic dialects: Taizi Yemeni, Sanaani Yemeni, Najdi, Jordanian, Syrian, Iraqi and Moroccan Arabic. The corpora collectively cover over 200,000 words, and are all manually annotated in a common set of standards for orthography, diacritized lemmas, tokenization, morphological units and English glosses. These corpora will be publicly available to serve as benchmarks for training and evaluating systems for Arabic dialect morphological analysis and disambiguation.


Introduction
As Arabic dialects (DA) become more widely written in social media, there is increased interest in the Arabic NLP community to have annotated corpora that will allow us to both study the dialects linguistically, and to create systems that can automatically process dialectal text. There have been important efforts to create relatively large corpora for Egyptian (Maamouri et al., 2014), Palestinian (Jarrar et al., 2014), and Emirati Arabic . While these resources are very helpful for single dialects, the problem is that there are many dialects, and in fact it is often unclear what to count as separate dialects (for example, the subdialects of Levantine). Therefore, we present a different approach in this paper: we annotate seven dialects, but with relatively smaller corpora (most around 30,000 words). Some of the dialects are closely related (Jordanian and Syrian), others are more distant (Moroccan). We use the same annotation methodology for all dialects: same guidelines, same processing steps, and same annotation file format. This makes our effort an ideal starting point for experimenting with using multidialectal resources to create and train NLP tools. The dialects we consider are Taizi  The paper is structured as follows. We start with a review of relevant literature (Section 2). We then summarize some linguistic facts about DA in general (Section 3) and subsequently present each of our seven dialects in Section 4, summarizing the corpora used and some interesting facts specific to each dialect. Section 5 then presents our annotation methodology. We then briefly discuss morphological analyzers, and conclude.

Related Work
Data Collections There have been several data collections centered on Arabic dialects, specifically spoken Arabic. A very useful resource is the Semitisches Tonarchiv at the University of Heidelberg in Germany. 2 We have included two Yemeni transcriptions from this resource in our YE. TZ and YE.SN corpora. Khalifa et al. (2016) is a large collection of over 100M words of a number of Arabic dialect, although the majority is from the Gulf.  created a large corpus with parallel data text from 25 Arab cities. Further data collections include (Al-Amri, 2000) which has not yet been digitized for use in NLP research.
Annotated Corpora There are few annotated corpora for dialectal Arabic: the Levantine Arabic Treebank (specifically Jordanian) (Maamouri et al., 2006), the Egyptian Arabic Treebank (Maamouri et al., 2014), Curras, the Pales-tinian Arabic annotated corpus (Jarrar et al., 2014), the Gulf Arabic Annotated corpus , Syrian, Jordanian dialectal corpora (Bouamor et al., 2014;Harrat et al., 2014), a small effort on Sanaani and Moroccan (AlShargi et al., 2016) (which this paper builds on), and SUAR (Al-Twairesh et al., 2018), a morphologically annotated corpus for Najdi and Hijazi which is semiautomatically annotated using the MADAMIRA tool (Pasha et al., 2014) and subsequently manually checked. Additionally, Voss et al. (2014) present a corpus of Moroccan dialect which has been annotated for language variety (code switching). Several of these efforts have followed the approach of Curras (Jarrar et al., 2014), which consists of around 70,000 words of a balanced genre corpus. The corpus was manually annotated using the DIWAN tool (Alshargi and Rambow, 2015), which we also use. The annotation in Curras is done by first using a morphological tagger for another Arabic dialect, namely MADAMIRA Egyptian (Pasha et al., 2014), to produce a base that was then corrected or accepted by a trained annotator.

Other NLP Resources for Dialectal Arabic
The effort to annotate corpora in context is a central step in developing morphological analyzers and taggers (Eskander et al., 2013;. However, other notable approaches and efforts that do not use annotated corpora have focused on developing specific resources manually or semi-automatically, e.g., the Egyptian Arabic morphological analyzer (Habash et al., 2012b) which is built upon the Egyptian Colloquial Arabic Lexicon (Kilany et al., 2002), the multidialectal dictionary Tharwa , or extending MSA analyzers and resources Harrat et al., 2014;Boujelbane et al., 2013).

Dialects: Linguistic Facts
In this section we present some general facts and phenomena shared across different dialects. In subsequent subsections, we present our dialects in more detail and commenting on the corpus sources.
Dialects and MSA Arabic dialects share many commonalities with Classical Arabic and Modern Standard Arabic (MSA). All variants of Arabic are morphologically complex as they include rich inflectional and derivational morphology that is expressed in two ways: namely, via templates and affixes. Furthermore, they contain several classes of attachable clitics. However, the dialects as a class differ in consistent ways from MSA, and they differ amongst each other. In fact, the differences between MSA and Dialectal Arabic (DA) have often been compared to those between Latin and the Romance languages (Chiang et al., 2006). The principal morpho-syntactic difference between DA and MSA is the loss of productive case marking, and nunation (tanween) on nouns, and mood on imperfective verbs.

Dialectal Variations
Differences among the dialects are found on all levels of linguistic description, i.e., phonology, morphology, syntax, and the lexicon. We summarize three phonological and three morphological salient examples in Table 1 for our dialects: the pronunciation of MSA /q/ written q, 3 MSA /Ã/ written j and MSA /k/ written k; and the various forms of the future, progressive and possessive particles.
From a lexical point of view, there are many words that have different meanings across dialects. For example, the word mA$y /ma:Si/ is 'no' in YE.SN and MA.RB, 'yes/ok' in SY.DM and JOR, and 'walking' in SA.NJ. Another example is the word SAfy /s Q a:fi/ which means 'enough' in MA.RB, but 'pure' in the other dialects and MSA. Some cases show subtle differences in meaning, e.g., xdAm /xadda:m/ means 'employee' generically in MA.RB, but it has a more specific and negative connotation in YE.TZ and YE.SN, namely 'enslaved servant'. While the above cases are all homonyms (homophones and homographs), there are instances of There are also cases of the same meaning being expressed in different ways, e.g., 'spoon' is mlEqp in MSA, metathesized mElqp in JOR and SY.DM, and xA$wqp in IR.BG.
Dialectal Orthography Since Arabic dialects do not have spelling standards, several previous efforts on Arabic dialect annotations (Maamouri et al., 2014;Jarrar et al., 2014; contributed to a movement that lead to the creation of a common Conventional Orthography for Dialectal Arabic (CODA) (Habash et al., 2012a;Zribi et al., 2014;. We also follow this approach to map from any spontaneous orthography in our data to CODA. The spirit of CODA is to define a common and consistent approach to spelling DA words that acknowledges their etymological and historical relationship with MSA and CA, but also maintains their uniqueness and independence. For example, if a DA word has an MSA cognate containing q, then its CODA spelling will use q even if the dialectal pronunciation is different. In contrast, DA morphemes are spelled in a way to reflect their DA uniqueness. For example the SY.DM word Hnfyq /èanfi:P/ 'we will wake up' is a cognate of MSA snfyq /sanafi:qu/: the future marker reflects the dialectal morphology and is not spelled as in MSA, but the stem is spelled as in MSA and thus the q does not reflect the dialectal pronunciation.

Dialect-Specific Corpora
Until recently, Arabic was mostly written in Modern Standard Arabic (MSA) and Classical Arabic, while written DA was rare. One early source of written dialectal Arabic are textbooks for learning an Arabic Dialect intended for non-Arabic speakers. Furthermore, sometimes spoken language has been recorded and transcribed. However, owing to the advent of the internet and its rapid growth among Arabic speaking populations, written materials in DA are now more accessible and easy to obtain than they were in the past. These written materials are typically informal written conversations among participant or traditional folk literature like short stories, poems, prose, thoughts and song. These texts can be found in online forums, blogs, and postings on social media networks. All of the our dialectal corpora consist of sources of various genres, collected from both online and print materials in order to cover many of the aspects of these dialects. Each of the YE.TZ, SA.NJ, IR.BG, JOR corpora has 30K words, while the YE.SN has 32K words, SY.DM has 35k words and MA.RB has 20k words. It should be noted that the data collected from the internet was written in Arabic characters, using "spontaneous" orthography since there are no orthographic standards for DA. The Roman alphabet sentence were transcribed from the textbooks into the Arabic alphabet using CODA. All examples presented in the rest of this section are in CODA except where specified otherwise.

Taizi Corpus (YE.TZ)
Sources The YE.TZ written data was collected manually from different resources such as forums, blogs, and social media networks. With reference to spoken data, half of the oral interviews were recorded and transcribed manually by the annotators, the remaining oral interview transcripts are taken from the Semitisches Tonarchiv (Section 2). The data includes wise anecdotes, proverbs, stories, poems, songs and dialogues.
Phonology and Orthography A distinguishing feature of YE.TZ is that MSA j /Ã/ is pronounced as /g/, e.g., jml 'camel' /gamal/, and that MSA q /q/ retains its pronunciation. In that regard, CODA spellings were straightforward.
Morphology Similar to a number of other dialects but unlike MSA, negation is expressed as an enclitic $ 'not', e.g., ydxl+$ 'he does not enter'. The vocative particle is expressed as the proclitics yA 'Oh' and wA 'Oh', or as an the enclitic Ah as in AmAh 'my mother'. The verbal proclitic qA 'already', which corresponds to MSA qd, frequently appears with past verbs, e.g., qA EmlnA 'we have already done that'.

Sanaani Corpus (YE.SN)
Sources The social texts were taken from a Sanaani Radio Station program called msEd wmsEdp, which addressed social issues and problems of the community. The oral interview transcripts were taken from the Semitisches Tonarchiv (Section 2). The interviews describe daily life, history and lifestyle in Sanaa. Folktales describing traditional stories handed down in Sanaa are taken from internet forums. Collections of wisdom sayings and tales of the famous wise man of Yemen "Ali walad Zaid" are taken from internet websites. Other texts were taken from social media, and include political events in Yemen, Sanaani jokes, religious sermons and transcripts that discuss the Sanaani dialect in MSA.
Phonology and Orthography MSA q /q/ is pronounced /g/ in YE.SN, including in religious contexts. For example, the word qmr 'moon' is pronounced /gamar/. This variation is not unique to YE.SN and other dialects such as IR.BG and JOR have it as well. This /g/ is often spontaneously spelled as q, which is consistent with CODA guidelines. A particularly marking phenomenon in YE.SN is the devoicing and emphasis of some instances of word-medial /d/, e.g., gdwp 'tomorrow' is pronounced /Gut Q wa/ and as a result may be written spontaneously as gTwp.
Morphology As shown in Table 1, there are four future particles in YE.SN: + E+, Ed, + $ +, + y+. While + E+ may be used with 1st, 2nd, or 3rd person conjugated verb, the rest are only used with 1st person singular conjugated verbs.

Najdi Corpus (SA.NJ)
Sources The SA.NJ corpus was collected from different sources that represent different genres: forums, poetry, jokes and tweets. We collected different posts from the Saudi web forum eqla3. com, including personal narratives (mainly sarcastic) and discussions. We also collected Najdi poems from the late twentieth century, mainly written by the contemporary Najdi poets Khalid AlFaisal, Mohammed bin Ahmed AlSudairy and Saad Bin Jadlan. We manually collected Najdi jokes from various online resources. And finally, on Twitter, we searched for distinctive Najdi keywords such as HnA 'we', qrw$p 'inconvenience', and mnyb 'I'm not'.
Phonology and Orthography As Table 1 shows, there are a number of phonological alternations in SA.NJ. The /dz/ variant of q /q/ and /ts/ variants of k /k/ are rather restricted in their usage. And unlike MSA, SA.NJ shows no distinction between the pronunciation of MSA etymological /d Q / and /D Q /. These phenomena affect spontaneous orthography and had to be addressed in the CODA annotations.
Morphology One marking morphological feature of SA.NJ (and other Gulf Arabic dialects) is the use of negation circumfix mA+ .. +b, as in mAnyb 'I am not' (spontaneously, often written as mnyb). Similar constructions exist in other dialects but are more productive, e.g. Egyptian mA+ .. +$ negates verbs in addition to pronouns. Unlike most DA and like MSA, SA.NJ retains some tanween (nunation). For example: >nA qAylK lk /Pana ga:ylin lak/ 'I said (active participle) to you'. However, as in MSA, the nunation is rarely written. Some morphological phenomena are becoming very rare, e.g., the use of ts for 2nd person singular feminine pronominal enclitic is dying out among younger people and merging with the masculine form k.
Lexicon SA.NJ has some distinguishing words such as >bxS 'more expert', kfw 'good', and dAfwr 'nerd' There are many borrowed words from English compared to borrowings from Turkish or Persian. For instance, the verb yflm is borrowed form English 'film' and means 'to act dramatically'.

Jordanian Corpus (JOR)
Sources The corpus includes written as well as spoken data. The written materials were drawn from internet sources, such as, forums, blogs, and social media. They include informal conversations among participant or traditional folk literature like short stories, poems, prose, memoirs, and songs. As for spoken data, oral interviews and observations were recorded and transcribed by the annotators. Nearly 20 informants were interviewed by the researchers. Older as well as uneducated people are included in order to ensure the authenticity of the data. The JOR data included a mix of subdialects that reflect the multiplicity of DA forms, including markedly Palestinian as well as Jordanian variants. For this reason, we refer to this corpus simply as JOR.

Phonology and Orthography
In some JOR sub-dialects, as with IR.BG, MSA k is affricated to /Ù/, e.g., klb /Ùalb/ 'dog'. q also realizes in two forms as /g/ and /P/. Some of these phenomena results in different spontaneous spellings that are then normalized during annotation.
Morphology JOR's 2nd person feminine singular pronominal clitic has two alternations depending on the sub-dialect: ky /ki/ and k /ik/. Examples include $ftky or $ftk 'I saw you'; however when following a vowel, both become ky /ki/, e.g. $Afwky 'they saw you'. Negation is marked with the enclitic $ ; such as, bAswy$ 'I do not do'.

Syrian Corpus (SY.DM)
Sources The written data was collected manually from different online written resources such as forums, blogs, and social media networks. Among the data, there were anecdotes, proverbs, stories, some poems, songs and dialogues.
Phonology and Orthography SY.DM has a glottal stop phoneme /P/ that is a cognate with either MSA Hamza ( & } > < ') or MSA Qaf q. In most spontaneous SY.DM orthography, the two forms are distinguished in a manner similar to CODA guidelines. A few exceptions include the word hl> 'now' which in CODA is written as hlq highlighting its etymological link to hAlwqt 'this time'. Less common spelling variations include the devoicing of j /Z/ to /S/, which may be reflected in spontaneous orthography, e.g., njtmE /niZtmiQ/ 'we meet' may appear as n$tmE /niStmiQ/.
Morphology A distinction of SY.DM (and North Levantine) compared to South Levantine and a number of other dialects is the absence of the negation enclitic $. SY.DM makes use of a number of future particles in free distribution (See Table 1). The progressive particle Em can only be used to indicate active progression at the moment, while the progressive proclitic + b+ has a wider range from habitual to progressive.

Iraqi corpus (IR.BG)
Sources The materials of the IR.BG corpus were obtained from social media websites, blogs and other online sources. The sources contain posts on political, social, and religious issues that touch upon the daily life of the Iraqi people. The sources include blogs, e.g., different sarcastic posts with a witty sense of humor gathered from the Iraqi blog $l$ AlErAqy, and short essays with commentary and views that sharply criticize loss in traditional values and morals in the Iraqi society after 2003. Proverbs, common sayings, and famous expressions were also collected from online blogs and forums.
Phonology and Orthography Some instances of MSA k appear as /tS/ in IR.BG, e.g., kAnt 'she was' /tSa:nat/. Some of these cases appear in spontaneous orthography as t$ or even J/ j (mostly due to Persian spelling influences). Some instances of MSA /q/ are pronounced as /g/, e.g., fwq 'above' /fo:g/. Some of these cases appear in spontaneous orthography as G or k, also due to Persian influences.
Morphology A strong marker of IR.BG is the progressive proctlitc + d+, e.g., $dtswq? 'what are you driving?'. IR.BG also has three future particles: rAH, rH, and + H+, which seem to be in free variation.
Lexicon The IR.BG lexicon has some distinguishing words such as >Twx 'little darker', and |ny 'I'. IR.BG has many loanwords from Kurdish, Persian, and Russian, e.g., Kurdish kAkh 'mister', Persian qndAg 'very weak tea or hot water and sugar', and Russian <stkAn 'a spindle-shaped tea cup'.

Moroccan Corpus (MA.RB)
Sources The corpus includes comments from the Moroccan news website hespress.com that have to do with sports, cinema, and education policy. The materials from forums include advice on social, religious, and economic issues. The oral interviews are transcriptions of people telling stories, most of which are events from their lives.
The folktales come from a Moroccan website that reprinted stories originally published in an encyclopedia of traditional Moroccan folktales.
The textbook examples include many basic greetings and expressions, as well as sample dialogues. The blog posts range in topic, but include relationship advice, recipes, and philosophical musings. The humor includes both short and long jokes from a few Facebook pages and one other website.
Phonology and Orthography Most MA.RB consonants are pronounced like their MSA equivalents; however, there are exceptions: dental consonants in MSA have become alveolar, so MSA v /T/, * /D/, and Z /D Q /, are pronounced /t/, /d/, and /d Q /, respectively in MA.RB. Such issues naturally interact with spontaneous orthography and are annotated as per CODA guidelines.
Morphology Among the set of dialects discussed here, MA.RB has the most distinct set of morphological features, such as its future, progressive and possessive particles (see Table 1). Like other North African dialects, and unlike MSA, MA.RB uses the prefix + n+ for imperfect first person singular, and distinguishes first person plural by adding the plural suffix +wA. Interestingly the imperfect first person singular in MA.RB looks like the imperfect first person plural in MSA and numerous other dialects. Finally, the perfect second person singular masculine and feminine both use the suffix ty, which corresponds to the feminine suffix in other DA.
Lexicon MA.RB has a number of loanwords from Berber, French and Spanish; and many speakers code-switch between Moroccan and French or Spanish.

Annotation Process
Process Overview To create new morphological annotated corpora, we follow (AlShargi et al., 2016)'s basic approach: we utilize the DIWAN tool (Alshargi and Rambow, 2015) to build and annotate the seven DA corpora discussed above. The project team consists of:    The dialect leads verify the annotators' work, and the project manager organizes and monitors the flow of the progress of everyone using the tool in the project.

Annotation
Steps First, the dialect leads collect the corpus text from different resources like social media, forms, websites, etc. The next step is to develop dialect-specific annotation guidelines, including the CODA specification for normalized orthography. The dialect leads then train the annotators before annotation starts. The leads follow the annotator's work. The annotations are not approved until the dialect leads check them. Wrong annotations are sent back to the annotator for correction. After the first round of annotation is done, we perform a second round of error checking, using both manual inspection and scripts that check for coherent annotations. The result is a DIWAN file which includes the correct annotation for the entire corpus. In the last step, we automatically reformat the annotations into a format which is best suited for computational purposes; we perform a third round of error checking for format errors, which we fix automatically. Figure 1 shows these steps.

Morphological Features Annotated
The DI-WAN interface assists human annotators in anno-tating each token with morphological and semantic information, including the following fields: • The CODA spelling of the raw token.
• The lemma, or the citation form, of the token.
• The morphemes of the word (prefixes, stem, suffixes) and their part-of-speech (POS). The stem is marked by the symbol # on either side.
• The English gloss of the word.
The annotation for one sentence in different dialects is shown in Table 2. This is not actually a sentence from our corpora, of course; we have chosen it to illustrate the annotation.
Error Correction Linguistic annotation is carried out manually. In order to guarantee high levels of accuracy and precision, we performed extensive error checking and correction. After annotating the seven different corpora, the annotated words were compiled in the form of linguistic codes in either one file or separate files to be checked and corrected by a second reviewer. This form of error checking cannot of course identify annotation errors in context (for example, a noun is misidentified as a verb); instead, this approach is efficient at finding impossible annotations. Examining the data demonstrated that the most challenging part for the annotators was the suffixes part, especially when there are long and complicated words. Some examples indicating the errors are listed below in Table 3.

Distribution of Resources
All created resources will be freely available for research purposes from Columbia (http://innovation. columbia.edu).

Conclusion and Future Work
We presented a collection of morphologically annotated corpora for seven Arabic dialects, collectively covering over 200,000 words. All corpora were manually annotated in a common set of standards for orthography, diacritized lemmas, tokenization, morphological units and English glosses. These corpora will be publicly available to serve as benchmarks for training and evaluating systems for Arabic dialect morphological analysis and disambiguation.
In future work, we will use these resources to train morphological taggers as described in . We also plan to extend the collection of dialect to include additional less studied varieties following the lead of efforts such as . We also plan to expand towards different historical and literature based varieties of Arabic.

Acknowledgments
This work is supported by the Air Force Research Laboratory (AFRL) under a grant administered by Ball Aerospace. Alkhereyf is supported by the KACST Graduate Studies program. The views expressed here are those of the authors and do not reflect the official policy or position of the U.S. Department of Defense or the U.S. Government We also would like to thank all the anonymous reviewers for their insightful and valuable comments and suggestions.