AIDA2: A Hybrid Approach for Token and Sentence Level Dialect Identification in Arabic

In this paper, we present a hybrid approach for performing token and sentence levels Dialect Identification in Arabic. Specifically we try to identify whether each token in a given sentence belongs to Modern Standard Arabic (MSA), Egyptian Dialectal Arabic (EDA) or some other class and whether the whole sentence is mostly EDA or MSA. The token level component relies on a Conditional Random Field (CRF) classifier that uses decisions from several underlying components such as language models, a named entity recognizer and and a morphological analyzer to label each word in the sentence. The sentence level component uses a classifier ensemble system that relies on two independent underlying classifiers that model different aspects of the language. Using a featureselection heuristic, we select the best set of features for each of these two classifiers. We then train another classifier that uses the class labels and the confidence scores generated by each of the two underlying classifiers to decide upon the final class for each sentence. The token level component yields a new state of the art F-score of 90.6% (compared to previous state of the art of 86.8%) and the sentence level component yields an accuracy of 90.8% (compared to 86.6% obtained by the best state of the art system).


Introduction
In this age of social media ubiquity, we note the pervasive presence of informal language mixed in with formal language. Degree of mixing formal and informal language registers varies across languages making it ever harder to process. The prob-lem is quite pronounced in Arabic where the difference between the formal modern standard Arabic (MSA) and the informal dialects of Arabic (DA) could add up to a difference in language morphologically, lexically, syntactically, semantically and pragmatically, exacerbating the challenges for almost all NLP tasks. MSA is used in formal settings, edited media, and education. On the other hand the spoken, and, currently written in social media and penetrating formal media, are the informal vernaculars. There are multiple dialects corresponding to different parts of the Arab world: (1) Egyptian, (2) Levantine, (3) Gulf, (4) Moroccan, and, (5) Iraqi. For each one of these sub-dialectal variants exist. Speakers/writers code switch between the two forms of the language especially in social media text both inter and intra sententially. Automatically identifying code-switching between variants of the same language (Dialect Identification) is quite challenging due to the lexical overlap and significant semantic and pragmatic variation yet it is crucial as a preprocessing step before building any Arabic NLP tool. MSA trained tools perform very badly when applied directly to DA or to intrasentential code-switched DA and MSA text (ex. Alfryq fAz bAlEAfyp bs tSdr qA}mp Aldwry, where the words correspond to MSA MSA DA DA MSA MSA MSA, respectively) 1 . Dialect Identification has been shown to be an important preprocessing step for statistical machine Translation (SMT). (Salloum et al., 2014) explored the impact of using Dialect Identification on the performance of MT and found that it improves the results. They trained four different SMT systems; (a) DA-to-English SMT, (b) MSA-to-English SMT, (c) DA + MSA-to-English SMT, and (d) DA-to-English hybrid MT system and treated the task of choosing which SMT system to invoke as a classification task. They built a classifier that uses various features derived from the input sentence and that indicate, among other things, how dialectal the input sentence is and found that this approach improved the performance by 0.9% BLEU points.
In this paper, we address the problem of token and sentence levels dialect identification in Arabic, specifically between Egyptian Arabic and MSA. For the token level task, we treat the problem as a sequence labeling task by training a CRF classifier that relies on the decisions made by a language model, a morphological analyzer, a shallow named entity recognition system, a modality lexicon and other features pertaining to the sentence statistics to decide upon the class of each token in the given sentence. For the sentence level task we resort to a classifier ensemble approach that combines independent decisions made by two classifiers and use their decisions to train a new one. The proposed approaches for both tasks significantly beat the current state of the art performance with a significant margin, while creating a pipelined system.

Related Work
Dialect Identification in Arabic has recently gained interest among Arabic NLP researchers. Early work on the topic focused on speech data. Biadsy et al. (2009) presented a system that identifies dialectal words in speech through acoustic signals. More recent work targets textual data. The main task for textual data is to decide the class of each word in a given sentence; whether it is MSA, EDA or some other class such as Named-Entity or punctuation and whether the whole sentence is mostly MSA or EDA. The first task is referred to as "Token Level Dialect Identification" while the second is "Sentence Level Dialect Identification".
For sentence level dialect identification in Arabic, the most recent works are (Zaidan and Callison-Burch, 2011), (Elfardy and Diab, 2013), and (Cotterell and Callison-Burch, 2014a). Zaidan and Callison-Burch (2011) annotate MSA-DA news commentaries on Amazon Mechanical Turk and explore the use of a language-modeling based approach to perform sentence-level dialect identification. They target three Arabic dialects; Egyptian, Levantine and Gulf and develop different models to distinguish each of them against the others and against MSA. They achieve an accuracy of 80.9%, 79.6%, and 75.1% for the Egyptian-MSA, Levantine-MSA, and Gulf-MSA classification, respectively. These results support the common assumption that Egyptian, relative to the other Arabic dialectal variants, is the most distinct dialect variant of Arabic from MSA. Elfardy and Diab (2013) propose a supervised system to perform Egyptian Arabic Sentence Identification. They evaluate their approach on the Egyptian part of the dataset presented by Zaidan and Callison-Burch (2011) and achieve an accuracy of 85.3%. Cotterell and Callison-Burch (2014b) extend Zaidan and Callison-Burch (2011) work by handling two more dialects (Iraqi and Moroccan) and targeting a new genre, specifically tweets. Their system outperforms Zaidan and Callison-Burch (2011) and Elfardy and Diab (2013), achieving a classification accuracy of 89%, 79%, and 88% on the same Egyptian, Levantine and Gulf datasets. For token level dialect identification, King et al. (2014) use a language-independent approach that utilizes character n-gram probabilities, lexical probabilities, word label transition probabilities and existing named entity recognition tools within a Markov model framework.
Jain and Bhat (2014) use a CRF based token level language identification system that uses a set of easily computable features (ex. isNum, isPunc, etc.). Their analysis showed that the most important features are the word n-gram posterior probabilities and word morphology. Lin et al. (2014) use a CRF model that relies on character n-grams probabilities (tri and quad grams), prefixes, suffixes, unicode page of the first character, capitalization case, alphanumeric case, and tweet-level language ID predictions from two off-the-shelf language identifiers: cld2 2 and ldig. 3 They increase the size of the training data using a semi supervised CRF autoencoder approach  coupled with unsupervised word embeddings.
MSR-India (Chittaranjan et al., 2014) use character n-grams to train a maximum entropy classifier that identifies whether a word is MSA or EDA. The resultant labels are then used together with word length, existence of special characters in the word, current, previous and next words to train a CRF model that predicts the token level classes of words in a given sentence/tweet. In our previously published system AIDA (Elfardy et al., 2014) we use a weakly supervised rule based approach that relies on a language model to tag each word in the given sentence to be MSA, EDA, or unk. We then use the LM decision for each word in the given sentence/tweet and combine it with other morphological information, in addition to a named entity gazetteer to decide upon the final class of each word.

Approach
We introduce AIDA2. This is an improved version of our previously published tool AIDA . It tackles the problems of dialect identification in Arabic both on the token and sentence levels in mixed modern standard Arabic MSA and Egyptian dialect EDA text. We first classify each word in the input sentence to be one of the following six tags as defined in the shared task for "Language Identification in Code-Switched Data" in the first workshop on computational approaches to code-switching [ShTk] (Solorio et al., 2014): • lang1: If the token is MSA (ex. AlwAqE, "The reality") • lang2: If the token is EDA (ex. m$, "Not") • ne: If the token is a named entity (ex. >mrykA, "America") • ambig: If the given context is not sufficient to identify the token as MSA or EDA (ex. slAm Elykm, "Peace be upon you") • mixed: If the token is of mixed morphology (ex. b>myT meaning "I'm always removing") • other: If the token is or is attached to any non Arabic token (ex. numbers, punctuation, Latin character, emoticons, etc) The fully tagged tokens in the given sentence are then used in addition to some other features to classify the sentence as being mostly MSA or EDA.

Token Level Identification
Identifying the class of a token in a given sentence requires knowledge of its surrounding tokens since these surrounding tokens can be the trigger for identifying a word as being MSA or EDA. This suggests that the best way to approach the problem is by treating it as a sequence labeling task. Hence we use a Conditional Random Field (CRF) classifier to classify each token in the input sentence. The CRF is trained using decisions from the following underlying components: • MADAMIRA: is a publicly available tool for morphological analysis and disambiguation of EDA and MSA text (Pasha et al., 2014). 4 MADAMIRA uses SAMA (Maamouri et al., 2010) to analyze the MSA words and CAL-IMA (Habash et al., 2012) for the EDA words.
We use MADAMIRA to tokenize both the language model and input sentences using D3 tokenization-scheme, the most detailed level of tokenization provided by the tool (ex. bAlfryq, "By the team" becomes "b+ Al+ fryq") (Habash and Sadat, 2006). This is important in order to maximize the Language Models (LM) coverage. Furthermore, we also use MADAMIRA to tag each token in the input sentence as MSA or EDA by tagging the source of the morphological analysis, if MADAMIRA. analyses the word using SAMA, then the token is tagged MSA while if the analysis comes from CALIMA, the token is tagged EDA. Out of vocabulary words are tagged unk. • Language Model: is a D3-tokenized 5-grams language model. It is built using the 119K manually annotated words of the training data of the shared task ShTk in addition to 8M words from weblogs data (4M from MSA sources and 4M from EDA ones). The weblogs are automatically annotated based on their source, namely, if the source of the data is dialectal, all the words from this source are tagged as EDA. Otherwise they are tagged MSA. Since we are using a D3tokenized data, all D3 tokens of a word are assigned the same tag of their corresponding word (ex. if the word "bAlfryq" is tagged MSA, then each of "b+", "Al+", and "fryq" is tagged MSA). During runtime, the Language Model classifier module creates a lattice of all possible tags for each word in the input sentence after it is being tokenized by MADAMIRA. Viterbi search algorithm (Forney, 1973) is then used to find the best sequence of tags for the given sentence. If the input sentence contains out of vocabulary words, they are being tagged as unk. This module also provides a binary flag called "isMixed". It is "true" only if the LM decisions for the prefix, stem, and suffix are not the same. • Modality List: ModLex (Al-Sabbagh et al., 2013) is a manually compiled lexicon of Arabic modality triggers (i.e. words and phrases that convey modality). It provides the lemma with a context and the class of this lemma (MSA, EDA, or both) in that context. In our approach, we match the lemma of the input word that is provided by MADAMIRA and its surrounding context with an entry in ModLex. Then we assign this word the corresponding class from the lexicon. If we find more than one match, we use the class of the longest matched context. If there is no match, the word takes unk tag. Ex. the word "Sdq" which means "told the truth" gets the class "both" in this context ">flH An Sdq" meaning "He will succeed if he told the truth". • NER: this is a shallow named entity recognition module. It provides a binary flag "isNE" for each word in the input sentence. This flag is set to "true" if the input word has been tagged as ne.
It uses a list of all sequences of words that are tagged as ne in the training data of ShTk in addition to the named-entities from ANERGazet (Benajiba et al., 2007) to identify the namedentities in the input sentence. This module also checks the POS provided by MADAMIRA for each input word. If a token is tagged as noun prop POS, then the token is classified as ne.
Using these four components, we generate the following features for each word:.
• MADAMIRA-features: the input word, prefix, stem, suffix, POS, MADAMIRA decision, and associated confidence score; • LM-features: the "isMixed" flag in addition to the prefix-class, stem-class, suffix-class and the confidence score for each of them as provided by the language model; • Modality-features: the Modality List decision; • NER-features: the "isNE" flag from the NER; • Meta-features: "isOther" is a binary flag that is set to "true" only if the input word is a non Arabic token. And "hasSpeechEff" which is another binary flag set to "true" only if the input word has speech effects (i.e. word lengthening).
Token--Level--Iden,fica,on We then use these features to train a CRF classifier using CRF++ toolkit (Sha and Pereira, 2003) and we set the window size to 16. 5 Figure 1 illustrates the different components of the token-level system.

Sentence Level Identification
For this level of identification, we rely on a classifier ensemble to generate the class label for each sentence. The underlying classifiers are trained on gold labeled data with sentence level binary decisions of either being MSA or EDA. Figure 2 shows the pipeline of the sentence level identification component. The pipeline consists of two main pathways with some pre-processing components. The first classifier (Comprehensive Classifier/Comp-Cl) is intended to cover dialectal statistics, token statistics, and writing style while the second one (Abstract Classifier/Abs-Cl) covers semantic and syntactic relations between words. The decisions from the two classifiers are fused together using a decision tree classifier to predict the final class of the input sentence. 6

Comprehensive Classifier
The first classifier is intended to explicitly model detailed aspects of the language. We identify multiple features that are relevant to the task and we group them into different sets. Using the D3 tokenized version of the input data in addition to the classes provided by the "Token Level Identification" module for each word in the given sentence, we conduct a suite of experiments using the decision tree implementation by WEKA toolkit (Hall et al., 2009) to exhaustively search over all features in each group in the first phase, and then exhaustively search over all of the remaining features from all groups to find the best combination of features that maximizes 10-fold cross-validation on the training data. We explore the same features used by Elfardy and Diab (2013) in addition to three other features that we refer to as "Modality Features". The full list of features include: • Perplexity-Features [PF]: We run the tokenized input sentence through a tokenized MSA and a tokenized EDA 5-grams LMs to get sentence perplexity from each LM (msaPPL and edaPPL). These two LMs are built using the same data and the same procedure for the LMs used in the "Token Level Identification" module; • Dia-Statistics-Features [DSF]: -The percentage of words tagged as EDA in the input sentence by the "Token Level Identification" module (diaPercent); -The percentage of words tagged as EDA and MSA by MADAMIRA in the input sentence (calimaWords and samaWords, respectively). And the percentage of words found in a precompiled EDA lexicon egyWords used and provided by (Elfardy and Diab, 2013); -hasUnk is a binary feature set to "true" only if the language model of the "Token Level Identification" module yielded at least one unk tag in the input sentence; -Modality features: The percentage of words tagged as EDA, MSA, and both (modEDA, modMSA, and modBoth, respectively) using the Modality List component in the "Token Level Identification" module.

Abstract Classifier
The second classifier, Abs-Cl, is intended to cover the implicit semantic and syntactic relations between words. It runs the input sentence in its surface form without tokenization through a surface form MSA and a surface form EDA 5-gram LMs to get sentence probability from each of the respective LM (msaProb and edaProb). These two LMs are built using the same data used in the "Token Level Identification" module LM, but without tokenization. This classifier complements the information provided by Comp-Cl. While Comp-Cl yields detailed and specific information about the tokens as it uses tokenized-level LMs, Abs-Cl is able to capture better semantic and syntactic relations between words since it can see longer context in terms of the number of words compared to that seen by Comp-Cl (on average a span of two words in the surface-level LM corresponds to almost five words in the tokenized-level LM) (Rashwan et al., 2011).

DT Ensemble
In the final step, we use the classes and confidence scores of the preceding two classifiers on the training data to train a decision tree classifier. Accordingly, an input test sentence goes through Comp-Cl and Abs-Cl, where each classifier assigns the sentence a label and a confidence score for this label. It then uses the two labels and the two confidence scores to provide its final classification for the input sentence.

Data
To our knowledge, there is no publicly available standard dataset that is annotated for both token and sentence levels to be used for evaluating both levels of classifications. Accordingly we use two separate standard datasets for both tasks.
For the token level identification, we use the training and test data that is provided by the shared task ShTk. Additionally, we manually annotate more token-level data using the same guidelines used to annotate this dataset and use this additional data for training and tuning our system.
• tokTrnDB: is the ShTk training set. It consists of 119,326 words collected from Twitter; • tokTstDB: is the ShTk test set. It consists of 87,373 words of tweets collected from some unseen users in the training set and 12,017 words of sentences collected from Arabic commentaries; • tokDevDB: 42,245 words collected from weblogs and manually annotated in house using the same guidelines of the shared task. 7 We only use this set for system tuning to decide upon the best configuration; • tokTrnDB2: 171,419 words collected from weblogs and manually annotated in house using the same guidelines of the shared task. We use it as an extra training set in addition to tokTrnDB to study the effect of increasing training data size on the system performance. 8 Table 1 shows the distribution of each of these subsets of the token-level dataset.   The task organizers kindly provided the guidelines for the task. 8 We are expecting to release both tokDevDB and tok-TrnDB2 in addition to some other data are still under development to the community by 2016 ipating in ShTk evaluation test bed. These baselines include: • IUCL: The best results obtained by King et al. (2014); • IIIT: The best results obtained by Jain and Bhat (2014)

Sentence Level Baselines
For the sentence level component, we evaluate our approach against all published results on the Arabic "Online Commentaries (AOC)" publicly available dataset (Zaidan and Callison-Burch, 2011). The sentence level baselines include: •  Table 3 compares our token level identification approach to all baselines. It shows, our proposed approach significantly outperforms all baselines using the same training and test sets. AIDA2 achieves 90.6% weighted average F-score while the nearest baseline gets 86.8% (this is 28.8% error reduction from the best published approach). By using both tokTrnDB and tokTrnDB2 for training, the weighted average F-score is further improved by 2.3% as shown in the last row of the table.

Sentence Level Evaluation
For all experiments, we use a decision-tree classifier as implemented in WEKA (Hall et al., 2009) toolkit. Table 4 shows the 10-folds crossvalidation results on the sentTrnDB.   Table 4: Cross-validation accuracy on the sent-TrnDB using the best selected features in each group ones that yield best cross-validation results of sentTrnDB. "Best-of-all-groups" shows the result of the best selected features from the retained feature groups which in turn is the final set of features for the comprehensive classifier. In our case the best selected features are msaPPL, edaPPL, diaPercent, hasUnk, cal-imaWords, modEDA, egyWords, latinPercent, puncPercent, avgWordLen, and hasDiac.
• "Abs-Cl" shows the results and best set of features (msaProb and edaProb) for the abstract classifier.
• "DT Ensemble" reflect the results of combining the labels and confidence scores from Comp-Cl and Abs-Cl using a decision tree classifier.
Among the different configurations, the ensemble system yields the best 10-fold cross-validation accuracy of 89.9%. We compare the performance of this best setup to our baselines on both the cross-validation and held-out test sets. As Table  5 shows, the proposed approach significantly outperforms all baselines on all sets.
6 Results Discussion

Token Level Results Discussion
Last row in table 3 shows that the system results in 24.5% error reduction by adding 171K words  Table 5: Results of using our best setup (DT Ensemble) against baselines of gold data to the training set. This shows that the system did not reach the saturation state yet, which means that adding more gold data can increase performance. Table 6 shows the confusion matrix of our best setup for all six labels over the tokTstDB. The table shows that the highest confusability is between lang1 and lang2 classes; 2.9% are classified as lang1 instead of lang2 and 1.6% are classified as lang2 instead of lang1. This accounts for 63.8% of the total errors. The Table also shows that our system does not produce the mixed class at all probably because of the tiny number of mixed cases in the training data (only 33 words out of 270.7K words). The same case applies to the ambig class as it represents only 0.4% of the whole training data. lang1 and ne are also quite highly confusable. Most of ne words have another nonnamed entity meaning and in most cases these other meanings tend to be MSA. Therefore, we expect that a more sophisticated NER system will help in identifying these cases.
Predicted lang1 lang2 ambig ne other mixed lang1 55.7% 1.6% 0.0% 0.9% 0.0% 0.0% lang2 2.9% 18.9% 0.0% 0.2% 0.0% 0.0% Gold ambig 0.1% 0.1% 0.0% 0.0% 0.0% 0.0% ne 0.8% 0.2% 0.0% 10.3% 0.1% 0.0% other 0.0% 0.0% 0.0% 0.0% 8.2% 0.0% mixed 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% Table 6: Token-level confusion matrix for the best performing setup on tokTstDB Table 7 shows examples of the words that are misclassified by our system. The misclassified word in the first examples (bED meaning "each other") has a gold class other. However, the gold label is incorrect and our system predicted it correctly as lang2 given the context. In the second example, the misclassified named entity refers to the name of a charitable organization but the word also means "message" which is a lang1 word. The third example shows a lang1 word that is incorrectly classified by our system as lang2. Similarly, in the last example our system incorrectly classified a lang2 word as a lang1. I did not read the statement and decided to object to it just to be annoying object Table 7: Examples of the words that were misclassified by our system

Sentence Level Results Discussion
The best selected features shows that Comp-Cl benefits most from using only 11 features. By studying the excluded features we found that: • Five features (hasSpeechEff, hasEmot, hasDe-cEff, hasExMark, and hasQuesMark) are zeros for most records, hence extremely sparse, which explains why they are not selected as relevant distinctive features. However, it should be noted that the hasSpeechEff and hasEmot features are markers of informal language especially in the social media (not to ignore the fact that users write in MSA using these features as well but much less frequently). Accordingly we anticipate that if the data has more of these features, they would have significant impact on modeling the phenomena; • Five features are not strong indicators of dialectalness. For the sentLength feature, the average length of the MSA, and EDA sentences in the training data is almost the same. While, the numPercent, modMSA, modBoth, and hasRep-Punc features are almost uniformly distributed across the two classes; • The initial assumption was that SAMA is exclusively MSA while CALIMA is exclusively EDA, thereby the samaWords feature will be a strong indicator for MSA sentences and the cali-maWords feature will be a strong indicator for EDA sentences. Yet by closer inspection, we found that in 96.5% of the EDA sentences, cal-imaWords is higher than samaWords. But, in only 23.6% of the MSA sentences, samaWords is higher than calimaWords. This means that samaWords feature is not able to distinguish the MSA sentences efficiently. Accordingly sama-Words feature was not selected as a distinctive feature in the final feature selection process.
Although modEDA is selected as one of the representative features, it only occurs in a small percentage of the training data (10% of the EDA sentences and 1% of the MSA sentences). Accordingly, we repeated the best setup (DT Ensemble) without the modality features, as an ablation study, to measure the impact of modality features on the performance. In the 10-fold-cross-validation on the sentTrnDB using Comp-Cl alone, we note that performance results slightly decreased (from 87.3% to 87.0%). However given the sparsity of the feature (it occurs in less than 1% of the tokens in the EDA sentences), 0.3% drop in performance is significant. This shows that if the modality lexicon has more coverage, we will observe a more significant impact. Table 8 shows some examples for our system predictions. The first example is correctly classified with a high confidence (92%). Example 2 is quite challenging. The second word is a typo where two words are concatenated due to a missing white space, while the first and third words can be used in both MSA and EDA contexts. Therefore, the system gives a wrong prediction with a low confidence score (59%). In principle this sentence could be either EDA or MSA. The last example should be tagged as EDA. However, our system tagged it as MSA with a very high confidence score of (94%).

Conclusion
We presented AIDA2, a hybrid system for token and sentence levels dialectal identification in code switched Modern Standard and Egyptian Dialectal Arabic text. The proposed system uses a classifier ensemble approach to perform dialect identification on both levels. In the token level module, we run the input sentence through four different classifiers. Each of which classify each word in the sentence. A CRF model is then used to predict the final class of each word using the provided information from the underlying four classifiers. The output from the token level module is then used to train one of the two underlying classifiers of the sentence level module. A decision tree classifier is then used to to predict the final label of any new input sentence using the predictions and confidence scores of two underlying classifiers. The sentence level module also uses a heuristic features selection approach to select the best features for each of its two underlying classifiers by maximizing the accuracy on a cross-validation set. Our approach significantly outperforms all published systems on the same training and test sets. We achieve 90.6% weighted average F-score on the token level identification compared to 86.8% for state of the art using the same data sets. Adding more training data results in even better performance to 92.9%. On the sentence level, AIDA2 yields an accuracy of 90.8% using cross-validation compared to the latest state of the art performance of 86.6% on the same data.