Arabic Dialect Identification for Travel and Twitter Text

This paper presents the results of the experiments done as a part of MADAR Shared Task in WANLP 2019 on Arabic Fine-Grained Dialect Identification. Dialect Identification is one of the prominent tasks in the field of Natural language processing where the subsequent language modules can be improved based on it. We explored the use of different features like char, word n-gram, language model probabilities, etc on different classifiers. Results show that these features help to improve dialect classification accuracy. Results also show that traditional machine learning classifier tends to perform better when compared to neural network models on this task in a low resource setting.


Introduction
In general, Arabic (language), refers to a wide spectrum of native languages used in Middle East and North Africa. As mentioned in Zaidan and Callison-Burch (2014), native languages of Arabic speakers differ with each other and with Modern Standard Arabic (MSA). These native languages or dialects can be categorized based on their common linguistic features and geographical locations . This categorization is described in detail in Bouamor et al. (2019). In the technological expansion of communication era, automatic identification of these dialects becomes an essential task for major natural language applications. These applications can be Machine Translation (Ling et al., 2013), Speech Recognition , Tourist Guide (Alshutayri and Atwell, 2017), Real-time Disaster Management Alkhatib et al., 2019) and in health care. The task at hand was to identify a natural language dialect given a sequence of text for Arabic . As per the shared tasks, these texts were either tourist help guide (subtask1) or the social media text (subtask2).

Related Work
Dialect identification is well known task in the Natural Language processing community. We can find work on different languages like English, German, Chinese, etc (Jauhiainen et al., 2018) for natural language dialect processing. Mostly it can be categorized into spoken and text level tasks. These categorization also includes work on resource creation for dialects (Zaidan and Callison-Burch, 2014; as well as the building a robust system for Dialect Identification. In Arabic, it is prerequisite for most NLP tasks, where many subsequent tasks depend on it. We can find spoken dialect identification work in Biadsy et al. • MADAR Twitter User Dialect Identification -this subtask requires the origin country of a tweet for a given user. We consider this classification task as a pipeline of 2 tasks. First we classify each tweet according to its country. Each user can tweet several times. The user to country mapping is decided based on frequency of the previous classification task. Each user is mapped to the most likely country predicted by the tweets s/he posts. We utilized features and model described in  as baselines for Arabic dialect identification on Corpus-26 . We wanted to replicate their model which used multinomial naive bayes classifier (Pedregosa et al., 2011) on character and word n-gram with language model score as features to get state of the art accuracy.

Data
The details of the datasets used for training, development and test, in different subtasks are given in the tables 1 and 2. In table 1, the training data was distributed into 26 classes named as MADAR-Corpus-26 where each class had 1600 samples. Each class had a representation of 200 samples in the dev data.
Type #Sentences train 41600 dev 5200 test 5200  Preprocessing is a necessary step while handing textual data. The preprocessing steps involved in the subtasks are detailed below:- • Tokenization and Normalization : We did not use any off-the-shelf tokenizer for the tweets. We used the standard technique of tokenizing the text on white spaces for both the tasks.
• Text cleaning (Tweets) : Unlike standard texts, tweets can contain different spelling variations of words, special characters, twitter handles, urls due to limited space. We tried different experiments to observe the impact of removal of the twitter handles and urls on the overall classification accuracy. We observed that removal of these terms adversely affects the classification score. So we chose to keep the tweets as they were.

Feature Engineering
The features used for subtask1 were similar to those used in . 3 different machine learning models were explored. All the below mentioned models were implemented using scikit-learn (Pedregosa et al., 2011) machine learning library.

• Logistic Regression
The individual features used in different subtasks are explained in detail here.
• Subtask1 -TF-IDF: We used different combinations of word and character level ngrams for the tasks. We observed that combining word and character level ngram TF-IDF vectors performed significantly better than individual word or character TF-IDF vectors. For our final submissions, combinations of word unigrams and character level n-grams were considered where n lies in {2, 3, 4, 5}. -Language Modeling: We trained different language models (LM) for the two types of corpora available to us.
We trained the language model on sentences specific to a particular class for both MADAR-Corpus-6 (6 LMs) and MADAR-Corpus-26 (26 LMs). 2 features were included for these language models while developing machine learning models for subtask1. The coarse probabilities mentioned in table 3 came from the scores of the language model trained on MADAR-6 corpus. The final language model score was arrived at by adding the scores of the word and character 5-gram LMs for both the corpora.
• Subtask2 For the first classification task in subtask2, we used the same word, character TF-IDF features and the same classifiers as  mentioned in subtask1. We used the dialect probabilities as an additional feature which were present in the column 4 in the provided data. These dialect probabilities were obtained by the best model in . We followed an ensemble approach for the classification task. Some of the tweets were unavailable in the training set. Some tweets consisted of only english tokens, so the arabic dialect probabilities were missing for those tweets. So we used two separate classifiers with the following features to handle data of different types -Word Unigram, Character 2-5 gram TF-IDF vectors, dialect probabilities for the tweets which contained arabic text -Word Unigram, Character 2-5 gram TF-IDF vectors for the tweets which contained no arabic text or contained only urls or twitter handles During testing, different classifiers were used for inferencing with appropriate feature. We marked 'Saudi Arabia' as the country of origin for a tweet which was unavailable because most of the tweets in the training set were from the users of Saudi Arab.

Deep Models
For subtask1, We have also tried out deep learning based classifier, where we used character and word level TF-IDF features as described above as input to the multi-layer perception (MLP). Here we used

Observations
We could observe that all the classifiers performed similarly when all the features were used. Combination of character and word level TF-IDF vectors performed better than character or word level TF-IDF vectors in isolation. We could see that the language models trained at word and character level were the biggest contributor to the system's performance for subtask1. TF-IDF features and coarse probabilities did not add much to the overall accuracy. Logistic Regression and multinomial naive bayes techniques performed significantly poor for subtask2, so we did not report the results in this paper. Machine learning approaches performed marginally better than the multi-layer perceptrons. This could be due to the higher number of parameters that deep learning approaches try to learn compared to traditional approaches. One of the main reasons for lower classification accuracy in subtask2 is our assumption to assign country of origin for unavailable tweets as 'Saudi Arabia'.

Conclusion and Future Work
We presented our experiments on supervised dialect identification task (MADAR) in Arabic. Our experiments demonstrate that for relatively low resource task such as MADAR, traditional machine learning algorithms with feature engineering show their potentials compared to the deep learning approaches. Unlabelled Arabic corpora can be used to learn character and word embeddings in Arabic. It would be an interesting area to explore how recurrent neural networks perform on this task.