Mawdoo3 AI at MADAR Shared Task: Arabic Tweet Dialect Identification

Arabic dialect identification is an inherently complex problem, as Arabic dialect taxonomy is convoluted and aims to dissect a continuous space rather than a discrete one. In this work, we present machine and deep learning approaches to predict 21 fine-grained dialects form a set of given tweets per user. We adopted numerous feature extraction methods most of which showed improvement in the final model, such as word embedding, Tf-idf, and other tweet features. Our results show that a simple LinearSVC can outperform any complex deep learning model given a set of curated features. With a relatively complex user voting mechanism, we were able to achieve a Macro-Averaged F1-score of 71.84% on MADAR shared subtask-2. Our best submitted model ranked second out of all participating teams.


Introduction
In recent years, an extensive increase in social media platforms usages, such as Facebook and Twitter, led to an exponential growth in the userbase generated content. The nature of this data is diverse. It comprises different expressions, languages, and dialects which attracted researchers to understand and harness language semantics such as sentiment, emotion, dialect identification, and many other Natural Language Processing (NLP) tasks. Arabic is one of the most spoken languages in the world, being used by many nations and spread across multiple geographical locations led to the generation of language variations (i.e., dialects) (Zaidan and Callison-Burch, 2014).
In this paper, we tackle the problem of predicting the user dialect from a set of his given tweets. We describe our work on exploring different machine and deep learning methods in our attempt to build a classifier for user dialect identification as part of MADAR (Multi-Arabic Dialect Applications and Resources) shared subtask-2   (Bouamor et al., 2019). The task of user dialect identification can be seen as a text classification problem, where we predict the probability of a dialect given a sequence of words and other features provided by the task organizers. Besides reporting the results from different models, we show how the provided dataset for the task is not straightforward and requires additional analysis, feature engineering, and post-processing techniques.
In the next sections, we describe the methods followed to achieve our best model. Section 2 lists previous work done, Section 3 analyses the dataset, while Section 4 describes models and different approaches. Section 5 compares and discusses empirical results and finally conclude in Section 6.

Related Work
Recent work in the Arabic language tackles the task of dialect identification. Fine-grained dialect identification models proposed by  to classify 26 specific Arabic dialects with an emphasis on feature extraction. They trained multiple models using Multinomial Naive Bayes (MNB) to achieve a Macro-Averaged F1-score of 67.9% for their best model.
In addition to traditional models, deep learning methods tackle the same problem. The research proposed by Elaraby and Abdul-Mageed (2018), shows an enhancement in accuracy when compared to machine learning methods. In Huang (2015), they used weakly supervised data or distance supervision techniques. They crawled data from Facebook posts combined with a labeled dataset to increase the accuracy of dialect identification by 0.5%.
In this paper, we build on top of methods from  and Elaraby and Abdul- Mageed (2018), by exploring machine and deep learning models to tackle the problem of finegrained Arabic dialect identification.

Dataset
The dataset used in this work represents information about tweets posted from the Arabic region, where each tweet is associated with its dialect label   (Bouamor et al., 2019). This dataset is collected from 21 countries which are Algeria, Bahrain, Djibouti, Egypt, Iraq, Jordan, Kuwait, Lebanon, Libya, Mauritania, Morocco, Oman, Palestine, Qatar, Saudi Arabia, Somalia, Sudan, Syria, Tunisia, United Arab Emirates, Yemen.
As shown in Figure 1, the distribution of tweets among countries is unbalanced. Around 35% of the tweets belong to Saudi Arabia, where only 0.08% belong to Djibouti.
The dataset contains 6 features for each user; username of the Twitter user, tweet ID, the language of the tweet as automatically detected by Twitter, a probability scores of 25 city dialects and MSA (Modern Standard Arabic) for each tweet obtained by running the best model described in  and most importantly the tweet text.
Each user has at most 100 tweets, labeled with the same dialect, and exists in one set. For example, if a user is listed in the training set then that user will not exist in development nor test set. Moreover, the maximum length of each tweet is 280 characters including spaces, URLs, hashtags, and mentions which makes it challenging to identify the dialects automatically (Twitter).
Another challenge of the dataset is that around 61% of the tweets are retweets, as shown in Table  1. This means that the majority of the tweets are a re-post of other Twitter users. For example, the tweet "RT @Bedoon120: https://t.co/sIKqXCUSAn for the user @abushooooq8 is an Egyptian tweet but it has a label of Kuwait because the user who retweeted is Kuwaiti (i.e. @abushooooq8), where the original author is Egyptian (i.e. @Bedoon120). Table 1 shows the distribution of available and unavailable data across different sets. It is also worth mentioning that around 10% of the data is missing; some tweets are not accessible because they were deleted by the author or owner account was deactivated.

Models
In this section, we explain our feature extraction methodology then we go over the various experimented approaches.

Feature Extraction
As a preprocessing step, normalization of Arabic letters is common when it comes to deal with Arabic text. We adopted the same preprocessing methodology used in (Soliman et al., 2017).
Aravec: A pre-trained word embedding models proposed by (Soliman et al., 2017) for the Arabic language using three different datasets: Twitter tweets, World Wide Web pages, and Wikipedia Arabic articles. Those models were trained using Word2Vec skip-gram and CBOW (Mikolov et al., 2013). In our experiments, we used the 300dimensional Twitter Skip-gram AraVec model.
Tf-idf: It has been proven that Tf-idf is efficient to encode textual information into real-valued vectors that represent the importance of a word to a document . One of the drawbacks of Tf-idf vectorized representation of the text is that it looses the information of the word order (i.e., syntactical information). Considering n-grams, for both levels word and characters, reduces the effect of that drawback. Accordingly, unigram and bigram word level Tf-idf vectors were extracted in addition to a character level Tf-idf vectors with n-gram values ranging from 1 to 5.
Features specific to tweets: There are features that are unique to Twitter such as user mentions, (e.g., @abushooooq8) and emojis. It has been found that using username as a feature can help the model understand the user dialect, for instance, it can easily find that the users @7abeeb ksa, @a ksa2030 @alanzisaudi have a Saudi Arabia dialect. Character level unigram Tf-idf has been extracted from each of the mentioned features.

Classification Methods
We used a range of classification methods starting from traditional machine learning methods into more complicated deep learning techniques.

Machine Learning Approaches
Traditional models include linear and probabilistic classifiers with various feature engineering techniques. We used SVM classifier that implements LinearSVC from the Scikit-learn library (Pedregosa et al., 2011). We used the LinearSVC model to predict the dialect given the tweet text represented as Tf-idf, username features and language model probabilities as formulated in Equation 1: whereŷ is the predicted dialect, D is probability space of all dialects, tfidf is the Tf-idf features extracted from a given tweet, tweet feat is the tweet features and lm is the language model probabilities.

Deep Learning Approaches
fastText Classification: The word embedding of the words in an input sentence that is fed into a weighting average layer. Then, it is fed to a linear classifier with softmax output layer .
SepCNN: Stands for Separable Convolutional Neural Networks (Denk, 2017), and is composed of two consecutive convolutional layers. The first is operating on the spatial dimension and performed on channels separately, while the second layer convolutes over the channel dimension only. Word embedding of the sentences is looked up from AraVec (Soliman et al., 2017). Then, the embedding of each word in the sentence (i.e., the tweet) are passed into a number of SepCNN blocks followed by a max pooling layer.
Word-level LSTM: A traditional deep learning classification method. The word sequence is passed into an AraVec layer to look up word embedding and then fed into a number of LSTM layers. The final word sequence is used as an input to a softmax layer to predict the dialect (Liu et al., 2016).
Char-Level CNN: In this architecture, the input is represented as characters that are converted into 128 character embedding. Those embedding vectors are then passed into a number of onedimensional convolutional layers. Each convolutional layer is followed by a batch normalization layer to optimize training and to add a regularization factor. The final output is then passed into one hidden layer and followed by a softmax output layer (Zhang et al., 2015).
Char-Level CNN and Word-level LSTM: A combination of the previous two methods. The output of word-level LSTM is concatenated with character-Level CNN before passing both of them into a hidden layer followed by a softmax output layer.
Char-Level CNN and Word-level CNN: In this network words are transformed into word embedding using AraVec, then concatenated with the output of character level CNN. The concatenated result is fed into the LSTM layer, which computes the final output. Then, passed into a hidden layer and a softmax output layer to make the final prediction (Zhang et al., 2017).

Results and Discussion
Two types of experiments were conducted to evaluate our models. At first, each tweet was treated independently with its corresponding label in the training and testing stages without grouping tweets for each user. All our experiments on MADAR shared subtask-2 were evaluated using the Macro-Averaged F1-score. Table 2 shows the accuracy and Macro-Averaged F1-score of the LinearSVC model. LinearSVC outperformed other traditional machine learning models hence we discarded reporting their results. On the other hand, deep learning models are known to generalize better on a large dataset, but unexpectedly it under-performed machine learning models.
The second type of experiments were done by grouping predictions per user. The unifying approach was done by either combining all tweets together in one document per user or by applying voting per tweet. In the former, we applied LinearSVC on the combined data with averaging the language model scores for all the tweets per user. This model achieved results of 77.33% accuracy and 70.43% Macro-Averaged F1-score. In the latter, we took the output of the first model (Uncombined LinearSVC) and applied two voting techniques.
The first technique was user voting based on dialect weighting. This approach aims to give more emphasis on less frequent dialects by multiplying each predicted label with a weight associated for each dialect d weight. Which is calculated as follows: Where max count is the number of tweets for the largest dialect (i.e., Saudi Arabia), min count is the number of tweets for the smallest dialect (i.e., Djibouti), step is a range defined as inverse cubic difference between maximum and minimum dialect counts divided by 5. dialect weight is an integer between 1 and 6 that defines dialect weight. Moreover, we found that increasing the weight of a retweet to 6 enhanced the accuracy of the model, and decreasing the weight of <UNAVAILABLE> tweets to 1 had a similar effect. The final user voting model achieved 81.67% accuracy and 71.60% F1-score which is the best model as shown in Table 2 Secondly, the other voting technique is based on majority voting with a penalty on the largest dialect. In this approach, we took the most frequent label from user tweets as the final label for that user. We impose selecting Saudi Arabia only if 75% of the predictions were Saudi Arabia for a given user. This approach achieved 80.02% accuracy and 71.84% Macro-Averaged F1-score.

Conclusion
This paper describes various methods applied on MADAR shared subtask-2 to predict an Arabic dialect from a set of given tweets, username, and other features. Experimental results show that LinearSVC was the most powerful prediction model, achieving the best Macro-Averaged F1-score than other machine learning models and deep learning ones. Despite the fact that there was a substantial amount of unavailable tweets in our dataset, yet we were able to achieve a relatively high F1-score of 71.60% on the development set and 69.86% on the test set, ranking second in the competition.