Hierarchical Deep Learning for Arabic Dialect Identification

In this paper, we present two approaches for Arabic Fine-Grained Dialect Identification. The first approach is based on Recurrent Neural Networks (BLSTM, BGRU) using hierarchical classification. The main idea is to separate the classification process for a sentence from a given text in two stages. We start with a higher level of classification (8 classes) and then the finer-grained classification (26 classes). The second approach is given by a voting system based on Naive Bayes and Random Forest. Our system achieves an F1 score of 63.02 % on the subtask evaluation dataset.


Introduction
Online platforms such as Social Media have become the default channel for people to actively participate in the generation of online content in different languages and dialects. Arabic is one of the fastest growing languages used on these platforms. There are many differences between Dialectal Arabic and Modern Standard Arabic which cause many challenges for Arabic language processing. Therefore, identifying the dialect in which posts are written is very important for understanding what has been written over these online platforms. Shoufan and Alameri (2015) presents a wide literature review of natural language processing for dialectical Arabic. The authors highlighted the huge lack of freely available dialectal corpora which was mentioned in (Zaghouani, 2014).
Although Arabic dialects are related but there are some lexical, phonological and morphological differences between them (Habash et al., 2013;Azab et al., 2013;Attia et al., 2012). Most recently, AL-Walaie and Khan, 2017) started to investigate the problem of the Arabic Dialect Identification with different classification methods.
In this paper, we are describing our work in the same research direction using the MADAR shared task corpus described in (Bouamor et al., 2019). The goal of this task is to classify a given text into one of 26 classes, corresponding to various dialects of Arabic language.
The remainder of this paper is organized as follows. In section 2, we describe the different techniques used in this work. In Section 3, we present our experimental setup and discuss the models and features used as well as our results. Finally, in Section 4 we conclude and give our future directions.

System Description
In the next few paragraphs, we will describe the two main methods we used in the MADAR shared task. The first one is based on deep learning with a hierarchical classification of dialects. The second one is based on the combination of Naive Bayes and Random Forest.

Hierarchical Deep Learning
We address the fine-grained identification of 25 dialects and the Modern Standard Arabic (MSA). Given the number of different dialects and the small size of the data set provided, deep learning algorithms didn't perform well. Our proposed method will aim to handle this problem by decreasing the number of classes the models need to predict. This is achieved using a hierarchical classification similar to the work described in Kowsari et al. (2017).
The classes are separated geographically and represent the dialects of 25 Arabic cities. Some of these dialects are remarkably similar, in particular for cities of the same country/region .Some dialects can be clustered to form a larger group. These groups are determined by the geographical distribution of the cities and the similarities between each dialect. This distribution is shown in table 1. A deep neural network (DNN) is trained to predict a group given a sentence. This model serves as the base for our system. Then for each dialect, a different model is trained. These models make predictions on their respective subset of dialects. Following this technique, two levels of DNNs are defined. First a base whose predictions are used to choose from a set of DNNs. The chosen one is then used to identify the dialect. The system architecture is presented in figure 1.

Vote Based Probabilistic Classifier
The low size of our data set made statistical models perform much better than the deep learning methods. Our proposed method will take into account the large number of classes by creating two different pipelines. The first one uses a Multinomial Naive Bayes. The second model uses a Random Forest Classifier. These models were implemented using the package scikit-learn (Pedregosa et al., 2011). The pipelines are pre-trained before they are given to the voting classifier. Then, the whole system is trained again to maximize the model performance for the dialects classification task. The data is first given into a count vectorizer then into a TF-IDF tranformer to extract meaningful information on word level. The voting classifier uses a hard voting method to select the model with the correct prediction.

Data
We used the data set provided by the MADAR Shared Task. The corpus covers the dialects of 25 Arab cities and the MSA. It is the same data set described in Bouamor et al. (2019) and . This corpus is composed of 2000 sentences translated to each dialect, with a total of 52000 sentences. We refer to this set as the MADAR corpus. We split this data set evenly between dialects in three parts: 80% constitutes the Train set, 10% the Dev and the last 10% the Test set. In our experiment, we limit the length of the sequences to 40 words and pad the sequences with zeros. For preprocessing we remove all non Arabic characters with the exception of Arabic numbers. To maximize the precision of the hierarchical deep learning system the input of the models is produced by a word2vec. The word2vec we used was trained separately using a database of over 32 million tweets. This data was downloaded using keywords extracted from the MADAR corpus. We used the score of a TF-IDF to find the most relevant words from each dialect. Tweets containing one of these words were downloaded and added to this data set. This way we could ensure a dialectal weight on the word embeddings.

Hierarchical Deep Learning
In our models, we used Bidirectional Long Short-Term Memory networks (B-LSTM) (Schuster and Paliwal, 1997). It consists of two LSTM networks running in parallel in different directions. Each LSTM generates a hidden representation: the first is generated by reading the input sequence from left to right and the second form right to left. This representations are then combined to compute the output sequence.
The architecture of the hierarchical system is composed of two levels (see the figure 1). The level one is a DNN with three layers: A B-LSTM    of 128 neurons followed by a fully-connected layer of size 64 and a fully-connected layer of size 8 with softmax activation for the output. The level two is a set of 7 DNNs. For each of this models the size of the layers and the type of Recurrent Neural Network (RNN) units used is different. This is done in order to adapt each model to the number of classes it has to handle as well as to have a proportional number of parameters with the size of the groups data set. The models utilize the following pattern: They are composed of three layers. The first is a RNN layer, either B-LSTM or a B-GRU with a size ranging between 32 and 64 units. Then a fully-connected layer of size ranging between 32 and 64. Finally a fully-connected layer with softmax activation for the output. All models were trained using the following parameters: batch size = 100, learning rate = 0.001, β1 = 0.9, β2 = 0.999, decay = 0. The cost function used was the cross entropy. Two gradient descent optimizers where used for training: the RMSProp and the Adamax. To metric the possible improvement of this system we compare the results with a baseline. This baseline is a deep neural network with a similar architecture as the ones found in the hierarchical system.

Vote Based Probabilistic Classifier
The statistical method performed much better than the Deep learning method. In this section we describe the pipeline using different parameters. To define the accuracy we used the F-1 macro average score. By changing parameters of each pipeline, our results change drastically. We found that for the Naive Bayes the alpha at 0.3 was giving the best performance. For the Random Forest Classifier (RFC), random states set to 2 was also giving the best results. Using 250 estimators and a 200 depth, the RFC was performing the best, leading up to a 4% increase in F1-score.

Results
The table 5 shows the result of each DNN in the hierarchical system. We notice good performance for some groups such as G3 and G2. However, the improvement in accuracy is not as substantial in most of the groups. Notably the performance of the seventh group only reaching a score of 0.55. This translates to a poor performance on the overall system. We see in   ble on similar dialects. For example, the first two sentences of table 6, are very similar. The first one is from Tunis whereas the second one is from Sfax which both belong to the same group. Because of this similarity, the models cannot make a correct distinction and often miss predict the correct label. Nonetheless, the statistical method provides good result when dialects are very close. Tunis and Sfax have both a good F1-score, even with some confusion due to similar sentences. However it struggles to identify dialects such as Mosul (MOS), Cairo (CAI) and Salt (SAL) which have a very low precision (table 2). The results can be explained by the fact that the amount of data available was very low which can lead to an overfitting of the deep learning model. The voting classifier perform 9% better (table 3).

Conclusion and Future Work
In this paper, we propose to use two different methods for Arabic dialect identification: the Hierarchical Deep Neural Network and the Hard Vot-ing Classifier. The hierarchical model uses two levels of DNNs where the first one predicts the group of a dialect, and the second one predicts the dialect according to the previous prediction. The method based on a statistical model is composed of a Multinomial Naive Bayes and a Random Forest Classifier connected by a Hard Voting Classifier. This model outperformed the F1-score results of the Hierarchical Deep Neural Network.
In the future, we plan to work on the combination of two neural networks. The output of the first model will be a vector composed of probabilities for each group. The second one, will take as input the sentence as well as the output of the previous model as a new feature.