No Army, No Navy: BERT Semi-Supervised Learning of Arabic Dialects

We present our deep leaning system submitted to MADAR shared task 2 focused on twitter user dialect identification. We develop tweet-level identification models based on GRUs and BERT in supervised and semi-supervised set-tings. We then introduce a simple, yet effective, method of porting tweet-level labels at the level of users. Our system ranks top 1 in the competition, with 71.70% macro F1 score and 77.40% accuracy.


Introduction
Language identification (LID) is an important NLP task that usually acts as an enabling technology in a pipeline involving another downstream task such as machine translation (Salloum et al., 2014) or sentiment analysis (Abdul-Mageed, 2017b,a). Although several works have focused on detecting languages in global settings (see Jauhiainen et al. (2018) for a survey), there has not been extensive research on teasing apart similar languages or language varieties . This is the case for Arabic, the term used to collectively refer to a large number of varieties with a vast population of native speakers (∼ 300 million). For this reason, we focus on detecting fine-grained Arabic dialect as part of our contribution to the MADAR shared task 2, twitter user dialect identification (Bouamor et al., 2019).
Previous works on Arabic (e.g., Callison-Burch (2011, 2014); Elfardy and Diab (2013); Cotterell and Callison-Burch (2014)) have primarily targeted cross-country regional varieties such as Egyptian, Gulf, and Levantine, in addition to Modern Standard Arabic (MSA). These * The title is word play on the Yiddish linguist Max Weinreich much quoted metaphor (in Yiddish) "A language is a dialect with an army and navy". See: https: //en.wikipedia.org/wiki/A_language_is_a_ dialect_with_an_army_and_navy.
works exploited social data from blogs (Diab et al., 2010;Elfardy and Diab, 2012;Al-Sabbagh and Girju, 2012;Sadat et al., 2014), the general Web (Al-Sabbagh and Girju, 2012), online news sites comments sections Callison-Burch, 2011), andTwitter (Abdul-Mageed andAbdul-Mageed et al., 2014;Mubarak and Darwish, 2014;Qwaider et al., 2018). Other works have used translated data (e.g., Bouamor et al. (2018)), or speech transcripts (e.g., Malmasi and Zampieri (2016). More recently, other works reporting larger-scale datasets at the country-level were undertaken. These include data spanning 10to-17 different countries (Zaghouani and Charfi, 2018;. To solve Arabic dialect identification, many researchers developed models based on computational linguistics and machine learning (Elfardy and Diab, 2013;Salloum et al., 2014;Cotterell and Callison-Burch, 2014), and deep learning . In this paper, we focus on using state-of-the-arts deep learning architectures to identify Arabic dialects of Twitter users at the country level. We use the MADAR twitter corpus (Bouamor et al., 2019), comprising 21 country-level dialect labels. Namely, we employ unidirectional Gated Recurrent Unit (GRU)  as our baseline and pre-trained Multilingual Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018) to identify dialect classes for individual tweets (which we then port at user level). We also apply semi-supervised learning to augment our training data, with a goal to improve model performance.
Our system ranks top 1 in the shared task. The rest of the paper is organized as follows: data are described in Section 2. Section 3 introduces our methods, follow by experiments in Section 4. We conclude in Section 5.

Data
Twitter user dialect identification is the second sub-task of 2019 MADAR shared task (Bouamor et al., 2019). This task is set up as fine-grained multi-class classification where corpus released by organizers are labeled with the tagset {Algeria, Bahrain, Djibouti, Egypt, Iraq, Jordan, Kuwait, Lebanon, Libya, Mauritania, Morocco, Oman, Palestine, Qatar, Saudi Arabia, Somalia, Sudan, Syria, Tunisia, United Arab Emirates, Yemen}. The corpus is divided into train, dev and test (with the test set shared without labels). For each tweet, organizers released a user id and tweet id and participants needed to crawl the actual tweets. We were not able to crawl part of the data because of unavailability on the Twitter platform. The distribution of the data in our splits after crawling is as follows: 2,036 (TRAIN-A), 281 (DEV) and 466 (TEST). For our experiments, we also make use of the task 1 corpus (95,000 sentences (Bouamor et al., 2018)). More specifically, we concatenate the task 1 data to the training data of task 2, to create TRAIN-B. Note that both DEV and TEST across our experiments are exclusively the data released in task 2, as described above. TEST labels were only released to participants after the official task evaluation. Table 1

Pre-processing & Architectures
With tweet ids at hand, we crawl users tweets via the Twitter API. We remove all usernames, URLs, and diacritics in the data. For evaluation, we use accuracy and macro F 1 −score. For modeling, we use two main deep learning architectures, Gated Recurrent Unit (GRU) and Bidirectional Encoder Representations from Transformers (BERT). For GRU, we tokenize tweets into word sequences by white-space. For BERT input, we apply Word-Piece tokenization. We set the maximal sequence length to 50 words/WordPieces. A GRU Chung et al., 2014) is a simplification of long-short term memory networks (LSTM), which in turn are a version of recurrent neural networks. For BERT (Devlin et al., 2018), it was introduced to dispense with recurrence and convolution. Its model architecture is a multi-layer bidirectional Transformer encoder (Vaswani et al., 2017). It uses masked language models to enable pre-trained deep bidirectional representations, in addition to a binary next sentence prediction task. The pre-trained BERT can be easily fine-tuned on large suite of sentence-level and token-level tasks.We also use semi-supervised learning in our modeling, as we explain next.

Semi-supervise Learning
Supervised deep learning requires a large number of labeled data points.
For this reason, we investigate augmenting our training data with automatically-predicted tweets using semisupervised learning (SSL). More specifically, we use self-training.
Self-training is a wrapper method for SSL (Triguero et al., 2015;Pavlinek and Podgorelec, 2017) where a classifier is initially trained on a small set of labeled samples D l . Then, the learned classifier is used to classify the unlabeled sample set D u . Based on the predication output, the most confident samples with their predicted labels are added to the labeled set. The classifier can then be re-trained on the new 'labeled' set. This process can be repeated until all the samples from D u are added to D l or a given stopping criteria is reached. We now introduce our experiments.

Experiments
We illustrate our four main sets of experiment. We present (i) our baseline model, GRU (Section 4.1), (ii) fine-tuning on BERT-Base, Multilingual Cased model for dialect identification (Section 4.2), (iii) semi-supervised learning with unlabeled data 4.3, (iv) user-level dialect identification (DID) 4.4.

GRU
We train a baseline GRU network with TRIAN-A. This network has one layer unidirectional GRU with 500 unites and a linear, output layer. The input word tokens are embedded by the trainable word vectors which are initialized with a standard normal distribution, with µ = 0, and σ = 1, i.e., W ∼ N (0, 1). We apply Adam (Kingma and Ba, 2014) with a fixed learning rate of 1e − 3 for optimization. For regularization, we use dropout (Srivastava et al., 2014) rate of 0.5 on the hidden layer. We set the maximal length of sequence in our GRU model to 50, and choose an arbitrary vocabulary size of 10,000 words. We employ batch training with a batch size of 8 on this model. We run the network for 10 epochs and save the model at the end of each epoch, choosing the model that performs highest on DEV as our best model. We report our best result on dev in Table 2. Our best result is acquired with 3 epochs. As Table 2 shows, the baseline obtains accuracy = 46.81% and F 1 = 28.84.

BERT
As mentioned earlier, we use the BERT-Base Multilingual Cased model released by the authors 1 . The model is trained on 104 languages (including Arabic) with 12 layer, 768 hidden units each, 12 attention heads, and has 110M parameters in entire model. The model has 119,547 shared Word-Pieces vocabulary, and was pre-trained on the entire Wikipedia for each language. For fine-tuning, we use a maximum sequence size of 50 tokens and a batch size of 32. We set the learning rate to 2e−5 and train for 10 epochs. We use the same hyperparameters in all of our BERT models. We finetune BERT on TRAIN-A and TRAIN-B sets, and call these BERT-A and BERT-B respectively. As Table 2 shows, both BERT models acquire better performance than the GRU models. On accuracy, BERT-A is 1.69% better than the baseline, and BERT-B is 1.95% better than baseline. BERT-B obtains 34.87 F 1 which is 5.03 better than the baseline and 0.94 better than BERT-A. Our best model of above two sets of experiment is BERT-B which obtains the best accuracy and F 1 . Hence, we use BERT-B in our following semi-supervised learning experiments.

Semi-supervised Learning
As we mentioned earlier, we apply self-training in order to augment training set. For this purpose, we use an in-house unlabeled, Arabic dataset of 9,981,965 tweets. We refer to this unlabeled dataset as unlabeled-10M. We pre-process unlabeled-10M using the same method as the rest of our data. We use the best model from Section 4.2 (i.e. BERT-B, which is trained on TRAIN-   B) to predict dialect labels for unlabeled-10M. To obtain the best performance, we investigate various settings to select the most reliable samples before adding such samples to our training data. These settings are based on the perclass value in the softmax/output layer, as follows: (i) Top-N%: We select samples which obtain top n% softmax values and add them with their predicted labels to TRAIN-B. We refer to the new training set as N SEMI. (ii) Top-N% Class: We also extract the samples which obtain top n% softmax value within each county class and add them to our training data, referring to the new train set as N Class SEMI. In our experiments, we choose n from the set {5%, 10%, 25%}. Then, we fine-tune the BERT-Base, Multilingual Cased model on the resulting six new training sets (e.g., 5% SEMI, 5% Class SEMI, 10% SEMI) with the same hyper-parameters as Section 4.2. We evaluate on DEV. For reference, BERT-N denotes the model which is trained on N SEMI, and BERT-NClass SEMI denotes the model which is trained on N Class SEMI. We present the description of these six train sets in Table 3. As  Table 4 shows, most semi-supervised models outperform BERT-B. For accuracy, the best model is BERT 10% (acc = 49.34%) with 4 epochs. It is 0.639% higher than BERT-B. For F 1 , the best model is BERT 5% (F 1 = 35.931) with 3 epochs. We use these two model in the following userlevel DID. Since the official metric of the shared task is macro F 1 score, we also consider BERT-25% Class SEMI as a candidate model for userlevel DID since it acquires better F 1 than BERT-10% as Table 4 shows.

User-level DID
Our aforementioned models identify dialect on the tweet-level, rather than directly detect the dialect of a user. Hence, we use tweet-level predicted labels (and associated softmax values) as a proxy for user-level labels. For each predicted label, we use the softmax value as a threshold for including only highest confidently predicted tweets. Since in some cases softmax values can be low, we try all values between 0.00 and 0.99 to take a softmax-based majority class as the user-level predicted label, fine-tuning on our DEV set. Figure 1 provides performance of the BERT-25% Class SEMI model on DEV using different softmax threshold values. Note that the shared task requires a maximum of three models submitted. For these, we chose the top 3 models in Table 4 (i.e., BERT-5%, BERT-10%, and BERT-25% Class SEMI). As a precauion, we also use the BERT-B when we fine-tune on the userlevel on DEV. We then use only the 3 models that perform best on DEV as our official task submission. As Table 5 shows, the best three systems on DEV are BERT-B, BERT-5% and BERT-25% Class SEMI. For the 34 unavailable users, we assigned the majority class in TRAIN-A (i.e., 'Saudi Arabia'). According to 5, our best system on TEST set is BERT-5% with 77.04% accuracy and 71.70 F 1 . It rank top 1 in the shared task.

Conclusion
In this paper, we described our submission to the MADAR shared task 2, focused on user-level Arabic dialect identification. We show how we acquire effective models using various supervised and semi-supervised methods, porting tweet-level labels to the user level. Our semi-supervised model with BERT achieves best results in the official task evaluation. In the future, we will investigate more extensive semi-supervised methods to improve performance.