Domain Adaptation for Arabic Cross-Domain and Cross-Dialect Sentiment Analysis from Contextualized Word Embedding

Finetuning deep pre-trained language models has shown state-of-the-art performances on a wide range of Natural Language Processing (NLP) applications. Nevertheless, their generalization performance drops under domain shift. In the case of Arabic language, diglossia makes building and annotating corpora for each dialect and/or domain a more challenging task. Unsupervised Domain Adaptation tackles this issue by transferring the learned knowledge from labeled source domain data to unlabeled target domain data. In this paper, we propose a new unsupervised domain adaptation method for Arabic cross-domain and cross-dialect sentiment analysis from Contextualized Word Embedding. Several experiments are performed adopting the coarse-grained and the fine-grained taxonomies of Arabic dialects. The obtained results show that our method yields very promising results and outperforms several domain adaptation methods for most of the evaluated datasets. On average, our method increases the performance by an improvement rate of 20.8% over the zero-shot transfer learning from BERT.


Introduction
The Arabic language is characterized by two main language varieties: Modern Standard Arabic (MSA) and Arabic dialect. MSA has a standard written form and acquires an official status across the Arab countries, while Dialectal Arabic refers to the informal spoken dialects in the Arab World (Habash, 2010). These dialects are used in daily life but have no standard written form (Saadane and Habash, 2015;Eryani et al., 2020). Geographically and according to (Zaidan and Callison-Burch, 2014), Arabic dialects can be classified into five coarse-grained regional dialects: Egyptian, Levantine, Gulf, Iraqi, and Maghrebi. Recent studies have categorized dialectal Arabic into more fine-grained levels, including countries and cities (Bouamor et al., 2019;Muhammad et al., 2020). These dialects differ from one another and from MSA, to a varying degree, at different linguistic levels .
With the unprecedented reach of social media platforms, Sentiment Analysis (SA) has become a fundamental task for many applications. Most research work in this area has been devoted to English and other European languages, while some research studies have addressed the question of transfer learning from MSA to dialectal Arabic. However, Khaddaj et al. (2019) and Qwaider et al. (2019) have shown that zero-shot transfer learning, from models trained on MSA data, does not perform well for SA on dialectal Arabic data. So, existing works have focused on building resources and annotating corpora for a few dialects where most of them were collected from social media (Medhaffar et al., 2017;Al-Twairesh et al., 2017;Baly et al., 2018;Moudjari et al., 2020;Oueslati et al., 2020). Nevertheless, dealing with Arabic dialects as standalone languages is challenging since building manually such resources is costly and timeconsuming.
It is well known that the generalization performance of Machine Learning (ML) models drops in the case of domain shift (out of distribution data). Hence, there is an imperative need to leverage existing labeled data from other related domains, in order to address this challenge. The aim is to accurately transfer the learned knowledge from a source domain labeled data to a new target domain data. On the one hand, adaptive pretraining of contextualized word embedding models has shown an effective transfer learning performance under domain shift (Han and Eisenstein, 2019;Rietzler et al., 2020). It consists of finetuning the pre-trained language models on large unlabeled corpus from the target domain using the MLM objective. On the other hand, self-training and domain-adversarial learning have been applied successfully to many NLP applications Ramponi and Plank, 2020;Ganin et al., 2016). An effective method that combines domain-adversarial training and self-training is the Adversarial-Learned Loss for Domain Adaptation (ALDA) (Chen et al., 2020). The domainadversarial training aligns both domains' output distributions, while self-training captures the discriminative features of the target domain data.
In this paper, we introduce a new unsupervised domain adaptation method for Arabic crossdomain and cross-dialect sentiment analysis based on AraBERT language model (Antoun et al., 2020) and the Adversarial-Learned Loss for Domain Adaptation (ALDA) (Chen et al., 2020). Due to limited amount of unlabeled data for most target domains-dialects, we do not rely on the adaptive pre-training of AraBERT model. Our method leverages the potentials of: i) contextualized word embeddings to learn high-level text representation, ii) adversarial domain training to match the output distributions of domains and dialects, and iii) selftraining to capture the discriminative features of the target domain data. To summarize, our main contributions are as follows: • The proposition of a new unsupervised domain adaptation method for Arabic SA.
• The study of three possible challenging scenarios of domain adaptation for Arabic SA.
• The achievement of very promising results on several Arabic cross-domain and cross-dialect sentiment classification datasets.
To the best of our knowledge, this is the first study that investigates domain adaptation for crossdomain, cross-dialect and cross-domain & crossdialect sentiment analysis, adopting the coarsegrained and the fine-grained taxonomies of Arabic dialects. The proposed method outperforms several state-of-the-art methods on most test datasets.
The rest of this paper is organized as follows. Section 2 presents related work. In Section 3, we introduce our method. Section 4 illustrates the conducted experiments, and discusses the obtained results. Finally, in Section 6, we conclude the paper and outline a few directions for future work.
Unsupervised domain adaptation. In the past few years, there has been considerable interest in unsupervised domain adaptation for cross-domain NLP tasks, including cross-domain sentiment analysis (Ramponi and Plank, 2020). Previous work has focused on minimizing the discrepancy between domains by aligning the output distributions of the source and the target domains. Maximum Mean Discrepancy (MMD) (Gretton et al., 2012), KL-divergence (Zhuang et al., 2015), Correlation Alignment (CORAL) (Sun and Saenko, 2016), and domain-adversarial learning (Ganin et al., 2016) are among the most widely used methods to learn domain-invariant features. In the same vein, other researchers have adopted self-training approach in order to learn discriminative features of the target domain (Ramponi and Plank, 2020;. The latter approach enables the model to be also trained on some samples of the target domain. The main idea is to select a subset of pseudo-labels, predicted on the target domain inputs, for which the model's confidence is higher than a fixed threshold, and to incorporate them into the model loss. However, pseudo-labels are generally noisy and may hurt the performance of the model. Chen et al. (2020) have tackled this issue by introducing the adversarial-learned loss for domain adaptation where the discriminator corrects the noise in the pseudo-labels by generating noise vectors that are specific for each domain.
Domain adaptation for cross-domain sentiment analysis. In order to learn cross-domain text representation, several domain adaptation methods have relied on pivot features extraction. In-spired from structural correspondence learning, Yu and Jiang (2016) have proposed a method to learn continuous sentence embedding employing CNN model across various domains. Li et al. (2018) have introduced a domain adaption method which can be extended to documents. The latter method uses a hierarchical attention transfer network for extracting the pivots and non-pivots features between source and target domains. Ziser and Reichart (2018) have proposed language modeling objective to learn a model scratch rather than adapting a pre-trained embedding model.
Recently, several methods have been introduced for domain adaptation based on adaptive pretraining of contextualized word embeddings (Han and Eisenstein, 2019;Vu et al., 2020). The latter approach relies on the availability of a large amount of unlabeled data in the target domain to finetune/adapt the existing pre-trained language model using the MLM objective. Rietzler et al. (2020) have proposed an unsupervised domain adaptation method for aspect-target sentiment classification based on BERT adaptive pretraining. Vu et al. (2020) have presented an adaptive pre-training method that adversarially masks out tokens that are hard to be reconstructed by the MLM. In another work, (Du et al., 2020) have proposed to combine BERT domain-aware training and adversarial-domain learning (Ganin et al., 2016) for cross-domain sentiment analysis. The domain-aware training combines the adaptive pretraining using the MLM objective and a Domain Distinguish Task (DDT). For cross-domain and cross-lingual domain adaptation,  have introduced an unsupervised feature decomposition method based on the mutual information to extract domain-invariant and domain-specific features using the XLM language model (Lample and Conneau, 2019).
For the Arabic language, Khaddaj et al. (2019) have introduced a domain adaptation method for cross-domain and cross-dialect sentiment analysis, combining domain adversarial training (Ganin et al., 2016) with denoising autoencoder for representation learning. The input sentences of both domains are represented using the bag-of-words representation by selecting the top 5,000 most frequent unigrams and bigrams. The obtained results on the Levantine multi-topic ArSentD-LEV dataset (Baly et al., 2018) show that combining the reconstruction loss with the adversarial training has slightly improved the performance in some cases. Nevertheless, the overall obtained results show that the zero-shot transfer from the SVM model achieves competitive results for some datasets. In another work, Qwaider et al. (2019) have shown that models that are trained on MSA for the task of sentiment classification generalize poorly to dialectal Arabic data. For improving the results, they have performed domain adaptation using feature engineering and sentiment lexicons.

Method
In this section, we present our model architecture. The noise-correcting discriminator, the classifier and the generator losses, employed in our model, are those of ALDA model (Chen et al., 2020).

Model architecture
In unsupervised domain adaptation settings, for sentiment analysis, we are given a labeled source The aim is then to transfer the learned knowledge from D S to D T . In other words, the objective is to train a robust classifier using the labeled source domain data that generalizes well on the target domain test data. Figure 1 presents the general framework of our method. We aim to leverage the strength of both the domain adversarial training and the self-training in a unified framework. The adversarial training aligns both domains' output distributions, whereas the self-training considers the discriminative features of the target domain. Besides, AraBERT is used as a generator to extract high-level representation from both source and target domains sentences.
The generator G, the AraBERT encoder, is trained to extract features from the input sentences for both domains: h [CLS] = G(x) corresponds to the hidden state of the [CLS] token. The weights of the generator are shared between both domain inputs.
The classifier C operates on the hidden states h [CLS] to classify the input instances x and outputs a probability vector p(y = k|x) = Sof tmax(W c h [CLS] + b c ) for both domains (p s and p t ), where b c and W c are the bias vector and the weight matrix on the classification layer, respectively.
The generator G tries to confuse the discrimina- Figure 1: The general framework of our method tor D by maximizing its loss. Thus, the generator aligns both domains' output distributions, whereas the discriminator must distinguish both domain features by generating different noise vectors for each domain. These noise vectors are employed to correct the pseudo-labels predicted by the classifier C. The Gradient Reversal Layer (GRL) reverses the gradient of the discriminator's loss during the back-propagation step.

Noise correcting discriminator
The input of the discriminator D corresponds to the hidden state h [CLS] of the generator G. D is trained to produce a noise vector ξ (x) = σ(D(h [CLS] )) by applying σ, the Sigmoid activation, on its output layer. Note that, the output layer size is equal to K, the number of classes. Each component of the noise vector estimates the probability that the predicted label is the correct label ξ (x) k = p(y = k|ŷ = k, x). Hence, instead of being trained to classify the source domain sentences and the ones of the target domain, G is trained to generate different noise vectors for each domain. The noise vector is used to estimate the confusion matrix η = (η kl ), which is applied to correct the target domain's pseudolabels, predicted by the classifier C. The intuition behind the ALDA model is that, if we appropriately estimate the confusion matrix, the noise in the pseudo-labels predicted by the classifier, can be efficiently corrected (Chen et al., 2020).
Assuming that the noise in pseudo-labels is classwise uniform with vector ξ (xt) k , the confusion matrix is then given by: The corrected label vector in the target domain is given by c (xt) = l η xt kl p(ŷ t = l|x t ) (l is the predicted pseudo label). For the source domain, the corrected label vector c (xs) is computed using the same procedure.
For the source domain, the discriminator minimizes the binary cross-entropy loss L bce between the corrected label vectors and the ground truth labels y s : For the target domain, the discriminator minimizes the binary cross-entropy loss L bce between the corrected label vector and the opposite distribution of the predicted pseudo-label u (ŷt) : where u (ŷt) is computed as follows: To discriminate between both domains, the discriminator minimizes the following total adversarial loss: In order to make the training more stable, ALDA incorporates the classification loss of the source domain as a regularization term into the discriminator. Thus, the discriminator must also correctly classify the source domain data. The regularization term is given by: [CLS] )) and L ce is the cross-entropy loss. Finally, the discriminator minimizes the following loss function:

Classifier and generator losses
Following the principles of pseudo-labeling methods for domain adaptation, the ground truth label y t for the target domain can be substituted by: where δ is a threshold. By using the learned confusion matrix η (xt) to correct the pseudo-label generated by the classifier C, ALDA approximates the loss in the target domain by: where L unh (p, k) = 1 − p k is the unhinged loss. Then, the classifier C minimizes the following loss: where L ce (p s , y s ) is the cross-entropy loss of the source domain. Finally, the generator G minimizes the following loss function: where λ ∈ [0, 1] is a hyperparameter.

Experiments
In this section, we present the experiments carried out to investigate the performance of our proposed method for Arabic cross-domain and cross-dialect sentiment analysis. We describe the used datasets and present the compared methods as well as the obtained results. We provide the experiments settings and implementation details of our method in Section A. The source code for reproducing the experimentations can be found in our github repository 1 .

Datasets
We conduct three main sets of experiments to cover three possible scenarios.
Scenario 1: Domain adaptation for dialects of the same region. The set of experiments of this scenario aims to study our method's performance for cross-dialect and cross-domain sentiment analysis for Arabic dialects of the same region.
Scenario 2: Domain adaptation across regional dialects. In the set of experiments of this scenario, we investigate the performance of our method using the coarse-grained regional taxonomy of Arabic dialects. For this purpose, 1. Firstly, we select three datasets, mixing Arabic dialects and MSA: BRAD (Elnagar and Einea), HARD (Elnagar et al., 2018), and TEAD (Abdellaoui and Zrigui, 2018) that are compiled from book reviews, hotel reviews, and Twitter, respectively. These datasets have sufficient samples to build a multi-dialect multi-domain dataset.
2. Secondly, we train an AraBERT-based dialect identification model, selecting data from some of the publicly available datasets, including MADAR (Bouamor et al., 2019), DART (Alsarsour et al., 2018), AOC (Zaidan and Callison-Burch, 2011), PADIC (Karima et al., 2018), and the multi dialect Arabic texts corpora proposed in (Khalid and Mark, 2013). The resulting Arabic dialect identification corpus consists of 353, 171 training sentences and a balanced test set of 50, 000 sentences and covers MSA as well as dialectal sentences from Maghrebi, Levantine, Egyptian, and Gulf. It is worth mentioning that our trained dialect identification model achieves 89% accuracy.
3. Finally, we apply our dialect identification model on the three evaluated datasets to build our multi-dialect multi-domain corpus. Moreover, we select the Levantine and Gulf dialects and MSA, which yielded sufficient data across domains. For review datasets, the rating levels 1 and 2 are assigned to negative polarity, while ratings 4 and 5 are considered positives. Furthermore, we sample 1000 positive and 1000 negative instances for each dialect to build our final multi-dialect and multi-domain dataset.
Scenario 3: Domain adaptation from MSA to Arabic dialects using social media data. The set of experiments of this scenario tackles the transfer of learning from MSA to Arabic dialects, belonging to different regions, using corpora built from social media (see Table 7  Since some of these datasets are labeled using positive and negative classes only (TSAC and MSAC), we evaluate our method using positive and negative sentences for all the used datasets. We use the train-test splits of the evaluated datasets whenever this information is available. Otherwise, we split the datasets into 80% train and 20% test. For the ArSentD-LEV and following the work of (Khaddaj et al., 2019), we evaluate our method on the full target domain/dialect dataset. For all our experiments, we utilize the accuracy evaluation measure and highlight the best accuracy performance using bold font.

Compared Methods
In order to assess the performance of our method, we compare it with the state-of-the-art domain adaptation method, introduced by (Khaddaj et al., 2019), for Arabic sentiment analysis on the ArSentD-LEV dataset. Moreover, we evaluate BERT for zero-shot transfer from the source domain, denoted ZS-BERT. For a fair comparison, we investigate the performance of three state-of-the-art domain adaptation methods including MMD (Gretton et al., 2012), CORAL (Sun and Saenko, 2016), and DANN (Ganin et al., 2016). We implement the latter methods on top of AraBERT. We have also evaluated two stateof-the-art cross-domain sentiment analysis methods, namely PBLM (Ziser and Reichart, 2018) and HTAN (Li et al., 2018). It is worth to mention that for PBLM and HATN, we have used an extra 4000 unlabeled sentences from each domain/dialect. For HTAN, we have used Mazjak word embedding model (Abu Farha and Magdy, 2019)

Results
Scenario 1: Domain adaptation for dialects of the same region. Tables 1 and 2 present the results obtained for Arabic cross-domain and cross-dialect sentiment Analysis using ArSentD-LEV.
Scenario 2: Domain adaptation across regional dialects. Table 3 summarizes the results obtained for cross-domain and cross-dialect as well as crossdomain and cross-dialect Arabic sentiment analysis using two regional dialects (Gulf and Levantine) and MSA data, covering three domains (books reviews, hotels reviews and Twitter). The overall obtained results show that the zero-shot transfer from AraBERT (ZS-BERT) outperforms previous state-of-the-art methods (PBLM and HTAN). Moreover, the evaluated domain adaptation methods on top of BERT improve AraBERT's performance for all evaluated scenarios. Besides, the results demonstrate that the performance of ZS-BERT method drops significantly in the cases of cross-domain as well as in cross-domain and cross-dialect scenarios. Nevertheless, the domain adaptation methods show more important improvements (an increment of 7.4% on average) in the scenarios mentioned above. The obtained results clearly show that our method surpasses the other methods for most target datasets and scenarios, except for some cases but the gap remains small.
Scenario 3: Domain adaptation from MSA to Arabic dialects using social media data.  Table 3: The results of accuracy measurement of cross-dialect and cross-domain as well as cross-domain & cross-dialect Arabic sentiment analysis using two regional dialects and MSA data, covering three domains (books, hotels, and Twitter). Each target dataset's performance is the average accuracy obtained using its corresponding domain and/or dialect source data for each scenario. For example, in the crossdialect scenario, the result of Gulf_BRAD is the average accuracy obtained from Levantine_BRAD and MSA_BRAD as source dialect.
from MSA to Arabic dialects. In agreement with the previously obtained results, all domain adaptation methods outperform the ZS-BERT method for all evaluated datasets by an average increment of 4.9% . CORAL, MMD, and DANN achieve comparable performances for most dialectal datasets. Moreover, the overall comparison results show that our method outperforms all other domain adaptation methods.

Result discussion
The overall obtained results of the evaluated scenarios show that our method improves the transfer performance from contextualized word embedding. Moreover, it achieves far better trans-fer performance than the state-of-the-art methods that are based on the bag-of-words representation or pretrained word embedding. Indeed, all BERT-based domain adaptation methods yield a far better transfer learning performance than both DANN_BOW and ADRL methods. Besides, our method achieves better performance than CORAL, MMD, and DANN, which are implemented on top of BERT module. These results can be explained by the fact that BERT captures a high-level representation of the input text (Devlin et al., 2019;Antoun et al., 2020) as well as the effectiveness of ALDA. In fact, the latter aligns both domain output distributions using adversarial training and captures the discriminative features of the target domain inputs throughout self-training (Chen et al., 2020). Moreover, using BERT as a feature generator allows the model to extract high-level shared features of the input data that are transferable across domains and dialects. For instance, the results show that training DANN on top of BERT model outperforms DANN BOW , trained using the bag-of-words text representation, or even state-of-the-art methods that are based on pivot features extraction (HATN and PBLM), by a large margin for both cross-domain and cross-dialect sentiment analysis (Table 2 and  Table 1).

Error Analysis
To understand why our proposed method outperforms the previous methods, we perform an error analysis. In this error analysis we focus on two aspects: the misclassified instances by our system and the instances correctly predicted by our method which the other approaches fail.
For the first aspect, the majority of misclassified samples correspond to very short sentences in the target dialect. Most of them are either idiomatic, offensive or sarcastic expressions that are specific to the target dialect and contains words that are distant from MSA : /wAErp/, /gAr Allh yEfw wSAfy/, /mlA THAn/ and /crf xArf/ 2 . It is worth mentionning that the other evaluated methods also misclassify these samples.
For the second aspect, we have checked the cases where our method correctly predicts the instances labels while the other methods fail. Overall, we notice that the zero-shot predictions were biased to the distribution of the source data, as example the ArSAS dataset contains 63% of negative instances. MMD, CORAL and DANN overcome this issue by aligning the distribution of source and target features, which improves the results on the target domain. Meanwhile, they tend to misclassify reviews that convey multiple sentiment polarities, as the case for hotel reviews or books reviews, where users tend to express their negative and positive sentiments in the same review. Table 8 (Section B) shows a sample of these instances. Our method outperforms these DA methods since it relies on a noise-correcting discriminator that generates different noise vectors for the source and the target domain and learns a confusion matrix in an adversarial manner. By correcting the noise in pseudo labels of the target domain using the confusion matrix, we can achieve a class-wise feature alignment of the source and the target domains. Nevertheless, the other evaluated DA methods align the output features of the source and the target domain in class agnostic fashion.

Conclusion
In this work, we have introduced an unsupervised domain adaptation method for Arabic cross-2 Transliteration is performed using Safe Buckwlater scheme domain and cross-dialect sentiment analysis based on the pretrained AraBERT language model and the Adversarial-Learned Loss for Domain Adaptation (ALDA). We have performed several experiments to investigate the performance of our method as well as several state-of-the-art methods, adopting both the coarse-grained and the fine-grained taxonomies of Arabic dialects. Moreover, we have studied the performance of domain adaptation from the MSA to Arabic dialects using social media data. The overall obtained results showed that domain adaptation methods outperform zero-shot transfer from BERT model by a large margin. Furthermore, our method achieved a very promising performance and surpassed the evaluated methods on most test datasets.
In future work, we plan to investigate domain adaptive pre-training by collecting unlabeled data for target domains and fine-tuning AraBERT using the MLM objective. The aim is to study the performance of our method using domain aware language model. Since the zero-shot transfer performance using BERT model drops significantly in cross-domain sentiment analysis experiments, we believe that training domain adaptation methods on top of domain aware BERT model will lead to improved performance. We also plan to study domain adaptation from rich-resource languages such as English to Arabic language and its dialects.

Computing Infrastructure
We conduct our experiments using an Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz working station, having a single Nvidia Tesla P100 with 16GB of RAM.