Improved Generalization of Arabic Text Classifiers

While transfer learning for text has been very active in the English language, progress in Arabic has been slow, including the use of Domain Adaptation (DA). Domain Adaptation is used to generalize the performance of any classifier by trying to balance the classifier’s accuracy for a particular task among different text domains. In this paper, we propose and evaluate two variants of a domain adaptation technique: the first is a base model called Domain Adversarial Neural Network (DANN), while the second is a variation that incorporates representational learning. Similar to previous approaches, we propose the use of proxy A-distance as a metric to assess the success of generalization. We make use of ArSentDLEV, a multi-topic dataset collected from the Levantine countries, to test the performance of the models. We show the superiority of the proposed method in accuracy and robustness when dealing with the Arabic language.


Introduction
Natural Language Processing (NLP) for Arabic is challenging due to the complexity of the language. Additionally, resources in Arabic are scarce making it difficult to achieve NLP progress at the pace of other resource-rich languages such as English (Badaro et al., 2019). As a result, there is a need for transfer learning methods that can overcome the resource limitations. In this paper, we propose the use of domain adaptation to address this challenge while considering the task of sentiment analysis (SA) also referred to as Opinion Mining (OM).
When training over a dataset with multiple domains, different domains have different data distributions. This has a negative impact when training on one domain and testing on another, since the model would not be able to generalize well.
Although domains within the same dataset have differences, they share some characteristics. For example, consider reviews of Amazon products: reviews of electronic products are different from book reviews, but these two domains share the general structure of reviews. We say there exists a shift in the data's distribution between the two domains. To solve this problem, many approaches were proposed within the field of Domain Adaptation (DA) (Ben-David et al., 2010). This field is receiving a lot of attention in English, a lot more than its Arabic counterpart.
Solving the data shift problem is of interest for many reasons. First, it is harder for machine learning to learn good internal representations on the Arabic text as opposed to English text. This is due to the sparsity of the Arabic language, and its morphological complexity compared to English. Another reason is the limited amount of available data, especially for dialects, which causes deep learning models to perform bad on any task. Lastly, we are not aware of domain adaptation techniques for the Arabic language, and thus much work needs to be done in this area to catch up with the research in English.
Traditionally, researchers focused their efforts on extracting features shared between the source and target domains (Blitzer et al., 2006Pan et al., 2010). After the advancement of representational learning (Bengio et al., 2013), several algorithms were introduced. The most notable approaches are Stacked Denoising Autoencoder (SDA) (Vincent et al., 2010;Glorot et al., 2011). Later, a modified version was introduced by (Chen et al., 2012). This version, called marginalized Stacked Denoising Autoencoder (mSDA), introduced a speedup compared to the original SDA since the input/output relation was provided in closed form. After Generative Adversarial Nets (Goodfellow et al., 2014) were introduced, the interest in adversarial training increased. Researchers developed new approaches that solve the DA problem through adversarial training, with emphasis on applications in computer vision and limited exploration for NLP. The most notable approaches are Domain Adversarial Neural Networks (DANN) (Ganin et al., 2016), Domain Separation Network (DSN) (Bousmalis et al., 2016), Adversarial Discriminative Domain Adaptation (ADDA) (Tzeng et al., 2017) and Conditional Adversarial Domain Adaptation (Long et al., 2018). Although limited in Arabic, some efforts have been spent to solve the domain shift problem (Jeblee et al., 2014;Monroe et al., 2014).
In this paper, we propose and evaluate some adversarial approaches for domain adaptation. The first is a regular DANN model while the second is a variant of DANN that incorporates representational learning. To assess the success of domain adaptation, we use the proxy A-distance as a matrix (Ben-David et al., 2007). The rest of the paper is organized as follows. Section 2 presents different approaches for DA. Section 3 introduces the algorithms to be evaluated, and describes the dataset. Section 4 presents the experiments and the results. We finally summarize our work and conclude the paper in Section 5.

Related Work
Domain Adaptation passed through several development stages. The first stage was based on feature engineering methods, while in the later stages, DA experienced a shift towards deep learning.
Initial approaches included finding words that behaved similarly in both the source and target domains. Blitzer et al. (2006) called such words pivot features, and proposed different approaches for extracting them. He first proposed using the most frequent common words as pivot features (Blitzer et al., 2006), and later on proposed using words with highest mutual information with the source labels . The extracted pivot features are then used by the algorithm to augment the initial dataset. This is done by learning a mapping to a vector space with dimensionality smaller than the dimensionality of the input data. Then, an optimization problem is solved in the new space, with the objective function being a similarity measure. Using the results of the optimization problem, new features are added to the original dataset. The resulting algo-rithm is called Structural Correspondence Learning (SCL) (Blitzer et al., 2006. A similar approach was introduced by Gong et al. (2013) where they suggested finding words, which they called landmarks, that have similar distributions over the source and target domains. These landmarks were used to increase the confusion between source and target domains, through optimizing a series of auxiliary tasks. Another point of view was introduced by Pan et al. (2010) based on the Spectral Graph Theory. Their approach, called Spectral Feature Alignment (SFA), aligned features from source and target domains using bipartite graphs. Although these approaches improved accuracies in domain adaptation tasks, the improvements remained limited.
The hype of deep learning motivated finding deep learning algorithms that could solve this problem. An interesting approach by Glorot et al. (2011) was preparing the input of any classifier by passing the input through Stacked Denoising Autoencoders (SDA) (Vincent et al., 2010). The use of SDAs helps find a new representation of the data that is domain invariant. This is achieved by reconstructing the input from stochastically disrupted data (via noise injection). Once the data is transformed, a linear SVM is trained on the new representation. This approach was more accurate than the previous approaches in predicting target domain labels. However, training SDAs is very time consuming. That is why Chen et al. (2012) forced the reconstruction mapping to be linear. This restriction yielded a closed form output solution. The new model, called marginalized Stacked Denoising Autoencoder (mSDA), was able to perform as good as the original SDA, and took much less time for training.
After the publication of GANs (Goodfellow et al., 2014), many researchers took interest in adversarial training. Ganin et al. (2016) proposed an adversarial network for domain adaptation. By introducing a Gradient Reversal Layer (GRL) that inverts the gradient's sign during backpropagation, the Domain Adversarial Neural Network (DANN) was forced to find a saddle point between 2 errors: a label prediction error (that is to be minimized) and a domain classification error (to be maximized). This approach led to the emergence of domain invariant features. DANN achieved state-of-the-art performance in domain adaptation tasks for two specific applications, namely: senti-  (Daume III, 2007) for word segmentation (Monroe et al., 2014). Both approaches were successful.

Proposed Method
A Domain Adaptation task is, in general, a prediction problem where given label data from a source domain S, we are to predict the labels of a target domain T with unlabeled data (Ben-David et al., 2010). In this paper, we focus on domain adaptation for sentiment analysis: Given data with sentiment labels from one domain, the model should be able to predict the sentiment of data coming from another domain. Let consists of N t unlabeled observations. The source and target observations are concatenated to form the input data X of N s + N t observations to the model. The architecture of DANN adopted is similar to the one in (Ganin et al., 2016). The variant, shown in Figure 1, is composed of 5 main parts: The above model uses denoising reconstruction (Vincent et al., 2010;Chen et al., 2012) and adversarial training (Ganin et al., 2016), in order to learn features that are discriminative towards the tasks at hand, while at the same time being able to generalize from one domain to another.
Three loss functions are associated with the network: 1) a loss function related to the classification task at hand, denoted as L task , 2) a loss function associated with the domain classifier, which could be the binary cross-entropy function (or log loss, etc...) and denoted as L domain , and 3) a loss function associated with the reconstruction of the input data, denoted as L recon , and could be the mean-squared error (or hinge loss, etc...). The model tries to minimize the sum of the 3 loss functions, i.e. it wants to find the parameters θ * such that: (1) where λ and µ are real numbers in the range [0, 1]. Since the reconstruction error tends to be larger than the other 2 losses by orders of magnitude, its corresponding scalar µ tends to be small.

Label Predictor
Using the label predictor, the model predicts the labels of the input data. During training, since only the source domain data has labels, the input is sliced in a way that the N s observations associated with the source domain are passed into the label predictor. The loss function L task depends on the task at hand (Janocha and Czarnecki, 2017). For example, one could use the mean squared error for regression, or the binary cross entropy for classification. For our purpose, we use the binary cross entropy.

Domain Classifier
The model above should be robust towards shift in data distribution. Said differently, the model should be able to predict accurately the label of a given observation even when it comes from the target domain instead of the source domain. Mathematically, this is equivalent to minimizing the error on label prediction and maximizing the error on domain classification. Ganin et al. (2016) showed that this can be done using a special layer they called Gradient Reversal Layer (GRL). The GRL does not affect the network during forward propagation, but it flips the sign of the gradients in backpropagation. The domain loss L domain adopted by (Ganin et al., 2016) is the log-loss between the true domain and the predicted domain. Other binary loss functions are possible (Janocha and Czarnecki, 2017). In our approach, we use the binary cross entropy. The error of the domain classifier is scaled by λ.

Denoising Autoencoder
The noised version of X, denotedX, is obtained from X by using a masking noise, i.e. some elements of X are set to 0 with probability p (Glorot et al., 2011). Then,X is propagated through an encoder network h(·) (Baldi, 2012) to get h(X). The decoder network r(·) reconstructs the input data X from the encoder's output h(X). A possible loss function is the mean squared error The error of the autoencoder is scaled by µ.

Proxy A-distance as a Generalization Metric
Ben-David et al. (2007) developed a distance metric called proxy A-distance. The lower the distance, the more similar the domains are. Intuitively, this would mean the source and target domains share more common features. Hence, machine learning models won't lose too much accuracy when trained over source domain and tested over the target domain. Let D and D be 2 probability distributions defined over a domain χ, and a hypothesis class A. The A-distance of D and D is defined as Intuitively, this is equivalent to finding the maximum L1 distance between the 2 probability distributions D and D . Since computing this metric is intractable, Ben-David et al. (2007) proposed a way to approximate it from finite samples as follows: a linear SVM is trained to discriminate between the 2 domains, then the error , called generalization error, is used to compute a proxy of the A-distanced A = 2(1−2 ). This proxy A-distance (PAD) can then be used to represent the distance between the 2 domains.

Experiments and Results
To test the effectiveness of the proposed approach, we conduct a 5-point sentiment classification on ArSentD-LEV 1 (Baly et al., 2018), once using the country of origin of the tweet as domain, and once the category to which the tweet belongs. We then show the effect of the data size on the performance of the adaptation algorithms. We start by describing the available dataset, then we describe each experiment alongside its results and we include some insights.

Dataset Description and Experiment Setup
ArSentD-LEV is a multi-domain dataset containing almost 4,000 tweets collected equally from the 4 Levantine countries: Jordan, Lebanon, Palestine and Syria. For each tweet, the following labels are available: the country of origin, the sentiment conveyed by the tweet on 5-point scale (from very negative to very positive), the way of expressing the sentiment (explicit vs implicit) and the category to which the tweet belongs. The tweets were divided into 5 categories: politics, personal, sports, religious and other. The distribution of the tweets amongst these categories is shown in figure  2.
Following the approach used by (Chen et al., 2012;Ganin et al., 2016), we extract from the dataset the 5,000 most frequent unigrams and bigrams, as was adopted in (Ganin et al., 2016) for English. We then form, using these unigrams and bigrams, a bag-of-words matrix that will be used as input data for the learned models. Although many models represent text better (e.g. sequence models, tree models, etc...) we limit ourselves to a simpler model to show the improvement by the domain adaptation technique rather than by the text model.
The different experiments evaluated the performance of four models. A Linear SVM was used as a baseline and representative of feature based models. For the deep learning models, we consider a fully-connected neural network (Rumelhart et al., 1988) consisting of a hidden layer of 100 neurons and a label predictor of size 2. The setup of DANN is similar to that in (Ganin et al., 2016). The hidden layer is composed of 100 neurons, and the label predictor is of size 2. The domain classifier of DANN (of size 2) is preceded by a GRL. The proposed model is identical to the description in section 3. All neural networks were trained using ADAM optimizer (Kingma and Ba, 2014) using a learning rate of 10 −3 .

Evaluation for Cross Country Adaptation
For this experiment, we evaluate the adaptation task between tweets from different countries. This means the source domain will consist of tweets from one of the 4 Levantine countries, and the target domain will consist of tweets coming from other countries. We thus have a total of 12 adaptation tasks. Baly et al. showed that Twitter is used for different purposes in different countries (Baly et al., 2017), which presents an additional challenge.
The result of the domain adaptation tasks are shown in Table 1. The proposed method outperformed all other models in most of the adaptation tasks. Although many real-life applications showed that traditional machine learning models are usually better when the available data is little (Cortes and Vapnik, 1995;Goodfellow et al., 2016), the proposed model was able to outperform the linear SVM in most of the tasks in our experiment. This means it was able to extract useful representation from the data. The model was also able to outperform DANN, which shows that the representational learning provides intrinsic representation of the data.

Evaluation for Cross Topic Adaptation
In this second experiment, we consider the task of adapting tweets from different topics. ArSentD-LEV (Baly et al., 2018) contains 5 classes for topic: politics, personal, religious, sports and other. This means we have a total of 20 tasks. The models evaluated are the linear SVM, DANN and the proposed model. The models' structure is identical to the one defined in section 4.2.
The results of the experiment are shown in Table  2. The behavior of the algorithms is significantly different in these categories. This is caused by the unbalanced data distribution amongst the different topics, as can be seen in Figure 2. We can see that whenever the data is very limited, the linear SVM outperforms the deep learning models. This is expected since neural networks cannot learn well the underlying representation when the data is scarce.
Looking at the radar plot in Figure 3, we can find the following interesting property. The higher the PAD distance between the source and target domains, the better the performance of the proposed model. This can be related to the fact that the proposed model tries to find a hidden representation that combines features from both source and target domains, i.e. decrease the distance between the 2 domains. Whenever the distance is low, the proposed model can not thus decrease it much further.

Performance with Limited Data Size
To test the limitation of the proposed approach with data size, we consider the task where the source domain is "Politics" and the target domain is "Personal", since the available data is larger than the data available for other tasks. We then start by gradually increasing the size, and test the performance of the model with each dataset size. Looking at Figure 4, we can see that the performance of the proposed method is better than that of DANN at all sizes. This confirms our assumption that DANN with SDA learns a better representation through the incorporation of autoencoder. In contrast, DANN focuses on the discriminative task at hand, and thus fails to generalize. We also have a generally increasing trend which comes from the fact that more data is available, hence the models are able to learn better features. We can see that the proposed variant outperforms DANN at all data sizes, and learns more with the increase in data size.

Conclusion
In this paper, we presented the first application of domain adaptation to the Arabic language. Although there exists work in English for domain adaptation, no work exists for Arabic. We considered in this paper the Domain Adversarial Neural Network (DANN) (Ganin et al., 2016) and proposed a variant that incorporates into DANN a stacked denoising autoencoder (SDA). The experiments and results provided several insights. We observed that integrating a reconstruction loss into DANN helped the model learn a better latent representation. This proved useful in all experiments, especially when the available data is little. These observations are consistent with what has been observed in English. The success of domain adaptation suggests the possibility of usage of DA to bridge the gap between different dialects of the Arabic language. Future work includes testing DA techniques to more Arabic dialects, trying other domain adaptation algorithms in Arabic, developing new domain adaptation techniques, evaluating the DA tasks using better text representation (e.g. sequence models...) and integrating transfer learning techniques in the models (Ng et al., 2015).