Improving the results of string kernels in sentiment analysis and Arabic dialect identification by adapting them to your test set

Recently, string kernels have obtained state-of-the-art results in various text classification tasks such as Arabic dialect identification or native language identification. In this paper, we apply two simple yet effective transductive learning approaches to further improve the results of string kernels. The first approach is based on interpreting the pairwise string kernel similarities between samples in the training set and samples in the test set as features. Our second approach is a simple self-training method based on two learning iterations. In the first iteration, a classifier is trained on the training set and tested on the test set, as usual. In the second iteration, a number of test samples (to which the classifier associated higher confidence scores) are added to the training set for another round of training. However, the ground-truth labels of the added test samples are not necessary. Instead, we use the labels predicted by the classifier in the first training iteration. By adapting string kernels to the test set, we report significantly better accuracy rates in English polarity classification and Arabic dialect identification.


Introduction
In recent years, methods based on string kernels have demonstrated remarkable performance in various text classification tasks ranging from authorship identification (Popescu and Grozea, 2012) and sentiment analysis (Giménez-Pérez et al., 2017; to native language identification (Popescu and Ionescu, 2013;Ionescu et al., 2014Ionescu et al., , 2016, dialect identification (Ionescu and Popescu, 2016b;Ionescu and Butnaru, 2017; and automatic essay scoring (Cozma et al., 2018). As long as a labeled training set is available, string kernels can reach state-of-the-art results in various languages in-cluding English (Ionescu et al., 2014;Giménez-Pérez et al., 2017;Cozma et al., 2018), Arabic (Ionescu, 2015;Ionescu et al., 2016;Ionescu and Butnaru, 2017;, Chinese  and Norwegian (Ionescu et al., 2016). Different from all these recent approaches, we use unlabeled data from the test set to significantly increase the performance of string kernels. More precisely, we propose two transductive learning approaches combined into a unified framework. We show that the proposed framework improves the results of string kernels in two different tasks (cross-domain sentiment classification and Arabic dialect identification) and two different languages (English and Arabic). To the best of our knowledge, transductive learning frameworks based on string kernels have not been studied in previous works.

Transductive String Kernels
String kernels. Kernel functions (Shawe-Taylor and Cristianini, 2004) capture the intuitive notion of similarity between objects in a specific domain. For example, in text mining, string kernels can be used to measure the pairwise similarity between text samples, simply based on character ngrams. Various string kernel functions have been proposed to date (Lodhi et al., 2002;Shawe-Taylor and Cristianini, 2004;Ionescu et al., 2014). Perhaps one of the most recently introduced string kernels is the histogram intersection string kernel (Ionescu et al., 2014). For two strings over an alphabet Σ, x, y ∈ Σ * , the intersection string kernel is formally defined as follows: where num v (x) is the number of occurrences of ngram v as a substring in x, and p is the length of v. The spectrum string kernel or the presence bits string kernel can be defined in a similar fashion Figure 1: The standard kernel learning pipeline based on the linear kernel. Kernel normalization is not illustrated for simplicity. Best viewed in color. (Ionescu et al., 2014). The standard kernel learning pipeline is presented in Figure 1. String kernels help to efficiently  compute the dual representation directly, thus skipping the first step in the pipeline illustrated in Figure 1. Transductive string kernels. We propose a simple and straightforward approach to produce a transductive similarity measure suitable for strings, as illustrated in Figure 2. We take the following steps to derive transductive string kernels. For a given kernel (similarity) function k, we first build the full kernel matrix K, by including the pairwise similarities of samples from both the train and the test sets (step S1 in Figure 2) . For a training set X = {x 1 , x 2 , ..., x m } of m samples and a test set Y = {y 1 , y 2 , ..., y n } of n samples, such that X ∩ Y = ∅, each component in the full kernel matrix is defined as follows (step S2 in Figure 2): where z i and z j are samples from the set Z = X ∪ Y = {x 1 , x 2 , ..., x m , y 1 , y 2 , ..., y n }, for all 1 ≤ i, j ≤ m + n. We then normalize the kernel matrix by dividing each component by the square root of the product of the two corresponding diagonal components: We transform the normalized kernel matrix into a radial basis function (RBF) kernel matrix as follows:K As the kernel matrix is already normalized, we can choose σ 2 = 0.5 for simplicity. Therefore, Equation (4) becomes: Each row in the RBF kernel matrixK is now interpreted as a feature vector, going from step S2 to step S3 in Figure 2. In other words, each sample z i is represented by a feature vector that contains the similarity between the respective sample z i and all the samples in Z (step S3 in Figure 2). Since Z includes the test samples as well, the feature vector is inherently adapted to the test set. Indeed, it is easy to see that the features will be different if we choose to apply the string kernel approach on a set of test samples Y , such that Y = Y . It is important to note that through the features, the subsequent classifier will have some information about the test samples at training time. More specifically, the feature vector conveys information about how similar is every test sample to every training sample. We next consider the linear kernel, which is given by the scalar product between the new feature vectors. To obtain the final linear kernel matrix, we simply need to compute the product between the RBF kernel matrix and its transpose (step S4 in Figure 2): In this way, the samples from the test set, which are included in Z, are used to obtain new (transductive) string kernels that are adapted to the test set at hand.
Transductive kernel classifier. After obtaining the transductive string kernels, we use a simple transductive learning approach that falls in the category of self-training methods (McClosky et al., 2006;Chen et al., 2011). The transductive approach is divided into two learning iterations. In the first iteration, a kernel classifier is trained on the training data and applied on the test data, just as usual. Next, the test samples are sorted by the classifier's confidence score to maximize the probability of correctly predicted labels in the top of the sorted list. In the second iteration, a fixed number of samples (1000 in the experiments) from the top of the list are added to the training set for another round of training. Even though a small percent (less than 8% in all experiments) of the predicted labels corresponding to the newly included samples are wrong, the classifier has the chance to learn some useful patterns (from the correctly predicted labels) only visible in the test data. The transductive kernel classifier (TKC) is based on the intuition that the added test samples bring more useful information than noise, since the majority of added test samples have correct labels. Finally, we would like to stress out that the groundtruth test labels are never used in our transductive algorithm.
The proposed transductive learning approaches are used together in a unified framework. As any other transductive learning method, the main disadvantage of the proposed framework is that the unlabeled test samples from the target domain need to be used in the training stage. Nevertheless, we present empirical results indicating that our approach can obtain significantly better accuracy rates in cross-domain polarity classification and Arabic dialect identification compared to state-of-the-art methods based on string kernels (Giménez-Pérez et al., 2017;Ionescu and Butnaru, 2017). We also report better results than other domain adaptation methods (Pan et al., 2010;Bollegala et al., 2013;Franco-Salvador et al., 2015;Sun et al., 2016;Huang et al., 2017).

Polarity Classification
Data set. For the cross-domain polarity classification experiments, we use the second version of Multi-Domain Sentiment Dataset (Blitzer et al., 2007). The data set contains Amazon product reviews of four different domains: Books (B), DVDs (D), Electronics (E) and Kitchen appliances (K). Reviews contain star ratings (from 1 to 5) which are converted into binary labels as follows: reviews rated with more than 3 stars are labeled as positive, and those with less than 3 stars as negative. In each domain, there are 1000 positive and 1000 negative reviews. Baselines. We compare our approach with several methods (Pan et al., 2010;Bollegala et al., 2013;Franco-Salvador et al., 2015;Sun et al., 2016;Giménez-Pérez et al., 2017; (Huang et al., 2007), CORAL (Sun et al., 2016) and TR-TrAdaBoost (Huang et al., 2017) in the single-source setting. Evaluation procedure and parameters. We follow the same evaluation methodology of Giménez-Pérez et al. (2017), to ensure a fair comparison. Furthermore, we use the same kernels, namely the presence bits string kernel (K 0/1 ) and the intersection string kernel (K ∩ ), and the same range of character n-grams (5-8 Results in multi-source setting. The results for the multi-source cross-domain polarity classification setting are presented in Table 1. Both the transductive presence bits string kernel (K 0/1 ) and the transductive intersection kernel (K ∩ ) obtain better results than their original counterparts. Moreover, according to the McNemar's test (Dietterich, 1998), the results on the DVDs, the Electronics and the Kitchen target domains are significantly better than the best baseline string kernel, with a confidence level of 0.01. When we employ the transductive kernel classifier (TKC), we obtain even better results. On all domains, the accuracy rates yielded by the transductive classifier are more than 1.5% better than the best baseline. For example, on the Books domain the accuracy of the transductive classifier based on the presence bits kernel (84.1%) is 2.1% above the best baseline (82.0%) represented by the intersection string kernel. Remarkably, the improvements brought by our transductive string kernel approach are statistically significant in all domains.
Results in single-source setting. The results for the single-source cross-domain polarity classification setting are presented in Table 2. We considered all possible combinations of source and target domains in this experiment, and we improve the results in each and every case. Without exception, the accuracy rates reached by the transductive string kernels are significantly better than the best baseline string kernel (Giménez-Pérez et al., 2017), according to the McNemar's test performed at a confidence level of 0.01. The highest improvements (above 2.7%) are obtained when the source domain contains Books reviews and the target domain contains Kitchen reviews. As in the multi-source setting, we obtain much better results when the transductive classifier is employed for the learning task. In all cases, the accuracy rates of the transductive classifier are more than 2% better than the best baseline string kernel. Remarkably, in four cases (E→B, E→D, B→K and D→K) our improvements are greater than 4%. The improvements brought by our transductive classifier based on string kernels are statistically significant in each and every case. In comparison with SFA (Pan et al., 2010), we obtain better results in all but one case (K→D). With respect to KMM (Huang et al., 2007), we also obtain better results in all but one case (B→E). Remarkably, we surpass the  Table 2: Single-source cross-domain polarity classification accuracy rates (in %) of our transductive approaches versus a state-of-the-art (sota) baseline based on string kernels (Giménez-Pérez et al., 2017), as well as SFA (Pan et al., 2010), KMM (Huang et al., 2007), CORAL (Sun et al., 2016) and TR-TrAdaBoost (Huang et al., 2017). The best accuracy rates are highlighted in bold. The marker * indicates that the performance is significantly better than the best baseline string kernel according to a paired McNemar's test performed at a significance level of 0.01.
other state-of-the-art approaches (Sun et al., 2016;Huang et al., 2017) in all cases.

Arabic Dialect Identification
Data set. The Arabic Dialect Identification (ADI) data set (Ali et al., 2016) contains audio recordings and Automatic Speech Recognition (ASR) transcripts of Arabic speech collected from the Broadcast News domain. The classification task is to discriminate between Modern Standard Arabic and four Arabic dialects, namely Egyptian, Gulf, Levantine, and Maghrebi. The training set contains 14000 samples, the development set contains 1524 samples, and the test contains another 1492 samples. The data set was used in the ADI Shared Task of the 2017 VarDial Evaluation Campaign . Baseline. We choose as baseline the approach of Ionescu and Butnaru (2017), which is based on string kernels and multiple kernel learning. The approach that we consider as baseline is the winner of the 2017 ADI Shared Task . In addition, we also compare with the second-best approach (Meta-classifier) . Evaluation procedure and parameters. Ionescu and Butnaru (2017) combined four kernels into a sum, and used Kernel Ridge Regression for training. Three of the kernels are based on character ngrams extracted from ASR transcripts. These are the presence bits string kernel (K 0/1 ), the intersection string kernel (K ∩ ), and a kernel based on Local Rank Distance (K LRD ) (Ionescu, 2013). The fourth kernel is an RBF kernel (K ivec ) based on the i-vectors provided with the ADI data set (Ali et al., 2016). In our experiments, we employ the exact same kernels as Ionescu and Butnaru (2017) to ensure an unbiased comparison with their ap-   Butnaru, 2017) and the first runner up . The best accuracy rates are highlighted in bold. The marker * indicates that the performance is significantly better than (Ionescu and Butnaru, 2017) according to a paired McNemar's test performed at a significance level of 0.01.
proach. As in the polarity classification experiments, we select r = 1000 unlabeled test samples to be included in the training set for the second round of training the transductive classifier, and we use Kernel Ridge Regression with a regularization of 10 −5 in all our ADI experiments.
Results. The results for the cross-domain Arabic dialect identification experiments on both the development and the test sets are presented in Table 3. The domain-adapted sum of kernels obtains improvements above 0.8% over the stateof-the-art sum of kernels (Ionescu and Butnaru, 2017). The improvement on the development set (from 64.17% to 65.42%) is statistically significant. Nevertheless, we obtain higher and significant improvements when we employ the transductive classifier. Our best accuracy is 66.73% (2.56% above the baseline) on the development set and 78.35% (2.08% above the baseline) on the test set. The results show that our domain adaptation framework based on string kernels attains the best performance on the ADI Shared Task data set, and the improvements over the state-of-the-art are statistically significant, according to the McNemar's test.