Cross-Domain Review Helpfulness Prediction Based on Convolutional Neural Networks with Auxiliary Domain Discriminators

With the growing amount of reviews in e-commerce websites, it is critical to assess the helpfulness of reviews and recommend them accordingly to consumers. Recent studies on review helpfulness require plenty of labeled samples for each domain/category of interests. However, such an approach based on close-world assumption is not always practical, especially for domains with limited reviews or the “out-of-vocabulary” problem. Therefore, we propose a convolutional neural network (CNN) based model which leverages both word-level and character-based representations. To transfer knowledge between domains, we further extend our model to jointly model different domains with auxiliary domain discriminators. On the Amazon product review dataset, our approach significantly outperforms the state of the art in terms of both accuracy and cross-domain robustness.


Introduction
Product reviews significantly help consumers finalize their purchasing decisions. With online reviews being ubiquitous, it is critical to examine the quality of reviews and present consumers more useful information. Both academia and industry have drawn close attention to the task of review helpfulness prediction (Liu et al., 2017a;Yang et al., 2015Yang et al., , 2016Martin and Pu, 2014).
Recent studies on review helpfulness prediction have been shown effective by using handcrafted features. For example, semantic features like LIWC, INQUIRER, and GALC (Yang et al., 2015;Martin and Pu, 2014), aspect- (Yang et al., 2016) and argument-based (Liu et al., 2017a) features. However, those methods require a large amount of labeled samples which is not always practical and yields models limited to product domains/categories of interests. For example, the * * Yinfei Yang is now with Google.
"Electronics" category used in our experiment from Amazon.com Review Dataset (McAuley and Leskovec, 2013) has more than 354k labeled reviews, while the "Watches" category has under 10k. For domains with limited data, labeled samples may be too few to build good estimators and the "out-of-vocabulary" (OOV) problem is often observed.
To alleviate the aforementioned issues, in this work, we propose an end-to-end approach for review helpfulness prediction requiring no prior knowledge nor manual feature crafting. In recent years, convolutional neural networks (CNNs), able to extract deep features from raw text contents, have demonstrated remarkable results in many tasks of natural language processing, for its high efficiency and performance comparable to Recurrent Neural Networks (RNNs) (Kim, 2014;Zhang et al., 2015). We thus employ CNNs as the basis of this work. As character-level representations are notably beneficial for alleviating the OOV problem for tasks such as text classification and machine translation (Ballesteros et al., 2015;Ling et al., 2015;Kim et al., 2016;Lee et al., 2017), we specifically enrich the word-level representation of CNNs by adding character-based representation. Experiments show that our CNNbased method significantly outperforms those using hand-crafted features and yields better results than the ensemble models.
To tackle the problem of insufficient data in some domains, we develop a cross-domain transfer learning (TL) approach to leverage knowledge from a domain with sufficient data. It is worth noting that, existing studies on this task only focus on a single product category or largely ignore the inter-domain correlations. Previous works also show that some features are domain-specific while others are sharable across domains. For example, image quality features are only useful for categories covering products like cameras (Yang et al., 2016), while semantic features and argumentbased features usually work for all domains (Yang et al., 2015;Liu et al., 2017a). Thus it is important for a TL approach to learn shared features for different domains. A typical TL model uses both a shared neural network (NN) and domainspecific NNs to derive shared and domain-specific features (Ganin et al., 2016;Taigman et al., 2017). Recently, Liu et al. (2017b) and Chen et al. (2017) apply adversarial loss and domain discriminators to specific shared models using RNNs for text classification and word segmentation tasks, respectively. Inspired by them, we study the crossdomain review helpfulness task with both adversarial loss and domain discriminators in a specific shared framework.
In a nutshell, our main novelty is in the first endto-end cross-domain model for review helpfulness prediction. Our model consists of two components: a feature transformation network (CNN) to represent the input reviews and a transfer learning module to adapt domain knowledge. In addition, shared and specific-shared features are confined with adversarial and domain discrimination losses. Extensive experiments show that our model is able to transfer knowledge between domains, and outperforms the state of the arts.
The remainder of the paper is organized as follows. Section 2 formally defines the problem and presents our model. Section 3 illustrates the effectiveness of the proposed model in the experiments. Section 4 presents related work, and finally Section 5 concludes our paper.

Model
We define review helpfulness prediction as a regression task that predicts the helpfulness score of a given review. The ground truth of helpfulness is determined using the "a of b approach": a of b users think a review is helpful.
Formally, we consider a cross-domain review helpfulness prediction task where we have a set of labeled reviews from a source domain and a target domain. We seek to transfer knowledge from a source domain with adequate data to train a better model for a target domain, which has relatively insufficient amount of data. For a review X, our goal is to predict its helpfulness score y.
As shown in Figure 1, our base model is a multigranularity CNN, which combines both wordlevel and character-level representations.

CNN with Character Representations
In many applications, such as text classification (Bojanowski et al., 2017) and machine reading comprehension (Seo et al., 2016), it is beneficial to enrich word embeddings with subword information. Inspired by that, we use a character embedding layer to enrich word representations. Let X be a review, consisting of a sequence of words (x 1 , x 2 , . . . , x m ). Following the CNN model in (Kim, 2014), for words in a review X, we first lookup the embeddings of all words (e 1 , e 2 , . . . , e m ) from an embedding matrix E ∈ R |V|×l where |V| is the vocabulary size and l is the embedding dimension.
The characters of the i-th word x i are embedded into vectors and then fed into a convolutional layer and a max-pooling layer to obtain a fixedsized vector CharEmb(x i ). This vector is concatenated with the original word embedding e i to form a new word embedding. This representation is advantageous in two folds: it helps group words with shared subwords, and it alleviates the OOV problem. Hence, we obtain a review's final representation by concatenating the embeddings of words in the review: e X = [e 1 , e 2 , e 3 , . . . , e m ] where e i = CharEmb(x i ) ⊕ e i , ∀i ∈ [1..m], e i is a column vector, and ⊕ is a stacking operator.
Next, we stack two 2-D convolutional layers and two 2-D max-pooling layers on the matrix e X to obtain the hidden representation h X . Multiple filters are used here. For each filter, we obtain a hidden representation: where f ∈ {2, 3, 4, 5} is window size, l is embedding dimension, c is channel size, Conv(·) represents a convolution layer, MaxPool(·) is a maxpooling layer. All the representations are then concatenated to form the final representation h X , i.e., In all, for each input X, our CNN model outputs a hidden feature representation h X = CNN(X).

Knowledge Transfer with Domain Discriminators
A typical transfer learning framework is to use both a shared neural network and domain-specific neural networks to learn shared and domainspecific features (Liu et al., 2017b). In our model, we use a shared CNN and domain-specific CNNs to derive shared features h c and domain-specific features h s and h t . The domain-specific output layers are defined as: where k ∈ {0, 1} is the domain label indicating whether a data instance is from the source domain (i.e., k = 0) or the target domain (i.e., k = 1). W sc , W tc , W s , and W t are the weights for shared-source, shared-target, source, and target domains respectively, while b s and b t are the biases for source and target domains respectively. The σ(·) represents the sigmoid function. Recent studies (Ganin et al., 2016;Taigman et al., 2017;Liu et al., 2017b) consider to apply domain discriminators on shared features to prevent domain-specific features from creeping into shared feature space. The main idea of using a domain discriminator p(d | h c ) is to predict the domain label d on the shared features h c . Here the domain discriminator is defined as a fully connected layer with weights W c and bias vector b c : Since the goal is to encourage the shared feature space indiscriminate across two domains, we define the adversarial loss L adv as: where h c i is the derived shared features from an input X i . Furthermore, to encourage the specific feature space to discriminate between different domains, we consider applying domain discrimination losses on the two specific feature spaces. We further add two negative cross-entropy losses, L s for the source domain and L t for the target domain: where I (d i =k) is an indicator function set to 1 when d i = k holds, or 0 otherwise, and h s i and h t i are the derived domain-specific features from an input X i from source and target domains respectively.
Nevertheless, studies in (Bousmalis et al., 2016;Liu et al., 2017b) show that adding orthogonality constraints on learned shared features H c and specific features H k for each domain k ∈ {s, t} can help learn domain-invariant features. We thus adopt the constraint L orth = k∈{s,t} H c H k in our model. H c and H k are obtained by stacking the hidden features from all the input instances.
Finally, we obtain a combined loss as follows: where all λ's are weights for different losses, and Θ denotes model parameters.

Experiments
Following previous work (Yang et al., 2015(Yang et al., , 2016, experiments are done on reviews from five categories of products in Amazon review dataset (McAuley and Leskovec, 2013). Data statistics are summarized in Table 1. The empirical study is done in two steps. Without TL, Part 1 (Sections 3.1 and 3.2) shows that embedding-based feature of CNN outperforms hand-crafted features. After validating the advantage of the CNN-based model, we demonstrate that our TL approach (introduced in Section 2.2) is more effective in boosting the advantage farther than other TL approaches in Part 2 (Section 3.3). In Part 2, the same CNN-based model is used for all TL approaches.  The lookup table E is initialized with pretrained vectors from GloVe (Pennington et al., 2014) by setting l = 100. For CNNs, the activation function is ReLU, and the channel size is set to 128. We also set λ 1 = λ 2 = λ 3 = λ 4 = 0.05, and λ 5 = 0.0008. AdaGrad (Duchi et al., 2011) is used in training with an initial learning rate of 0.08. Fowllowing the previous work (Yang et al., 2015(Yang et al., , 2016, ten-fold cross-validation is performed for all experiments and all the results are evaluated in correlation coefficients between the predicted helpfulness score and the ground truth score computed by "a of b approach" from the dataset.

Comparison with hand-crafted features
We first compare our base CNN model with regression baselines that use hand-crafted features which are STR, UGR, LIWC, INQUIRER (Yang et al., 2015), and aspect-based feature ASP (Yang et al., 2016), and the vanilla CNN (CNN) in (Kim, 2014). As shown in Table 2, both CNN-based models outperform the baselines, indicating CNNbased models have better expressiveness than these hand-crafted features for this task.
Our CNN-based model outperforms the vanilla CNN based one on relatively small domains (e.g., "Watches", "Cellphones") and achieves comparable results on large ones (e.g., "Electronics"). This is because the OOV problem is severe on small domains and our model with character-level representations can help more on them. In all, our CNN-based method shows better performance compared to the baselines.

Comparison with ensemble features
We further compare our CNN-based model with two groups of ensemble features: Fusion 1 comprising of STR, UGR, LIWC, and INQUIRER features (Yang et al., 2015), and Fusion 2 further comprising of the ASP feature (Yang et al., 2016). As shown in Table 3.2, our CNN-based model consistently outperforms the models based on ensemble features.
Watch Phone Outdoor Home Elec.

Comparison with TL models
To evaluate the effectiveness of our transfer learning approach, we compare our full model with three baselines: Src-only that uses only source data, Tgt-only that uses only target data, and TL-S that use both source and target data with the adversarial training as in (Liu et al., 2017b). For TL based approaches, we use the "Electronics" category as the source domain and all other categories as target domains.  According to Table 4, due to the domain shift, Src-only performs worse than Tgt-only. This is intuitive as those domains are related but different. Our model achieves better or comparable results than Tgt-only and TL-S. This supports the benefits of transfer learning and demonstrates the usefulness of adding domain discriminators on both source and target domains.
Last but not the least, our model shows less improvement over Tgt-only when target domain data size increases. For example, our model yields an improvement of 4% over Tgt-only on the smallest domain "Watches." But the improvement drops to 1.7% on the largest domain "Home." To investigate this, we pick the category "Outdoor" as the target domain and track how our TL approach looses its edge as the amount (in terms of percentage in Table 5) of data from the target domain used in training increases. The full set of data from the source domain "Electronics" is constantly used.  According to Table 5, the more data from the target domain, the less advantage our approach has over the Tgt-only model. It is more beneficial to leverage knowledge from another relevant domain when there is less data in the target domain. This also demonstrates that our model is able to learn transferable features from a relevant domain to help the task on a target domain which often has limited data.

Related Work
Review Helpfulness Prediction: The recent studies on review helpfulness prediction focus on hand-crafted features from the review texts. For example, (Yang et al., 2015) and (Martin and Pu, 2014) examined semantic features like LIWC, INQUIRER, and GALC. Subsequently, aspect- (Yang et al., 2016) and argument-based (Liu et al., 2017a) features are demonstrated to improve the prediction performance. However, these methods rely on sufficient labeled data and may not perform ideally for domains with limited data. To alleviate this issue, we employ Convolutional Neural Networks (CNNs) (Kim, 2014;Zhang et al., 2015) as the base model and further considers characterlevel representations (Ballesteros et al., 2015;Ling et al., 2015;Kim et al., 2016;Lee et al., 2017). Transfer Learning: Transfer learning (TL) has been extensively studied in the last decade, interested readers can refer to (Pan and Yang, 2010) for a detailed survey. With the popularity of deep learning, a great amount of Neural Network (NN) based methods are proposed for TL (Yosinski et al., 2014;Wang and Zheng, 2015;Mou et al., 2016;Yang et al., 2017;Liu et al., 2017b). A simple but widely used framework is referred to as fine-tuning approaches, which first use the parameters of the well-trained models on the source domain to initialize the model parameters of the target domain, and then fine-tune the parameters based on labeled data in the target domain (Yosinski et al., 2014;Mou et al., 2016). Another typical framework is to use a shared NN to learn shared features for both source and target domains (Mou et al., 2016;Yang et al., 2017;. On top of that, specific shared framework use both a shared NN and domain-specific NNs to derive shared and domain-specific features (Ganin et al., 2016;Taigman et al., 2017;Yu et al., 2018). However it may not be ideal to separate shared and specific features, recent studies (Ganin et al., 2016;Taigman et al., 2017;Liu et al., 2017b) consider the adversarial networks to learn more robust shared features across domains. Inspired by this, our method adopts adversarial network on the shared features. In the meanwhile, we also use domain discriminators on both source and target features to help learn domain-specific features.
To the best of our knowledge, our work is the first to study cross-domain review helpfulness prediction. Without any hand-crafted features, our CNN-based method achieves better results than the existing approaches.

Conclusion
In this work, we proposed a convolutional neural network (CNN) based approach that combines both word-and character-level representations, for review helpfulness prediction. We studied transfer learning for the task and used auxiliary domain discriminators on both shared and specific representations. Experiments showed our CNN-based models outperform the existing approaches. In the near future, we will look at multi-task helpfulness prediction to further transfer knowledge across domains. Meanwhile, it is also worth studying domain correlation in the transfer learning (Yu et al., 2018) or multi-task settings .