Reinforced Product Metadata Selection for Helpfulness Assessment of Customer Reviews

To automatically assess the helpfulness of a customer review online, conventional approaches generally acquire various linguistic and neural embedding features solely from the textual content of the review itself as the evidence. We, however, find out that a helpful review is largely concerned with the metadata (such as the name, the brand, the category, etc.) of its target product. It leaves us with a challenge of how to choose the correct key-value product metadata to help appraise the helpfulness of free-text reviews more precisely. To address this problem, we propose a novel framework composed of two mutual-benefit modules. Given a product, a selector (agent) learns from both the keys in the product metadata and one of its reviews to take an action that selects the correct value, and a successive predictor (network) makes the free-text review attend to this value to obtain better neural representations for helpfulness assessment. The predictor is directly optimized by SGD with the loss of helpfulness prediction, and the selector could be updated via policy gradient rewarded with the performance of the predictor. We use two real-world datasets from Amazon.com and Yelp.com, respectively, to compare the performance of our framework with other mainstream methods under two application scenarios: helpfulness identification and regression of customer reviews. Extensive results demonstrate that our framework can achieve state-of-the-art performance with substantial improvements.


Introduction
The massive number of reviews left by many experienced consumers on online products are our priceless treasure in e-commerce. We believe that online customer reviews can provide more subjective and informative opinions from various perspectives on the products besides the objective de- scriptions given by their merchants. Hence, we prefer browsing the reviews online for the sake of finding our desirable products. However, it is quite time-consuming for potential buyers to sift through millions of online reviews with uneven qualities to make purchase decisions.
To discover and surface helpful reviews for customers, some well-known websites (e.g., Amazon.com) have launched a module illustrated by Figure 1 which asks for feedback on the helpfulness of online reviews. And a recent study 1 reported that this featured module can increase the revenue of Amazon with an estimated 27 billion U.S. dollars annually. Although this crowdsourcing module is quite useful to find a fraction of helpful reviews, statistics (Fan et al., 2018) indicate that roughly 60% online reviews still do not receive any vote of helpfulness or unhelpful-ness. This phenomenon on unknown helpfulness is even more prominent in low-traffic items including those less popular and new arrival products.
We believe it is a promising study that establishes an automatic helpfulness assessment system for online reviews, which can be as useful as a product recommendation engine (Park et al., 2012) in e-commerce. Moreover, review helpfulness assessment (Diaz and Ng, 2018) and sentiment analysis (Liu and Zhang, 2012) are two different but related lines of work to study on the user-generated content (UGC). Sentiment analysis mainly focuses on identifying the opinion orientation of the reviewer himself/herself on the target product. A reviewer can express several emotional words in his/her comment. However, this comment may contain less information on the target product and may not be helpful to other potential consumers. Therefore, review helpfulness assessment concerns more about whether a comment is useful/helpful to other potential consumers.
So far as we know, a series of work on review helpfulness prediction has been proposed from two perspectives: 1) some work leveraged domain-specific knowledge to extract a wide range of hand-crafted features (including structural (Mudambi and Schuff, 2010;Xiong and Litman, 2014), lexical (Kim et al., 2006;Xiong and Litman, 2011), syntactic (Kim et al., 2006), emotional (Martin and Pu, 2014), semantic (Yang et al., 2015) and even argument features (Liu et al., 2017)) from the raw text of reviews as the evidence fed to off-the-shelf learning tools for helpfulness prediction; and 2) recent studies (Chen et al., 2018a,b;Fan et al., 2018) took advantages of deep neural nets by modifying the convolutional neural network (Kim, 2014) to acquire the low-dimensional representations of helpful reviews without the aid of feature engineering. Overall, these mainstream methods extract various linguistic and neural features solely from the raw text of a review as the evidence for helpfulness assessment.
We, on the other hand, consider that identifying the helpfulness of a review should be fully aware of the metadata (such as the name, the brand, the place of origin, the category, the description, etc.) of the target product besides the textual content of the review itself. To illustrate our idea, Figure 2 shows an example of two customer reviews in Amazon.com with diverse helpfulness scores on Figure 2: Two customer reviews with diverse helpfulness scores on the same product in Amazon.com. the same product online. We can easily figure out that the helpful review (104 of 114 people found it helpful) may concern the key of "description" in the metadata of the product, while the unhelpful review (0 of 17 people found it helpful) talks nothing about the product but expresses deep remorse. Without direct supervision from human beings, it is difficult for machines to infer the correct key/aspect in the product metadata that helpful reviews really concern.
It leaves us with a problem of how to teach machines learning to choose the correct key-value product metadata to help assess the helpfulness of free-text reviews more precisely. To address the issue, this paper introduces a novel framework composed of two mutual-benefit modules: a product metadata selector (agent) and a review helpfulness predictor (network). This work is proposed to be deployed in a real-world system. Therefore, we suggest to decouple the two modules (i.e., the selector and the predictor). Leveraging the attention mechanism is an alternative approach on an endto-end framework. However, the decoupled modules are preferred in an industrial system. Given a product, the selector explores the connection between the keys in the product metadata and one of its reviews to take an action that selects the correct value, and the successive predictor exploits this value attended by the free-text review to acquire better neural representations for helpfulness prediction. The predictor is directly  optimized by the stochastic gradient descent algorithm (Ruder, 2016) with the loss of prediction, and the selector can be updated via the policy gradient method (Sutton et al., 2000) rewarded with the performance of the predictor. We use two real-world datasets from Amazon.com and Yelp.com, to compare the performance with other mainstream approaches on two application tasks: helpfulness identification and regression of customer reviews. The experimental results reveal that our framework can reach stateof-the-art performance on both tasks with substantial improvements. In addition, our framework can help acquire the embeddings of the keys in product metadata, and visualization results illustrate that they can capture various aspects on customer reviews.

Framework
The intuition of our framework (named reinforced review helpfulness prediction, abbr. as R 2 HP) is to predict the helpfulness of a customer review which is fully aware of the correct product metadata selected by a reinforced selector (agent). This work is proposed to be deployed in a real-world system which is responsible for recommending helpful reviews to millions of customers. Therefore, we suggest decoupling the two modules (selector and predictor). Leveraging the attention mechanism (Vaswani et al., 2017) is an alternative ap-proach to establishing an end-to-end framework. However, the decoupled modules are preferred by the industry.
As shown by Figure 3, the selector (agent) learns from both the keys in the product metadata and one of its reviews to take an action that selects the correct value, and a successive predictor (network) makes the free-text review attend to this value to obtain better neural representations for helpfulness prediction. The predictor is directly optimized by SGD with the loss of prediction, and the selector could be updated via policy gradient rewarded with the performance of the predictor.

Reinforced Product Metadata Selection
Given a customer review and the key-value formed product metadata, our reinforced neural selector π takes them as input, to output a policy p which is the probability distribution over the keys. Suppose that the product metadata contains k keys each of which is represented by an l-dimensional vector. Then we can achieve the embeddings of the keys: Assume that there are n words/tokens in the customer review c. We align each word/token with the embedding dictionary acquired by the word embedding approaches such as Word2Vec (Mikolov et al., 2013), Glove (Pennington et al., 2014) or Elmo (Peters et al., 2018) to initialize the distributed representations of the customer review C ∈ R l×n . To achieve the local contextual embeddings of the customer review c, we use a Bi-LSTM network (Schuster and Paliwal, 1997) which takes the word embeddings of the customer review C as input: H c ∈ R 2l×n stands for the contextual embeddings where each word can obtain two hidden units with the length of 2l encoding both the backward and the forward contextual information of the customer review locally. We use B ∈ R k×n to obtain the bilinear relationship between the embeddings of the keys K and the local contextual embeddings of the customer review H c : where W ∈ R l×2l is the weight matrix for the Rectifier Linear Unit (ReLU). The i-th row of B contains the aspect/topic feature aligned by the local contextual embeddings of the customer review.
We apply the reduce max strategy to each row of B to keep the most significant feature for each metadata key, and use the softmax function to gain the policy p ∈ R k : p = Softmax(Reduce max(B, axis = 1)).
( 3) Then our reinforced selector (agent) can select a metadata value v ∼ π(v|K, C) in terms of p.

Product-aware Review Helpfulness Prediction
In this part, we elaborate our neural predictor for review helpfulness prediction. It is devised based on the motivation that the helpfulness of an online review should be fully aware of the selected metadata of its target product besides the textual content of the customer review itself. The predictor is composed of two components: 1) the local contextual embeddings of a review and 2) the product-aware distributed representations of the review. We have explained how to obtain the local contextual embeddings of a review H c ∈ R 2l×n in the previous subsection. Here we will describe how to achieve the product-aware distributed representations of the review, denoted by H c . We use m to stand for the number of tokens/words in the selected metadata value v provided by our reinforced selector. Similarly, we can refine the word embeddings of the value V of the product metadata via another Bi-LSTM network: and achieve the contextual embeddings of the selected product metadata H v ∈ R 2l×m . To make the contextual embeddings of the customer review fully aware of the product metadata, we design a word-level matching mechanism as follows, where W ∈ R 2l×2l is the weight matrix and b ∈ R 2l is the bias vector. The outer product ⊗ copys the bias vector b for m times (i.e., e ∈ R m ) to generate a 2l × m matrix. Then Q ∈ R m×n is the sparse matrix that holds the word-level matching information between the value v of the product metadata and the customer review c.
If we further apply the softmax function to each column of Q, we will obtain G ∈ R m×n , the ith column of which represents the normalized attention weights over all the words in the metadata value v for the i-th word in the customer review c: Then we can use the attention matrix G ∈ R m×n and the contextual embeddings of the product metadata H v ∈ R 2l×m to re-form the productaware review representation H c ∈ R 2l×n : Driven by original motivation, we need to join the local contextual embeddings of the review (H c ) and the product-aware distributed representations of the review (H c ) together for predicting its helpfulness with the feature matrix H ∈ R 2l×n : H can also benefit from the idea of ResNet  that efficiently acquires the residual between H c and H c , and provides a highway to update H c if the residual is tiny. Generally speaking, we define a loss function 2 L(s g |H c ) which takes H c as the feature to predict a helpfulness score s p judged by the ground-truth score s g . Given the value v is selected by π, our objective is to minimize the expectation: where Θ are parameters to be learned. The gradient of J(Θ) with respect to Θ is: where ∇ Θ L(s g |v, H c ) refers to training the predictor by SGD and we use the REINFORCE algorithm (Williams, 1992) to update the selector with the gradient of log π(v|K, C) and the reward L(s g |v, H c ) ∈ (0.0, 1.0].

Real-world Datasets
We look up two well-formatted JSON resources online which contain plenty of product metadata (including titles, brands, categories, and descriptions) and numerous customer reviews. One is the data collection 3 of Amazon.com crawled by    (He and McAuley, 2016) up to July 2014. The other one is the dump file 4 directly provided by Yelp.com for academic purposes. We use the product ids (i.e., "asin" in Amazon and "business id" in Yelp) as the foreign keys to align the metadata of products with customer reviews. 80% products with online reviews are randomly picked as the training set, leaving the rest as the test set. In this way, two real-world datasets, i.e., Amazon and Yelp, are built, and the statistics of them are shown by Table 1 and Table 2, respectively. In this study, we regard the reviews which receive at least 1 vote for helpfulness/unhelpfulness, i.e., the column # (R.) ≥ 1v. in Table 1 and Table 2, as the experimental samples. In Amazon, the crowd-sourcing module for voting helpful reviews provides an "X of Y " helpfulness score, where "Y " stands for the total number of users who participate in voting, and "X" denotes the number of users who think the review is helpful. Yelp offers more options: useful: X, cool: Y , and funny: Z, to the users who are willing to give feedback.
Regardless of the difference, we generally consider the reviews which receive at least 0.75 ratio of helpfulness/usefulness (helpfulness/usefulness score ≥ 0.75) as positive samples, leaving the others as the negative samples for classification.

Comparison Methods
We compare our framework (R 2 HP) with a wide range of prior arts. Specifically, we re-implement the methods of learning from deep neural networks and with hand-crafted features. The upto-date neural approaches involve the embeddinggated CNN (EG-CNN) (Chen et al., 2018a,b) and the multi-task neural learning (MTNL) architecture (Fan et al., 2018) for review helpfulness prediction. The hand-crafted features include the structural features (STR) (Mudambi and Schuff, 2010;Xiong and Litman, 2014), the lexical features (LEX) (Xiong and Litman, 2011), the emotional features (GALC) (Martin and Pu, 2014) and the semantic features (INQUIRER) (Yang et al., 2015). We also add two more experiments on integrating all the hand-crafted features via the Support Vector Machines (SVM) and the Random Forest (R.F.) model for review helpfulness assessment.

Application Scenarios
Previous studies mostly reported their performance on either the application scenario of review helpfulness identification or regression. Therefore, we conduct extensive experiments comparing the performance of our framework with all the other approaches under both scenarios.
Identification of helpful reviews: As both the training and test sets are imbalanced, we adopt the Area under Receiver Operating Characteristic (AUROC) as the metric to evaluate the performance of all the approaches on helpful review identification. In line with Table 3 and Table 4, MTNL (Fan et al., 2018) achieves the best performance up-to-date on this classification task among the baseline approaches as it achieves the best performance on 12 of 14 categories in Amazon and Yelp datasets. R 2 HP surpasses MTNL on both datasets and obtains state-of-the-art (microaveraged) results of 67.5% AUROC (Amazon) and 75.1% AUROC (Yelp) with absolute improvements of 4.9% AUROC and 4.7% AUROC, respectively.
Regression of helpfulness score: In this task, all the approaches are required to predict the fraction of helpful votes that each review receives. We use the data in the column # (R. ≥ 1v.) in Table 1 and Table 2 as the training and test sets. The Squared Correlation Coefficient (R 2 -score) is adopted as the metric to evaluate the performance of all the approaches on helpfulness score regression. Table 5 and Table 6 show that MTNL (Fan et al., 2018) achieves state-of-the-art performance on this regression task among the baseline approaches. Our framework outperforms MTNL on both datasets and obtains state-of-the-art (microaveraged) results of 62.3% R 2 -score (Amazon) and 74.0% R 2 -score (Yelp) with absolute improvements of 5.4% R 2 -score and 5.8% R 2 -score, respectively.

Ablation Study on Metadata Selector
In this part, we conduct the ablation study on different metadata selectors: i.e., random selector, heuristic selector, and our reinforced selector. The random selector requires no prior knowledge but randomly picks one pair of (key, value) from the product metadata with uniform distribution. The heuristic selector keeps choosing the pair of (key, value) in which the value contains the longest text 5 . Our reinforced selector learns from the reward given by the helpfulness predictor and makes a wise decision on the metadata selection. The values of product metadata selected by the three selectors are fed into the same predictor. Figure 4 shows the performance of both identification and regression of review helpfulness supported by the three different selectors, and the results demonstrate that our reinforced selector surpasses the other policies (the random selector and the heuristic selector) for metadata selection.

Case Study on Key Embeddings
Our framework also helps to acquire the distributed representations of the keys in product metadata. We believe these key embeddings can capture various aspects/topics on customer reviews and lead to the correct value for review helpfulness prediction. With the help of t-SNE (Maaten and Hinton, 2008), we can map the embeddings of the metadata keys from the Amazon and Yelp datasets into 2D vectors and illustrate them in Figure 5.
It shows that the key embeddings in Amazon locate at the bottom-right and the key embeddings in Yelp are generally at upper-left. The keys which express similar meanings within the same dataset are close to each other, such as the keys "city", "state" and "address" in Yelp, and the keys "title", "description" and "brand" in Amazon. Even across      different datasets, the key embeddings draw near to each other if they are close in meaning, e.g., "title" (Amazon) and "name" (Yelp), or "asin" (Amazon) and "business id" (Yelp).

Conclusion and Future Work
Driven by the intuition that the helpfulness of an online review should be fully aware of the metadata of its target product besides the textual content of the review itself, we take on the challenge of selecting the correct key-value product metadata to help predict the helpfulness of free-text reviews more precisely. To address the problem, we propose a novel framework in this paper, which is composed of two interdependent modules. Given a product, an agent (selector) learns from both the keys in the product metadata and one of its reviews to take an action that selects the correct value, and a successive network (predictor) makes the freetext review attend to this value to produce better neural representations for helpfulness prediction. We use two real-world datasets from Amazon and Yelp, respectively, to compare the performance of our framework with other mainstream methods on two tasks: helpfulness identification and regression of online reviews. Extensive results show that our framework can achieve stateof-the-art performance with substantial improvements. Further discussions demonstrate that it can not only provide better policies on selecting the correct value of product metadata but also acquire the embeddings of the keys in the product metadata.
We also believe the study on review helpfulness assessment could be as important as the topic of product recommendation, and several open prob-lems are deserved to be explored in the future: User-specific and explainable recommendation of helpful reviews: As different users may concern about various aspects of the products online, helpful review recommendation needs to be more userspecific and self-explainable.
Enhancing the prediction of helpful reviews with unlabeled data: As a small proportion of reviews could be heuristically regarded as helpful or unhelpful, it thus becomes a promising study to automatically predict the helpfulness of online reviews based on the small amount of labeled data and a vast amount of unlabeled data.
Cross-domain helpfulness prediction of online reviews (Chen et al., 2018b): Given that it costs a lot on manually annotating a sufficient number of helpful reviews in a new domain, we should explore effective approaches on transferring useful knowledge from limited labeled samples in another domain.