Review-based Question Generation with Adaptive Instance Transfer and Augmentation

While online reviews of products and services become an important information source, it remains inefficient for potential consumers to exploit verbose reviews for fulfilling their information need. We propose to explore question generation as a new way of review information exploitation, namely generating questions that can be answered by the corresponding review sentences. One major challenge of this generation task is the lack of training data, i.e. explicit mapping relation between the user-posed questions and review sentences. To obtain proper training instances for the generation model, we propose an iterative learning framework with adaptive instance transfer and augmentation. To generate to the point questions about the major aspects in reviews, related features extracted in an unsupervised manner are incorporated without the burden of aspect annotation. Experiments on data from various categories of a popular E-commerce site demonstrate the effectiveness of the framework, as well as the potentials of the proposed review-based question generation task.


Introduction
The user-written reviews for products or service have become an important information source and there are a few research areas analyzing such data, including aspect extraction (Bing et al., 2016;Chen et al., 2013), product recommendation (Chelliah and Sarkar, 2017), and sentiment analysis Zhao et al., 2018a). Reviews reflect certain concerns or experiences of users on products or services, and such information is valuable for other * The work described in this paper is partially supported by a grant from the Research Grant Council of the Hong Kong Special Administrative Region, China (Project Code: 14204418). † The work was done when Qian Yu was an intern at Alibaba. potential consumers. However, there are few mechanisms assisting users for efficient review digestion. It is time-consuming for users to locate critical review parts that they care about, particularly in long reviews.
We propose to utilize question generation (QG)  as a new means to overcome this problem. Specifically, given a review sentence, the generated question is expected to ask about the concerned aspect of this product, from the perspective of the review writer. Such question can be regarded as a reading anchor of the review sentence, and it is easier to view and conceive due to its concise form. As an example, the review for a battery case product in Table 1 is too long to find sentences that can answer a user question such as "How long will the battery last?". Given the generated questions in the right column, it would be much easier to find out the helpful part of the review. Recently, as a topic attracting significant research attention, question generation is regarded as a dual task of reading comprehension in most works, namely generating a question from a sentence with a fixed text segment in the sentence designated as the answer .
Two unique characteristics of our review-based question generation task differentiate it from the previous question generation works. First, there is no review-question pairs available for training, thus a simple Seq2Seq-based question generation model for learning the mapping from the input (i.e. review) to the output (i.e. question) cannot be applied. Even though we can easily obtain large volumes of user-posed review sets and question sets, they are just separate datasets and cannot provide any supervision of input-output mapping (i.e. reviewquestion pair). The second one is that different from the traditional question generation, the generated question from a review sentence will not simply take a fixed text segment in the review as its Review Question It doesn't heat up like most of the other ones, and I was completely fascinated by the ultra light and sleek design for the case. Before I was using the Mophie case but I couldn't wear it often because it was like having a hot brick in your pocket, hence I had to always leave it at home. On the contrary, with PowerBear, I never take it off because I can't even tell the difference. Also it is build in a super STRONG manner and even though I dropped my phone a few times, its shock resistant technology won't let a single thing happen to the case or the phone. The PowerBear case became an extension to my phone that I never have to take off because when I charge it at night, it charges both my phone and the case. I have battery life for more than two days for normal use, i.e. not power-consuming gaming.
Does this make the phone warm during charging? Have any of you that own this had a Mophie? Does this give protection to the phone? Can this charge the phone and the extra battery at the same time? How many days it can last? answer. The reason is that some reviews describing user experiences are highly context-sensitive. For the example in Table 1, for the review "I have battery life for more than two days for normal use, i.e. not power-consuming gaming." and its corresponding example question "How many days it can last?", obviously the text segment "more than two days" is a less precise answer, while the whole review sentence is much more informative. In some other case, even such less precise answer span cannot be extracted from the review sentence, e.g. for the example question "Does this give protection to the phone?" and the review sentence "Also it is ... even though I dropped my phone ..., its shock resistant technology won't let a single thing happen to the case or the phone.". Of course here, a simple "Yes" or "No" answer does not make much sense as well, while the whole review sentence is a vivid and informative answer.
The above two unique characteristics raise two challenges for our task. The first challenge, namely lacking review-question pairs as training instances, appears to be intractable, particularly given that the current end-to-end models are very data-hungry. One instant idea is to utilize user-posed (question, answer) pairs as substitute for training. However, several instance-related defects hinder the learned generation model from being competent for the review-based question generation. Some answers are very short, e.g. "more than two days", therefore, without necessary context, they are not helpful to generate good questions. The second challenge, namely the issue that some verbose answers contain irrelevant content especially for subjective questions. To handle this challenge, we propose a learning framework with adaptive instance transfer and augmentation.
Firstly, a pre-trained generation model based on user-posed answer-question pairs is utilized as an initial question generator. A ranker is designed to work together with the generator to improve the training instance set by distilling it via removing unsuitable answer-question pairs to avoid "negative transfer" (Pan and Yang, 2009), and augmenting (Kobayashi, 2018) it by adding suitable reviewquestion pairs. For selecting suitable reviews for question generation, the ranker considers two factors: the major aspects in a review and the review's suitability for question generation. The two factors are captured via a reconstruction objective and a reinforcement objective with reward given by the generator. Thus, the ranker and the generator are iteratively enhanced, and the adaptively transferred answer-question pairs and the augmented reviewquestion pairs gradually relieve the data lacking problem.
In accordance with the second characteristic of our task, it is plausible to regard a review sentence or clause as the answer to the corresponding question originated from it. Such treatment brings in the second challenge: how can we guarantee that the generated question concentrates on the critical aspect mentioned by the review sentence? For example, a question like "How was the experience for gaming?" is not a favourable generation for "I have battery life for more than two days for normal use, i.e. not power-consuming gaming.". To solve this problem, we incorporate aspect-based feature discovering in the ranker, and then we integrate the aspect features and an aspect pointer network in the generator. The incorporation of such aspect-related features and structures helps the generator to focus more on critical product aspects, other than the less important parts, which is complied with the real user-posed questions.
To sum up, our main contributions are threefold.
(1) A new practical task, namely question generation from reviews without annotated instance, is proposed and it has good potential for multiple applications.
(2) A novel adaptive instance transfer and augmentation framework is proposed for handling the data lacking challenge in the task. (3) Review-based question generation is conducted on E-commerce data of various product categories.

Related Work
Question generation (QG) is an emerging research topic due to its wide application scenarios such as education , goal-oriented dialogue (Lee et al., 2018), and question answering . The preliminary neural QG models  outperform the rule-based methods relying on hand-craft features, and thereafter various models have been proposed to further improve the performance via incorporating question type (Dong et al., 2018), answer position , long passage modeling (Zhao et al., 2018b), question difficulty , and to the point context (Li et al., 2019). Some works try to find the possible answer text spans for facilitating the learning . Question generation models can be combined with its dual task, i.e., reading comprehension or question answering with various motivations, such as improving auxiliary task performance Golub et al., 2017), collaborating QA and QG model (Tang et al., 2018, and unified learning (Xiao et al., 2018).
Although question generation has been applied on other datasets, e.g., Wikipedia (Du and Cardie, 2018), most of the existing QG works treat it as a dual task of reading comprehension Cui et al., 2017), namely generating a question from a piece of text where a certain text span is marked as answer, in spite of several exceptions where only sentences without answer spans are used for generating questions Chali and Baghaee, 2018). Such generation setting is not suitable for reviews due to the lack of (question, review) pairs and improper assumption of text span answer as aforementioned. There are works training the question generation model with the user-written QA pairs in E-commerce sites (Hu et al., 2018;Chali and Baghaee, 2018), but the practicality is limited since the questions are only generated from answers instead of reviews.
Transfer learning (Pan and Yang, 2009;Li et al., 2020) refers to a broad scope of methods that exploit knowledge across domains for handling tasks in the target domain. A few terms are used for describing specific methods in this learning paradigm, e.g., self-taught learning (Raina et al., 2007), domain adaptation (Long et al., 2017), etc. Based on "what to transfer", transfer learning is categorized into four groups (Pan and Yang, 2009), namely instance transfer, feature representation transfer, parameter transfer, and relational knowledge transfer. Our learning framework can be regarded as a case of instance transfer with iterative instance adaptation and augmentation.

The Proposed AITA Framework
For handling the aforementioned issues, we propose an Adaptive Instance Transfer and Augmentation (AITA) framework as shown in Figure 1. Since the review-related processing is always sentencebased, we use "review" for short to refer to review sentence in this paper. Its two components, namely ranker and generator, are learned iteratively. Initially, AITA simply transfers all available (question, answer) pairs and trains a generator. Then it will iteratively enhance the generator with the help of the ranker. The ranker takes a (question, answer) pair and a review as its input and calculates a ranking score s. Thus, it can rank all reviews for a given QA pair. The ranking objective incorporates the reward provided by the generator, which helps find out those suitable reviews to form (review, question) pairs for training (i.e. augmenting the training data). Meanwhile, the reward from the generator also helps remove unsuitable QA pairs for training, so that it makes the transfer more adaptive. Note that the ranker also learns to model two hidden aspect related variables for the review, which are helpful for the generator to ask about the major aspects in review. Such an iterative instance manipulation procedure gradually transfers and augments the training set for handling review-based question generation.

Review Ranker for Data Augmentation
There are two pieces of input text for ranker. The first one is the concatenation of a (question, answer) pair qa and the second one is a review sentence r. qa and r are associated with the same product. Since the ranker is responsible for instance augmentation that provides (question, review) pairs, it is trained to learn a score s(qa, r) which can be used to return suitable r's for a given qa.
Ranking with Partially Shared Encoders. The input qa and r are encoded with two Transformer encoders with the same structure and partially shared parameters, to leverage the advantage of multi-head self attention on modeling word associations without considering term position. An input (qa or r) is written as a matrix E = [e T 1 , ..., e T n ] T , where e is a word embedding and n is the text length. The number of heads in the multi-head selfattention is denoted as m, and the output of the j-th head is written as: where d is the dimension of word embedding. The outputs of different heads are concatenated and the encoding for the i-th word is written as h i = [head 1 i ; ...; head m i ]. To obtain the sentence representation considering the complete semantics, we apply a global attention layer on the output of the Transformer encoder: where the attention weight α i = exp(h i ·M·h)/Z α , Z α is the normalization, and h = h i /n. The parameter matrix M is shared by encoders for both qa and r for capturing the common attention features across them.
After encoding qa and r as h α (qa) and h α (qa), a vector g(qa, r) is assigned with the concatenation of h α (qa), h α (qa) and their difference The review ranking score s(qa, r) is calculated as: where σ is sigmoid function.

Reinforcement Objective for Ranker Learning.
To learn an appropriate s(qa, r), we encounter a major challenge, namely lacking ground truth labels for (question, review). Our solution takes the generator in our framework as an agent that can provide reward for guiding the learning of ranker.
The generator is initially trained with (question, answer) data, and is gradually updated with adapted and augmented training instances, so that the rewards from the generator can reflect the ability of review for generating the corresponding question. Specifically, we propose a reinforcement objective that makes use of the reward from the generator, denoted as reward G (r, q). For each pair of question and review, we take the normalized log ppl(q|r) in the generator as reward: where R qa is the reviews under the same product as qa, and log ppl(q|r) is the log perplexity of generating a question q from a review r: The reinforcement objective for the ranker is to maximize the average reward for all the reviews given a question. The sampling probabilities for reviews are obtained via normalized ranking score, namely p(r|qa) = s(qa, r)/Z qa , where Z qa = r * ∈Rqa s(qa, r * ). The loss function is: The gradient calculation for the above objective is an intractable problem. As an approximated method which performs well in the iterative algorithm, the normalization term Z qa is fixed during the calculation of the policy gradient: ∆L g (qa, r) = r ∆s(qa, r)reward G (r, q)/Z qa Regularization with Unsupervised Aspect Extraction. Product aspects usually play a major role in all of product questions, answers and reviews, since they are the discussion focus of such text content. Thus, such aspects can act as connections in modeling input pairs of qa and r via the partially shared structure. To help the semantic vector h α in Eqn 3 capture salient aspects of reviews, an autoencoder module is connected to the encoding layer for reconstructing h α . Together with the matrix M, the autoencoder can be used to extract salient aspects from reviews. Note that this combined structure is similar to the ABAE model , which has been shown effective for unsupervised aspect extraction. Compared with supervised aspect detection methods, such a unsupervised module avoid the burden of aspect annotation for different product categories, and our experiments demonstrate that regularization based on this module is effective.
Specifically, h α is mapped to an aspect distribution p α and then reconstructed: where each dimension in p α stands for the probability that the review contains the corresponding aspect, and h α is the reconstruction of review representation, and A is a learnable parameter matrix. Note that we define "aspects" as implicit aspect categories, namely clusters of associated attributes of product, which is commonly used in unsupervised aspect extraction (Wang et al., 2015;. The reconstruction objective is written as: Only the reconstruction of review representations is considered since we focus on discovering aspects in reviews. 1 In this way, the aspect-based reconstruction will force h α to focus on salient aspects that facilitate the reconstruction. The final loss function of the ranker is regularized to: where λ is a hyper-parameter. 1 We simplified the objective in AEAB model by eliminating the additional regularization term which is not necessary when combining L α (qa, r) and L g (qa, r).

Question Generator in Transfer Learning
We adapt the Seq2Seq model for the aspect-focused generation model, which is updated gradually via the transferred and augmented instances. With the help of aspect-based variables learned in ranker, the generator can generate questions reflecting the major aspect in the review.
Aspect-enhanced Encoding. To emphasize the words related to salient aspects, the attention weight α i obtained in the ranker is incorporated into the word embedding. Given an input review sentence, we obtain the extended word embedding e i at position i: where e i is the pre-trained word embedding, e P OS i is the one-hot POS tag of i-th word, e N ER i is a BIO feature for indicating whether the i-th word is a named entity, and α i indicates the aspect-based weight for the i-th word. Bi-LSTM is adopted as the basic encoder of generator, encoding the i-th word as the concatenation of hidden states with both directions: Decoding with Aspect-aware Pointer Network.
Pointer network, i.e., copy mechanism, can significantly improve the performance of text generation. In our task, in addition to the word-level hidden state in the decoder, the overall aspect distribution of the review can also provide clues for how likely the generator should copy corresponding review aspect words into the generated question. The question is generated with an LSTM decoder. The word probability for the current time step is formulated as: and related variables are calculated as: where s t is the hidden state for the t-th word in question and c t is the context encoding based on attention weight z tj .
In the pointer network, for a particular position t in the generated text, the word may be copied from a distribution based on the attention weight z t ={z tj }, where the copy probability is assigned according to the current hidden state s t . We also Data: QA set S qa ={(q,a)}; review set S r ={r}; µ Result: S; generator trained with S Prepare pairs of (qa, r) under each product Initialize the training set S = S qa For each epoch Do 1. Train generator with S. 2. Prepare the reward G (qa, r) as generator reward for each pair of (qa, r) (each answer a in qa pairs is regarded as a review for q). 3. Adapt S via removing µ instances with low reward. 4. Train ranker according to the objective in Eqn 10. 5. Augment S via adding µ pairs of instances, which are (q, r) pairs with top s(qa, r) in ranker. 6. Collect α and p α for instances in S from ranker. End Algorithm 1: Learning algorithm of AITA.
consider the influence of the aspect distribution p α in the copy probability β for interpolation: The incorporation of p α helps the pointer network to consider the overall aspect distribution of context in addition to the semantics in the current position for copying words. Finally, the t-th word is generated from the mixture of the two distributions: The generator is trained via maximizing the likelihood of the question q given the review r:

Iterative Learning Algorithm
The purpose of our iterative learning, as by Alg 1, is to update the generator gradually via the instance augmentation. The input data for the iterative learning consists of the transferred instance set of question-answer pairs S qa , an unlabeled review set S r , and an adaption parameter µ. When the learning is finished, two outputs are produced: the final training instances S, and the learned generator. The training set S for generator is initialized with S qa . In each iteration of the algorithm, the generator is trained with current S, and then S is adapted accordingly. The ranker is trained based on the rewards from the generation, which is used for instance augmentation in S. Thus, the training set S is updated during the iterative learning, starting from a pure (question, answer) set. Analysis on the influence of the composition of S, i.e., instance numbers of two types, is presented in Section 4.5.
There are two kinds of updates for the instance set S: (1) adaption via removing (q, a) pairs with low generator reward, in order to avoid "negative transfer"; (2) augmentation via adding (q, r) pairs that are top ranked by ranker, in order to increase the proportion of suitable review-question instances in training set. The instance number hyperparameter µ for removing and adding can be set according to the scale of S qa , and more details are given in our experimental setting.
To guarantee the effective instance manipulation, two interactions exist between generator and ranker. First, aspect-related variables for reviews obtained by ranker are part of the generator input. The second interaction is that a reward from generator is part of the learning objective for ranker, in order to teach ranker to capture the suitable reviews for generating the corresponding question.

Datasets
We exploit the user-written QA dataset collected in (Wan and McAuley, 2016) and the review set collected in (McAuley et al., 2015) as our experimental data. The two datasets are collected from Amazon.com separately. We filter and merge the two datasets to obtain products whose associated QA pairs and reviews can both be found. The statistics for our datasets can be found in Table 2, where the numbers of product for several very large product categories are restricted to 5000. According to the average lengths, we can find that the whole review tend to be very long. It justified our assumption that it is not easy for users to exploit reviews, and questions with short length can be a good catalogue for viewing reviews.
To test our question generation framework, we manually labeled 100 ground truth review-question pairs for each product category. 6 volunteers are asked to select user-posed questions and the corresponding review sentences that can serve as answers. Specifically, the volunteers are given pairs  of question and review, and only consider the relevance between question and review. The answer to the question is also accessible but it is only used for helping annotators to understand the question. All labeled pairs are validated by two experienced annotators with good understanding for the consumer information need in E-commerce. . The labeled instances are removed from the training set.

Experimental Settings
For each product category, we train the AITA framework and use the learned generator for testing. The fixed 300 dimension GloVe word embeddings (Pennington et al., 2014) are used as the basic word vectors. For all text including question, answer and review, we utilize StanfordNLP for tokenizing, lower casing, and linguistic features extraction, e.g., NER & POS for the encoder in generator. In ranker, the dimension of aspect distribution is set to 20 and the λ in the final loss function in Eqn 10 is set to 0.8. In the multi-head self-attention, the head number is set to 3 and the dimension for Q, K, V is 300. The dimensions of matrices can be set accordingly. The hidden dimension in generator is set to 200. In the iterative learning algorithm, we set the epoch number to 10 and the updating instance number µ to 0.05 × |S qa |. In testing, given a review r as input for generator, the additional input variables α(r) and p α (r) are obtained via the review encoder (Eqn 3) and aspect extraction (Eqn 8), which are question-independent.
For testing the effectiveness of our learning framework and the incorporation of aspect, we compare our method with the following models: G a : A sentence-based Seq2Seq generation model trained with user-written answerquestion pairs. G P N a : A pointer network is incorporated in the Seq2Seq decoding to decide whether to copy word from the context or select from vocabulary. G P N ar : Review data is incorporated via a retrieval-based method. Specifically, the most relevant review sentence for each question is retrieved via BM25 method, and such review-question pairs are added into the training set. G P N a +aspect (Hu et al., 2018): Aspect is exploited in this model. We trained the aspect module in our framework, i.e. only using the reconstruction objective to obtain an aspect feature extractor from reviews. Then the aspect features and distributions can be used in the same way as in our method. AITA refers to our proposed framework. AITA-aspect: All the extracted aspect-related features are removed from AITA as an ablation for evaluating the effectiveness of the unsupervised module for aspect. For every product category, we run each model for 3 times and report the average performance with four evaluation metrics, including BLEU1 (B1), BLEU4 (B4), METEOR (MET) and ROUGE-L (R L ).

Evaluation of Question Generation
The results are demonstrated in Table 3. AITA achieves the best performance on all product categories regarding different evaluation metrics. The significant improvements over other models demonstrate that our instance transfer and augmentation method can indeed reduce inappropriate answerquestion pairs and provide helpful review-question pairs for the generator. The performance of G a is very poor due to the missing of attention mechanism. Both G P N a and G P N a +aspect have worse performance than ours, even though some product categories have large volume of QA pairs (>100k), e.g., Electronics, Tools, etc. This indicates that the answer-question instances are not capable of learning a review-based question generator because of the different characteristics between the answer set and review set. G P N ar performs much worse than G P N a , which proves that a simple retrieval method  is not effective for merging the instances related to reviews and answers. AITA adapts and augments the QA set to select suitable review-question pairs considering both aspect and generation suitability, resulting in a better generator. In addition, effectiveness of aspect feature and aspect pointer network can be illustrated via the slight but stable improvement of G P N a +aspect over G P N a and the performance drop of AITA-aspect on all the categories. This proves that even without precise aspect annotation, our unsupervised aspect-based regularization is helpful for improving generation.

Human Evaluation and Case Study
We conduct human evaluation on two product categories to study the quality of the generated questions. Two binary metrics Relevance and Aspect are used to indicate whether a question can be answered by the review and whether they share the same or related product aspect. The third metric,  Fluency with the value set {1, 2, 3}, is adopted for judging the question fluency. 1 means not fluent and 3 means very fluent. We selected 50 generated questions from each model and asked 4 volunteers The entire length of the watch is 9 inches, but the effective length from the last hole to clasp is about 8 inches.
-G P N a : What is the difference between gear 2 neo and this watch?
-G P N a +aspect: How is the length? -AITA: What is the dimension in mm? If you have a huge wrist this watch mayn't look good nor fit you well.
-G P N a : What is the wrist size? -G P N a +aspect: How does it fit? -AITA: Will it fit my huge hand? The stainless steel case back can be pried off from the 12 o'clock position (from the back), and the battery CAN be replaced.
-G P N a : Is the material good quality and not easy to tore? -G P N a +aspect: Can the lid be removed? -AITA: Can you tell me how to replace the battery? The watch has a Japanese Miyota movement inside, and has a Japanese Sony 626sw battery which requires you to loosen a very small flat head screw and slide a little metal arm out of the way to remove the battery.
-G P N a : What is the battery life on this watch? -G P N a +aspect: Can I remove the battery? -AITA: Can I remove the battery? for evaluation. The average scores are reported in Table 4, which shows that our framework achieves the best performance regarding all the metrics, especially for Relevance, showing that our AITA can help generate more accurate questions based on reviews and thus facilitates exploiting reviews. Due to the incorporation of implicit aspect information, both AITA and G P N a +aspect significantly outperform G P N a regarding both Aspect and Relevance. Again, G P N ar with a simple retrieval method for augmenting training instances cannot perform well.
The blue sentences in Table 5 are from a long review talking about some important information of a wat ch, and the questions generated by different models are also given. These questions are more user-friendly and potential consumers can browse them to quickly locate the information they care about. For example, if a user wants to know more about the battery replacement, the portion before the third sentence can be skipped. According to the generated questions via three methods in the Table  5, we can find that the questions from AITA are asking about major aspects of the review sentences. G P N a failed to capture major aspects in the first three sentences, and the questions generated by G P N a +aspect are not as concrete as ours, owning to the insufficient training instances.

Analysis on Instances Composition
The training instance set for the generator, i.e., S in Algorithm 1, is initialized with QA set and gradually adapted and augmented. Here, we investigate the effect of composition property of S on the generator performance at different epochs. As shown in Fig 2, two product categories and two metrics are illustrated, with the gradually changed training instance set S. The proportion of review-question (qr) instances in S starts with 0, and significant performance improvement can be observed while the qr proportion gradually increases. The results stay stable until the qr proportion reach 80%.

Conclusions
We propose a practical task of question generation from reviews, whose major challenge is the lack of training instances. An adaptive instance transfer and augmentation framework is designed for handling the task via an iterative learning algorithm. Unsupervised aspect extraction is integrated for aspect-aware question generation. Experiments on real-world E-commerce data demonstrate the effectiveness of the training instance manipulation in our framework and the potentials of the review-based question generation task.