Automatic Opinion Question Generation

We study the problem of opinion question generation from sentences with the help of community-based question answering systems. For this purpose, we use a sequence to sequence attentional model, and we adopt coverage mechanism to prevent sentences from repeating themselves. Experimental results on the Amazon question/answer dataset show an improvement in automatic evaluation metrics as well as human evaluations from the state-of-the-art question generation systems.


Introduction
Question generation (QG) can be considered as a task which affects many aspects of people's lives. One of the main significance of the question generation is its capability to improve one's learning ability. Studies have shown that asking questions can help students realize their knowledge deficits and encourages them to look for information to compensate for those deficits (Graesser and Person, 1994). Additionally, QG can be used as an aid to search engines by providing suggestions regarding the users' queries (Chali and Hasan, 2015). This way, the users can either choose one of those suggestions or obtain a better idea on how to modify their query to get better results. Moreover, QG can assist the reading comprehension task and the question answering community by providing a robust input for their systems (Serban et al., 2016;Rajpurkar et al., 2016;Yang et al., 2017).
In this work, we propose a sequence to sequence model that uses attention and coverage mechanisms for addressing the question generation problem at the sentence level. The attention and coverage mechanisms prevent language generation systems from generating the same word over and over again, and have been shown to improve a system's output (See et al., 2017).
We benefit from the community-based question answering systems. Specifically, we use the Amazon question/answer dataset (McAuley and Yang, 2016). The sentences are mostly informal and sometimes do not follow the correct grammatical structure. We utilize the answers that people post on the community question answering system as inputs to our model; hence, proposing an opinion question generation system which could be used as an interface to online forums helping users in browsing and querying them by making questions as suggestions.
In the subsequent section, we describe the related works to QG. The next section is on the task definition, followed by the demonstration of the model structure. After that, we discuss the experimental settings and at the end provide a thorough discussion of our results.

Related Work
After the first question generation shared task evaluation challenge (Rus et al., 2010), the question generation task has received a huge attention from the natural language generation community. Many of the traditional approaches involve human resources to create robust templates and then employing them to generate questions. For instance, Heilman and Smith (2010) approach is to overgenerate questions by some hand-written rules and then rank them using a logistic regression model. Labutov et al. (2015) benefit from a low-dimensional ontology for document segments. They crowdsource a set of promising question templates that are matched with that representation and rank the results based on their relevance to the source. Lindberg et al. (2013) employed a template-based approach while taking advantage of semantic information to generate natural language questions for on-line learning support. Chali and Hasan (2015) consider the automatic generation of all possible questions from a topic of interest by exploiting the named entity information and the predicate argument structures of the sentences.
Lately, more approaches have been presented that utilize the neural encoder-decoder architecture. Serban et al. (2016) address the problem by transducing knowledge graph facts into questions. They created a factoid question and answer corpus by using the Recurrent Neural Network architecture.
QG can also be combined with its complementary task, Question Answering (QA) for further improvement. Tang et al. (2017) consider QG and QA as dual tasks and train their relative models simultaneously. Their training framework takes advantage of the probabilistic correlation between the two tasks. QG has also entered other communities such as computer vision. Mostafazadeh et al. (2016) introduced the visual question generation task where the goal of the system is to create a question given an image.
One of the latest studies on the QG task has been conducted by Du et al. (2017). Their task is a QG on both sentences and paragraphs for the reading comprehension task, and they adopt an attentionbased sequence learning model. Another recent work is by Yuan et al. (2017), they generate questions from documents using supervised and reinforcement learning.
In our work, we generate questions using community questions and answers and apply the encoder-decoder structure. To boost the performance of our system, we use attention and coverage mechanisms as suggested in See et al. (2017).

Task Formulation
Given an answer A = (a 1 , a 2 , ..., a N ), we are going to generate a natural language question Q = (q 1 , q 2 , ..., q M ), where its answer is embedded in A. Our goal is to find Q such that the conditional probability p(Q|A) is maximized. We model p(Q|A) as a product of word predictions: This indicates that the probability of each q t relies on the previously generated words and the in-put sentence A.

Model Structure
For modeling p(Q|A), we use the simple RNN encoder-decoder architecture (Cho et al., 2014) with the global attentional model (Luong et al., 2015), which lets the decoder learn to focus on a particular range of the input sequence during the generation task. To improve upon this model, we apply coverage mechanism (See et al., 2017), which prevents the word repetition problem.

Encoder
An encoder network maps an input sequence into word vectors and then converts them into hidden states b 1 , ..., b N . In our case, the encoder is a two layer bidirectional LSTM network (Hochreiter and Schmidhuber, 1997). We concatenate the output of the forward hidden states − → b j and the backward hidden states for input token j. This b j is used later by the decoder to calculate the context vector c t , which stores the relevant source-side information and simplifies the prediction of the next target word. c t is computed as a weighted sum of b i : where a t is an alignment vector and is calculated according to the general attention model: To initialize the decoder's hidden state, we concatenate the hidden states of the forward and the backward pass of the encoder.

Decoder
The decoder is a two layer unidirectional LSTM. It keeps a coverage vector s, which is the sum of the previous alignment vectors: It shows how much coverage each input word has received from the attention mechanism so far and it helps the mechanism to avoid attending to the same words again once they have been attended to initially (See et al., 2017). It should be mentioned that s 0 is a zero vector since nothing has been covered on the first time step. This coverage vector will be added to the source hidden This b i will be substituted in equations (1) and (2) where w s is a parameter to be learned. This way, with the help of s t , the attention mechanism always has a memory of its past decisions.
The decoder predicts the next word q t given the context vector c t and all the previously predicted words {q 1 , ..., q t−1 }. We use a softmax layer to produce the predictive distribution: h t is the attentional hidden state which is calculated given the target hidden state h t and the source context vector c t : where W s and W c are learnable parameters. The hidden state at time step t of the decoder is generated by: where q t−1 is the previously generated word and h t−1 is the former hidden state.
Moreover, we use the input feeding approach (Luong et al., 2015), which informs the decoder which words were considered for the past alignments. We do this by concatenating the attentional hidden state h t with the inputs at the next time steps.

Training and Generation
The training objective is to minimize the negative log-likelihood of the training corpus. Considering S = {(a i , q i )} |S| 1 as our whole training data, we define the objective as: In addition to this primary loss function, it is required to introduce a coverage loss to penalize an overlap between the coverage vector and the attention distribution, which means attending to the same location multiple times.
After being reweigted by some hyperparameter λ, this amount is added to equation (3): In the generation step, we utilize the beam search for the inference to maximize the conditional probability.
Since the size of our vocabulary is limited to a small number, many unknown words (UNK) will be generated during the inference. We substitute the (UNK) tokens with the words with the highest attention weight from the source sentence.

Dataset
We use the Amazon question/answer dataset (McAuley and Yang, 2016). We set the minimum length of the questions to 4 tokens, including the question mark to filter out poorly structured sentences. The answers must be at least 10 tokens long. Moreover, we set the maximum length of the questions and the answers to 20 and 35 tokens, respectively. As there are many URLs in the dataset, we replace them with a URL token to reduce the vocabulary size. We lower-case the entire dataset and use the NLTK toolkit 1 for sentence tokenization. There can be many examples where the questions are not grammatically correct. People may just ask: "Waterproof ?". The same problem occurs with the answers: the answer might be a single "Yes". We use 80% of the dataset as the training set, and the rest is divided between the validation set and the test set. Table 1 shows the total number of examples in each dataset after removing very long or very short sentences from the training and the validation datasets.

Train
Validation Test # pairs 233729 28969 70648 Table 1: Statistics of the dataset

Experimental Setting
Our base model is from OpenNMT system (Klein et al., 2017), and we use the PyTorch 2 library, a deep learning framework that provides maximum flexibility and speed. It accelerates the computation on both CPU and GPU by a great amount, and the memory usage is extremely efficient in PyTorch compared to other options. We fix the size of the answer and the question vocabularies to 50k. Only the most frequent words are kept, and the rest are replaced with the UNK token. We set the word embedding dimension to 300 and we use glove.840B.300d (Pennington et al., 2014) as the pre-trained word embedding on both the encoder and the decoder sides. These embeddings are updated during training. The LSTM hidden unit size is set to 600 and we set the number of layers to 2. We employ the stochastic gradient descent (SGD) as the optimization method with an initial learning rate of 1.0 and halve the learning rate after 10 epochs. The training continues for 20 epochs with the batch size of 64 and dropout probability of 0.3. The hyperparameter λ that is used for weighting the coverage loss is set to 1 3 . The decoding is done using the beam search with the beam size of 5, and the generation is stopped when we reach the EOS token. In the end, we choose the model with the lowest perplexity on the validation set.

Baseline
We compare our model 4 to that of Du et al. (2017). We only experiment with their sentencelevel model and run the same Amazon question and answer dataset on the system provided by the first author. We keep the source and target vocabulary size the same as ours, (i.e., 50k) and set the maximum and the minimum length of the questions and answers the same as our model. Everything else is left to the default values.

Automatic Evaluation Metrics
For evaluating our system automatically, we use three different evaluation metrics. The first one is BLEU (Papineni et al., 2002) that uses the ngram similarity between a prediction and a set of references. We calculate BLEU score for unigrams and bigrams. The next one is METEOR (Denkowski and Lavie, 2014), which scores predictions by aligning them to ground truth sentences with the help of stemming, synonyms and paraphrases. The last evaluation metric is Rouge (Lin, 2004). It compares the generated sentences with the references based on n-gram. For this task, we use ROUGE L , which reports the results based on the longest common subsequence. We use the evaluation package by Chen et al. (2015). Table 2 shows the results of our system and the baseline. Our model improves the BLEU 1 score by at least 1.5 points. It also achieves a better result regarding the BLEU 2 and the ME-TEOR whereas the ROUGE is lower than the baseline. If we consider the results reported in Du et al. (2017), we notice that the BLEU scores are much higher compared to our work. The reason is that they use the SQuAD dataset (Rajpurkar et al., 2016), which is a human-generated corpus. The sentences are well-structured, grammatically correct with fewer unnecessary punctuation and colloquialism. However, when working with the community-based question answering systems, the structure of sentences do not always follow the correct path. These sentences often contain useless information and symbols.  Another problem is that multiple questions can be generated from a single sentence. The system may generate a question which is correct both semantically and grammatically and also asks about accurate information in the sentence. However, if it is not the same as the ground-truth, the results will be affected. Figure 1 shows some examples generated by our system and Du et al. (2017), where the coverage mechanism becomes useful and prevents the model from generating the same word 'material' twice.

Results and Discussion
Answer 1: I really don't know, I did full size cupcakes, mini ones it would hold a ton! GT Question: How many mini-cupcakes will this hold? DSC: what size is it? Ours: how many cupcakes will it hold? Answer 2: Nothing out of the ordinary. just a simple screw driver. if I recall correctly, I think it may have came with the tools needed to assemble. good luck and congratulations GT Question: What tools are required to assemble unit? DSC: What is the assembly required? Ours: what tools do I need to assemble this? Answer 3: You can definitely still do pushups with the wraps on. The wraps just give extra support, they really don't impact your range of motion at all. GT Question: Can I do pushups while wearing these wraps, or is the material too stiff? DSC: Can you still use the material while wearing the material? Ours: Can I do pushups while wearing these wraps? Answer 4: I would go with a medium it fits well and when you adjust it with the helmet it's tight to the chin. GT Question: What size to buy for 14 yr old 125lb and 5'5? DSC: I'm a woman with a small head, what size should I get? Ours: What size should I get for a child who is 5'6"? Answer 5: There's the ability to forward the bp measurement information via email to friends, family and doctors so I assume that once it's been sent an email you can print -it however I haven't tested this functionality yet. At the very least when you bring up the bp readings on your screen you can do a screen capture and then print that screen capture. GT Question: Is it possible to print the BP readings? DSC: What is the difference between the BP and the BP? Ours: How do you print from the BP?

Human Evaluations
To further assess the performance of our system, we performed human evaluations on the results. Three English-speaker students were asked to give a score from 1 (very poor) to 5 (very good) to the questions generated from both systems according to two criteria: syntactic correctness and relevance. Syntactic correctness indicates the grammaticality and the fluency and relevance demonstrates whether the question is meaningful and related to the sentence it is generated from. The three assessors performed the evaluations on 100 randomly selected question and answer pairs from the results. The comparison of human evaluations between our system and the Du et al. (2017) model is shown in Table 3. Bold numbers demonstrate the best performing system for each evaluation criteria, and we see that our system outperforms the Du et al. (2017)

Conclusion
In this work, we presented a sequence to sequence learning model to address the opinion question generation task. We showed the training process using the global attention and applied the coverage mechanism to improve the model. We took advantage of community-based question answering systems which contain informal speech and its sentences do not always follow grammatical rules. Experimental results show an improvement in the automatic evaluation metrics as well as the human evaluations compared to the baseline system.