Asking the Crowd: Question Analysis, Evaluation and Generation for Open Discussion on Online Forums

Teaching machines to ask questions is an important yet challenging task. Most prior work focused on generating questions with fixed answers. As contents are highly limited by given answers, these questions are often not worth discussing. In this paper, we take the first step on teaching machines to ask open-answered questions from real-world news for open discussion (openQG). To generate high-qualified questions, effective ways for question evaluation are required. We take the perspective that the more answers a question receives, the better it is for open discussion, and analyze how language use affects the number of answers. Compared with other factors, e.g. topic and post time, linguistic factors keep our evaluation from being domain-specific. We carefully perform variable control on 11.5M questions from online forums to get a dataset, OQRanD, and further perform question analysis. Based on these conclusions, several models are built for question evaluation. For openQG task, we construct OQGenD, the first dataset as far as we know, and propose a model based on conditional generative adversarial networks and our question evaluation model. Experiments show that our model can generate questions with higher quality compared with commonly-used text generation methods.


Introduction
Teaching machines to ask questions from given corpus, i.e. question generation (QG), is an important yet challenging task in natural language processing.In recent years, QG has received increasing attention from both the industrial and academic communities due to its wide applications.Dialog systems can be proactive by asking users questions (Wang et al., 2018), question answering (QA) systems can benefit from the corpus produced by a QG model (Duan et al., 2017), education (Heilman and Smith, 2010) and clinical (Weizenbaum et al., 1966;Colby et al., 1971) systems require QG as well.
We can divide all questions into two categories.Fixed-answered questions have standard answers, e.g."who invented the car?(Karl Benz)".In contrast, different people may have distinct answers over open-answered questions like "what do you think of the self-driving car?".Most prior work about QG (QA) aimed to generate (answer) fixedanswered questions.As questions are targeting on answers which are certain spans of given corpus, they are always not worth discussing.Nowadays, with the help of online QA forums (e.g.Quora and Zhihu1 ), open-answered questions can greatly arouse open discussion that helps people under different backgrounds to share knowledge and ideas (high-qualified questions can help to attract more visitors for QA forums as well).This kind of questions are also useful for many tasks, e.g.making dialog systems more proactive.
In this paper, we focus on generating openanswered questions for open discussion, i.e. the openQG task.To make our model useful in practice, we generate questions from real-world news which are suitable for arousing open discussion.As far as we know, no research has focused on this task before due to the two difficulties: • To generate high-qualified questions (for open discussion), we need to perform question evaluation, which is rather challenging.
It is worth mentioning that a good question evaluation metric is not only a necessity to compare different models, but can also throw light on the text generation process, e.g.acting as the reward function through reinforcement learning.Based on the perspective that the more answers a question receives, the higher quality it has for open discussion, we analyze how language use affects the number of answers.Compared with other factors, e.g. the topic and post time, focusing on language use can keep our evaluation from being domain-specific.To this end, we carefully perform variable control on 11.5M online questions from Zhihu and build the "open-answered question ranking dataset (OQRanD)", containing 22K question pairs (questions in each pair only differ in language use).Based on OQRanD, we reach to some interesting conclusions on how linguistic factors affects the number that a question receives, and further build question evaluation models.After building our linguistic-based question evaluation model, we propose a QG model based on conditional generative adversarial network (CGAN).During the adversarial training process, we perform reinforcement learning to introduce information from the evaluation model.This architecture was not used in QG before as far as we know, and experiments show that our model gets better performance compared with commonlyused text generation methods in the quality of generated questions.All the experiments are performed on the "open-answered question generation dataset (OQGenD)" we build, which contains 20K news-question pairs.It is the first dataset for openQG to the best of our knowledge.
Above all, the main contributions of this paper are threefold: • We propose the openQG task, and build OQ-GenD, OQRanD from 11.5M questions for generating and evaluating questions.
• We study how language use affects the number of answers a question receives, and draw some interesting conclusions for linguisticbased question evaluation.
• We propose a model based on CGAN and our question evaluation model, which outperforms commonly-used text generation models in the quality of generated questions.
2 Related Work

Question Evaluation
Question evaluation is a rather challenging task.Automatic evaluation metrics such as BLEU (Papineni et al., 2002), ROUGE (Lin, 2004) and ME-TEOR (Lavie and Agarwal, 2007) were widely used to measure n-gram overlaps between generated questions and ground truth questions, however, they are far from enough since we cannot list all possible ground truth questions in openQG.To this end, we need to develop specific evaluation metrics for questions.Some researches (Heilman and Smith, 2010;Figueroa and Neumann, 2013) directly trained question ranking (QR) models via supervised learning, and used it to perform evaluation.However, these models are always domainspecific and not interpretable since we cannot tell what makes a question get a high (low) score.Rao and Daumé III (2018) took a step further, and pointed out that a good question is one whose expected answer will be useful.By using the "expected value of perfect information", they proposed a useful evaluation model.However, our task significantly differs from it in two aspects: first, there is no correct answer for open-answered questions thus it is hard to tell which answer is "useful".Second, the goal of openQG is to arouse open discussions instead of "solving a problem".
Intuitively, a good question evaluation metric should be interpretable and keeps away from being domain-specific.To this end, we first analyze how language use affects the number of answers, and then build evaluation models based on these conclusions.There are some researches (Guerini et al., 2011;Danescu-Niculescu-Mizil et al., 2012;Guerini et al., 2012;Tan et al., 2014) about how language use affects the reaction that a piece of text generates, but we are the first to focus on questions as far as we know.

Question Generation
QG was traditionally tackled by rule-based approaches (Heilman and Smith, 2010;Lindberg et al., 2013;Mazidi and Nielsen, 2014;Hussein et al., 2014;Labutov et al., 2015).In recent years, neural network (NN) approaches have taken the mainstream.Du et al. (2017) pioneered NN-based QG by using Seq2seq models (Sutskever et al., 2014).Many researches have tried to make it more suitable for QG tasks since then, including using answer position features (Zhou et al., 2017), pointer mechanism (Kumar et al., 2018a;Zhao et al., 2018), etc. Adding more constraints, e.g.controlling the topic (Hu et al., 2018) and difficulty (Gao et al., 2018) of QG, or combining it with QA (Duan et al., 2017;Wang et al., 2017;Tang et al., 2017) have also been studied.Recently, using adversarial training and reinforcement learning (Yuan et al., 2017;Kumar et al., 2018b;Yao et al., 2018) have become a new trend.As far as we know, the CGAN model we proposed has not used before.Besides, most prior researches aimed to generate fixed-answered questions, and we are the first to propose openQG task to the best of our knowledge.
It is worth mentioning that though we only focus on text-based QG, we can also generate questions from images, i.e. visual question generation (Ren et al., 2015;Fan et al., 2018) and knowledge graphs (Serban et al., 2016;Elsahar et al., 2018) as well.

Question Analysis and Evaluation
In this section, we deal with question analysis and evaluation.We first perform variable control and build OQGenD.After that, we analyze how language use affects the number of answers a question receives.Based on these conclusions, we further build question evaluation models.

Construction of OQRanD
The number of answers a question receives is affected by many factors.As pointed out by a number of prior researches, there are four dominated variables: topic, author, time and language use.In other words, we should control the first three variables to study the effect of language use.We perform our analysis based on an in-house dataset from Zhihu.There are 11.5M open-domain questions, and the following information is also provided for each question: the post time, the author (user ID), the author's followers and followees, the manually-tagged topics, the number of answers, viewers and followers.
Although we mainly focus on the number of answers, the counts of viewers and followers of the question are also interesting.Especially, if a question receives more answers, can we expect it to be viewed and followed by more people as well?To figure it out, we perform correlation analysis using the Pearson correlation coefficient (PCC) (Lee Rodgers and Nicewander, 1988).PCC is a measure of the linear correlation between two random variables.It is a real number between [-1, 1], where 1 means there is a total positive linear correlation, 0 means no linear correlation exixts, and -1 means there is a total negative linear correlation.PCC between the number of answers and viewers is 0.93, and that number between the number of answers and followers is 0.86.So a question with more answers can always attract more visitors and followers.
As for variable control, we first focus on topic.Since each of the 11.5M questions has one of the 37 manually-tagged topics (all topics are listed in the appendix), we divide them into 37 subsets, and further extract question-pairs in each subset independently.In each pair, we want the topics of two questions as close as possible.Since questions are short texts (often about 10 words), topics are greatly reflected by nouns.We measure topicsimilarity for questions q 1 , q 2 by: T S(q 1 , q 2 ) = # nouns in both q 1 and q 2 # nouns in q 1 + # nouns in q 2 (1) where "#" means "the number of".The larger T S(q 1 , q 2 ) for q 1 , q 2 , the closer they are in topics.
We set a boundary µ, and filter out question pairs whose T S(q 1 , q 2 ) < µ.A number of values for µ is tried, and we finally choose µ = 0.3 since the topics of (q 1 , q 2 ) are already close enough without discarding too much data.Finally, we get 24.2M topic-controlled (TC) question pairs.Based on TC pairs, we further control the effect of authors.Since users with more followers are expected to get more responses, we need to eliminate the effect of their social network.To do so, we collect all active users provided by Zhihu and build a "follower network".In this network, each user is a node, and there is an edge from A to B if user A follows user B. We run PageRank algorithms (Page et al., 1999) on the network, and get a PageRank value for each user (real values are rounded to integers).By excluding TC pairs whose authors do not have the same PageRank value, we get 10.8M topic-and author-controlled (TAC) question pairs.
Controlling the effect of time is rather complex, since few questions are posted at exactly the same time.An earlier question may benefit from "first-move advantage" (Borghol et al., 2012), but a later question might be preferred because the earlier can become "stale" (Tan et al., 2014).For a TAC pair (q 1 , q 2 ), we use (n 1 , n 2 ) to denote the number of their answers, and (t 1 , t 2 ) to show their posted times.The idea is: we first study how time factors affect the number of answers, i.e. how ∆t = |t 1 − t 2 | affects ∆n = |n 1 − n 2 |.After that, we can find if certain ∆t has small effects.By picking TAC pairs with such ∆t, the effect of time can be greatly reduced.
To study how ∆t affects ∆n, we should leave ∆t as the only variable, i.e. control the effect of language use in TAC pairs.To do so, we measure the distance between q 1 and q 2 by normalized edit distance: where edit(q 1 , q 2 ) is the edit distance, and len(•) is the length of a question.The smaller d(q 1 , q 2 ) between q 1 and q 2 , the more similar they are in language use.We further rank all TAC pairs by d values from small to large, and pick up the first 2% pairs to get 217K topic-, author-and languagecontrolled (TALC) question pairs.Now that ∆t is the only difference, the smaller effect it has, the smaller ∆n is expected.The number of TALC pairs decreases exponentially with the growth of ∆t.As pointed out by Tan et al. (2014), directly computing E(∆n|∆t) is not reliable since the estimate will be dominated by TALC pairs with small ∆n.Instead, we should use the deviation estimate: where Ê(n 2 |n 1 ) is the average n 2 over question pairs whose q 1 has n 1 answers, and TALC pairs whose n 1 > 9 are not considered since the number is too few, making the results less reliable.
In Figure 1(a), we show how D varies with ∆t (a smaller effect of ∆t makes D closer to 0).As we can see, D is rather small when ∆t is close to 0, which is in accordance with common sense.As ∆t grows, D increases sharply, which is largely caused by the "first move advantage" described in (Borghol et al., 2012).Although D decreases when ∆t is about 100 hours (we think the main reason is: earlier questions starts to become "stale"), it is not so small as before.When ∆t is about 200 hours (the later questions also starts to become "stale"), D increases again and maintains at a high level.Figure 1(b) shows the case when ∆t is close to 0.
As mentioned above, if we control ∆t to make D rather small, the effect of time will be greatly reduced.However, we may filter out too many data if making ∆t too close to 0. Intuitively, 90 seems like a good upper-bound, and we use ∆t D<90 to denote the time interval composed by all ∆t that make D < 90.To further test this upper-bound, we pick out TALC pairs whose ∆t ∈ ∆t D<90 , and compute the deviation |E(n 2 |n 1 ) − n 1 | under different n 1 to get Figure 2 (in contrast, we also show the case when ∆t is not controlled).As we can see, by choosing pairs whose ∆t ∈ ∆t D<90 , we can greatly reduce deviations.Since |E(n 2 |n 1 )−n 1 | < 5 under each n 1 , we can further eliminate the remaining time-effect by enlarging ∆n.Based on thse conclusions, we perform timecontrol on all TAC pairs by choosing pairs whose ∆t ∈ ∆t D<90 and ∆n > 20 (20 is much larger than 5).To study the effect of language use, we want q 1 , q 2 not so close.So we further discard the remaining pairs whose d(q 1 , q 2 ) < 0.6, and get 22K question pairs to build OQRanD.

The Effects of Language Use
To show how language use affects the number of answers that a question receives, we perform significant tests on different linguistic features.The one-sided paired t-test with Bonferroni correction (for multiple comparisons) is adopted.For significant levels, we set α = .05,.01,.001,.0001,which correspond with the number of arrows (Table 1).The direction of arrows show how the feature affects the number of answers: up arrows (↑) indicate that a large feature-value (e.g. a longer length, a higher perplexity) can lead to more answers, and down arrows (↓) means small feature values are preferred.Here are some interesting conclusions2 : Ask concise questions.The basic sanity check we perform is the length of questions.Table 2 indicates that questions with less words tend to get more answers.This is in accordance with Simmons et al. ( 2011) which shows that short version of memes are more likely to become popular.In contrast, Tan et al. (2014) found that longer versions of tweets are more likely to be popular.This indicates that attracting more answers is different from making a blog retweeting by more people.
Ask one thing a time and make it vivid.  of speech (POS) that occurs (proportions are better than word counts since they can eliminate the effect of length).As Table 2 suggests, using less nouns, adjectives and prepositions is helpful.As nouns are often topic words (occurred with adjectives and prepositions), it is better to contain less topics and ask one thing a time.On the other hand, it is better to use more verbs and adverbs to make the question vivid.Besides, using less punctuation helps (this often leads to more concise questions).
Interact with readers naturally.We check the proportions of personal pronouns (ppron), and find it helps to be interactive by using more second ppron, e.g.你认为 (what do you think of).We also check the proportion of please-words, e.g.请 教 (could you please answering...).As Table 2 indicates, we should not use too many honorifics.Just interact with others naturally as if we are talking to our close friends.
Positive words help.Can we get more answers by picking words with sentiments?We check the occurrence of positive and negative words based on a word emotional polarity dictionary, NTUSD3 .As shown in Table 2, more sentiment words can help, especially positive words.
Use familiar expressions.Distinctive expressions may attract attention, but using "common language" can make a question better understood.Intuitively, if more commonly-used words occurs, a question is easier to read.To this end, we collect 4K words with the highest frequency from OQRanD and measure their occurrence.Table 2 shows that it is better to use common words and make the question familiar.In addition, we randomly sample 134K questions that are not appeared in OQRanD to build six language models (LMs) based on 1, 2, 3 gram word and POS features, respectively.Table 3 indicates that questions with smaller perplexity (i.e. more familiar) are always better.
Imitate good questions.Since a number of questions have already aroused a large range of open discussion, can we get more answers by imitating them?We pick 80K questions that are not appeared in OQRanD with the highest answer number as "good questions" and train six LMs (similar to above).Table 3 shows that the less perplexity a question gets, the more answers it arouses.In conclusion, imitating good questions helps.We also explore if news headlines are worth imitating.On one hand, they are carefully-written concise texts.On the other hand, as pointed out by Wei and Wan (2017), a lot of Chinese news headlines are intentionally written to be attentiongetting.From Table 3, it turns out that imitating their word use is useful.

Question Evaluation Model
Based on OQRanD and our conclusions about how language use affects the answer that a question receives, we can train models to predict which question can receive more answers in each pair.Since questions in the same pair only differ in language use, models based on OQRanD can concentrate on linguistic facts to avoid being domain-specific.
Given pair (q 1 , q 2 ), we label it as "1" if n 1 > n 2 , otherwise we use label "0".In this way, our task turns into a binary classification task.We further train a model F s which inputs a question and outputs a score.The larger F s (•), the more answer is expected.By comparing F s (q 1 ), F s (q 2 ), we can make the final prediction.Although we can also use both q 1 , q 2 as inputs and train a model that directly outputs label 0 or 1, using F s on q 1 , q 2 respectively is more flexible when we need to rank more than two question.Besides, F s can be directly used for getting rewards during the reinforcement QG process.
We use several models as F s , and perform training based on the hinge loss.Table 4 shows the accuracy of different models (hyper-parameters and training details are provided in the appendix).When features in Section 3.2 are not used, the CNN model gets the best performance, which is not surprised.However, adding these features greatly improves the performance of all statistical models, making SVM and RF significantly surpass CNN.This illustrates the importance of linguistic factors.

Question Generation
In this section, we perform openQG.We construct OQGenD, the first dataset for openQG as far as we know, and propose a model based on CGAN.Especially, we use the question evaluation model based on OQRanD to introduce prior knowledge.Finally, we perform experiments and use multiple evaluation metrics (including our linguistic-based model) and reach to the conclusions.

Construction of OQGenD
Since real-world news are suitable for arousing open discussion, we built OQGenD from news and open-answered questions.We crawled news (published in the last three years) from Tencent News4 , and performed data cleaning (removing non-textual components and filtering out redundant data) to get 59K news at last.To make questions in OQGenD suitable for open discussion, we ranked the 11.5M questions mentioned in Section 3.1 by their number of answers from large to small and picked the first half (576K).
To match news and questions, we first used automatic ways to find a "candidate dataset" and then performed human labeling to build our final OQGenD dataset.To get the candidate dataset, three heuristic unsupervised methods were used to compute the distance between a piece of news  (3) weighted averaged word embeddings, which was proposed by Arora et al. (2016).It first computed a weighted average of the word vectors in the sentence and then performed a "common component removal".For each piece of news, we picked out questions with the smallest two distances under each method.
We further hired five native speakers to label the candidate dataset.An NQ-pair was preserved only if it was appropriate for a human to raise the question given the piece of news.In other words, the question should be related to the given news while not mentioning extra information.In case that too many NQ-pairs were discarded, we allowed human labelers to perform two kinds of modifications on each question to preserve more data.First, we allowed them to modify the question in an NQpair by at most two entities, e.g.change it from "马 克龙是怎样一个人？(What is Macron like?)" to "特朗普是怎样的一个人(What is Trump like?)".Second, we allowed them to use a meaningful substring to replace the original question.We ensured that each NQ-pair was labeled by three people, and it was preserved in OQGenD only if all of them agreed.In this way, we got 20K NQ-pairs.Among these pairs, there were 9K news, each corresponding with more than one questions.The average word numbers in each piece of news, question were 508, 12, respectively.

Model
As shown in Figure 3, our model is composed by a generator G θ and a discriminator D φ .G θ outputs a question Ŷ = {ŷ 1 , ŷ2 , ..., ŷn } from given news X = {x 1 , x 2 , ..., x m }.It is a Seq2seq network with the attention mechanism (Luong et al., 2015).Both encoder and decoder are GRU (Chung et al., 2014) networks.D φ takes an NQ-pair (X, Y D ) as input, and predicts how likely it comes from real-world dataset.First, it embeds the X, Y D into v news , v ques respectively by two CNNs similar to Zhang and Wallace (2015).Based on the two representations, it computes where [v news ; v ques ] is the concatenation of the two vectors v news , v ques , and W m , W f , b m , b f are parameters of our model.We expect v match to measure if the question matches the news, and v f luent to measure if the question is fluent enough (like human-written questions).The final prediction (5) where σ is the sigmoid function and W proj , b proj are parameters.As we can see, both G θ (X) and D φ (X, Y D ) are conditioned on X, thus our model can be viewed as a special type of CGAN (Mirza and Osindero, 2014), which provides more control to make generated questions closely related to input news.
Algorithm 1 Training process.
for d-steps do Use X, Y, Ŷ to generate fake NQ-pairs (X f , Y f ); 7: Train D φ on real NQ-pairs (X, Y ) and fake NQ-pairs (X f , Y f ) by Eq. 6; 8: end for 9: for g-steps do Compute rewards for Ŷ by Eq. 10; 12: Update G θ on (X, Ŷ ) by Eq. 9; 13: end for 14: until G, D converge

Adversarial Training
The training process of GAN is formalized as a game in which the generative model is trained to generate outputs to fool the discriminator (Goodfellow et al., 2014).For our model, the training process is described in algorithm 1.
Before adversarial training, we pre-train G θ by maximizing the log probability of a question Y given X (X, Y come from OQGenD), i.e.Maximum Likelihood Estimate (MLE), as described in Sutskever et al., 2014.This is helpful for making the adversarial training process more stable.Besides, the parameters of our question evaluation model Q is frozen during the whole process.
We iteratively perform d-steps and g-steps to train D φ , G θ respectively during the adversarial traing process.In d-steps, we fix the parameters of G θ , and the inputs for D φ are three-folds: (1) NQpairs (X, Y ) from OQGenD.(2) News and questions generated by G θ , i.e. (X, Ŷ ).(3) Unmatched NQ-pairs created from OQGenD.We label "real data" (1) as "1"; and regard both (2), (3) as "fake data" with label "0".It is worth mentioning that the unmatched NQ-pairs are used to keep D φ from only focusing on the questions.To train D φ , we minimize the objective function: Since text-generation is a discrete process, we cannot directly use D φ (X, Ŷ ) to update θ in G θ .A commonly-used idea (Yu et al., 2017;Li et al., 2017) is to train G θ based on policy gradient (Sutton et al., 2000).In this case, G θ is regarded as a policy network.At time-step t, state s t is the generated text Ŷ[1:t] , and action a t is generating the next word ŷt+1 with a probability π G (a t |s t ) = p G (ŷ t+1 | Ŷ[1:t] , X).To get reward r t , we perform Monte-Carlo search, i.e. sample Ŷ[1:t] into a complete sentence ŶMC for k times, and perform: After getting r t , θ is updated by minimizing We can also change Eq 8 into a penalty-based version: where E[ t π(a t |s t )] can be viewed as a regularization term.It forces the generator to prefer a smaller π(a t |s t ).In this way, it can generate more diversified results.
Since we have already trained a question evaluation model F s (•) in Section 3.3, we can use: (10) to replace Eq. 7. In Eq. 10, we add prior knowledge about "how language use affects the number of answers" into the adversarial training process through reinforcement learning, and expect the linguistic affects that we have discovered can throw light on the text generation process.

Experiments
We choose several typical text-generation models as baselines.We apply a Seq2seq model similar to Du et al. (2017), and use a CopyNet similar to Kumar et al. (2018b).As adversarial training has become a new trend in QG, we also adopt the Seq-GAN proposed by Yu et al. (2017) and SentiGAN by Wang et al. (2018)  We adopt the commonly-used BLEU, ROUGE-L and METEOR for question evaluation.Besides, our score function F s based on OQRanD is also used.Similarly, we choose the the SVM model which gets the best performance in Table 4.We compute F s ( Ŷ ) for each generated question Ŷ , and report the average value in "F s -SVM" column of Table 5.As mentioned above, F s shows if the generated questions are expected to receive more answers thus are more suitable for open discussion.The higher F s a model gets, the better performance it has.
The results of our experiments are listed in Table 5.When it comes to BLEU, ROUGE-L and METEOR, our models get the best performance.This shows the advantage of making both of the generator and discriminator conditioned on input news.Besides, the full version of our model gets the best BLEU-3, BLEU-4 and METEOR values by introducing the linguistic-based question evaluation model during adversarial training.Of all the baselines, SentiGAN gets the best performances on BLEU-3 and BLEU-4, which is largely contributed by its penalty based objective function.Since the same piece of news always corresponds with multiple questions (and these questions may differ a lot) in OQGenD, models based on adversarial training (SeqGAN, SentiGAN and ours) always get better results than others (Seq2seq and CopyNet).
When it comes to F s , the full version of our model gets the best performance, which illustrates that information from the SVM model is useful to generate questions with better quality.Besides, we can also use the conclusions in Section 3.2 to compare different models, e.g.questions generated by our full version model are the most concise (9.68 words per question).On the other hand, Senti-GAN generates the longest questions (11.54 words per question).

Conclusion and Future Work
In this paper, we take the first step on teaching machines to ask open-answered questions from news for open discussion.To generate high-qualified questions, we analysis how language use affects the number of answers that a question receives based on OQRanD, a dataset created by variable control.These conclusions help us to build question evaluation models, and can also used to compare results of different question generation models.For question generation, we propose a model based on CGAN using reinforcement learning to introduce information from our evaluation model.Experiments show that our model outperforms commonly-used text generation methods.
There are many future works to be done.First, we will explore more powerful QG structure to deal with the huge difference between the length of input and output texts.Besides, how to better leverage prior knowledge during openQG (like human often do) is also interesting.Finally, combining openQG with its reverse task, openQA, is also worth exploration.

A Details of Language Model
In this section, we introduce the details of our language models described in section 3.2.
We used the HanLP toolkit5 perform word segmentation.The toolkit was also used to get the POS of each word.To train language models, we adopted the SRILM toolkit6 .During this process, we used modified kneser-ney smoothing for all the language models based on word n-grams and witten-bell smoothing for language models based on POS n-grams.

B Details of Question Evaluation Models
In this section, we introduce the details of our question evaluation models described in section 3.3.We adopted the Ranklib toolkit7 to train the random forest model.For the SVM model, we used the SVM-rank toolkit8 .More specifically, we set the trade-off between training error and margin of SVM to 3 and chose the linear kernel function.
For CNN and RNN models, the word embedding size is 128, and the size of POS embedding is 32.The RNN model is a single-layer bidirectional LSTM network with 128 hidden units.As for the CNN model, the convolution layer contains filters whose sizes are 160 × 1, 160 × 2, 160 × 3, 160 × 4. The counts for each kind of filters are 64, 64, 64, 64, and the stride for each of them is 1.After the convolution layer, there is a max-pooling layer and a fully connected layer with the sigmoid activation to get the final result.

C Details of Question Generation models
In this section, we introduce the details of our question generation model described in section 4.2.
Our model is composed by a generator and a discriminator.The generator is a typical seq2seq model.It has three components: an encoder network, a decoder network and an attention network.The encoder is a single-layer bidirectional GRU with 64 hidden units while the decoder is a singlelayer unidirectional GRU with 128 hidden units.The CNN of discriminator for news contains filters whose sizes are 128 × 1, 128 × 2, 128 × 3, 128 × 4, 128 × 5.The counts for each kind of filters are 32,64,64,32,16, and the stride for each of them is is set to 1.The CNN of discriminator for questions contains filters whose sizes are 128 × 1, 128 × 2, 128 × 3, 128 × 4. The counts for each kind of filters are 32, 64, 64, 32, and the stride for each of them is set to 1.

D Examples of Our Datasets
As mentioned above, we controlled the effect of topic, time and author to get OQRanD.During this process, we divided all the questions into 37 subsets according to manually-tagged topics.These topics are listed in Table 6.The examples of OQRanD are shown in Table 7.The examples of OQGenD are shown in Table 8 (in case that the original news are too long, we omit the sentences that is not related to the qestions).

Figure 1 :
Figure 1: The effect of time lag (∆t) on D.

Figure 2 :
Figure 2: D under different n 1 (the smaller, the better).

Figure 3 :
Figure 3: Architecture of our model.
and a question: (1) term frequency-inverse document frequency (tf-idf), which first extracted 5 (10) key words from each question (news) by tfidf values, and then measured distances by the number of intersected key words; (2) cosine distance, which is based on the bag-of-words model;

Table 1 :
The number of arrows and t-test efficacy.
What kinds of words can help to get more answers?We test the proportion of different parts

Table 3 :
Significance tests on LM-based features.ppl stands for perplexity.
. For our model, the "vanilla"

Table 5 :
Results for openQG.*( ) denotes that our vanilla (full) model differs from the baseline significantly based on one-side paired t-test with p < 0.05.versionusesEq.7 to compute rewards, and the "full" version uses Eq. 10 (the SVM model which gets the best performance in Table4are adopted as F s ).More details about hyper-parameters and training process are provided in the appendix.