Generating a Common Question from Multiple Documents using Multi-source Encoder-Decoder Models

Ambiguous user queries in search engines result in the retrieval of documents that often span multiple topics. One potential solution is for the search engine to generate multiple refined queries, each of which relates to a subset of the documents spanning the same topic. A preliminary step towards this goal is to generate a question that captures common concepts of multiple documents. We propose a new task of generating common question from multiple documents and present simple variant of an existing multi-source encoder-decoder framework, called the Multi-Source Question Generator (MSQG). We first train an RNN-based single encoder-decoder generator from (single document, question) pairs. At test time, given multiple documents, the Distribute step of our MSQG model predicts target word distributions for each document using the trained model. The Aggregate step aggregates these distributions to generate a common question. This simple yet effective strategy significantly outperforms several existing baseline models applied to the new task when evaluated using automated metrics and human judgments on the MS-MARCO-QA dataset.


Introduction
Search engines return a list of results in response to a user query. In the case of ambiguous queries, retrieved results often span multiple topics and might benefit from further clarification from the user. One approach to disambiguate such queries is to first partition the retrieved results by topic and then ask the user to choose from queries refined for each partition.
For example, a query 'how good is apple?' could retrieve documents some of which relate to apple the fruit, and some of which relate to Apple * Work was done when affiliated with Microsoft Research AI the company. In such a scenario, if the search engine generates two refinement queries 'how good is apple the fruit?' and 'how good is the company apple?', the user could then choose one of it as a way to clarify her initial query.
In this work, we take a step towards this aim by proposing a model that generates a common question that is relevant to a set of documents. At training time, we train a standard sequenceto-sequence model (Sutskever et al., 2014) with a large number of (single document, question) pairs to generate a relevant question given a single document. At test time, given multiple (N ) input documents, we use our model, called the Multi-Source Question Generator (MSQG), to allow document-specific decoders to collaboratively generate a common question. We first encode the N input documents separately using the trained encoder. Then, we perform an iterative procedure to i) (Distribute step) compute predictive word distributions from each document-specific decoder based on previous context and generation ii) (Aggregate step) aggregate predictive word distributions by voting and generate a single shared word for all decoders. These two steps are repeated until an end-of-sentence token is generated. We train and test our model on the MS-MARCO-QA dataset and evaluate it by assessing whether the original passages can be retrieved from the generated question, as well as human judgments for fluency, relevancy, and answerability. Our model significantly outperforms multiple baselines. Our main contributions are: i) a new task of generating a common question from multiple documents, where a common question target is does not exist, unlike multilingual sources to common language translation tasks. ii) an extensive evaluation of an existing multisource encoder-decoder models including our simple variant model for generating a common question. iii) an empirical evaluation framework based on automated metrics and human judgments on answerability, relevancy, and fluency to extensively evaluate our proposed MSQG model against the baselines.

Multi-Source Question Generator
Our Multi-Source Question Generator (MSQG) model introduces a mechanism to generate a common question given multiple documents. At training time, it employs a standard sequence-tosequence (S2S) model using a large number of (single document, question) pairs. At test time, it generates a common question given multiple documents, similar to Garmash and Monz (2016) and Firat et al. (2016). Specifically, our MSQG model iterates over two interleaved steps, until an end-ofsentence (EOS) token is generated:

Distribute
Step During the Distribute step, we take an instance of the trained S2S model, and perform inference with N different input documents. Each document is then encoded using one copy of the model to generate a unique target vocabulary distribution P dec i,t (for document i, at time t) for the next word. Note that source information comes from not only encoded latent representation from a source document, but also the cross-attention between source and generation.

Aggregate
Step During the Aggregate step, we aggregate the N different target distributions into one distribution by averaging them as below: whereP dec t is the final decoding distribution at time t, and Σ N i β i = N . In our experiments, we weight all the decoding distributions equally (β i = 1) to smooth out features that are distinct in each document i, where i ∈ {1, . . . , N }.
Note that the average Aggregate can be perceived as a majority voting scheme, in that each document-specific decoder will vote over the vocabulary and the final decision is made in a collaborative manner. We also experimented with different Aggregate functions: (i) MSQG mult multiplies the distributions, which is analogous to a unanimous voting scheme. However, it led to sub-optimal results since one unfavorable distribution can discourage decoding of certain common words. (ii) MSQG max takes the maximum probability of each word across N distributions and normalizes them into a single distribution, but it could not generate sensible questions so we excluded from our pool of baselines.

Model Variants
Avoiding repetitive generation We observed that naively averaging the target distributions at every decoding time continually emphasized the common topic, thereby decoding repetitive topic words. To increase the diversity of generated tokens, we mask those tokens that have already been decoded in subsequent decoding steps. This strategy is reasonable for our task since questions generally tend to be short and rarely have repeated words. This mechanism can be viewed as a hard counterpart of the coverage models developed in Tu et al. (2016) and See et al. (2017). We denote this feature by rmrep in subscript. Shared encoder feature To initialize multiple decoders with the common meaning of the documents in a partition, we broadcast the mean of encoded latent representation to each decoder and denote this variant by the subscript sharedh. Note that the source document can affect the generated target vocabulary distribution P dec i,t at Distribute step through source-generation cross-attention.

Experimental setup
Our training method uses the standard LSTMbased (Hochreiter and Schmidhuber, 1997) S2S with bi-linear attention (Luong et al., 2015). An input to our encoder is a concatenation of 100-dim GloVe (Pennington et al., 2014) vector, 100-dim predicate location vector, and 1024-dim ELMo (Peters et al., 2018) vector. Targets are embedded into 100-dim vectors. The S2S is bi-directional with a 256-dim bi-linear attention in each direction with ReLU (Nair and Hinton, 2010). Our encoder has two layers and we use an Adam (Kingma and Ba, 2014) with a learning rate of 2 × 10 −5 .

Baselines
S2S We compare our model with a standard S2S baseline where we concatenate the N documents into a single document to generate a question. We provide detailed discussions about the effect of document order in supplementary material (SM). Two variants are considered (S2S and S2S rmrep ). Beam size is set to 5.

MESD
We also compare our model with the multi-encoder single-decoder (MESD) baseline where documents are encoded individually into The single decoder's initial hidden state is initialized by the mean of {v i } N i=1 , following (Dong and Smith, 2018).

Dataset
We use the Microsoft MAchine Reading COmprehension Question-Answering Dataset (MS-MARCO-QA) (Nguyen et al., 2016), where a single data instance consists of an anonymized Bing search query q and top-10 retrieved passages. Among the 10 passages, a passage is labelled isselected:True if annotators used it, if any, to construct answers, and most instances contain one or two selected passages. For training S2S, we use a single selected passage p * ∈ {p 1 , p 2 , . . . , p 10 } as input, and the query q as target output.

Constructing Evaluation Sets
For automatic evaluation, we follow the standard evaluation method from the MS-MARCO Re-Ranking task. For each generated questionq, we construct an evaluation set that contains 100 passages in total. 1 First, using the 10-passage sets from the MS-MARCO-QA development dataset as inputs, we generate common questions with the baselines and our MSQG models, decoded for a maximum length of 25 words. A sample generation is provided in the SM. Secondly, we evaluate the generations by using the pre-trained BERT-based MS-MARCO passage re-ranker R, which is publicly available and state-of-the-art as of April 1, 2019 (Nogueira and Cho, 2019). We assess whether the 10-passage set used to generate the question ranks higher than 90 other passages drawn from a pool of ∼8.8 million MS-MARCO passages using the generated question. These 90 passages are retrieved via a different criterion: BM25 (Robertson and Zaragoza, 2009) using Lucene 2 . Note that there are multiple 10-passage sets that generate the same questionq. For each of these 10-passage sets, we construct a 100-passage evaluation set using the same 90 passages retrieved via the BM25 criterion.

Evaluation Metrics
MRR, MRR@10, nDCG An input to the reranker R is a concatenation of the generated question and one passage i.e. [q, p]. For each pair, it returns a score ∈ (0, 1) where 1 denotes that the input passage is the most suitable forq. We score all 100 pairs in an evaluation set. For the source 10-passage set, we average the 10 scores into one score as one combined document and obtain the retrieval statistics MRR, MRR@10 (Voorhees, 2001;Radev et al., 2002), and nDCG (Järvelin and Kekäläinen, 2002) (see the SM for details).
Human Judgments We also conduct human evaluation where we compare questions generated by MSQG sharedh,rmrep and the S2S baseline, and the reference question using three criteria: fluency, relevancy, and answerability to the original 10 passages. We randomly select 200 (10-passage, reference question) sets from which we generate questions, yielding 2,000 (passage, question) evaluation pairs for our model, baseline, and reference, respectively (see the SM for details). Table 3 shows the mean retrieval statistics and their proportion of unique generated questions from 55,065 10-passage instances. Notice that our proposed MSQG models are more effective in terms of retrieving the source 10-passage sets. Particularly, MSQG sharedh,rmrep outperforms the baselines in all metrics, indicating that broadcasting the mean of the document vectors to initialize the decoders (sharedh), and increasing the coverage of vocabulary (rmrep) are effective mechanisms for generating common questions.

Results
Overall, the retrieval statistics are relatively low. Most 100 passages in the evaluation sets have high pair-wise cosine similarities. We computed similarities of passage pairs for a significant portion of the dataset until convergence. A random set of 10 passages has an average pair-wise similarity of 0.80, whereas the top-10 re-ranked passages have an average of 0.85 based on BERT (Devlin et al., 2018) embeddings. Given the small similarity margin, the retrieval task is challenging. Despite of low statistics, we obtained statistical significance based on MRR with p < 0.00001 between all model pairs (see the SM for details).
Human evaluation results are shown in Table  1. In the comparison tasks, our proposed model significantly outperforms the strong baseline by a large margin. Nevertheless, judges preferred the reference over our model on all three aspects. The individual tasks corroborate our observations.

Conclusion
We present a new task of generating common questions based on shared concepts among documents, and extensively evaluated multi-source encoder-decoder framework models, including our variant model MSQG applied to this new task. We also provide an empirical evaluation framework based on automated metrics and human judgments to evaluated multi-source generation framework for generating common questions.   Table 3 shows the retrieval results on a larger set of baselines and MSQG models. M attn 256 is an attention-based encoder-decoder with hidden size 256 for both encoder and decoder. M 256 and M 512 are non-attention encoder-decoders with hidden sizes 256 and 512. S2S denotes M attn 256 , as in the main paper. It shows that models constructed using M attn 256 are more effective as opposed to models using M 512 which has more parameters. Furthermore, we see that the averaging scheme in the Reduction step, broadcasting the same encoder mean, and increasing coverage of vocabulary tokens are important features to generating common questions using MSQG models.

B Effect of Document Order on S2S
To examine if the order of multiple input documents are critical for S2S, we obtain the attention weights at each decoding time, gathered across the development dataset. Next, we perform a simple ordinary least squares regression, where the predictors are indexed word positions in a concatenated input, and responses are assumed noisy attention weights over the development dataset for each word position.
The slope coefficient fell within the 95% confidence interval that includes the null: −2.75 × 10 −5 , 3.03 × 10 −5 and a statistically significant intercept value of 0.0021. The result also validates that an average 10-passage string is approximately 476 (≈ 1 0.0021 ) words long. Thus, we conclude that the attention weights are evenly distributed across multiple document at test time, and the document ordering is not critical to the Figure 2: Agglomerative Clustering of 55,065 source 10-passage sets. Each set is represented by the mean of 10 BERT embeddings. Both max and average linkages yield the same inflection point at 0.0326, corresponding to 35,928 and 32,871 clusters. This method implies that the target proportion of unique generations should be at least 65% or 60%, which all models but MSQG mult achieve.
performance of S2S.

C Clustering Duplicate 10-passage Sets
In the MS-MARCO-QA dataset, there are many highly similar 10-passage sets retrieved from semantically close MS-MARCO queries. Examples of semantically close MS-MARCO queries include ["symptoms blood sugar is low", "low blood sugar symptoms", "symptoms of low blood sugar levels" ,"signs and symptoms of low blood sugar", "what symptoms from low blood sugar", ... ], from which we expect duplicate generated questions, thus in sum, less than 55,065 different questions.
Therefore, to estimate the target proportion of unique generations, we examine the number of semantically similar 10-passage sets through agglomerative clustering. Figure 2 shows cluster results with varying degrees of affinity thresholds, and observe that the effective models should generate at least 65% unique questions from the development dataset. This, together with the low retrieval statistics of MSQG mult , implies that multiplying the distributions is not an appropriate Reduction step.
On the other hand, generating the most number of unique questions does not imply that the model better generates common questions. In particular, S2S rmrep generates the most diverse questions, however, its retrieval statistics are significantly lower than its MSQG counterparts.

D Statistical Significance Tests
Retrieval evaluation on ∼55K evaluation sets using the re-ranker R is compute-intensive. Thus, for each model, we randomly sample and obtain retrieval statistics for 15K evaluation sets which are enough to mimic the true evaluation set distribution.
Then, to assess statistical significance, we use a non-parametric two-sample test, such as Mann-Whitney (MW) or Kolmogorov-Smirnov statistic, and test whether any pair of 15K retrieval sets between two models come from the same distribution. In our task, both tests reached the same conclusion. MW two-sample tests on MRR results showed statistical significance at p < 0.00001 for all model pairs dealt in the main paper, in spite of the relatively low retrieval statistics.

E Human Evaluation Templates
UHRS comparison and individual task instructions and shown in the next pages.

F Generated Questions Sample
Passage 1: cucumbers and zucchini look similar but have nutritional differences . photo credit martin poole / digital vision / getty images . do n't let the similarities between cucumbers and zucchini confuse you . even though both cylindrical vegetables are dark green with white flesh , they are distinctively different species . both cucumbers and zucchini belong to the curcurbit family , which also counts gourds , melons , pumpkins and squash among its members . cucumbers and zucchini differ both in how people commonly eat them and in their nutritional values . people almost always eat cukes raw , while zucchini is more often cooked .
Passage 2: cucumber and squash seedlings both have elongated foliage for the first set of leaves after they emerge from the soil . the second set of leaves on a seedling varies . cucumber leaves are in the shape of a triangle and are flat in the center and rough to the touch . squash plants vary in shape as to the particular variety , but have three to five lobes and are larger than cucumber leaves . zucchini squash has elongated serrated leaves .
Passage 3: zucchini vs cucumber . zucchini and cucumber are two vegetables that look mightily similar and hard to distinguish from each other . but in close inspection , they are actually very different . so read on . zucchini . zucchini is defined to be the kind of vegetable that is long , green colored and has many seeds .
Passage 4: as a general rule , we prefer cucumbers raw and zucchini cooked . while you ca n't replace one with the other , zucchinis and cucumbers do complement one another . slice two cucumbers , two zucchinis and one sweet onion , and soak them all in rice vinegar for at least an hour in the refrigerator .
Passage 5: cucumber and zucchini are popular vegetables that are similar in appearance and botanical classification . but they differ significantly in taste , texture and culinary application . zucchini and cucumber are both members of the botanical family cucurbitaceae , which includes melons , squashes and gourds .
Passage 6: melon vs. squash . the cucumber is not particularly sweet , but it shares a genus with the cantaloupe and is botanically classified as a melon . the zucchini is a variety of summer squash and is of the same species as crookneck squash .
Passage 7: cucumber vs. zucchini . side by side , they might fool you : cucumbers and zucchinis share the same dark green skin , pale seedy flesh , and long cylindrical shape . to the touch , however , these near -twins are not the same : cucumbers are cold and waxy , while zucchinis are rough and dry . the two vegetables also perform very differently when cooked .
Passage 8: the second set of squash leaves grow much quicker and larger than cucumber leaves in the same time . squash leaves may be up to four times as large as a cucumber leaf when they are the same age .
Passage 9: in reality , zucchini is really defined as a vegetable so when it comes to the preparation of it , it has different temperament . cucumber . cucumber is both classified as a fruit and a vegetable . it is long and is green in color , too . it is part of what they call the gourd family .
Passage 10: zucchini 's flowers are edible ; cucumber 's flowers are not . zucchini is generally considered as a vegetable ; cucumber is classified as both a fruit and a vegetable . yes , they can fool the eye because of their similar look but as you go deeper , they are very different in so many ways .
Question generated by MSQG sharedh,rmrep : what are the difference between cucumber and zucchini Question generated by S2S: different types of zucchini Reference question: difference between cucumber and zucchini