Multi-style Generative Reading Comprehension

This study tackles generative reading comprehension (RC), which consists of answering questions based on textual evidence and natural language generation (NLG). We propose a multi-style abstractive summarization model for question answering, called Masque. The proposed model has two key characteristics. First, unlike most studies on RC that have focused on extracting an answer span from the provided passages, our model instead focuses on generating a summary from the question and multiple passages. This serves to cover various answer styles required for real-world applications. Second, whereas previous studies built a specific model for each answer style because of the difficulty of acquiring one general model, our approach learns multi-style answers within a model to improve the NLG capability for all styles involved. This also enables our model to give an answer in the target style. Experiments show that our model achieves state-of-the-art performance on the Q&A task and the Q&A + NLG task of MS MARCO 2.1 and the summary task of NarrativeQA. We observe that the transfer of the style-independent NLG capability to the target style is the key to its success.


Introduction
Question answering has been a long-standing research problem. Recently, reading comprehension (RC), a challenge to answer a question given textual evidence provided in a document set, has received much attention. Current mainstream studies have treated RC as a process of extracting an answer span from one passage (Rajpurkar et al., 2016(Rajpurkar et al., , 2018 or multiple passages (Joshi et al., 2017;, which is usually done by predicting the start and end positions of the answer Devlin et al., 2018). * Work done during an internship at NTT. The demand for answering questions in natural language is increasing rapidly, and this has led to the development of smart devices such as Alexa.
In comparison with answer span extraction, however, the natural language generation (NLG) capability for RC has been less studied. While datasets such as MS MARCO (Bajaj et al., 2018) and Nar-rativeQA (Kociský et al., 2018) have been proposed for providing abstractive answers, the stateof-the-art methods for these datasets are based on answer span extraction Hu et al., 2018). Generative models suffer from a dearth of training data to cover open-domain questions.
Moreover, to satisfy various information needs, intelligent agents should be capable of answering one question in multiple styles, such as wellformed sentences, which make sense even without the context of the question and passages, and concise phrases. These capabilities complement each other, but previous studies cannot use and control different styles within a model.
In this study, we propose Masque, a generative model for multi-passage RC. It achieves stateof-the-art performance on the Q&A task and the Q&A + NLG task of MS MARCO 2.1 and the summary task of NarrativeQA. The main contri-butions of this study are as follows.
Multi-source abstractive summarization. We introduce the pointer-generator mechanism (See et al., 2017) for generating an abstractive answer from the question and multiple passages, which covers various answer styles. We extend the mechanism to a Transformer (Vaswani et al., 2017) based one that allows words to be generated from a vocabulary and to be copied from the question and passages.
Multi-style learning for style control and transfer. We introduce multi-style learning that enables our model to control answer styles and improves RC for all styles involved. We also extend the pointer-generator to a conditional decoder by introducing an artificial token corresponding to each style, as in (Johnson et al., 2017). For each decoding step, it controls the mixture weights over three distributions with the given style ( Figure 1).

Problem Formulation
This paper considers the following task: , and an answer style label s, an RC model outputs an answer y = {y 1 , . . . , y T } conditioned on the style.
In short, given a 3-tuple (x q , {x p k }, s), the system predicts P (y). The training data is a set of 6-tuples: (x q , {x p k }, s, y, a, {r p k }), where a and {r p k } are optional. Here, a is 1 if the question is answerable with the provided passages and 0 otherwise, and r p k is 1 if the k-th passage is required to formulate the answer and 0 otherwise.

Proposed Model
We propose a Multi-style Abstractive Summarization model for QUEstion answering, called Masque. Masque directly models the conditional probability p(y|x q , {x p k }, s). As shown in Figure 2, it consists of the following modules.
1. The question-passages reader ( §3.1) models interactions between the question and passages.
Our model is based on multi-source abstractive summarization: the answer that it generates can be viewed as a summary from the question and passages. The model also learns multi-style answers together. With these two characteristics, we aim to acquire the style-independent NLG ability and transfer it to the target style. In addition, to improve natural language understanding in the reader module, our model considers RC, passage ranking, and answer possibility classification together as multi-task learning.

Question-Passages Reader
The reader module is shared among multiple answer styles and the three task-specific modules.

Word Embedding Layer
Let x q and x p k represent one-hot vectors (of size V ) for words in the question and the k-th passage. First, this layer projects each of the vectors to a d word -dimensional vector with a pretrained weight matrix W e ∈ R d word ×V such as GloVe (Pennington et al., 2014). Next, it uses contextualized word representations via ELMo (Peters et al., 2018), which allows our model to use morphological clues to form robust representations for out-of-vocabulary words unseen in training. Then, the concatenation of the word and con-textualized vectors is passed to a two-layer highway network (Srivastava et al., 2015) to fuse the two types of embeddings, as in . The highway network is shared by the question and passages.

Shared Encoder Layer
This layer uses a stack of Transformer blocks, which are shared by the question and passages, on top of the embeddings provided by the word embedding layer. The input of the first block is immediately mapped to a d-dimensional vector by a linear transformation. The outputs of this layer are E p k ∈ R d×L for each k-th passage, and E q ∈ R d×J for the question.
Transformer encoder block. The block consists of two sub-layers: a self-attention layer and a position-wise feed-forward network. For the self-attention layer, we adopt the multi-head attention mechanism (Vaswani et al., 2017). Following GPT (Radford et al., 2018), the feed-forward network consists of two linear transformations with a GELU (Hendrycks and Gimpel, 2016) activation function in between. Each sub-layer is placed inside a residual block (He et al., 2016). For an input x and a given sub-layer function f , the output is LN(f (x) + x), where LN indicates the layer normalization (Ba et al., 2016). To facilitate these residual connections, all sub-layers produce a sequence of d-dimensional vectors. Note that our model does not use any position embeddings in this block because ELMo gives the positional information of the words in each sequence.

Dual Attention Layer
This layer uses a dual attention mechanism to fuse information from the question to the passages as well as from the passages to the question.
It first computes a similarity matrix U p k ∈ R L×J between the question and the k-th passage, as done in , where indicates the similarity between the l-th word of the k-th passage and the j-th question word. The w a ∈ R 3d are learnable parameters. The ⊙ operator denotes the Hadamard product, and the [; ] operator denotes vector concatenation across the rows. Next, the layer obtains the row and column normalized similarity matrices A p k = softmax j (U p k ⊤ ) and B p k = softmax l (U p k ). It then uses DCN (Xiong et al., 2017) to obtain dual attention representations, G q→p k ∈ R 5d×L and G p→q ∈ R 5d×J : Here,

Modeling Encoder Layer
This layer uses a stack of the Transformer encoder blocks for question representations and obtains M q ∈ R d×J from G p→q . It also uses another stack for passage representations and obtains M p k ∈ R d×L from G q→p k for each k-th passage. The outputs of this layer, M q and {M p k }, are passed on to the answer sentence decoder; the {M p k } are also passed on to the passage ranker and the answer possibility classifier.

Passage Ranker
The ranker maps the output of the modeling layer, {M p k }, to the relevance score of each passage. It takes the output for the first word, M p k 1 , which corresponds to the beginning-of-sentence token, to obtain the aggregate representation of each passage sequence. Given w r ∈ R d as learnable parameters, it calculates the relevance of each k-th passage to the question as

Answer Possibility Classifier
The classifier maps the output of the modeling layer to a probability for the answer possibility. It also takes the output for the first word, M p k 1 , for all passages and concatenates them. Given w c ∈ R Kd as learnable parameters, it calculates the answer possibility for the question as

Answer Sentence Decoder
Given the outputs provided by the reader module, the decoder generates a sequence of answer words one element at a time. It is autoregressive (Graves, 2013), consuming the previously generated words as additional input at each decoding step.

Word Embedding Layer
Let y represent one-hot vectors of the words in the answer. This layer has the same components as the word embedding layer of the reader module, except that it uses a unidirectional ELMo to ensure that the predictions for position t depend only on the known outputs at positions previous to t.
Artificial tokens. To be able to use multiple answer styles within a single system, our model introduces an artificial token corresponding to the style at the beginning of the answer (y 1 ), as done in (Johnson et al., 2017;Takeno et al., 2017). At test time, the user can specify the first token to control the style. This modification does not require any changes to the model architecture. Note that introducing the token at the decoder prevents the reader module from depending on the answer style.

Attentional Decoder Layer
This layer uses a stack of Transformer decoder blocks on top of the embeddings provided by the word embedding layer. The input is immediately mapped to a d-dimensional vector by a linear transformation, and the output is a sequence of d-dimensional vectors: {s 1 , . . . , s T }.
Transformer decoder block. In addition to the encoder block, this block consists of the second and third sub-layers after the self-attention block and before the feed-forward network, as shown in Figure 2. As in (Vaswani et al., 2017), the selfattention sub-layer uses a sub-sequent mask to prevent positions from attending to subsequent positions. The second and third sub-layers perform the multi-head attention over M q and M p all , respectively. The M p all is the concatenated outputs of the encoder stack for the passages, Here, the [, ] operator denotes vector concatenation across the columns. This attention for the concatenated passages produces attention weights that are comparable between passages.

Multi-source Pointer-Generator
Our extended mechanism allows both words to be generated from a vocabulary and words to be copied from both the question and multiple passages ( Figure 3). We expect that the capability of copying words will be shared among answer styles.

Additive Attention
Additive Attention  For each decoding step t, mixture weights λ v , λ q , λ p for the probability of generating words from the vocabulary and copying words from the question and the passages are calculated. The three distributions are weighted and summed to obtain the final distribution.
Extended vocabulary distribution. Let the extended vocabulary, V ext , be the union of the common words (a small subset of the full vocabulary, V , defined by the input-side word embedding matrix) and all words appearing in the input question and passages. P v then denotes the probability distribution of the t-th answer word, y t , over the extended vocabulary. It is defined as: where the output embedding W 2 ∈ R d word ×Vext is tied with the corresponding part of the input embedding (Inan et al., 2017), and W 1 ∈ R d word ×d and b 1 ∈ R d word are learnable parameters. P v (y t ) is zero if y t is an out-of-vocabulary word for V .
Copy distributions. A recent Transformerbased pointer-generator randomly chooses one of the attention-heads to form a copy distribution; that approach gave no significant improvements in text summarization (Gehrmann et al., 2018).
In contrast, our model uses an additional attention layer for each copy distribution on top of the decoder stack. For the passages, the layer takes s t as the query and outputs α p t ∈ R KL as the attention weights and c p t ∈ R d as the context vectors: where w p , b p ∈ R d and W pm , W ps ∈ R d×d are learnable parameters. For the question, our model uses another identical layer and obtains α q t ∈ R J and c q t ∈ R d . As a result, P q and P p are the copy distributions over the extended vocabulary: where k(l) means the passage index corresponding to the l-th word in the concatenated passages.
Final distribution. The final distribution of y t is defined as a mixture of the three distributions: where W m ∈ R 3×3d and b m ∈ R 3 are learnable parameters.

Combined Attention
In order not to attend words in irrelevant passages, our model introduces a combined attention. While the original technique combined word and sentence level attentions (Hsu et al., 2018), our model combines the word and passage level attentions. The word attention, Eq. 1, is re-defined as

Loss Function
We define the training loss as the sum of losses via L(θ) = L dec + γ rank L rank + γ cls L cls where θ is the set of all learnable parameters, and γ rank and γ cls are balancing parameters.
The loss of the decoder, L dec , is the negative log likelihood of the whole target answer sentence averaged over N able answerable examples: where D is the training dataset. The losses of the passage ranker, L rank , and the answer possibility classifier, L cls , are the binary cross entropy between the true and predicted values averaged over all N examples: .

Setup
Datasets. MS MARCO 2.1 provides two tasks for generative open-domain QA: the Q&A task and the Q&A + Natural Language Generation (NLG) task. Both tasks consist of questions submitted to Bing by real users, and each question refers to ten passages. The dataset also includes annotations on the relevant passages, which were selected by humans to form the final answers, and on whether there was no answer in the passages.
Answer styles. We associated the two tasks with two answer styles. The NLG task requires a wellformed answer that is an abstractive summary of the question and passages, averaging 16.6 words. The Q&A task also requires an abstractive answer but prefers it to be more concise than in the NLG task, averaging 13.1 words, and many of the answers do not contain the context of the question. For the question "tablespoon in cup", a reference answer in the Q&A task is "16," while that in the NLG task is "There are 16 tablespoons in a cup." Subsets. In addition to the ALL dataset, we prepared two subsets for ablation tests as listed in Table 1. The ANS set consisted of answerable questions, and the NLG set consisted of the answerable questions and well-formed answers, so that NLG ⊂ ANS ⊂ ALL. We note that multi-style learning enables our model to learn from different answer styles of data (i.e., the ANS set), and multi-task learning with the answer possibility classifier enables our model to learn from both answerable and unanswerable data (i.e., the ALL set).
Training and Inference. We trained our model with mini-batches consisting of multi-style an-   swers that were randomly sampled. We used a greedy decoding algorithm and did not use any beam search or random sampling, because they did not provide any improvements.
Evaluation metrics and baselines. ROUGE-L and BLEU-1 were used to evaluate the models' RC performance, where ROUGE-L is the main metric on the official leaderboard. We used the reported scores of extractive Yan et al., 2019;, generative , and unpublished RC models at the submission time.
In addition, to evaluate the individual contributions of our modules, we used MAP and MRR for the ranker and F 1 for the classifier, where the positive class was the answerable questions.

Results
Does our model achieve state-of-the-art on the two tasks with different styles? Table 2 shows the performance of our model and competing models on the leaderboard. Our ensemble model of six training runs, where each model was trained with the two answer styles, achieved state-of-theart performance on both tasks in terms of ROUGE-L. In particular, for the NLG task, our single model outperformed competing models in terms of both ROUGE-L and BLEU-1.
Does multi-style learning improve the NLG performance? Table 3    Does the Transformer-based pointer-generator improve the NLG performance? Table 3 shows that our model also outperformed the model that used RNNs and self-attentions instead of Transformer blocks as in MCAN (McCann et al., 2018). Our deep decoder captured the multi-hop interaction among the question, the passages, and the answer better than a single-layer LSTM decoder could.
Does joint learning with the ranker and classifier improve NLG performance? Furthermore, Table 3 shows that our model (jointly trained with the passage ranker and answer possibility classifier) outperformed the model that did not use the ranker and classifier. Joint learning thus had a regularization effect on the question-passages reader. We also confirmed that the gold passage ranker, which can perfectly predict the relevance of passages, significantly improved the RC performance. Passage ranking will be a key to developing a system that can outperform humans.
Does joint learning improve the passage ranking performance? Table 4 lists the passage ranking performance on the ANS dev. set 2 . The ranker shares the question-passages reader with the answer decoder, and this sharing contributed to improvements over the ranker trained without the answer decoder. Also, our ranker outperformed the initial ranking provided by Bing by a significant margin.
Does our model accurately identify answerable questions? Figure 4 shows the precision-recall curve for answer possibility classification on the ALL dev. set. Our model identified the answerable questions well. The maximum F 1 score was 0.7893, where the threshold of answer possibility was 0.4411. This is the first report on answer possibility classification with MS MARCO 2.1.
Does our model control answer lengths with different styles? Figure 5 shows the lengths of the answers generated by our model broken down by the answer style and query type. The generated answers were relatively shorter than the reference answers, especially for the Q&A task, but well controlled with the target style for every query type. The short answers degraded our model's BLEU scores in the Q&A task (Table 2) because of BLEU's brevity penalty (Papineni et al., 2002).

Experiments on NarrativeQA
Next, we evaluated our model on Narra-tiveQA (Kociský et al., 2018). It requires understanding the underlying narrative rather than relying on shallow pattern matching. Our detailed setup and output examples are in the supplementary material.

Setup
We only describe the settings specific to this experiment.
Datasets. Following previous studies, we used the summary setting for the comparisons with the reported baselines, where each question refers to one summary (averaging 659 words), and there is no unanswerable questions. Our model therefore did not use the passage ranker and answer possibility classifier.
Answer styles. The NarrativeQA dataset does not explicitly provide multiple answer styles. In order to evaluate the effectiveness of multi-style learning, we used the NLG subset of MS MARCO as additional training data. We associated the NarrativeQA and NLG datasets with two answer styles. The answer style of NarrativeQA (NQA) is different from that of MS MARCO (NLG) in that the answers are short (averaging 4.73 words) and contained frequently pronouns. For instance, for the question "Who is Mark Hunter?", a reference is "He is a high school student in Phoenix." Evaluation metrics and baselines. BLEU-1 and 4, METEOR, and ROUGE-L were used in accordance with the evaluation in the dataset paper (Kociský et al., 2018). We used the reports of top-performing extractive Tay et al., 2018;Hu et al., 2018) and generative (Bauer et al., 2018;Indurthi et al., 2018) models.

Results
Does our model achieve state-of-the-art performance? Table 5 shows that our single model, trained with two styles and controlled with the NQA style, pushed forward the state-of-the-art by a significant margin. The evaluation scores of the model controlled with the NLG style were low because the two styles are different. Also, our model without multi-style learning (trained with only the NQA style) outperformed the baselines in terms of ROUGE-L. This indicates that our model architec-  ture itself is powerful for natural language understanding in RC.

Related Work and Discussion
Transfer and multi-task learning in RC. Recent breakthroughs in transfer learning demonstrate that pre-trained language models perform well on RC with minimal modifications (Peters et al., 2018;Devlin et al., 2018;Radford et al., 2018Radford et al., , 2019. In addition, our model also uses ELMo (Peters et al., 2018) for contextualized embeddings.
Multi-task learning is a transfer mechanism to improve generalization performance (Caruana, 1997), and it is generally applied by sharing the hidden layers between all tasks, while keeping task-specific layers.  and Nishida et al. (2018) reported that the sharing of the hidden layers between the multi-passage RC and passage ranking tasks was effective. Our results also showed the effectiveness of the sharing of the question-passages reader module among the RC, passage ranking, and answer possibility classification tasks.
In multi-task learning without task-specific layers, Devlin et al. (2018) and  improved RC performance by learning multiple datasets from the same extractive RC setting. Mc-Cann et al. (2018) and Yogatama et al. (2019) investigated multi-task and curriculum learning on many different NLP tasks; their results were below task-specific RC models. Our multi-style learning does not use style-specific layers; instead uses a style-conditional decoder.
Generative RC. S-Net  used an extraction-then-synthesis mechanism for multi-passage RC. The models proposed by McCann et al. (2018), Bauer et al. (2018), andIndurthi et al. (2018) used an RNN-based pointer-generator mechanism for single-passage RC. Although these mechanisms can alleviate the lack of training data, large amounts of data are still required. Our multistyle learning will be a key technique enabling learning from many RC datasets with different styles.
In addition to MS MARCO and NarrativeQA, there are other datasets that provide abstractive answers. DuReader , a Chinese multi-document RC dataset, provides longer documents and answers than those of MS MARCO. DuoRC (Saha et al., 2018) and CoQA  contain abstractive answers; most of the answers are short phrases.
Controllable text generation. Many studies have been carried out in the framework of style transfer, which is the task of rephrasing a text so that it contains specific styles such as sentiment. Recent studies have used artificial tokens (Sennrich et al., 2016;Johnson et al., 2017), variational auto-encoders (Hu et al., 2017), or adversarial training Tsvetkov et al., 2018) to separate the content and style on the encoder side. On the decoder side, conditional language modeling has been used to generate output sentences with the target style. In addition, output length control with conditional language modeling has been well studied (Kikuchi et al., 2016;Takeno et al., 2017;Fan et al., 2018). Our style-controllable RC relies on conditional language modeling in the decoder.
Multi-passage RC. The simplest approach is to concatenate the passages and find the answer from the concatenation, as in (Wang et al., 2017). Earlier pipelined models found a small number of relevant passages with a TF-IDF based ranker and passed them to a neural reader , while more recent models have used a neural re-ranker to more accurately select the relevant passages Nishida et al., 2018). Also, non-pipelined models (including ours) consider all the provided passages and find the answer by comparing scores between passages . The most recent models make a proper trade-off between efficiency and accuracy (Yan et al., 2019;.

RC with unanswerable question identification.
The previous work of ( Levy et al., 2017;) outputted a no-answer score depending on the probability of all answer spans. Hu et al. (2019) proposed an answer verifier to compare an answer with the question. Sun et al. (2018) jointly learned an RC model and an answer verifier. Our model introduces a classifier on top of the question-passages reader, which is not dependent on the generated answer.
Abstractive summarization. Current state-ofthe-art models use the pointer-generator mechanism (See et al., 2017). In particular, content selection approaches, which decide what to summarize, have recently been used with abstractive models. Most methods select content at the sentence level (Hsu et al., 2018;Chen and Bansal, 2018) or the word level (Pasunuru and Bansal, 2018;Gehrmann et al., 2018). Our model incorporates content selection at the passage level in the combined attention.
Query-based summarization has rarely been studied because of a lack of datasets. Nema et al. (2017) proposed an attentional encoder-decoder model; however, Saha et al. (2018) reported that it performed worse than BiDAF on DuoRC. Hasselqvist et al. (2017) proposed a pointer-generator based model; however, it does not consider copying words from the question.

Conclusion
This study sheds light on multi-style generative RC. Our proposed model, Masque, is based on multi-source abstractive summarization and learns multi-style answers together. It achieved stateof-the-art performance on the Q&A task and the Q&A + NLG task of MS MARCO 2.1 and the summary task of NarrativeQA. The key to its success is transferring the style-independent NLG capability to the target style by use of the question-passages reader and the conditional pointer-generator decoder. In particular, the capability of copying words from the question and passages can be shared among the styles, while the capability of controlling the mixture weights for the generative and copy distributions can be acquired for each style. Our future work will involve exploring the potential of our multi-style learning towards natural language understanding.

A.2 Experimental Setup for NarrativeQA
Model configurations. Our best model was jointly trained with the NarrativeQA and MS MARCO NLG datasets for a total of seven epochs with a batch size of 64, where each batch consisted of multi-style answers that were randomly sampled. For efficient multi-style learning, each summary in the NarrativeQA dataset was divided into ten passages (size of 130 words) with sentencelevel overlaps such that each sentence in the summary was entirely contained in a passage. Each passage from MS MARCO was also truncated to 130 words. The rest of the configuration was the same as in the MS MARCO experiments.
Evaluation settings. An official evaluation script is not provided, so we used the evaluation script created by Bauer et al. (2018) 5 . The answers were normalized by making words lowercase and removing punctuation marks.

A.3 Output Examples Generated by Masque
Tables 6 and 7 list the generated examples for questions from MS MARCO 2.1 and Narra-tiveQA, respectively. We can see from the examples that our model could control answer styles appropriately for various question and reasoning types. We did find some important errors: style errors, yes/no classification errors, copy errors with respect to numerical values, grammatical errors, and multi-hop reasoning errors. Table 6: Output examples generated by Masque from MS MARCO. The model was trained with the Q&A and NLG styles. The relevant passage is one that an annotator selected to compose the reference answer. The model could control answer styles appropriately for (a) natural language, (b) cloze-style, and (c) keywords questions. (d) The answer style was incorrect. (e) The answers were not consistent between the styles. (f) Copying from numerical words worked poorly. There were some grammatical errors in the generative answers, which are underlined. Table 7: Output examples generated by Masque from NarrativeQA. The model was trained with the NarrativeQA (NQA) and MS MARCO (NLG) styles. It could control answer styles appropriately for questions that required (a,b) single-sentence reasoning and (c) multi-sentence reasoning. (d) Example of an error in multi-sentence reasoning. There were some grammatical errors in the generative answers, which are underlined.