A Multi-answer Multi-task Framework for Real-world Machine Reading Comprehension

The task of machine reading comprehension (MRC) has evolved from answering simple questions from well-edited text to answering real questions from users out of web data. In the real-world setting, full-body text from multiple relevant documents in the top search results are provided as context for questions from user queries, including not only questions with a single, short, and factual answer, but also questions about reasons, procedures, and opinions. In this case, multiple answers could be equally valid for a single question and each answer may occur multiple times in the context, which should be taken into consideration when we build MRC system. We propose a multi-answer multi-task framework, in which different loss functions are used for multiple reference answers. Minimum Risk Training is applied to solve the multi-occurrence problem of a single answer. Combined with a simple heuristic passage extraction strategy for overlong documents, our model increases the ROUGE-L score on the DuReader dataset from 44.18, the previous state-of-the-art, to 51.09.


Introduction
Machine reading comprehension (MRC) or question answering (QA) has been a long-standing goal in Natural Language Processing. There is a surge of interest in this area due to new end-to-end modeling techniques and the release of several largescale, open-domain datasets.
In earlier datasets (Hermann et al., 2015;Hill et al., 2016;Yang et al., 2015;Rajpurkar et al., 2016), the questions did not arise from actual end users. Instead, they were constructed in cloze style or created by crowdworkers given a short passage from well-edited sources such as Wikipedia and CNN/Daily Mail. As a consequence, the questions * * Corresponding author: D. Lin (lindek@naturali.io). are usually well-formed and about simple facts, and the answers are guaranteed to exist as short spans in the given candidate passages.
In MS-MARCO (Nguyen et al., 2016), the questions were sampled from actual search queries, which may have typos and may not be phrased as questions. 1 Multiple short passages, which might have the answer to the query, were extracted from webpages by a separate information retrieval system. He et al. (2017) made the DuReader dataset a more realistic reflection of the real-world problem by including not only questions with relatively short and factual answers, but also questions about complex descriptions, procedures, opinions, etc. which may have multiple, much longer answers, or no answer at all. Furthermore, full-body text from webpages listed in top search results are directly provided as context. These documents tend to be much noiser than Wikipedia and CNN. They are much longer (5 times longer than those in MS-MARCO on average) and contain many paragraphs that are irrelevant to the query.
New problems arise as we now consider the task of machine reading comprehension in a much more challenging real-world setting. First, multiple valid answers to a single question are not only possible but quite common. Figure 1 shows some examples of questions with multiple answers from the DuReader dataset. There could be multiple ways to perform the same task (Question 1), multiple opinions about the same subject (Question 2), or multiple explanations for the same observation (Question 3). However, few works have been done with multiple answers in machine reading comprehension. To address this problem, we propose a multi-answer multi-task scheme which incorporates multiple reference answers in the objective Figure 1: Examples of questions with multiple answers from the DuReader dataset function (but still predicts a single answer in decoding time). We propose three different kinds of multi-answer loss functions and compare their performance through experiment.
Another problem is the multiple occurrences of the same answer. As rich context is provided for a single question, the same answer could occur more than one time in different passages, or even at different places of the same passage. In this case, using only one gold span for the answer could be problematic, as the model is forced to choose one span over others that contain the same content. To solve this problem, we propose to apply Minimum Risk Training (MRT), which uses the expected metric as the loss and gives reward to all spans that are similar with the gold answer.
In this paper, we present a multi-answer, multitask objective function to train an end-to-end MRC/QA system. We experiment with various alternatives on the DuReader dataset and show that our model out-performs other competing systems and increases the state-of-the-art ROUGE-L score by about 7 points.

Related Work
Various datasets have been released in recent years, which fuel the research for reading comprehension and question answering. The CNN/Daily-Mail dataset (Hermann et al., 2015) and the Chil-dren's Book Test (Hill et al., 2016) evaluate comprehension by filling in missing words from welledited texts. SQuAD (Rajpurkar et al., 2016) is one of the most popular datasets for reading comprehension, where a span in a Wikipedia passage is to be extracted to answer questions generated by annotators. The WikiQA (Yang et al., 2015) is another dataset from Wikipedia, where one single sentence is to be selected to answer questions from search engine logs. Different from the above datasets, the MS-MARCO dataset (Nguyen et al., 2016) was built in a real-world setting. The questions were real anonymized Bing queries and multiple passages are extracted from related web pages by a separate system. The DuReader (He et al., 2017) is a Chinese dataset, similarly constructed from user queries as MS MACRCO, but in a more realistic setting using Baidu Web Search and Baidu Answers (Zhidao) data. While a small proportion of questions were labeled with multiple answers in MS MARCO (9.93%), more than half of the DuReader queries were annotated with multiple answers, which provides the perfect setup for our work.
Great effort has been put into the development of sophisticated neural models for machine reading comprehension. The attention mechanism was first introduced by Hermann et al. (2015) into reading comprehension and soon became the dom-inating model. Wang and Jiang (2017) proposed to solve machine comprehension using Match-LSTM and answer pointer. Seo et al. (2017) and Xiong et al. (2017) applied different ways to match the question and the context with bidirectional attention.  used iterative aligner to match the question and the passage with feature-rich encoder. Cui et al. (2017) employed one more layer of attention over the bi-directional attention mechanism.  applied a self-matching mechanism to aggregate evidence from the context. Tan et al. (2018) proposed to generate answer from extracted answer span. Yu et al. (2018) proposed to use convolution with self-attention instead of recurrent models in reading comprehension.
Recently there are some emerging works starting to touch the reading comprehension task from the answer side. Wang et al. (2018a) proposed to use evidence aggregation to re-rank answer candidates extracted from different passages, and Wang et al. (2018b) proposed Cross-Passage Answer Verification model for the same purpose. Neither of them involved multiple answers as in this work.
Minimum Risk Training (MRT) has been widely used in various tasks in NLP.  introduced MRT into Neural Machine Translation, and Ayana et al. (2016) applied it in Text Summarization.

Our Approach
In this section we describe in details the architecture of our model which is depicted in Figure 2.

Passage Extraction
Unlike most other datasets where the source of answers is a short passage with a few hundred words, the DuReader dataset provides up to 5 full documents, which may contain up to 100K words. This incurs exorbitant demand on memory and training time. To deal with this issue, previous approaches select a single representative paragraph for each document, on which the answer extraction is performed. The original paper of DuReader (He et al., 2017) employed a simple heuristic strategy, and Wang et al. (2018b) trained a paragraph ranking model, while Clark and Gardner (2017) applied TF-IDF based method for the TriviaQA dataset (Joshi et al., 2017) which was in a similar situation. However, answers could come from more than one paragraph. We apply a simple yet effective method to extract contents from multiple paragraphs of the document, aiming to include as much information for the answer extraction as possible.
We concatenate the title and the whole document as the passage if it is shorter than a predefined maximum length. If not, we employ passage extraction in the following way: • The title of the document is extracted.
Whether a document is relevant to the question could be easily seen from the title.
• We compute BLEU-4 score of each paragraph relative to the question, and select the one that appears first in the document among paragraphs with top-k scores.
• We extract the full body of this selected paragraph and the next paragraph.
• For each of the following paragraphs, the first sentence is extracted as it probably contains the main point.
• We concatenate all the extracted contents to form the extracted passage, and it is truncated to the maximum length if it is longer than the predefined value.
We apply our model on the basis of the extracted passages.

Representation of Word
Given a word sequence of question Q = {w q t } m t=1 , and a word sequence of extracted passage P = {w p t } n t=1 , we combine different useful information to form the representation of each question word w q t and passage word w p t : • Word-level embedding: each word w in the question and passage is mapped to its corresponding n-dimensional embedding we.
• POS tag embedding: we use a POS tagger to tag each word in the question and passage. Each POS tag is mapped to a m-dimensional embedding pe.
• Word-in-question feature: following  and Weissenborn et al. (2017), we use one additional binary feature wiq for each passage word, indicating whether this word occurs in the question. Each question word is represented as the concatenation of the word embedding we, and the POS tag embedding pe, denoted as x q = [we; pe]. Each passage word is additionally concatenated with the word-in-question feature wiq: x p = [we; pe; wiq].
It should be noted that, character-level embedding is an important part of word representation in English MRC models (Seo et al., 2017;Weissenborn et al., 2017;Tan et al., 2018). Character sequence would give information which helps to relieve the OOV problem, as many English words share the same stem and differ only in prefix or suffix. However, this is not the case in Chinese, and we observe no significant improvement incorporating character-level embedding into our system.

Encoding Layer
Following previous work, we use a bi-directional LSTM to obtain contextual encoding for each word in the question and passage respectively:

Match Layer
To fuse question encoding and passage encoding, we adopt the Attention Flow Layer (Seo et al., 2017) with a simpler similarity function. The similarity score between the contextual encoding for a query word u q i and that for a passage word u p j is defined as: The context-to-query attention vectors c p j are computed from the similarity scores: The query-to-context attention vector d p is computed as: Another BiLSTM is applied on top of them to get the question-aware passage representation: 3.5 Multi-answer multi-task loss function

Answer prediction with multi-answer
A reading comprehension model is typically trained as an extractor of an answer span from a candidate passage. In DuReader dataset, multiple reference answers are provided for a single question. For each of the reference answers, we add the span with the highest F1 score to the gold answer spans. For models considering only a single answer span (baseline model), the gold answer span is the one with the highest F1 score relative to any of the reference answers (He et al., 2017;Wang et al., 2018b). In the boundary model with pointer network (Wang and Jiang, 2017;Tan et al., 2018), two probability distributions y 1 j and y 2 j (j = 1 . . . n), which denote the probability that position j is the beginning or the end of the answer span respectively, are computed as follows: where t = 1, 2, and the initial hidden state h a 0 is generated by an attention-pooling over the question representation following : Note that all passages for the same question are concatenated in order to predict one answer span. The loss is defined as the sum of negative log probabilities of the ground truth start and end position based on the predicted distributions: We propose three different ways to incorporate multiple answers. A simple solution is to compute the average loss for multiple answer spans: L avg treats all answer spans as equally good. However, some of them may be closer to humangenerated answers than others. We therefore define the weighted average loss as follow: where w k is the F-score between the answer span and the corresponding human-generated answer, normalized by the sum of the scores of each answer. Another solution is to use the minimal value of the loss from each span: Instead of predicting all answer spans, this loss will encourage the model to predict only the easiest answer span for it.
The answer span prediction loss L ap is defined as the average of any of the loss functions described above over the training set. Tan et al. (2018) showed that their single-answer, multi-passage MRC model benefits from using multi-task learning by adding an auxiliary loss to predict the correct passage to extract the answer from. We adapt the idea to compute passage selection loss L ps in multi-answer setting.

Passage selection with multi-answer
We first apply attention-pooling over the passage representation {h p j } n j=1 , and then calculate a matching score g for each passage: Since multiple answers are provided in the DuReader dataset, multiple passages may contain correct answers. The match score g for different passages are not in competition against one another. We therefore used pointwise sigmoid function instead of the softmax function (as in Tan et al. (2018)) in the passage selection loss L ps : where y k = 1 if one of the gold span comes from this passage, y k = 0 otherwise.

Joint training
We train our model by jointly optimizing answer span prediction loss and passage selection loss: where λ ps is a hyper-parameter tuned on the dev set.

Minimum Risk Training
Minimum Risk Training (MRT) has been widely used in various tasks in NLP. The basic idea is to directly optimize the evaluation metric instead of maximizing the log likelihood of training data using Maximum Likelihood Estimation (MLE) as described above. In MRT, the object is to minimize the expected loss with respect to the posterior distribution: where ∆(y i , y * i ) is a function which indicates the difference between the predicted result y i and the label result y * i .
In this work, we apply MRT to solve the problem of multi-occurrence of answer in machine reading comprehension, directly using the metric (ROUGE-L) as ∆. As an answer occurs multiple times in the context, each span in which the answer occurs will have minimum difference with the answer, and is thus given a high score by a model trained with MRT.
In machine translation and many other tasks, to compute the expected metric with respect to the posterior distribution is often intractable. Thus sampling methods are commonly used in MRT training. However, in our span extraction model, we use all spans without sampling.
Formally, the MRT loss in our model is defined as: y 1 k y 2 l ∆(P k,l , A) (28) As in , we minimize the linear combination of MLE and MRT loss: where J M LE (θ) refers to L in equation 26 and λ is a hyper-parameter tuned on the development set.

Experiment
We conduct our experiment on the DuReader dataset (He et al., 2017), where multiple passages containing full-body text are provided for each question, and over half of the questions have multiple answers.

Dataset and Evaluation Metrics
The DuReader dataset consists of 201574 questions in total, with 181574 in the training set, 10000 in the development set, and 10000 in the test set. The questions are sampled from frequently occurring queries from Baidu search engines, and the full-body text of top-5 search results from the web are provided as the context. BLEU-4 and ROUGE-L are used in evaluation on DuReader. However, the implementations for the two metrics are quite different in the official evaluation tool. As in MS-MARCO, the BLEU-4 score is normalized across all questions, essentially giving different weights to different questions, while the ROUGE-L is averaged across different questions. We mainly focus on ROUGE-L as each question in a reading comprehension    (Tan et al., 2018). For a single question with multiple reference answers, the maximum score with any reference answers is used, as implemented in the official tool for ROUGE-L. This is reasonable as providing one valid answer is good enough in many cases.

Word and POS Tag Embedding
We train a segmentation model with one-layer BiLSTM using the DuReader dataset, and apply it to a subset of SogouT corpus 2 , which contains a large amount of Chinese web pages (Liu et al., 2012). 256-dimension word embeddings are trained on this data with language model task using one-layer BiLSTM model. As for POS tag, we use a POS tagger trained on the Chinese Treebank (CTB) data to tag each word in questions and passages in the DuReader dataset. 64-dimension POS tag embeddings are trained on this data using one-layer BiLSTM model.
We keep all word and POS tag embeddings fixed during training.

Training and Parameters
The maximum length of each passage is set to be 500. The batch size is set to be 32. The dimension of hidden vector is set to be 150 for all layers. Dropout (Srivastava et al., 2014) is applied between layers, with a dropout rate of 0.15. We use λ ps = 3 for passage selection loss and λ = 10 for 2 http://www.sogou.com/labs/resource/t. php  Our model is optimized with Adam algorithm (Kingma and Ba, 2014), and the learning rate is fixed to 0.001 during training. Table 1 shows the results for passage extraction and rich-feature representation (pre-trained word, POS, and word-in-query embeddings) on the development set. Both of them dramatically increase ROUGE-L and BLEU-4 score over the BiDAF baseline from the original DuReader paper. Together they form our single-answer baseline, on which we test the effectiveness of the multi-answer multi-task loss and Minimum Risk Training.

Different loss functions with
multi-answer Table 2 shows the experimental results with three different multi-answer loss functions introduced in Section 3.5.1. All of them offer improvement over the single-answer baseline, which shows the effectiveness of utilizing multiple answers. The average loss performs better than the min loss, which suggests that forcing the model to predict all possible answers is better than asking it to just find the easiest one. Not surprisingly, by taking into account the quality of different answer spans, the weighted average loss outperforms the average loss and achieves the best result among the three.
All later experiments are conducted based on the weighted average loss.

Multi-task Loss and Minimum Risk Training
As we can see in Table 3, the ROUGE-L score on the DuReader development set increases to 49.77 by incorporating multi-answer into the loss function. Joint learning with passage selection loss yields an increase of 0.19. And with Minimum Risk Training, our model can reach a ROUGE-L score of 50.62, with a further increment of 0.66. Table 4 shows the performance of our model and other state-of-the-art models on the DuReader test set. First, we compare our passage extraction method with the paragraph ranking model from Wang et al. (2018b). Based on the same BiDAF model described in Section 3.4, our method (PE+BiDAF) significantly outperforms the trained model from Wang et al. (2018b) (PR+BiDAF) on the DuReader test set. As we can see, our complete model achieves the state-of-the-art performance in both ROUGE-L and BLEU-4, and greatly narrows the performance gap between MRC system and human in the challenging realworld setting.

Further Analysis
For further analysis, we construct two sets from the development set. Q s contains 2787 questions with a single reference answer, and Q m contains 6650 questions with more than one reference answer. 563 questions from the development set are labeled with no answer, and thus not included in Q s or Q m . Table 5 shows the performance of our model on Q s and Q m . It can be seen that even questions with single answer (Q s ) can benefit from using multiple answers in training. The improvement for Q m is higher than that for Q s .

Conclusion
In this paper, we focus on real-world machine reading comprehension. We propose a multianswer multi-task framework to tackle the multianswer problem which is common in everyday world. Minimum Risk Training is applied to solve the multi-occurrence problem of the answer. We also propose a simple method for passage extraction which solves the length issue of the passage. Experimental results indicate that our model achieves state-of-the-art performance in the challenging DuReader dataset. Despite using multiple answers in training, our system only predicts a single answer in decoding time. However, in some cases (e.g. for questions about opinion), finding all possible answers may be desirable. In the future, we plan to design models which could generate all possible answers for a single question.