Question Answering through Transfer Learning from Large Fine-grained Supervision Data

We show that the task of question answering (QA) can significantly benefit from the transfer learning of models trained on a different large, fine-grained QA dataset. We achieve the state of the art in two well-studied QA datasets, WikiQA and SemEval-2016 (Task 3A), through a basic transfer learning technique from SQuAD. For WikiQA, our model outperforms the previous best model by more than 8%. We demonstrate that finer supervision provides better guidance for learning lexical and syntactic information than coarser supervision, through quantitative results and visual analysis. We also show that a similar transfer learning procedure achieves the state of the art on an entailment task.


Introduction
Question answering (QA) is a long-standing challenge in NLP, and the community has introduced several paradigms and datasets for the task over the past few years. These paradigms differ from each other in the type of questions and answers and the size of the training data, from a few hundreds to millions of examples.
We are particularly interested in the contextaware QA paradigm, where the answer to each question can be obtained by referring to its accompanying context (paragraph or a list of sentences). Under this setting, the two most notable types of supervisions are coarse sentence-level and finegrained span-level. In sentence-level QA, the task is to pick sentences that are most relevant to the question among a list of candidates (Yang et al., 2015). In span-level QA, the task is to locate the * All work was done while the author was an exchange student at University of Washington. smallest span in the given paragraph that answers the question (Rajpurkar et al., 2016).
In this paper, we address coarser, sentencelevel QA through a standard transfer learning 1 technique of a model trained on a large, spansupervised QA dataset. We demonstrate that the target task not only benefits from the scale of the source dataset but also the capability of the finegrained span supervision to better learn syntactic and lexical information.
For the source dataset, we pretrain on SQuAD (Rajpurkar et al., 2016), a recentlyreleased, span-supervised QA dataset. For the source and target models, we adopt BiDAF (Seo et al., 2017), one of the top-performing models in the dataset's leaderboard. For the target datasets, we evaluate on two recent QA datasets, WikiQA (Yang et al., 2015) and Se-mEval 2016 (Task 3A) (Nakov et al., 2016), which possess sufficiently different characteristics from that of SQuAD. Our results show 8% improvement in WikiQA and 1% improevement in Se-mEval. In addition, we report state-of-the-art results on recognizing textual entailment (RTE) in SICK (Marelli et al., 2014) with a similar transfer learning procedure.

Background and Data
Modern machine learning models, especially deep neural networks, often significantly benefit from transfer learning. In computer vision, deep convolutional neural networks trained on a large image classification dataset such as ImageNet (Deng et al., 2009) have proved to be useful for initializing models on other vision tasks, such as object detection (Zeiler and Fergus, 2014). In nat-  (Pennington et al., 2014) are also widely used for natural language tasks (Karpathy and Fei-Fei, 2015;Kumar et al., 2016). Instead of these, we initialize our models from a QA dataset and show how standard transfer learning can achieve stateof-the-art in target QA datasets. There have been several QA paradigms in NLP, which can be categorized by the context and supervision used to answer questions. This context can range from structured and confined knowledge bases (Berant et al., 2013) to unstructured and unbounded natural language form (e.g., documents on the web (Voorhees and Tice, 2000)) and unstructured, but restricted in size (e.g., a paragraph or multiple sentences (Hermann et al., 2015)). The recent advances in neural question answering lead to numerous datasets and successful models in these paradigms (Rajpurkar et al., 2016;Yang et al., 2015;Nguyen et al., 2016;Trischler et al., 2016). The answer types in these datasets are largely divided into three categories: sentencelevel, in-context span, and generation. In this paper, we specifically focus on the former two and show that span-supervised models can better learn syntactic and lexical features. Among these datasets, we briefly describe three QA datasets to be used for the experiments in this paper. We also give the description of an RTE dataset for an example of a non-QA task. Refer to Table 1 to see the examples of the datasets.
SQuAD (Rajpurkar et al., 2016) is a recent spanbased QA dataset, containing 100k/10k train/dev examples. Each example is a pair of context para-graph from Wikipedia and a question created by a human, and the answer is a span in the context. SQUAD-T is our modification of SQuAD dataset to allow for sentence selection QA. ('T' for senTence). We split the context paragraph into sentences and formulate the task as classifying whether each sentence contains the answer. This enables us to make a fair comparison between pretraining with span-supervised and sentencesupervised QA datasets.
WikiQA (Yang et al., 2015) is a sentence-level QA dataset, containing 1.9k/0.3k train/dev answerable examples. Each example consists of a real user's Bing query and a snippet of a Wikipedia article retrieved by Bing, containing 18.6 sentences on average. The task is to classify whether each sentence provides the answer to the query.
SemEval 2016 (Task 3A) (Nakov et al., 2016) is a sentence-level QA dataset, containing 1.8k/0.2k/0.3k train/dev/test examples. Each example consists of a community question by a user and 10 comments. The task is to classify whether each comment is relevant to the question.
SICK (Marelli et al., 2014) is a dataset for recognizing textual entailment (RTE), containing 4.5K/0.5K/5.0K train/dev/test examples. Each example consists of a hypothesis and a premise, and the goal is to determine if the premise is entailed by, contradicts, or is neutral to the hypothesis (hence classification problem). We also report results on SICK to show that span-supervised QA dataset can be also useful for non-QA datasets.
BiDAF. The inputs to the model are a question q, and a context paragraph x. Then the model selects the best answer span, which is arg max (i,j) y start i y end j , where i <= j. Here, y start i and y end i are start and end position probabilities of i-th element, respectively.
Here, we briefly describe the answer module which is important for transfer learning to sentence-level QA. The input to the answer module is a sequence of vectors {h i } each of which encodes enough information about the i-th context word and its relationship with its surrounding words and the question words. Then the role of the answer module is to map each vector h i to its start and end position probabilities, y start i and y end i .
BiDAF-T refers to the modified version of BiDAF to make it compatible with sentence-level QA. ('T' for senTence). In this task, the inputs are a question q and a list of sentences, x 1 , . . . , x T , where T is the number of the sentences. Note that, unlike BiDAF, which outputs single answer per example, Here we need to output a C-way classification for each k-th sentence.
Since BiDAF is a span-selection model, it cannot be directly used for sentence-level classification. Hence we replace the original answer module of BiDAF with a different answer module, and keep the other modules identical to those of BiDAF. Given the input to the new answer module, {h k 1 , . . . , h k N }, where the superscript is the sentence index (1 ≤ k ≤ T ), we obtain the C-way classification scores for the k-th sentence, y k ∈ [0, 1] C via max-pooling method: where W ∈ R C×d , b ∈ R C are trainable weight matrix and bias, respectively, and max() function is applied elementwise. For WikiQA and SemEval 2016, the number of classes (C) is 2, i.e. each sentence (or comment) is either relevant or not relevant. Since some of the metrics used for these datasets require full ranking, we use the predicted probability for "relevant" label to rank the sentences.
Note that BiDAF-T can be also used for the RTE dataset, where we can consider the hypothesis as a question and the premise as a context sentence (T = 1), and classify each example into 'entailment', 'neutral', or 'contradiction' (C = 3).
Transfer Learning. Transfer learning between the same model architectures 3 is straightforward: we first initialize the weights of the target model with the weights of the source model pretrained on the source dataset, and then we further train (finetune) on the target model with the target dataset. To transfer from BiDAF (on SQuAD) to BiDAF-T, we transfer all the weights of the identical modules, and initialize the new answer module in BiDAF-T with random values. For more training details, refer to Appendix A.

Experiments
Pretrained  512 that of SQuAD, which are from community and Wikipedia, respectively.
(c) Pretraining on SQuAD and SQuAD-T with finetuning (fourth and fifth row) significantly outperforms (by more than 5%) the highest-rank systems on WikiQA. It also outperforms the second ranking system in SemEval-2016 and is only 1% behind the first ranking system.
(d) Transfer learning models achieve better results with pretraining on span-level supervision (SQuAD) than coarser sentence-level supervision (SQuAD-T). 4 Finally, we also use the ensemble of 12 different training runs on the same BiDAF architecture, which obtains the state of the art in both datasets. This system outperforms the highest-ranking system in WikiQA by more than 8% and the best system in SemEval-2016 by 1% in every metric. It is important to note that, while we definitely benefit from the scale of SQuAD for transfer learning to smaller WikiQA, given the gap between SQuAD-T and SQuAD (> 3%), we see a clear sign that span-supervision plays a significant role well.
Varying the size of pretraining dataset. We vary the size of SQuAD dataset used during pretraining, and test on WikiQA with finetuning. Results are shown in Table 3. As expected, MAP on WikiQA drops as the size of SQuAD decreases. It is worth noting that pretraining on SQuAD-T (Table 2) yields 0.5 point lower MAP than pretraining on 50% of SQuAD. In other words, roughly speaking, span-level supervision data is worth more than twice the size of sentence-level supervision data for the purpose of pretraining. Also, even a small size of fine-grained supervision data helps; pretraining with 12.5% of SQuAD gives an advantage of more than 7 points than no pretraining.
Analysis. Figure 1 shows the latently-learned attention maps between the question and one of the context sentences from a WikiQA example in Table 1. The top map is pretrained on SQuAD-T (corresponding to SQuAD-T&Yes in Table 2) and the bottom map is pretrained on SQuAD (SQuAD&Yes). The more red the color, the higher 4 We additionally perform Mann-Whitney U Test and Mc-Nemars Test to show the statistical significance of the advantage of span-level pretraining over sentence-level pretraining. For WikiQA, the advantage is statistically significant with the confidence levels of 97.1% and 99.6%, respectively. For Se-mEval, we obtain the confidence levels of 97.8% and 99.9%, respectively.   the relevance between the words. There are two interesting observations here. First, in SQuAD-pretrained model (bottom), we see a high correspondence between question's airbus and context's aircraft and aerospace, but the SQuAD-T-pretrained model fails to learn such correspondence.
Second, we see that the attention map of the SQuAD-pretrained model is more sparse, indicating that it is able to more precisely localize correspondence between question and context words. In fact, we compare the sparsity of WikiQA test examples in SQuAD&Y and SQuAD-T&Y. Following Hurley and Rickard (2009), the sparsity of an attention map is defined by where V is a set of values between 0 and 1 in attention map, and is a small value which we define 0.01 for here. A histogram of the sparsity is shown in Figure 2. There is a large gap in the average sparsity of WikiQA test examples between SQuAD&Yes and SQuAD-T&Yes, which are 0.84 and 0.56, respectively. More analyses including error analysis and more visualizations are shown in Appendix B.
Entailment Results. In addition to QA experiments, we also show that the models trained on span-supervised QA can be useful for textual entailment task (RTE). Table 4 shows the trans-   (a) BiDAF-T pretrained on SQuAD outperforms that without any pretraining by 6% and that pretrained on SQuAD-T by 2%, which demonstrates that the transfer learning from large span-based QA gives a clear improvement.
(b) Pretraining on SQuAD+SNLI outperforms pretraining on SNLI only. Given that SNLI is larger than SQuAD, the difference in their performance is a strong indicator that we are benefiting from not only the scale of SQuAD, but also the fine-grained supervision that it provides.
(c) We outperform the previous state of the art by 2% with the ensemble of SQuAD+SNLI pretraining routine. It is worth noting that Mou et al. (2016) also shows improvement on SICK by pretraining on SNLI.

Conclusion
In this paper, we show state-of-the-art results on WikiQA and SemEval-2016 (Task 3A) as well as an entailment task, SICK, outperforming previous results by 8%, 1%, and 2%, respectively. We show that question answering with sentence-level supervision can greatly benefit from standard transfer learning of a question answering model trained on a large, span-level supervision. We additionally show that such transfer learning can be applicable in other NLP tasks such as textual entailment. Todor Mihaylov and Preslav Nakov. 2016. Semanticz at semeval-2016 task 3: Ranking relevant answers in community question answering using semantic similarity based on fine-tuned word embeddings. Se-mEval pages 879-886.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. In ICLR.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In NIPS.

A Training details
Parameters. For pretraining BiDAF on SQuAD, we follow the exact same procedure in Seo et al. (2017). For pretraining BiDAF-T on SQuAD-T, we use the same hyperparameters for all modules except the answer module, for which we use the hidden state size of 200. The learning rate is controlled by AdaDelta (Zeiler, 2012) with the initial learning rate of 0.5 and minibatch size of 50. We maintain the moving averages of all weights of the model with the exponential decay rate of 0.999 during training and use them at test. The loss function is the cross entropy betweenỹ k and the one-hot vector of the correct classification.
Convergence. For all settings, we train models until performance on development set continue to decrease for 5k steps. Table 5 shows the median selected step on each setting.  Figure 3. (Top) We see high correspondence between same word from question and context such as senator and john, in SQuAD-pretrained model, but the SQuAD-Tpretrained model fails to learn such correspondence. (Bottom) We see high correspondence between stems from question and stem from context (left) as well as plant from question and plants from context (right), in SQuADpretrained model, but the SQuAD-T-pretrained model fails to learn such correspondence.
Error Analysis. Table 7       from WikiQA and SemEval-2016, and classified them into 6 categories (Table 6). In Table 8, we compare the performance on these WikiQA examples by SQuAD-T-pretrained model and SQuADpretrained model. It shows that span supervision clearly helps answering questions on Category 1 and 2, which are easier to answer, with answering correctly on most of the questions in Category 1. Similarly, we show the comparison of the performance on classified examples of the model without pretraining and SQuAD-pretrained model on SemEval-2016. It also shows that span supervision helps answering questions asking information or opinion/recommendation.

C More Results
SQuAD-T. To better understand SQuAD-T dataset, we show the performance BiDAF-T with different training routines. We get MAP 89.46 and accuracy 85.34% with SQuAD-trained BiDAF model, and MAP 90.18 and accuracy 84.69% with SQuAD-T-trained BiDAF-T model. There is no large gap between the two models, as each paragraph of SQuAD-T has 5 sentences on average, which makes the classification problem easier than WikiQA.
SNLI. Other larger RTE datasets such as SNLI also benefit from transfer learning, although the improvement is smaller. We confirm the improvement by showing that the result on SNLI when pretraining on SQuAD with BiDAF is 82.6%, which is slightly higher than that of the model pretrained on SQuAD-T (81.6%). This, however, did not outperform the state of the art (88.8%) by Wang et al. (2017). This is mostly because BiDAF (or BiDAF-T) is a QA model, which is not designed for RTE tasks.