Two-Stage Synthesis Networks for Transfer Learning in Machine Comprehension

We develop a technique for transfer learning in machine comprehension (MC) using a novel two-stage synthesis network. Given a high performing MC model in one domain, our technique aims to answer questions about documents in another domain, where we use no labeled data of question-answer pairs. Using the proposed synthesis network with a pretrained model on the SQuAD dataset, we achieve an F1 measure of 46.6% on the challenging NewsQA dataset, approaching performance of in-domain models (F1 measure of 50.0%) and outperforming the out-of-domain baseline by 7.6%, without use of provided annotations.


Introduction
Machine comprehension (MC), the ability to answer questions over a provided context paragraph, is a key task in natural language processing.The rise of high-quality, large-scale human-annotated datasets for this task (Rajpurkar et al., 2016;Trischler et al., 2016) has allowed for the training of data-intensive but expressive models such as deep neural networks (Wang et al., 2016;Xiong et al., 2016;Seo et al., 2016).Moreover, these datasets have the attractive quality that the answer is a short snippet of text within the paragraph, which narrows the search space of possible answer spans.
However, many of these models rely on large amounts of human-labeled data for training.Yet data collection is a time-consuming and expensive task.Moreover, direct application of a MC model trained on one domain to answer questions over paragraphs from another domain may suffer performance degradation.
While understudied, the ability to transfer a MC model to multiple domains is of great practical importance.For instance, the ability to quickly use a MC model trained on Wikipedia to bootstrap a question-answering system over customer support manuals or news articles, where there is no labeled data, can unlock a great number of practical applications.
In this paper, we address this problem in MC through a two-stage synthesis network (SynNet).The SynNet generates synthetic question-answer pairs over paragraphs in a new domain that are then used in place of human-generated annotations to finetune a MC model trained on the original domain.
The idea of generating synthetic data to augment insufficient training data has been explored before.For example, for the target task of translation, Sennrich et al. (2016) present a method to generate synthetic translations given real sentences to refine an existing machine translation system.
However, unlike machine translation, for tasks like MC, we need to synthesize both the question and answers given the context paragraph.Moreover, while the question is a syntactically fluent natural language sentence, the answer is mostly a salient semantic concept in the paragraph, e.g., a named entity, an action, or a number, which is often a single word or short phrase1 .Since the answer has a very different linguistic structure compared to the question, it may be more appropriate to view answers and questions as two different types of data.Hence, the synthesis of a (question, answer) tuple is needed.
In our approach, we decompose the process of generating question-answer pairs into two steps, answer generation conditioned on the paragraph, and question generation conditioned on the paragraph and answer.We generate the answer first because answers are usually key semantic concepts, while questions can be viewed as a full sentence composed to inquire the concept.
Using the proposed SynNet, we are able to outperform a strong baseline of directly applying a high-performing MC model trained on another domain.For example, when we apply our algorithm using a pretrained model on the Stanford Question-Answering Dataset (SQuAD) (Rajpurkar et al., 2016), which consists of Wikipedia articles, to answer questions on the NewsQA dataset (Trischler et al., 2016), which consists of CNN/Daily Mail articles, we improve the performance of the SQuAD baseline from 39.0% to 46.6% F1 and approach results of previously published work of Trischler et al. (2016) (50.0%F1), without use of labeled data in the new domain.Moreover, an error analysis reveals that we achieve higher accuracy over the baseline on all common question types.

Question Answering
Question answering is an active area in natural language processing with ongoing research in many directions (Berant et al., 2013;Hill et al., 2015;Golub and He, 2016;Chen et al., 2016;Hermann et al., 2015).Machine comprehension, a form of extractive question answering where the answer is a snippet or multiple snippets of text within a context paragraph, has recently attracted a lot of attention in the community.The rise of large-scale human annotated datasets with over 100,000 realistic question-answer pairs such as SQuAD (Rajpurkar et al., 2016), NewsQA (Trischler et al., 2016), and MSMARCO (Nguyen et al., 2016), has led to a large number of successful deep learning models (Lee et al., 2016;Seo et al., 2016;Xiong et al., 2016;Dhingra et al., 2016;Wang and Jiang, 2016).

Semi-Supervised Learning
Semi-supervised learning has a long history (c.f.Chapelle et al. (2009) for an overview), and has been applied to many tasks in natural language processing such as dependency parsing (Koo et al., 2008), sentiment analysis (Yang et al., 2015),machine translation (Sennrich et al., 2016), and semantic parsing (Berant and Liang;Wang et al.;Jia and Liang, 2016).Recent work generated synthetic annotations on unsupervised data to boost the performance of both reading comprehension and visual question answering models (Yang et al., 2017;Ren et al., 2015), but on domains with some form of annotated data.There has also been work on generating high-quality questions (Yuan et al., 2017;Serban et al., 2016;Labutov et al.), but not how to best use them to train a model.In contrast, we use the two-stage SynNet to generate data tuples to directly boost performance of a model on a domain with no annotations.

Transfer Learning
Transfer learning (Pan and Yang, 2010) has been successfully applied to numerous domains in machine learning, such as machine translation (Zoph et al., 2016), computer vision, (Sharif Razavian et al., 2014), and speech recognition (Doulaty et al., 2015).Specifically, object recognition models trained on the large-scale ImageNet challenge (Russakovsky et al., 2015) have proven to be excellent feature extractors for diverse tasks such as image captioning (i.e., Lu et al. (2016);Fang et al. (2015); Karpathy and Fei-Fei (2015)) and visual question answering (i.e., Zhou et al. (2015); Xu and Saenko (2016); Fukui et al. (2016); Yang et al. (2016)), among others.In a similar fashion, we use a model pretrained on the SQuAD dataset as a generic feature extractor to bootstrap a QA system on NewsQA.

The Transfer Learning Task for MC
We formalize the task of machine comprehension below.Our MC model takes as input a tokenized question q = {q 0 , q 1 , ...q n }, a context paragraph p = {p 0 , p 1 , ...p n }, where q i , p i are words, and learns a function f (p, q) → {a start , a end } where a start and a end are pointer indices into paragraph p, i.e., the answer a = p astart ...p a end .
Given a collection of labeled paragraph, question, answer triples {p, q, a} n i=1 from a particular domain s, i.e., Wikipedia articles, we can learn a MC model f s (p, q) that is able to answer questions in that domain.
However, when applying the model trained in one domain to answer questions in another, the performance may degrade.On the other hand, labeling data to train a model in the new domain is expensive and time-consuming.
In this paper, we propose the task of transferring a MC system f s (p, q) that is trained in a source domain to answer questions over another target domain, t.In the target domain t, we are given an unlabeled set p t = {p} k i=1 of k paragraphs.During test time, we are given an unseen set of paragraphs, p * , in the target domain, over which we would like to answer questions.

Two-Stage SynNet
To bootstrap our model f s we use a SynNet (Figure 1), which consists of answer synthesis and question synthesis modules, to generate data on p t .Our SynNet learns the conditional probability of generating answer a = {a start , a end } and question q = {q 1 , ...q n } given paragraph p, P (q, a|p).We decompose the joint probability distribution P (q, a|p) into a conditional probability distribution P (q|p, a)P (a|p), where we first generate the answer a, followed by generating the question q conditioned on the answer and paragraph.

Answer Synthesis Module
In our answer synthesis module we train a simple IOB tagger to predict whether each word in the paragraph is part of an answer or not.
More formally, given a set of words in a paragraph p = {p 1 ...p n }, our IOB tagging model learns the conditional probability of labels y 1 ...y n , where y 1 ∈ IOB START , IOB MID , IOB END if a word p i is marked as an answer by the annotator in our train set, NONE otherwise.
We use a bi-directional Long-Short Term Memory Network (Bi-LSTM) (Hochreiter and Schmidhuber, 1997) for tagging.Specifically, we project each word p i → p * i into a continuous vector space via pretrained GloVe embeddings (Pennington et al., 2014).We then run a Bi-LSTM over the word embeddings p * 1 , ...p * n to produce a contextdependent word representation h 1 , ...h n , which we feed into two fully connected layers followed by a softmax to produce our tag likelihoods for each word.
We select all consecutive spans where y = NONE produced by the tagger as our candidate answer chunks, which we feed into our question synthesis module for question generation.

Question Synthesis Module
Our question synthesis module learns the conditional probability of generating question q = {q 1 , ...q n } given answer a = a start , a end and paragraph p = p 1 ...p n , P (q 1 , ...q n |p 1 ...p n , a start , a end ).We decompose the joint probability distribution of generating all the question words q 1 , ...q n into generating the question one word at a time, i.e. n i=1 P (q i |p, a, q 1...i−1 ).The model is similar to an encoder-decoder network with attention (Bahdanau et al., 2014), which computes the conditional probability P (q i |p 1 ...p n , a start , a end , q 1...i−1 ).
We run a Bi-LSTM over the paragraph to produce contextdependent word representations h = {h 1 , ...h n }.
To model where the answer is in the paragraph, similar to Yang et al. (2017), we insert answer information by appending a zero/one feature to the paragraph word embeddings.Then, at each time step i, a decoder network attends to both h and the previously generated question token q i−1 to produce a hidden representation r i .Since paragraphs may often have named entities and rare words not present during training, we incorporate a copy mechanism into our models (Gu et al., 2016).
We use an architecture motivated by latent predictor networks (Ling et al., 2016) to force the model to learn when to copy vs. directly predict the word, without direct supervision of what action to choose.Specifically, at every time step i, two latent predictors generate the probability of generating word w i , a pointer network C p (Vinyals et al., 2015) which can copy a word from the context paragraph, and a vocabulary predictor V p which directly generates a probability distribution of choosing a word w i from a predefined vocabulary.The likelihood of choosing predictor k at time step i is proportional to w k r i , and the likelihood of predicting question token q i is given by , where v represents the vocabulary predictor and c represents the copy predictor, and l(w i ) is the likelihood of the word given by the predictor2 .For training, since no direct supervision is given as to which predictor to choose, we minimize the cross entropy loss of producing the correct question tokens n j=1 −log(q * j ) by marginalizing out latent variables using a variant of the forward-backward algorithm (see Ling et al. (2016) for full details).
During inference, to generate a question q 1 ...q n , we use greedy decoding in the following manner.At time step i, we select the most likely predictor (C p or V p ), followed by the most likely word q i given the predictor.We feed the predicted word as input at the next timestep back into the decoder until we predict the end symbol, END, after which we stop decoding.

Machine Comprehension Model
Our machine comprehension model f (p, q) → a learns the conditional likelihood of predicting answer pointers a = {a start , a end } given paragraph p and question q, P (a|p, q).In our experiments we use the open-source Bi-directional Attention Flow (BiDAF) network (Seo et al., 2016) 3 since it is one of the best-performing models on the SQuAD dataset4 , although we note that our algorithm for Algorithm 1: Training Algorithm Input : x s = {p s , q s , a s } n i=1 triplets from source domain s; pretrained MC model on s, f s (p, q) → {a start , a end }; paragraphs from target domain t, p m j=1 Output: MC model on target domain, f t (p, q) → {a start , a end } 1 Train SynNet g to maximize P (q, a|p) on source s; 2 Generate samples x t = (q, a|p) k i=1 on text in target domain t; 3 Use x s ∪ x t to finetune MC model f s on domain t.For every batch sampled from x t , sample k batches from x s ; data synthesis can be used with any MC model.

Algorithm Overview
Having given an overview of our SynNet and a brief overview of the MC model we describe our training procedure, which is illustrated in Algorithm 1.

Training
Our approach for transfer learning consists of several training steps.First, given a series of labeled examples x s = {p s , q s , a s } n i=1 from domain s, paragraphs p m j=1 from domain t, and pretrained MC model f s (p, q), we train the SynNet g s to maximize the likelihood of the question-answer pairs in s.
Second, we fix our SynNet g s and we sample x t = {p t , q t , a t } k i=1 question-answer pairs on the paragraphs in domain t.Several examples of generated questions can be found in Table 1.
We then transfer the MC model originally learned on the source domain to the target domain t using SGD on the synthetic data.However, since the synthetic data is usually noisy, we alternatively train the MC model with mini-batches from x s and x t , which we call data-regularization.Every k batches from x, we sample 1 batch of synthetic data from x , where k is a hyper-parameter, which we set to 4. Letting the model encounter many examples from source domain s serves to regularize the distribution of the synthetic data in the target domain with real data from s.We checkpoint finetuned model f * s every i mini-batches, i = 1000 in Snippet of context paragraph (answer in bold) Generated questions (bold) vs. human questions ...A high court in northern India on Friday acquitted a wealthy businessman facing the death sentence for the killing of a teen in a case dubbed " the house of horrors ." Moninder Singh Pandher was sentenced to death by a lower court in February.The teen was one of 19 victims -children and ...

How many victims were in India ?
What was the amount of children murdered ?...Rescuers have found the body of a man who was one of six people aboard a small airplane that crashed Sunday evening near the northern shore of Puerto Rico , the U.S. Coast Guard said .The Cessna 206 single-engine aircraft ...

What was the body of Puerto Rico's airplane?
Where did the diver find the body?...Shopping malls around the country were expected to review their emergency plans and consider additional security measures in light of Wednesday's shooting , which killed eight .Watch what experts say about keeping malls safe... How many experts died in the International Institute of Shopping War ?How many died in mall shooting ?...Former boxing champion Vernon Forrest , 38 , was shot and killed in southwest Atlanta , Georgia , on July 25 .A grand jury indicted the three suspects -Charman Sinkfield , 30 ; Demario Ware , 20 ; and Jquante...
Where was the first person to be shot ?Where was Forrest killed?
Table 1: Randomly sampled paragraphs and corresponding synthetic vs. human questions from the NewsQA train set.our experiments, and save a copy of the model at each checkpoint.
At test time, to generate an answer, we feed paragraph p = {p 0 , p 1 , ...p n } and question q through our finetuned MC model f * (p, q) to get P (p i = a start ), P (p i = a end ) for all i ∈ 1...n.We then use dynamic programming (Seo et al., 2016) to find the optimal answer span {a start , a end }.To improve the stability of using our model for inference, we average the predicted answer likelihoods from model copies at different checkpoints prior to running the dynamic programming algorithm.

Experimental Setup
We summarize the datasets we use in our experiments, parameters for our model architectures, and training details.
The SQuAD dataset consists of approximately 100,000 question-answer pairs on Wikipedia, 87,600 of which are used for training, 10,570 for development, and an unknown number in a hidden test set.The NewsQA dataset consists of 92,549 train, 5,166 development and 5,165 test questions on CNN/Daily Mail news articles.Both the domain type (i.e., news) and question types differ between the two datasets.For example, an analysis of a randomly generated sample of 1,000 questions from both NewsQA and SQuAD (Trischler et al., 2016) reveals that approximately 74.1% of questions in SQuAD require word matching or paraphrasing to retrieve the answer, as opposed to 59.7% in NewsQA.As our test metrics, we report two numbers, exact match (EM) and F1 score.
We train a BIDAF model on the SQuAD train dataset and use a two-stage SynNet to finetune it on the NewsQA train dataset.
We initialize word-embeddings for the BIDAF model, answer synthesis module, and question synthesis module with 300-dimensional-GloVe vectors (Pennington et al., 2014) trained on the 840B Common Crawl corpus.We set all embeddings of unknown word tokens to zero.
For both the answer synthesis and question synthesis module, we use a vocabulary of size 110,179.We use LSTMs with hidden states of size 150 for the answer module vs. those of size 100 for the question module since the answer module is less memory intensive than the question module.
We train both the answer and question module with Adam (Kingma and Ba, 2014) and a learning rate of 1e-2.We train a BIDAF model with the default hyperparameters provided in the open-source repository.To stop training of the question synthesis module, after each epoch, we monitor both the loss as well as the quality of questions generated on the SQuAD development set.To stop training of the answer synthesis module, we similarly monitor predictions on the SQuAD development set.
To train the question synthesis module, we only use the questions provided in the SQuAD train set.However, to train the answer synthesis module,  (Trischler et al., 2016) 34.9 50.0 Match-LSTM on NewsQA (Trischler et al., 2016)  Q gen refers to using answers generated from our SynNet respectively to finetune the model on NewsQA, A ner refers to using answers extracted from a standard NER system to generate questions.M * sq refers to using the baseline SQUAD model in the ensemble.

EM F1
M newsqa 46.3 60.8 M newsqa + S net 47.9 61.5 Table 3: NewsQA to SQuAD.Exact match (EM) and span F1 results on SQuAD development set of a NewsQA BIDAF model baseline vs. one finetuned on SQuAD using the data generated by a 2-stage SynNet (S net ).
we further augment the human-annotated labels of each paragraph with tags from a simple NER system 5 because labels of answers provided in the train set are underspecified, i.e., many words in the paragraph that could be potential answers are not labeled.Therefore, we assume any named entities could also be potential answers of certain questions, in addition to the answers explicitly labeled by annotators.
To generate question-answer pairs on the NewsQA train set using the SynNet, we first run every paragraph through our answer synthesis module.We then randomly sample up to 30 candidate answers extracted by our module, which we feed into the question synthesis module.This results in 250,000 synthetic question-answer pairs that we can use to finetune our MC model.In study A, we vary k, the number of mini-batches from SQuAD for every batch in NewsQA.In study B, we set k = 0, and vary the answer type and how much of the paragraph we use for question synthesis.2-sent refers to using two sentences before answer span, while context refers to using the entire paragraph.A ner refers to using an NER system and A oracle refers to using the human-annotated answers to generate questions.

Experimental Results
We report the main results on the NewsQA test set (Table 2), report brief results on SQuAD (Table 3), conduct ablation studies (Table 4), and conduct an error analysis.

Results
We compare to the best previously published work, which trains BARB (Trischler et al., 2016) and Match-LSTM (Wang and Jiang, 2016) architectures, and a BIDAF model we train on NewsQA.Directly applying a BIDAF model trained on SQuAD to predict on NewsQA leads to poor performance with an F1 measure of 39.0%, 13.2% lower than one trained on labeled NewsQA data.Using the 2-stage SynNet already leads to a slight boost in performance (F1 measure of 44.3%), which implies that having exposure to the new domain via question-answer pairs provides important signal for the model during training.When we augment the answers from our answer synthesis module with those from a generic NER system to produce questions, we have an additional 2.3% performance boost.Finally, when we ensemble with the original model, we boost the EM further by 0.2%.Our final system achieves an F1 measure of 46.6%, approaching previously published results of 50.0%.The results demonstrate that using the proposed architecture and training procedure, we can transfer a MC model from one domain to another, without use of annotated data.
We also evaluate the SynNet on the NewsQAto-SQuAD direction.We directly apply the best setting from the other direction and report the result in Table 3.The SynNet improves over the baseline by 1.6% in EM and 0.7% in F1.Limited by space, we leave out ablation studies in this direction.

Ablation Studies
To better understand how various components in our training procedure and model impact overall performance we conduct several ablation studies, as summarized in Table 4.

Answer Synthesis
We experiment with using the answer chunks given in the train set, A oracle , to generate synthetic questions, versus those from an NER system, A ner .Results in Table 4(A) show that using human-annotated answers to generate questions leads to a significant performance boost over using answers from an answer generation module.This supports the hypothesis that the answers humans choose to generate questions for provide important linguistic cues for finetuning the machine comprehension model.

Question Synthesis
To see how copying impacts performance, we explore using the entire paragraph to generate the question vs.only the two sentences before and one sentence after the answer span and report re-sults in Table 4(B).On the NewsQA train set, synthetic questions that use 2 sentences contain an average of 3.0 context words within 10 words to the left and right of the answer chunk, those that use the entire context have 2.1 context words, and human generated questions only have 1.7 words.Training with generated questions that have a large amount of overlap with words close to the answer span (i.e., those that use 2-sentences vs. entire context for generation) leads to models that perform worse, especially with synthetic answer spans and no data regularization (35.6% F1 vs. 34.3%F1).One possible reason is that, according to analysis in Trischler et al. (2016), significantly more questions in the NewsQA dataset require paraphrase, inference, and synthesis as opposed to word-matching.

Model Finetuning
To see how the quantity of synthetic questions encountered during training impacts performance, we use k = {0, 2, 4} mini-batches from SQuAD for every synthetic mini-batch from NewsQA to finetune our model, and average the prediction of 4 checkpointed models during testing.As we see from the results, letting the model to encounter data from human annotations, although from another domain, serves as a key form of dataregularization, yielding consistent improvement as k increases.We hypothesize this is because the data distribution of machine-generated questions is different than human-annotated ones; our batching scheme provides a simple way to prevent overfitting to this distribution.

Error Analysis
In this section we provide a qualitative analysis of some of our components to help guide further research in this task.

Answer Synthesis
We randomly sample and present a paragraph with answers extracted by our answer synthesis module (Tables 5 and 6).Although the module appears to have high precision, i.e., it picks up entities such as the "Atlantic Paranormal Society", it misses clear entities such as "David Schrader", which suggests training a system with full NER/POS tags as labels would yield better results, and also explains why augmenting synthetic data generated by Syn-Net with such tags leads to improved performance.
They are ghost hunters , or , as they prefer to be called , paranormal investigators ." Ghost-Hunters ", which airs a special live show at 7 p.m. Halloween night , is helping lift the stigma once attached to paranormal investigators .The show has become so popular that the group featured in each episode -Atlantic Paranormal Society -has spawned imitators across United States and affiliates in countries .TAPS , as the " Hunters" group is informally known , even has its own " Reality Radio" show , magazine , lecture tours , T-shirts -and groupies ." Hunters" has made creepy cool , says David Schrader , a paranormal investigator and co-host of " Radio ", a radio show that investigates paranormal activity.What is Oklahoma 's unemployment rate until Oklahoma City ?What was the manager of the Oklahoma City agency ?How many companies are in Oklahoma City ?How many workers may Oklahoma have as fair hold ?Who said the bureau has already hired civilians to choose What was the average hour manager of Oklahoma City ?How much would Oklahoma have a year to be held What year did Oklahoma 's census build job industry ?Table 6: Predictions from the question synthesis module on a subset of a paragraph.

Question Synthesis
We randomly sample synthetic questions generated by our module and present our results in Table 6.Due to the copy mechanism, our module has the tendency to directly use many words from the paragraph, especially common entities, such as "Oklahoma" in the example.Thus, one way to generate higher-quality questions may be to introduce a cost function that promotes diversity during decoding, especially within a single paragraph.In turn, this would expose the RC model to a larger variety of training examples in the new domain, which can lead to better performance.

Machine Comprehension Model
We examine the performance over various question types of a finetuned BIDAF on NewsQA vs. one trained on NewsQA vs. one trained on SQuAD (Figure 2).Finetuning with Syn-Net improves performance over all question types given, with the largest performance boost on location and person-identification questions.Similarly, models trained on synthetic questions tend to approach in-domain performance on numeric and person-identification questions, but still struggle with questions that require higher-order reasoning, i.e. those starting with "what was" or "what did".Designing a question generator that explicitly requires such reasoning may be one way to further bridge the gap in performance.

Conclusion
We introduce a two-stage SynNet for the task of transfer learning for machine comprehension, a task which is both challenging and of practical im-portance.With our network and a simple training algorithm where we generate synthetic questionanswer pairs on the target domain, we are able to generalize a MC model from one domain to another with no annotated data.We present strong results on the NewsQA test set, improving performance of a baseline BIDAF model by over 7.6% F1.Through ablation studies and error analysis, we provide insights into our methodology on the SynNet and MC models that can help guide further research in this task.

Figure 1 :
Figure 1: Illustration of the two-stage SynNet.The SynNet is trained to synthesize the answer and the question, given the paragraph.The first stage of the model, an answer synthesis module, uses a bi-directional LSTM to predict IOB tags on the input paragraph, which mark out key semantic concepts that are likely answers.The second stage, a question synthesis module, uses a uni-directional LSTM to generate the question, while attending on embeddings of the words in the paragraph and IOB ids.Although multiple spans in the paragraph could be identified as potential answers, we pick one span when generating the question.

Figure 2 :
Figure 2: NewsQA accuracy of baseline BIDAF model trained on SQuAD (light yellow), vs. model finetuned with our method (green) vs. one trained from scratch on NewsQA (dark blue).
M sq + A gen + Q gen 30.6 44.3 M sq + A gen + A ner + Q gen 32.8 46.6 M sq + A gen + A ner + Q gen + M *

Table 2 :
Main Results.Exact match (EM) and span F1 scores on the NewsQA test set of a BIDAF model finetuned with our SynNet.M sq refers to a baseline BIDAF model trained on SQuAD, A gen ,

Table 4 :
Ablation Studies.Exact match (EM) and span F1 results on NewsQA test set of a BIDAF model finetuned with a 2-stage SynNet.

Table 5 :
Sample predictions from our answer synthesis module.