Multi-source Meta Transfer for Low Resource Multiple-Choice Question Answering

Multiple-choice question answering (MCQA) is one of the most challenging tasks in machine reading comprehension since it requires more advanced reading comprehension skills such as logical reasoning, summarization, and arithmetic operations. Unfortunately, most existing MCQA datasets are small in size, which increases the difficulty of model learning and generalization. To address this challenge, we propose a multi-source meta transfer (MMT) for low-resource MCQA. In this framework, we first extend meta learning by incorporating multiple training sources to learn a generalized feature representation across domains. To bridge the distribution gap between training sources and the target, we further introduce the meta transfer that can be integrated into the multi-source meta training. More importantly, the proposed MMT is independent of backbone language models. Extensive experiments demonstrate the superiority of MMT over state-of-the-arts, and continuous improvements can be achieved on different backbone networks on both supervised and unsupervised domain adaptation settings.


Introduction
Recently, there has been a growing interest in making machines to understand human languages, and a great progress has been made in machine reading comprehension (MRC). There are two main types of MRC task: 1) extractive/abstractive question answering (QA) such as SQuAD (Rajpurkar et al., 2018) and DROP (Dua et al., 2019); 2) multiplechoice QA (MCQA) such as MultiRC (Khashabi et al., 2018) and DREAM (Sun et al., 2019a). Different from extractive/abstractive QA whose answers are usually limited to the text spans exist in the passage, the answers of MCQA may not appear in the text passage and may involve complex language inference. Thus, MCQA usually requires more advanced reading comprehension abilities, including arithmetic operation, summarization, logic reasoning and commonsense reasoning (Richardson et al., 2013;Sun et al., 2019a), and etc. In addition, the size of most existing MCQA datasets is much smaller than that of the extractive/abstractive QA datasets. For instance, all the span-based QA datasets, except CQ (Bao et al., 2016), contain more than 100k samples. In contrast, the data size of most existing MCQA datasets are far less than 100k (see Table 1), and the smallest one only contains 660 samples.
The above two major challenges make MCQA much more difficult to optimize and generalize, especially for the low resource issue. In order to achieve better performance on downstream NLP tasks, it is inevitable to fine-tune the pre-trained deep language models (Devlin et al., 2019;Raffel et al., 2019; with a large number of supervised target data for reducing the discrepancy between the training source and target data. Due to the low resource nature, the performance of most existing MCQA methods is far from satisfactory. To alleviate such issue in MCQA, one straightforward solution is to merge all available data resources for training (Palmero Aprosio et al., 2019). However, the data heterogeneity of datasets (e.g., resource domains, answer types and varies diversity of choice size across different MCQA datasets.) hinders the practical use of this strategy.
To better discover the hidden knowledge across multiple data sources, we propose a novel framework termed Multi-source Meta Transfer (MMT). In this framework, we first propose a module named multi-source meta learning (MML) that extends traditional meta learning to multiple sources where a series of meta-tasks on different data resources is constructed to simulate lowresource target task. In this way, a more generalized representation could be obtained by considering multiple source datasets. On the top of it, the meta transfer learning (MTL) is integrated into multi-source meta training to further reduce the distribution gap between training sources and the target one. Different from traditional meta learning that assumes tasks generated from the similar distribution/same dataset, MMT is able to discover the knowledge across different datasets and transfer it into the target task. More importantly, MMT is agnostic to the upstream framework, i.e., it can be seamlessly incorporated into any existing backbone language models to improve performance. Figure 1 briefly illustrates both meta learning and the proposed MMT.
2 Related Work 2.1 Meta Learning Meta learning, a.k.a "learning to learn", intends to design models that can learn general data representation and adapt to new tasks with a few training samples (Finn et al., 2017;Nichol et al., 2018). Early works have demonstrated that meta learning is capable of boosting the performance of natural language processing (NLP) tasks, such as named entity recognition (Munro et al., 2003) and grammatical error correction (Seo et al., 2012).
Recently, meta learning gains more and more attention. Many works explore to adopt meta learning to address low resource issues in various NLP tasks, such as machine translation (Gu et al., 2018;Sennrich and Zhang, 2019), semantic parsing (Guo et al., 2019), query generation (Huang et al., 2018), emotion distribution learning (Zhao and Ma, 2019), relation classification (Wu et al., 2019;Obamuyide and Vlachos, 2019) and etc. These methods have all achieved good performance due to their powerful data representation ability. Meanwhile, the strong learning capability of meta learning also provides deep models with a better initialization, and boosts deep models fast adaptation to new tasks under both supervised (Qian and Yu, 2019;Obamuyide and Vlachos, 2019) and unsupervised (Srivastava et al., 2018) scenarios. Unfortunately, meta learning is seldom studied in multiple-choice question answering in existing methods. To our best knowledge, it is also the first time to extend meta learning into multi-source scenarios.

Multiple-Choice Question Answering
Multiple-choice question answering (MCQA) is a challenging task, which requires understanding the relationships and handle the interactions between passages, questions and choices to select the correct answer (Chen and Durrett, 2019). As one of the hot track of question answering tasks, MCQA has seen a great surge of challenging datasets and novel architectures recently. These datasets are built through considering different contexts and scenes. For instance, Guo et al. (2017) present an open-domain comprehension dataset; Lai et al. (2017) build a QA dataset from examinations, which requires more complex reasoning on questions; and Zellers et al. (2018) introduce a QA dataset that requires both natural language inference and commonsense reasoning. Meanwhile, various approaches have been proposed to address the MCQA task using different neural network architectures. Some works propose to compute the similarity between question and each of the choices through an attention mechanism . Kumar et al. (2016) construct the context embedding for semantic representation.  and  apply the recurrent memory network for question reasoning. Chung et al. (2018) and Jin et al. (2019) further incorporate an attention mechanism into recurrent memory networks for multi-step reasoning. Most existing works only strive to increase the reasoning capability by constructing complex models, but ignore the low resource nature of those available MCQA datasets.

Methodology
Many existing MCQA tasks suffer from the lowresource issue, which requires a special training strategy to tackle it. Recent advance of meta learning shows its advantages in solving the few-shot learning problem. Typically, it can rely on only a very small number of training samples to train a model with good generalization ability (Finn et al., 2017;Nichol et al., 2018). Unfortunately, the existing meta learning algorithms are unable to be applied in our problem setting directly, since they are based on the assumption that the meta tasks are generated from the same data distribution (Fallah et al., 2019). For example, one of the most popular benchmarks is the Mini-ImageNet dataset that was proposed by Lake et al. (2011), and it consists of 100 sub-classes from ImageNet dataset. All the meta tasks generated from the same training dataset have similar properties. In contrast, in our studied problem MCQA, data properties such as answer, question type, and commonsense are greatly vary across the MCQA datasets. Specifically, the passages and questions come from different scenarios (such as exams, dialogues, and stories), and the answering choice contains more complex semantic information than the fixed categories in Mini-ImageNet. Therefore, simply combining all the data resources into one and feeding it into existing meta learning algorithms is not an optimal solution (the experimental results in Figure 5 also support this point).
To address the data heterogeneity challenge and cater to the MCQA task, we extend the traditional meta learning method to multiple training sources scenarios, where we fully exploit multiple interdomain sources to learn more generalized representations. Specifically, multi-source meta learning performs meta learning among multiple sources in sequence, thereby completing one iteration. However, multi-source meta learning alone cannot guarantee the desirable performance due to the data distribution gap between multiple sources and target data. Therefore, transfer learning from multisources to target is required. Here we introduce meta transfer learning into each meta learning iteration, which aims at reducing the discrepancy between the learned meta representation from multisource and target.

Multi-source Meta Transfer
The proposed multi-source meta transfer (MMT) method consists of two modules: multi-source meta learning (MML) and meta transfer learning (MTL). As shown in Figure 2, the MML contains fast adaptation, meta-model update and target fine-tuning steps; and the MTL performs to transfer the knowledge initialized by MML to the target task. Note that MMT is agnostic to backbone models, i.e., it can be seamlessly incorporated into any stronger backbone to boost performance. In this work, we select pre-trained BERT (Devlin et al., 2019) and RoBERTa  as the backbone for MMT. Generally, MMT first learns meta features from multiple sources of inputs such that those features could be mapped into a latent representation space. Then, the fine-tuning step performs to reduce the representation gap between different sources and the meta representation. Finally, MTL is applied to transfer the well-initialized meta representations to the target task. The details of MMT are summarized in Algorithm 1, where the procedures of MML and MTL are presented in lines 2-16 and lines 17-21, respectively. In MML, we sequentially sample data to construct the tasks τ in meta learning from multiple source distributions {p s (τ ); s ∈ S}, where S denotes the sources index set. Note that the support-tasks and query-tasks, in one iteration of MML, should be sampled simultaneously to satisfy the same distribution requirement. The learning rates for each of the learning modules are different, where α denotes the learning rate for fast adaptation module, β is utilized for both meta-model updating and target fine-tuning, and λ represents the learning rate for MTL. Moreover, the parameter of MMT is initialized from the backbone language model, i.e., BERT, RoBERTa.
In the sequence, we introduce each step in multisource meta learning (MML) module. The first step is fast adaptation (lines 4-8), which aims to learn the meta information from support-tasks τ s i . The task-specific parameter θ is updated by The second step is meta-model update (line 9), where its cost function, τ s i ∼p s (τ ) L τ s i (f (θ )), is calculated with respect to θ , and it is adopted to evaluate the performance of fast adaptation on the corresponding newly sampled query-tasks (τ s i ). It is worth noting that f (θ ) is an implicit function of θ (see Equation 1), and the second-order Hessian gradient matrix is required for the gradient computation (Nichol et al., 2018). However, the use of second derivatives is computationally expensive, so we employ a first-order approximation (Obamuyide and Vlachos, 2019) to update the meta-model gradient by The last step of MML is target fine-tuning (lines 10-14). Although the learnt meta representations carry sufficient semantic knowledge and are well generalized, the data distribution discrepancy between meta representation and target still exists. This fine-tuning step is utilized to reduce the distance between the meta representation and target task on the latent representation space.
Generally, all the steps in MML are sequentially conducted until the meta-model converges. After performing MML, the meta transfer learning (MTL) module will be applied upon the learnt meta representations for the final transfer learning on target data.

Unsupervised Domain Adaptation
In this section, we extend MMT to the unsupervised domain adaptation setting, where no labeled data from the target domain will be given. In this Compute gradient for fast adaption: Gradient for target fine-tuning: Evaluate ∇ θ L τ t i (f (θ)) with respect to batch size; 20 Gradient for meta transfer learning: setting, the difficulty of unsupervised domain adaptation arises due to the different number of choices between source and target datasets. This issue hinders the pre-trained model to be applied to the target task whose choices differ from the source task, i.e., only the knowledge of feature encoders are transferable. To address this issue, unsupervised MMT constructs the support/query-tasks by sampling, which makes the choice number of tasks in the source equal to the target task. With this manner, the unsupervised MMT is able to transfer the knowledge of both feature encoders and classifier to the target task. Some prior works (Chung et al., 2018) also investigated on the unsupervised transfer learning in QA, but they did not well solve the category difference issue exists in multi-sources learning. To the best of our knowledge, we are the first to apply meta learning to address knowledge transfer issue between tasks with different choices in the unsupervised domain adaptation setting. Next, we term our proposed method as unsupervised MMT in short. The framework of unsupervised MMT is shown in Figure 3. A specific source is pre-trained, as an initial state of meta model, to reduce the optimization cost of MMT learning without prior information. With this initial state, unsupervised MMT conducts meta learning by the steps of fast adaption and meta-model update iteratively. Correspondingly, the training of unsupervised MMT is implemented by removing the fine-tuning procedures (lines 10-14 and lines 17-21) in Algorithm 1. By this manner, unsupervised MMT shortened the target representation discrepancy from the specific transferred representation to a generalized meta representation. Moreover, unsupervised MMT fast adapts to category variable tasks without supervised fine-tuning, which relaxes the fixed-category constraint in transfer learning.

Source Selection in MMT
Source selection is a prerequisite step for MMT. Due to the data heterogeneity of different sources, the performance of meta learning may drop if we consider some undesirable data sources in training. In other words, these undesirable or called "dis-similar" data sources will cause negative transfer when their distribution is far away from the target one. To eliminate such drawback, we may consider those "similar" datasets from all the available data sources. In the experiments, we also evaluate the transfer performance of the all source datasets on the target task. The more "similar" of source to target data, the better improvements can be achieved through MMT on the target tasks. Therefore, we use the transfer performance as a guidance for the sequential multi-source meta transfer training, i.e., learns from dissimilar sources to a similar one.

Dataset
We conduct experiments to evaluate the performance of MMT on the following MCQA benchmark datasets.
DREAM (Sun et al., 2019a) is a dialogue-based dataset designed by education experts to evaluate the English level of nonnative English speakers. It focuses on multi-tune multi-party dialogue understanding, which contains various types of questions, like summary, logic, arithmetic, commonsense, etc.
MCTEST ( RACE (Lai et al., 2017) is a dataset about passage reading comprehension, which collected from middle/high school English examinations. Human experts design the questions, and the passages cover various categories of human articles: news, stories, advertisements, biography, philosophy, etc.
SemEval-2018-Task11 (Ostermann et al., 2018) consists of scenario-related narrative text and various types of questions. The goal is to evaluate the machine comprehension for commonsense knowledge.
SWAG (Zellers et al., 2018) is a dataset about rich grounded situations, which is constructed debiased with adversarial filtering and explores the gap between machine comprehension and human.

Experimental Setting
To demonstrating the versatility of MMT, we adopt both BERT (Devlin et al., 2019) and RoBERTa  as the backbone. Due to the resource limitation, the maximal sequence input lengths of BERT and RoBERTa can only be set as 512 and 256, respectively. For all datasets, the model optimization is performed by Adam (Kingma and Ba, 2014), the initial learning rate of fast adaptation α is set to 1e − 3, and the rest ones are set to 1e − 5.

Supervised MCQA
The results of MCQA under supervised setting are summarized in Table 2. Note that we reproduce the results of BERT-Base and RoBERTa-Large on the benchmark datasets in our experiment setting for fair comparison. From the results, we can see that MMT(RoBERTa) achieves the best performances overall benchmark datasets and outperforms current SOTAs with significant margins (i.e., from 5% to 13%). Second, MMT is able to boost up performance over different pre-trained language models. While, the weaker backbone network is, the better improvement MMT can achieve. For example, the MMT(BERT-Base) improves BERT-Base over 14% on MCTEST. In contrast, MMT(RoBERTa) only achieves 1.54% on MCTEST. The performance difference between MMT(RoBERTa) and MMT(BERT-Base) is mainly related to the perfor-mance of backbone itself and the scale of backbone parameter in MMT optimization. We also want to point out that one of the advantages for MMT is backbone-free, which indicates that its performance can be improved progressively with the advance of language models.

Unsupervised Domain Adaptation for MCQA
In this experiment, we further evaluate the performance of MMT under the unsupervised domain adaptation, where no labeled data from the target domain will be available. We use BERT-Base as the backbone, and the model is trained on SWAG and RACE training sources, which is termed as unsupervised MMT(S+R). We also compare it with other SOTAs as well as some transfer learning baselines "TL( * )". For example, "TL(R-S)" denotes that BERT-Base is first fine-tuned in sequence on RACE and SWAG, and then test on MCTEST.
The results of MCTEST are summarized in Table 3.
From the results, we observe that the unsupervised MMT significantly outperforms other unsupervised domain adaptation methods, e.g., MemN2N (Chung et al., 2018) and QACNN (Chung et al., 2018) by a large margin. Moreover, unsupervised MMT can beat some supervised methods, such as BERT-Base, IMC , even without any labeled data from  (Chung et al., 2018) Yes 72.66 IMC  Yes 76.59 MemN2N (Chung et al., 2018) No 53.39 QACNN (Chung et al., 2018) No 63  "Sup." denotes supervised, "S" denotes SWAG, "R" denotes RACE, and "TL( * )" denotes transfer learning from different datasets to MCTEST. For example, "TL(R-S)" denotes that Bert-Base is first fine-tuned on RACE, then on SWAG. Unsupervised MMT(S+R) denotes that the meta model is trained on the sources of SWAG and RACE.
the target domain. For a more fair comparison, we also create several transfer learning baselines that can utilize multiple training sources such as TL(R-S) and TL(S-R). From the results, we can conclude that unsupervised MMT is a better solution to make full use of multiple training sources than sequential transfer learning. Similar observations hold on SWAG dataset. Reported in Table 4, unsupervised MMT outperforms other methods significantly. Note we follow the same setting in KagNet (Lin et al., 2019) that only the development set of SWAG is evaluated.

Ablation Study
We conduct ablative experiments to analyze the two modules of MMT, i.e., multi-source meta learning (MML) and meta transfer learning (MTL). The MTL is the transfer learning module specifically designed for MML, and TL denotes the traditional transfer learning without MML. The experiments are based on BERT-Base model, and all the results are reported in Table 5.  Table 5: Ablation study of MMT on DREAM. "TL" denotes supervised transfer learning, "M" denotes MCTEST, "R" denotes RACE, and "∪" denotes the task combination of RACE and MCTEST.
In the first experiments, we present the results of the MML module. When the input source for MML is a single source, MML downgrades to the traditional meta learning. From the results, we observe that MML fine-tuned on MCTEST (MML(M)) is better than that on RACE (MML(R)), which is caused by the large difference between the RACE and DREAM datasets. We also compare the baseline that simply combines RACE and MCTEST datasets to be one large training source, denoted by MML(M∪R), dramatically drops the performance and only achieves 29.20% on DREAM dataset, which is 23.67% lower than that of MML(M). This suggests that a simple combination of the two different training datasets for meta training is not a good choice.
For the transfer learning (TL) module, we can observe that the performance improvement is more significant by transferring knowledge from RACE to DREAM, compared to that from MCTEST. In addition, TL(R-M) also benefits from fine-tuning on RACE and MCTEST sequentially, and achieves better results.
With the help of MTL, MMT further boosts the performance on DREAM and outperforms both MML and TL baselines. For instance, MMT(M) outperforms MML(M) and TL(M) with 15.67% and 8.40%, respectively. Moreover, MMT is also helpful in alleviating the overfitting issue that exists in TL baselines. The results of development set for TL( * ) are higher than the test set, which indicates the poor generalization ability of TL( * ). Fortunately, MMT( * ) is able to address this issue. The MMT(R+M) that is trained on both RACE and MCTEST in meta learning manner, achieves the best results in all evaluated methods.

Source Selection for MMT
Source selection is a prerequisite step for MMT. In previous experiments, we assume that training resources are given without selection. Due to the data heterogeneity of different sources, the performance of meta learning may drop if we incorporate some undesirable data sources in training. In this experiment, we evaluate the transferability between different datasets and further give the suggestion on the source selection for MMT. The results are summarized in Figure 4. In Figure, the X-axis denotes the source, and Y-axis denotes the target. The values in the boxes indicate transferability from source to the target data in terms of accuracy. For example, 14 denotes transferring RACE to the target MCTEST will obtain 14% accuracy improvement over that only trained on the MCTEST. The negative value in the transferability matrix suggests the negative transfer. There is no source that can be used to improve the performance of SWAG effectively.  Figure 4: Transferability matrix. X-axis denotes the source, and Y-axis denotes the target. The values indicate the transferability from source to the target data in terms of accuracy. The higher the value is, the stronger the transferability is. Taking MCTEST dataset for example, transfer learning pre-trained on RACE leads 14% performance improvement than fine-tuning on MCTEST only.
In MMT, we employ this transferability matrix to guide the source selection for MML training. Specifically, in supervised MMT, we only choose those training sources with the significant positive transfer. In unsupervised MMT, the source with the highest score is selected to be the initial state. To verify the impact of different dataset to MMT, we further study the improvement on target Se-mEval by training with different sources. The results is shown in Figure 5. The performance of SemEval drops when we incorporate DREAM and SWAG into training. Recall the transferability matrix in Figure 4, the DREAM and SWAG datasets show little help in improving the performance on SemEval compared to RACE and MCTEST. In summary, more source data do not guarantee better performance. Only the "similar" source data will be beneficial for multi-source meta learning.

Conclusion
In this work, we propose a novel method named multi-source meta transfer for multiple-choice question answering on low resource setting. Our method considers multiple sources meta learning and target fine-tuning into a unified framework, which is able to learn a general representation from multiple sources and alleviate the discrepancy between source and target. We demonstrate the superiority of our methods on both supervised setting and unsupervised domain adaptation settings over the state-of-the-arts. In future work, we explore to extend this approach for other low resource tasks in NLP.