CLER: Cross-task Learning with Expert Representation to Generalize Reading and Understanding

This paper describes our model for the reading comprehension task of the MRQA shared task. We propose CLER, which stands for Cross-task Learning with Expert Representation for the generalization of reading and understanding. To generalize its capabilities, the proposed model is composed of three key ideas: multi-task learning, mixture of experts, and ensemble. In-domain datasets are used to train and validate our model, and other out-of-domain datasets are used to validate the generalization of our model’s performances. In a submission run result, the proposed model achieved an average F1 score of 66.1 % in the out-of-domain setting, which is a 4.3 percentage point improvement over the official BERT baseline model.


Introduction
Reading comprehension (RC) tasks are important to measure machines' capabilities of reading and understanding. Given a question and context, a typical extractive RC task aims to automatically extract an appropriate answer from the given context.
A large number of datasets for RC tasks, which contains various types of context, such as Wikipedia article (Rajpurkar et al., 2016;Yang et al., 2018;Kwiatkowski et al., 2019), newswire (Trischler et al., 2017), and web snipets (Dunn et al., 2017;Joshi et al., 2017), have recently been published. Similarly, many types of RC task, such as multiple passage (Dunn et al., 2017;Joshi et al., 2017), multi-hop reasoning (Yang et al., 2018;Welbl et al., 2018), dialog (Choi et al., 2018;Reddy et al., 2019) and commonsense reasoning (Ostermann et al., 2018;  To assess the performance of an RC model on such datasets, basically, we have to train the model on the target domain. This solution requires the same domain dataset as the target domain to appropriately train the model. However, it is difficult to collect the same domain dataset whenever we train a model for an RC task. To overcome this problem, transfer learning can be applied to create a general model, but there have been few works on this (Chung et al., 2018;Sun et al., 2019). During training on the source dataset, the model should be generalized to prevent overfitting to the particular domain. In other words, the model should be able to deal with examples on the target domain (i.e., out-of-domain) well.
The MRQA shared task aims to measure generalization capability for RC tasks.
The shared task released six-domain datasets (Rajpurkar et al., 2016;Trischler et al., 2017;Joshi et al., 2017;Dunn et al., 2017;Yang et al., 2018;Kwiatkowski et al., 2019) to train and vali-date the model as in-domain settings, and unveiled six out of the twelve test datasets 1 (Dua et al., 2019;Lai et al., 2017;Kembhavi et al., 2017;Levy et al., 2017;Saha et al., 2018) to validate the trained model as out-of-domain settings. The characteristics of released datasets are shown in Table 1. The goal of this competition is to demonstrate high performances on out-of-domain datasets (the bottom part of Table 1 and additionally unseen test datasets) by the trained model which only utilizes in-domain datasets (the top part of Table 1).
In this paper, we propose CLER, which stands for Cross-task Learning with Expert Representation.
CLER is based on BERT , which has recently shown great success as a large-scale language model. The proposed model is composed of three concepts; multi-task learning, mixture of experts (MoE), and ensemble.
Our first motivation to employ multi-task learning is inspired by MT-DNN (Liu et al., 2019a). MT-DNN is based on BERT as a shared layer and is trained on four tasks: single-sentence classification, pairwise text similarity, pairwise text classification, and pairwise ranking. In particular, natural language inference (NLI) as a pairwise sentence classification task is related to RC tasks, even in four tasks. Therefore, we train the proposed model for RC and NLI tasks in a multi-task setting.
Our second motivation to employ MoE is inspired by Guo et al. (2018). They demonstrated the effectiveness of the MoE architecture for transfer learning in sentiment analysis and part-ofspeech tagging tasks. MoE basically has different neural networks called "experts" and divides a single task into several subtasks so that each subtask is assigned to one expert. Here, we assume that each subtask corresponds to each domain in in-domain settings. Moreover, in MoE, unseen domains (i.e., out-of-domain) are represented as a combination of several domains, such as SQuAD, TriviaQA, and HotpotQA. Therefore, we expect that MoE can deal with examples in any domain well.
Finally, we employ an ensemble to enhance the performance of the proposed model. Because ensemble models have shown superior performances over than single ones (Seo et al., 2016;, we introduce an ensemble 1 BioASQ: http://bioasq.org/ mechanism to improve performance.
The contributions of this paper are as follows: • We propose a BERT-based model with multitask learning and mixture of experts called CLER.
• We demonstrate that our model has better performances than the official BERT baseline model in both in-domain and out-of-domain settings.
2 Related works RC models: The state-of-the-art in RC tasks has been rapidly advanced by neural models (Seo et al., 2016;Yu et al., 2018;. In particular, BERT  significantly improves the performance of a wide range of natural language understanding tasks, including RC tasks. BERT is designed to pre-train contextual representations from unlabeled text and fine-tune for downstream tasks. By leveraging large amounts of unlabeled data, BERT can obtain rich contextual representations.

Multi-task learning:
Multi-task learning (Caruana, 1997) is a widely used technique in which a model is trained on data from multiple tasks. Multi-task learning provides the model a regularization effect to alleviate overfitting to a specific task, thus enabling universal representations to be learned across tasks. Liu et al. (2019a) proposed the multi-task deep neural network (MT-DNN) based on the BERT model. Similar to the original BERT model, MT-DNN is pre-trained as a language model for learning contextual representations. In the fine-tuning phase, MT-DNN uses multi-task learning instead of training on only a specific task.
Mixture-of-Experts : Guo et al. (2018) introduced the mixture-of-experts (MoE) (Jacobs et al., 1991) approach for unsupervised domain adaptation from multiple sources. MoE is composed of different neural networks, i.e., experts. In the original MoE, a single task is divided into subtasks, and each expert learns to handle a certain subtask. Guo et al. (2018) assumes that different source domains are aligned to different sub-spaces of the target domain.

Model
For generalization to RC tasks, we propose CLER, which is based on BERT  and several other techniques. An overview of the proposed model is illustrated in Figure 1. The core concepts behind our model are multi-task learning, mixture of experts (MoE), and the ensemble mechanism. During training, MoE learns the relationship between domains regardless of the type of task, while the model is trained on RC and NLI tasks simultaneously. We refer to this series of training procedures that trains the model with different experts on two types of task as cross-task learning.

BERT-based model
We utilize BERT LARGE to encode a pair of sentences composed as [CLS] <sentence1> [SEP] <sentence2>. BERT LARGE , which consists of 24 transformer blocks, has already been pre-trained using BooksCorpus (Zhu et al., 2015) and English Wikipedia. For an RC task, the given question and context are set to <sentence1> and <sentence2>, respectively. Similarly, for an NLI task, the given premise and hypothesis are set to <sentence1> and <sentence2>, respectively.
[CLS] and [SEP] are special tokens prepared by the default function of BERT. The given pair of sentences is tokenized as a wordpiece token with a sequence length of up toL = 512. Finally, all tokens are fed into the MoE layer.

Mixture of Experts
To explicitly capture the representation between domains, we introduce a mixture of experts (MoE) (Jacobs et al., 1991) layer after encoding the representation over BERT. As illustrated in Figure 2, MoE is composed of K parts in the expert layer to encode the input representation and a gating network to classify the input representation into the local experts. Intuitively, we expect that each expert is able to interpret domain-wise representations. Formally, given the representation X ∈ R d×L , where d is the number of dimensions of the output of BERT and L indicates the number of input tokens, the equation for output Y ∈ R d×L can be written as follows: where G(x) i indicates the output probability of the i-th expert via the gating network, E i (x) indicates the output representation via the i-th expert layer, and K is the total number of experts.
Here, we give the equations of the gating network G(·) as follows: where − −− → GRU and ← −− − GRU correspond to a forward GRU and backward GRU, respectively, W g is a weight matrix, b g is a bias vector, ; indicates a concatenation operator, and L is the number of given tokens. Note that each GRU only outputs the final hidden state vector in Equation 4.
Then, we give the equation of the i-th expert layer E(·) as follows: where W i is the i-th weight matrix, and b i is the i-th bias vector. As mentioned above, each expert has a different weight matrix and bias vector, and the gating network classifies an input example into local experts. Therefore, all experts are able to interpret the input representation with respect to any domain, even if it is unseen in the source domain.

Multi-task Learning
According to Liu et al. (2019a), multi-task learning is effective for improving models on several NLP tasks. In particular, NLI tasks are related to RC tasks and even several NLP tasks. Therefore, we employ the multi-task learning approach on RC and NLI tasks to enhance the generalization of our model.
BERT-encoder and MoE layer correspond to a shared layer, and both FC RC and FC NLI , which indicate fully connected layers, are task-specific layers in our multi-task setting. For FC RC at prediction time, given the representation of all tokens via the MoE layer, FC RC outputs the span with the maximum logits across all tokens. Specifically, two types of FC RC layer, which are span predictors for the start and end position, estimate the span with the start and end position, individually. For FC NLI at prediction time, given the representation of the first token via the MoE layer corresponding to the [CLS] token, FC NLI outputs a predicted class out of entailment, neutral, and contradiction.

Loss Function
Finally, we minimize the loss function with the multi-task setting as follows: where L RC is a negative log likelihood loss for RC tasks, L N LI is a cross entropy loss for NLI tasks, L importance is an importance loss, and λ is a weight hyperparameter.
According to Shazeer et al. (2017), we employ an importance loss L importance to avoid the local minimum. This loss function penalizes some experts that frequently take a large probability via the gating network in any domain. Let us denote the importance loss as follows: where Z represents all samples in the given minibatch, CV (·) is a coefficient of variation, and w importance is a weight hyperparameter.

Ensemble
To further enhance the generalization of our model, we employ an ensemble mechanism. The ensemble is only applied at test time.
At test time, we feed examples of RC tasks into our models, which are trained with different seeds, independently. We integrate the logits via FC RC into a merged logit as follows: where o j s ∈ R L and o j e ∈ R L correspond to the logits of our j-th model for the start span and end span, respectively, and J is the total number of models in the ensemble. Finally, we take the span with the maximum logits over m s and m e .

Datasets
Datasets for RC Tasks MRQA shared task organizers released six types of train and development dataset to train and validate the model for generalization. Additionally, six out of the twelve types of out-of-domain dataset were unveiled to only validate the trained model.
We randomly sampled examples to make the Test set from the official train dataset. Note that Train, which was created from the official train dataset but is not the same as the official one, does not contain the same examples as in Test. The development dataset Dev. was used as the same for the official development set. The statistics of the datasets are listed in Table 2.

Datasets for NLI Tasks
We introduce two types of NLI datasets to train our model with multi-task learning: SNLI (Bowman et al., 2015) and MultiNLI (Williams et al., 2018). The statistics of these datasets are listed in Table 3.
At training time, the number of examples in each dataset corresponded to the number of examples in the RC task dataset. Specifically, the numbers of examples on SNLI, FICTION, GOVERN-MENT, SLATE, TELEPHONE, and TRAVEL were the same as those of SQuAD, NewsQA, Triv-iaQA, SearchQA, HotpotQA, and NaturalQuestions, respectively.

Experimental Setup
All of our implementations followed the settings described in this section.
We used the BERT LARGE model for all of our implementations. For the MoE layer, the number of experts was set to 12. We set the hidden unit sizes of the GRU layer and the hidden unit sizes of each expert to 512 and 1024, respectively. For the ensemble model, we trained three models independently with different seeds. The best model of the three evaluated on the out-of-domain development set was chosen as a single model.
We used Adam with a learning rate of 3e-5 to optimize the model. We fine-tuned the model for 2 epochs with a batch size of 24. During training, λ and w importance were set to 0.5 and 0.1, respectively.
Two types of metrics, exact match (EM) and partial match (F1), were employed in the MRQA shared task. EM was 1 if the predicted answer was perfectly the same as the gold answer, but otherwise it was 0. For F1, we calculated the overlap rate between the predicted answer and the gold answer, so the maximum F1 score is 1.

Comparison Models
As baseline models, we referred to the official evaluation results based on BERT BASE and BERT LARGE . To fairly compare the baseline and our models, we prepared BERT STL , which is composed of only the BERT-encoder and FC RC with the same settings of our models. BERT STL is different from BERT LARGE with respect to the hyperparameter of scheduling (t total in Pytorch implementation). Note that BERT STL does not employ both multi-task learning and ensemble.
We also prepared BERT MTL excluding the MoE layer from CLER, as illustrated in Figure 1, to assess the effectiveness of multi-task learning.

In-domain Evaluation
We evaluated all models on the in-domain development set. Table 4 summarizes the results on the in-domain development set.
CLER with the ensemble setting consistently demonstrated superior performances on all datasets.
Also, the multi-task learning (BERT MTL ) effectively improved overall performances. However, MoE could not improve the performances compared with BERT MTL on indomain datasets.

Out-of-domain Evaluation
We also evaluated all models on the out-of-domain development set.    Overall, the performances of our model were improved compared to the official baseline models. It was observed that CLER drastically improved the EM and F1 scores compared with baseline models on TextbookQA and DuoRC. Moreover, the multi-task learning improved the average F1 score (+0.7 pt) compared with BERT STL , and the MoE layer further improved the average F1 score (+0.8 pt) compared with BERT MTL . This suggests that both the multi-task learning and MoE are effective for improving generalization for RC tasks.

Submission Run
For the submission run, 6-domain datasets for the development set and additional 6-domain datasets for the test set were used to evaluate the submitted models. All datasets for the submission run were consistently out-of-domain settings. Table 6 summarizes the submission run results. CLER drastically improved the performances compared with the official baseline models. We finally ranked 6th of all participants.

Conclusion
In this paper, we proposed a BERT-based model with multi-task learning and mixture of experts (MoE) called CLER. To enhance generalization for RC tasks, we introduced an MoE layer and the multi-task learning approach. We also applied an ensemble mechanism to CLER to further improve its performances. Experimental results showed that CLER drastically improved EM and F1 scores compared with the official BERT baseline models.
In future work, we will replace the BERTencoder with a more powerful model, such as XL-Net (Yang et al., 2019) or RoBERTa (Liu et al., 2019b), which have recently achieved state-of-the-art performances on natural language understanding benchmarks. We will also attempt other training strategies, such as question generation, to automatically augment the training dataset.