Multi-Task Deep Neural Networks for Natural Language Understanding

In this paper, we present a Multi-Task Deep Neural Network (MT-DNN) for learning representations across multiple natural language understanding (NLU) tasks. MT-DNN not only leverages large amounts of cross-task data, but also benefits from a regularization effect that leads to more general representations to help adapt to new tasks and domains. MT-DNN extends the model proposed in Liu et al. (2015) by incorporating a pre-trained bidirectional transformer language model, known as BERT (Devlin et al., 2018). MT-DNN obtains new state-of-the-art results on ten NLU tasks, including SNLI, SciTail, and eight out of nine GLUE tasks, pushing the GLUE benchmark to 82.7% (2.2% absolute improvement) as of February 25, 2019 on the latest GLUE test set. We also demonstrate using the SNLI and SciTail datasets that the representations learned by MT-DNN allow domain adaptation with substantially fewer in-domain labels than the pre-trained BERT representations. Our code and pre-trained models will be made publicly available.


Introduction
Learning vector-space representations of text, e.g., words and sentences, is fundamental to many natural language understanding (NLU) tasks.Two popular approaches are multi-task learning and language model pre-training.In this paper we strive to combine the strengths of both approaches by proposing a new Multi-Task Deep Neural Network (MT-DNN).
Multi-Task Learning (MTL) is inspired by human learning activities where people often apply the knowledge learned from previous tasks to help learn a new task (Caruana, 1997;Zhang and Yang, 2017).For example, it is easier for a person who knows how to ski to learn skating than the one * Equal Contribution.who does not.Similarly, it is useful for multiple (related) tasks to be learned jointly so that the knowledge learned in one task can benefit other tasks.Recently, there is a growing interest in applying MTL to representation learning using deep neural networks (DNNs) (Collobert et al., 2011;Liu et al., 2015;Luong et al., 2015;Xu et al., 2018) for two reasons.First, supervised learning of DNNs requires large amounts of task-specific labeled data, which is not always available.MTL provides an effective way of leveraging supervised data from many related tasks.Second, the use of multi-task learning profits from a regularization effect via alleviating overfitting to a specific task, thus making the learned representations universal across tasks.
In contrast to MTL, language model pretraining has shown to be effective for learning universal language representations by leveraging large amounts of unlabeled data.A recent survey is included in Gao et al. (2018).Some of the most prominent examples are ELMo (Peters et al., 2018), GPT (Radford et al., 2018) and BERT (Devlin et al., 2018).These are neural network language models trained on text data using unsupervised objectives.For example, BERT is based on a multi-layer bidirectional Transformer, and is trained on plain text for masked word prediction and next sentence prediction tasks.To apply a pre-trained model to specific NLU tasks, we often need to fine-tune, for each task, the model with additional task-specific layers using task-specific training data.For example, Devlin et al. (2018) shows that BERT can be fine-tuned this way to create state-of-the-art models for a range of NLU tasks, such as question answering and natural language inference.
We argue that MTL and language model pretraining are complementary technologies, and can be combined to improve the learning of text rep-arXiv:1901.11504v1[cs.CL] 31 Jan 2019 resentations to boost the performance of various NLU tasks.To this end, we extend the MT-DNN model originally proposed in Liu et al. (2015) by incorporating BERT as its shared text encoding layers.As shown in Figure 1, the lower layers (i.e., text encoding layers) are shared across all tasks, while the top layers are task-specific, combining different types of NLU tasks such as single-sentence classification, pairwise text classification, text similarity, and relevance ranking.Similar to the BERT model, MT-DNN is trained in two stages: pre-training and fine-tuning.Unlike BERT, MT-DNN uses MTL in the finetuning stage with multiple task-specific layers in its model architecture.
MT-DNN obtains new state-of-the-art results on eight out of nine NLU tasks1 used in the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018), pushing the GLUE benchmark score to 82.2%, amounting to 1.8% absolute improvement over BERT.We further extend the superiority of MT-DNN to the SNLI (Bowman et al., 2015a) and SciTail (Khot et al., 2018) tasks.The representations learned by MT-DNN allow domain adaptation with substantially fewer in-domain labels than the pre-trained BERT representations.For example, our adapted models achieve the accuracy of 91.1% on SNLI and 94.1% on SciTail, outperforming the previous state-ofthe-art performance by 1.0% and 5.8%, respectively.Even with only 0.1% or 1.0% of the original training data, the performance of MT-DNN on both SNLI and SciTail datasets is fairly good and much better than many existing models.All of these clearly demonstrate MT-DNN's exceptional generalization capability via multi-task learning.

Tasks
The MT-DNN model combines four types of NLU tasks: single-sentence classification, pairwise text classification, text similarity scoring, and relevance ranking.For concreteness, we describe them using the NLU tasks defined in the GLUE benchmark as examples.
Single-Sentence Classification: Given a sentence2 , the model labels it using one of the predefined class labels.For example, the CoLA task is to predict whether an English sentence is grammatically plausible.The SST-2 task is to determine whether the sentiment of a sentence extracted from movie reviews is positive or negative.
Text Similarity: This is a regression task.Given a pair of sentences, the model predicts a real-value score indicating the semantic similarity of the two sentences.STS-B is the only example of the task in GLUE.
Pairwise Text Classification: Given a pair of sentences, the model determines the relationship of the two sentences based on a set of pre-defined labels.For example, both RTE and MNLI are language inference tasks, where the goal is to predict whether a sentence is an entailment, contradiction, or neutral with respect to the other.QQP and MRPC are paragraph datasets that consist of sentence pairs.The task is to predict whether the sentences in the pair are semantically equivalent.
Relevance Ranking: Given a query and a list of candidate answers, the model ranks all the candidates in the order of relevance to the query.QNLI is a version of Stanford Question Answering Dataset (Rajpurkar et al., 2016).The task involves assessing whether a sentence contains the correct answer to a given query.Although QNLI is defined as a binary classification task in GLUE, in this study we formulate it as a pairwise ranking task, where the model is expected to rank the candidate that contains the correct answer higher than the candidate that does not.We will show that this formulation leads to a significant improvement in accuracy over binary classification.

The Proposed MT-DNN Model
The architecture of the MT-DNN model is shown in Figure 1.The lower layers are shared across all tasks, while the top layers represent task-specific outputs.The input X, which is a word sequence (either a sentence or a pair of sentences packed together) is first represented as a sequence of embedding vectors, one for each word, in l 1 .Then the transformer encoder captures the contextual information for each word via self-attention, and gen-Figure 1: Architecture of the MT-DNN model for representation learning.The lower layers are shared across all tasks while the top layers are task-specific.The input X (either a sentence or a pair of sentences) is first represented as a sequence of embedding vectors, one for each word, in l 1 .Then the Transformer encoder captures the contextual information for each word and generates the shared contextual embedding vectors in l 2 .Finally, for each task, additional task-specific layers generate task-specific representations, followed by operations necessary for classification, similarity scoring, or relevance ranking.
erates a sequence of contextual embeddings in l 2 .This is the shared semantic representation that is trained by our multi-task objectives.In what follows, we elaborate on the model in detail.
Lexicon Encoder (l 1 ): The input X = {x 1 , ..., x m } is a sequence of tokens of length m.Following Devlin et al. (2018), the first token x 1 is always the [CLS] token.If X is packed by a sentence pair (X 1 , X 2 ), we separate the two sentences with a special token [SEP].The lexicon encoder maps X into a sequence of input embedding vectors, one for each token, constructed by summing the corresponding word, segment, and positional embeddings.
Transformer Encoder (l 2 ): We use a multilayer bidirectional Transformer encoder (Vaswani et al., 2017) to map the input representation vectors (l 1 ) into a sequence of contextual embedding vectors C ∈ R d×m .This is the shared representation across different tasks.Unlike the BERT model (Devlin et al., 2018) that learns the representation via pre-training and adapts it to each individual task via fine-tuning, MT-DNN learns the representation using multi-task objectives.
Single-Sentence Classification Output: Suppose that x is the contextual embedding (l 2 ) of the token [CLS], which can be viewed as the semantic representation of input sentence X.Take the SST-2 task as an example.The probability that X is labeled as class c (i.e., the sentiment) is predicted by a logistic regression with softmax: where W SST is the task-specific parameter matrix.
Text Similarity Output: Take the STS-B task as an example.Suppose that x is the contextual embedding (l 2 ) of [CLS] which can be viewed as the semantic representation of the input sentence pair (X 1 , X 2 ).We introduce a task-specific parameter vector w ST S to compute the similarity score as: where g(z) = 1 1+exp (−z) is a sigmoid function that maps the score to a real value of the range [0, 1].
Pairwise Text Classification Output: Take natural language inference (NLI) as an example.The NLI task defined here involves a premise P = (p 1 , ..., p m ) of m words and a hypothesis H = (h 1 , ..., h n ) of n words, and aims to find a logical relationship R between P and H.The design of the output module follows the answer module of the stochastic answer network (SAN) (Liu et al., 2018a), a state-of-the-art neural NLI model.SAN's answer module uses multi-step reasoning.Rather than directly predicting the entailment given the input, it maintains a state and iteratively refines its predictions.
The SAN answer module works as follows.We first construct the working memory of premise P by concatenating the contextual embeddings of the words in P , which are the output of the transformer encoder, denoted as M p ∈ R d×m , and similarly the working memory of hypothesis H, denoted as M h ∈ R d×n .Then, we perform K-step reasoning on the memory to output the relation label, where K is a hyperparameter.At the beginning, the initial state s 0 is the summary of M h : Here, x k is computed from the previous state s k−1 and memory M p : x k = j β j M p j and β j = softmax(s k−1 W 2 M p ).A one-layer classifier is used to determine the relation at each step k: (3) At last, we utilize all of the K outputs by averaging the scores: Each P r is a probability distribution over all the relations R ∈ R.During training, we apply stochastic prediction dropout (Liu et al., 2018b) before the above averaging operation.During decoding, we average all outputs to improve robustness.
Relevance Ranking Output: Take QNLI as an example.Suppose that x is the contextual embedding vector of [CLS] which is the semantic representation of a pair of question and its candidate answer (Q, A).We compute the relevance score as: For a given Q, we rank all of its candidate answers based on their relevance scores computed using Equation 5.

The Training Procedure
The training procedure of MT-DNN consists of two stages: pretraining and multi-task fine-tuning.
The pretraining stage follows that of the BERT model (Devlin et al., 2018).The parameters of the lexicon encoder and Transformer encoder are learned using two unsupervised prediction tasks: masked language modeling and next sentence prediction. 3n the multi-task fine-tuning stage, we use minibatch based stochastic gradient descent (SGD) to learn the parameters of our model (i.e., the parameters of all shared layers and task-specific layers) as shown in Algorithm 1.In each epoch, a mini-batch b t is selected(e.g., among all 9 GLUE tasks), and the model is updated according to the task-specific objective for the task t.This approximately optimizes the sum of all multi-task objectives.
where 1(X, c) is the binary indicator (0 or 1) if class label c is the correct classification for X, and P r (.) is defined by e.g., Equation 1 or 4.
For the text similarity tasks, such as STS-B, where each sentence pair is annotated with a realvalued score y, we use the mean squared error as the objective: where Sim(.) is defined by Equation 2.
The objective for the relevance ranking tasks follows the pairwise learning-to-rank paradigm (Burges et al., 2005;Huang et al., 2013).Take QNLI as an example.Given a query Q, we obtain a list of candidate answers A which contains a positive example A + that includes the correct answer, and |A| − 1 negative examples.We then minimize the negative log likelihood of the positive example given queries across the training data where Rel(.) is defined by Equation 5 and γ is a tuning factor determined on held-out data.In our experiment, we simply set γ to 1.

Experiments
We evaluate the proposed MT-DNN on three popular NLU benchmarks: GLUE (Wang et al., 2018), Stanford Natural Language Inference (SNLI) (Bowman et al., 2015b), and SciTail (Khot et al., 2018).We compare MT-DNN with existing stateof-the-art models including BERT and demonstrate the effectiveness of MTL for model finetuning using GLUE and domain adaptation using SNLI and SciTail.

Datasets
This section briefly describes the GLUE, SNLI, and SciTail datasets, as summarized in Table 1.
The GLUE benchmark is a collection of nine NLU tasks, including question answering, sentiment analysis, and textual entailment; it is considered well-designed for evaluating the generalization and robustness of NLU models.Both SNLI and SciTail are NLI tasks.
CoLA The Corpus of Linguistic Acceptability is to predict whether an English sentence is linguistically acceptable or not (Warstadt et al., 2018).It uses Matthews correlation coefficient (Matthews, 1975) as the evaluation metric.

SST-2
The Stanford Sentiment Treebank is to determine the sentiment of sentences.The sentences are extracted from movie reviews with human annotations of their sentiment (Socher et al., 2013).Accuracy is used as the evaluation metric.

STS-B
The Semantic Textual Similarity Benchmark is a collection of sentence pairs collected from multiple data resources including news headlines, video, and image captions, and NLI data (Cer et al., 2017).Each pair is human-annotated with a similarity score from one to five, indicating how similar the two sentences are.The task is evaluated using two metrics: the Pearson and Spearman correlation coefficients.
QNLI This is derived from the Stanford Question Answering Dataset (Rajpurkar et al., 2016) which has been converted to a binary classification task in GLUE.A query-candidate-answer tuple is labeled as positive if the candidate contains the correct answer to the query and negative otherwise.In this study, however, we formulate QNLI as a relevance ranking task, where for a given query, its positive candidate answers are considered more relevant, and thus should be ranked higher than its negative candidates.
QQP The Quora Question Pairs dataset is a collection of question pairs extracted from the community question-answering website Quora.The task is to predict whether two questions are semantically equivalent (Chen et al., 2018).As the distribution of positive and negative labels is unbalanced, both accuracy and F1 score are used as evaluation metrics.

MRPC The Microsoft Research Paraphrase
Corpus consists of sentence pairs automatically extracted from online news sources with human annotations denoting whether a sentence pair is semantically equivalent to the other in the pair (Dolan and Brockett, 2005).Similar to QQP, both accuracy and F1 score are used as evaluation metrics.
MNLI Multi-Genre Natural Language Inference is a large-scale, crowd-sourced entailment classification task (Nangia et al., 2017) of sentences (i.e., a premise-hypothesis pair), the goal is to predict whether the hypothesis is an entailment, contradiction, or neutral with respect to the premise.The test and development sets are split into in-domain (matched) and cross-domain (mismatched) sets.The evaluation metric is accuracy.
RTE The Recognizing Textual Entailment dataset is collected from a series of annual challenges on textual entailment.The task is similar to MNLI, but uses only two labels: entailment and not entailment (Wang et al., 2018).
WNLI The Winograd NLI (WNLI) is a natural language inference dataset derived from the Winograd Schema dataset (Levesque et al., 2012).This is a reading comprehension task.The goal is to select the referent of a pronoun from a list of choices in a given sentence which contains the pronoun.
SNLI The Stanford Natural Language Inference (SNLI) dataset contains 570k human annotated sentence pairs, in which the premises are drawn from the captions of the Flickr30 corpus and hypotheses are manually annotated (Bowman et al., 2015b).This is the most widely used entailment dataset for NLI.The dataset is used only for domain adaptation in this study.
SciTail This is a textual entailment dataset derived from a science question answering (SciQ) dataset (Khot et al., 2018).The task involves assessing whether a given premise entails a given hypothesis.In contrast to other entailment datasets mentioned previously, the hypotheses in SciTail are created from science questions while the corresponding answer candidates and premises come from relevant web sentences retrieved from a large corpus.As a result, these sentences are linguistically challenging and the lexical similarity of premise and hypothesis is often high, thus making SciTail particularly difficult.The dataset is used only for domain adaptation in this study.

Implementation details
Our implementation of MT-DNN is based on the PyTorch implementation of BERT4 .We used Adamax (Kingma and Ba, 2014) as our optimizer with a learning rate of 5e-5 and a batch size of 32.The maximum number of epochs was set to 5. A linear learning rate decay schedule with warm-up over 0.1 was used, unless stated otherwise.Following (Liu et al., 2018a), we set the number of steps to 5 with a dropout rate of 0.1.To avoid the exploding gradient problem, we clipped the gradient norm within 1.All the texts were tokenized using wordpieces, and were chopped to spans no longer than 512 tokens.

GLUE Results
The test results on GLUE are presented in Table 2. 5 MT-DNN outperforms all existing systems on all tasks, except WNLI, creating new state-ofthe-art results on eight GLUE tasks and pushing the benchmark to 82.2%, which amounts to 1.8% absolution improvement over BERT LARGE .Since MT-DNN uses BERT LARGE for its shared layers, the gain is solely attributed to the use of MTL in fine-tuning.MTL is particularly useful for the tasks with little in-domain training data.As we observe in the table, on the same type of tasks, the improvements over BERT are much more substantial for the tasks with less in-domain training data e.g., the two NLI tasks: RTE vs. MNLI, and the two paraphrase tasks: MRPC vs. QQP.The gain of MT-DNN is also attributed to its flexible modeling framework which allows us to incorporate the task-specific model structures and training methods which have been developed in the single-task setting, effectively leveraging the existing body of research.
Two such examples use the SAN answer module for the pairwise text classification output module, and the pairwise ranking loss for the QNLI task which by design is a binary classification 5 There is an ongoing discussion on revising the QNLI dataset.We will update the results when the new dataset is available.
problem in GLUE.To investigate the relative contributions of the above two modeling design choices, we implement different versions of MT-DNNs and compare their performance on the development sets.The results are shown in Table 3.
• BERT BASE is the base BERT model released by the authors, which we used as a baseline.
We fine-tuned the model for each single task.
• MT-DNN is the proposed model described in Section 3 using the pre-trained BERT BASE as its shared layers.We then fine-tuned the model using MTL on all GLUE tasks.Comparing MT-DNN vs. BERT BASE , we see that the results on dev sets are consistent with the GLUE test results in Table 2.
• ST-DNN, standing for Single-Task DNN, uses the same model architecture as MT-DNN.But, instead of fine-tuning one model for all tasks using MTL, we create multiple ST-DNNs, one for each task using only its indomain data for fine-tuning.Thus, for pairwise text classification tasks, the only difference between their ST-DNNs and BERT models is the design of the task-specific output module.The results show that on three out of four tasks (MNLI, QQP and MRPC) ST-DNNs outperform their BERT counterparts, justifying the effectiveness of the SAN answer module.We also compare the results of ST-DNN and BERT on QNLI.While ST-DNN is fine-tuned using the pairwise ranking loss, BERT views QNLI as binary classification and is fine-tuned using the cross entropy loss.That ST-DNN significantly outperforms BERT demonstrates clearly the importance of problem formulation.

SNLI and SciTail Results
In

Domain Adaptation Results
One of the most important criteria for building practical systems is fast adaptation to new tasks and domains.This is because it is prohibitively expensive to collect labeled training data for new domains or tasks.Very often, we only have very small training data or even no training data.
To evaluate the models using the above criterion, we perform domain adaptation experiments on two NLI tasks, SNLI and SciTail, using the following procedure:  3. evaluate the models using task-specific test data.
We denote the two task-specific models as MT-DNN.For comparison, we also perform the same adaptation procedure to the pre-trained BERT model, creating two task-specific BERT models for SNLI and SciTail, respectively, denoted as BERT.
We split the training data of SNLI and SciTail, and randomly sample 0.1%, 1%, 10% and 100% of its training data.As a result, we obtain four sets of training data for SciTail, which includes 23, 235, 2.3k and 23.5k training samples.Similarly, we obtain four sets of training data for SNLI, which includes 549, 5.5k, 54.9k and 549.3k training samples.
Results on different amounts of training data of SNLI and SciTail are reported in Figure 2 and Table 5.We observe that our model pre-trained on GLUE via multi-task learning outplays the BERT baseline consistently.The fewer the training data used, the larger improvement MT-DNN demonstrates over BERT.For example, with only 0.1% (23 samples) of the SNLI training data, MT-DNN achieves 82.1% in accuracy while BERT's accuracy is 52.5%; with 1% of the training data, the accuracy of our model is 85.2% and BERT is 78.1%.We observe similar results on SciTail.The results indicate that the representations learned by MT-DNN are more effective for domain adaptation than that of BERT.

Conclusion
In this work we proposed a model called MT-DNN to combine multi-task learning and language model pre-training for language representation learning.MT-DNN obtains new state-ofthe-art results on ten NLU tasks across three popular benchmarks: SNLI, SciTail, and GLUE.MT-DNN also demonstrates an exceptional generalization capability in domain adaptation experiments.
There are many future areas to explore to improve MT-DNN, including a deeper understanding of model structure sharing in MTL, a more effective training method that leverages relatedness among multiple tasks, and ways of incorporating the linguistic structure of text in a more explicit and controllable manner.

Figure 2 :
Figure 2: Domain adaption results on SNLI and Sci-Tail development datasets using the shared embeddings generated by MT-DNN and BERT, respectively.Both MT-DNN and BERT are fine-tuned based on the pretrained BERT BASE .The X-axis indicates the amount of domain-specific labeled samples used for adaptation.

Table 1 :
. Given a pair Summary of the three benchmarks: GLUE, SNLI and SciTail.

Table 2 :
GLUE test set results, which are scored by the GLUE evaluation server.The number below each task denotes the number of training examples.The state-of-the-art results are in bold.MT-DNN uses BERT LARGE for its shared layers.All the results are obtained from https://gluebenchmark.com/leaderboard.

Table 3 :
GLUE dev set results.The best result on each task is in bold.BERT BASE is the base BERT model released by the authors, and is fine-tuned for each single task.The Single-Task DNN (ST-DNN) uses the same model architecture as MT-DNN.But instead of fine-tuning one model for all tasks using MTL, we create multiple ST-DNNs, one for each task using only in-domain data for fine-tuning.ST-DNNs and MT-DNN use BERT BASE for their shared layers.

Table 4 ,
we compare our adapted models, using all in-domain training samples, against several strong baselines including the best results reported in the leaderboards.We see that MT-DNN generates new state-of-the-art results on both datasets, pushing the benchmarks to 91.1% on SNLI (1.0% absolute improvement) and 94.1% on SciTail (5.8% absolute improvement), respectively.

Table 4 :
Results on the SNLI and SciTail dataset.

Table 5 :
Domain adaptation results on SNLI and Sci-Tail, as shown in Figure 2. 1. fine-tune the MT-DNN model on eight GLUE tasks, excluding WNLI; 2. create for each new task (SNLI or SciTail) a task-specific model, by adapting the trained MT-DNN using task-specific training data;