The Microsoft Toolkit of Multi-Task Deep Neural Networks for Natural Language Understanding

We present MT-DNN, an open-source natural language understanding (NLU) toolkit that makes it easy for researchers and developers to train customized deep learning models. Built upon PyTorch and Transformers, MT-DNN is designed to facilitate rapid customization for a broad spectrum of NLU tasks, using a variety of objectives (classification, regression, structured prediction) and text encoders (e.g., RNNs, BERT, RoBERTa, UniLM). A unique feature of MT-DNN is its built-in support for robust and transferable learning using the adversarial multi-task learning paradigm. To enable efficient production deployment, MT-DNN supports multi-task knowledge distillation, which can substantially compress a deep neural model without significant performance drop. We demonstrate the effectiveness of MT-DNN on a wide range of NLU applications across general and biomedical domains. The software and pre-trained models will be publicly available at https://github.com/namisan/mt-dnn.


Introduction
NLP model development has observed a paradigm shift in recent years, due to the success in using pretrained language models to improve a wide range of NLP tasks (Peters et al., 2018;Devlin et al., 2018).Unlike the traditional pipeline approach that conducts annotation in stages using primarily supervised learning, the new paradigm features a universal pretraining stage that trains a large neural language model via self-supervision on a large unlabeled text corpus, followed by a fine-tuning step that starts from the pretrained contextual representations and conducts supervised learning for individual tasks.The pretrained language models can effectively model textual variations and distributional similarity.Therefore, they can make subsequent task-specific training more sample efficient and often significantly boost performance in downstream tasks.However, these models are quite large and pose significant challenges to production deployment that has stringent memory or speed requirements.As a result, knowledge distillation has become another key feature in this new learning paradigm.An effective distillation step can often substantially compress a large model for efficient deployment (Clark et al., 2019;Tang et al., 2019;Liu et al., 2019a).
In the NLP community, there are several well designed frameworks for research and commercial purposes, including toolkits for providing conventional layered linguistic annotations (Manning et al., 2014), platforms for developing novel neural models (Gardner et al., 2018) and systems for neural machine translation (Ott et al., 2019).However, it is hard to find an existing tool that supports all features in the new paradigm and can be easily customized for new tasks.For example, (Wolf et al., 2019) provides a number of popular Transformer-based (Vaswani et al., 2017) text encoders in a nice unified interface, but does not offer multi-task learning or adversarial training, state-of-the-art techniques that have been shown to significantly improve performance.Additionally, most public frameworks do not offer knowledge distillation.A notable exception is Distill-BERT (Sanh et al., 2019), but it provides a standalone compressed model and does not support task-specific model compression that can further improve performance.
We introduce MT-DNN, a comprehensive and easily-configurable open-source toolkit for building robust and transferable natural language un-derstanding models.MT-DNN is built upon PyTorch (Paszke et al., 2019) and the popular Transformer-based text-encoder interface (Wolf et al., 2019).It supports a large inventory of pretrained models, neural architectures, and NLU tasks, and can be easily customized for new tasks.
A key distinct feature for MT-DNN is that it provides out-of-box adversarial training, multitask learning, and knowledge distillation.Users can train a set of related tasks jointly to amplify each other.They can also invoke adversarial training (Miyato et al., 2018;Jiang et al., 2019), which helps improve model robustness and generalizability.For production deployment where large model size becomes a practical obstacle, users can use MT-DNN to compress the original models into substantially smaller ones, even using a completely different architecture (e.g., compressed BERT or other Transformer-based text encoders into LSTMs (Hochreiter and Schmidhuber, 1997)).The distillation step can similarly leverage multi-task learning and adversarial training.Users can also conduct pretraining from scratch using the masked language model objective in MT-DNN.Moreover, in the fine-tuning step, users can incorporate this as an auxiliary task on the training text, which has been shown to improve performance.MT-DNN provides a comprehensive list of state-of-the-art pre-trained NLU models, together with step-by-step tutorials for using such models in general and biomedical applications.

Design
MT-DNN is designed for modularity, flexibility, and ease of use.These modules are built upon PyTorch (Paszke et al., 2019) and Transformers (Wolf et al., 2019), allowing the use of the SOTA pre-trained models, e.g., BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019c) and UniLM (Dong et al., 2019).The unique attribute of this package is a flexible interface for adversarial multi-task fine-tuning and knowledge distillation, so that researchers and developers can build large SOTA NLU models and then compress them to small ones for online deployment.The overall workflow and system architecture are shown in Figure 1 and Figure 3 respectively.

Workflow
As shown in Figure 1, starting from the neural language model pre-training, there are three different training configurations by following the directed

Natural Language Model
Pre-training

Fine-tuning
Single-task Knowledge Distillation

Adversarial Training
Figure 1: The workflow of MT-DNN: train a neural language model on a large amount of unlabeled raw text to obtain general contextual representations; then finetune the learned contextual representation on downstream tasks, e.g.GLUE (Wang et al., 2018); lastly, distill this large model to a lighter one for online deployment.In the later two phrases, we can leverage powerful multi-task learning and adversarial training to further improve performance. arrows: • Single-task configuration: single-task finetuning and single-task knowledge distillation; • Multi-task configuration: multi-task finetuning and multi-task knowledge distillation; • Multi-stage configuration: multi-task finetuning, single-task fine tuning and single-task knowledge distillation.
Moreover, all configurations can be additionally equipped with the adversarial training.Each stage of the workflow is described in details as follows.
Neural Language Model Pre-Training Due to the great success of deep contextual representations, such as ELMo (Peters et al., 2018), GPT (Radford et al., 2018) and BERT (Devlin et al., 2018), it is common practice of developing NLU models by first pre-training the underlying neural text representations (text encoders) through massive language modeling which results in superior text representations transferable across multiple NLP tasks.Because of this, there has been an increasing effort to develop better pre-trained text encoders by multiplying either the scale of data (Liu et al., 2019c) or the size of model (Raffel Moreover, users can leverage the LM pretraining, such as masked LM used by BERT, as an auxiliary task for fine-tuning under the multitask learning (MTL) framework (Sun et al., 2019;Liu et al., 2019b).Fine-tuning Once the text encoder is trained in the pre-training stage, an additional task-specific layer is usually added for fine-tuning based on the downstream task.Besides the existing typical single-task fine-tuning, MT-DNN facilitates a joint fine-tuning with a configurable list of related tasks in a MTL fashion.By encoding taskrelatedness and sharing underlying text representations, MTL is a powerful training paradigm that promotes the model generalization ability and results in improved performance (Caruana, 1997;Liu et al., 2019b;Luong et al., 2015;Liu et al., 2015;Ruder, 2017;Collobert et al., 2011).Additionally, a two-step fine-tuning stage is also supported to utilize datasets from related tasks, i.e. a single-task fine-tuning following a multi-task finetuning.It also supports two popular sampling strategies in MTL training: 1) sampling tasks uniformly (Caruana, 1997;Liu et al., 2015); 2) sampling tasks based on the size of the dataset (Liu et al., 2019b).This makes it easy to explore various ways to feed training data to MTL training.Finally, to further improve the model robustness, MT-DNN also offers a recipe to apply adversarial training (Madry et al., 2017;Zhu et al., 2019;Jiang et al., 2019) in the fine-tuning stage.Knowledge Distillation Although contextual text representation models pre-trained with massive text data have led to remarkable progress in NLP, it is computationally prohibitive and inefficient to deploy such models with millions of parameters for real-world applications (e.g.BERT large model has 344 million parameters).Therefore, in order to expedite the NLU model learned in either a single-task or multi-task fashion for deployment, MT-DNN additionally supports the multitask knowledge distillation (Clark et al., 2019;Liu et al., 2019a;Tang et al., 2019;Balan et al., 2015;Ba and Caruana, 2014), an extension of (Hinton et al., 2015), to compress cumbersome models into lighter ones.The multi-task knowledge distillation process is illustrated in Figure 2. Similar to the fine-tuning stage, adversarial training is available in the knowledge distillation stage.

Architecture
Lexicon Encoder (l 1 ): The input X = {x 1 , ..., x m } is a sequence of tokens of length m.The first token x 1 is always a specific token, e.g.[CLS] for BERT Devlin et al. (2018) while Finally, for each task, additional task-specific layers generate task-specific representations, followed by operations necessary for classification, similarity scoring, or relevance ranking.In case of adversarial training, we perturb embeddings from the lexicon encoder and then add an extra loss term during the training.Note that for the inference phrase, it does not require perturbations.
<s> for RoBERTa Liu et al. (2019c).If X is a pair of sentences (X 1 , X 2 ), we separate these sentences with special tokens, e.g.[SEP] for BERT and [</s>] for RoBERTa.The lexicon encoder maps X into a sequence of input embedding vectors, one for each token, constructed by summing the corresponding word with positional, and optional segment embeddings.Encoder (l 2 ): We support a multi-layer bidirectional Transformer (Vaswani et al., 2017) or a LSTM (Hochreiter and Schmidhuber, 1997) encoder to map the input representation vectors (l 1 ) into a sequence of contextual embedding vectors C ∈ R d×m .This is the shared representation across different tasks.Note that MT-DNN also allows developers to customize their own encoders.For example, one can design an encoder with few Transformer layers (e.g. 3 layers) to distill knowledge from the BERT large model (24 layers), so that they can deploy this small mode online to meet the latency restriction as shown in Figure 2.
Task-Specific Output Layers: We can incorporate arbitrary natural language tasks, each with its task-specific output layer.For example, we implement the output layers as a neural decoder for a neural ranker for relevance ranking, a logistic regression for text classification, and so on.A multistep reasoning decoder, SAN (Liu et al., 2018a,b) is also provided.Customers can choose from existing task-specific output layer or implement new one by themselves.

Application
In this section, we present a comprehensive set of examples to illustrate how to customize MT-DNN for new tasks.We use popular benchmarks from general and biomedical domains, including GLUE (Wang et al., 2018), SNLI (Bowman et al., 2015), SciTail (Khot et al., 2018), SQuAD (Rajpurkar et al., 2016), ANLI (Nie et al., 2019), and biomedical named entity recognition (NER), relation extraction (RE) and question answering (QA) (Lee et al., 2019).To make the experiments reproducible, we make all the configuration files publicly available.We also provide a quick guide for customizing a new task in Jupyter notebooks.

General Domain Natural Language
Understanding Benchmarks
• SNLI.The Stanford Natural Language Inference (SNLI) dataset contains 570k human annotated sentence pairs, in which the premises are drawn from the captions of the Flickr30 corpus and hypotheses are manually annotated (Bowman et al., 2015).This is the most widely used entailment dataset for NLI.
• SciTail This is a textual entailment dataset derived from a science question answering (SciQ) dataset (Khot et al., 2018).In contrast to other entailment datasets mentioned previously, the hypotheses in SciTail are created from science questions while the corresponding answer candidates and premises come from relevant web sentences retrieved from a large corpus.
• ANLI.The Adversarial Natural Language Inference (ANLI, Nie et al. (2019)) is a new largescale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure.Particular, the data is selected to be difficult to the state-of-the-art models, including BERT and RoBERTa.
• SQuAD.The Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al., 2016) contains about 23K passages and 100K questions.The passages come from approximately 500 Wikipedia articles and the questions and answers are obtained by crowdsourcing.Following (Devlin et al., 2018), table 2 compares different training algorithm: 1) BERT denotes a single task fine-tuning; 2) BERT + MTL indicates that it is trained jointly via MTL; at last 3), BERT + AdvTrain represents that a single task fine-tuning with adversarial training.It is obvious that the both MLT and adversarial training helps to obtain a better result.We further test our model on an adversarial natural language inference (ANLI) dataset (Nie et al., 2019).Table 3 summarizes the results on ANLI.As Nie et al. (2019), all the dataset of ANLI (Nie et al., 2019), MNLI (Williams et al., 2018), SNLI (Bowman et al., 2015) and FEVER (Thorne et al., 2018) are combined as training.RoBERTa-LARGE+AdvTrain obtains the best performance compared with all the strong baselines, demonstrating the advantage of adversarial training.

Biomedical Natural Language
Understating Benchmarks There has been rising interest in exploring natural language understanding tasks in high-value domains other than newswire and the Web.In our release, we provide MT-DNN customization for three representative biomedical natural language understanding tasks: • Named entity recognition (NER): In biomedical natural language understanding, NER has received greater attention than other tasks and datasets are available for recognizing various biomedical entities such as disease, gene, drug (chemical).
• Relation extraction (RE): Relation extraction is more closely related to end applications, but annotation effort is significantly higher compared to NER.Most existing RE tasks focus on binary relations within a short text span such as a sentence of an abstract.Examples include gene-disease or protein-chemical relations.
• Question answering (QA): Inspired by interest in QA for the general domain, there has been some effort to create question-answering datasets in biomedicine.Annotation requires domain expertise, so it is significantly harder than in general domain, where it is to produce large-scale datasets by crowdsourcing.
The MT-DNN customization can work with standard or biomedicine-specific pretraining models such as BioBERT, and can be directly applied to biomedical benchmarks (Lee et al., 2019).We will go though a typical Natural Language Inference task, e.g.SNLI, which is one of the most popular benchmark, showing how to apply our toolkit to a new task.MT-DNN is driven by configuration and command line arguments.Firstly, the SNLI configuration is shown in Figure 4.The configuration defines tasks, model architecture as well as loss functions.We briefly introduce these attributes as follows:

Extension
1. data format is a required attribute and it denotes that each sample includes two sentences (premise and hypothesis).Please refer the tutorial and API for supported formats.
2. task layer type specifies architecture of the task specific layer.The default is a "linear layer".
3. labels Users can list unique values of labels.
The configuration helps to convert back and forth between text labels and numbers during training and evaluation.Without it, MT-DNN assumes the label of prediction are numbers.
4. metric meta is the evaluation metric used for validation.
5. loss is the loss function for SNLI.It also supports other functions, e.g.MSE for regression.
6. kd loss is the loss function in the knowledge distillation setting.
7. adv loss is the loss function in the adversarial setting.
8. n class denotes the number of categories for SNLI.
9. task type specifies whether it is a classification task or a regression task.
Once the configuration is provided, one can train the customized model for the task, using any supported pre-trained models as starting point.
MT-DNN is also highly extensible, as shown in Figure 4, loss and task layer type point to existing classes in code.Users can write customized classes and plug into MT-DNN.The customized classes could then be used via configuration.

Conclusion
Microsoft MT-DNN is an open-source natural language understanding toolkit which facilitates researchers and developers to build customized deep learning models.Its key features are: 1) support for robust and transferable learning using adversarial multi-task learning paradigm; 2) enable knowledge distillation under the multi-task learning setting which can be leveraged to derive lighter models for efficient online deployment.We will extend MT-DNN to support Natural Language Generation tasks, e.g.Question Generation, and incorporate more pre-trained encoders, e.g.T5 (Raffel et al., 2019) in future.

Figure 2 :
Figure 2: Process of knowledge distillation for MTL.A set of tasks where there is task-specific labeled training data are picked.Then, for each task, an ensemble of different neural nets (teacher) is trained.The teacher is used to generate for each task-specific training sample a set of soft targets.Given the soft targets of the training datasets across multiple tasks, a single MT-DNN (student) shown in Figure3is trained using multi-task learning and back propagation, except that if task t has a teacher, the task-specific loss is the average of two objective functions, one for the correct targets and the other for the soft targets assigned by the teacher.

Figure 3 :
Figure3: Overall System Architecture: The lower layers are shared across all tasks while the top layers are taskspecific.The input X (either a sentence or a set of sentences) is first represented as a sequence of embedding vectors, one for each word, in l 1 .Then the encoder, e.g a Transformer or recurrent neural network (LSTM) model, captures the contextual information for each word and generates the shared contextual embedding vectors in l 2 .Finally, for each task, additional task-specific layers generate task-specific representations, followed by operations necessary for classification, similarity scoring, or relevance ranking.In case of adversarial training, we perturb embeddings from the lexicon encoder and then add an extra loss term during the training.Note that for the inference phrase, it does not require perturbations.
• GLUE.The General Language Understanding Evaluation (GLUE) benchmark is a collection of

Table 2 :
Comparison among single task, multi-Task and adversarial training on MNLI, RTE, QNLI, SST and MPRC in GLUE.

Table 3 :
Results in terms of accuracy on the ANLI.