LIMIT-BERT : Linguistics Informed Multi-Task BERT

In this paper, we present Linguistics Informed Multi-Task BERT (LIMIT-BERT) for learning language representations across multiple linguistics tasks by Multi-Task Learning. LIMIT-BERT includes five key linguistics tasks: Part-Of-Speech (POS) tags, constituent and dependency syntactic parsing, span and dependency semantic role labeling (SRL). Different from recent Multi-Task Deep Neural Networks (MT-DNN), our LIMIT-BERT is fully linguistics motivated and thus is capable of adopting an improved masked training objective according to syntactic and semantic constituents. Besides, LIMIT-BERT takes a semi-supervised learning strategy to offer the same large amount of linguistics task data as that for the language model training. As a result, LIMIT-BERT not only improves linguistics tasks performance but also benefits from a regularization effect and linguistics information that leads to more general representations to help adapt to new tasks and domains. LIMIT-BERT outperforms the strong baseline Whole Word Masking BERT on both dependency and constituent syntactic/semantic parsing, GLUE benchmark, and SNLI task. Our practice on the proposed LIMIT-BERT also enables us to release a well pre-trained model for multi-purpose of natural language processing tasks once for all.


Introduction
Recently, pre-trained language models have shown greatly effective across a range of linguistics inspired natural language processing (NLP) tasks such as syntactic parsing, semantic parsing and * Corresponding author. This paper was partially supported by National Key Research and Development Program of China (No. 2017YFB0304100), Key Projects of National Natural Science Foundation of China (U1836222 and 61733011), Huawei-SJTU long term AI project, Cutting-edge Machine reading comprehension and language model. so on Zhou et al., 2020;Ouchi et al., 2018;, when taking the latter as downstream tasks for the former. In the meantime, introducing linguistic clues such as syntax and semantics into the pre-trained language models may furthermore enhance other downstream tasks such as various Natural Language Understanding (NLU) tasks (Zhang et al., 2020a,b). However, nearly all existing language models are usually trained on large amounts of unlabeled text data (Peters et al., 2018;Devlin et al., 2019), without explicitly exploiting linguistic knowledge. Such observations motivate us to jointly consider both types of tasks, pre-training language models, and solving linguistics inspired NLP problems. We argue such a treatment may benefit from two-fold. (1) Joint learning is a better way to let the former help the latter in a bidirectional mode, rather than in a unidirectional mode, taking the latter as downstream tasks of the former.
(2) Naturally empowered by linguistic clues from joint learning, pre-trained language models will be more powerful for enhancing downstream tasks. Thus we propose Linguistics Informed Multi-Task BERT (LIMIT-BERT), making an attempt to incorporate linguistic knowledge into pre-training language representation models. The proposed LIMIT-BERT is implemented in terms of Multi-Task Learning (MTL) (Caruana, 1993) which has shown useful, by alleviating overfitting to a specific task, thus making the learned representations universal across tasks.
Since universal language representations are learning by leveraging large amounts of unlabeled data which has quite different data volume compared with linguistics tasks dataset such as Penn Treebank (PTB) 1 (Marcus et al., 1993).
To alleviate such data unbalance on multi-task learning, we apply semi-supervised learning approach that uses a pre-trained linguistics model 2 to annotate large amounts of unlabeled text data and to combine with gold linguistics task dataset as our final training data. For such pre-processing, it is easy to train our LIMIT-BERT on large amounts of data with many tasks concurrently by simply summing up all the concerned losses together. Moreover, since every sentence has labeled with predicted syntax and semantics, we can furthermore improve the masked training objective by fully exploiting the known syntactic or semantic constituents during the language model training process. Unlike the previous work MT- DNN (Liu et al., 2019b) which only fine-tunes BERT on GLUE tasks, our LIMIT-BERT is trained on large amounts of data in a semi-supervised way and firmly supported by explicit linguistic clues. We verify the effectiveness and applicability of LIMIT-BERT on Propbank semantic parsing 3 in both span style (CoNLL-2005) (Carreras andMàrquez, 2005) and dependency style, (CoNLL-2009) (Hajič et al., 2009) and Penn Treebank (PTB) (Marcus et al., 1993) for both constituent and dependency syntactic parsing. Our empirical results show that semantics and syntax can indeed benefit the language representation model via multitask learning and outperforms the strong baseline Whole Word Masking BERT (BERT WWM ).

Tasks and Datasets
LIMIT-BERT includes five types of downstream tasks: Part-Of-Speech, constituent and dependency syntactic parsing, span and dependency semantic role labeling (SRL).
Both span (constituent) and dependency are two broadly-adopted annotation styles for either semantics or syntax, which have been well studied and discussed from both linguistic and computational perspectives (Chomsky, 1981;. Constituency parsing aims to build a constituency-based parse tree from a sentence that represents its syntactic structure according to a phrase structure grammar. While dependency parsing identifies syntactic relations (such as an adjective modifying a noun) between word pairs in a sentence. The constituent structure is better at disclosing phrasal continuity, while the dependency structure is better at indicating dependency relation among words.
Semantic role labeling (SRL) is dedicated to recognizing the predicate-argument structure of a sentence, such as who did what to whom, where and when, etc. For argument annotation, there are two formulizations. One is based on text spans, namely span-based SRL. The other is dependency-based SRL, which annotates the syntactic head of argument rather than the entire argument span. SRL is an important method to obtain semantic information beneficial to a wide range of NLP tasks Mihaylov and Frank, 2019).
BERT is typically trained on quite large unlabeled text datasets, BooksCorpus and English Wikipedia, which have 13GB plain text, while the datasets for specific linguistics tasks are less than 100MB. Thus we employ semi-supervised learning to alleviate such data unbalance on multi-task learning by using a pre-trained linguistics model to label BooksCorpus and English Wikipedia data. The pre-trained model jointly learns POS tags and the four types of structures on semantics and syntax, in which the latter is from the XLNet version of (Zhou et al., 2020), giving state-of-the-art or comparable performance for the concerned four parsing tasks. During training, we set 10% probability to use gold syntactic parsing and SRL data: Penn Treebank (PTB) (Marcus et al., 1993), span style SRL (CoNLL-2005) (Carreras and Màrquez, 2005) and dependency style SRL (CoNLL-2009) (Hajič et al., 2009).

Linguistics-Guided Mask Strategy
BERT applies two training objectives: Masked Language Model (LM) and Next Sentence Prediction (NSP) based on WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary. For Masked LM training objective, BERT uses training data generator to choose 15% of the token positions at random for mask replacement and predict the masked tokens 4 . Since using different masking strategy can improve model performance such as the Whole Word Masking 5 which masks all of the tokens corresponding to a word at once, we further improve the masking strategy by exploit-Span and Dependency SRL Federal Paper Board sells paper and wood products .   ing available linguistic clues, syntactic or semantic constituents (phrases) 6 , predicted by our pretrained linguistics model as discussed in Section 2. Thus, we apply three mask strategies at random for each sentence: Syntactic Phrase Masking, Semantic Phrase Masking, and Whole Word Masking. Syntactic/Semantic Phrase Masking (SPM) means that all the tokens corresponding to a syntactic/semantic phrase are masked, as shown in Figure 1. The overall masking rate and replacement strategy remain the same as BERT, we still predict each masked WordPiece token independently. Intuitively, it makes sense that SPM is strictly more powerful than original Token Masking or Whole Word Masking, since SPM may choose and predict the meaningful words or phrases such as verb predicates or noun phrases.

Overview
The architecture of the LIMIT-BERT is shown in Figure 2. Our model includes four modules: token representation, Transformer encoder, language modeling layers, task-specific layers including syn-6 Syntactic phrases indicate the constituent subtrees while semantic phrases represent as predicate or argument in span SRL. tactic and semantic scorers and decoders. We take multi-task learning (MTL) (Caruana, 1993) sharing the parameters of token representation and Transformer encoder, while language modeling layers and the top task-specific layers have independent parameters. The training procedure is simple that we just sum up the language model loss with taskspecific losses together.

Token Representation
Following BERT token representation (Devlin et al., 2019), the first token is always the [CLS] token. If input X is packed by a sentence pair X 1 ; X 2 , we separate the two sentences with a special token [SEP] ("packed by" means connect two sentences as BERT training). The Transformer encoder maps X into a sequence of input embedding vectors, one for each token, which is a sum of the corresponding word, segment, and positional embeddings.
If we apply BERT training data (BooksCorpus and English Wikipedia), we use pair sentences packed to perform next sentence prediction and only take the first sentence including [CLS] and [SEP] token for later linguistics tasks. While using gold linguistics task data (PTB, CoNLL-2005, andCoNLL-2009) with 10% probability, we only take one sentence as input that [CLS] and [SEP] are first and last tokens respectively.
Since input sequence X is based on WordPiece token, we only take the last WordPiece vector of the word in the last layer of Transformer encoder as our sole word representation for later linguistics tasks input to keep the same length of the token and label annotations.

Transformer Encoder
The Transformer encoder in our model is adapted from (Vaswani et al.), which transforms the input representation vectors into a sequence of contextualized embedding vectors with shared representation across different tasks. We use the pre-trained parameters of BERT (Devlin et al., 2019) as our encoder initialization for faster convergence.

Language Modeling Layers
BERT training applies masked language modeling (MLM) as a training objective which corrupts the input by replacing some tokens with a special token [MASK] and then lets the model reconstruct the original tokens. While in our LIMIT-BERT training, the linguistics specific tasks and MLM train- Span and Dependency SRL Federal Paper Board sells paper and wood products . ing take the same input; thus the [MASK] tokens raise a mismatch problem that the model sees artificial [MASK] tokens during MLM training but not when being fine-tuned and inference on linguistics tasks. Besides, due to learning bidirectional representations, MLM approaches incur a substantial computational cost increase because the network only learns from 15% of the tokens per example and needs more training time to converge.
Recently (Yang et al., 2019;Clark et al., 2020) have made attempts to alleviate such a difficulty. The latter applies a replaced token detection task in their ELECTRA model. Instead of masking the input, ELECTRA corrupts it by replacing some input tokens with plausible alternatives sampled from a small generator network, which is close to the original input without [MASK] tokens.
We adopt the ELECTRA training approach in our LIMIT-BERT, which lets the generator G and discriminator D share the same parameters and embedding as shown in Figure 2. The generator G is identical to BERT training (Devlin et al., 2019) that predicts the masked tokens and next sentence and sums token mask loss and next sentence predict loss 7 as J G (θ). Then the discriminator D takes the 7 If using gold linguistics task data, we only compute the token mask loss. predicted tokens by generator G 8 and is trained to distinguish tokens that have been replaced by generator G which is a simple binary classification of each token with loss J D (θ). At last, we take the output vector X of discriminator D to feed the following task-specific layers and sum the loss of J G (θ) and J D (θ) as the final language modeling loss J lm (θ): where λ is set to 50 as the same as ELECTRA.

Task-specific Layers
Firstly, we rebuild word representations from the WordPiece tokens for linguistics tasks. Then we follow (Zhou et al., 2020) to construct the taskspecific layers, including scoring layer and decoder layer. The former scores three types of linguistic objectives, dependency head, syntactic constituent and semantic role. The latter is to generate the legal linguistics structures. Word Level Construction Suppose that X is the output of the discriminator Transformer encoder, we pre-process the WordPiece sequence vector X for linguistics specific tasks learning which are based on word level. We only take the first sentence X 1 including token [CLS] and [SEP] of packed sentence pair (X 1 ; X 2 ). Then we convert WordPiece sequence vector to word-level by simply taking the last WordPiece token vector of the word as the representation of the whole word.

Scoring Layer
After the word-level construction, we calculate the POS tag, syntactic constituent, dependency head, and semantic role scores, following the training way as (Zhou et al., 2020) to construct syntactic constituent, dependency head, and semantic role scores objective loss which are represented as J 1 (θ), J 2 (θ) and J 3 (θ) respectively. For POS tagging model training, we apply a one-layer feedforward network and minimize the negative log-likelihood of the gold POS tag gp i of each word, which is implemented as a crossentropy loss: where x i is word vector inside X.
Utilizing these specific task scores, we do a sum to obtain the linguistics task loss J lt (θ) for training: At last, our LIMIT-BERT is trained for simply minimizing the overall loss: Decoder Layer For syntactic parsing, we apply the joint span CKY-style algorithm to generate constituent and dependency syntactic tree simultaneously by following .
For span and dependency SRL, we use a single dynamic programming decoder according to the uniform semantic role score following the nonoverlapping constraints: span semantic arguments for the same predicate do not overlap (Punyakanok et al., 2008). For further details of the scoring and decoder layer, please refer to (Zhou et al., 2020).

Evaluation
We use the model of (Zhou et al., 2020) with finetuned uncased BERT WWM (whole word masking) as the baseline 9 . For fairly compared to the baseline BERT WWM , we also extract the language modeling layer of LIMIT-BERT and use the same model of (Zhou et al., 2020) to fine-tune. We evaluate our proposed model LIMIT-BERT and baseline model BERT WWM on CoNLL-2009 shared task (Hajič et al., 2009) for dependency-style SRL, CoNLL-2005 shared task (Carreras andMàrquez, 2005) for span-style SRL both using the Propbank convention (Palmer et al., 2005), and English Penn Treebank (PTB) (Marcus et al., 1993) for constituent syntactic parsing, Stanford basic dependencies (SD) representation (de Marneffe et al., 2006) converted by the Stanford parser 10 for dependency syntactic parsing using the same model of (Zhou et al., 2020) to fine-tune. We follow standard data splitting and evaluate setting as (Zhou et al., 2020) and use end-to-end SRL setups of both span and dependency SRL. Since LIMIT-BERT involves all syntactic and semantic parsing tasks, it is possible to directly apply LIMIT-BERT to each task without fine-tuning and we also compare these results.
In order to evaluate the language model pretraining performance of our LIMIT-BERT, we also evaluate LIMIT-BERT on two widely-used datasets, The General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018) which is a collection of nine NLU tasks and Stanford Natural Language Inference (SNLI) (Bowman et al., 2015) to show the superiority.

Implementation Details
Our implementation of LIMIT-BERT is based on the PyTorch implementation of BERT 11 . We use a learning rate of 3e-5 and a batch size of 32 with 1 million training steps. The optimizer and other training settings are the same as BERT (Devlin et al., 2019). For task-specific layers including syntactic and semantic scorers and decoders, we set the same hyperparameters settings as (Zhou et al., 2020). LIMIT-BERT model is trained on 32 NVIDIA GeForce GTX 1080Ti GPUs.

Main Results
Syntactic Parsing Results As shown in Table  1    The results of syntactic and semantic parsing empirically illustrate that incorporating linguistic knowledge into pre-trained language model by multi-task and semi-supervised learning can significantly enhance downstream tasks. Table 4 includes the best results reported in the leaderboards 12 of SNLI. We see that LIMIT-BERT outperforms the strong baseline model BERT WWM in 0.3 F1 score on the SNLI benchmark. GLUE Results We fine-tuned LIMIT-BERT for each GLUE task on task-specific data. The dev results in Table 5 show that LIMIT-BERT outperforms the strong baseline model and achieves remarkable results compared to other state-of-the-art models in literature.

Discussions
Ablation Study LIMIT-BERT contains three key components: Multi-Task learning, ELECTRA training approach, and Syntactic/Semantic Phrase Masking (SPM). To evaluate the contribution of each component in LIMIT-BERT, we remove each component from the model for training and then fine-tune on downstream NLU tasks and linguistics tasks for evaluation. In consideration of computational cost, we apply BERT base as the start of training and only use one-tenth of the BERT training corpus. We employ the same training setting    for each ablation model: 200k training steps, 1e-5 learning rate and 32 batch size. After language model training, we extract the layers of BERT base and fine-tune on downstream tasks for evaluation. The ablation study is conducted on NLU tasks and linguistics tasks shown in Table 6   and SPM also can improve performance when finetuning on linguistics tasks. Comparing the results in Tables 6 and 7, ELEC-TRA training approach and SPM are more effective for NLU tasks while multi-tasks learning can improve the linguistics tasks performance significantly. The possible explanation is that multi-tasks learning enables LIMIT-BERT to 'remember' the linguistics information and thus lead to better performance in downstream linguistics tasks. Fine-tuning Effect We examine the fine-tuning effect of LIMIT-BERT on linguistics tasks. The results in Table 8 show that LIMIT-BERT with or without finetuning still outperforms BERT WWM baseline consistently among all tasks. In such a case, fine-tuning is necessary to boost the semantic parsing performance while no-fine-tuning performs better on syntactic parsing. As shown in Table 8, the accuracy improves 0.1 F1 and 0.4 F1 of span SRL and dependency SRL after fine-tuning respectively but no-fine-tuning performs better nearly 0.2 F1 of syntactic parsing. The possible explanation is that no-fine-tuning LIMIT-BERT use semisupervised training data which contains much more long sentence samples and benefits syntactic parsing more.

Model
Test Yasunaga et al. (2018) 97.59 Akbik et al. (2018) 97.85 Bohnet et al. (2018) 97.96 LIMIT-BERT 97.71  Part-Of-Speech Performance Table 9 lists the results of POS tagging on WSJ test set showing that our LIMIT-BERT achieves competitive results compared with other state-of-the-art models. Note that we only apply simple one-layer decoder without those complicated ones such as conditional random field (CRF) (Lafferty et al., 2001) as the POS tagging task is not the main concern of our model.

Sentences Length Performance
The performance of baseline model and LIMIT-BERT while varying the sentence length of four linguistics tasks on the English dev set is shown in Figure 3. The statistics show that our LIMIT-BERT outperforms the baseline model of over all sentence lengths. For different sentence lengths, LIMIT-BERT outperforms much better than baseline model on long sentence (larger than 50) of both syntactic and semantic parsing. The possible explanation is that LIMIT-BERT uses semi-supervised training data, which contains much more long sentence samples and benefits parsing performance on long sentences.

Related Work
Linguistics Inspired NLP Tasks With the impressive success of deep neural networks in various NLP tasks (Chen and Manning, 2014;Dozat and Manning, 2017;Ma et al., 2018;Strubell et al., 2018;Zhang et al., 2018a;Li et al., 2018a;, syntactic parsing and semantic role labeling have been well developed with neural network and achieve very high performance (Chen and Manning, 2014;Dozat and Manning, 2017;Ma et al., 2018;Kitaev and Klein, 2018;. Semantic role labeling is deeply related to syntactic structure and a number of works try to incorporate syntactic information in semantic role labeling models by different methods such as concatenation of lexicalized embedding , usage of syntactic GCN  and multi-task learning (Strubell et al., 2018;Zhou et al., 2020). Besides semantic role labeling and syntactic parsing are two key tasks of semantics and syntax so that they are included into our linguistics tasks for multi-task learning.
In addition, both span and dependency are popularly adopted annotation styles for both semantics and syntax and some work on jointly learning of semantic and syntactic (Henderson et al., 2013;Lluís et al., 2013;Swayamdipta et al., 2016) . Researchers are interested in two styles of SRL models that may benefit from each other rather than their separated development, which has been roughly discussed in (Johansson and Nugues, 2008). On the other hand, researchers have discussed how to encode lexical dependencies in phrase structures, like lexicalized tree adjoining grammar (LTAG) (Schabes et al., 1988) , Combinatory Categorial Grammar (CCG) (Steedman, 2000) and head-driven phrase structure grammar (HPSG) (Pollard and Sag, 1994) which is a constraintbased highly lexicalized non-derivational generative grammar framework. To absorb both strengths of span and dependency structure, we apply both span (constituent) and dependency representations of semantic role labeling and syntactic parsing. Thus, it is a natural idea to study the relationship between constituent and dependency structures, and the joint learning of constituent and dependency syntactic parsing (Klein and Manning, 2004;Charniak and Johnson, 2005;Farkas et al., 2011;Green andŽabokrtský, 2012;Ren et al., 2013;Xu et al., 2014;Yoshikawa et al., 2017).

Pre-trained Language Modeling
Recently, deep contextual language model has been shown effective for learning universal language representations by leveraging large amounts of unlabeled data, achieving various state-of-the-art results in a series of NLU benchmarks. Some prominent examples are Embedding from Language models (ELMo) (Peters et al., 2018), Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019) and Generalized Autoregressive Pretraining (XLNet) (Yang et al., 2019).
Many latest works make attempts to modify the language model based on BERT such as ELEC-TRA (Clark et al., 2020) and MT-DNN (Liu et al., 2019b). ELECTRA focuses on the [MASK] tokens mismatch problem and thus combines the idea of Generative Adversarial Networks GANs (Goodfellow et al.). MT-DNN applies multi-task learning to language model pre-training and achieves new state-of-the-art results on GLUE benchmark. Besides, (Gururangan et al., 2020) finds that multiphase adaptive pretraining offers large gains in task performance which is similar with our semisupervised learning strategy.

Conclusions
In this work, we present LIMIT-BERT which applies multi-task learning with multiple linguistic tasks by semi-supervised learning. We use five key syntax and semantics tasks : Part-Of-Speech (POS) tags, constituent and dependency syntactic parsing, span and dependency semantic role labeling (SRL). and further improve the masking strategy of BERT training by effectively exploiting the available syntactic and semantic clues for language model training. The experiments show that LIMIT-BERT outperforms the strong baseline BERT WWM on four benchmark parsing treebanks and two NLU tasks. The results of GLUE and SNLI empirically illustrate that incorporating linguistic knowledge into pre-training language BERT by multi-task and semi-supervised learning can also enhance downstream tasks. There are many future areas to explore to improve LIMIT-BERT, including a deeper understanding of model structure sharing in MTL, a more effective training method that leverages relat-edness among multiple tasks, for both fine-tuning and pre-training, and ways of incorporating the linguistic structure of text in a more explicit and controllable manner.