LIMIT-BERT : Linguistic Informed Multi-Task BERT

In this paper, we present a Linguistic Informed Multi-Task BERT (LIMIT-BERT) for learning language representations across multiple linguistic tasks by Multi-Task Learning (MTL). LIMIT-BERT includes five key linguistic syntax and semantics tasks: Part-Of-Speech (POS) tags, constituent and dependency syntactic parsing, span and dependency semantic role labeling (SRL). Besides, LIMIT-BERT adopts linguistics mask strategy: Syntactic and Semantic Phrase Masking which mask all of the tokens corresponding to a syntactic/semantic phrase. Different from recent Multi-Task Deep Neural Networks (MT-DNN) (Liu et al., 2019), our LIMIT-BERT is linguistically motivated and learning in a semi-supervised method which provides large amounts of linguistic-task data as same as BERT learning corpus. As a result, LIMIT-BERT not only improves linguistic tasks performance but also benefits from a regularization effect and linguistic information that leads to more general representations to help adapt to new tasks and domains. LIMIT-BERT obtains new state-of-the-art or competitive results on both span and dependency semantic parsing on Propbank benchmarks and both dependency and constituent syntactic parsing on Penn Treebank.


Introduction
Recently, language model pre-training has shown to be effective for improving accuracy across a range of natural language tasks.Since language models are trained on large amounts of unlabeled data (Peters et al., 2018;Devlin et al., 2018), thus it do not explicitly acquire any linguistic knowledge such as syntax and semantics information which can be beneficial for other downstream tasks for example Natural Language Understanding (NLU) (Zhang et al., 2019a,b).To investigate whether linguistic information can help language representation models to improve downstream tasks performance, this work proposes a model called Linguistic Informed Multi-Task BERT1 (LIMIT-BERT) to make the first attempt to incorporate linguistic knowledge into pre-training language representation models BERT (Devlin et al., 2018).
Besides, Multi-Task Learning (MTL) (Caruana, 1993) has shown useful for jointly learning for multiple related tasks.The advantage comes from two sides.Firstly, the knowledge learned in one task can benefit other similar tasks inherently.Secondly, MTL shows a regularization effect via alleviating overfitting to a specific task, thus making the learned representations universal across tasks.
Thus it is natural idea to incorporate linguistic knowledge by joint learning language model with linguistic tasks.Since universal language representations is learning by leveraging large amounts of unlabeled data which has quiet difference data volume compared with linguistic tasks dataset such as Penn Treebank (PTB)2 (Marcus et al., 1993).
To alleviate such data unbalance on multi-task learning, we apply semi-supervised learning approach that uses a pre-training linguistic model 3 to label large amounts of language training corpus and combining with golden linguistic tasks dataset as our final training data.For such preprocessing, it is easy to train our LIMIT-BERT on large amounts of data with many tasks concurrently by sum all the loss together.Moreover, since each sentences have predicted syntax and semantics structure information, we also can modify the mask strategy based on syntactic or semantic phrase in our language model training process.Unlike the previous work MT-DNN (Liu et al., 2019) which only fine-tunes BERT on GLUE tasks by multi-task learning, our LIMIT-BERT is trained on large amounts of data by semisupervised learning method and based on linguistic motivation.
We verify the effectiveness and applicability of LIMIT-BERT on Propbank semantic parsing4 in both span style (CoNLL-2005) (Carreras and Màrquez, 2005) and dependency style, (CoNLL-2009) (Hajič et al., 2009) and Penn Treebank (PTB) (Marcus et al., 1993) for both constituent and dependency syntactic parsing.Our empirical results show that semantics and syntax can indeed benefit language representation model via multi-task learning, and LIMIT-BERT reaches new state-of-the-art or competitive performance for all four tasks: span and dependency SRL, constituent and dependency syntactic parsing.

Tasks And Dateset
LIMIT-BERT includes five types of downstream tasks: Part-Of-Speech, constituent and dependency parsing, span and dependency semantic role labeling (SRL).
Both span (constituent) and dependency are effective formal representations for both semantics and syntax, which have been well studied and discussed from both linguistic and computational perspective, though few works comprehensively considered the impact of either/both representation styles over the respective parsing (Chomsky, 1981;Li et al., 2019).
Constituency parsing aims to extract a constituency-based parse tree from a sentence that represents its syntactic structure according to a phrase structure grammar.While dependency parsing identifies syntactic relations (such as an adjective modifying a noun) between word pairs in a sentence.Constituent structure is better at disclosing phrasal continuity while the dependency structure is better at indicating dependency relation among words.
Semantic role labeling (SRL) is dedicated to recognizing the predicate-argument structure of a sentence, such as who did what to whom, where and when, etc.For argument annotation, there are two formulizations.One is based on text spans, namely span-based SRL.The other is dependencybased SRL, which annotates the syntactic head of argument rather than entire argument span.SRL is an important method to obtain semantic information beneficial to a wide range of natural language processing (NLP) tasks (Zhang et al., 2018;Mihaylov and Frank, 2019).
BERT is training on large unlabel data BooksCorpus and English Wikipedia which have 13GB plain text combined while specific tasks datasets are less than 100MB.Thus we use semisupervised learning approach to alleviate such data unbalance on multi-task learning which use a pre-training linguistic model to label BooksCorpus and English Wikipedia data.We joint learning Part-Of-Speech (POS) tags with (Zhou et al., 2019) which reaches state-of-the-art or competitive performance on both span (constituent) and dependency of both SRL and syntactic parsing as our pre-training linguistic model.During training, we set 10% probability to use golden syntactic parsing and SRL data: Penn Treebank (PTB) (Marcus et al., 1993), span style SRL (CoNLL-2005) (Carreras and Màrquez, 2005) and dependency style SRL (CoNLL-2009) (Hajič et al., 2009).

Linguistics Mask Strategy
BERT applies two language model learning tasks: Masked LM and Next Sentence Prediction (NSP) based on WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary.For Masked LM task, BERT uses training data generator to chooses 15% of the token positions at random for mask replacement and predict the masked tokens5 .Since using different mask strategy can improve model performance such as the Whole Word Masking6 which masks all of the tokens corresponding to a word at once, we attempt to change masked strategy based on linguistics information.As discussed in Section 2, we label each sentence in our training data which contains syntactic and seman- (a) Semantic Phrase Masking.tic phrases7 .Thus, we apply three mask strategies at random for each sentence: Syntactic Phrase Masking, Semantic Phrase Masking and Whole Word Masking.Syntactic/Semantic Phrase Masking mask all of the tokens corresponding to a syntactic/semantic phrase at once as shown in 1.The overall masking rate and replacement strategy remains the same as BERT, we still predict each masked WordPiece token independently.

Overview
The architecture of the LIMIT-BERT is shown in Figure 2. Our model includes three modules: token representation, BERT transformer encoder, task-specific layers including syntactic and semantic scorers and decoders.We take multi-task learning (MTL) approach (Caruana, 1993) sharing the parameters of token representation and BERT transformer encoder, while the top task-specific layers have independent parameters.The training procedure is simple that we just sum the language model loss including masked LM and next sentence prediction loss with task-specific loss together.The input X is based on WordPiece sequence and either a sentence or a pair of sentences packed together, while our linguistic tasks are based on word level of only one sentence.Thus we only take the first sentence of pair sentences packed as task-specific input and last WordPiece token as word representation.In what follows, we elaborate on the model in detail.

Token Representation
Following (Devlin et al., 2018), the first token x1 is always the [CLS] token.If input X is packed by a sentence pair (X1;X2), we separate the two sentences with a special token [SEP].The transformer encoder maps X into a sequence of input embedding vectors, one for each token, constructed by summing the corresponding word, segment, and positional embeddings.
If we apply BERT training data (BooksCorpus and English Wikipedia), we use pair sentences packed to perform next sentence prediction and only take the first sentence including [CLS] and [SEP] token for later linguistic tasks.While using golden linguistic tasks data (Penn Treebank, CoNLL-2005 andCoNLL-2009) with 10% probability, we only take one sentence as input that [CLS] and [SEP] are first and last token respectively.
Since input sequence X is based on WordPiece token, we only take the last WordPiece vector of the word in the last layer of transformer encoder as our sole word representation for later linguistic tasks input to keep the same length of the token and label annotations..

Transformer Encoder
The Transformer encoder in our model is adapted from (Vaswani et al.) which transforms the input representation vectors into a sequence of contextual embedding vectors with shared representation across different tasks.We use the pre-trained parameters of BERT (Devlin et al., 2018) as our encoder initialization which can obtains faster convergence.Below, we will describe how to combine the linguistic task specific layers with language model training objective.

Task-specific Layers
We follow (Zhou et al., 2019) to construct the taskspecific layers including scoring layer and decoder layer which purpose is to generate the legal linguistic structure.
Scoring layer contains four types of score: POS score, dependency head score, constituent score and semantic role score.In decoder layer, we perform joint span structure for constituent and dependency syntactic tree 8 and uniform representation for span and dependency SRL by the four types of score.
Suppose that X is the output of the transformer encoder, we compute the language model loss J lm (θ) by X which is sum of token mask loss and next sentence predict loss 9 as same as BERT training (Devlin et al., 2018).
Next we prepare the WordPiece sequence vector X for linguistic specific tasks learning which are based on word level.Firstly, we only take the first sentence X1 including token [CLS] and [SEP] of packed sentence pair (X1;X2).Then we convert WordPiece sequence vector to word level and the method is simply that we only take the last Word-Piece token vector of the word as the representation of the whole word.
After word level construction, we calculate the POS, constituent span, dependency head, and se-8 Besides, for constructing a full predicted syntactic tree, we also join POS tasks in our model and use POS score to predict the POS tags. 9If using golden linguistic tasks data, we only compute the token mask loss.mantic role scores.Utilizing these specific tasks scores, we can compute the linguistic tasks loss J lt (θ) for training and performing dynamic programming decoder to generate constituent and dependency syntactic tree, and span, dependency SRL which setting details follow (Zhou et al., 2019).
For POS tasks training, we apply a simply onelayer feedforward networks and minimize the negative log-likelihood of the golden POS g i of each word, which is implemented as a cross-entropy loss: where x i is word vector of X, LN denotes Layer Normalization, g is the Rectified Linear Unit nonlinearity.

Evaluation
We evaluate our proposed model LIMIT-BERT on CoNLL-2009 shared task (Hajič et al., 2009)  2005), and English Penn Treebank (PTB) (Marcus et al., 1993) for constituent syntactic parsing, Stanford basic dependencies (SD) representation (Marneffe et al., 2006) converted by the Stanford parser 10 for dependency syntactic parsing.We follow standard data splitting and evaluate setting as (Zhou et al., 2019).In addition, we use end-to-end SRL setups.

Implementation details
Our implementation of LIMIT-BERT is based on the PyTorch implementation of BERT 11 .We use a learning rate of 1e-5 and a batch size of 16 with 1 million training steps.The optimizer and other training settings are same as BERT (Devlin et al., 2018).For task-specific layers including syntactic and semantic scorers and decoders, we set the same hyperparameters settings as (Zhou et al., 2019).LIMIT-BERT model is trained on four NVIDIA Titan RTX GPU with Intel i7-7800X CPU.

Syntactic Parsing Results
Compared to the existing state-of-the-art models with pre-training, our LIMIT-BERT achieves new state-of-the-art on both constituent and dependency parsing.Compared with our baseline (Zhou et al., 2019), LIMIT-BERT exceeds more than 0.2 in UAS of dependency and 0.3 F1 of constituent syntactic parsing which are considerable improvements on such strong baselines.

Semantic Parsing Results
We present all results using the official evaluation script from the CoNLL-2005 and CoNLL-2009 shared tasks.and out-domain (Brown) test sets and compares our model with previous state-of-the-art models in end-to-end mode.The upper part of the table presents results from span SRL while the lower part shows the results of dependency SRL.
Our LIMIT-BERT achieves new state-of-theart in three datasets of four which empirically illustrate that incorporating linguistic knowledge into pre-training language BERT by multi-task and semi-supervised learning can actually improve downstream tasks.

Conclusions
In this work we present a model LIMIT-BERT which applies multi-task learning with multiple linguistic tasks by semi-supervised method.We use five key syntax and semantics tasks: Part-Of-Speech (POS) tags, constituent and dependency syntactic parsing, span and dependency semantic role labeling (SRL).We also modify the mask strategy of BERT training input in order to incorporate syntactic and semantic information in language model.The experiments show that LIMIT-BERT obtains new state-of-the-art or competitive results on four parsing tasks of Propbank benchmarks and Penn Treebank.
paper board [MASK] paper and wood [MASK] .

Figure 1 :
Figure 1: Syntactic and Semantic Phrase Masking strategy.In figure (a) the predicates sells and products have been replaced by [MASK] while in figure (b) each token of constituent federal paper board also has been masked.

Figure 2 :
Figure 2: The framework of our LIMIT-BERT.

Table 1 :
Dependency syntactic parsing on WSJ test set.