Towards Semi-Supervised Learning for Deep Semantic Role Labeling

Neural models have shown several state-of-the-art performances on Semantic Role Labeling (SRL). However, the neural models require an immense amount of semantic-role corpora and are thus not well suited for low-resource languages or domains. The paper proposes a semi-supervised semantic role labeling method that outperforms the state-of-the-art in limited SRL training corpora. The method is based on explicitly enforcing syntactic constraints by augmenting the training objective with a syntactic-inconsistency loss component and uses SRL-unlabeled instances to train a joint-objective LSTM. On CoNLL-2012 English section, the proposed semi-supervised training with 1%, 10% SRL-labeled data and varying amounts of SRL-unlabeled data achieves +1.58, +0.78 F1, respectively, over the pre-trained models that were trained on SOTA architecture with ELMo on the same SRL-labeled data. Additionally, by using the syntactic-inconsistency loss on inference time, the proposed model achieves +3.67, +2.1 F1 over pre-trained model on 1%, 10% SRL-labeled data, respectively.


Introduction
Semantic role labeling (SRL), a.k.a shallow semantic parsing, identifies the arguments corresponding to each clause or proposition, i.e. its semantic roles, based on lexical and positional information. SRL labels non-overlapping text spans corresponding to typical semantic roles such as Agent, Patient, Instrument, Beneficiary, etc. This task finds its use in many downstream applications such as question-answering (Shen and Lapata, 2007), information extraction (Bastianelli et al., 2013), machine translation, etc.
Several SRL systems relying on large annotated corpora have been proposed (Peters et al., 2018; ⇤ Equal contribution, name order decided by coin flip. He et al., 2017), and perform relatively well. A more challenging task is to design an SRL method for low resource scenarios (e.g. rare languages or domains) where we have limited annotated data but where we may leverage annotated data from related tasks. Therefore, in this paper, we focus on building effective systems for low resource scenarios and illustrate our system's performance by simulating low resource scenarios for English.
SRL systems for English are built using large annotated corpora of verb predicates and their arguments provided as part of the PropBank and OntoNotes v5.0 projects (Kingsbury and Palmer, 2002;Pradhan et al., 2013). These corpora are built by adding semantic role annotations to the constituents of previously-annotated syntactic parse trees in the Penn Treebank (Marcus et al., 1993). Traditionally, SRL relies heavily on using syntactic parse trees either from shallow syntactic parsers (chunkers) or full syntactic parsers and Punyakanok et al. shows significant improvements by using syntactic parse trees.
Recent breakthroughs motivated by end-to-end deep learning techniques (Zhou and Xu, 2015;He et al., 2017) achieve state-of-the-art performance without leveraging any syntactic signals, relying instead on ample role-label annotations. We hypothesize that by leveraging syntactic structure while training neural SRL models, we may achieve robust performance, especially for low resource scenarios. Specifically, we propose to leverage syntactic parse trees as hard constraints for the SRL task i.e., we explicitly enforce that the predicted argument spans of the SRL network must agree with the spans implied by the syntactic parse of the sentence via scoring function in the training objective. Moreover, we present a semi-supervised learning (SSL) based formulation, wherein we leverage syntactic parse trees for SRL-unlabeled data to build effective SRL for low resource scenarios.
We build upon the state-of-the-art SRL system by (Peters et al., 2018;He et al., 2017), where we formulate SRL as a BIO tagging problem and use multi-layer highway bi-directional LSTMs. However, we differ in terms of our training objective. In addition to the log-likelihood objective, we also include syntactic inconsistency loss (defined in Section 2.3) which quantifies the hard constraint (spans implied by syntactic parse) violations in our training objective. In other words, while training our model, we enforce the outputs of our system to agree with the spans implied by the syntactic parse of the sentence as much as possible. In summary, our contributions to low-resource SRL are: 1. A novel formulation which leverages syntactic parse trees for SRL by introducing them as hard constraints while training the model.
2. Experiments with varying amounts of SRLunlabeled data that point towards semisupervised learning for low-resource SRL by leveraging the fact that syntactic inconsistency loss does not require labels.

Proposed Approach
We build upon an existing deep-learning approach to SRL (He et al., 2017). First we revisit definitions introduced by (He et al., 2017) and then discuss about our formulation.

Task definition
Given a sentence-predicate pair (x, v), SRL task is defined as predicting a sequence of tags y, where each y i belongs to a set of BIO tags (⌦). So, for an argument span with semantic role ARG i , B-ARG i tag indicates that the corresponding token marks the beginning of the argument span and I-ARG i tag indicates that the corresponding token is inside of the argument span and O tag indicates that token is outside of all argument spans. Let n = |x| = |y| be the length of the sentence. Further, let srl-spans(y) denote the set of all argument spans in the SRL tagging y. Similarly, parsespans(x) denotes the set of all unlabeled parse constituents for the given sentence x. Lastly, SRLlabeled/unlabeled data refers to sentence-predicate pairs with/without gold SRL tags.

State-of-the-Art (SOTA) Model
He et al. proposed a deep bi-directional LSTM to learn a locally decomposed scoring func-tion conditioned on the entire input sentence-P n i=1 log p(y i |x). To learn the parameters of a network, the conditional negative log-likelihood L(w) of a sample of training data T = {x (t) , Since Eq.(1) does not model any dependencies between the output tags, the predicted output tags tend to be structurally inconsistent. To alleviate this problem, (He et al., 2017) searches for the best y over the space of all possibilities (⌦ n ) using the scoring function f (x, y), which incorporates log probability and structural penalty terms. The details of scoring function is on Appendix Eq. (7).

Structural Constraints
There are different types of structural constraints: BIO, SRL and syntactic constraints. BIO constraints define valid BIO transitions for sequence tagging. For example, B-ARG0 cannot be followed by I-ARG1. SRL constraints define rules on the role level and has three particular constraints: unique core roles (U), continuation roles (C) and reference roles (R) (Punyakanok et al., 2008). Lastly, syntactic constraints state that srlspans(y) have to be subset of parse-spans(x). (He et al., 2017) use BIO and syntactic constraints at decoding time by solving Eq.
(2) where f (x, y) incorporates those constraints and report that SRL constraints do not show significant improvements over the ensemble model. In particular, by using syntactic constraints, (He et al., 2017) achieves up to +2 F1 score on CoNLL-2005 dataset via A* decoding. Improvements of SRL system via use of syntactic constraints is consistent with other observations (Punyakanok et al., 2008). However, all previous works enforce syntactic consistency only during decoding step. We propose that enforcing syntactic consistency during training time would also be beneficial and show the efficacy experimentally on Section 3.3.

Training with Joint Objective
Based on Eq.
(1), a supervised loss, and Eq. (5), the SI-Loss, we propose a joint training objective. For a given sentence-predicate pair (x, v) and SRL tags y, our joint training objective (at epoch t) is defined as: Here, ↵ 1 and ↵ 2 are weights (hyper-parameters) for different loss components and are tuned using a development set. During training, we minimize joint loss -i.e., negative log-likelihood (or crossentropy loss) and syntactic inconsistency loss.

Semi-supervised learning formulation
In low resource scenarios, we have limited labeled data and larger amounts of unlabeled data. The obvious question is how to leverage large amounts of unlabeled data for training accurate models. In context of SRL, we propose to leverage SRLunlabeled data in terms of parse trees.
Observing Eq.(5), one can notice that our formulation of SI-Loss is only dependent upon model's predicted tag sequenceŷ (t) at a particular time point t during training and the given sentence and it does not depend upon gold SRL tags. We leverage this fact in our SSL formulation to compute SI-loss from SRL-unlabeled sentences.  Let sup-s be a batch of SRL-labeled sentences and usup-s be a batch SRL-unlabeled sentences only with parse information. In SSL setup, we propose to train our model with joint objective where sup-s only contributes to supervised loss Eq.(1) and unsup-s contributes in terms of syntactic inconsistency objective Eq.(5) and combine them according to Eq.(6) to train them with joint loss.

Dataset
We evaluate our model's performance on spanbased SRL dataset from CoNLL-2012 shared task (Pradhan et al., 2013). This dataset contains gold predicates as part of the input sentence and also gold parse information corresponding to each sentence which we use for defining hard constraints for SRL task. We use standard train/development/test split containing 278K/38.3K/25.6K sentences. Further, there is approx. 10% disagreement between gold SRL-spans and gold parse-spans (we term these as noisy syntactic constraints). During training, we do not preprocess data to handle these noisy constraints but for the analysis related to enforcing syntactic constraints during inference, we study both cases: with and without noisy constraints. 1

Model configurations
For the SOTA system proposed in (Peters et al., 2018), we use code from Allen AI 2 to implement our approach. We follow their initialization and training configurations.
Let BX, JX denote model trained with X% of the SRL-labeled data with cross-entropy and joint training objective, re-

Results
We are interested in answering following questions. (Q1) how well does the baseline model produce syntactically consistent outputs, (Q2) does our approach actually enforce syntactic constraints, (Q3) does our approach enforce syntactic constraints without compromising the quality of the system, (Q4) how well does our SSL formulation perform, especially in low-resource scenarios, and lastly (Q5) what is the difference in using the syntactic constraints in training time compared to using it at decoding time. To answer (Q1-2) favorably we report average disagreement rate computed over test split. To answer (Q3-5), we report overall F1-scores on CoNLL-2012 test set (using standard evaluation script). For experiments using SRL-unlabeled data, we report average results after running multiple experiments with different random samples of it.
Does training with joint objective help? We trained 3 models with random 1%, 10% and whole 100% of the training set with joint objective (↵ 1 = ↵ 2 = 0.5). For comparison, we trained 3 SOTA models with the same training sets. All models were trained for max 150 epochs and with a patience of 20 epochs. Table 1 reports the results of this experiment. We see that models trained with joint objective (JX) improve over baseline models (BX), both in terms of F1 and average disagreement rate. These improvements provide evidence for answering (Q1-3) favorably. Further, gains are more in low resource scenarios because   by training models jointly to satisfy syntactic constraints helps in better generalization when trained with limited SRL corpora.
Does SSL based training work for low-resource scenarios? To enforce syntactic constraints via SI-loss on SRL-unlabeled data, we further train pre-trained model with two objectives in SSL set up: (a) SI-loss (Table 2) and (b) joint objective (Table 3) For experiment (a), we use square loss, kW W pre-train k 2 regularizer to keep the model W close to the pre-trained model W pre-train to avoid catastrophic forgetting ( set to 0.005). We optimize with SGD with learning rate of 0.01, ↵ 2 = 1.0, patience of 10 epochs. We see that with SI-loss improvements are significant in terms of average disagreement rate as compared to F1. For experiment (b), we train B1 and B10 with joint objective in SSL set-up (as discussed in Section 2.5). We use SGD with learning rate of 0.05, ↵ 1 = ↵ 2 = 1.0 and patience of 10 epochs. We report +1.58 F1 and +0.78 F1 improvement over B1 and B10, trained with 5% and 100% SRLunlabeled data, respectively. Note that we cannot achieve these improvements with simply finetunning BX with supervised loss, as seen with BX-further on Table 3. This provides evidence to answer (Q4) favorably. In general, the performance gains increase as the size of the SRLunlabeled data increases.
Is it better to enforce syntactic consistency on decoding or on training time? To answer (Q5), we conducted three experiments: using syntactic constraints on (a) inference only, i.e. structured prediction, (b) training only, and (c) both training and inference steps. For the structured prediction, we consider A* decoding, as used in (He et al., 2017) and gradient-based inference , which optimizes loss function similar to SIloss on Eq.(5) per example basis. If neither A* decoding nor gradient-based inference is used, we use Viterbi algorithm to enforce BIO constraints. The performance is the best (bold on Table 4) when syntactic consistency is enforced both on training and inference steps, +3.67, +2.1 F1 score improvement over B1 and B10 respectively, and we conclude that the effort of enforcing syntactic consistency on inference time is complementary to the same effort on training time. However, note that the overall performance increases as the benefit from enforcing syntactic consistency with SSL is far greater compared to marginal decrement on structured prediction. While syntactic constraints help both train and inference, injecting constraints on train time is far more robust compared to enforcing them on decoding time. The performance of the structured prediction drops rapidly when the noise in the parse information is introduced (x column of Table 4). On the other hand, SSL was trained on CoNLL2012 data where about 10% of the gold SRL-spans do not match with gold parse-spans and even when we increase noise level to 20% the performance drop was only around 0.1 F1 score.

Related Work
The traditional approaches for SRL (Pradhan et al., 2005;Koomen et al., 2005) constituted of cascaded system with four subtasks: pruning, argument identification, argument labeling, and inference.
Recent approaches (Zhou and Xu, 2015; He et al., 2017) proposed end-to-end system for SRL using deep recurrent or bi-LSTM-based architecture with no syntactic inputs and have achieved SOTA results on English SRL. Lastly, (Peters et al., 2018) proposed ELMo, a deep contextualized word representations, and improved the SOTA English SRL by 3.2 F1-points.
Even on the end-to-end learning, inference still remains as a separate subtask and would be formalized as a constrained optimization problem. To solve this problem ILP (Punyakanok et al., 2008), A* algorithm (He et al., 2017) and gradient-based inference (Lee et al., 2017) were employed. Further, all of these works leveraged syntactic parse during inference and was never used during training unless used as a cascaded system.
To the best of our knowledge, this work is the first attempt towards SSL span-based SRL model. Nonetheless, there were few efforts in SSL in dependency-based SRL systems (Fürstenau and Lapata, 2009;Deschacht and Moens, 2009;Croce et al., 2010). (Fürstenau and Lapata, 2009) proposed to augment the dataset by finding similar unlabeled sentences to already labeled set and annotate accordingly. While interesting, the similar augmentation technique is harder to apply to spanbased SRL as one requires to annotate the whole span. (Deschacht and Moens, 2009;Croce et al., 2010) proposed to leverage the relation between words by learning latent word distribution over the context, i.e. language model. Our paper also incorporates this idea by using ELMo as it is trained via language model objective.

Conclusion and Future Work
We presented a SI-loss to enforce SRL systems to produce syntactically consistent outputs. Further, leveraging the fact that SI-loss does not require labeled data, we proposed a SSL formulation with joint objective constituting of SI-loss and supervised loss together. We show the efficacy of the proposed approach on low resource settings, +1.58, +0.78 F1 on 1%, 10% SRL-labeled data respectively, via further training on top of pretrained SOTA model. We further show the structured prediction can be used as a complimentary tool and show performance gain of +3.67, +2.1 F1 over pre-trained model on 1%, 10% SRL-labeled data, respectively. Semi-supervised training from the scratch and examination of semi-supervised setting on large dataset remains as part of the future work.