Structured Tuning for Semantic Role Labeling

Recent neural network-driven semantic role labeling (SRL) systems have shown impressive improvements in F1 scores. These improvements are due to expressive input representations, which, at least at the surface, are orthogonal to knowledge-rich constrained decoding mechanisms that helped linear SRL models. Introducing the benefits of structure to inform neural models presents a methodological challenge. In this paper, we present a structured tuning framework to improve models using softened constraints only at training time. Our framework leverages the expressiveness of neural networks and provides supervision with structured loss components. We start with a strong baseline (RoBERTa) to validate the impact of our approach, and show that our framework outperforms the baseline by learning to comply with declarative constraints. Additionally, our experiments with smaller training sizes show that we can achieve consistent improvements under low-resource scenarios.


Introduction
Semantic Role Labeling (SRL, Palmer et al., 2010) is the task of labeling semantic arguments of predicates in sentences to identify who does what to whom.Such representations can come in handy in tasks involving text understanding, such as coreference resolution (Ponzetto and Strube, 2006) and reading comprehension (e.g., Berant et al., 2014;Zhang et al., 2020).This paper focuses on the question of how knowledge can influence modern semantic role labeling models.
Linguistic knowledge can help SRL models in several ways.
In addition to such influences on input representations, knowledge about the nature of semantic roles can inform structured decoding algorithms used to construct the outputs.The SRL literature is witness to a rich array of techniques for structured inference, including integer linear programs (e.g., Punyakanok et al., 2005Punyakanok et al., , 2008)), bespoke inference algorithms (e.g., Täckström et al., 2015), A* decoding (e.g., He et al., 2017a), greedy heuristics (e.g., Ouchi et al., 2018), or simple Viterbi decoding to ensure that token tags are BIOconsistent.
By virtue of being constrained by the definition of the task, global inference promises semantically meaningful outputs, and could provide valuable signal when models are being trained.However, beyond Viterbi decoding, it may impose prohibitive computational costs, thus ruling out using inference during training.Indeed, optimal inference may be intractable, and inference-driven training may require ignoring certain constraints that render inference difficult.
While global inference was a mainstay of SRL models until recently, today's end-to-end trained neural architectures have shown remarkable successes without needing decoding.These successes can be attributed to the expressive input and internal representations learned by neural networks.The only structured component used with such models, if at all, involves sequential dependencies between labels that admit efficient decoding.
In this paper, we ask: Can we train neural network models for semantic roles in the presence of general output constraints, without paying the high computational cost of inference?We propose a structured tuning approach that exposes a neural SRL model to differentiable constraints during the finetuning step.To do so, we first write the out-put space constraints as logic rules.Next, we relax such statements into differentiable forms that serve as regularizers to inform the model at training time.Finally, during inference, our structuretuned models are free to make their own judgments about labels without any inference algorithms beyond a simple linear sequence decoder.
We evaluate our structured tuning on the CoNLL-05 (Carreras and Màrquez, 2005) and CoNLL-12 English SRL (Pradhan et al., 2013) shared task datasets, and show that by learning to comply with declarative constraints, trained models can make more consistent and more accurate predictions.We instantiate our framework on top of a strong baseline system based on the RoBERTa (Liu et al., 2019) encoder, which by itself performs on par with previous best SRL models that are not ensembled.We evaluate the impact of three different types of constraints.Our experiments on the CoNLL-05 data show that our constrained models outperform the baseline system by 0.2 F1 on the WSJ section and 1.2 F1 on the Brown test set.Even with the larger and cleaner CoNLL-12 data, our constrained models show improvements without introducing any additional trainable parameters.Finally, we also evaluate the effectiveness of our approach on low training data scenarios, and show that constraints can be more impactful when we do not have large training sets.
In summary, our contributions are: 1.We present a structured tuning framework for SRL which uses soft constraints to improve models without introducing additional trainable parameters. 1 2. Our framework outperforms strong baseline systems, and shows especially large improvements in low data regimes.

Model & Constraints
In this section, we will introduce our structured tuning framework for semantic role labeling.In §2.1, we will briefly cover the baseline system.
To that, we will add three constraints, all treated as combinatorial constraints requiring inference algorithms in past work: Unique Core Roles in §2.3, Exclusively Overlapping Roles in §2.4, and Frame Core Roles in §2.5.For each constraint, we will discuss how to use its softened version dur-1 Our code to replay our experiments is archived at https://github.com/utahnlp/structuredtuning srl.
ing training.
We should point out that the specific constraints chosen serve as a proof-of-concept for the general methodology of tuning with declarative knowledge.For simplicity, for all our experiments, we use the ground truth predicates and their senses.

Baseline
We use RoBERTa (Liu et al., 2019) base version to develop our baseline SRL system.The large number of parameters not only allows it to make fast and accurate predictions, but also offers the capacity to learn from the rich output structure, including the constraints from the subsequent sections.
Our base system is a standard BIO tagger, briefly outlined below.Given a sentence s, the goal is to assign a label of the form B-X, I-X or O for each word i being an argument with label X for a predicate at word u.These unary decisions are scored as follows: Here, map converts the wordpiece embeddings e to whole word embeddings by summation, f v and f a are linear transformations of the predicate and argument embeddings respectively, f va is a twolayer ReLU with concatenated inputs, and finally g is a linear layer followed by softmax activation that predicts a probability distribution over labels for each word i when u is a predicate.In addition, we also have a standard first-order sequence model over label sequences for each predicate in the form of a CRF layer that is Viterbi decoded.We use the standard cross-entropy loss to train the model.

Designing Constraints
Before looking at the specifics of individual constraints, let us first look at a broad overview of our methodology.We will see concrete examples in the subsequent sections.
Output space constraints serve as prior domain knowledge for the SRL task.We will design our constraints as invariants at the training stage.To do so, we will first define constraints as statements in logic.Then we will systematically relax these Boolean statements into differentiable forms using concepts borrowed from the study of triangular norms (t-norms, Klement et al., 2013).Finally, we will treat these relaxations as regularizers in addition to the standard cross-entropy loss.
All the constraints we consider are conditional statements of the form: where the left-and the right-hand sides-L(x), R(x) respectively-can be either disjunctive or conjunctive expressions.The literals that constitute these expressions are associated with classification neurons, i.e., the predicted output probabilities are soft versions of these literals.
What we want is that model predictions satisfy our constraints.To teach a model to do so, we transform conditional statements into regularizers, such that during training, the model receives a penalty if the rule is not satisfied for an example. 2o soften logic, we use the conversions shown in Table 1 that combine the product and Gödel t-norms.We use this combination because it offers cleaner derivatives make learning easier.A similar combination of t-norms was also used in prior work (Minervini and Riedel, 2018).Finally, we will transform the derived losses into log space to be consistent with cross-entropy loss.Li et al. (2019) outlines this relationship between the crossentropy loss and constraint-derived regularizers in more detail.
Table 1: Converting logical operations to differentiable forms.For literals inside of L(s) and R(s), we use the Gödel t-norm.For the top-level conditional statement, we use the product t-norm.Operations not used this paper are marked as '-'.

Unique Core Roles (U )
Our first constraint captures the idea that, in a frame, there can be at most one core participant of a given type.Operationally, this means that for every predicate in an input sentence s, there can be no more than one occurrence of each core argument (i.e,A core = {A0, A1, A2, A3, A4, A5}).In first-order logic, we have: which says, for a predicate u, if a model tags the i-th word as the beginning of the core argument span, then it should not predict that any other token is the beginning of the same label.
In the above rule, the literal B X is associated with the predicted probability for the label B-X3 .This association is the cornerstone for deriving constraint-driven regularizers.Using the conversion in Table 1 and taking the natural log of the resulting expression, we can convert the implication in (6) as l(u, i, X): Adding up the terms for all tokens and labels, we get the final regularizer L U (s): Our constraint is universally applied to all words and predicates (i.e., i, u respectively) in the given sentence s.Whenever there is a pair of predicted labels for tokens i, j that violate the rule (6), our loss will yield a positive penalty.
Error Measurement ρ u To measure the violation rate of this constraint, we will report the percentages of propositions that have duplicate core arguments.We will refer to this error rate as ρ u .

Exclusively Overlapping Roles (O)
We adopt this constraint from Punyakanok et al. (2008) and related work.In any sentence, an argument for one predicate can either be contained in or entirely outside another argument for any other predicate.We illustrate the intuition of this constraint in Table 2, assuming core argument spans are unique and tags are BIO-consistent.
Based on Table 2, we design a constraint that says: if an argument has boundary [i, j], then no other argument span can cross the boundary at j.
Table 2: Formalizing the exclusively overlapping role constraint in terms of the B and I literals.For every possible span [i-j] in a sentence, whenever it has a label X for some predicate (first row), token labels as in the subsequent rows are not allowed for any other predicate for any other argument Y.Note that this constraint does not affect the cells marked with a -.
This constraint applies to all argument labels in the task, denoted by the set A.
∀ u, i, j ∈ s such that j > i, and ∀ X ∈ A, where Here, the term P (u, i, j, X) denotes the indicator for the argument span [i, j] having the label X for a predicate u and corresponds to the first row of Table 2.The terms Q 1 (v, i, j, Y) and Q 2 (v, i, j, Y) each correspond to prohibitions of the type described in the second and third rows respectively.As before, the literals B X , etc are relaxed as model probabilities to define the loss.By combining the Gödel and product t-norms, we translate Rule (8) into: where, l(u, i, j, X) = max 0, log P (u, i, j, X) Again, our constraint applies to all predicted probabilities.However, doing so requires scanning over 6 axes defined by (u, v, i, j, X, Y), which is computationally expensive.To get around this, we observe that, since we have a conditional statement, the higher the probability of P (u, i, j, X), the more likely it yields non-zero penalty.These cases are precisely the ones we hope the constraint helps.Thus, for faster training and ease of implementation, we modify Equation 8 by squeezing the (i, j) dimensions using top-k to redefine L O above as: where T denotes the set of the top-k span boundaries for predicate u and argument label X.This change results in a constraint defined by u, v, X, Y and the k elements of T .
Error Measurement ρ o We will refer to the error of the overlap constraint as ρ o , which describes the total number of non-exclusively overlapped pairs of arguments.In practice, we found that models rarely make such observed mistakes.In §3, we will see that using this constraint during training helps models generalize better with other constraints.In §4, we will analyze the impact of the parameter k in the optimization described above.

Frame Core Roles (F )
The task of semantic role labeling is defined using the PropBank frame definitions.That is, for any predicate lemma of a given sense, PropBank defines which core arguments it can take and what they mean.The definitions allow for natural constraints that can teach models to avoid predicting core arguments outside of the predefined set.
where S(u) denotes the set of senses for a predicate u, and R(u, k) denotes the set of acceptable core arguments when the predicate u has sense k.
As noted in §2.2, literals in the above statement can to be associated with classification neurons.Thus the Sense(u, k) corresponds to either model prediction or ground truth.Since our focus is to validate the approach of using relaxed constraints for SRL, we will use the latter.
This constraint can be also converted into regularizer following previous examples, giving us a loss term L F (s).
Error Measurement ρ f We will use ρ f to denote the violation rate.It represents the percentage of propositions that have predicted core arguments outside the role sets of PropBank frames.
Loss Our final loss is defined as: Here, L E (s) is the standard cross entropy loss over the BIO labels, and the λ's are hyperparameters.

Experiments & Results
In this section, we study the question: In what scenarios can we inform an end-to-end trained neural model with declarative knowledge?To this end, we experiment with the CoNLL-05 and CoNLL-12 datasets, using standard splits and the official evaluation script for measuring performance.
To empirically verify our framework in various data regimes, we consider scenarios ranging from where only limited training data is available, to ones where large amounts of clean data are available.

Experiment Setup
Our baseline (described in §2.1) is based on RoBERTa.We used the pre-trained base version released by Wolf et al. (2019).Before the final linear layer, we added a dropout layer (Srivastava et al., 2014) with probability 0.5.To capture the sequential dependencies between labels, we added a standard CRF layer.At testing time, Viterbi decoding with hard transition constraints was employed across all settings.In all experiments, we used the gold predicate and gold frame senses.
Model training proceeded in two stages: 1.We use the finetuned the pre-trained RoBERTa model on SRL with only crossentropy loss for 30 epochs with learning rate 3 × 10 −5 .2. Then we continued finetuning with the combined loss in Equation 12for another 5 epochs with a lowered learning rate of 1 × 10 −5 .
During both stages, learning rates were warmed up linearly for the first 10% updates.
For fair comparison, we finetuned our baseline twice (as with the constrained models); we found that it consistently outperformed the singly finetuned baseline in terms of both error rates and role F1.We grid-searched the λ's by incrementally adding regularizers.The combination of λ's with good balance between F1 and error ρ's on the dev set were selected for testing.We refer readers to the appendix for the values of λ's.
For models trained on the CoNLL-05 data, we report performance on the dev set, and the WSJ and Brown test sets.For CoNLL-12 models, we report performance on the dev and the test splits.

Scenario 1: Low Training Data
Creating SRL datasets requires expert annotation, which is expensive.
In this paper, we study the scenario where we have small amounts of fully labeled training data.We sample 3% of the training data and an equivalent amount of development examples.The same training/dev subsets are used across all models.
Table 3 reports the performances of using 3% training data from CoNLL-05 and CoNLL-12 (top and bottom respectively).We compare our strong baseline model with structure-tuned models using all three constraints.Note that for all these evaluations, while we use subsamples of the dev set for model selection, the evaluations are reported using the full dev and test sets.
We see that training with constraints greatly improves precision with low training data, while recall reduces.This trade-off is accompanied by a reduction in the violation rates ρ u and ρ f .As noted in §2.4,models rarely predict label sequences that violate the exclusively overlapping roles constraint.As a result, the error rate ρ o (the number of violations) only slightly fluctuates.the CoNLL-05 dataset which consists of 35k sentences with 91k propositions.Again, we compare RoBERTa (twice finetuned) with our structuretuned models.We see that the constrained models consistently outperform baselines on the dev, WSJ, and Brown sets.With all three constraints, the constrained model reaches 88 F1 on the WSJ.It also generalizes well on new domain by outperforming the baseline by 1.2 points on the Brown test set.

Scenario 2: Large Training Data
As in the low training data experiments, we observe improved precision due to the constraints.This suggests that even with large training data, direct label supervision might not be enough for neural models to pick up the rich output space structure.Our framework helps neural networks, even as strong as RoBERTa, to make more correct predictions from differentiable constraints.
Surprisingly, the development ground truth has a 2.34% error rate on the frame role constraint, and 0.40% on the unique role constraint.Similar percentages of unique role errors also appear in WSJ and Brown test sets.For ρ o , the oracle has no violations on the CoNLL-05 dataset.
The exclusively overlapping constraint (i.e.ρ o ) is omitted as we found models rarely make such prediction errors.After adding constraints, the error rate of our model approached the lower bound.Note that our framework focuses on the learning stage without any specialized decoding algorithms in the prediction phase except the Viterbi algorithm to guarantee that there will be no BIO violations.

What about even larger and cleaner data?
The ideal scenario, of course, is when we have the luxury of massive and clean data to power neural network training.In Table 5, we present results on CoNLL-12 which is about 3 times as large as CoNLL-05.It consists of 90k sentences and 253k propositions.The dataset is also less noisy with respect to the constraints.For instance, the oracle development set has no violations for both the unique core and the exclusively overlapping constraints.
We see that, while adding constraints reduced error rates of ρ u and ρ f , the improvements on label consistency do not affect F1 much.As a result, our best constrained model performes on a par with the baseline on the dev set, and is slightly better than the baseline (by 0.1) on the test set.Thus we believe when we have the luxury of data, learning with constraints would become optional.This observation is in line with recent results in Li and Srikumar (2019) and Li et al. (2019).
But is it due to the large data or the strong baseline?To investigate whether the seemingly saturated performance is from data or from the model, we also evaluate our framework on the original BERT (Devlin et al., 2019) which is relatively less powerful.We follow the same model setup for experiments and report the performances in Table 5 and Table 9.We see that compared to RoBERTa, BERT obtains similar F1 gains on the test set, suggesting performance ceiling is due to the train size.

Ablations & Analysis
In §3, we saw that constraints not just improve model performance, but also make outputs more structurally consistent.In this section, we will show the results of an ablation study that adds one constraint at a time.Then, we will examine the sources of improved F-score by looking at individual labels, and also the effect of the top-k relaxation for the constraint O. Furthermore, we will examine the robustness of our method against randomness involved during training.We will end this section with a discussion about the ability of constrained neural models to handle structured outputs.

Constraint Ablations
We present the ablation analysis on our constraints in Table 6.We see that as models become more constrained, precision improves.Furthermore, one class of constraints do not necessarily reduce the violation rate for the others.Combining all three constraints offers a balance between precision, recall, and constraint violation.
One interesting observation that adding the O constraints improve F-scores even though the ρ o values were already close to zero.As noted in §2.4,our constraints apply to the predicted scores of all labels for a given argument, while the actual decoded label sequence is just the highest scoring sequence using the Viterbi algorithm.Seen this way, our regularizers increase the decision margins on affected labels.As a result, the model predicts scores that help Viterbi decoding, and, also generalizes better to new domains i.e., the Brown set.Sources of Improvement Table 7 shows labelwise F1 scores for each argument.Under low training data conditions, our constrained models gained improvements primarily from the frequent labels, e.g., A0-A2.On CoNLL-05 dataset, we found the location modifier (AM-LOC) posed challenges to our constrained models which significantly performed worse than the baseline.Another challenge is the negation modifier (AM-NEG), where our models underperformed on both datasets, particularly with small training data.When using the CoNLL-12 training set, our models performed on par with the baseline even on frequent labels, confirming that the performance of soft-structured learning is nearly saturated on the larger, cleaner dataset.Robustness to random initialization We observed that model performance with structured tuning is generally robust to random initialization.As an illustration, we show the performance of models trained on the full CoNLL-12 dataset with different random initializations in Table 9.

Impact of
Can Constrained Networks Handle Structured Prediction?Larger, cleaner data may presumably be better for training constrained neural models.But it is not that simple.We will approach the above question by looking at how good the transformer models are at dealing with two classes of constraints, namely: 1) structural constraints that rely only on available decisions (constraint U ), 2) constraints involving external knowledge (constraint F ).
For the former, we expected neural models to perform very well since the constraint U repre- sents a simple local pattern.From Tables 4 and 5, we see that the constrained models indeed reduced violations ρ u substantially.However, when the training data is limited, i.e., comparing CoNLL-05 3% and 100%, the constrained models, while reducing the number of errors, still make many invalid predictions.We conjecture this is because networks learn with constraints mostly by memorization.Thus the ability to generalize learned patterns on unseen examples relies on training size.The constraint F requires external knowledge from the PropBank frames.We see that even with large training data, constrained models were only able to reduce error rate ρ f by a small margin.In our development experiments, having larger λ F tends to strongly sacrifice argument F1, yet still does not to improve development error rate substantially.Without additional training signal in the form of such background knowledge, constrained inference becomes a necessity, even with strong neural network models.

Semantic Role Labeling & Constraints
The SRL task is inherently knowledge rich; the outputs are defined in terms of an external ontology of frames.The work presented here can be generalized to several different flavors of the task, and indeed, constraints could be used to model the interplay between them.For example, we could revisit the analysis of Yi et al. (2007), who showed that the PropBank A2 label takes on multiple meanings, but by mapping them to VerbNet, they can be disambiguated.Such mappings naturally define constraints that link semantic ontologies.
Constraints have long been a cornerstone in the SRL models.Several early linear models for SRL (e.g.Punyakanok et al., 2004Punyakanok et al., , 2008;;Surdeanu et al., 2007) modeled inference for PropBank SRL using integer linear programming.Riedel and Meza-Ruiz (2008) used Markov Logic Networks to learn and predict semantic roles with declarative constraints.The work of (Täckström et al., 2015) showed that certain SRL constraints admit efficient decoding, leading to a neural model that used this framework (FitzGerald et al., 2015).Learning with constraints has also been widely adopted in semi-supervised SRL (e.g., Fürstenau and Lapata, 2012).
With the increasing influence of neural networks in NLP, however, the role of declarative constraints seem to have decreased in favor of fully end-to-end training (e.g., He et al., 2017b;Strubell et al., 2018, and others).In this paper, we show that even in the world of neural networks with contextual embeddings, there is still room for systematically introducing knowledge in the form of constraints, without sacrificing the benefits of end-to-end learning.Chang et al. (2012) and Ganchev et al. (2010) developed models for structured learning with declarative constraints.Our work is in the same spirit of training models that attempts to maintain output consistency.

Structured Losses
There are some recent works on the design of models and loss functions by relaxing Boolean formulas.Kimmig et al. (2012) used the Łukasiewicz t-norm for probabilistic soft logic.Li and Srikumar (2019) augment the neural network architecture itself using such soft logic.Xu et al. (2018) present a general framework for loss design that does not rely on soft logic.Introducing extra regularization terms to a downstream task have been shown to be beneficial in terms of both output structure consistency and prediction accuracy (e.g., Minervini and Riedel, 2018;Hsu et al., 2018;Mehta et al., 2018;Du et al., 2019;Li et al., 2019).
Final words In this work, we have presented a framework that seeks to predict structurally consistent outputs without extensive model redesign, or any expensive decoding at prediction time.Our experiments on the semantic role labeling task show that such an approach can be especially helpful in scenarios where we do not have the luxury of massive annotated datasets.
Table 4 reports the performance of models trained with our framework using the full training set of

Table 3 :
Results on low training data (3% of CoNLL-05 and CoNLL-12).RoBERTa 2 : Baseline finetuned twice.U: Unique core roles.F: Frame core roles.O: Exclusively overlapping roles.δF1: improvement over baseline.ρ f is marked NA for the CoNLL-05 test results because ground truth sense is unavailable on the CoNLL-05 shared task page.

Table 4 :
Results on the full CoNLL-05 data.Oracle: Errors of oracle.ρ o is in [0,6] across all settings.

Table 5 :
Results on CoNLL-12.BERT 2 : The original BERT finetuned twice.ρ o is around 50 across all settings.With the luxury of large and clean data, constrained learning becomes less effective.

Table 7 :
Top-k Beam Size As noted in §2.4,we used the top-k strategy to implement the constraint O.As a result, there is a certain chance for predicted label sequences to have non-exclusive AM-MOD Label-wise F1 scores for the CoNLL-05 and CoNLL-12 development sets.overlapwithoutour regularizer penalizing them.What we want instead is a good balance between coverage and runtime cost.To this end, we analyze the CoNLL-12 development set using the baseline trained on 3% of CoNLL-12 data.Specifically, we count the examples which have such overlap but the regularization loss is ≤ 0.001.In Table8, we see that k = 4 yields good coverage.

Table 8 :
Impact of k for the top-k strategy, showing the number of missed examples for different k.We set k = 4 across all experiments.