Jointly Predicting Predicates and Arguments in Neural Semantic Role Labeling

Recent BIO-tagging-based neural semantic role labeling models are very high performing, but assume gold predicates as part of the input and cannot incorporate span-level features. We propose an end-to-end approach for jointly predicting all predicates, arguments spans, and the relations between them. The model makes independent decisions about what relationship, if any, holds between every possible word-span pair, and learns contextualized span representations that provide rich, shared input features for each decision. Experiments demonstrate that this approach sets a new state of the art on PropBank SRL without gold predicates.


Introduction
Semantic role labeling (SRL) captures predicateargument relations, such as "who did what to whom." Recent high-performing SRL models Marcheggiani et al., 2017;Tan et al., 2018) are BIO-taggers, labeling argument spans for a single predicate at a time (as shown in Figure 1). They are typically only evaluated with gold predicates, and must be pipelined with error-prone predicate identification models for deployment.
We propose an end-to-end approach for predicting all the predicates and their argument spans in one forward pass. Our model builds on a recent coreference resolution model , by making central use of learned, contextualized span representations. We use these representations to predict SRL graphs directly over text spans. Each edge is identified by independently predicting which role, if any, holds between every possible pair of text spans, while using aggressive beam 1 Code and models: https://github.com/luheng/lsgn pruning for efficiency. The final graph is simply the union of predicted SRL roles (edges) and their associated text spans (nodes). Our span-graph formulation overcomes a key limitation of semi-markov and BIO-based models (Kong et al., 2016;Zhou and Xu, 2015;Yang and Mitchell, 2017;Tan et al., 2018): it can model overlapping spans across different predicates in the same output structure (see Figure 1). The span representations also generalize the token-level representations in BIObased models, letting the model dynamically decide which spans and roles to include, without using previously standard syntactic features (Punyakanok et al., 2008;FitzGerald et al., 2015).
To the best of our knowledge, this is the first span-based SRL model that does not assume that predicates are given. In this more realistic setting, where the predicate must be predicted, our model achieves state-of-the-art performance on PropBank. It also reinforces the strong performance of similar span embedding methods for coreference , suggesting that this style of models could be used for other span-span relation tasks, such as syntactic parsing (Stern et al., 2017), relation extraction (Miwa and Bansal, 2016), and QA-SRL (FitzGerald et al., 2018).

Model
We consider the space of possible predicates to be all the tokens in the input sentence, and the space of arguments to be all continuous spans. Our model decides what relation exists between each predicate-argument pair (including no relation).
Formally, given a sequence X = w 1 , . . . , w n , we wish to predict a set of labeled predicateargument relations Y ⊆ P × A × L, where P = {w 1 , . . . , w n } is the set of all tokens (predicates), A = {(w i , . . . , w j ) | 1 ≤ i ≤ j ≤ n} contains all the spans (arguments), and L is the space of semantic role labels, including a null label indicating no relation. The final SRL output would be all the non-empty relations {(p, a, l) ∈ Y | l = }.
We then define a set of random variables, where each random variable y p,a corresponds to a predicate p ∈ P and an argument a ∈ A, taking value from the discrete label space L. The random variables y p,a are conditionally independent of each other given the input X: Where φ(p, a, l) is a scoring function for a possible (predicate, argument, label) combination. φ is decomposed into two unary scores on the predicate and the argument (defined in Section 3), as well as a label-specific score for the relation: The score for the null label is set to a constant: φ(p, a, ) = 0, similar to logistic regression.
Learning For each input X, we minimize the negative log likelihood of the gold structure Y * : Beam pruning As our model deals with O(n 2 ) possible argument spans and O(n) possible predicates, it needs to consider O(n 3 |L|) possible relations, which is computationally impractical. To overcome this issue, we define two beams B a and B p for storing the candidate arguments and predicates, respectively. The candidates in each beam are ranked by their unary score (Φ a or Φ p ). The sizes of the beams are limited by λ a n and λ p n. Elements that fall out of the beam do not participate in computing the edge factors Φ (l) rel , reducing the overall number of relational factors evaluated by the model to O(n 2 |L|). We also limit the maximum width of spans to a fixed number W (e.g. W = 30), further reducing the number of computed unary factors to O(n).

Neural Architecture
Our model builds contextualized representations for argument spans a and predicate words p based on BiLSTM outputs ( Figure 2) and uses feedforward networks to compute the factor scores in φ(p, a, l) described in Section 2 ( Figure 3).

Word-level contexts
The bottom layer consists of pre-trained word embeddings concatenated with character-based representations, i.e.
for each token w i , we have . We then contextualize each x i using an m-layered bidirectional LSTM with highway connections (Zhang et al., 2016), which we denote asx i .
Argument and predicate representation We build contextualized representations for all candidate arguments a ∈ A and predicates p ∈ P. The argument representation contains the following: end points from the BiLSTM outputs (x START(a) ,x END(a) ), a soft head word x h (a), and embedded span width features f (a), similar to . The predicate representation is simply the BiLSTM output at the position INDEX(p).
The soft head representation x h (a) is an attention mechanism over word inputs x in the argument span, where the weights e(a) are computed via a linear layer over the BiLSTM outputsx.
e(a) = SOFTMAX(w exSTART(a):END(a) ) (8) x START(a):END(a) is a shorthand for stacking a list of vectors x t , where START(a) ≤ t ≤ END(a).
Scoring The scoring functions Φ are implemented with feed-forward networks based on the predicate and argument representations g:

Experiments
We experiment on the CoNLL 2005 (Carreras and Màrquez, 2005) and CoNLL 2012 (OntoNotes 5.0, (Pradhan et al., 2013)) benchmarks, using two SRL setups: end-to-end and gold predicates. In the end-to-end setup, a system takes a tokenized sentence as input, and predicts all the predicates and their arguments. Systems are evaluated on the micro-averaged F1 for correctly predicting (predicate, argument span, label) tuples. For comparison with previous systems, we also report results with gold predicates, in which the complete set of predicates in the input sentence is given as well.
Other experimental setups and hyperparameteres are listed in Appendix A.1.
ELMo embeddings To further improve performance, we also add ELMo word representations (Peters et al., 2018) to the BiLSTM input (in the +ELMo rows). Since the contextualized representations ELMo provides can be applied to most previous neural systems, the improvement is orthogonal to our contribution. In Table 1 and 2, we organize all the results into two categories: the comparable single model systems, and the mod-els augmented with ELMo or ensembling (in the PoE rows).
End-to-end results As shown in Table 1, 2 our joint model outperforms the previous best pipeline system  by an F1 difference of anywhere between 1.3 and 6.0 in every setting. The improvement is larger on the Brown test set, which is out-of-domain, and the CoNLL 2012 test set, which contains nominal predicates. On all datasets, our model is able to predict over 40% of the sentences completely correctly.
Results with gold predicates To compare with additional previous systems, we also conduct experiments with gold predicates by constraining our predicate beam to be gold predicates only. As shown in Table 2

Analysis
Our model's architecture differs significantly from previous BIO systems in terms of both input and decision space. To better understand our model's strengths and weaknesses, we perform three analyses following  and , studying (1) the effectiveness of beam  Effectiveness of beam pruning Figure 4 shows the predicate and argument spans kept in the beam, sorted with their unary scores. Our model efficiently prunes unlikely argument spans and predicates, significantly reduces the number of edges it needs to consider. Figure 5 shows the recall of predicate words on the CoNLL 2012 development set. By retaining λ p = 0.4 predicates per word, we are able to keep over 99.7% argument-bearing predicates. Compared to having a part-of-speech tagger (POS:X in Figure 5), our joint beam pruning allowing the model to have a soft trade-off between efficiency and recall. 4 Long-distance dependencies Figure 6 shows the performance breakdown by binned distance between arguments to the given predicates. Our model is better at accurately predicting arguments that are farther away from the predicates, even compared to an ensemble model  that has a higher overall F1. This is very likely due to architectural differences; in a BIO tagger, predicate information passes through many LSTM timesteps before reaching a long-distance argument, whereas our architecture enables direct connections between all predicates-arguments pairs.
Agreement with syntax As mentioned in , their BIO-based SRL system has good agreement with gold syntactic span boundaries (94.3%) but falls short of previous syntaxbased systems (Punyakanok et al., 2004). By directly modeling span information, our model achieves comparable syntactic agreement (95.0%) to Punyakanok et al. (2004) Figure 5: Recall of gold argument-bearing predicates on the CoNLL 2012 development data as we increase the number of predicates kept per word. POS:X shows the gold predicate recall from using certain pos-tags identified by the NLTK part-ofspeech tagger (Bird, 2006).  tions of global structural constraints 5 compared to previous systems. Our model made more constraint violations compared to previous systems. For example, our model predicts duplicate core arguments 6 (shown in the U column in Table 3) more often than previous work. This is due to the fact that our model uses independent classifiers to label each predicate-argument pair, making it difficult for them to implicitly track the decisions made for several arguments with the same predicate. The Ours+decode row in Table 3 shows SRL performance after enforcing the U-constraint using dynamic programming  at decoding time. Constrained decoding at test time is effective at eliminating all the core-role inconsistencies (shown in the U-column), but did not bring significant gain on the end result (shown 5 Punyakanok et al. (2008) described a list of global constraints for SRL systems, e.g., there can be at most one core argument of each type for each predicate. 6 Arguments with labels ARG0,ARG1,. . . ,ARG5 and AA.  in SRL F1), which only evaluates the piece-wise predicate-argument structures.

Conclusion and Future Work
We proposed a new SRL model that is able to jointly predict all predicates and argument spans, generalized from a recent coreference system . Compared to previous BIO systems, our new model supports joint predicate identification and is able to incorporate span-level features. Empirically, the model does better at longrange dependencies and agreement with syntactic boundaries, but is weaker at global consistency, due to our strong independence assumption.
In the future, we could incorporate higher-order inference methods  to relax this assumption. It would also be interesting to combine our span-based architecture with the selfattention layers (Tan et al., 2018;Strubell et al., 2018) for more effective contextualization.