Semantic Role Labeling with Iterative Structure Refinement

Modern state-of-the-art Semantic Role Labeling (SRL) methods rely on expressive sentence encoders (e.g., multi-layer LSTMs) but tend to model only local (if any) interactions between individual argument labeling decisions. This contrasts with earlier work and also with the intuition that the labels of individual arguments are strongly interdependent. We model interactions between argument labeling decisions through iterative refinement. Starting with an output produced by a factorized model, we iteratively refine it using a refinement network. Instead of modeling arbitrary interactions among roles and words, we encode prior knowledge about the SRL problem by designing a restricted network architecture capturing non-local interactions. This modeling choice prevents overfitting and results in an effective model, outperforming strong factorized baseline models on all 7 CoNLL-2009 languages, and achieving state-of-the-art results on 5 of them, including English.


Introduction
Semantic role labeling (SRL), originally introduced by Gildea and Jurafsky (2000), involves the prediction of predicate-argument structure, i.e., identification of arguments and their assignment to underlying semantic roles. Semantic-role representations have been shown to be beneficial in many NLP applications, including question answering (Shen and Lapata, 2007), information extraction (Christensen et al., 2011) and machine translation (Marcheggiani et al., 2018). In this work, we focus on dependency-based SRL (Hajič et al., 2009), a popular version of the task which involves identifying syntactic heads of arguments rather than marking entire argument spans (see the graph in red in Figure 1). Edges in the dependency graphs are annotated with semantic roles (e.g., A0:PLEASER) and the predicates are labeled  with their senses from a given sense inventory (e.g., SATISFY.01 in the example). Before the rise of deep learning methods, the most accurate SRL methods relied on modeling high-order interactions in the output space (e.g., between arguments or arguments and predicates) (Watanabe et al., 2010;Toutanova et al., 2008). Earlier neural methods can model such output interactions through a transition system, and achieve competitive performance (Henderson et al., 2013). However, current state-ofthe-art SRL systems use powerful sentence encoders (e.g., layers of LSTMs He et al., 2017) or multi-head self-attention (Strubell et al., 2018)) and factorize over small fragments of the predicted structures. Specifically, most modern models process individual arguments and perform predicate disambiguation independently. The trend towards more factorizable models is not unique to dependency-based SRL but common for most structured prediction tasks in NLP (Kiperwasser and Goldberg, 2016;Manning, 2017, 2018). The only major exception is language generation tasks, especially machine translation and language modeling, where larger amounts of text are typically used in training.
Powerful encoders, in principle, can capture long-distance dependencies and hence alleviate the need for modeling high-order interactions in the output. However, capturing these interactions in the encoder would require substantial amounts of data. Even if we have domain knowledge about likely interactions between components of the predicted graphs, it is hard to inject this knowledge in an encoder.
Consider the example in Figure 1. The argument 'state' appears in the highly ambiguous syntactic position '[.] to satisfy'. All three core semantic roles of the predicate  can in principle appear here: patient (A1:ENTITY FULFILLED, as in 'a sweet tooth to satisfy'), instrument (A2:METHOD, as in 'a little dessert to satisfy your sweet tooth') and agent (A0: PLEASER, as in our actual example). The basic factorized model got it wrong, assigning A1 to the argument 'state'. However, taking into account other arguments, the model can correct the label. The configuration 'A1 to satisfy' is more likely when an agent (A0) is present in the sentence. The lack of an agent boosts the score for the correct configuration 'A0 to satisfy'.
Our iterative refinement approach encodes the above intuition. In iterative refinement , a refinement network repeatedly takes previous output as input and produces its refined version. Formally, we have y t+1 = Refine(x, y t ).
Naturally, such refinement strategy also requires an initial prediction y 0 , which is produced by a ('base') factorized model. Refinement strategies have been successful in machine translation Novak et al., 2017;Xia et al., 2017;Hassan et al., 2018), but their effectiveness in other NLP tasks is yet to be demonstrated. 1 We conjecture that this discrepancy is due to differences in data availability. Given larger amounts of training data typically used in machine translation, their base models and refinement networks overfit to a lesser extent. Overfitting in (1) the base model and (2) the refinement network are both problematic. The first implies that either there are no mistakes in the base models in the training set or their distribution is very different from that in the test regime, so the training material for the inference networks ends up being misleading. The second naturally means that refinement will fail at test time. We address both these issues by designing restricted inference 1 See extra discussion and related work in section 2. networks and adding a specific form of noise when training them.
Our structured refinement network is simple but encodes non-local dependencies. Specifically, it takes into account the information about the role distributions on the previous iteration aggregated over the entire sentences but not the information what the other arguments are. It is a coarse compressed representation of the prediction, yet it represents long-distance information not readily available within the factorized base model. While this is not the only possible design, we believe that the empirical gains from using this simple refinement network, demonstrate the viability of our general framework of iterative refinement with restricted inference networks. They also suggest that intuitions underlying declarative constraints used in early work on SRL (Punyakanok et al., 2008;Das et al., 2012) can be revisited but now encoded in a flexible soft way to provide induction biases for the refinement networks. We leave this for future work.
We consider the CoNLL-2009 dataset (Hajič et al., 2009). We start with a strong factorized baseline model, which already achieves state-ofthe-art results on a subset of the languages. Then, using our structure refinement network, we improve on this baseline on all 7 CoNLL-2009 languages. The model achieves best-reported results in 5 languages, including English. We also observe improvements on out-of-domain test sets, confirming the robustness of our approach. We perform experiments demonstrating the importance of adding noise, and ablation studies showing the necessity of incorporating output interactions. Furthermore, we provide analysis on constraint violations and errors on the English test set. 2

Related Work
Learning to refine predictions from neural structured prediction models has recently received significant attention. Our approach bears similarity to methods used in machine translation Novak et al., 2017;Xia et al., 2017). All these methods refine a translated sentence produced by a seq2seq model with another seq2seq model. Among them, the deliberation networks by Xia et al. (2017) rely on BiLSTMs and improve initial predictions from an competitive baseline and obtain state-of-art-results on English-to-French translation. Later, it has been shown that the deliberation networks can improve translation when used within the Transformer framework (Hassan et al., 2018).
Certain approaches, not necessarily directly optimized for refinement, can nevertheless be regarded as iterative refinement methods. Structured prediction energy networks (SPENs) are trained to assign global energy scores to output structures, and the gradient descent is used during inference to minimize the global energy (Belanger and Mc-Callum, 2016). As the gradient descent involves iterative optimization, its steps can be viewed as iterative refinement. In particular, Belanger et al. (2017) build a SPEN for SRL, but for the spanbased formalism, not the dependency one we consider in this work. While they improve over their baseline model, their baseline model used multilayer perceptron to encode local factors, thus the encoder power is limited. Moreover their refined model performs worse in the out-of-domain setting than their baseline model, indicating overfitting (Belanger et al., 2017).
In the follow-up work, Gimpel (2018, 2019) introduce inference networks to replace gradient descent. Their inference networks directly refine the output. Improvements over competitive baselines are reported on part-of-speech tagging, named entity recognition and CCG supertagging (Tu and Gimpel, 2019). However, their inference networks are distilling knowledge from a tractable linear-chain conditional random field (CRF) model. Thus, these methods do not provide direct performance gains. More importantly, the interactions captured in these models are likely local, as they learn to mimic Markov CRFs.
Denoising autoencoders (Vincent et al., 2008) can also be used to refine structure. Indeed, image segmentation can be improved through iterative inference with denoising autoencoders (Romero et al., 2017;Drozdzal et al., 2018). Their framework is very similar to ours, albeit we are working in a discrete domain. One other difference is that by using a convolutional architecture in the refinement network, they are still modeling only local interactions. At a more conceptual level, Bengio et al. (2013) argued that a denoising autoencoder should not be too robust to the input variations as to ignore the input. This indicates that we should not expect refinement networks to correct all the errors, even in theory, and hence, the refinement networks do not need to be particularly powerful.
Very recently, Wang et al. (2019) used high order statistical model for Semantic Dependency Parsing (Oepen et al., 2015), and obtain improvements over strong baseline using BiLSTM. They attempted loopy belief propagation and mean field variational inference for inference, and train the model end to end. Such inference steps are well motivated. This work is similar to energy network approach (Belanger and McCallum, 2016), while a global score function is provided, and approximate inference steps are used. Comparing to ours, the inference can also be regarded as iterative structure refinement. Yet, we do not provide a global score and directly try to model the refinement. In principle, our formalization should give us more liberty in terms of designing the refinement network.

Dependency Semantic Role Labeling
In this section, we introduce the notation and present our factorized baseline model.

Notation
In dependency SRL, for each sentence of length n, we have a sequence of words w, dependency labels dep, part-of-speech tags pos, each being a discrete sequence of length n. To simplify notation, we consider one predicate at a time. We denote the number of roles by r, it includes the 'null' role, signifying that the corresponding word is not an argument of the predicate. Formally, P ∈ ∆ m−1 is the probability distribution over m predicate senses, and ∆ m−1 represents the corresponding probability simplex. We also have predicate sense embeddings Π ∈ R m×dπ , and index j, throughout the discussion, refers to the position of the target predicate in the sentence. R ∈ ∆ n r−1 is a matrix of size n × r such that each row sums to 1, corresponding to a probability distribution over roles. In particular R i,0 is the probability of i-th word not being an argument of the predicate.
We index role label and sense predictions from different refinement iterations ('time steps') with t, i.e. P t and R t . The index t ranges from 0 to T , and P 0 and R 0 denotes the predictions from the factorized baseline model. Details (e.g., hyperparameters) are provided in the appendix.

Factorized Model
Similarly to recent approaches to SRL and semantic graph parsing (He et al., 2017;Dozat and Manning, 2018), our factorized baseline model starts with concatenated embeddings x. Then, we encode the sentence with a BiLSTM, further extract features with an MLP (multilayer perceptron) and apply a bi-affine classifier to the resulting features to label the words with roles. We also use a predicate-dependent linear layer for sense disambiguation.
More formally, we start with getting a sentence representation by concatenating embeddings. We have x w ∈ R n×dw , x dep ∈ R n×d δ , x pos ∈ R n×dp for words, dependency labels and part-of-speech tags, respectively. We concatenate them to form a sentence representation: We further encode the sentence with a BiLSTM: From these context-aware word representations, we produce features for argument identification and role labeling that will be used by a bi-affine classifier. Note that, for every potential predicateargument dependency (i.e. a candidate edge), we need to produce representations of both endpoints: the argument and the predicate 'sides'. For the argument side, h ρ 0 will be used to compute the logits for argument identification and h ρ 1 will be used for deciding on its role: Similarly, for the predicate side, we also extract two representations h 0 and h 1 (recall that the predicate is at position j): We then obtain logits I ρ 0 corresponding to the decision to label arguments as null, and logits I ρ 1 for other roles. So, we have: Unlike Dozat and Manning (2018), where argument identification and role labeling are trained with two losses, 3 we feed them together into a single softmax layer to compute the semantic-role distribution R 0 : Now, for sense disambiguation, we need to extract yet another predicate representation h π : In the formalism we use (PropBank), senses are predicate-specific, so we use predicate-specific sense embedding matrices Π. The matrix Π acts as a linear layer before softmax: This ends the description of our baseline model, which we also use to get initial predictions for iterative refinement.

Structured Refinement Network
In this section, we introduce the structured refinement network for dependency SRL. When doing refinement, it has access to the roles distribution R t ∈ ∆ n r−1 and the sense distribution P t ∈ ∆ m−1 computed at the previous iteration (i.e. time t). In addition, it exploits the sentence representation x ∈ R n×dx . Our refinement network is limited and structured, in the sense that it only has access to a compressed version of the previous prediction, and the network itself is a simple MLP.
Similarly to our baseline model, we extract feature vectors g from input x and further separately encode the argument representation g α and the predicate token representation g π : To simplify the notation, we omit indexing them by t, except for R t and P t . We use two refinement networks, one for roles and another one for predicate senses.

Role Refinement Network
First, we describe our structured refinement network for role labeling. We use i to index arguments. We obtain a compressed representation o i used for refining R t i by summing up the probability mass for all roles, excluding the null role: Intuitively, o i is the aggregation of all other roles being labeled by the current predicate. We concatenate o i with feature vectors of the current argument g α , predicate g π , the relaxed predicate sense embedding Π ·P t and the role probability itself (R i ) to form the input to a two-layer network: where σ is the logistic sigmoid function, W α ∈ R dr×(2r−1+2dg+dπ) , W α ∈ R r×dr are learned linear mappings. We obtain our refined logits M α i for the i-th argument; M α refers to the stacked matrix of logits for all arguments. To obtain the refined role distribution, we add up M α and I α that we got from the baseline model, and follow that by a softmax layer:

Sense Refinement Network
To build a representation for sense disambiguation, we simply compute the probability mass for each role (excluding the null role) to obtain r π , and concatenate this with g π and Π · P t : Differently from the role refinement network, sense prediction is predicate-specific. Therefore, we first map z π to R dπ , and then take the inner product with the predicate-specific sense embeddings Π to get the refined logits: Similarly to role refinement, σ is the logistic function, W π ∈ R dr×(r−1+dg+dπ) , W π ∈ R m×dr are learned linear mappings. Again, we combine the logits M π and I π before the softmax layer:

Weight Tying
Our refinement networks are similar to the denoising autoencoders (DAEs; Vincent et al. 2008), so we use the weight-tying technique popular with DAEs. We believe that the technique may be even more effective here as the amount of labeled data for SRL is lower than in many usual applications of DAEs. We tie W α with a subset of W α rows: specifically with the rows acting on R t i in the computation of M α i (see equations 19 and 20). Similarly, we tie W π with the part of W π corresponding to Π · P t (see equations 24 and 25). Formally, where W [: k] takes the first k rows of matrix W .

Self Refinement
We describe a simpler version of the refinement network which we will use in experiments to test whether the improvements with the structured refinement network over the factorized baseline are genuinely coming from modeling interaction between arguments rather than from simply combining multiple classifiers. This simpler refinement network does not account for any interactions between arguments. Instead of equations 19 and 24, we have: Everything else is kept the same as in the full model, expect that the size of W α and W π needs to be adjusted. We refer to this ablated network as the self-refinement network.

Training for Iterative Structure Refinement
In this section, we describe our training procedure.

Two-Stage Training
We have two models: the baseline model, producing the initial predictions, and the iterative refinement network, correcting them. While it is possible to train them jointly, we find joint training slow to converge. Instead, we train the factorized baseline model first and then optimize the refinement networks while keeping the baseline model fixed.

Stochastic Training
Our baseline model overfits to the training set, and, if simply training on its output, our refinement network would learn to copy the base predictions. Instead, we perturb the baseline prediction during training. Naturally, we can add dropout (Srivastava et al., 2014) and recurrent dropout (Gal and Ghahramani, 2016) to our neural networks. However, for the smaller data set we use, we find this not sufficient. In particular, we use Gumbel-Softmax instead of Softmax.

Loss for Iterative Refinement
Let us denote gold-standard labels for roles and predicates as R * and P * . We use two separate losses L base (R * , P * , x) and L refine (R * , P * , x) for our two-stage training. We define losses for predictions from each refinement iteration and sum them up: We adopt the Softmax-Margin loss (Gimpel and Smith, 2010;Blondel et al., 2019) for individual L. Effectively, we subtract 1 from the logit of the gold label, and apply the cross entropy loss.

Experiments
Datasets We conduct experiments on CoNLL-2009(Hajič et al., 2009) data set for all languages, including Catalan (Ca), Chinese (Zh), Czech (Cz), English (En), German (De), Japanese (Jp) and Spanish (Es). We use the predicted part-of-speech tags, dependency labels, and pre-identified predicate, provided with the dataset. The statistics of datasets are shown in Table 2.
Hyperparameters We use ELMo  for English, and FastText embeddings (Bojanowski et al., 2017;Grave et al., 2018) for all other languages. We train and run the refinement networks for two iterations. All other hyperparameters are the same for all languages, except BiLSTMs for English is larger than others.
Training Details Training the refinement network takes roughly 2 times more time than the baseline models, as it requires running BiLSTMs. The extra computation for the structured refinement network is minimal. For English, training the iterative refinement model for 1 epoch takes about 6 minutes on one 1080ti GPU. Adam is used as the optimizer (Kingma and Ba, 2015), with the learning rate of 3e-4. We use early stopping on the development set. We run 600 epochs for all baseline models, and 300 epochs for the refinement networks. Batch sizes are chosen from 32, 64, or 128 to maximize GPU memory usage. Our implementation is based on PyTorch and AllenNLP (Paszke et al., 2017;Gardner et al., 2018).

Results and Discussions
Test Results Results for all CoNLL-2009 languages on the standard (in-domain) datasets are presented in Table 1. We compare our best model to the best previous single model for the corresponding language (excluding ensemble ones). Most research has focused on English, but we include results of recent models which were evaluated on at least 3 languages. When compared to the previous models, both our models are very competitive, with the exception of German. On the German dataset, Mulcaire et al. (2018) also report a relatively weak result, when compared to Roth and Lapata (2016). The German dataset is the smallest one in terms of the number of predicates. Syntactic information used by Roth and Lapata (2016) may be very beneficial in this setting and may be the reason for this discrepancy. Our structured refinement approach improves over the best previous results on 5 out of 7 languages. Note that hyper-parameters of the refinement network are not tuned for individual languages, suggesting that the proposed method is robust and may be easy to apply to new languages and/or new base models.
The only case where the refinement network was not effective is Chinese, where it achieved only a negligible improvement.
Out-of-Domain Results on the out-of-domain  Zhao et al. (2009), Japanese is from Watanabe et al. (2010), Czech is from Henderson et al. (2013), German and Spanish are from Roth and Lapata (2016), English is from  and Chinese is from Cai et al. (2018). We report the best testing results from Mulcaire et al. (2018  testing sets are presented in Table 4. 5 We observe improvements from using refinement in all the cases. This shows that our refinement approach is robust against domain shift. Ablations We report development set results in different settings in Table 3. Our full model performs 2 refinement iterations, uses weight tying, and the Gumbel noise. 6 We select the best configuration for each language to report the test set performance in Table 1 and Table 4. As expected, weight tying is beneficial for lower-resource languages such as Catalan, Japanese and Spanish (see Table 2 for dataset characteristics). The Gumbel noise helps for all languages except for Czech and English, the two largest datasets. In particular, we observe almost no improvement on the Spanish dataset without using the Gumbel noise. We note relatively consistent but small gains from using 2 refinement iterations. The magnitude of the gains may be an artifact of us having the loss terms L(R * , R t ) and L(P * , P t ), encouraging not only the final (second), but also the first, iteration to produce 5 Roth and Lapata (2016) has better in-domain testing score, but did not report the out-of-domain score. 6 We set λ α g = 5 for role and λ π g = 50 for sense, so that initial predictions contain around 20% errors. accurate predictions.
A potential alternative explanation is that our refinement network is restricted to simple interactions, resulting in the fixed point reachable in one step.

Constraints Violation
We consider violation of unique core roles (U), continuation roles (C) and reference roles (R) constraints from Punyakanok et al. (2008);FitzGerald et al. (2015) in Table 6. U is violated if a core role (A0 -A5, AA) appears more than once; C is violated when the C-X role is not preceded by the X role (for some X); R is violated if R-X role does not appear. Our approach results in a large reduction in the uniqueness constraint violations. Our model slightly reduces the number of R violations, while He et al. (2017) reported that deterministically enforcing constraints is not helpful (albeit in span-based SRL). However learning those constraints in a soft way might be beneficial.

Argument Interaction vs. No Argument Interaction
We compare the structured refinement network and the self-refinement network in Table 5. Both networks share the same hyperparameters. The structured refinement network consistently outperforms the self-refinement counterpart. This suggests that the refinement model benefits from accessing information about other arguments when doing refinement. In other words, modeling argument interaction appears genuinely useful.

Improvement Decomposition
We report labeled role precision, recall and sense disambiguation accuracy in Table 7. Our structured refinement approach consistently improves over the baseline model in all metrics. While we cannot assert the improvements on all metrics are significant, this suggests that it learns some non-trivial interactions instead of merely learning to balance precision and

recall.
Error Correction Analysis We show the errors that the structured refinement network corrects in Figure 2. In the baseline confusion matrix, we see the errors are fairly balanced for all the roles we consider here. In the error correction matrix, the corrections are also fairly evenly distributed. Yet, this is not completely uniform. There is a tendency towards filtering out arguments rather than generating new ones.

Conclusions and Future Work
We propose the structured refinement network for dependency semantic role labeling. The structured refinement network corrects predictions made by a strong factorized baseline model while modeling interactions in the predicated structure. The resulting model achieves state-of-the-art results on 5 out of 7 languages in the CoNLL-2009 data set, and substantially outperforms the factorized model on all of these languages. For the future work, the structured refinement network can be further improved. For example, we  can take an inspiration from either declarative constraints used in the previous work (Punyakanok et al., 2008) or from literature on lexical semantics of verbs, studying patterns of event and argument realization (e.g., Levin 1993). Indeed, the unique role constraint as a declarative constraint is one of the motivation for the concurrent work on modeling argument interaction in SRL (Chen et al., 2019). That work relies on capsule networks (Sabour et al., 2017) and focuses primarily on enforcing the role uniqueness constraint. The framework can be extended to other tasks. For example, in syntactic dependency parsing: the refinement network can rely on representations of grandparent nodes, siblings and children to propose a correction. In general, structure refinement networks should allow domain experts to incorporate prior knowledge about output dependencies and improve model performance.

Model
Ca