Capturing Argument Interaction in Semantic Role Labeling with Capsule Networks

Semantic role labeling (SRL) involves extracting propositions (i.e. predicates and their typed arguments) from natural language sentences. State-of-the-art SRL models rely on powerful encoders (e.g., LSTMs) and do not model non-local interaction between arguments. We propose a new approach to modeling these interactions while maintaining efficient inference. Specifically, we use Capsule Networks (Sabour et al., 2017): each proposition is encoded as a tuple of capsules, one capsule per argument type (i.e. role). These tuples serve as embeddings of entire propositions. In every network layer, the capsules interact with each other and with representations of words in the sentence. Each iteration results in updated proposition embeddings and updated predictions about the SRL structure. Our model substantially outperforms the non-refinement baseline model on all 7 CoNLL-2019 languages and achieves state-of-the-art results on 5 languages (including English) for dependency SRL. We analyze the types of mistakes corrected by the refinement procedure. For example, each role is typically (but not always) filled with at most one argument. Whereas enforcing this approximate constraint is not useful with the modern SRL system, iterative procedure corrects the mistakes by capturing this intuition in a flexible and context-sensitive way.


Introduction
The task of semantic role labeling (SRL) involves the prediction of predicate-argument structure, i.e., both the identification of arguments and labeling them with underlying semantic roles. The shallow semantic structures have been shown beneficial in many natural language processing (NLP) applications, including information extrac-1 Code: https://github.com/DalstonChen/CapNetSRL. tion (Christensen et al., 2011), question answering (Eckert and Neves, 2018) and machine translation (Marcheggiani et al., 2018).
In this work, we focus on the dependency version of the SRL task (Surdeanu et al., 2008). An example of a dependency semantic-role structure is shown in Figure 1. Edges in the graph are marked with semantic roles, whereas predicates (typically verbs or nominalized verbs) are annotated with their senses.
Intuitively, there are many restrictions on potential predicate-argument structures. Consider the 'role uniqueness' constraint: each role is typically, but not always, realized at most once. For example, predicates have at most one agent. Similarly, depending on a verb class, only certain subcategorization patterns are licensed. Nevertheless, rather than modeling the interaction between argument labeling decisions, state-of-the-art semantic role labeling models Li et al., 2019b) rely on powerful sentence encoders (e.g., multi-layer Bi-LSTMs (Zhou and Xu, 2015;Qian et al., 2017;). This contrasts with earlier work on SRL, which hinged on global declarative constraints on the labeling decisions (FitzGerald et al., 2015;Das et al., 2012). Modern SRL systems are much more accurate and hence enforcing such hard and often approximate constraints is unlikely to be as beneficial (see our experiments in Section 7.2).
Instead of using hard constraints, we propose to use a simple iterative structure-refinement procedure. It starts with independent predictions and refines them in every subsequent iteration. When refining a role prediction for a given candidate argument, information about the assignment of roles to other arguments gets integrated. Our intuition is that modeling interactions through the output spaces rather than hoping that the encoder somehow learns to capture them, provides a useful inductive bias to the model. In other words, capturing this interaction explicitly should mitigate overfitting. This may be especially useful in a lowerresource setting but, as we will see in experiments, it appears beneficial even in a high-resource setting.
We think of semantic role labeling as extracting propositions, i.e.
predicates and argument-role tuples.
In our example, the proposition is sign.01(Arg0:signer = Hiti, Arg1:document = contract). Across the iterations, we maintain and refine not only predictions about propositions but also their embeddings. Each 'slot' (e.g., correspond to role Arg0:signer) is represented with a vector, encoding information about arguments assumed to be filling this slot. For example, if Hiti is predicted to fill slot Arg0 in the first iteration then the slot representation will be computed based on the contextualized representation of that word and hence reflect this information. The combination of all slot embeddings (one per semantic role) constitutes the embedding of the proposition. The proposition embedding, along with the prediction about the semantic role structure, gets refined in every iteration of the algorithm.
Note that, in practice, the predictions in every iteration are soft, and hence proposition embeddings will encode current beliefs about the predicate-argument structure. The distributed representation of propositions provides an alternative ("dense-embedding") view on the currently considered semantic structure, i.e. information extracted from the sentence. Intuitively, this representation can be readily tested to see how well current predictions satisfy selection restrictions (e.g., contract is a very natural filler for Arg1:document) and check if the arguments are compatible.
To get an intuition how the refinement mechanism may work, imagine that both Hiti and injury are labeled as Arg0 for sign.01 in the first iteration. Hence, the representation of the slot Arg0 will encode information about both predicted arguments. At the second iteration, the word injury will be aware that there is a much better candidate for filling the signer role, and the probability of assigning injury to this role will drop. As we will see in our experiments, whereas enforcing the hard uniqueness constraint is not useful, our iterative procedure corrects the mistakes of the base model by capturing interrelations between arguments in a flexible and context-sensitive way.
In order to operationalize the above idea, we take inspiration from the Capsule Networks (CNs) (Sabour et al., 2017). Note that we are not simply replacing Bi-LSTMs with generic CNs. Instead, we use CN to encode the structure of the refinement procedure sketched above. Each slot embedding is now a capsule and each proposition embedding is a tuple of capsules, one capsule per role. In every iteration (i.e. CN network layer), the capsules interact with each other and with representations of words in the sentence.
We experiment with our model on standard benchmarks for 7 languages from CoNLL-2009. Compared with the non-refinement baseline model, we observe substantial improvements from using the iterative procedure. The model achieves state-of-the-art performance in 5 languages, including English.

Base Dependency SRL Model
In dependency SRL, for each predicate p of a given sentence x = {x 1 , x 2 , · · · , x n } with n words, the model needs to predict roles y = {y 1 , y 2 , · · · , y n } for every word. The role can be none, signifying that the corresponding word is not an argument of the predicate.
We start with describing the factorized baseline model which is similar to that of . It consists of three components: (1) an embedding layer; (2) an encoding layer and (3) an inference layer.

Embedding Layer
The first step is to map symbolic sentence x and predicate p into continuous embedded space: where e i ∈ R de and p ∈ R de .

Encoding Layer
The encoding layer extracts features from input sentence x and predicate p. We extract features from input sentence x using stacked bidirectional LSTMs (Hochreiter and Schmidhuber, 1997): where x i ∈ R d l . Then we represent each role logit of each word by a bi-linear operation with the target predicate: where W ∈ R d l ×de is a trainable parameter and b j|i ∈ R is a scalar representing the logit of role j for word x i .

Inference Layer
The probability P (y|x, p) is then computed as where b ·|i = {b 1|i , b 2|i , · · · , b |T ||i } and |T | denotes the number of role types.

Dependency based Semantic Role Labeling using Capsule Networks
Inspired by capsule networks, we use capsule structure for each role state to maintain information across iterations and employ the dynamic routing mechanism to derive the role logits b j|i iteratively. Figure 2 illustrates the architecture of the proposed model.

Capsule Structure
We start by introducing two capsule layers: (1) the word capsule layer and (2) the role capsule layer.

Word Capsule Layer
The word capsule layer is comprised of capsules representing the roles of each word. Given sentence representation x i and predicate embedding p, the word capsule layer is derived as: where W k j ∈ R d l ×de and u j|i ∈ R K is the capsule vector for role j of word x i . K denotes the capsule size. Intuitively, the capsule encodes the argument-specific information relevant to deciding if role j is suitable for the word. These capsules do not get iteratively updated.

Role Capsule Layer
As discussed in the introduction, the role capsule layer could be viewed as an embedding of a proposition. The capsule network generates the capsules in the layer using "routing-by-agreement". This process can be regarded as a pooling operation. Capsules in the role capsule layer at t-th iteration are derived as: j is generated with the linear combination of capsules in the word capsule layer with weights c (t) ij : Here, c ij are coupling coefficients, calculated by "softmax" function over role logits b ij can be interpreted as the probability that word x i is assigned role j: The role logits b (t) j|i are decided by the iterative dynamic routing process. The Squash operation will deactive capsules receiving small input s (t) j (i.e. roles not predicted in the sentence) by pushing them further to the 0 vector.
Algorithm 1: Dynamic routing algorithm. l w and l r denote word capsule layer and role capsule layer, respectively.

Dynamic Routing
The dynamic routing process involves T iterations. The role logits b j|i before first iteration are all set to zeros: b (0) j|i = 0. Then, the dynamic routing process updates the role logits b j|i by modeling agreement between capsules in two layers: where W ∈ R K×K . The whole dynamic routing process is shown in Algorithm 1. The dynamic routing process can be regarded as the role refinement procedure (see Section 5 for details).

Incorporating Global Information
When computing the j-th role capsule representation (Eq 9), the information originating from an i-th word (i.e. u j|i ) is weighted by the probability of assigning role j to word x i (i.e. c ij ). In other words, the role capsule receives messages only from words assigned to its role. This implies that the only interaction the capsule network can model is competition between arguments for a role. 2 Note though that this is different from imposing the hard role-uniqueness constraint, as the network does this dynamically in a specific context. Still, this is a strong restriction.
In order to make the model more expressive, we further introduce a global node g (t) to incorporate global information about all arguments at the current iterations. The global node is a compressed representation of the entire proposition, and used in the prediction of all arguments, thus permitting arbitrary interaction across arguments. The global 2 In principle, it can model the opposite, i.e. collaboration / attraction but it is unlikely to be useful in SRL. node g (t) at t-th iteration is derived as: where |T | ∈ R K·|T | is the concatenation of all capsules in the role capsule layer.
We append an additional term for the role logits update in Eq (11) where W ∈ R K×K and W g ∈ R K×K .

Refinement
The dynamic routing process can be seen as iterative role refinement. Concretely, the coupling coefficients c (t) i in Eq (10) can be interpreted as the predicted distribution of the semantic roles for word x i in Eq (5) at t-th iteration: Since dynamic routing is an iterative process, semantic role distribution c (t) i in iteration t will affect the semantic role distribution c (t+1) i in next iteration t + 1: where f (·) denotes the refinement function defined by the operations in each dynamic routing iteration.

Training
We minimize the following loss function L(θ): where λ is a hyper-parameter for the regularization term and P (T ) (y i |x, i, p; θ) = c (T ) i . Unlike standard refinement methods (Belanger et al., 2017; which sum losses across all refinement iterations, our loss is only based on the prediction made at the last iteration. This encourages the model to rely on the refinement process rather than do it in one shot. Our baseline model is trained analogously, but using the cross-entropy for the independent classifiers (Eq (5)).
Uniqueness Constraint Assumption As we discussed earlier, for a given target predicate, each semantic role will typically appear at most once. To encode this intuition, we propose another loss term L u (θ): where b (T ) j|· are the semantic role logits in T -th iteration. Thus, the final loss L * (θ) is the combination of the two losses: where η is a discount coefficient. for all other languages on both the baseline model and the proposed CapsuleNet. LSTM state dimension d l is 500. Capsule size K is 16. Batch size is 32. The coefficient for the regularization term λ is 0.0004. We employ Adam (Kingma and Ba, 2015) as the optimizer and the initial learning rate α is set to 0.0001. Syntactic information is not utilized in our model. Table 1 shows the performance of our model trained with loss L * (θ) for different values of discount coefficient η on the English development set. The model achieves the best performance when η equals to 0. It implies that adding uniqueness constraint on loss actually hurts the performance. Thus, we use the loss L * (θ) with η equals to 0 in the rest of the experiments, which is equivalent to the loss L(θ) in Eq (16). We also observe that the model with 2 refinement iterations performs the best (89.92% F1).  (2015) 87.70 75.50 Foland and Martin (2015) 86.00 75.90 Roth and Lapata (2016) 87.90 76.50 Swayamdipta et al. (2016) 85.00 -  87.70 77.70  89.10 78.90  89.50 79.30  89.80 79.80  89.60 79.00 Mulcaire et al. (2018) 87 .

Overall Results
Table 2 compares our model with previous stateof-the-art SRL systems on English. Some of the systems (Lei et al., 2015; only use local features, whereas others (Swayamdipta et al., 2016) incorporate global information at the expense of greedy decoding. Additionally, a number of systems exploit syntactic information (Roth and Lapata, 2016;. Some of the results are obtained with ensemble systems (FitzGerald et al., 2015;Roth and Lapata, 2016). As we observe, the baseline model (see Section 2) is quite strong, and outperforms the previous state-of-the-art systems on both in-domain and out-of-domain sets on English. The proposed Cap-suleNet outperforms the baseline model on English (e.g. obtaining 91.06% F1 on the English test set), which shows the effectiveness of the cap-  sule network framework. The improvement on the out-of-domain set implies that our model is robust to domain variations. Table 3 gives the performance of models with ablation on some key components, which shows the contribution of each component in our model.

Ablation
• The model without the global node is described in Section 3.
• The model that further removes the role capsule layer takes the mean of capsules u j|i in Eq (7) of word capsule layer as the semantic role logits b j|i : where K denotes the capsule size.
• The model that additionally removes the word capsule layer is exactly equivalent to the baseline model described in Section 2.
As we observe, CapsuleNet with all components performs the best on both development and test sets on English. The model without using the global node performs well too. It obtains 91.05% F1 on the English test set, almost the same performance as full CapsuleNet. But on the English outof-domain set, without the global node, the performance drops from 82.72% F1 to 82.36% F1. It implies that the global node helps in model generalization. Further, once the role capsule layer is removed, the performance drops sharply. Note that the model without the role capsule layer does not use refinements and hence does not model argument interaction. It takes the mean of capsules u j|i in the word capsule layer as the semantic role logits b j|i (see Eq (19)), and hence could be viewed as an enhanced ('ensembled') version of the baseline  model. Note that we only introduced a very limited number of parameters for the dynamic routing mechanism (see Eq (11-13)). This suggests that the dynamic routing mechanism does genuinely captures interactions between argument labeling decisions and the performance benefits from the refinement process.

Error Analysis
The performance of our model while varying the sentence length and the number of arguments per proposition is shown in Figure 3. The statistics of the subsets are in Table 4. Our model consistently outperforms the baseline model, except on sentences of between 50 and 59 words. Note that the subset is small (only 391 sentences), so the effect may be random. node reduces the precision as the number of iteration grows. This model is less transparent, so it harder to see why it chooses this refinement strategy. As expected, the F1 scores for both models peak at the second iteration, the iteration number used in the training phase. The trend for the exact match score is consistent with the F1 score.

Duplicate Arguments
We measure the degree to which the role uniqueness constraint is violated by both models. We plot the number of violations as the function of the number of iterations ( Figure 5). Recall that the violations do not imply errors in predictions as even the gold-standard data does not always satisfy the constraint (see the orange line in the figure). As expected, CapsuleNet without the global node captures competition and focuses on enforcing this constraint. Interestingly, it converges to keeping the same fraction of violations as the one observed in the gold standard data. In contrast, the full model increasingly ignores the constraint in later iterations. This is consistent with the overgeneration trend evident from the precision and recall plots discussed above. Figure 6 illustrates how many roles get changed between consecutive iterations of CapsuleNet, with and without the global node. Green indicates how many correct changes have been made, while red shows how many errors have been introduced. Since the number of changes is very large, we represent the non-zero number q in the (a) Capsule Network w/o Global Node (b) Capsule Network with Global Node Figure 6: Changes in labeling between consecutive iterations on the English development set. Only argument types appearing more than 50 times are listed. "None" type denotes "NOT an argument". Green and red nodes denote the numbers of correct and wrong role transitions have been made respectively. The numbers are transformed into log space.   Roth and Lapata (2016), Czech (Cz) is from Henderson et al. (2013), Chinese (Zh) is from  and English (En) is from Li et al. (2019b). log spaceq = sign(q) log(|q|).
As we expected, for both models, the majority of changes are correct, leading to better overall performance. We can again see the same trends. CapsuleNet without the global node tends to filter our arguments by changing them to "None". The reverse is true for the full model. Table 5 gives the results of the proposed Capsu-leNet SRL (with global node) on the in-domain test sets of all languages from CoNLL-2009. As shown in Table 5, the proposed model consistently outperforms the non-refinement baseline model and achieves state-of-the-art performance on Catalan (Ca), Czech (Cz), English (En), Japanese (Jp) and Spanish (Es). Interestingly, the effectiveness of the refinement method does not seem to be dependent on the dataset size: the improvements on the smallest (Japanese) and the largest datasets (English) are among the largest.

Additional Related Work
The capsule networks have been recently applied to a number of NLP tasks (Xiao et al., 2018;Gong et al., 2018). In particular, Yang et al. (2018) represent text classification labels by a layer of capsules, and take the capsule actions as the classification probability. Using a similar method, Xia et al. (2018) perform intent detection with the capsule networks.  and Li et al. (2019a) use capsule networks to capture rich features for machine translation. More closely to our work,  and Zhang et al. (2019) adopt the capsule networks for relation extraction. The previous models apply the capsule networks to problems that have a fixed number of components in the output. Their approach cannot be directly applied to our setting.

Conclusions & Future Work
State-of-the-art dependency SRL methods do not account for any interaction between role labeling decisions. We propose an iterative approach to SRL. In each iteration, we refine both predictions of the semantic structure (i.e. a discrete structure) and also a distributed representation of the proposition (i.e. the predicate and its predicted arguments). The iterative refinement process lets the model capture interactions between the decisions. We relied on the capsule networks to operationalize this intuition. We demonstrate that our model is effective, and results in improvements over a strong factorized baseline and state-of-theart results on standard benchmarks for 5 languages (Catalan, Czech, English, Japanese and Spanish) from CoNLL-2009. In future work, we would like to extend the approach to modeling interaction between multiple predicate-argument structures in a sentence as well as to other semantic formalisms (e.g., abstract meaning representations (Banarescu et al., 2013)).