Neural-Davidsonian Semantic Proto-role Labeling

We present a model for semantic proto-role labeling (SPRL) using an adapted bidirectional LSTM encoding strategy that we call"Neural-Davidsonian": predicate-argument structure is represented as pairs of hidden states corresponding to predicate and argument head tokens of the input sequence. We demonstrate: (1) state-of-the-art results in SPRL, and (2) that our network naturally shares parameters between attributes, allowing for learning new attribute types with limited added supervision.


Introduction
Universal Decompositional Semantics (UDS) (White et al., 2016) is a contemporary semantic representation of text (Abend and Rappoport, 2017) that forgoes traditional inventories of semantic categories in favor of bundles of simple, interpretable properties. In particular, UDS includes a practical implementation of Dowty's theory of thematic proto-roles (Dowty, 1991): arguments are labeled with properties typical of Dowty's proto-agent (AWARENESS, VOLITION ...) and proto-patient (CHANGED STATE ...).
Annotated corpora have allowed the exploration of Semantic Proto-role Labeling (SPRL) 1 as a natural language processing task (Reisinger et al., 2015;White et al., 2016;Teichert et al., 2017). For example, consider the following sentence, in which a particular pair of predicate and argument heads have been emphasized: "The cat ate the rat." An SPRL system must infer from the context of the sentence whether the rat had VOLITION, CHANGED-STATE, and EXISTED-AFTER the eating event (see Table 2 for more properties).
We present an intuitive neural model that Figure 1: BiLSTM sentence encoder with SPR decoder. Semantic proto-role labeling is with respect to a specific predicate and argument within a sentence, so the decoder receives the two corresponding hidden states.
achieves state-of-the-art performance for SPRL. 2 As depicted in Figure 1, our model's architecture is an extension of the bidirectional LSTM, capturing a Neo-Davidsonian like intuition, wherein select pairs of hidden states are concatenated to yield a dense representation of predicate-argument structure and fed to a prediction layer for endto-end training. We include a thorough quantitative analysis highlighting the contrasting errors between the proposed model and previous (nonneural) state-of-the-art. In addition, our network naturally shares a subset of parameters between attributes. We demonstrate how this allows learning to predict new at-

SPR Property
Explanation of Property INSTIGATION Arg caused the Pred to happen? ✗ VOLITIONAL Arg chose to be involved in the Pred? ✗ AWARE Arg was/were aware of being involved in the Pred? PHYSICALLY EXISTED Arg existed as a physical object? EXISTED AFTER Arg existed after the Pred stopped? ✗

CHANGED STATE
The Arg was/were altered or somehow changed during or by the end of the Pred? Table 1: Example SPR annotations for the toy example "The cat ate the rat," where the Predicate in question is "ate" and the Argument in question is either "cat" or "rat." Note that not all SPR properties are listed, and the binary labels ( , ✗) are coarsened from a 5-point Likert scale.
tributes with limited supervision: a key finding that could support efficient expansion of new SPR attribute types in the future.

Background
Davidson (1967) is credited for representations of meaning involving propositions composed of a fixed arity predicate, all of its core arguments arising from the natural language syntax, and a distinguished event variable. The earlier example could thus be denoted (modulo tense) as (∃e)eat[(e, CAT, RAT)], where the variable e is a reification of the eating event. The order of the arguments in the predication implies their role, where leaving arguments unspecified (as in "The cat eats") can be handled either by introducing variables for unstated arguments, e.g., (∃e)(∃x)[eat(e, CAT, x)], or by creating new predicates that correspond to different arities, e.g., (∃e)eat intransitive[(e, CAT)]. 3 The Neo-Davidsonian approach (Castañeda, 1967;Parsons, 1995), which we follow in this work, allows for variable arity by mapping the argument positions of individual predicates to generalized semantic roles, shared across predicates, 4 e.g., AGENT, PATIENT and THEME, in: (∃e)[eat(e) ∧ Agent(e, CAT) ∧ Patient(e, RAT)]. Dowty (1991) conjectured that the distinction between the role of a prototypical Agent and prototypical Patient could be decomposed into a number of semantic properties such as "Did the argument change state?". Here we formulate this 3 This formalism aligns with that used in PropBank (Palmer et al., 2005), which associated numbered, core arguments with each sense of a verb in their corpus annotation. 4 For example, as seen in FrameNet (Baker et al., 1998

"Neural-Davidsonian" Model
Our proposed SPRL model ( Fig. 1) determines the value of each attribute (e.g., VOLITION) on an argument (a) with respect to a particular predication (e) as a function on the latent states associated with the pair, (e, a), in the context of a full sentence. Our architecture encodes the sentence using a shared, one-layer, bidirectional LSTM (Hochreiter and Schmidhuber, 1997;Graves et al., 2013). We then obtain a continuous, vector representation h ea = [h e ; h a ], for each predicate-argument pair as the concatenation of the hidden BiLSTM states h e and h a corresponding to the syntactic head of the predicate of e and argument a respectively. These heads are obtained over gold syntactic parses using the predicate-argument detection tool, PredPatt (White et al., 2016). 6 For each SPR attribute, a score is predicted by passing h ea through a separate two-layer perceptron, with the weights of the first layer shared across all attributes: This architecture accomodates the definition of SPRL as multi-label binary classification given by Teichert et al. (2017) by treating the score as the log-odds of the attribute being present (i.e. P(attr|h ea ) = 1 1+exp[−Score(attr,hea)] ). This architecture also supports SPRL as a scalar regression task where the parameters of the network are tuned to directly minimize the discrepancy between the predicted score and a reference scalar label. The loss for the binary and scalar models are negative log-probability and squared error, respectively; the losses are summed over all SPR attributes.
Training with Auxiliary Tasks A benefit of the shared neural-Davidsonian representation is that it offers many levels at which multi-task learning may be leveraged to improve parameter estimation so as to produce semantically rich representations h ea , h e , and h a . For example, the sentence encoder might be pre-trained as an encoder for machine translation, the argument representation h a can be jointly trained to predict wordsense, the predicate representation, h e , could be jointly trained to predict factuality (Saurí and Pustejovsky, 2009;Rudinger et al., 2018), and the predicate-argument representation, h ea , could be jointly trained to predict other semantic role formalisms (e.g. PropBank SRL-suggesting a neural-Davidsonian SRL model in contrast to recent BIO-style neural models of SRL (He et al., 2017)).
To evaluate this idea empirically, we experimented with a number of multi-task training strategies for SPRL. While all settings outperformed prior work in aggregate, simply initializing the BiLSTM parameters with a pretrained English-to-French machine translation encoder 7 produced the best results, 8 so we simplify discussion by focusing on that model. The efficacy of MT pretraining that we observe here comes as no surprise given prior work demonstrating, e.g., the utility of bitext for paraphrase (Ganitkevitch et al., 2013), that NMT pretraining yields improved contextualized word embeddings 9 (McCann et al., 2017), and that NMT encoders specifically capture useful features for SPRL (Poliak et al., 2018).
Full details about each multi-task experiment, including a full set of ablation results, are reported in Appendix A; details about the corresponding datasets are in Appendix B.
Except in the ablation experiment of Figure  2, our model was trained on only the SPRL data and splits used by Teichert et al. (2017) (learning all properties jointly), using GloVe 10 embeddings and with the MT-initialized BiLSTM. Models were implemented in PyTorch and trained end-to-end with Adam optimization (Kingma and Ba, 2014) and a default learning rate of 10 −3 . Each model was trained for ten epochs, selecting the best-performing epoch on dev.
Prior Work in SPRL We additionally include results from prior work: "LR" is the logisticregression model introduced by Reisinger et al. (2015) and "CRF" is the CRF model (specifically SPRL ⋆ ) from Teichert et al. (2017). Although White et al. (2016) released additional SPR annotations, we are unaware of any benchmark results on that data; however, our multi-task results in Appendix A do use the data and we find (unsurprisingly) that concurrent training on the two SPR datasets can be helpful. Using only data and splits from White et al. (2016), the scalar regression architecture of Table 6 achieves a Pearson's ρ of 0.577 on test.
There are a few noteworthy differences between our neural model and the CRF of prior work. As an adapted BiLSTM, our model easily ex-2017) trained on the 10 9 Fr-En corpus (Callison-Burch et al., 2009)     (2) argument is a proper noun; (3) argument is an organization or institution; (4) argument is a pronoun; (5) predicate is phrasal or a particle verb construction; (6) predicate is used metaphorically; (7) predicate is a light-verb construction. #DIFFER is the size of the respective subset.

Experiments
original SPR annotations on a 5-point Likert scale, instead of a binary cut-point along that scale (> 3). Manual Analysis We select two properties (VOLITION and MAKES PHYSICAL CONTACT) to perform a manual error analysis with respect to CRF 11 and our binary model from Table 2. For each property, we sample 40 dev instances with gold labels of "True" (> 3) and 40 instances of "False" (≤ 3), restricted to cases where the two system predictions disagree. 12 We manually label each of these instances for the six features shown in Table 3. For example, given the input "He sits down at the piano and plays," our neural model correctly predicts that He makes physical contact during the sitting, while CRF does not. Since He is a pronoun, and sits down is phrasal, this example contributes −1 to ∆ FALSE-in rows 1, 4 and 5. 11 We obtained the CRF dev system predictions of Teichert et al. (2017)  For both properties our model appears more likely to correctly classify the argument in cases where the predicate is a phrasal verb. This is likely a result of the fact that the BiLSTM has stronger language-modeling capabilities than the CRF, particularly with MT pretraining. In general, our model increases the false-positive rate for MAKES PHYSICAL CONTACT, but especially when the argument is pronominal.
Learning New SPR Properties One motivation for the decompositional approach adopted by SPRL is the ability to incrementally build up an inventory of annotated properties according to need and budget. Here we investigate (1) the degree to which having less training data for a single property degrades our F1 for that property on held-out data and (2) the effect on degradation of concurrent training with the other properties. We focus on two properties only: INSTIGATION, a canonical example of a proto-agent property, and MANIP-ULATED, which is a proto-patient property. For each we consider six training set sizes (1, 5, 10, 25, 50 and 100 percent of the instances). Starting with the same randomly initialized BiLSTM 13 , we consider two training scenarios: (1) ignoring the remaining properties or (2) including the model's loss on other properties with a weight of λ = 0.1 in the training objective.
Results are presented in Figure 2. We see that, in every case, most of the performance is achieved with only 25% of the training data. The curves also suggest that training simultaneously on all SPR properties allows the model to learn the tar-get property more quickly (i.e., with fewer training samples) than if trained on that property in isolation. For example, at 5% of the training training data, the "all properties" models are achieving roughly the same F1 on their respective target property as the "target property only" models achieves at 50% of the data. 14 As the SPR properties currently annotated are by no means semantically exhaustive, 15 this experiment indicates that future annotation efforts may be well served by favoring breadth over depth, collecting smaller numbers of examples for a larger set of attributes.

Conclusion
Inspired by: (1) the SPR decomposition of predicate-argument relations into overlapping feature bundles and (2) the neo-Davidsonian formalism for variable-arity predicates, we have proposed a straightforward extension to a BiLSTM classification framework in which the states of pre-identified predicate and argument tokens are pairwise concatenated and used as the target for SPR prediction. We have shown that our Neural-Davidsonian model outperforms the prior state of the art in aggregate and showed especially large gains for properties of CHANGED-POSSESSION, STATIONARY, and LOCATION. Our architecture naturally supports discrete or continuous label paradigms, lends itself to multi-task initialization or concurrent training, and allows for parameter sharing across properties. We demonstrated this sharing may be useful when some properties are only sparsely annotated in the training data, which is suggestive of future work in efficiently increasing the range of annotated SPR property types.

A Mult-Task Investigation
Multi-task learning has been found to improve performance on many NLP tasks, particularly for neural models, and is rapidly becoming de rigueur in the field. The strategy involves optimizing for multiple training objectives corresponding to different (but usually related) tasks. Collobert and Weston (2008) use multi-task learning to train a convolutional neural network to perform multiple core NLP tasks (POS tagging, named entity recognition, etc.). Multi-task learning has also been used to improve sentence compression (Klerke et al., 2016), chunking and dependency parsing (Hashimoto et al., 2017). Related work on UDS (White et al., 2016) shows improvements on event factuality prediction with multi-task learning on BiLSTM models (Rudinger et al., 2018). To complete the basic experiments reported in the main text, here we include an investigation of the impact of multi-task learning for SPRL. We borrow insights from Mou et al. (2016) who explore different multi-task strategies for NLP including approach of initializing a network by training it on a related task ("INIT") versus interspersing tasks during training ("MULT"). Here we employ both of these strategies, referring to them as pretraining and concurrent training. We also use the terminology target task and auxiliary task to differentiate the primary task(s) we are inter-ested in from those that play only a supporting role in training. In order to tune the impact of auxiliary tasks on the learned representation, Luong et al. (2016) use a mixing parameter, α i , for each task i. Each parameter update consists of selecting a task with probability proportional to its α i and then performing one update with respect to that task alone. They show that the choice of α has a large impact on the effect of multi-task training, which influences our experiments here.
Please refer to Appendix B for details on the datasets used in this section. In particular, with a few exceptions, White et al. (2016)  In addition to the binary and scalar SPR architectures outlined in Section 3 of the main paper, we also considered concurrently training the BiL-STM on a fine-grained word-sense disambiguation task or on joint SPR1 and SPR2 prediction. We also experimented with using machine translation and PropBank SRL to initialize the parameters of the BiLSTM. Preliminary experimentation on dev data with other combinations helped prune down the set of interesting experiments to those listed in Table 4 which assigns names to the models explored here. Our ablation study in Section 4 of the main paper uses the model named SPR1 while the other results in the main paper correspond to MT:SPR1 in the case of binary prediction and MT:SPR1S in the case of scalar prediction. After detailing the additional components used for pretraining or concurrent training, we present aggregate results and for the best performing models (according to dev) we present property-level aggregate results.

A.1 Auxiliary Tasks
Each auxiliary task is implemented in the form of a task-specific decoder with access to the hidden states computed by the shared BiLSTM encoder. In this way, the losses from these tasks backpropagate through the BiLSTM. Here we describe each task-specific decoder.
PropBank Decoder The network architecture for the auxiliary task of predicting abstract role types in PropBank is nearly identical to the architecture for SPRL described in Section 3 of the main paper. The main difference is that the Prop-Bank task is a single-label, categorical classification task.
The loss from this decoder is the negative log of the probability assigned to the correct label.

Supersense Decoder
The word sense disambiguation decoder computes a probability distribution over 26 WordNet supersenses with a simple single-layer feedforward network: where W ∈ R 1200×26 and h a is the RNN hidden state corresponding to the argument head token we wish to disambiguate. Since the gold label in the supersense prediction task is a distribution over supersenses, the loss from this decoder is the cross-entropy between its predicted distribution and the gold distribution.
French Translation Decoder Given the encoder hidden states, the goal of translation is to generate the reference sequence of tokens Y = y 1 , · · · , y n in the target language, i.e., French. We employ the standard decoder architecture for neural machine translation. At each time step i, the probability distribution of the decoded token y i is defined as: where W fr is a transform matrix, and b fr is a bias. The inputs are the decoder hidden state s i and the context vector c i . The decoder hidden state s i is computed by: where RNN is a recurrent neural network using Llayer stacked LSTM, y i−1 is the word embedding of token y i−1 , and s 0 is initialized by the last encoder left-to-right hidden state.  The context vector c i is computed by an attention mechanism (Bahdanau et al., 2014;Luong et al., 2015), where W α is a transform matrix and b α is a bias. The loss is the negative log-probability of the decoded sequence.

A.2 Results
In this section, we present a series of experiments using different components of the neural architecture described in Section 3, with various training regimes. Each experimental setting is given a name (in SMALLCAPS) and summarized in Table  4. Unless otherwise stated, the target task is SPR1 (classification). To ease comparison, we include results from the main paper as well as additional results.
Experiment 0: Embeddings By default, all models reported in this paper employ pretrained word embeddings (GloVe). In this experiment we replaced the pretrained embeddings in the vanilla   The auxiliary task loss is further downweighted by a hyperparameter λ ∈ {1, 10 −1 , 10 −2 , 10 −3 , 10 −4 } which is chosen based on dev results. We apply this training regime with the auxiliary task of Supersense prediction (SPR1+WSD) and the scalar SPR2 prediction task (SPR1+SPR2), described in Experiment 2.
Experiment 1c: Multi-task Combination This setting is identical to Experiment 1b, but includes MT pretraining (the best-performing pretraining setting on dev), as described in 1a. Accordingly, the two experiments are MT:SPR1+WSD and MT:SPR1+SPR2. Experiment 1d: Property-Specific Model Selection (PS-MS) Experiments 1a-1c consider a variety of pretraining tasks, co-training tasks, and weight values, λ, in an effort to improve aggregate F1 for SPR1. However, the SPR properties are diverse, and we expect to find gains by choosing training settings on a property-specific basis. Here, for each property, we select from the set of models considered in experiments 1a-1c the one that achieves the highest dev F1 for the target property. We report the results of applying those property-specific models to the test data.  Table 8: SPR1 and SPR2 as scalar prediction tasks. The overall performance for each experimental setting is reported as the average Pearson correlation over all properties. Highest SPR1 and SPR2 results are in bold.
Experiment 2: SPR as a scalar task In Experiment 2, we trained the SPR decoder to predict properties as scalar instead of binary values. Performance is measured by Pearson correlation and reported in Tables 8 and 7. In this case, we treat SPR1 and SPR2 both as target tasks (separately). By including SPR1 as a target task, we are able to compare (1) SPR as a binary task and a scalar task, as well as (2) SPR1 and SPR2 as scalar tasks. These results constitute the first reported numbers on SPR2.
We observe a few trends. First, it is generally the case that properties with high F1 on the SPR1 binary task also have high Pearson correlation on the SPR1 scalar task. The higher scoring properties in SPR1 scalar are also generally the higher scoring properties in SPR2 (where the SPR1 and SPR2 properties overlap), with a few notable exceptions, like INSTIGATION. Overall, correlation values are lower in SPR2 than SPR1. This may be the case for a few reasons. (1) The underlying data in SPR1 and SPR2 are quite different. The former consists of sentences from the Wall Street Journal via PropBank (Palmer et al., 2005), while the latter consists of sentences from the English Web Treebank (Bies et al., 2012) via the Universal Dependencies; (2) certain filters were applied in the construction of the SPR1 dataset to remove instances where, e.g., predicates were embedded in a clause, possibly resulting in an easier task; (3) SPR1 labels came from a single annotator (after determining in pilot studies that annotations from this annotator correlated well with other annotators), where SPR2 labels came from 24 different annotators with scalar labels averaged over twoway redundancy.
Discussion With SPR1 binary classification as the target task, we see overall improvements from various multi-task training regimes (Experiments 1a-d, Tables 5 and 6), using four different auxiliary tasks: machine translation into French, PropBank abstract role prediction, word sense disambiguation (WordNet supersenses), and SPR2. 16 These auxiliary tasks exhibit a loose trade-off in terms of the quantity of available data and the semantic relatedness of the task: MT is the least related task with the most available (parallel) data, while SPR2 is the most related task with the smallest quantity of data. While we hypothesized that the relatedness of PropBank role labeling and word sense disambiguation tasks might lead to gains in SPR performance, we did not see substantial gains in our experiments (PB:SPR1, SPR1+WSD). We did, however, see improvements over the targettask only model (SPR1) in the cases where we added MT pretraining (MT:SPR1) or SPR2 concurrent training (SPR1+2). Interestingly, combining MT pretraining with SPR2 concurrent training yielded no further gains (MT:SPR1+2).