Neural Semantic Role Labeling with Dependency Path Embeddings

This paper introduces a novel model for semantic role labeling that makes use of neural sequence modeling techniques. Our approach is motivated by the observation that complex syntactic structures and related phenomena, such as nested subordinations and nominal predicates, are not handled well by existing models. Our model treats such instances as sub-sequences of lexicalized dependency paths and learns suitable embedding representations. We experimentally demonstrate that such embeddings can improve results over previous state-of-the-art semantic role labelers, and showcase qualitative improvements obtained by our method.


Introduction
The goal of semantic role labeling (SRL) is to identify and label the arguments of semantic predicates in a sentence according to a set of predefined relations (e.g., "who" did "what" to "whom").Semantic roles provide a layer of abstraction beyond syntactic dependency relations, such as subject and object, in that the provided labels are insensitive to syntactic alternations and can also be applied to nominal predicates.Previous work has shown that semantic roles are useful for a wide range of natural language processing tasks, with recent applications including statistical machine translation (Aziz et al., 2011;Xiong et al., 2012), plagiarism detection (Osman et al., 2012;Paul and Jamal, 2015), and multi-document abstractive summarization (Khan et al., 2015).
The task of semantic role labeling (SRL) was pioneered by Gildea and Jurafsky (2002).In Table 1: Outputs of SRL systems for the sentence He had trouble raising funds.Arguments of raise are shown with predicted roles as defined in Prop-Bank (A0: getter of money; A1: money).Asterisks mark flawed analyses that miss the argument He.
their work, features based on syntactic constituent trees were identified as most valuable for labeling predicate-argument relationships.Later work confirmed the importance of syntactic parse features (Pradhan et al., 2005;Punyakanok et al., 2008) and found that dependency parse trees provide a better form of representation to assign role labels to arguments (Johansson and Nugues, 2008).
Most semantic role labeling approaches to date rely heavily on lexical and syntactic indicator features.Through the availability of large annotated resources, such as PropBank (Palmer et al., 2005), statistical models based on such features achieve high accuracy.However, results often fall short when the input to be labeled involves instances of linguistic phenomena that are relevant for the labeling decision but appear infrequently at training time.Examples include control and raising verbs, nested conjunctions or other recursive structures, as well as rare nominal predicates.The difficulty lies in that simple lexical and syntactic indicator features are not able to model interactions triggered by such phenomena.For instance, con-sider the sentence He had trouble raising funds and the analyses provided by four publicly available tools in Table 1 (mate-tools, Björkelund et al. (2010); mateplus, Roth and Woodsend (2014);TensorSRL, Lei et al. (2015); and easySRL, Lewis et al. (2015)).Despite all systems claiming stateof-the-art or competitive performance, none of them is able to correctly identify He as the agent argument of the predicate raise.Given the complex dependency path relation between the predicate and its argument, none of the systems actually identifies He as an argument at all.
In this paper, we develop a new neural network model that can be applied to the task of semantic role labeling.The goal of this model is to better handle control predicates and other phenomena that can be observed from the dependency structure of a sentence.In particular, we aim to model the semantic relationships between a predicate and its arguments by analyzing the dependency path between the predicate word and each argument head word.We consider lexicalized paths, which we decompose into sequences of individual items, namely the words and dependency relations on a path.We then apply long-short term memory networks (Hochreiter and Schmidhuber, 1997) to find a recurrent composition function that can reconstruct an appropriate representation of the full path from its individual parts (Section 2).To ensure that representations are indicative of semantic relationships, we use semantic roles as target labels in a supervised setting (Section 3).
By modeling dependency paths as sequences of words and dependencies, we implicitly address the data sparsity problem.This is the case because we use single words and individual dependency relations as the basic units of our model.In contrast, previous SRL work only considered full syntactic paths.Experiments on the CoNLL-2009 benchmark dataset show that our model is able to outperform the state-of-the-art in English (Section 4), and that it improves SRL performance in other languages, including Chinese, German and Spanish (Section 5).

Dependency Path Embeddings
In the context of neural networks, the term embedding refers to the output of a function f within the network, which transforms an arbitrary input into a real-valued vector output.Word embeddings, for instance, are typically computed by forwarding a one-hot word vector representation from the input layer of a neural network to its first hidden layer, usually by means of matrix multiplication and an optional non-linear function whose parameters are learned during neural network training.
Here, we seek to compute real-valued vector representations for dependency paths between a pair of words w i , w j .We define a dependency path to be the sequence of nodes (representing words) and edges (representing relations between words) to be traversed on a dependency parse tree to get from node w i to node w j .In the example in Figure 1, the dependency path from raising to he is raising NMOD − −− → trouble OBJ − − → had SBJ ←− he.Analogously to how word embeddings are computed, the simplest way to embed paths would be to represent each sequence as a one-hot vector.However, this is suboptimal for two reasons: Firstly, we expect only a subset of dependency paths to be attested frequently in our data and therefore many paths will be too sparse to learn reliable embeddings for them.Secondly, we hypothesize that dependency paths which share the same words, word categories or dependency relations should impact SRL decisions in similar ways.Thus, the words and relations on the path should drive representation learning, rather than the full path on its own.The following sections describe how we address representation learning by means of modeling dependency paths as sequences of items in a recurrent neural network.

Recurrent Neural Networks
The recurrent model we use in this work is a variant of the long-short term memory (LSTM) network.It takes a sequence of items X = x 1 , ..., x n as input, recurrently processes each item x t ∈ X at a time, and finally returns one embedding state e n for the complete input sequence.For each time step t, the LSTM model updates an internal memory state m t that depends on the current input as well as the previous memory state m t−1 .In order to capture long-term dependencies, a so-called gating mechanism controls the extent to which each component of a memory cell state will be modified.In this work, we employ input gates i, output gates o and (optional) forget gates f.We formalize the state of the network at each time step t as follows: (1) In each equation, W describes a matrix of weights to project information between two layers, b is a layer-specific vector of bias terms, and σ is the logistic function.Superscripts indicate the corresponding layers or gates.Some models described in Section 3 do not make use of forget gates or memory-to-gate connections.In case no forget gate is used, we set f t = 1.If no memoryto-gate connections are used, the terms in square brackets in (1), (2), and (4) are replaced by zeros.

Embedding Dependency Paths
We define the embedding of a dependency path to be the final memory output state of a recurrent LSTM layer that takes a path as input, with each input step representing a binary indicator for a part-of-speech tag, a word form, or a dependency relation.In the context of semantic role labeling, we define each path as a sequence from a predicate to its potential argument.1 Specifically, we define the first item x 1 to correspond to the part-of-speech tag of the predicate word w i , followed by its actual word form, and the relation to the next word w i+1 .The embedding of a dependency path corresponds to the state e n returned by the LSTM layer after the input of the last item, x n , which corresponds to the word form of the argument head word w j .An example is shown in Figure 2.
The main idea of this model and representation is that word forms, word categories and dependency relations can all influence role labeling decisions.The word category and word form of the predicate first determine which roles are plausible and what kinds of path configurations are to be expected.The relations and words seen on the path can then manipulate these expectations.In Figure 2, for instance, the verb raising complements the phrase had trouble, which makes it likely that the subject he is also the logical subject of raising.
By using word forms, categories and dependency relations as input items, we ensure that specific words (e.g., those which are part of complex predicates) as well as various relation types (e.g., subject and object) can appropriately influence the representation of a path.While learning corresponding interactions, the network is also able to determine which phrases and dependency relations might not influence a role assignment decision (e.g., coordinations).

Joint Embedding and Feature Learning
Our SRL model consists of four components depicted in Figure 3: (1) an LSTM component takes lexicalized dependency paths as input, (2) an additional input layer takes binary features as input, (3) a hidden layer combines dependency path embeddings and binary features using rectified linear units, and (4) a softmax classification layer produces output based on the hidden layer state as input.We therefore learn path embeddings jointly with feature detectors based on traditional, binary indicator features.
Given a dependency path X, with steps x k ∈ {x 1 , ..., x n }, and a set of binary features B as input, we use the LSTM formalization from equations (1-5) to compute the embedding e n at time step n and formalize the state of the hidden layer h and softmax output s c for each class category c as follows: 3 System Architecture The overall architecture of our SRL system closely follows that of previous work (Toutanova et al., 2008;Björkelund et al., 2009) and is depicted in Figure 4. We use a pipeline that consists of the following steps: predicate identification and disambiguation, argument identification, argument classification, and re-ranking.The neural-network components introduced in Section 2 are used in the last three steps.The following sub-sections describe all components in more detail.

Predicate Identification and Disambiguation
Given a syntactically analyzed sentence, the first two steps in an end-to-end SRL system are to identify and disambiguate the semantic predicates in the sentence.Here, we focus on verbal and nominal predicates but note that other syntactic categories have also been construed as predicates in the NLP literature (e.g., prepositions; Srikumar and Roth (2013)).For both identification and disambiguation steps, we apply the same logistic re-  gression classifiers used in the SRL components of mate-tools (Björkelund et al., 2010).The classifiers for both tasks make use of a range of lexicosyntactic indicator features, including predicate word form, its predicted part-of-speech tag as well as dependency relations to all syntactic children.

Argument Identification and Classification
Given a sentence and a set of sense-disambiguated predicates in it, the next two steps of our SRL system are to identify all arguments of each predicate and to assign suitable role labels to them.
For both steps, we train several LSTM-based neural network models as described in Section 2. In particular, we train separate networks for nominal and verbal predicates and for identification and classification.Following the findings of earlier work (Xue and Palmer, 2004), we assume that different feature sets are relevant for the respective tasks and hence different embedding representations should be learned.As binary input features, we use the following sets from the SRL literature (Björkelund et al., 2010).Other features Relative position of the candidate argument with respect to the predicate (left, self, right); sequence of part-of-speech tags of all words between the predicate and the argument.

Reranker
As all argument identification (and classification) decisions are independent of one another, we apply as the last step of our pipeline a global reranker.Given a predicate p, the reranker takes as input the n best sets of identified arguments as well as their n best label assignments and predicts the best overall argument structure.We implement the reranker as a logistic regression classifier, with hidden and embedding layer states of identified arguments as features, offset by the argument label, and a binary label as output (1: best predicted structure, 0: any other structure).At test time, we select the structure with the highest overall score, which we compute as the geometric mean of the global regression and all argument-specific scores.

Experiments
In this section, we demonstrate the usefulness of dependency path embeddings for semantic role labeling.Our hypotheses are that (1) modeling dependency paths as sequences will lead to better representations for the SRL task, thus increasing labeling precision overall, and that (2) embeddings will address the problem of data sparsity, leading to higher recall.To test both hypotheses, we experiment on the in-domain and out-of-domain test sets provided in the CoNLL-2009 shared task (Hajič et al., 2009) and compare results of our system, henceforth PathLSTM, with systems that do not involve path embeddings.We compute precision, recall and F 1 -score using the official CoNLL-2009 scorer. 2 The code is available at https://github.com/microth/PathLSTM.

Model selection
We train argument identification and classification models using the XLBP toolkit for neural networks (Monner and Reggia, 2012).The hyperparameters for each step were selected based on the CoNLL 2009 development set.For direct comparison with previous work, we use the same preprocessing models and predicate-specific SRL components as provided with mate-tools (Bohnet, 2010;Björkelund et al., 2010).The types and ranges of hyperparameters considered are as follows: learning rate α ∈ [0.00006, 0.3], dropout rate d ∈ [0.0, 0.5], and hidden layer sizes |e| ∈ [0, 100], |h| ∈ [0, 500].In addition, we experimented with different gating mechanisms (with/without forget gate) and memory access settings (with/without connections between all gates and the memory layer, cf.Section 2).The best parameters were chosen using the Spearmint hyperparameter optimization toolkit (Snoek et al., 2012), applied for approx.200 iterations, and are summarized in Table 2.

Results
The results of our in-and out-of-domain experiments are summarized in Tables 3 and 5 sults by 0.4 and 0.2 percentage points, respectively.At a F 1 -score of 86.7%, our local model (using no reranker) reaches the same performance as state-of-the-art local models.Note that differences in results between systems might originate from the application of different preprocessing techniques as each system comes with its own syntactic components.For direct comparison, we evaluate against mate-tools, which use the same preprocessing techniques as PathLSTM.In comparison, we see improvements of +0.8-1.0 percentage points absolute in F 1 -score.
In the out-of-domain setting, our system achieves new state-of-the-art results of 76.1% (single) and 76.5% (ensemble) F 1 -score, outperforming the previous best system by Roth and Woodsend (2014)  the same preprocessing methods.Table 4 presents in-domain test results for our system when specific feature types are omitted.The overall low results indicate that a combination of dependency path embeddings and binary features is required to identify and label arguments with high precision.
Figure 5 shows the effect of dependency path embeddings at mitigating sparsity: if the path between a predicate and its argument has not been observed at training time or only infrequently, conventional methods will often fail to assign a role.This is represented by the recall curve of mate-tools, which converges to zero for arguments with unseen paths.The higher recall curve for PathLSTM demonstrates that path embeddings can alleviate this problem to some extent.For unseen paths, we observe that PathLSTM improves over mate-tools by an order of magnitude, from 0.9% to 9.6%.The highest absolute gain, from 12.8% to 24.2% recall, can be observed for dependency paths that occurred between 1 and 10 times during training.
Figure 7 plots role labeling performance for sentences with varying number of words.There are two categories of sentences in which the improvements of PathLSTM are most noticeable: Firstly, it better handles short sentences that contain expletives and/or nominal predicates (+0.8% absolute in F 1 -score).This is probably due to the fact that our learned dependency path representations are lexicalized, making it possible to model  argument structures of different nominals and distinguishing between expletive occurrences of 'it' and other subjects.Secondly, it improves performance on longer sentences (up to +1.0% absolute in F 1 -score).This is mainly due to the handling of dependency paths that involve complex structures, such as coordinations, control verbs and nominal predicates.
We collect instances of different syntactic phenomena from the development set and plot the learned dependency path representations in the embedding space (see Figure 6).We obtain a projection onto two dimensions using t-SNE (Van der Maaten and Hinton, 2008).Interestingly, we can  Finally, terms of recall of proto-agent (A0) and protopatient (A1) roles, with slight gains in precision for the A2 role.Overall, PathLSTM does slightly worse with respect to modifier roles, which it labels with higher precision but at the cost of recall.

Path Embeddings in other Languages
In this section, we report results from additional experiments on Chinese, German and Spanish data.The underlying question is to which extent the improvements of our SRL system for English also generalize to other languages.To answer this question, we train and test separate SRL models for each language, using the system architecture and hyperparameters discussed in Sections 3 and 4, respectively.We train our models on data from the CoNLL-2009 shared task, relying on the same features as one of the participating systems (Björkelund et al., 2009), and evaluate with the official scorer.For direct comparison, we rely on the (automatic) syntactic preprocessing information provided with the CoNLL test data and compare our results with the best two systems for each language that make use of the same preprocessing information.
The results, summarized in Table 7, indicate that PathLSTM performs better than the system by Björkelund et al. (2009)  semantic role labeling.They developed a feedforward network that uses a convolution function over windows of words to assign SRL labels.Apart from constituency boundaries, their system does not make use of any syntactic information.Foland and Martin (2015) extended their model and showcased significant improvements when including binary indicator features for dependency paths.Similar features were used by FitzGerald et al. (2015), who include role labeling predictions by neural networks as factors in a global model.
These approaches all make use of binary features derived from syntactic parses either to indicate constituency boundaries or to represent full dependency paths.An extreme alternative has been recently proposed in Zhou and Xu (2015), who model SRL decisions with a multi-layered LSTM network that takes word sequences as input but no syntactic parse information at all.
Our approach falls in between the two extremes: we rely on syntactic parse information but rather than solely making using of sparse binary features, we explicitly model dependency paths in a neural network architecture.
Other SRL approaches Within the SRL literature, recent alternatives to neural network architectures include sigmoid belief networks (Henderson et al., 2013) as well as low-rank tensor models (Lei et al., 2015).Whereas Lei et al.only make use of dependency paths as binary indicator features, Henderson et al. propose a joint model for syntactic and semantic parsing that learns and ap-plies incremental dependency path representations to perform SRL decisions.The latter form of representation is closest to ours, however, we do not build syntactic parses incrementally.Instead, we take syntactically preprocessed text as input and focus on the SRL task only.
Apart from more powerful models, most recent progress in SRL can be attributed to novel features.For instance, Deschacht and Moens (2009) and Huang and Yates (2010) use latent variables, learned with a hidden markov model, as features for representing words and word sequences.Zapirain et al. (2013) propose different selection preference models in order to deal with the sparseness of lexical features.Roth and Woodsend (2014) address the same problem with word embeddings and compositions thereof.Roth and Lapata (2015) recently introduced features that model the influence of discourse on role labeling decisions.
Rather than coming up with completely new features, in this work we proposed to revisit some well-known features and represent them in a novel way that generalizes better.Our proposed model is inspired both by the necessity to overcome the problems of sparse lexico-syntactic features and by the recent success of SRL models based on neural networks.

Dependency-based embeddings
The idea of embedding dependency structures has previously been applied to tasks such as relation classification and sentiment analysis.Xu et al. (2015) and Liu et al. (2015) use neural networks to embed dependency paths between entity pairs.To identify the relation that holds between two entities, their approaches make use of pooling layers that detect parts of a path that indicate a specific relation.In contrast, our work aims at modeling an individual path as a complete sequence, in which every item is of relevance.Tai et al. (2015) and Ma et al. (2015) learn embeddings of dependency structures representing full sentences, in a sentiment classification task.In our model, embeddings are learned jointly with other features, and as a result problems that may result from erroneous parse trees are mitigated.

Conclusions
We introduced a neural network architecture for semantic role labeling that jointly learns embeddings for dependency paths and feature combinations.Our experimental results indicate that our model substantially increases classification performance, leading to new state-of-the-art results.In a qualitive analysis, we found that our model is able to cover instances of various linguistic phenomena that are missed by other methods.
Beyond SRL, we expect dependency path embeddings to be useful in related tasks and downstream applications.For instance, our representations may be of direct benefit for semantic and discourse parsing tasks.The jointly learned feature space also makes our model a good starting point for cross-lingual transfer methods that rely on feature representation projection to induce new models (Kozhevnikov and Titov, 2014).

Figure 1 :
Figure 1: Dependency path (dotted) between the predicate raising and the argument he.

Figure 2 :
Figure 2: Example input and embedding computation for the path from raising to he, given the sentence he had trouble raising funds.LSTM time steps are displayed from right to left.

Figure 3 :
Figure3: Neural model for joint learning of path embeddings and higher-order features: The path sequence x 1 . . .x n is fed into a LSTM layer, a hidden layer h combines the final embedding e n and binary input features B, and an output layer s assigns the highest probable class label c.

Figure 4 :
Figure 4: Pipeline architecture of our SRL system.

(Figure 6 :
Figure 6: Dots correspond to the path representation of a predicate-argument instance in 2D space.White/black color indicates A0/A1 gold argument labels.Dotted ellipses denote instances exhibiting related syntactic phenomena (see rectangles for a description and dotted rectangles for linguistic examples).Example phrases show actual output produced by PathLSTM (underlined).

Figure 5 :
Figure 5: Results on in-domain test instances, grouped by the number of training instances that have an identical (unlexicalized) dependency path.

Figure 7 :
Figure 7: Results by sentence length.Improvements over mate-tools shown in parentheses.

Table 2 :
Hyperparameters selected for best models and training proceduresLexico-syntactic features Word form and word category of the predicate and candidate argument; dependency relations from predicate and argument to their respective syntactic heads; full dependency path sequence from predicate to argument.Local context features Word forms and word categories of the candidate argument's and predicate's syntactic siblings and children words.

Table 3 :
Results on the CoNLL-2009 in-domain test set.All numbers are in percent.

Table 4 :
Ablation tests in the in-domain setting.

Table 5 :
Discussion To determine the sources of individual improvements, we test PathLSTM models without specific feature types and directly compare PathLSTM and mate-tools, both of which use 3 Results are taken from Lei et al. (2015).Results on the CoNLL-2009 out-ofdomain test set.All numbers are in percent.

Table 6 :
Table 6 shows results for nominal and verbal predicates as well as for different (gold) role labels.In comparison to mate-tools, we can see that PathLSTM improves precision for all argument types of nominal predicates.For verbal predicates, improvements can be observed in Results by word category and role label.

Table 7 :
in all cases.For German and Chinese, PathLSTM achieves the best overall F 1 -scores of 80.1% and 79.4%, respectively.Results (in percentage) on the CoNLL-2009 test sets for Chinese, German and Spanish.