A Simple and Accurate Syntax-Agnostic Neural Model for Dependency-based Semantic Role Labeling

We introduce a simple and accurate neural model for dependency-based semantic role labeling. Our model predicts predicate-argument dependencies relying on states of a bidirectional LSTM encoder. The semantic role labeler achieves competitive performance on English, even without any kind of syntactic information and only using local inference. However, when automatically predicted part-of-speech tags are provided as input, it substantially outperforms all previous local models and approaches the best reported results on the English CoNLL-2009 dataset. We also consider Chinese, Czech and Spanish where our approach also achieves competitive results. Syntactic parsers are unreliable on out-of-domain data, so standard (i.e., syntactically-informed) SRL models are hindered when tested in this setting. Our syntax-agnostic model appears more robust, resulting in the best reported results on standard out-of-domain test sets.


Introduction
The task of semantic role labeling (SRL), pioneered by Gildea and Jurafsky (2002), involves prediction of predicate argument structure, i.e. both identification of arguments as well as their assignment to an underlying semantic role.These representations have been shown beneficial in many NLP applications, including question answering (Shen and Lapata, 2007) and information extraction (Christensen et al., 2011).Semantic banks (e.g., PropBank (Palmer et al., 2005)) typically represent arguments as syntactic constituents or, more generally, text spans (Baker et al., 1998).In contrast, CoNLL-2008 and2009 shared tasks (Surdeanu et al., 2008;Hajic et al., 2009) popularized dependency-based semantic role labeling where the goal is to identify syntactic heads of arguments rather than entire constituents.Figure 1 shows an example of such a dependency-based representation: node labels are senses of predicates (e.g., "01" indicates that the first sense from the PropBank sense repository is used for predicate makes in this sentence) and edge labels are semantic roles (e.g., A0 is a proto-agent, 'doer').
Until recently state-of-the-art SRL systems relied on complex sets of lexico-syntactic features (Pradhan et al., 2005) as well as declarative constraints (Punyakanok et al., 2008;Roth and Yih, 2005).Neural SRL models instead exploited feature induction capabilities of neural networks, largely eliminating the need for complex handcrafted features.Initially achieving state-of-theart results only in the multilingual setting, where careful feature engineering is not practical (Titov et al., 2009;Gesmundo et al., 2009;Henderson et al., 2013), neural SRL models now also outperform their traditional counterparts on standard benchmarks for English (FitzGerald et al., 2015;Roth and Lapata, 2016;Swayamdipta et al., 2016;Foland and Martin, 2015).
Recently, it has been shown that an accurate span-based SRL model can be constructed without relying on syntactic features (Zhou and Xu, 2015).Nevertheless, state-of-the-art methods for dependency-based SRL still heavily rely on syntactic features (Roth and Lapata, 2016;FitzGerald et al., 2015;Lei et al., 2015;Roth and Wood-send, 2014;Swayamdipta et al., 2016).In particular, Roth and Lapata (2016) argue that syntactic features are necessary and show that performance of their model degrades dramatically if syntactic paths between arguments and predicates are not provided as an input.In this work, we are the first to show how to construct a very accurate dependency-based semantic role labeler which either does not use any kind of syntactic information or uses very little (automatically predicted part-ofspeech tags).This suggests that our LSTM model can largely capture syntactic information, and this information can, to a large extent, substitute treebank syntax.
Our model is inspired by recent work in syntactic dependency parsing (Kiperwasser and Goldberg, 2016;Cross and Huang, 2016).In their simplest version, they encoded a sentence by a bidirectional LSTM encoder, and then dependency edges in a candidate dependency tree are scored independently from each other, relying only on the concatenation of two LSTM states, one for the head word and one for the dependent word.We observe that the direct application of this idea does not lead to competitive results on dependencybased SRL.Instead, we find it necessary to use a multi-pass approach where we first identify predicates2 and disambiguate them, then, for each predicate, we re-encode the sentence with an LSTM while indicating (in the input) which word is chosen as a predicate.Finally, for each predicate, arguments and their roles are predicted in the same way as before, i.e. relying on the two LSTM states (a state of the predicate word and a state of the argument word).Intuitively, in this way, on each run, the LSTM encoder does not need to represent all argument-predicate dependencies in its state trajectory but can focus on a single predicate at a time.We hypothesize that this constitutes a more effective way to use the LSTM capacity.This reencoding idea is reminiscent of the region marking features used in the span-based model of Zhou and Xu (2015).
The resulting SRL model is very simple.Not only we do not rely on syntax, our model is also local, i.e. we do not globally score or constrain sets of arguments.On the standard in-domain CoNLL-2009 benchmark we achieve 87.6 F1 which com-pares favorable to the best local model (86.7%F1 for PathLSTM (Roth and Lapata, 2016)) and approaches the best results overall (87.9% for an ensemble of 3 PathLSTM models with a reranker on top).Moreover, as syntactic parsers are not reliable when used out-of-domain, standard (i.e.syntactically-informed) dependency SRL models are crippled when applied to such data.In contrast, our syntax-agnostic model appears to be considerably more robust: we achieve the best result so far on the out-of-domain Brown test set, 77.3% F1.This constitutes a 2% absolute improvement over the comparable previous model (75.3% for the local PathLSTM) and substantially outperforms any previous method (76.5% for the ensemble of 3 PathLSTMs).The key contributions can be summarized as follows: • we propose the first effective syntax-agnostic model for dependency-based SRL; • it achieves the best results among local models on the English in-domain test set; • it substantially outperforms all previous methods on the out-of-domain test set.
Note that, in this work, we are not arguing that neither global inference nor integration of treebank syntax is not beneficial to SRL.Instead, we leave these questions for future work.In fact, we believe that the proposed SRL model, given its simplicity and efficiency, can be used as a natural building block for future global and syntactically-informed SRL models.

Our Model
The focus of this paper is on argument identification and labeling, as these are the steps which have been previously believed to require syntactic information.For predicate disambiguation we use a simple LSTM model, described in Section 2.4.As we sketched in the introduction, in order to identify and classify arguments, we use a Bidirectional LSTM (BiLSTM).LSTM takes as input word representations x i of each word w i in a sentence w.LSTM states provide dynamic representation of words and their contexts in a sentence.The actual prediction of roles is done by a classifier which takes as an input the BiLSTM representation of the candidate argument and the BiLSTM representation of the predicate.

Word Representation
We represent each word w as the concatenation of three vectors: a randomly initialized word embedding x re ∈ R dw , a pre-trained word embedding x pe ∈ R dw , a randomly initialized part-of-speech tag embedding x pos ∈ R dp and a randomly initialized lemma embedding x le ∈ R d l that is only active if the word is one of the predicates.The randomly initialized embeddings x re , x pos , and x le are fine-tuned during training, while the pretrained ones are kept fixed.The final word representation is given by x = x re • x pe • x pos • x le , where • represents the concatenation operator.

Bidirectional LSTM Encoder
One of the most effective ways to model sequences are recurrent neural networks (RNN) (Elman, 1990), more precisely their gated versions, for example, Long Short-Term Memory (LSTM) networks (Hochreiter and Schmidhuber, 1997).Formally, we can define an LSTM as a function LST M θ (x 1:i ) that takes as input the sequence x 1:i and returns a hidden state h i ∈ R d h .This state can be regarded as a representation of the sentence from the start to the position i, or, in other words, it encodes the word at position i along with its left context.Bidirectional LSTMs make use of two LSTMs: one for the forward pass, and another for the backward pass, LST M F and LST M B , respectively.In this way the concatenation of forward and backward LSTM states encodes both left and right contexts of a word, BiLST M (x 1:n , i) = LST M F (x 1:i ) • LST M B (x n:i ).In this work we stack k layers of bidirectional LSTMs, each layer takes the lower layer as its input.
Since for each word in a sentence we want to predict the semantic role given a predicate, we concatenate the hidden states at the k-th layer of System P R F1 the current word and the predicate word and use them as input to a classifier.Though we experimented with multilayer perceptrons, we obtained the best results with a simple log-linear model.The classifier computes the probability of the role (including special 'NULL' role to indicate that a word is not an argument of the predicate) given the candidate argument and the predicate: where v i and v p are hidden state calculated by respectively BiLST M (x 1:n , i) and BiLST M (x 1:n , p), l is the lemma of predicate p and the symbol ∝ signifies proportionality.Instead of using a fixed matrix W l,r or simply assuming that W l,r = W r , we, inspired by FitzGerald et al. (2015), found it beneficial to jointly embed the role r and predicate lemma l using a nonlinear transformation: where ReLU is the rectilinear activation function, U is a parameter matrix, whereas u l ∈ R d l and v r ∈ R dr are randomly initialized embeddings of predicate lemmas and roles.In this way each role prediction is predicate-specific, and at the same time we expect to learn a good representation for roles associated to infrequent predicates.

Predicate-Specific Encoder
As we will show in Section 3, although this onepass model, where the sentence is encoded only once, is very effective for syntactic dependency parsing, it does not perform well in SRL (Table 3, '-predicate flag').Though we found this dramatic drop in performance surprising, the nature of dependencies, especially for nominal predicates, is different here with many arguments being System P R F1 far away from the predicates.Inspired by the spanbased SRL approach of Zhou and Xu (2015), we add a predicate-specific feature to the word representation by concatenating a binary flag to the word representation of Section 2.1.The flag is set to 1 for the word corresponding to the currently considered predicate, it is set to 0 otherwise.In this way, sentences with more than one predicate will be encoded multiple times.

Predicate Disambiguation
We also implemented a syntax-agnostic predicate sense disambiguator.For this subtask, we represented a word as a concatenation of its pretrained word embedding, the predicate word we want to disambiguate, and the predicate flag.This word representation is fed to a single-layer BiL-STM.The concatenation of the hidden state of the predicate and the predicate word embeddings are passed to a linear classifier obtain the predicate sense.At test time, if a predicate has never been seen during training, the first sense is predicted.

Experiments
We applied our model to the English CoNLL-2009 dataset with the standard split into training, test and development sets.For the semantic role labeler, we used external embeddings of Dyer et al. (2015) learned using the structured skip n-gram approach of Ling et al. (2015).Similarly to Kiperwasser and Goldberg (2016) we used word dropout (Iyyer et al., 2015); we replace a word with the unknown token UNK with probability α f r(w)+α , where α is an hyper-parameter and f r(w) is the frequency of the word w.We used the predicted POS tags provided by the CoNLL-2009 shared-task organizers.For the predicate disambiguator of Section 2.4 we used GloVe embed- dings (Pennington et al., 2014).We optimized with Adam (Kingma and Ba, 2015).The hyperparameters tuning and all model selection was performed on the development set, see Table 4 for their values.
The results indicate that our full model (with POS tags and re-encoding) significantly outperforms all the local counter-parts on the in-domain tests (see Table 1, 87.6% F1 for our model vs. 86.7%for PathLSTM) and outperforms even ensemble models on the out-of-domain data (77.3% vs. 76.5% for the ensemble of PathLSTMs).The ablation studies (Table 3) demonstrate that POS tag information is beneficial, though not crucial for obtaining competitive performance.In contrast, one-pass processing without re-encoding badly hurts the performance (6% drop in F1 on the development set).

Conclusions
Our syntax-agnostic method is simple and fast, and surpasses comparable approaches (no system combination, local inference) on the standard indomain benchmark for English.Moreover, it outperforms all previous methods (including ensembles) in the arguably more realistic out-of-domain setting.In the future, we will consider integration of syntactic information and joint inference as well as experiment with additional languages.

Figure 2 :
Figure 2: Predicting an argument and its label with an LSTM encoder.

Table 1 :
Results on the in-domain test set.

Table 2 :
Results on the out-of-domain test set.

Table 3 :
Ablation study on the development set.