Exploiting Mutual Benefits between Syntax and Semantic Roles using Neural Network

We investigate mutual beneﬁts between syntax and semantic roles using neural network models, by studying a parsing → SRL pipeline, a SRL → parsing pipeline, and a simple joint model by embedding sharing. The integration of syntactic and semantic features gives promising results in a Chinese Semantic Tree-bank, demonstrating large potentials of neural models for joint parsing and semantic role labeling.


Introduction
The correlation between syntax and semantics has been a fundamental problem in natural language processing (Steedman, 2000). As a shallow semantic task, semantic role labeling (SRL) models have traditionally been built upon syntactic parsing results (Gildea and Jurafsky, 2002;Gildea and Palmer, 2002;Punyakanok et al., 2005). It has been shown that parser output features play a crucial role for accurate SRL (Pradhan et al., 2005;Surdeanu et al., 2007).
On the reverse direction, semantic role features have been used to improve parsing (Boxwell et al., 2010). Existing methods typically use semantic features to rerank n-best lists of syntactic parsing models (Surdeanu et al., 2008;Hajič et al., 2009). There has also been attempts to learn syntactic parsing and semantic role labeling models jointly, but most such efforts have led to negative results (Sutton and Mc-Callum, 2005;Van Den Bosch et al., 2012;Boxwell et al., 2010). * Work done while the first author was visiting SUTD.
With the rise of deep learning, neural network models have been used for semantic role labeling (Collobert et al., 2011). Recently, it has been shown that a neural semantic role labeler can give state-of-the-art accuracies without using parser output features, thanks to the use of recurrent neural network structures that automatically capture syntactic information (Zhou and Xu, 2015;. In the parsing domain, neural network models have also been shown to give state-of-the-art results recently Weiss et al., 2015;. The availability of parser-independent neural SRL models allows parsing and SRL to be performed by both parsing→SRL and SRL→parsing pipelines, and gives rise to the interesting research question whether mutual benefits between syntax and semantic roles can be better exploited under the neural setting. Different from traditional models that rely on manual feature combinations for joint learning tasks (Sutton and McCallum, 2005;Zhang and Clark, 2008a;Finkel and Manning, 2009;Lewis et al., 2015), neural network models induce non-linear feature combinations automatically from input word and Part-of-Speech (POS) embeddings. This allows more complex feature sharing between multiple tasks to be achieved effectively (Collobert et al., 2011).
We take a first step 1 in such investigation by cou-pling a state-of-the-art neural semantic role labeler  and a state-of-the-art neural parser . First, we propose a novel parsing→SRL pipeline using a tree Long Short-Term Memory (LSTM) model (Tai et al., 2015) to represent parser outputs, before feeding them to the neural SRL model as inputs. Second, we investigate a SRL→parsing pipeline, using semantic role label embeddings to enrich parser features. Third, we build a joint training model by embedding sharing, which is the most shallow level of parameter sharing between deep neural networks. This simple strategy is immune to significant differences between the network structures of the two models, which prevent direct sharing of deeper network parameters. We choose a Chinese semantic role treebank (Qiu et al., 2016) for preliminary experiments, which offers consistent dependency between syntax and semantic role representations, thereby facilitates the application of standard LSTM models. Results show that the methods give improvements to both parsing and SRL accuracies, demonstrating large potentials of neural networks for the joint task.
Our contributions can be summarized as: • We show that the state-of-the-art LSTM semantic role labeler of Zhou and Xu (2015), which has been shown to be able to induce syntactic features automatically, can still be improved using parser output features via tree LSTM (Tai et al., 2015); • We show that state-of-the-art neural parsing can be improved by using semantic role features; • We show that parameter sharing between neural parsing and SRL improves both sub tasks, which is in line with the observation of Collobert et al. (2011) between POS tagging, chunking and SRL.

Semantic Role Labeler
We employ the SRL model of , which uses a bidirectional Long Short-term Memory (Hochreiter and Schmidhuber, 1997;Graves and Schmidhuber, 2005;Graves et al., 2013) for sequential labeling.
work on a Chinese dataset. Given the sentence "人 类(human) 的(de) 发 展(development) 面临(face) 挑战(challenge)", the structure of the model is shown in Figure 1. For each word w t , the LSTM model uses a set of vectors to control information flow: an input gate i t , a forget gate f t , a memory cell c t , an output gate o t , and a hidden state h t . The computation of each vector is as follows: Here σ denotes component-wise sigmoid function and is component-wise multiplication.
The representation of x t is from four sources: an embedding for the word w t , two hidden states of the last LSTM cells in a character-level bidirectional LSTM  (denoted as − → ch t and ← − ch t , respectively), and a learned vector Part-of-Speech (POS) representation (pos t ). A linear transformation is applied to the vector representations before feeding them into a component-wise ReLU (Nair and Hinton, 2010) function.
The hidden state vectors at the t-th word from both directions (denote as − → h t and ← − h t , respectively) are passed through the ReLU function, before a softmax layer for semantic role detection.

Stack-LSTM Dependency Parser
We employ the Stack-LSTM model of  for dependency parsing. As shown in Figure  2, it uses a buffer (B) to order input words, a stack (S) to store partially constructed syntactic trees, and • SHIFT, which pops the top element off the buffer, pushing it into stack.
• REDUCE-LEFT/REDUCE-RIGHT, which pop the top two elements off the stack, pushing back the composition of the two elements with a dependent relation.
The parser is initialized by pushing input embeddings into the buffer in the reverse order. The representation of the token is same as the previous bidirectional LSTM (Bi-LSTM) model. The buffer (B), stack (S) and action history sequence (A) are all represented by LSTMs, with S being represented by a novel stack LSTM. At a time step t, the parser predicts an action according to current parser state p t : W , V and d are model parameters.

DEP→SRL Pipeline
In this pipeline model, we apply Stack-LSTM parsing first and feed the results as additional features for SRL. For each word w t to the SRL system, the corresponding input becomes, where dep t is the t-th word's dependency information from parser output and V (dep) is a weight matrix. There are multiple ways to define dep t . A simple method is to use embeddings of the dependency label at w t . However, this input does not embody full arc information.
We propose a novel way of defining dept, by using hidden vector ht of a dependency tree LSTM (Tai et al., 2015) at wt as dept. Given a dependency tree output, we define tree LSTM inputs xt in the same way as Section 2.1. The tree LSTM is a bottom-up generalization of the sequence LSTM, with a node ht having multiple predecessors h k t−1 , which corresponding to the syntactic dependents of the word wt. The computation of ht for each wt is (unlike t, which is a left-to-right index,t is a bottomup index, still with one ht being computed for each wt): For training, we construct a corpus with all words being associated with automatic dependency labels by applying 10-fold jackknifing.

SRL→DEP Pipeline
In this pipeline model, we conduct SRL first, and feed the output semantic roles to the Stack-LSTM parser in the token level. The representation of a token becomes: where srl t is the t-th word's predicted semantic role embedding and V (srl) is a weight matrix. For training, we construct a training corpus with automatically tagged semantic role labels by using 10-fold jackknifing.

Joint Model by Parameter Sharing
The structure of the joint system is shown in Figure 3. Here the parser and semantic role labeler are coupled in the embedding layer, sharing the vector lookup tables for characters, words and POS. More specifically, the Bi-LSTM model of Section 2.1 and the Stack-LSTM model of Section 2.2 are used for the SRL task and the parsing task, respectively. The Bi-LSTM labeler and Stack-LSTM parser share the embedding layer. During training, we maximize the The loss from the semantic role labeler and the parser both propagate to the embedding layer, resulting in a better vector representation of each token, which benefits both tasks at the same time. On the other hand, due to different neural structures, there is no sharing of other parameters. The joint model offers the simplest version of shared training (Collobert et al., 2011), but does not employ shared decoding (Sutton and McCallum, 2005;Zhang and Clark, 2008b). Syntax and semantic roles are assigned separately, avoiding error propagation.

Experimental Settings
Datasets We choose Chinese Semantic Treebank (Qiu et al., 2016) for our experiments. Similar to the CoNLL corpora (Surdeanu et al., 2008;Hajič et al., 2009) and different from PropBank (Kingsbury and Palmer, 2002;Xue and Palmer, 2005), it is a dependency-based corpus rather than a constituent-based corpus. The corpus contains syntactic dependency arc and semantic role annotations in a consistent form, hence facilitating the joint task. We follow the standard split for the training, development and test sets, as shown in Table 1.
Training Details. There is a large number of singletons in the training set and a large number of out-of-vocabulary (OOV) words in the development set. We use the mechanism of  to stochastically set singletons as UNK token in each training iteration with a probability p unk . The hyperparameter p unk is set to 0.2.
For parameters used in Stack-LSTM, we follow . We set the number of embeddings by intuition, and decide to have the size of word embedding twice as large as that of charac-  ter embedding, and the size of character embedding larger than the size of POS embedding. More specifically, we fix the size of word embeddings n w to 64, character embeddings n char to 32, POS embeddings n pos to 30, action embeddings n dep to 30, and semantic role embeddings n srl to 30. The LSTM input size is set to 128 and the LSTM hidden size to 128. We randomly initialize each parameter to a real value in [− 6 r+c , 6 r+c ], where r is the number of input unit and c is the number of output unit (Glorot and Bengio, 2010). To minimize the influence of external information, we did not pretrain the embedding values. In addition, we apply a Gaussian noise N (0, 0.2) to word embeddings during training to prevent overfitting.
We optimize model parameters using stochastic gradient descent with momentum. The same learning rate decay mechanism of  is used. The best model parameters are selected according to a score metric on the development set. For different tasks, we use different score metrics to evaluate the parameters. Since there are there metrics, F1, UAS and LAS, possibly reported at the same time, we use the weighted average to consider the effect of all metrics when choosing the best model on the dev set. In particular, we use F 1 for SRL, 0.5 × LAS + 0.5 × U AS for parsing, and 0.5 × F 1 + 0.25 × U AS + 0.25 × LAS for the joint task.

Results
The final results are shown in Table 2, where F 1 represents the F 1 -score of semantic roles, and UAS and LAS represent parsing accuracies. The Bi-LSTM row represents the bi-directional semantic role labeler, the S-LSTM row represents the Stack-LSTM parser, the DEP→SRL row represents the dependency parsing → SRL pipeline, the SRL→DEP row represents the SRL → dependency parsing pipeline, and the Joint row represents the parameter-shared model. For the DEP→SRL pipeline, lab and lstm  represents the use of dependency label embeddings and tree LSTM hidden vectors for the additional SRL features dep t , respectively. Comparison between Bi-LSTM and DEP→SRL shows that slight improvement is brought by introducing dependency label features to the semantic role labeler (72.71→73.00). By introducing full tree information, the lstm integration leads to much higher improvements (72.71→74.18). This demonstrates that the LSTM SRL model of Zhou and Xu (2015) can still benefit from parser outputs, despite that it can learn syntactic information independently.
In the reverse direction, comparison between S-LSTM and SRL→DEP shows improvement to UAS/LAS by integrating semantic role features (82.10→82.62). This demonstrates the usefulness of semantic roles to parsing and is consistent with observations on discrete models (Boxwell et al., 2010). To our knowledge, we are the first to report results using a SRL → Parsing pipeline, which is enabled by the neural SRL model.
Using shared embeddings, the joint model gives improvements on both SRL and parsing. The most salient difference between the joint model and the two pipelines is the shared parameter space.
These results are consistent with the finds of Collobert et al. (2011) who show that POS, chunking and semantic role information can bring benefit to each other in joint neural training. In contrast to their results (SRL 74.15→74.29, POS 97.12→97.22, CHUNK 93.37→93.75), we find that parsing and SRL benefit relatively more from each other (SRL 72.72→73.84,DEP 84.33→85.15). This is intuitive because parsing offers deeper syntactic information compared to POS and shallow syntactic chunking.

Conclusion
We investigated the mutual benefits between dependency syntax and semantic roles using two state-ofthe-art LSTM models, finding that both can be further improved. In addition, simple multitask learning is also effective. These results demonstrate potentials for deeper joint neural models between these tasks.