Syntax Aware LSTM model for Semantic Role Labeling

In Semantic Role Labeling (SRL) task, the tree structured dependency relation is rich in syntax information, but it is not well handled by existing models. In this paper, we propose Syntax Aware Long Short Time Memory (SA-LSTM). The structure of SA-LSTM changes according to dependency structure of each sentence, so that SA-LSTM can model the whole tree structure of dependency relation in an architecture engineering way. Experiments demonstrate that on Chinese Proposition Bank (CPB) 1.0, SA-LSTM improves F1 by 2.06% than ordinary bi-LSTM with feature engineered dependency relation information, and gives state-of-the-art F1 of 79.92%. On English CoNLL 2005 dataset, SA-LSTM brings improvement (2.1%) to bi-LSTM model and also brings slight improvement (0.3%) when added to the state-of-the-art model.


Introduction
The task of Semantic Role Labeling (SRL) is to recognize arguments of a given predicate in a sentence and assign semantic role labels. Many NLP works such as machine translation (Xiong et al., 2012;Aziz et al., 2011) benefit from SRL because of the semantic structure it provides. Figure 1 shows a sentence with semantic role label.
Dependency relation is considered important for SRL task (Xue, 2008;Punyakanok et al., 2008;Pradhan et al., 2005), since it can provide rich structure and syntax information for SRL. At the bottom of Figure 1 shows dependency of the sentence. * Corresponding Author  (Xue and Palmer, 2003) with semantic role labels and dependency.
Traditional methods (Sun and Jurafsky, 2004;Xue, 2008;Chang, 2008, 2009;Sun, 2010) do classification according to manually designed features. Feature engineering requires expertise and is labor intensive. Recent works based on Recurrent Neural Network (RNN) (Zhou and Xu, 2015;Wang et al., 2015;He et al., 2017) extract features automatically, and significantly outperform traditional methods. However, because RNN methods treat language as sequential data, they fail to integrate the tree structured dependency into RNN.
We propose Syntax Aware Long Short Time Memory (SA-LSTM) to directly model complex tree structure of dependency relation in an architecture engineering way. Architecture of SA-LSTM is shown in Figure 2. SA-LSTM is based on bidirectional LSTM (bi-LSTM). In order to model the whole dependency tree, we add additional directed connections between dependency related words in bi-LSTM. SA-LSTM integrates the whole dependency tree directly into the model in an architecture engineering way. Also, to take dependency relation type into account, we introduce trainable weights for different types of dependency relation. The weights can be trained to indicate importance of a dependency type.
SA-LSTM is able to directly model the whole tree structure of dependency relation in an architecture engineering way. Experiments show that SA-LSTM can model dependency relation better than traditional feature engineering way. SA-LSTM gives state of the art F 1 on CPB 1.0 and also shows improvement on English CoNLL 2005 dataset.

Syntax Aware LSTM
In this section, we first introduce ordinary bi-LSTM. Based on bi-LSTM, we then introduce the proposed SA-LSTM. Finally, we introduce how to do optimization for SA-LSTM.

Conventional bi-LSTM Model for SRL
In a corpus sentence, each word w t has a feature representation x t which is generated automatically as (Wang et al., 2015) did. z t is feature embedding for w t , calculated as followed: where W 1 ∈ R n 1 ×n 0 . n 0 is the dimension of word feature representation. In a corpus sentence, each word w t has six internal vectors, C, g i , g f , g o , C t , and h t , shown in Equation 2: where C is the candidate value of the current cell state. g are gates used to control the flow of information. C t is the current cell state. h t is hidden state of w t . W x and U x are matrixes used in linear transformation: As convention, f stands for tanh and σ stands for sigmoid. means the element-wise multiplication.
In order to make use of bidirectional information, the forward − → h t T and backward ← − h t T are concatenated together, as shown in Equation 4: Finally, o t is the result vector with each dimension corresponding to the score of each semantic role tag, and are calculated as shown in Equation 5:  Figure 2: Structure of Syntax Aware LSTM. The purple square is the current cell that is calculating. The green square is a dependency related cell.
where W 2 ∈ R n 3 ×n 2 , n 2 is 2 × h t , W 3 ∈ R n 4 ×n 3 and n 4 is the number of tags in IOBES tagging schema.

Syntax Aware LSTM Model for SRL
This section introduces the proposed SA-LSTM model. Figure 2 shows the structure of SA-LSTM. SA-LSTM is based on bidirectional LSTM. By architecture engineering, SA-LSTM can model the whole tree structure of dependency relation.
S t is the key component of SA-LSTM. It stands for information from other dependency related words, and is calculated as shown in Equation 6: S t is the weighted sum of all hidden state vectors h i which come from previous words w i . Note that, α ∈ {0, 1} indicates whether there is a dependency relation pointed from w i to w t .
We add a gate g s to constrain information from S t , as shown in Equation 8: To protect the original word information from being diluted (Wu et al., 2016) by S t , we add S t to hidden layer vector h t instead of adding to cell state C t . So h t in SA-LSTM cell is calculated as: For example, in Figure 2, there is a dependency relation "advmod" from green square to purple square. By Equation 7, only the hidden state of green square is selected, then transformed by g s in Equation 8, finally added to hidden layer of the purple cell.
SA-LSTM changes structure by adding different connections according to dependency relation. In this way, SA-LSTM integrates the whole tree structure of dependency.
However, by using α in Equation 7, we do not take dependency type into account, so we further improve the way α is calculated from Equation 7 to Equation 10. Each type m of dependency relation is assigned a trainable weight α m . In this way, SA-LSTM can model differences between types of dependency relation.

Optimization
This section introduces optimization methods for SA-LSTM. We use maximum likelihood criterion to train SA-LSTM. We choose stochastic gradient descent algorithm to optimize parameters. Given a training pair T = (x, y) where T is the current training pair, x denotes current training sentence, and y is the corresponding correct answer path. y t = k means that the t-th word has the k-th semantic role label.
The score of o t is calculated as: where N i is the word number of the current sentence and θ stands for all parameters. So the log where y ranges from all valid paths of answers. We use Viterbi algorithm to calculate the global normalization. Besides, we collected those impossible transitions from corpus beforehand. When calculating global normalization, we prevent calculating impossible paths which contains impossible transitions.

Experiment setting
In order to compare with previous Chinese SRL works, we choose to do experiment on CPB 1.0. We also follow the same data setting as previous Chinese SRL work (Xue, 2008;Sun et al., 2009) did. Pre-trained 1 word embeddings are tested on SA-LSTM and shows improvement.
For English SRL, we test on CoNLL 2005 dataset.
We use Stanford Parser (Chen and Manning, 2014) to get dependency relation. The training set of Chinese parser overlaps a part of CPB 1.0 test set, so we retrained the parser. Dimension of hyper parameters are tuned according to development set. n 1 = 200, n h = 100, n 2 = 200, n 3 = 100, learning rate = 0.001.

Syntax Aware LSTM Performance
To prove that SA-LSTM models dependency relation better than simple feature engineering 1 Trained by word2vec on Chinese Gigaword Corpus 2 All experiment code and related files are available on request 3 We test the model on CPB 1.0 0 0 where T is the number of dependency related words and α is a 0, 1 variable calculated as in Equation 7. Then F t is concatenated to x t to form a new feature representation. Then these representations are fed into bi-LSTM.
As shown in Table 1, on CPB 1.0, SA-LSTM reaches 79.81%F 1 score with random initialization and 79.92%F 1 score with pre-trained word embedding. Both of them are the best F 1 score ever published on CPB 1.0 dataset.
In contrast to the "bi-LSTM+feature engineering dependency" model, it is clear that architecture method of SA-LSTM gains more improvement(77.09% to 79.81%) than simple feature engineering method(77.09% to 77.75%). Path-LSTM (Roth and Lapata, 2016) embeds dependency path between predicate and argument for each word using LSTM, then does classification according to such path embedding and some other features. SA-LSTM (79.81%F 1 ) outperforms Path-LSTM (79.01%F 1 ) on CPB 1.0.
Both "bi-LSTM + feature engineering dependency" and Path-LSTM only model dependency parsing information for each single word, which can not model the whole dependency tree struc-ture. However, by building the dependency relation directly into the structure of SA-LSTM and changing the way information flows, SA-LSTM is able to model the whole tree structure of dependency relation.
We also test our SA-LSTM on English CoNLL 2005 dataset. Replacing conventional bi-LSTM by SA-LSTM brings 1.7%F 1 improvement. Replacing bi-LSTM layers of the state of the art model (He et al., 2017) by SA-LSTM 1 brings 0.3%F 1 improvement.

Visualization of Trained Weights
According to Equation 10, influence from a single type of dependency relation will be multiplied with type weight α m . When α m is 0, the influence from this type of dependency relation will be ignored totally. When the weight is bigger, the type of dependency relation will have more influence on the whole system.
As shown in Figure 3, dependency relation type dobj receives the highest weight after training, as shown by the red bar. According to grammar knowledge, dobj should be an informative relation for SRL task, and SA-LSTM gives dobj the most influence automatically. This further demonstrate that the result of SA-LSTM is highly in accordance with grammar knowledge.

Related works
Semantic role labeling (SRL) was first defined by (Gildea and Jurafsky, 2002). Early works (Gildea and Jurafsky, 2002;Sun and Jurafsky, 2004) on SRL got promising result without large annotated SRL corpus. Xue and Palmer (2003) built the Chinese Proposition Bank to standardize Chinese SRL research.
Traditional works such as (Xue and Palmer, 2005;Xue, 2008;Ding and Chang, 2009;Sun et al., 2009;Chen et al., 2006;Yang et al., 2014) use feature engineering methods. Their methods can take dependency relation into account in feature engineering way, such as syntactic path feature. It is obvious that feature engineering method can not fully capture the tree structure of dependency relation.
More recent SRL works often use neural network based methods. Collobert and Weston (2008) proposed a Convolutional Neural Network (CNN) method for SRL. Zhou and Xu (2015) proposed bidirectional RNN-LSTM method for English SRL, and Wang et al. (2015) proposed a bi-RNN-LSTM method for Chinese SRL on which SA-LSTM is based. He et al. (2017) further extends the work of Zhou and Xu (2015). NN based methods extract features automatically and significantly outperforms traditional methods. However, most NN based methods can not utilize dependency relation which is considered important for semantic related NLP tasks (Xue, 2008;Punyakanok et al., 2008;Pradhan et al., 2005).
The work of Roth and Lapata (2016) and Sha et al. (2016) have the same motivation as SA-LSTM, but in different ways. Sha et al. (2016) uses dependency relation as feature to do argument relations classification. Roth and Lapata (2016) embeds dependency path into feature representation for each word using LSTM. In contrast, SA-LSTM utilizes dependency relation in an architecture engineering way, by integrating the whole dependency tree structure directly into SA-LSTM structure.

Conclusion
We propose Syntax Aware LSTM model for Semantic Role Labeling (SRL). SA-LSTM is able to directly model the whole tree structure of dependency relation in an architecture engineering way. Experiments show that SA-LSTM can model dependency relation better than traditional feature engineering way. SA-LSTM gives state of the art F 1 on CPB 1.0 and also shows improvement on English CoNLL 2005 dataset.