Neural Modeling of Multi-Predicate Interactions for Japanese Predicate Argument Structure Analysis

The performance of Japanese predicate argument structure (PAS) analysis has improved in recent years thanks to the joint modeling of interactions between multiple predicates. However, this approach relies heavily on syntactic information predicted by parsers, and suffers from errorpropagation. To remedy this problem, we introduce a model that uses grid-type recurrent neural networks. The proposed model automatically induces features sensitive to multi-predicate interactions from the word sequence information of a sentence. Experiments on the NAIST Text Corpus demonstrate that without syntactic information, our model outperforms previous syntax-dependent models.


Introduction
Predicate argument structure (PAS) analysis is a basic semantic analysis task, in which systems are required to identify the semantic units of a sentence, such as who did what to whom. In prodrop languages such as Japanese, Chinese and Italian, arguments are often omitted in text, and such argument omission is regarded as one of the most problematic issues facing PAS analysis (Iida and Poesio, 2011;Sasano and Kurohashi, 2011;Hangyo et al., 2013).
In response to the argument omission problem, in Japanese PAS analysis, a joint model of the interactions between multiple predicates has been gaining popularity and achieved the state-ofthe-art results (Ouchi et al., 2015;Shibata et al., 2016). This approach is based on the linguistic intuition that the predicates in a sentence are semantically related to each other, and capturing this relation can be useful for PAS analysis. In the exam- Figure 1: Example of Japanese PAS. The upper edges denote dependency relations, and the lower edges denote case arguments. "NOM" and "ACC" denote the nominative and accusative arguments, respectively. "ϕ i " is a zero pronoun, referring to the antecedent " i (man i )". ple sentence in Figure 1, the word " i (man i )" is the accusative argument of the predicate " (arrested)" and is shared by the other predicate " (escaped)" as its nominative argument. Considering the semantic relation between " (arrested)" and " (escaped)", we intuitively know that the person arrested by someone is likely to be the escaper. That is, information about one predicate-argument relation could help to identify another predicate-argument relation. However, to model such multi-predicate interactions, the joint approach in the previous studies relies heavily on syntactic information, such as part-of-speech (POS) tags and dependency relations predicted by POS taggers and syntactic parsers. Consequently, it suffers from error propagation caused by pipeline processing.
To remedy this problem, we propose a neural model which automatically induces features sensitive to multi-predicate interactions exclusively from the word sequence information of a sentence. The proposed model takes as input all predicates and their argument candidates in a sentence at a time, and captures the interactions using gridtype recurrent neural networks (Grid-RNN) without syntactic information. In this paper, we first introduce a basic model that uses RNNs. This model independently estimates the arguments of each predicate without considering multi-predicate interactions (Sec. 3). Then, extending this model, we propose a neural model that uses .
Performing experiments on the NAIST Text Corpus (Iida et al., 2007), we demonstrate that even without syntactic information, our neural models outperform previous syntax-dependent models (Imamura et al., 2009;Ouchi et al., 2015). In particular, the neural model using Grid-RNNs achieved the best result. This suggests that the proposed grid-type neural architecture effectively captures multi-predicate interactions and contributes to performance improvements. 1 2 Japanese Predicate Argument Structure Analysis

Task Description
In Japanese PAS analysis, arguments are identified that each fulfills one of the three major case roles, nominative (NOM), accusative (ACC) and dative (DAT) cases, for each predicate. Arguments can be divided into the following three categories according to the positions relative to their predicates (Hayashibe et al., 2011;Ouchi et al., 2015): Dep: Arguments that have direct syntactic dependency on the predicate. Zero: Arguments referred to by zero pronouns within the same sentence that have no direct syntactic dependency on the predicate. Inter-Zero: Arguments referred to by zero pronouns outside of the same sentence.
1 Our source code is publicly available at https://github.com/hiroki13/neural-pasa-system For example, in Figure 1, the nominative argument " (police)" for the predicate " (arrested)" is regarded as a Dep argument, because the argument has a direct syntactic dependency on the predicate. By contrast, the nominative argument " i (man i )" for the predicate " (escaped)" is regarded as a Zero argument, because the argument has no direct syntactic dependency on the predicate.
In this paper, we focus on the analysis for these intra-sentential arguments, i.e., Dep and Zero. In order to identify inter-sentential arguments (Inter-Zero), a much broader space must be searched (e.g., the whole document), resulting in a much more complicated analysis than intrasentential arguments. 2 Owing to this complication, Ouchi et al. (2015) and Shibata et al. (2016) focused exclusively on intra-sentential argument analysis. Following this trend, we also restrict our focus to intra-sentential argument analysis.

Challenging Problem
Arguments are often omitted in Japanese sentences. In Figure 1, ϕ i represents the omitted argument, called the zero pronoun. This zero pronoun ϕ i refers to " i (man i )". In Japanese PAS analysis, when an argument of the target predicate is omitted, we have to identify the antecedent of the omitted argument (i.e., the Zero argument).
The analysis of such Zero arguments is much more difficult than that for Dep arguments, owing to the lack of direct syntactic dependencies. For Dep arguments, the syntactic dependency between an argument and its predicate is a strong clue. In the sentence in Figure 1, for the predi- cate " (arrested)", the nominative argument is " (police)". This argument is easily identified by relying on the syntactic dependency. By contrast, because the nominative argument " i (man i )" has no syntactic dependency on its predicate " (escaped)", we must rely on other information to identify the zero argument.
As a solution to this problem, we exploit two kinds of information: (i) the context of the entire sentence, and (ii) multi-predicate interactions. For the former, we introduce single-sequence model that induces context-sensitive representations from a sequence of argument candidates of a predicate. For the latter, we introduce multisequence model that induces predicate-sensitive representations from multiple sequences of argument candidates of all predicates in a sentence (shown in Figure 2).

Single-Sequence Model
The single-sequence model exploits stacked bidirectional RNNs (Bi-RNN) (Schuster and Paliwal, 1997;Graves et al., 2005Graves et al., , 2013Zhou and Xu, 2015). Figure 3 shows the overall architecture, which consists of the following three components: Input Layer: Map each word to a feature vector representation.
Output Layer: Compute the probability of each case label for each word using the softmax function.  In the following subsections, we describe each of these three components in detail.

Input Layer
Given an input sentence w 1:T = (w 1 , · · · , w T ) and a predicate p, each word w t is mapped to a feature representation x t , which is the concatenation (⊕) of three types of vectors: where each vector is based on the following atomic features inspired by Zhou and Xu (2015): ARG: Word index of each word.
PRED: Word index of the target predicate and the words around the predicate.
MARK: Binary index that represents whether or not the word is the predicate. Figure 4 presents an example of the atomic features. For the ARG feature, we extract a word index x word ∈ V for each word. Similarly, for the PRED feature, we extract each word index x word for the C words taking the target predicate at the center, where C denotes the window size. The MARK feature x mark ∈ {0, 1} is a binary value that represents whether or not the word is the predicate. Then, using feature indices, we extract feature vector representations from each embedding matrix. Figure 5 shows the process of creating the feature vector x 1 for the word w 1 " (she)". We set two embedding matrices: (i) a word embedding matrix E word ∈ R d word ×|V| , and (ii) a mark embedding matrix E mark ∈ R d mark ×2 . From each embedding matrix, we extract the corresponding column vectors and concatenate them as a feature vector x t based on Eq. 1.
Each feature vector x t is multiplied with a parameter matrix W x : t is given to the first RNN layer as input.

RNN Layer
In the RNN layers, feature vectors are updated recurrently using Bi-RNNs. Bi-RNNs process an input sequence in a left-to-right manner for oddnumbered layers and in a right-to-left manner for even-numbered layers. By stacking these layers, we can construct the deeper network structures.
Stacked Bi-RNNs consist of L layers, and the hidden state in the layer ℓ ∈ (1, · · · , L) is calculated as follows: Both of the odd-and even-numbered layers receive h (ℓ−1) t , the t-th hidden state of the ℓ−1 layer, as the first input of the function g (ℓ) , which is an arbitrary function 3 . For the second input of g (ℓ) , odd-numbered layers receive h (ℓ) t−1 , whereas evennumbered layers receive h (ℓ) t+1 . By calculating the hidden states until the L-th layer, we obtain a hidden state sequence h (L) t , we calculate the probability of case labels for each word in the output layer.

Output Layer
For the output layer, multi-class classification is performed using the softmax function: where h (L) t denotes a vector representation propagated from the last RNN layer (Fig 3). Each element of y t is a probability value corresponding to each label. The label with the maximum probability among them is output as a result. In this work, we set five labels: NOM, ACC, DAT, PRED, null. PRED is the label for the predicate, and null denotes a word that does not fulfill any case role.

Multi-Sequence Model
Whereas the single-sequence model assumes independence between predicates, the multi-sequence model assumes multi-predicate interactions. To capture such interactions between all predicates in a sentence, we extend the singlesequence model to the multi-sequence model using Grid-RNNs (Graves and Schmidhuber, 2009;Kalchbrenner et al., 2016). Figure 6 presents the overall architecture for the multi-sequence model, which consists of three components: Input Layer: Map words to M sequences of feature vectors for M predicates.
Grid Layer: Update the hidden states over different sequences using Grid-RNNs.
Output Layer: Compute the probability of each case label for each word using the softmax function.
In the following subsections, we describe these three components in detail.

Input Layer
The multi-sequence model takes as input a sentence w 1:T = (w 1 , · · · , w T ) and all predicates {p m } M 1 in the sentence. For each predicate p m , the input layer creates a sequence of feature vectors X m = (x m,1 , · · · , x m,T ) by mapping each input word w t to a feature vector x m,t based on Eq 1. That is, for M predicates, M sequences of feature vectors {X m } M 1 are created. Then, using Eq. 2, each feature vector x m,t is mapped to h

Grid Layer
Inter-Sequence Connections For the grid layers, we use Grid-RNNs to propagate the feature information over the different sequences (inter-sequence connections). The figure on the right in Figure 6 shows the first grid layer. The hidden state is recurrently calculated from the upper-left (m = 1, t = 1) to the lowerright (m = M, t = T).
Formally, in the ℓ-th layer, the hidden state h (ℓ) m,t is calculated as follows: This equation is similar to Eq. 3. The main difference is that the hidden state of a neighboring sequence, h (ℓ) m−1,t (or h (ℓ) m+1,t ), is concatenated (⊕) with the hidden state of the previous (ℓ − 1) layer, h (ℓ−1) m,t , and is taken as input of the function g (ℓ) . In the figure on the right in Figure 6, the blue curved lines represent the inter-sequence connections. Taking as input the hidden states of neighboring sequences, the network propagates feature information over multiple sequences (i.e., predicates). By calculating the hidden states until the L-th layer, we obtain M sequences of the hidden states, i.e., {H

Residual Connections
As more layers are stacked, it becomes more difficult to learn the model parameters, owing to various challenges such as the vanishing gradient problem (Pascanu et al., 2013). In this work, we integrate residual connections (He et al., 2015;Wu et al., 2016) with our networks to form connections between layers. Specifically, the input vector h

Output Layer
As with the single-sequence model, we use the softmax function to calculate the probability of the case labels of each word w t for each predicate p m : m,t is a hidden state vector calculated in the last grid layer.

Japanese PAS Analysis Approaches
Existing approaches to Japanese PAS analysis are divided into two categories: (i) the pointwise approach and (ii) the joint approach. The pointwise approach involves estimating the score of each argument candidate for one predicate, and then selecting the argument candidate with the maximum score as an argument (Taira et al., 2008;Imamura et al., 2009;Hayashibe et al., 2011;Iida et al., 2016). The joint approach involves scoring all the predicateargument combinations in one sentence, and then selecting the combination with the highest score (Yoshikawa et al., 2011;Sasano and Kurohashi, 2011;Ouchi et al., 2015;Shibata et al., 2016). Compared with the pointwise approach, the joint approach achieves better results. Ouchi et al. (2015) reported that it is beneficial to Japanese PAS analysis to capture the interactions between all predicates in a sentence. This is based on the linguistic intuition that the predicates in a sentence are semantically related to each other, and that the information regarding this semantic relation can be useful for PAS analysis.

Multi-Predicate Interactions
Similarly, in semantic role labeling (SRL), Yang and Zong (2014) reported that their reranking model, which captures the multi-predicate interactions, is effective for the English constituentbased SRL task (Carreras and Màrquez, 2005). Taking this a step further, we propose a neural architecture that effectively models the multipredicate interactions.

Neural Approaches
Japanese PAS In recent years, several attempts have been made to apply neural networks to Japanese PAS analysis (Shibata et al., 2016;Iida et al., 2016) 4 . In Shibata et al. (2016), a feed-forward neural network is used for the score calculation part of the joint model proposed by Ouchi et al. (2015). In Iida et al. (2016), multi-column convolutional neural networks are used for the zero anaphora resolution task.
Both models exploit syntactic and selectional preference information as the atomic features of neural networks. Overall, the use of neural networks has resulted in advantageous performance levels, mitigating the cost of manually designing combination features. In this work, we demonstrate that even without such syntactic information, our neural models can realize comparable performance exclusively using the word sequence information of a sentence.
English SRL Some neural models have achieved high performance without syntactic information in English SRL. Collobert et al. (2011) and Zhou and Xu (2015) worked on the English constituent-based SRL task (Carreras and Màrquez, 2005) using neural networks. In Collobert et al. (2011), their model exploited a convolutional neural network and achieved a 74.15% F-measure without syntactic information.
In Zhou and Xu (2015), their model exploited bidirectional RNNs with linear-chain conditional random fields (CRFs) and achieved the state-of-the-art result, an 81.07% Fmeasure. Our models should be regarded as an extension of their model.
The main differences between Zhou and Xu (2015) and our work are: (i) constituent-based vs dependency-based argument identification and (ii) the multi-predicate consideration. For the constituent-based SRL, Zhou and Xu (2015) used CRFs to capture the IOB label dependencies, because systems are required to identify the spans of arguments for each predicate. By contrast, for Japanese dependency-based PAS analysis, we replaced the CRFs with the softmax function, because in Japanese, arguments are rarely adjacent to each other. 5 Furthermore, whereas the model described in Zhou and Xu (2015) predicts arguments for each predicate independently, our multisequence model jointly predicts arguments for all predicates in a sentence concurrently by considering the multi-predicate interactions.

Experimental Settings Dataset
We used the NAIST Text Corpus 1.5, which consists of 40,000 sentences from Japanese newspapers (Iida et al., 2007). For the experiments, we adopted standard data splits (Taira et al., 2008;Imamura et al., 2009;Ouchi et al., 2015): Train: Articles: Jan 1-11, Editorials: Jan-Aug Dev: Articles: Jan 12-13, Editorials: Sept Test: Articles: Jan 14-17, Editorials: Oct-Dec We used the word boundaries annotated in the NAIST Text Corpus and the target predicates that have at least one argument in the same sentence. We did not use any external resources.

Learning
We trained the model parameters by minimizing the cross-entropy loss function: where θ is a set of model parameters, and the hyper-parameter λ is the coefficient governing the L2 weight decay.

Implementation Details
We implemented our neural models using a deep learning library, Theano (Bastien et al., 2012). The number of epochs was set at 50, and we reported the result of the test set in the epoch with the best F-measure from the development set. The parameters were optimized using the stochastic gradient descent method (SGD) via a mini-batch, whose size was selected from {2, 4, 8}. The learning rate was automatically adjusted using Adam (Kingma and Ba, 2014). For the L2 weight decay, the hyper-parameter λ in Eq. 4 was selected from {0.001, 0.0005, 0.0001}.
In the neural models, the number of the RNN and Grid layers were selected from {2, 4, 6, 8}.
The window size C for the PRED feature (Sec. 3.1) was set at 5. Words with a frequency of 2 or more in the training set were mapped to each word index, and the remaining words were mapped to the unknown word index. The dimensions d word and d mark of the embeddings were set at 32. In the single-sequence model, the parameters of GRUs were set at 32 × 32. In the multi-sequence model, the parameters of GRUs related to the input values were set at 64 × 32, and the remaining were set at 32 × 32. The initial values of all parameters were sampled according to a uniform distribution from [− √ 6 √ row+col , √ 6 √ row+col ], where row and col are the number of rows and columns of each matrix, respectively.

Baseline Models
We compared our models to existing models in previous works (Sec. 5.1) that use the NAIST Text Corpus 1.5. As a baseline for the pointwise approach, we used the pointwise model 6 proposed in Imamura et al. (2009). In addition, as a baseline for the joint approach, we used the model proposed in Ouchi et al. (2015). These models exploit gold annotations in the NAIST Text Corpus as POS tags and dependency relations.

Dep
Zero All  Ouchi et al. (2015), and Ouchi+ 15 is the ALL-Cases Joint Model in Ouchi et al. (2015). The mark † denotes the significantly better results with the significance level p < 0.05, comparing Single-Seq and Multi-Seq.

Results
Neural Models vs Baseline Models Table 1 presents F-measures from our neural sequence models with eight RNN or Grid layers and the baseline models on the test set. For the significant test, we used the bootstrap resampling method. According to all metrics, both the single-(Single-Seq) and multi-sequence models (Multi-Seq) outperformed the baseline models. This confirms that our neural models realize high performance, even without syntactic information, by learning contextual information effective for PAS analysis from the word sequence of the sentence. In particular, for zero arguments (Zero), our models achieved a considerable improvement compared to the joint model in Ouchi et al. (2015). Specifically, the single-sequence model improved by approximately 2.0 points, and the multisequence model by approximately 3.0 points according to the F-measure. These results suggest that modeling the context of the entire sentence using RNNs are beneficial to Japanese PAS analysis, particularly to zero argument identification.

Effects of Multiple Predicate Consideration
As Table 1 shows, the multi-sequence model significantly outperformed the single-sequence model in terms of the F-measure overall (81.42% vs 81.15%). These results demonstrate that the grid-type neural architecture can effectively capture multi-predicate interactions by connecting the sequences of the argument candidates for all predicates in a sentence.
Compared to the single-sequence model for dif-

Single-Seq
Multi-Seq L +res. −res. +res. −res.  ). This shows that capturing multi-predicate interactions is particularly effective for zero arguments, which is consistent with the results in Ouchi et al. (2015). Table 2 presents F-measures from the neural sequence models with different network depths and with/without residual connections. The performance tends to improve as the RNN or Grid layers get deeper with residual connections. In particular, the two models with eight layers and residual connections achieved considerable improvements of approximately 1.0 point according to the F-measure compared to models without residual connections. This means that residual connections contribute to effective parameter learning of deeper models. Table 3 presents F-measures from the neural sequence models with different numbers of predicates in a sentence. In Table 3, M denotes how  many predicates appear in a sentence. For example, the sentence in Figure 1 includes two predicates, "arrested" and "escaped", and thus in this example M = 2.

Effects of the Number of Predicates
Overall, performance of both models gradually deteriorated as the number of predicates in a sentence increased, because sentences that contain many predicates are complex and difficult to analyze. However, compared to the singlesequence model, the multi-sequence model suppressed performance degradation, especially for zero arguments (Zero). By contrast, for direct dependency arguments (Dep), both models either achieved almost equivalent performance or the single-sequence model outperformed the multisequence model. A Detailed investigation of the relation between the number of predicates in a sentence and the complexity of PAS analysis is an interesting line for future work. Comparing the models using the NAIST Text Corpus 1.5, the single-and multi-sequence models outperformed the baseline models according to all metrics. In particular, for the dative case, the two neural models achieved much higher results, by approximately 30 points. This suggests that although dative arguments appear infrequently compared with the other two case arguments, the neural models can learn them robustly.

Comparison per Case Role
In addition, for zero arguments (Zero), the neural models achieved better results than the baseline models. In particular, for zero arguments of the nominative case (NOM), the multisequence model demonstrated a considerable improvement of approximately 2.5 points according to the F-measure compared with the joint model in Ouchi et al. (2015). To achieve high accuracy for the analysis of such zero arguments, it is necessary to capture long distance dependencies (Iida et al., 2005;Sasano and Kurohashi, 2011;Iida et al., 2015). Therefore, the improvements of the results suggest that the neural models effectively capture long distance dependencies using RNNs that can encode the context of the entire sentence.

Conclusion
In this work, we introduced neural sequence models that automatically induce effective feature representations from the word sequence information of a sentence for Japanese PAS analysis. The experiments on the NAIST Text Corpus demonstrated that the models realize high performance without the need for syntactic information. In particular, our multi-sequence model improved the performance of zero argument identification, one of the problematic issues facing Japanese PAS analysis, by considering the multi-predicate interactions with Grid-RNNs.
Because our neural models are applicable to SRL, applying our models for multilingual SRL tasks presents an interesting future research direction. In addition, in this work, the model parameters were learned without any external resources. In future work, we plan to explore effective methods for exploiting large-scale unlabeled data to learn the neural models.