Improving Shared Argument Identification in Japanese Event Knowledge Acquisition

Event knowledge represents the knowledge of causal and temporal relations between events. Shared arguments of event knowledge encode patterns of role shifting in successive events. A two-stage framework was proposed for the task of Japanese event knowledge acquisition, in which related event pairs are first extracted, and shared arguments are then identified to form the complete event knowledge. This paper focuses on the second stage of this framework, and proposes a method to improve the shared argument identification of related event pairs. We constructed a gold dataset for shared argument learning. By evaluating our system on this gold dataset, we found that our proposed model outperformed the baseline models by a large margin.


Introduction
Natural language understanding requires not only linguistic knowledge but also common knowledge about the real world. Event relation knowledge is a type of common knowledge of critical importance, representing the knowledge of the relation between events as well as the typical patterns of role shifting between events. Event relation knowledge is useful for natural language understanding tasks as well as natural language generation tasks which require modeling of the possible event sequences.
In this paper, we define an event to be a predicate argument structure (PAS), which consists of a Figure 1: Event relation knowledge with shared arguments. 1 predicate and its relevant arguments. In addition, we define one unit of event relation knowledge to be a pair of successive events with one or more shared arguments. Figure 1 represents an example of event relation knowledge, which consists of two events, pas 1 and pas 2 .
The shared arguments correspond to the common participants of the two events, such as A 1 and A 3 in the above example. These shared arguments play an important role in the application of event relation knowledge since they encode the correspondence relations between case slots within a piece of event relation knowledge.
In this paper, we aim to improve the shared argument identification in Japanese event relation knowledge. Event relation knowledge acquisition in Japanese is a much more challenging task than its counterpart of English, due to several linguistic properties of Japanese. For example: (1) a. John attached a stamp to the letter, and he dropped it into the mailbox. b. John attached a stamp to the letter, and (ϕ he ) dropped (ϕ letter ) into the mailbox.
In the above example, (1-b) is the Japanese correspondence of (1-a), directly translated into English. We can observe that Japanese has an abundance of omitted arguments. In addition, Japanese lacks linguistic clues regarding the accordance in gender, number, etc., such as 'he ' and 'it' in (1-a). These linguistic properties hinder the performance of Japanese coreference resolution sys- tems, and make it unsuitable to apply coreferencebased methods of English event relation knowledge acquisition (Chambers and Jurafsky, 2008) directly to Japanese.
On the other hand, event relation knowledge can benefit the task of the coreference resolution. The shared arguments within an event relation knowledge provide direct clues that the case slots sharing an argument should hold co-referring arguments. These clues are particularly critical in cases in which selectional preference is not helpful, such as coreference resolution problems presented in Winograd Schema Challenge (Levesque et al., 2012;Rahman and Ng, 2012). Consider the following example: (2) a.
(Google-ga acquired Motorola-wo, because they-ga went bankrupt.) b. A 1 -ga go bankrupt → A 2 -ga A 1 -wo acquire In the example of (2-a), both precedents of 'they', 'Google' and 'Motorola', are of the same category. While selectional preference is not helpful in this case, the event relation knowledge in (2-b) can help us resolve (2-a) correctly.
In this work, we adopted the two-stage framework for Japanese event relation knowledge acquisition (Shibata and Kurohashi, 2011). In the first stage of related event pair extraction, we adopted the method proposed by Shibata and Kurohashi (2011); and in the second stage of shared argument identification, we extended the model of Kohama et al. (2015) to incorporate all types of shared arguments in our gold dataset. We designed a richer feature representation for shared argument learning, which considers the interaction between shared arguments and the mechanism of argument omission in depth.
In addition, we manually constructed a gold dataset for shared argument learning. With the help of linguistic experts, we established an annotation scheme for shared argument. We classified the shared arguments into three types: standard shared argument, quasi shared argument, and multiple shared argument. We evaluated our method of shared argument identification on the gold dataset. By comparing our proposed methods with several baseline models, we observed a significant improvement for shared argument identification.

Related Work
As a resource-rich language, coreference resolution of English has achieved a satisfying performance. Thus, several works which utilize coreference information were proposed for English event relation knowledge acquisition. Chambers and Jurafsky (2008) introduced the concept of narrative event chains as a representation of structured event relation knowledge. Their method utilizes the coreference chains within the input text to collect events involving the same entity, which they called the protagonist. Among the set of events involving the same entity, event sequences that are observed a significant number of times are extracted as typical event sequences. Pichotta and Mooney (2014) used a richer representation of event than in the work of Chambers et al. and achieved an improvement in predicting performance. Instead of representing an event as a (predicate, dependency) pair, they considered an event as a structure of a predicate and arguments with subject, object, direct object relations with the predicate. With this multi-argument event representation, their model performs better in the cases of ambiguous verbs, and is more capable of capturing complex interactions between multiple entities.
There are several works proposed for Japanese event relation knowledge acquisition utilizing the co-occurrences of events. Abe et al. (2008) proposed a pattern-based method which utilized a predefined set of lexico-syntactic co-occurrence patterns to perform bootstrapping for event relation learning. Their work focused on the acquisition of related event pairs, but not the relations between the arguments of the related events. Shibata and Kurohashi (2011) proposed a twostage approach for Japanese event relation knowledge acquisition ( Figure 2). In the first stage, related event pairs are extracted from large-scale corpora by association rule mining. In the second stage, shared arguments of the event pairs are identified heuristically based on case slot similarity scores. Kohama et al. (2015) improved the work of Shibata and Kurohashi (2011) by utilizing crowdsourced data for shared argument learning. They proposed a joint model that simultaneously predicts the shared argument configuration and disambiguates the meaning of the predicates. However, their work failed to identify the shared arguments accurately for two reasons. First, the crowdsourced data they used is very noisy and lacks a well-defined standard of labeling. Second, the features used in their model are not sufficient for capturing the characteristics of shared arguments.

Shared Argument Identification
In this section, we introduce our method of shared argument identification. In Section 3.1, we first introduce the acquisition of related event pairs, which are the inputs to our shared argument identification model. We introduce the gold dataset used for model learning in Section 3.2. In Section 3.3, we describe the selection of case frames. These case frames will be used to model different meanings of predicates in our model. The remaining of the section will be dedicated to the description of our proposed methods of shared argument identification.

Related Event Pairs
Our work is based on the two-stage framework of event relation knowledge proposed by Shibata and Kurohashi (2011). We adopt the first stage of related event pair extraction proposed in their work to obtain the related event pairs, which will be the input to our shared argument identification model.
Here, we briefly describe the first stage of related event pair extraction. Starting from the web corpus, we first extract the PAS pairs with syntactic dependency, and use the Apriori algorithm to pick out the related event pairs efficiently ( Figure  2). In order to improve the quality of the extracted event pairs, we apply an additional filtering step based on the clause relations between event pairs as suggested in Kohama et al. (2015). Table 1 shows several examples of related event pairs extracted in this process. Each event pair R consists of two PASs, pas 1 and pas 2 , and the sentences containing both pas 1 and pas 2 are regarded as the support sentences of R. These support sentences contain many valuable clues for the task of shared argument identification. Thus, the event pair R along with its support sentences will serve as the input to our shared argument identification model.

Event Pair
Shared Argument Standard (stamp-wo letter-ni paste) (letter-wo mailbox-ni put) n-w Quasi (cow-wo raise) (milk-de cheese-wo make) w-d' Multiple / (tourist-ga town-wo/ni visit) (town-ga be crowded) w/n-g  Table 3: Transforming different types of shared arguments to their standard shared argument sets.

Gold Dataset
We manually constructed a gold dataset for learning shared argument identification model. In this work, we train and evaluate our proposed model on this gold dataset.
This dataset contains 809 related event pairs, with each of the event pair annotated with its shared argument configuration. Three annotators with linguistic background participated in the construction of this dataset.

Type of Shared Arguments
The gold dataset contains the following types of shared arguments (Table 2): 1. Standard Shared Argument: The arguments shared between one case slot of the first event and another case slot of the second event. This type of shared argument represents the fact that arguments of the two cases should correspond to an identical real world entity.
In this work, we only consider the four main cases of ga ( ), wo ( ), ni ( ), and de ( ). From now on, we use the shorthand notation of g, w, n, and d to represent these four main cases. The first example in Table 2 has a standard shared argument between the first ni-case and the second wo-case, which both correspond to the entity 'letter'. we use the notation n-w to represent it.

Quasi Shared Argument:
Quasi shared argument is a pair of arguments which are closely related to each other in the context of the given event relation knowledge. As can be seen from the example in Table 2, the arguments of the first wo-case and the second de-case are 'cow' and 'milk', respectively. These two arguments are considered to be closely related since the milk in the context corresponds to the specific milk which is produced by the cow in the same context.
We attached an apostrophe (') to denote a quasi shared argument.

Multiple Shared Argument:
Multiple shared argument occurs when more than two case slots share the same argument. As can be seen from the example in Table 2, the argument 'town' is shared between three cases: wo-case or ni-case of the first event, and the ga-case of the second event.
We use the symbol '/' to separate different case slots of the same predicate which share arguments.

Preprocessing of Gold dataset
In this work, we only focus on the identification of standard shared arguments. For utilizing the gold dataset with other shared argument types, we perform a pre-processing to the gold annotation before model training. We transform each shared argument configuration into its corresponding standard configuration set. First, we define the corresponding standard shared argument set for each shared argument in the following manner (Table 3): 1. For each standard shared argument, we transform it into the standard shared argument set containing only itself.
2. For each quasi shared argument, we transform it into the standard shared argument set containing a null shared argument (ϕ) and its Shared Argument Configuration {[g-g, n-n], [g-g, n-w], [g-g, n-n, w-d], [g-g, n-w, w-d]} Table 4: Transforming shared argument configuration to corresponding standard configuration set.
standard counterpart in which all the apostrophe (') mark is removed. See the second example in Table 3. 3. For each multiple shared argument, we transform it into the standard shared argument set containing all the shared arguments that could be entailed from it. See the third example in Table 3.
For a given shared argument configuration, we first transform each of its containing shared argument into its corresponding standard shared argument set in the above manner. By taking the product of these standard shared argument sets, we obtain the corresponding standard configuration set of the shared argument configuration. See Table 4 for examples.

Case Frame Selection
Selectional preferences provide important clues for the task of share argument identification. Case frames are good sources of selectional preference information, and it handles the issue of predicate ambiguity by clustering the usage of each predicate by their meanings. In turn, the meaning of a case frame is represented by the argument distribution in each case slot of its corresponding case frame.
In this work, we consider wide-coverage case frames constructed automatically from a huge raw corpus as the source of selectional preference information (Hayashibe et al., 2015). For each event pair R(pas 1 pas 2 ), we select 10 relevant case frames for both pas 1 and pas 2 by utilizing the supporting sentences S of R. Here, we describe the method for selecting relevant case frames for each event pair, which are used in our proposed models.
Given a case frame cf , we denote the bag-ofwords (BoW) representation of arguments within each case slot of cf as follows: We denote the BoW representation of arguments appearing in the corresponding case slots of the support sentences S as follows: We define the relevance score of cf with respect to R as follows: which is the sum of cosine similarity scores between the BoW representation of case slots in the four main cases. Finally, we rank all the case frames in descending order with respect to relevance score and take the top 10 of them as relevant case frames. Table  5 represents the first five relevant case frames of the predicate (visit) of the following event pair: (tourist-ga visit be crowded)

Joint Prediction of Shared Argument and Case Frame
As mentioned in Section 3.3, case frames provide important information of selectional preferences. However, the gold data does not provide the appropriate case frame of each predicate. To tackle this problem, we propose a model of shared argument identification that simultaneously predicts the appropriate case frame for each predicate.

Model
We adopt a maximum entropy (MaxEnt) classifier model. Given a related event pair R(pas 1 → pas 2 ) and its supporting sentences S, the conditional probability of a shared argument configuration A and case frame pair cf 1 , cf 2 is modeled as: P (A, cf 1 , cf 2 |R, S; w) = exp{w·ϕ(A,cf 1 ,cf 2 ,R,S)} Z (2) In the above equation, ϕ(A, cf 1 , cf 2 , R, S) is the feature representation of the shared argument configuration, w is the model parameter, and Z is the normalization constant. In Table 6 we summarized the features used, under the example of shared argument n-w.  Table 5: Relevant case frames of (visit).

Feature Description Configuration
Binary feature indicating the existence of the shared argument n-w.

Post-predicate
Binary feature indicating the existence of argument in w case of pas 2 . Core Binary features indicating if n case of cf 1 and w case of cf 2 are core cases. If a case slot takes argument in more than 10% of the time in the selected case frame, we define it as a core case. Case slot similarity The cosine similarity between the vocabulary distribution of n case of cf 1 and w case of cf 2 . Normalized case slot similarity Case slot similarity of n-w normalized over the similarities of all case slots of cf 1 . Same for cf2.

Conflict
The ratio of support sentences in S that holds different arguments in the first n case and the second w case.

Context
We collect words that appear in S but not within the event pair as context words. We calculate the relative probability of each context word to appear in the first n case compared to other main cases, and similar for the second w case. A tf-idf weighted sum of this probability is added as feature.

Prediction
During the prediction phase, the shared argument configurationÂ and case frame pairĉ f 1 ,ĉ f 2 that gives the highest probability is chosen: (Â,ĉ f 1 ,ĉ f 2 ) = argmax A,cf 1 ,cf 2 P (A, cf 1 , cf 2 |R, S; w) (3) For each related event pair R, we choose 10 relevant case frames for each predicate of concern as candidate of cf 1 and cf 2 , as described in Section 3.3.

Model Training
In the training phase, the most probable case frame pair (ĉ f 1 ,ĉ f 2 ) and the model parameter w are updated alternatively. Also, the most probable gold configurationĝ among the standard configuration set is also updated along with the case frame pair.
The training algorithm is summarized below: 1. Initialize model parameter w randomly.
2. Use the current parameter w to update the most probable gold configuration and the most probable case frame pair (ĝ,ĉ f 1 ,ĉ f 2 ):ĝ ,ĉ f 1 ,ĉ f 2 = argmax g,cf 1 ,cf 2 P (g, cf 1 , cf 2 |R, S; w) 3. Use (ĝ,ĉ f 1 ,ĉ f 2 ) to update model parameter w. The following is the objective function, in which the superscripts of g, cf 1 , and cf 2 denote the id of the event pairs, and N is the total number of training objects: (Hyper-parameter α is set to 1.0.) 4. Back to 2 until convergence. The convergence condition is that the most probable (ĝ,ĉ f 1 ,ĉ f 2 ) for all event pairs are the same as the previous iteration. If the convergence condition is not satisfied after 15 iterations, we terminate the training process.

Shared Argument Learning with Combined Case Frame
Here, we introduce another model for learning shared arguments which uses the combined case frames. The joint reference model (Section 3.4) picks exactly one case frame for each predicate. On the other hand, the combined case frame model combines the relevant case frames by taking the weighed sum of them by the relevance scores with respect to the event pair. This method does not decide the most appropriate case frame of each predicate. Instead, all of the relevant case frames are considered, and case frames with higher relevance scores have larger influence on the feature representation.

Combined Case Frame
A combined case frame is obtained by combining the relevant case frames according to their relevance scores. The calculation of the relevance scores of each case frame is described in Section 3.3.
Given a set of relevant case frames CF , we defined the combined case frame cf as follows: (8) in which V x cf is the vocabulary distribution vector of cf .

Model
Similar to the joint prediction model presented in section 3.4, we adopt a MaxEnt classifier model. Given an event pair R(pas 1 pas 2 ) and its supporting sentences S, we model the conditional probability of shared argument configuration A as: P (A|R, S; w) = exp{w · ϕ(A, cf 1 , cf 2 , R, S)} Z (9) In the above equation, ϕ is the feature representation as summarized in Table 6, w is the model parameter, and Z is the normalization constant.
The training algorithm is similar to the one described in Section 3.4. In the training phase, the most probable gold configurationĝ and the model parameter w are updated alternatively until convergence.

Settings
The case frames used in the experiments are built from a web corpus of four billion sentences, with the method proposed by Hayashibe et al. (2015).
We use Classias (Okazaki, 2009) as the implementation of maximum entropy classifier and L-BFGS (Nocedal, 1980) as the optimization algorithm for learning. We train and evaluate our proposed models by a 5-fold cross-validation test on the gold shared argument dataset.

Evaluation and Result
We apply three evaluation metrics: precision, recall, and F-score (F 1 ) for the evaluation of our shared argument identification models.  Table 7: Evaluation result. We compared our proposed models with two baseline models. The first baseline model, denoted as Baseline [g-g] in Table 7, is the majority classifier which gives the output of g-g regardless of the event pair given. The second baseline model, denoted as Baseline[Kohama+15], is the model proposed by Kohama et al. (2015).
The experiment results are summarized in Table  7. In addition, several event relation knowledge acquired are shown in Table 8.

Comparison with Baseline Models
As can be observed from Table 7, both of our proposed models outperformed the baseline models by a large margin.
Compared to the model proposed by Kohama et al. (2015), we use a richer feature representation for shared argument configuration. In their work, a shared argument is represented by the vocabulary distribution similarity between two case slots, such as the similarity between case frames, or the similarity between arguments in the supporting sentences. However, by considering only the distributional similarities between two case slots, their method overlooked two important intrinsic properties of the shared argument identification task: Different pieces of shared arguments are not independent, and shared arguments that share a case slot have repulsive effects on each other. For example, if a shared argument configuration already includes g-g, then it would be unlikely that g-w also exists in the same configuration. We add the normalized case slot similarity feature which considers not only the case slot similarity of a pair of case slots, but also the relative similarity of them, to account for this property.
2. The mechanism of argument omission in related event pairs: High vocabulary distribution similarity indicates the existence of shared arguments, but not vice versa. Consider the following example: (juice-ga become cheaper juice-wo buy) Although there exists a shared argument of g-w, the vocabulary distributions of the two corresponding case slots are quite different. To address this property, we add the context feature which considers each context word and the relative probability of them to appear in each of the main case slots.

Comparison Between Proposed Models
The major difference between the two proposed models lies in how case frames for feature construction are decided.
As can be observed from Table 7, the joint prediction model achieved a better F-score than the combined case frame model. We conclude that deciding one best case frame is a better way for modeling the selectional preference of a predicate, compared to combining case frames with respect to the relevance scores. The result also verified the effectiveness of the joint model of case frames and shared arguments.

Error Analysis
In the following are several patterns of error observed in the system output. Examples of each error type are presented in Table 8. 1. Error due to case frame granularity (Error Type 1): Our proposed model jointly predicts the most appropriate case frame along with the shared argument configuration. By selecting a single case frame for each predicate, we are able to model the selectional preference of the predicates accurately. However, the automatically constructed case frames do not always provide the granularity suitable for our task. If a coarse-grained case frame is selected dur-ing prediction phase, the prediction of shared argument will also be affected.
For the example shown in Table 8, an appropriate case frame of the second predicate 'put' should contain words that supports nw shared argument in the wo-case. Table 9 represents the most appropriate case frame of the predicate 'put' among all the case frames of this predicate. It can be observed that although the wo-case contains words relevant to the n-w shared argument, such as 'letter' and 'envelope', there are other irrelevant words dominating this case. These kind of broad, somewhat noisy case frames hinder the performance of our shared argument identification model. 2. Error due to event participants with similar characteristics (Error Type 2): Our method relies largely on selectional preference information for identifying shared arguments. Thus, the prediction performance of our system is not very good for event pairs containing multiple participants with similar characteristics.
For the example shown in Table 8, our model wrongly identified the shared argument n-g.
Although both cases are expected to hold human participants, the entity in the first ni-case should correspond to the victim of both actions 'persecute' and 'kill', while the second ga-case should hold the entity of the perpetrator of the two actions. In the scenario of the above event pair, there are two participants of similar characteristics, which are both expected to be human. Since selectional preference cannot effectively distinguish between these similar participants, our model often has difficulty dealing with event pairs with multiple similar participants. 3. Error due to fixed expression (Error Type 3) In a fixed expression, an argument often takes on a different meaning than it usually does. Fixed expressions within events sometimes cause problems in shared argument identification. For the example shown in Table 8, the system output is as follows: (face-ga become brighter sun-ga face-wo appear) Independently, both PASs shown above are plausible. However, the first PAS, 'face-ga become brighter', means showing a cheerful look; while the second PAS, 'sun-ga face-wo appear', means sun rising. Although both expression contains the argument 'face', the shared argument of g-w does not exist.

Conclusion
This paper proposed a method for shared argument identification in event relation knowledge acquisition. By addressing several problems of the previous works, we improved the shared argument identification model significantly. We proposed a richer feature representation of shared argument configuration which is more suitable for model learning. In order to incorporate different types of shared argument in the gold dataset, we update the most appropriate gold configuration along with case frames during the training process. We evaluated our model on a manually annotated gold dataset, and our model outperformed the baseline models by a large margin. Our proposed model jointly predicts the shared argument configuration and the appropriate case frames. By comparing the result of our proposed model with the combined case frame model, we verified the effectiveness of this joint model to predict the appropriate case frames.