Entity-Centric Joint Modeling of Japanese Coreference Resolution and Predicate Argument Structure Analysis

Predicate argument structure analysis is a task of identifying structured events. To improve this field, we need to identify a salient entity, which cannot be identified without performing coreference resolution and predicate argument structure analysis simultaneously. This paper presents an entity-centric joint model for Japanese coreference resolution and predicate argument structure analysis. Each entity is assigned an embedding, and when the result of both analyses refers to an entity, the entity embedding is updated. The analyses take the entity embedding into consideration to access the global information of entities. Our experimental results demonstrate the proposed method can improve the performance of the inter-sentential zero anaphora resolution drastically, which is a notoriously difficult task in predicate argument structure analysis.


Introduction
Natural language often conveys a sequence of events like "who did what to whom", and extracting structured events from the raw text is a kind of touchstone for machine reading. This is realized by a combination of coreference resolution (called CR, hereafter) and predicate argument structure analysis (called PA, hereafter).
The characteristics and difficulties in the analyses vary among languages. In English, there are few omissions of arguments, and thus PA is relatively easy, around 83% accuracy , while CR is relatively difficult, around 70% accuracy .
On the other hand, in Japanese and Chinese, where arguments are often omitted, PA is a dif-ficult task, and even state-of-the-art systems only achieve around 50% accuracy. Zero anaphora resolution (ZAR) is a difficult subtask of PA, detecting a zero pronoun and identifying a referent of the zero pronoun. As the following example shows, CR in English (identifying the antecedent of it) and ZAR in Japanese (identifying the omitted nominative argument) are similar problems.
(1) a. John bought a car last month.
It was made by Toyota. b.
John-TOP last month a car-ACC bought. ( ) (ϕ-NOM) Toyota made-COPULA. ZAR) needs to identify salient entities, which cannot be identified without performing CR and PA simultaneously. Our results support this claim, and suggest that the status quo of PA-exclusive research in Japanese is an insufficient approach.
Our work is inspired by (Wiseman et al., 2016), which described an English CR system, where entities are represented by embeddings, and they are updated by CR results dynamically. We perform Japanese CR and PA by extending this idea. Our experimental results demonstrate the proposed method can improve the performance of the inter-sentential zero anaphora resolution drastically.
Although most of studies did not consider the notion entity, Sasano and Kurohashi (2011) consider an entity, and its salience score is calculated based on simple rules. However, they used gold coreference links to form the entities, and reported the salience score did not improve the performance. In contrast, we perform CR automatically, and capture the entity salience by using RNNs.
For Chinese, where zero anaphors are often used, neural network-based approaches (Chen and Ng, 2016;Yin et al., 2017) outperformed conventional machine learning approaches (Zhao and Ng, 2007).
Coreference Resolution. CR has been actively studied in English and Chinese. Neural networkbased approaches (Wiseman et al., 2016;Clark and Manning, 2016b,a; outperformed conventional machine learning approaches (Clark and Manning, 2015). Wiseman et al. (2016) and Clark and Manning (2016b) learn an entity representation and integrate this into a mentionbased model. Our work is inspired by Wiseman et al. (2016), which learn the entity representation by using Recurrent Neural Networks (RNNs). Clark and Manning (2016b) adopt a clustering approach for the entity representation. The reason why we do not use this is that if we take a clustering approach in our setting, zero pronouns need to be first identified before clustering, and thus, it is hard to perform CR and PA jointly.  take an end-to-end approach, aiming at not relying on hand-engineering mention detector (consider all spans as potential mentions). In used Japanese evaluation corpora, since the basic unit for the annotations and our analyses (CR and PA) is fixed, we do not need consider all spans.
In Japanese, CR has not been actively studied other than Iida et al. (2003); Sasano et al. (2007) since the use of zero pronouns is more common and problematic. Semantic Role Labeling. Japanese PA is similar to Semantic Role Labeling (SRL) in English. Neural network-based approaches have improved the performance (Zhou and Xu, 2015;. In these approaches, an appropriate argument for a predicate is searched among mentions in a text. The notion entity is not considered. Other Entity-Centric Study. There are several studies that consider the notion entity in other areas: text comprehension (Kobayashi et al., 2016;Henaff et al., 2016) and language modeling (Ji et al., 2017).

Japanese Preliminaries
Before presenting our proposed method, we describe the basics of Japanese predicate argument structure and its analysis.
Since the word order is relatively free among arguments in Japanese, an argument is followed by a case marking postposition. The postpositions (ga), (wo), and (ni) indicate nominative (NOM), accusative (ACC) and dative (DAT), respectively. In the double nominative construction such as " " (My English is good), " " (English) is regarded as NOM, and " " (I), the outer nominative is regarded as NOM2. This paper targets these four cases.
PA is tightly related to a dependency structure of a sentence. Considering the relation between a predicate and its argument, and a necessary analysis can be classified into the following three categories (see example sentence (2) below).

D D D
Overt case: When an argument with a case marking postposition has a dependency relation with a predicate, PA is not necessary. In example (2), since " " (bread-ACC) has a dependency relation with " " (ate), it is obvious that " " takes " " as its ACC argument.
Case analysis: When a topic marker (wa) is attached to an argument, the case marking postposition disappears, and the analysis of identifying the case role becomes necessary. The analysis is called case analysis. In the example, although " " (John-TOP) has a dependency relation with " " (ate), the analysis of identifying NOM is necessary. The same phenomenon happens when a relative clause is used. When an argument is modified by a relative clause, we do not know its case role to the predicate in the relative clause. In the example, although " " has a dependency relation with " " (bought), the analysis of identifying ACC is necessary.
Zero anaphora resolution (ZAR): Some arguments are not included in the phrases with which a predicate has a dependency relation. While pronouns are mostly used in English, they are rarely used in Japanese. This phenomenon is called zero anaphora, and the analysis of identifying an argument (referent of the zero pronoun) is called zero anaphora resolution (ZAR). In the example, although " " takes " " as its NOM argument, they do not have a dependency relation, and thus zero anaphora resolution is necessary.
When dependency relations are identified by parsing, what Japanese PA has to do is case analysis and zero anaphora resolution.
Each predicate has a set of required cases, but not all the four cases. For example, " " (buy) takes NOM and ACC, but neither DAT nor NOM2. PA for " " in sentence (2) has to find John as NOM, but also has to judge that it does not take DAT and NOM2 arguments.
Another difficulty lies in that a predicate takes a case, but in a sentence it does not take a specific argument. For example, in the sentence "it is difficult to bake a bread", NOM of "bake" is not a specific person, but means "anyone" or "in general". In such cases, PA has to regard arguments as unspecified.

Overview of Our Proposed Method
An overview of our proposed model is described with a motivated example ( Figure 1). Our model equips an entity buffer for entity management. At first, it contains only special entities, author and reader.
In Japanese CR and PA, a basic phrase, which consists of one content word and zero or more function words, is adopted as a basic unit. When an input text is given, the contextual representations of basic phrases are obtained by using Convolutional Neural Network (CNN) and Bidirectional LSTM. Then, from the beginning of the text, CR is performed if a target phrase is a noun phrase, and PA is performed if a target phrase is a predicate phrase. Both of these analyses take into consideration not only the mentions in the text but also the entities in the entity buffer.
In CR, when a mention refers to an existing entity, the entity embedding in the entity buffer is updated. In Figure 1, " " (said person) is analyzed to refer to " " (Mr.Kovalyov), and the entity embedding of " " is updated. When a mention is analyzed to have no antecedent, it is registered to the entity buffer as a new entity.
In PA, when a predicate has no argument for any case, its argument is searched among any mentions in the text, author and reader. In the same way as CR, PA takes into consideration not only the mentions but also entities in the entity buffer, and updates the entity embedding.
In Figure 1, the predicate " " (run for) has no NOM argument. Our method finds " " as its NOM argument, and then updates its entity embedding. As mentioned before, the entity embedding of " " is updated by the coreference relation with " " in the second sentence. In the third sentence, the predicate " " (support) has also no NOM argument, and " " is identified as its NOM argument, because the frequent reference implies its salience.

Input Encoding
Conventional machine learning techniques have extracted features from a basic phrase, which require much effort on feature engineering. Our method obtains an embedding of each basic phrase using CNN and bi-LSTM as shown in Figure 2.
Suppose the i-th basic phrase bp i consists of |bp i | words. First, the embedding of each word is represented as a concatenation of word (lemma), part of speech (POS), sub-POS and conjugation embeddings. We append start-of-phrase and endof-phrase special words to each phrase in order to better represent prefixes and suffixes. Let W i ∈ R d×(|bp i |+2) be an embedding matrix for bp i where d denotes the dimension of word representation.
The embedding of the basic phrase is obtained by applying CNN to the sequence of words. A feature map f i is obtained by applying a convolution between W i and a filter H of width n. The m-th element of f i is obtained as follows: where is the Frobenius inner product. Then, to capture the most important feature for a given filter in bp i , the max pooling is applied as follows: The process described so far is for one filter. The multiple filters of varying widths are applied to obtain the representation of bp i . When we set h filters, x i , the embedding of the i-th basic phrase, is represented as The embeddings of basic phrases are read by bi-LSTM to capture their context as follows: and the contextualized embedding of the i-th basic phrase is represented as a concatenation of the hidden layers of forward and backward LSTM.
This process is performed for each sentence. Since CR and PA are performed for a whole document D, the indices of basic phrases are reassigned from the beginning to the end of D in a consecutive order: To handle exophora, author and reader are assigned a unique trainable embedding, respectively.

Coreference Resolution
We adopt a mention-ranking model that assigns each mention its highest scoring candidate antecedent. This model assigns a score s m CR (ant, m i ) to a target mention m i and its candidate antecedent ant 1 . The candidate antecedents include i) mentions preceding m i , ii) author and reader, and iii) NA CR (no antecedent). s m CR (ant, m i ) is calculated as follows: where W CR 1 and W CR 2 are weight matrices, and v CR input is an input vector, a concatenation of the following vectors: • embeddings of m i and ant • exact match or partial match between strings of m i and ant • sentence distance between m i and ant. The distance is binned into one of the buckets [0, 1, 2, 3+].
• whether a pair of m i and ant has an entry in a synonym dictionary. When a candidate antecedent is NA CR , the input vector is just the embedding of a target mention m i , and the same neural network with different weight matrices calculates a score.
The following margin objective is trained: where N m denotes the number of mentions in a document, AN T (m i ) denotes the set of candidate antecedents of m i , andt i denotes the highest scoring true antecedent of m i defined as follows: where T (m i ) denotes the set of true antecedents of m i .

Predicate Argument Structure Analysis
When a target phrase is a predicate phrase, PA is performed. For each case of a predicate, PA searches an appropriate argument among candidate arguments: i) basic phrases located in the sentence including the predicate and preceding sentences, ii) author and reader, iii) unspecified, and 1 The superscript m of s m CR (ant, mi) represents a mention-based score, which contrasts with an entity-based score introduced in Section 6. iv) NA PA which means the predicate takes no argument of for the case. The probability that the predicate m i takes an argument arg for case c is defined as follows: where W PA 1,c , W PA 2 are weight matrices, and v PA input is an input vector, a concatenation of the following vectors: • embeddings of m i and arg 2 • path embedding: the dependency path between a predicate and an argument is an important clue. Roth and Lapata (2016) learn a representation of a lexicalized dependency path for SRL. An LSTM reads words 3 from an argument to a predicate along with a dependency path, and the final hidden state is adopted as the embedding of the dependency path. 4 For case analysis, the direct dependency relation between a predicate and its argument can be represented as the path embedding.
• selectional preference: selectional preference is another important clue for PA. A selectional preference score is learned in an unsupervised manner from automatic parses of a raw corpus (Shibata et al., 2016).
• sentence distance between m i and arg. The distance is binned in the same way as CR.
The objective is to minimize the cross entropy between predicted and true distributions: where N p denotes the number of predicates in a document, and arg denotes a true argument.

Entity-Centric Model
While the base model performs mention-based CR and PA, our proposed model performs entity-based analyses as shown in Figure 1.

Entity Embedding Update
The entity embeddings are managed in an entity buffer. First, let us introduce time stamp i for the entity embedding update. Time i corresponds to the analysis for the i-th basic phrase in a document. If an entity is referred to by the analysis, its embedding is updated. Let e (k) i be the embedding of an entity k at time i (after the entity embedding is updated).
In CR, following Wiseman et al. (2016), when a target phrase m i refers to the entity k, e (k) i is updated as follows: where LST M e denotes an LSTM for the entity embedding update. When an antecedent is NA CR , a new entity embedding is set up, initialized by a zero vector. The entity buffer maintains K LSTMs (K is the number of entities in a document), and their parameters are shared. The proposed method updates the entity embedding not only in CR but also in PA. When the referent of a zero pronoun of case c of predicate p i is entity k, the entity embedding is updated by using the predicate embedding h i multiplied by a weight matrix W c for case c as follows: In both CR and PA, the embeddings of entities other than the referred entity k are not updated (e (l) i ← e (l) i−1 (l ̸ = k)).

Use of Entity Embedding in CR and PA
Both CR and PA are allowed to take the entity embeddings into consideration. In CR, let z ant denote the id of an entity to which the candidate antecedent ant belongs. The entity-based score s e CR is calculated as follows: (13) The intuition behind the first case is that the dotproduct of h i , the embedding of the target mention, and e (zant) i−1 , the embedding of the entity that ant belongs to indicates the plausibility of their coreference. g N A (m i ) is defined as follows: ), (14) where q is a weight vector, and W N A is a weight matrix. The intuition is that whether a target phrase is NA CR can be judged from h i , the embedding of the target mention itself, and the sum of all the current entity embeddings. s e CR is added to s m CR , and the training objective is the same as the one described in Section 5.2.
In PA, the entity embedding corresponding to a candidate argument arg 5 is just added to the input vector v PA input described in Section 5.3, and mention-and entity-based score s m+e PA (arg, m i , c) is calculated in the same way as s m PA (arg, m i , c). The training objective is again the same as the one in Section 5.3.
In Wiseman et al. (2016), the oracle entity assignment is used for the entity embedding update in training, and the system output is used in a greedy manner in testing. Since the performance of PA is lower than that of English CR, there might be a more significant gap between training and testing. Therefore, scheduled sampling (Bengio et al., 2015) is adopted to bridge the gap: in training, the oracle entity assignment is used with probability ϵ t (at the t-th iteration) and the system output otherwise. Exponential decay is used: ϵ t = k t (we set k = 0.75 for our experiments).

Experimental Setting
The two kinds of evaluation sets were used for our experiments. One is the KWDLC (Kyoto Uni-versity Web Document Leads Corpus) evaluation set (Hangyo et al., 2012), and the other is Kyoto Corpus. KWDLC consists of the first three sentences of 5,000 Web documents (15,000 sentences) and Kyoto Corpus consists of 550 News documents (5,000 sentences). Word segmentations, POSs, dependencies, PASs, and coreferences were manually annotated (the closest referents and antecedents were annotated for zero anaphora and coreferences, respectively). Since we want to focus on the accuracy of CR and PA, gold segmentations, POSs, and dependencies were used. KWDLC (Web) was divided into 3,694 documents (11,558 sents.) for training, 512 documents (1,585 sents.) for development, and 700 documents (2,195 sents.) for testing; Kyoto Corpus (News) was divided into 360 documents (3,210 sents.) for training, 98 documents (971 sents.) for development, and 100 documents (967 sents.) for testing.
The evaluation measure is an F-measure, and the evaluation of both CR and PA was relaxed using a gold coreference chain, which leads to an entity-based evaluation. We did not use the conventional CR evaluation measures (MUC, B 3 , CEAF and CoNLL) because our F-measure is almost the same as MUC, which is a link-based measure, and the other measures considering singletons get excessively high values 6 , and thus they do not accord with the actual performance in our setting. 7

Implementation Detail
The dimension of word embeddings was set to 100, and the word embeddings were initialized with pre-trained embeddings by Skip-gram with a negative sampling (Mikolov et al., 2013) on a Japanese Web corpus consisting of 100M sentences. The dimension of POS, sub-POS and conjugation were set to 10, respectively, and these embeddings were initialized randomly. The dimensions of the hidden layer in all the neural networks were set to 100. We used filter windows of 1,2,3 with 33 feature maps each for basic phrase CNN.
Adam (Kingma and Ba, 2014) was adopted as the optimizer. F measures were averaged over four runs.
Checkpoint ensemble (Chen et al., 2017) was adopted, where the k best models were taken in terms of validation score, and then the parameters from the k models were averaged for testing. This method requires only one training process. In our experiments, k was set to 5, and the maximum number of epochs was set to 10.
We used a single-layer bi-LSTM for the input encoding (Section 5.1); preliminary experiments with stacked stacked bi-directional LSTM with residual connections were not favorable. Although we tried to use the character-level embedding of each word obtained with CNN, as the same way in the basic phrase embedding from the word sequences, the performance was not improved. The synonym dictionary used for CR (Section 5.2) was constructed from an ordinary dictionary and Web corpus, and has about 7,300 entries (Sasano et al., 2007).

Experimental Result
The following three methods were compared: • Baseline: the method described in Section 5.
• "+entity (CR)": this method corresponds to (Wiseman et al., 2016). Entity embedding is updated based on the CR result, and CR takes the entity embedding into consideration.
• "+entity (CR,PA)" (proposed method): entity embedding is updated based on PA as well as CR result, and the CR and PA take the entity embedding into consideration.
The performance of CR and PA (case analysis and zero anaphora resolution (ZAR)) is shown in Table 1. The performance of CR and case analysis was almost the same for all the methods. For ZAR, "+entity (CR,PA)" improved the performance drastically.
CR surely benefits from the entity salience. Since entity embeddings are updated based on system outputs, its performance matters. The performance of Japanese CR is lower than that of English CR. Therefore, we assume there are improved/worsen examples, and our CR performance did not improve significantly. The performance of ZAR also matters. However, the performance of ZAR in our baseline model is extremely low, and thus there are few worsen examples and    Table 2 shows performance of case analysis and zero anaphora resolution for each case, and each argument position. Unspecified was counted for exophora. Both for the News and Web evaluation sets, the performance for inter arguments of zero anaphora resolution, which was extremely difficult in the baseline method, was improved by a large margin by our proposed method.

Ablation Study
To reveal the importance of each clue for CR and PA, each clue was ablated. Table 3 shows the result on the development set. We found that, the path embedding was effective for PA, and the string match was effective for CR. The sentence distance for both CR and PA was effective for News, but not for Web since the Web evaluation corpus consists of three-sentence documents.

Comparison with Other Work
It is difficult to compare the performance of our method with other studies directly because there are no studies handling both CR and PA. The comparisons with other studies are summarized as follows:  Table 3: Ablation study on the development set. The cells shaded gray represent they are not directly affected from the ablation, but from the counterpart analysis result.
corpus contains a lot of annotation errors as pointed out in Iida et al. (2016), we did not conduct our experiments on the NAIST text corpus.
• Iida et al. (2003) reported an F-measure of about 0.7 on News domain. The possible reason why our performance on News (0.541) is lower than theirs is that their basic unit is a compound noun while our basic unit is a noun, and thus our setting is difficult in comparison with theirs.
Since we handle inter as well as intra and exophora arguments in PA, together with CR, we can say that our experimental setting is more practical in comparison with other studies.

Error Analysis
In example (3), although the NOM argument of the predicate " " (go to hospital) is author, our method wrongly classified it as unspecified.
(3) every day go to hospital! I myself-TOP very healthy.
((I) go to hospital every day! (I am) very healthy, though.) In the second sentence, our method correctly identified the antecedent of " " (I) as author, and the NOM of " " (healthy) as " " (I). Our method adopts the greedy search so that it cannot exploit this handy information in the analysis of the first sentence. The global modeling using reinforcement learning (Clark and Manning, 2016a) for a whole document is our future work.
In example (4), although the NOM argument of " " (be decorated) in the second sentence is " " (dress) in the first sentence, our method wrongly classified it as NA PA . " " (organdie) has a bridging relation to " ", which might help capture the salience of " ". The bridging reference resolution is our next target and must be easily incorporated into our model.

Conclusion
This paper has presented an entity-centric neural network-based joint model of coreference resolution and predicate argument structure analysis. Each entity has its embedding, and the embeddings are updated according to the result of both of these analyses dynamically. Both of these analyses took the entity embedding into consideration to access the global information of entities. The experimental results demonstrated that the proposed method could improve the performance of the inter-sentential zero anaphora resolution drastically, which has been regarded as a notoriously difficult task. We believe that our proposed method is also effective for other pro-drop languages such as Chinese and Korean.