Word-based Japanese typed dependency parsing with grammatical function analysis

We present a novel scheme for word-based Japanese typed dependency parser which integrates syntactic structure analysis and grammatical function analysis such as predicate-argument structure analysis. Compared to bunsetsu-based dependency parsing, which is predominantly used in Japanese NLP, it provides a natural way of extracting syntactic constituents, which is useful for downstream applications such as statistical machine translation. It also makes it possible to jointly decide dependency and predicate-argument structure, which is usually implemented as two separate steps. We convert an existing treebank to the new dependency scheme and report parsing results as a baseline for future research. We achieved a better accuracy for assigning function labels than a predicate-argument structure analyzer by using grammatical functions as dependency label.


Introduction
The goal of our research is to design a Japanese typed dependency parsing that has sufficient linguistically derived structural and relational information for NLP applications such as statistical machine translation. We focus on the Japanesespecific aspects of designing a kind of Stanford typed dependencies (de Marneffe et al., 2008).
Syntactic structures are usually represented as dependencies between chunks called bunsetsus. A bunsetsu is a Japanese grammatical and phonological unit that consists of one or more content words such as a noun, verb, or adverb followed by a sequence of zero or more function words such as auxiliary verbs, postpositional particles, or sentence-final particles. Most publicly available Japanese parsers, including CaboCha 1 (Kudo et al., 2002) and KNP 2 (Kawahara et al., 2006), return bunsetsu-based dependency as syntactic structure. Such parsers are generally highly accurate and have been widely used in various NLP applications.
However, bunsetsu-based representations also have two serious shortcomings: one is the discrepancy between syntactic and semantic units, and the other is insufficient syntactic information (Butler et al., 2012;Tanaka et al., 2013).
Bunsetsu chunks do not always correspond to constituents (e.g. NP, VP), which complicates the task of extracting semantic units from bunsetsubased representations. This kind of problem often arises in handling such nesting structures as coordinating constructions. For example, there are three dependencies in a sentence (1): a coordinating dependency b2 -b3 and ordinary dependencies b1 -b3 and b3 -b4. In extracting predicate-argument structures, it is not possible to directly extract a coordinated noun phrase "wine and sake" as a direct object of the verb "drank". In other words, we need an implicit interpretation rule in order to extract NP in coordinating construction: head bunsetsu b3 should be divided into a content word and a function word , then the content word should be merged with the dependent bunsetsu b2.
list 'A list of wine and sake that (someone) drank' Therefore, predicate-argument structure analysis is usually implemented as a post-processor of bunsetsu-based syntactic parser, not just for assigning grammatical functions, but for identifying constituents, such as an analyzer SynCha 3 (Iida  et al., 2011), which uses the parsing results from CaboCha. We assume that using a word as a parsing unit instead of a bunsetsu chunk helps to maintain consistency between syntactic structure analysis and predicate-argument structure analysis. Another problem is that linguistically different constructions share the same representation. The difference of a gapped relative clause and a gapless relative clause is a typical example. In sentences (2) and (3), we cannot discriminate the two relations between bunsetsus b2 and b3 using unlabeled dependency: the former is a subject-predicate construction of the noun "cat" and the verb "eat" (subject gap relative clause) while the latter is not a predicate-argument construction (gapless relative clause). ( hanashi story 'the story about having eaten fish' We aim to build a Japanese typed dependency scheme that can properly deal with syntactic constituency and grammatical functions in the same representation without implicit interpretation rules. The design of Japanese typed dependencies is described in Section 3, and we present our evaluation of the dependency parsing results for a parser trained with a dependency corpus in Section 4. Mori et al. (2014) built word-based dependency corpora in Japanese.

Related work
The reported parsing achieved an unlabeled attachment score of over 90%; however, there was no information on the syntactic relations between the words in this corpus. Uchimoto et al. (2008) also proposed the criteria and definitions of word-level dependency structure mainly for annotation of a spontaneous speech corpus, the Corpus of Spontaneous Japanese (CSJ) (Maekawa et al., 2000), and they do not make a distinction between detailed syntactic functions either. We proposed a typed dependency scheme based on the well-known and widely used Stanford typed dependencies (SD), which originated in English and has since been extended to many languages, but not to Japanese. The Universal dependencies (UD) (McDonald et al., 2013;de Marneffe et al., 2014) has been developed based on SD in order to design the cross-linguistically consistent treebank annotation 4 . The UD for Japanese has also been discussed, but no treebanks have been provided yet. We focus on the feasibility of word-based Japanese typed dependency parsing rather than on cross-linguistic consistency. We plan to examine the conversion between UD and our scheme in the future.

Typed dependencies in Japanese
To design a scheme of Japanese typed dependencies, there are three essential points: what should be used as parsing units, which dependency scheme is appropriate for Japanese sentence structure, and what should be defined as dependency types.

Parsing unit
Defining a word unit is indispensable for wordbased dependency parsing. However, this is not a trivial question, especially in Japanese, where words are not segmented by white spaces in its orthography. We adopted two types of word units defined by NINJL 5 for building the Balanced Corpus of Contemporary Written Japanese (BC-CWJ) (Maekawa et al., 2014;: Short unit word (SUW) is the shortest token conveying morphological information, and the long unit word (LUW) is the basic unit for parsing, consisting of one or more SUWs. Figure 1 shows ex-ample results from the preprocessing of parsing. In the figure, "/" denotes a border of SUWs in an LUW, and "∥" denotes a bunsetsu boundary.

Dependency scheme
Basically, Japanese dependency structure is regarded as an aggregation of pairs of a leftside dependent word and a right-side head word, i.e. right-headed dependency, since Japanese is a head-final language. However, how to analyze a predicate constituent is a matter of debate. We define two types of schemes depending on the structure related to the predicate constituent: first conjoining predicate and arguments, and first conjoining predicate and function words such as auxiliary verbs.
As shown in sentence (4), a predicate bunsetsu consists of a main verb followed by a sequence of auxiliary verbs in Japanese. We consider two ways of constructing a verb phrase (VP). One is first conjoining the main verb and its arguments to construct VP as in sentence (4a), and the other is first conjoining the main verb and auxiliary verbs as in sentence (4b). These two types correspond to sentences (5a) and (5b), respectively, in English. The structures in sentences (4a) and (5a) are similar to a structure based on generative grammar. On the other hand, the structures in sentences (4b) and (5b) are similar to the bunsetsu structure.
We defined two dependency schemes Head Final type 1 (HF 1 ) and Head Final type 2 (HF 2 ) as shown in Figure 2, which correspond to structures of sentences (4a) and (4b), respectively. Additionally, we introduced Predicate Content word Head type (PCH), where a content word (e.g. verb) is treated as a head in a predicate phrase so as to link the predicate to its argument more directly.

Dependency type
We defined 35 dependency types for Japanese based on SD, where 4-50 types are assigned for syntactic relations in English and other languages.   (Ikehara et al., 1997) to generalize the nouns. Table 1 shows the major dependency types. To discriminate between a gapped relative clause and a gapless relative clause as described in Section 1, we assigned two dependency types rcmod and ncmod respectively. Moreover, we introduced gap information by subdividing rcmod into three types to extract predicate-argument relations, while the original SD make no distinction between them. The labels of case and gapped relative clause enable us to extract predicate-argument structures by simply tracing dependency paths. In the case of HF 1 in Figure 2, we find two paths between content words: "fried fish"(NN)←pobj←dobj← "eat"(VB) and (VB)←aux←aux←rcmod nsubj← "calico cat"(NN). By marking the dependency types dobj and rcmod nsubj, we can extract the arguments for predicate , i.e., as a direct object and as a subject.

Evaluation
We demonstrated the performance of the typed dependency parsing based on our scheme by using the dependency corpus automatically converted from a constituent treebank and an off-the-self parser.

Resources
We used a dependency corpus that was converted from the Japanese constituent treebank (Tanaka et al., 2013) built by re-annotating the Kyoto University Text Corpus (Kurohashi et al., 2003) with phrase structure and function labels. The Kyoto corpus consists of approximately 40,000 sentences from newspaper articles, and from these 17,953 sentences have been re-annotated. The treebank is designed to have complete binary trees, which can be easily converted to dependency trees by adapting head rules and dependency-type rules for each partial tree. We divided this corpus into 15,953 sentences (339,573 LUWs) for the training set and 2,000 sentences (41,154 LUWs) for the test set.

Parser and features
In the analysis process, sentences are first tokenized into SUW and tagged with SUW POS by the morphological analyzer MeCab (Kudo et al., 2004). The LUW analyzer Comainu (Kozawa et al., 2014) chunks the SUW sequences into LUW sequences. We used the MaltParser (Nivre et al., 2007), which marked over 81 % in labeled attachment score (LAS), for English SD. Stack algorithm (projective) and LIBLINEAR were chosen as the parsing algorithm and the learner, respectively. We built and tested the three types of parsing models with the three dependency schemes.
Features of the parsing model are made by combining word attributes as shown in Table  2. We employed SUW-based attributes as well as LUW-based attributes because LUW contains many multiword expressions such as compound nouns, and features combining LUW-based attributes tend to be sparse. The SUW-based attributes are extracted by using the leftmost or rightmost SUW of the target LUW. For instance, for LUW in Figure 1, the SUW-based attributes are s LEMMA L (the leftmost SUW's lemma "fish") and s LEMMA R (the rightmost SUW's lemma "fry").

Results
The parsing results for the three dependency schemes are shown in Table 3 (a). The dependency schemes HF 1 and HF 2 are comparable, but PCH is slightly lower than them, which is probably because PCH is a more complicated structure, having left-to-right dependencies in the predicate phrase, than the head-final types HF 1 and HF 2 . The performances of the LUW-based parsings are considered to be comparable to the results of a bunsetsu-dependency parser CaboCha on the same data set, i.e. a UAS of 92.7%, although we cannot directly compare them due to the difference in parsing units.   dobj and iobj) resulted in relatively high scores in comparison to the temporal (tmod) and locative (lmod) cases. These types are typically labeled as belonging to the postpositional phrase consisting of a noun phrase and particles, and case particles such as "ga", "o" and "ni" strongly suggest an argument by their combination with verbs, while particles and "de" are widely used outside the temporal and locative cases.

Predicate-argument structure
We extracted predicate-argument structure information as triplets, which are pairs of predicates and arguments connected by a relation, i.e. (pred , rel , arg), from the dependency parsing results by tracing the paths with the argument and gapped relative clause types. pred in a triplet is a verb or an adjective, arg is a head noun of an argument, and rel is nsubj, dobj or iobj.
The gold standard data is built by converting predicate-argument structures in NAIST Text Corpus (Iida et al., 2007) into the above triples. Basically, the cases "ga", "o" and "ni" in the corpus correspond to "nsubj", "dobj" and "iobj", respectively, however, we should apply the alternative conversion to passive or causative voice, since the annotation is based on active voice. The conversion for case alternation was manually done for each triple. We filtered out the triples including zero pronouns or arguments without the direct dependencies on their predicates from the converted triples, finally 6,435 triplets remained. Table 4 shows the results of comparing the extracted triples with the gold data. PCH marks the highest score here in spite of getting the lowest score in the parsing results. It is assumed that the characteristics of PCH, where content words tend to be directly linked, are responsible. The table also contains the results of the predicate-argument structure analyzer SynCha. Note that we focus on only the relations between a predicate and its dependents, while SynCha is designed to deal with zero anaphora resolution in addition to predicateargument structure analysis over syntactic dependencies. Since SynCha uses the syntactic parsing results of CaboCha in a cascaded process, the parsing error may cause conflict between syntactic structure and predicate-argument structure. A typical example is that case where a gapped relative clause modifies a noun phrase A B "B of A", e.g., [ VP ] [ NP ] "footprints of the cat that escaped from a garden." If the noun A is an argument of a main predicate in a relative clause, the predicate is a dependent of the noun A; however, this is not actually reliable because two analyses are separately processed. There are 75 constructions of this type in the test set; the LUW-based dependency parsing captured 42 correct predicate-argument relations (and dependencies), while the cascaded parsing was limited to obtaining 6 relations.

Conclusion
We proposed a scheme of Japanese typeddependency parsing for dealing with constituents and capturing the grammatical function as a dependency type that bypasses the traditional limitations of bunsetsu-based dependency parsing. The evaluations demonstrated that a word-based dependency parser achieves high accuracies that are comparable to those of a bunsetsu-based dependency parser, and moreover, provides detailed syntactic information such as predicate-argument structures. Recently, discussion has begun toward Universal Dependencies, including Japanese. The work presented here can be viewed as a feasibility study of UD for Japanese. We are planning to port our corpus and compare our scheme with UD to contribute to the improvement of UD for Japanese.