UD-Japanese BCCWJ: Universal Dependencies Annotation for the Balanced Corpus of Contemporary Written Japanese

In this paper, we describe a corpus UD Japanese-BCCWJ that was created by converting the Balanced Corpus of Contemporary Written Japanese (BCCWJ), a Japanese language corpus, to adhere to the UD annotation schema. The BCCWJ already assigns dependency information at the level of the bunsetsu (a Japanese syntactic unit comparable to the phrase). We developed a program to convert the BCCWJ to UD based on this dependency structure, and this corpus is the result of completely automatic conversion using the program. UD Japanese-BCCWJ is the largest-scale UD Japanese corpus and the second-largest of all UD corpora, including 1,980 documents, 57,109 sentences, and 1,273k words across six distinct domains.


Introduction
The field of Natural Language Processing has seen growing interest in multilingual and crosslinguistic research. One such cross-linguistic research initiative is the Universal Dependencies (UD) (McDonald et al., 2013) Project, which defines standards and schemas for parts of speech and dependency structures and distributes multilingual corpora. As part of our efforts to import the UD annotation schema into the Japanese language, we defined a part-of-speech (PoS) system and set of dependency structure labels for Japanese, which are documented on GitHub 1 , and we are currently preparing reference corpora. This paper describes our Japanese UD corpus UD Japanese-BCCWJ, which is based on the Balanced Corpus of Contemporary Written Japanese (BCCWJ) (Maekawa et al., 2014), and which we have prepared as part of our efforts to design a Japanese version of UD. 1 https://github.com/UniversalDependencies/ Previous applications of UD to Japanese corpora can be found in Table 1, which is based on . Tanaka et al. (2016) have published a Japanese UD treebank, UD Japanese-KTC, which was converted from the Japanese Phrase Structure Treebank (Tanaka and Nagata, 2013). Other corpora include an unlabelled UD Japanese treebank derived from Wikipedia, UD Japanese-GSD, and a Japanese-PUD corpus, UD Japanese-PUD (Zeman et al., 2017), derived from parallel corpora, but all of these have had to be partially manually corrected. According to Table 1, UD Japanese-BCCWJ is the largest UD Japanese corpus. Furthermore, it is the second largest of all UD corpora and includes many documents across various domains as shown in Table 3.
Existing Japanese-language corpora tagged with dependency structures include the Kyoto University Text Corpus (Kurohashi and Nagao, 2003) and the Japanese Dependency Corpus (Mori et al., 2014). These corpora frequently use bunsetsu as the syntactic dependency annotation units for Japanese. Also, the BC-CWJ, based on UD Japanese-BCCWJ, is annotated using a bunsetsu-level dependency structure , which we must thus convert from a bunsetsu-level dependency structure to a Universal Dependencies schema. Figure 1 shows an example of BCCWJ with the UD annotation schema.
In this paper, we describe the conversion of the BCCWJ to the UD annotation schema. To accomplish the conversion, the following information must be combined: word-morphological information, bunsetsu-level dependency structure, coordination structure annotation, and predicate argument structure information. We also attempt to convert the BCCWJ to a UD schema, which allows us to respond to changes in the tree structures based on ongoing discussions in the UD commu-(There is an) impact experiment on the darkness drinking party on white night. Figure 1: Summary of conversion of BCCWJ to UD. (The sample is from PB_00001). The left example is the BCCWJ schema, bunsetsu-level dependency structure, and the right is the Universal Dependencies schema.  The copyright negotiation process has also been completed for BCCWJ DVD purchasers. 3 All BCCWJ data are automatically tokenized and PoS-tagged by NLP analysers in a threelayered tokenization of Short Unit Word (SUW), Long Unit Word (LUW), and bunsetsu as in Figure 2. 4 There are subcorpora to be checked manually to improve their quality after analysis, as well as a subcorpus of the 1% of the BCCWJ data called 'core data' consisting of 1,980 samples and 57,256 sentences with morphological information (word boundaries and PoS information). Table 2 describes each genre in the BCCWJ core data. The distribution, including the BCCWJ core data, is shown in Figure 3. The UD Japanese-BCCWJ is based on the BCCWJ core data. The BCCWJ provides bunsetsu-level dependency information as BCCWJ-DepPara  including bunsetsu dependency structures, coordination structures, and information on Table 3: Genre distribution including BCCWJ core data. A description of each genes is given in the Table 2.
X X X X X X X X predicate-argument structures through BCCWJ-DepPara-PAS (Ueda et al., 2015). This information is exploited in the conversion of BCCWJ to the UD schemas.

Conversion of BCCWJ to UD
As shown in Figure 1, there are some differences between the BCCWJ and UD schemas. One concerns PoS: BCCWJ's and UD's PoS Unidic (Den et al., 2007) and Universal PoS (Petrov et al., 2012), respectively (e.g. noun(common.general) and NOUN in Figure 1). Second, the structure is different between bunsetsu-level and word-level dependency, for example in the directions and units of dependency (compare BCCWJ with the UD schema in Figure 1). Finally, the bunsetsu-level dependency structures in Japanese have less detailed syntactic dependency roles than the relations in Universal Dependencies like nmod and case. We need to convert UD Japanese-BCCWJ while taking into consideration the differences between the UD and BCCWJ schemata. In addition, we need to choose or detect apposite word units for the basic word unit based on UD guidelines from SUWs, LUWs, and others because these layers are not always appropriate as given by BCCWJ. Therefore, we convert BCCWJ to UD Japanese-BCCWJ using the following steps: 1. Detect the word unit. 2. Convert Unidic PoS to UD PoS. 3. Convert bunsetsu-level dependency to UD word-level dependency. 4. Attach a UD relation label to each dependency.
We will describe each step in the following sections.

Word Unit
Japanese, unlike English as well as many other languages, text is not explicitly divided into words using spaces. UD guidelines specify that the basic units of annotation are syntactic words 5 . The first task is therefore to decide what counts as a token and what counts as a syntactic word. All the samples in the BCCWJ are morphologically analysed based on linguistic units called 'Short Unit Words (SUWs) and 'Long Unit Words (LUWs), as in Figure 2. SUWs are defined on the basis of their morphological properties in the Japanese language. They are minimal atomic units that can be combined in ways specific to particular classes of Japanese words. LUWs are defined on the basis of their syntactic properties. The bunsetsu are word grouping units defined in terms of the dependency structure (the so-called bunsetsu-kakariuke). The bunsetsu-level dependency structure annotations in BCCWJ-DepPara (Asahara and Matsumoto, 2016) rely on LUWs. As shown in Figure 2, the SUWs, LUWs, and bunsetsu exist in a hierarchical relationship: SUW <= LUW <= bunsetsu; SUWs render / / as three words, LUWs as / or two words, and bunsetsu as or one word. SUWs and LUWs also entail different PoS systems, as will be described in Section 3.2.
UD Japanese-BCCWJ adopts the SUW word unit, which corresponds to the BCCWJ's basic PoS system, as its fundamental linguistic unit. However, as described in the following sections, usage information associated with LUWs is also required to conform to UD standards and to achieve consistency with annotations for other languages. We will discuss the differences between SUWs and LUWs in Section 5.1.

Conversion to Universal PoS tags
UD has adopted Universal PoS tags, version 2.0 (Petrov et al., 2012), as a system for aggregating the parts of speech of all languages; in this system 17 distinct parts of speech are defined. For the Japanese-language version of UD, we defined the UD parts of speech by constructing a table of correspondences using UniDic (Den et al., 2007) and the Universal PoS tags. For SUWs, BCCWJ adopts a PoS system based on a word's possible lexical categories. For example, the PoS tag noun(common.adverbial) ( --) means that the word can be a common noun ( ) or an adverb ( ). In contrast, LUWs are used to specify PoS tags based on usage principles, which resolve usage ambiguities based on context. The noun(common.adverbial) tag in the SUW PoS system resolves to a common noun or an adverb depending on context. We selected the SUW PoS system because SUWs are the base annotation of word units of the BCCWJ; broadly speaking, there is no significant difference between the SUW and LUW PoS systems for our purposes.
However, for certain words we need to use a LUW PoS system based on usage principles in order to conform to the UD standards and to achieve consistency with other languages. For example, in the case of a nominal verb (noun(common.verbal_suru), which can add -) or nominal adjective (noun(common.adjectival), which can add -), the SUW PoS system, based on lexical principles, is not appropriate because if a word is a verb or adjective depending on the context, the SUW PoS system cannot detect this. Instead, here we use LUW PoS tags based on usage principles that resolve ambiguities based on context. The LUW PoS tags based on usage principles have the advantage of being easier to map onto other lan- guages, and the reduced ambiguity associated with word endings makes it easier to specify the conditions for a VERB or ADJ tag. Table 4 shows the mapping between Universal PoS tags and UniDic based on these principles. Note that the mapping is for Unidic SUW PoS; using Unidic LUW PoS would be simpler, as described in the following Section 5.1. The fact is, however, that there are several problems involved in using LUW PoS, as will be described presently.

Conversion of dependency structure
For syntactic information for Japanese, we use BCCWJ-DepPara , which includes bunsetsu dependency and coordination information for the BCCWJ. In order to convert bunsetsu-level into word-level dependencies, we identify the head word in the bunsetsu and then attach all other elements in the bunsetsu to the head word, as in Figure 3. Note that the UD dependency arrow is from the head to the dependent word, whereas the BCCWJ dependency arrow is from the dependent to the head word; this is merely a notational issue and the substantive description is the same. Moreover, the head-word in the bunsetsuis selected as the rightmost content word after separating content and function words; for example, the head-word is experiments in impact experiments in Figure 3. 7 While BCCWJ-DepPara includes dependency information, it does not include syntactic dependency roles corresponding to the Universal Dependencies relations (de Marneffe et al., 2014) (such as the labels nsubj, obj, and iobj). We therefore determined and assigned the UD relation labels based on the case-marking (particle(case binding adverbial)) or predicate-argument structure information in BCCWJ-PAS (Ueda et al., 2015). This predicateargument structure information is semantic-level information, so basically we use the case-marking, and the predicate-argument information is just for reference. Since Japanese, unlike languages such as English, can omit core arguments and case-marking and the case-marking always corresponds with grammatical arguments in UD relations, predicate-argument structure is necessarily expressed by the case marker. For example, the case marker ha usually indicates a nominal subject nsubj, but also frequently appears as a topic marker. 8 Table 5 shows the rules for assigning UD relations. These conversions combine various rules like bunsetsu information, case information, and coordination relations between the head word and the dependent word.
Our current rules, which are unable to identify clauses, thus cannot effectively handle clauserelated labels such as csubj, advcl, and acl; this is because clauses in Japanese are vaguer than in English, as described in Section 5.2. In the future, we will solve this problem by establishing  criteria for identifying clauses. BCCWJ-DepPara also contains coordinate structure information, but our current conversion rules do not yet have defined rules related to coordinate structures such as cc and conj. The issue will be presented in .

Format
Through this process we can convert the BCCWJ to a UD schema. UD Japanese-BCCWJ is formatted by CoNLL-U. UD Japanese-BCCWJ provides the word form, lemma of the word form, universal part-of-speech tag, language-specific part-ofspeech tag (Unidic POS), and Universal Depen-dencies relation. Note that the provided POS is the SUW POS serves as the language-specific PoS tag in UD Japanese-BCCWJ.
UD allows us to insert any annotation using the MISC field, so we can give syntactic information using this field for LUW word units and bunsetsu. This information may be useful for Japanese parsing. Table 6 summarizes the MISC fields in UD Japanese-BCCWJ.

Parsing by genre
UD Japanese-BCCWJ is attractive in that it includes documents in various genres. We present the parsing results that indicate differences by genre. In this paper we do not show part-of-speech tagging results, because there are some Japanese POS tagging tools (for example, Kudo et al. (2004)'s implementation, MeCab), which make it easier to convert Unidic to UD POS, as mentioned.
We use UDPipe (Straka and Straková, 2017) as a tool to train the parsing model and evaluate the parsing accuracy. UDPipe is a trainable pipeline for tokenization, tagging, lemmatization, and dependency parsing from CoNLL-U format files. The parsing uses Parsito (Straka et al., 2015), which is a transition-based parser using a neural-network classifier. We use default parameters in UDPipe. 9 We use the labelled attachment score (LAS) and unlabelled attachment score (UAS) as evaluation metrics.
The results are shown in Table 7 and Table 8. The columns in the Tables represent the parsing model by genre, the rows the genre tests, and 'all' is the full core data, so a given cell represents the result of evaluating the genre parsing model by the genre test set.
Whereas the genres of OW, PB, PM, and PN contain more than 200K tokens, the genres of OC and OY contain only around 100K, tokens as shown in Table 3.
It is in principle one of the advantages of UD Japanese-BCCWJ that it can utilize a relatively large scale sub-corpus. In fact, however, the UAS results show that if a genre has more than 200K tokens, the result from using only the in-domain data is better than that with the data for all 1.2 million tokens, including the out-domain data. 9 The version using UDPipe is 1.2.1-devel, and executes with no options.

Discussion
In this section, we will take up a problem related to UD Japanese that centres on UD Japanese-BCCWJ. The overall discussion of UD Japanese is summarized by .
We must also still discuss the issue of coordinate structures in Japanese. The issue will be presented in .

Word units
The choice of word unit is one of the important issues in UD Japanese. BCCWJ includes three sorts of word unit standards, as noted: SUWs, LUWs, and bunsetsu. We used SUWs for UD Japanese-BCCWJ.
However, the UD project stipulates that word delimitation in the UD standard should be for 'syntactic words'. LUWs in BCCWJ are thus a more preferable word delimitation standard than SUWs. Figure 4 shows the difference between SUW PoS and LUW PoS. The top of Figure 4 shows the  SUW-based PoS. The verb do and the verbal noun make a compound verb, as in the bottom of Figure 4 in the LUW-based segmentation. Figure 5 presents a functional multi-word expression , which includes three words in SUW units and one word in LUW units. We can mask the morphological construction of the syntactic word within a LUW.
However, currently we nevertheless continue to use SUWs as the UD Japanese word delimitation standard. This is because (1) LUWs are difficult to produce with word segmenters, and (2) some functional multi-word expressions in Japanese do not conform to the LUW standards.

Clause
The UD dependency labels are designed to be split between the word/phrase and the clause. The difference between clauses and words/phrases is vague in Japanese, because cases, including the subject, do not necessarily overtly appear in sentences. Figure 6 shows an adjective clause and an adjective phrase in Japanese. At the top of Figure 6 is an overt adjective clause with a nominal subject. In contrast, however, in the example at the bottom of Figure 6 it cannot be determined whether the adjective is attributive or predicative, since the nominal subject of adjective predicate can be omitted in Japanese (in this case, 'tail' may be omitted). Thus, we define acl for all adjectives which attach to noun phrases as the current state.

Other UD Japanese resources
In this section, we describe other UD Japanese resources at the time of writing. Table 2 shows a summary of these. As noted, there are five UD Japanese corpora as of March 2018, which in scale constitute the second largest of all UD corpora with the addition of the UD Japanese-BCCWJ. UD Japanese-KTC (Tanaka et al., 2016) is based on the NTT Japanese Phrase Structure Treebank (Tanaka and Nagata, 2013), which contains the same original text as the Kyoto Text Corpus (KTC) (Kurohashi and Nagao, 2003). KTC is a bunsetsu-level dependency structure like BCCWJ, but with its own word delimitation schema and POS tag set. We are now modifying the UD Japanese KTC from the version 1.0 schema to version 2.0.
UD Japanese-GSD consists of sentences from Japanese Wikipedia that have been automatically split into words by IBM's word seg-menter. The dependencies are automatically resolved using the bunsetsu-level dependency parser (Kanayama et al., 2000) with the attachment rules for functional words defined in UD Japanese.
UD Japanese-PUD (Zeman et al., 2017) was created in the same manner as UD Japanese-GSD, with the goal of maintaining consistency with UD Japanese-GSD. It is a parallel corpus with multiple other languages.
UD Japanese-Modern (Omura et al., 2017) is a small UD annotation corpus based on the Corpus of Historical Japanese: Meiji-Taisho Series I Magazines (CHJ) (Ogiso et al., 2017). The CHJ is large-scale corpus with morphological information of Old Japanese and has morphological information compatible with the BCCWJ. We annotated bunsetsu-level syntactic dependency and coordinated structures using the BCCWJ-DepPara annotation schema and predicate-argument relations, and utilized the conversion script used for UD Japanese-BCCWJ because the two corpora share the same annotation schema. There are two characteristic syntactic structures in Old Japanese. One is inversion, found in Sino-Japanese literary styles. The other is predicative adnominals.
As mentioned, each UD Japanese corpus has been developed in a different manner since the resources are derived from annotation with other standards. For example, UD Japanese-KTC is converted from a phrase structure treebank, while UD Japanese-Modern is based on compatible annotation with UD Japanese-BCCWJ. However, the syntactic structures of Old Japanese are very different from contemporary Japanese, as described above.
Presently we are trying to standardize UD Japanese resources under the UD Japanese-BCCWJ schema by annotating BCCWJ-DepPara with standard syntactic dependency notation for other resources. Then, we will use the conversion rules of this article for the other UD Japanese resources.

Summary and Outlook
In this paper, we described a corpus created by converting the Balanced Corpus of Contemporary Written Japanese (BCCWJ), a Japanese language corpus, into the UD annotation schema. There are differences between BCCWJ and UD schemas, and so we have tried to develop and implement rules to convert BCCWJ to UD.
The UD Japanese-BCCWJ was released in March 2018. Note that though the corpus does not include the surface form due to the original text copyright, the BCCWJ DVD Edition purchaser can add the surface form using the scripts in the UD package. However, this is a matter of debate, as described in this paper, so we are going to continue to update it based on ongoing discussion, for instance regarding the apposite word unit for Japanese.
At the time of writing, we have completed the process of UD conversion based on SUWs. We also need to implement a corpus based on LUWs, and will publicly release our Japanese UD data based on both SUW and LUW analyses.