Constructing a Japanese Basic Named Entity Corpus of Various Genres

: This paper introduces a Japanese Named Entity (NE) corpus of various genres. We annotated 136 documents in the Balanced Corpus of Contemporary Written Japanese with the eight types of NE tags de(cid:12)ned by IREX. The NE corpus consists of six types of genres of documents such as blogs, magazines, white papers


Introduction
Named Entity (NE) recognition is a process by which the names of particular classes and numeric expressions are recognized in text. NEs include person names, locations, organizations, dates, times, and so on. NE recognition is one of the basic technologies used in text processing, including Information Extraction (IE), Question Answering (QA), and Information Retrieval (IR).
For the development of NE recognizers in early stage, newspaper articles have been mainly used. For example, the following data sets consist of newspaper articles: eight types of basic Japanese NE recognition data sets for Information Retrieval and Extraction Exercise (IREX) (IREX Committee, 1999), the CoNLL'03 shared task (Tjong Kim Sang and De Meulder, 2003), an English fine-grained NE type that includes 64 classes (Weischedel and Brunstein, 2005), and Sekine's extended NE hierarchy that includes about 200 classes of NEs (Sekine et al., 2002).
As for Sekine's extended NE hierarchy, NE corpus have been created on various genres documents such as blogs, white papers and so on, in BCCWJ (Maekawa et al., 2010). 1 However, compared with the corpus for Sekine's extended NE hierarchy, which covers several genres, corpus for Japanese basic NEs have been created for fewer genres of documents such as newspaper articles of IREX and leading sentences of Web pages (Hangyo et al., 2012).
This paper introduces a Japanese Named Entity (NE) corpus of various genres called BCCWJ Basic NE corpus. BCCWJ Basic NE corpus was created for the sake of expanding genres of documents for Japanese basic NE researches. The corpus includes 136 documents in the Balanced Corpus of Contemporary Written Japanese (BCCWJ) core data annotated with the eight types of NE tags defined by IREX. The corpus contains 2,464 NE tags in total and the genres of the documents are following: Yahoo! Chiebukuro (OC) 2 , White Paper (OW), Yahoo! Blog (OY), Books (PB) Magazines (PM) and Newspapers (PN). This corpus includes genres of documents that have not been targeted in existing NE corpus for IREX definition. (IREX Committee, 1999;Hangyo et al., 2012).

BCCWJ Basic NE corpus
We annotated 136 documents included in BC-CWJ core data with IREX-defined NE tags by the following procedure. 3 We choose the same documents of a Japanese morphological analysis corps. 4 • Initial annotation: Six annotators, the authors and three university students, annotated all the documents with NEs. Each document was annotated by only a member.
• Modification: Four of the annotators checked all the annotated documents again and modified annotation errors. Annotation disagreements are resolved based on discussion of annotators.
• Packaging We prepared a package only including annotated tags with the positions in each documents. Users having BCCWJ can reproduce the BCCWJ Basic NE corpus with the package. Table 2 shows the number of documents and NE tags of each genre in BCCWJ Basic NE corpus. For comparing purpose, the statistics of IREX data. The number of documents of BCCWJ Basic NE corpus is more than the sum of the number of the IREX evaluation data: GENERAL data, ARREST DATA. In addition, BCCWJ Basic NE corpus includes documents other than newspapers such as Yahoo! Chiebukuro and White Paper. Table 3 shows the statistics of BCCWJ Basic NE corpus. Table 4 shows the percentage of each NE in a genre. We see from these statistics that BCCWJ Basic NE corpus has different property compared IREX. For example, we see that Yahoo! Chiebukuro and White Paper include more AR-TIFACT than newspapers and Magazine includes more PERSON than the other genres.

Example Uses of BCCWJ Basic NE corpus
This section describes some example uses of BC-CWJ Basic NE corpus.

Evaluation of an NE recognizer
We evaluated KNP that extracts the eight types of NEs listed in Table 1 based on the IREX definition. KNP is one of the freely available state of the art NE recognizers. We used Japanese morphological analyzer JUMAN version 7.01 5 as a morphological analyzer of KNP version 4.12 6 .  where NUM is the number of correct NEs recognized by KNP.
Compared with Newspapers, KNP showed lower performance on Yahoo! Chiebukuro and Yahoo! Blog. One of the reasons seems that KNP was trained with IREX CRL data that consists of news articles. Another reason is Yahoo! Chiebukuro and Yahoo! Blog includes more abbreviations and colloquial expressions than newspapers. Furthermore, KNP also showed lower performance on White Paper even if White Paper documents were written language. One of the reasons seems that White Paper includes more AR-TIFACT NEs than Newspapers The accuracy of KNP for ARTIFACT was lower than the other NEs on Newspapers.
From this evaluation, we see that we can evaluate NE recognizers with different perspective by using different genres of documents.

The Other Expected Use
We also expect that BCCWJ Basic NE corpus contributes to the following research.
• NE recognition for colloquial expressions: Yahoo! Blog contributes to NE recognition researches for colloquial expressions because Yahoo! Blog includes more colloquial expressions than Newspapers and White Paper • Domain Adaptation: BCCWJ Basic NE corpus includes six genres of documents, therefore, we expect BCCWJ Basic NE corpus is useful for the research of domain adaptation (Daumé III, 2007).
• Revision learning for NE recognition: We also have uploaded not only latest annotation but also older versions of NE annotation results. Therefore, we can use the corpus as an error detection research or revision learning like Japanese morphological analysis (Nakagawa et al., 2002).
• Comparison of annotation performance on different genres of documents: We can use BCCWJ  Genre ART DATE LOC MON OPT ORG PERC PERS TIME  OC  54  19  57  9  8  19  0  6  3  OW  163  129  140  9  39  128  33  15  0  OY  25  60  52  7  9  61  11  79  3  PB  29  50  87  0  24  26  6  169  8  PM  13  42  32  5  4  17  1  203  2  PN  24  165  188  59  13  118  38  78  22  Total  308  465  557  89  97  369  89  550 Table 2. this corpus for evaluating annotation performance and annotation methods on different genres of documents. One of the examples is described in (Komiya et al., 2016). The paper compared the following two methods to annotate a corpus via non-expert annotators for named entity (NE) recognition task. The first one is an annotation method by revising the results of an existing NE recognizer. The other is an annotation method by hand from the beginning.

Conclusion
This paper introduced a Japanese Named Entity