User-Generated Text Corpus for Evaluating Japanese Morphological Analysis and Lexical Normalization

Morphological analysis (MA) and lexical normalization (LN) are both important tasks for Japanese user-generated text (UGT). To evaluate and compare different MA/LN systems, we have constructed a publicly available Japanese UGT corpus. Our corpus comprises 929 sentences annotated with morphological and normalization information, along with category information we classified for frequent UGT-specific phenomena. Experiments on the corpus demonstrated the low performance of existing MA/LN methods for non-general words and non-standard forms, indicating that the corpus would be a challenging benchmark for further research on UGT.


Introduction
Japanese morphological analysis (MA) is a fundamental and important task that involves word segmentation, part-of-speech (POS) tagging and lemmatization because the Japanese language has no explicit word delimiters. Although MA methods for well-formed text (Kudo et al., 2004;Neubig et al., 2011) have been actively developed taking advantage of the existing annotated corpora of newswire domains, they perform poorly on usergenerated text (UGT), such as social media posts and blogs. Additionally, because of the frequent occurrence of informal words, lexical normalization (LN), which identifies standard word forms, is another important task in UGT. Several studies have been devoted to both tasks in Japanese UGT (Sasano et al., 2013;Kaji and Kitsuregawa, 2014;Saito et al., 2014Saito et al., , 2017 to achieve the robust performance for noisy text. Previous researchers have evaluated their own systems using in-house data created by individual researchers, and thus it is difficult to compare the performance of different systems and discuss what issues remain in these two tasks. Therefore, publicly available data is necessary for a fair evaluation of MA and LN performance on Japanese UGT. In this paper, we present the blog and Q&A site normalization corpus (BQNC), 1 which is a public Japanese UGT corpus annotated with morphological and normalization information. We have constructed the corpus under the following policies: (1) available and restorable; (2) compatible with the segmentation standard and POS tags used in the existing representative corpora; and (3) enabling a detailed evaluation of UGT-specific problems.
For the first requirement, we extracted and used the raw sentences in the blog and Q&A site registers compiled by (the non-core data of) the Balanced Corpus of Contemporary Written Japanese (BCCWJ) (Maekawa et al., 2014), in which the original sentences are preserved. 2 For the second requirement, we followed the short unit word (SUW) criterion of the National Institute for Japanese Language and Linguistics (NINJAL), which is used in various NINJAL's corpora, including manually annotated sentences in the BCCWJ. For the third requirement, we organized linguistic phenomena frequently observed in the two registers as word categories, and annotated each word with a category. We expect that this will contribute to future research to develop systems that manage UGT-specific problems.
The BQNC comprises sentence IDs and annotation information, including word boundaries, POS, lemmas, standard forms of non-standard word tokens, and word categories. We will release the annotation information that enables BCCWJ applicants to replicate the full BQNC data from the original BCCWJ data. 3 Using the BQNC, we evaluated two existing   (Kudo et al., 2004) and a joint MA and LN method (Sasano et al., 2013). Our experiments and error analysis showed that these systems did not achieve satisfactory performance for non-general words. This indicates that our corpus would be a challenging benchmark for further research on UGT.

Overview of Word Categories
Based on our observations and the existing studies (Ikeda et al., 2010;Kaji et al., 2015), we organized word tokens that may often cause segmentation errors into two major types with several categories, as shown in Table 1. We classified each word token from two perspectives: the type of vocabulary to which it belongs and the type of variant form to which it corresponds. For example, ニホン nihon 'Japan' written in katakana corresponds to a proper name and a character type variant of its standard form 日本 written in kanji. Specifically, we classified vocabulary types into neologisms/slang, proper names, onomatopoeia, 4 interjections, (Japanese) dialect words, foreign words, and emoticons/ASCII art (AA), in addition to general words. 5 A common characteristic of these vocabularies, except for general words, is that a new word can be indefinitely invented or imported. We annotated word tokens with vocabulary type information, except for general words.
From another perspective, any word can have multiple variant forms. Because the Japanese writ-ing system comprises multiple script types including kanji and two types of kana, that is, hiragana and katakana, 6 words have orthographic variants written in different scripts. Among them, nonstandard character type variants that rarely occur in well-formed text but occur in UGT can be problematic, for example, a non-standard form カワ イイ for a standard form かわいい kawaī 'cute'. Additionally, ill-spelled words are frequently produced in UGT. We further divided them into two categories. The first is sound change variants that have a phonetic difference from the original form and are typically derived by deletions, insertions, or substitutions of vowels, long sound symbols (chōon "ー"), long consonants (sokuon "っ"), and moraic nasals (hatsuon "ん"), for example, おい しーい oishīi for おいしい oishī 'tasty'. The second category is alternative representations that do not have a phonetic difference and are typically achieved by substitution among uppercase or lowercase kana characters, or among vowel characters and long sound symbols, for example, 大きぃ for 大きいōkī 'big'. Moreover, typographical errors can be seen as another type of variant form. We targeted these four types of non-standard forms for normalization to standard forms.

Corpus Construction Process
The BQNC was constructed using the following steps. The annotation process was performed by the first author.   (1) Sentence Selection We manually selected sentences to include in our corpus from the blog and Q&A site registers in the BCCWJ non-core data. We preferentially extracted sentences that contained candidates of UGT-specific words, that is, word tokens that may belong to non-general vocabularies or correspond to non-standard forms. As a result, we collected more than 900 sentences.
(2) First Annotation Sentences in the non-core data have been automatically annotated with word boundaries and word attributes, such as POS and lemma. Following the BCCWJ annotation guidelines (Ogura et al., 2011a,b) and UniDic (Den et al., 2007), which is an electronic dictionary database designed for the construction of NINJAL's corpora, we refined the original annotations of the selected sentences by manually checking them. The refined attributes were token, POS, conjugation type, conjugation form, pronunciation, lemma, and lemma ID. Additionally, we annotated each token with a word category shown in Table 1 and a standard form ID if the token corresponded to a nonstandard form. Table 2 shows two examples of annotated sentences. We annotated each non-standard token with a standard form ID denoted as "[lemma ID]:[lemma](_[pronunciation])", which is associated with the set of acceptable standard forms shown in Table 3.
(3) Second Annotation We rechecked all tokens in the sentences that we finished the first annotation and fixed the annotation criteria, that is, the definitions of vocabulary types and variant form types, and standard forms for each word. Through these steps, we obtained 929 annotated sentences.

Type of Vocabulary
Through the annotation process, we defined the criteria for vocabulary types as follows.
Neologisms/Slang: a newly invented or imported word that has come to be used collectively. Specifically, we used a corpus reference application called Chunagon 7 and regarded a word as a neologism/slang if its frequency in the BCCWJ was less than five before the year 2000 and increased to more than ten in 2000 or later. 8 Proper names: following the BCCWJ guidelines, we regarded a single word that corresponded to a proper name, such as person name, organization name, location name, and product name, as a proper name. In contrast to the BCCWJ guidelines, we also regarded an abbreviation of a proper name as a proper name, for example, "ドラクエ" in Table 1.
Onomatopoeia: a word corresponds to onomatopoeia. We referred to a Japanese onomatopoeia dictionary (Yamaguchi, 2002) to assess whether a word is onomatopoeic. We followed the criteria in the BCCWJ guidelines on what forms of words are onomatopoeic and what words are associated with the same or different lemmas.
Interjections: a word whose POS corresponds to an interjection. Although we defined standard forms for idiomatic greeting expressions registered as single words in UniDic, 9 we did not define standard and non-standard forms for other interjections that express feelings or reactions, for example, え えē 'uh-huh' and うわあ uwā 'wow'.
Foreign words: a word from non-Japanese languages. We regarded a word written in scripts in the original language as a foreign word, for example, English words written in the Latin alphabet such as "plastic". Conversely, we regarded loanwords written in Japanese scripts (hiragana, katakana, or kanji) as general words, for example, プラスチッ ク 'plastic'. Moreover, we did not regard English acronyms and abbreviations written in uppercase letters as foreign words because such words are typically also written in the Latin alphabet in Japanese sentences, for example, ＳＮＳ.
Dialect words: a word from a Japanese dialect. We referred to a Japanese dialect dictionary (Sato, 2009) and regarded a word as a dialect word if it corresponded to an entry or occurred in an example sentence. We did not consider normalization from a dialect word to a corresponding word in the standard Japanese dialect.
Emoticons/AA: nonverbal expressions that comprise characters to express feelings or attitudes. Because the BCCWJ guidelines does not explicitly describe criteria on how to segment emoticon/AA expressions as words, we defined criteria to follow emoticon/AA entries in UniDic. 10

Type of Variant Form
There are no trivial criteria to determine which variant forms of a word are standard forms because most Japanese words can be written in multiple ways. Therefore, we defined standard forms of a word as all forms whose occurrence rates were approximately equal to 10% or more in the BCCWJ among forms that were associated with the same lemma. For example, among variant forms of the lemma 面白い omoshiroi 'interesting' or 'funny' that occurred 7.9K times, major forms 面白い and おもしろい accounted for 72% and 27%, respectively, and other forms, such as オモシロイ and オモシロい, were very rare. In this case, the standard forms of this word are the two former variants. We annotated tokens corresponding to the two latter non-standard forms with the standard form IDs and the types of variant forms. We defined criteria for types of variant forms as follows.
Character type variants: among the variants written in different scripts, we regarded variants whose occurrence rates were approximately equal to 5% or less in the BCCWJ as non-standard forms of character type variants. Specifically, variants written in kanji, hiragana, or katakana for native words and Sino-Japanese words, variants written in katakana or hiragana for loanwords, variants written in uppercase or lowercase Latin letters for English abbreviations are candidates for character type variants. We assessed whether these candidates were non-standard forms based on the occurrence rates.
Alternative representations: a form whose internal characters are (partially) replaced by special characters without phonetic differences. Specifically, non-standard forms of alternative representations include native words and Sino-Japanese words written in historical kana orthography (e.g., 思ふ for 思う omō/omou 'think'), and loanwords written as an unusual 11 katakana sequence (e.g., オオケストラ for オーケストラ 'orchestra'). Additionally, alternative representations include substitution with respect to kana: substitution of the long vowel kana by the long sound symbol (e.g., おいしfor おいしい oishī 'tasty'), substitution of upper/lowercase kana by the other case (e.g., ゎたし for わたし watashi 'me'), and phonetic or visual substitution of kana characters by Latin letters and symbols (e.g., かわＥ for かわい い kawaī 'cute' and こωにちは for こんにちは konnichiwa 'hello').
Typographical errors: a form with typographical errors derived from character input errors, kanakanji conversion errors, or the user's incorrect understanding. For example, つたい tsutai for つら い turai 'tough' and そｒ for それ sore 'it'.

Evaluation
We present the statistics of the BQNC in Table 4. It comprises 929 sentences, 12.6K word tokens, and 767 non-standard word tokens. As shown in Table 6, the corpus contains tokens of seven types of vocabulary and four types of variant form. Whereas there exist fewer than 40 instances of neologisms/slang, dialect words, foreign words, and  In the following subsections, we evaluate the existing methods for MA and LN on the BQNC and discuss correctly or incorrectly analyzed results.

Systems
We evaluated two existing methods. First, we used MeCab 0.996 (Kudo et al., 2004), 12 which is a popular Japanese MA toolkit based on conditional random fields. We used UniDicMA (unidic-cwj-2.3.0) 13 as the analysis dictionary, which contains attribute information of 873K words and MeCab's parameters (word occurrence costs and transition costs) learned from annotated corpora, including the BCCWJ (Den, 2009).
Second, we used our implementation of Sasano et al. (2013)'s joint MA and LN method. They defined derivation rules to add new nodes in the word lattice of an input sentence built by their baseline system, JUMAN. Specifically, they used the following rules: (i) sequential voicing (rendaku), (ii) substitution with long sound symbols and lowercase kana, (iii) insertion of long sound symbols and lowercase kana, (iv) repetitive onomatopoeia (XYXY-form 14 ) and (v) non-repetitive onomatopoeia (XQYri-form and XXQto-form). For example, rule (iii) adds a node of 冷たぁぁい tsumetāi as a variant form of 冷たい tsumetai 'cold' 12 https://taku910.github.io/mecab/ 13 https://unidic.ninjal.ac.jp/ 14 "X" and "Y" represent the same kana character(s) corresponding to one mora, "Q" represents a long consonant character "っ/ッ", "ri" represents a character "り/リ", and "to" represents a character "と/ト".  and rule (iv) adds a node of うはうは uhauha 'exhilarated' as an onomatopoeic adverb. if the input sentences contain such character sequences. The original implementation by Sasano et al. (2013) was an extension of JUMAN and followed JUMAN's POS tags. To adapt their approach to the SUW, we implemented their rules and used them to extend the first method of MeCab using UniDicMA. We set the costs of the new nodes by copying the costs of their standard forms or the most frequent costs of the same-form onomatopoeia, whereas Sasano et al. (2013) manually defined the costs of each type of new word. We denote this method by MeCab+ER (Extension Rules). Notably, we did not conduct any additional training to update the models' parameters for either methods. Table 5 shows the overall performance, that is, Precision, Recall, and F 1 score, of both methods for SEGmentation, POS tagging 15 and NORmalization. 16 Compared with well-formed text domains, 17 the relatively lower performance (F 1 of 90-95%) of both methods for segmentation and POS tagging indicates the difficulty of accurate segmentation and tagging in UGT. However, MeCab+ER outperformed MeCab by 2.5-2.9 F 1 points because of the derivation rules. Regarding the normalization performance of MeCab+ER, the method achieved moderate precision but low recall, which indicates its limited coverage for various variant forms in the dataset. Table 6 shows the segmentation and POS tagging recall for both methods for each category. In contrast to the sufficiently high performance for general words, both methods performed worse for words of characteristic categories in UGT; micro average recall was at most 79.6% for segmentation 15 We only evaluated top-level POS. 16 We regarded a predicted standard form as correct if the prediction was equal to one of the gold standard forms. 17 For example, Kudo et al. (2004)    and 70.4% for POS tagging ("non-gen/std total" column). MeCab+ER outperformed MeCab particularly for onomatopoeia, character type variants, alternative representations, and sound change variants. The high scores for dialect words were probably because UniDicMA contains a large portion of (19 out of 23) dialect word tokens. Interjection was a particularly difficult vocabulary type, for which both methods recognized only approximately 50% of the gold POS tags. We guess that this is because the lexical variations of interjections are diverse; for example, there are many user-generated expressions that imitate various human voices, such as laughing, crying, and screaming. Table 7 shows the recall of MeCab+ER's normalization for each category. The method correctly normalized tokens of alternative representations and sound change variants with 30-40% recall. However, it completely failed to normalize character type variants not covered by the derivation rules and more irregular typographical errors.

Analysis of the Segmentation Results
We performed error analysis of the segmentation results for the two methods.  incorrectly segmented (F-F).
In Table 9, we show the actual segmentation/normalization examples using the methods for the three cases; the first, second, and third blocks show examples of T-F, F-T, and F-F cases, respectively. First, out of 32 T-F cases, MeCab+ER incorrectly segmented tokens as onomatopoeia in 18 cases. For example, (a) and (b) correspond to new nodes added by the rules for the XQYri-form and XYXY-form onomatopoeia, respectively, even though (a) is a verb phrase and (b) is a repetition of interjections.
Second, out of 200 F-T cases that only MeCab+ER correctly segmented, the method correctly normalized 119 cases, such as (c), (d), and the first word in (g), and incorrectly normalized 42 cases, such as (e) and the second word in (f). The remaining 39 cases were tokens that required no normalization, such as the first word in (f), the second word in (g), and (h). The method correctly normalized simple examples of sound change variants (c: しーかーも for しかも) and alternative representations (d: ぉぃら for おいら) because of the substitution and insertion rules, but failed to normalize character type variants (f: やきゅー for 野球) and complicated sound change variants (e: んまぃ for うまい).
Third, out of 413 F-F cases, 148 tokens were complicated variant forms, including a combination of historical kana orthography and the insertion of the long sound symbol (i), a combination of the character type variant and sound change variant (j), a variant written in romaji (k). The remaining 265 tokens were other unknown words, including emoticons (l), neologisms/slang (m), and proper names (n). 18   examples, some of them were not necessarily inappropriate results; normalization between similar interjections and onomatopoeia was intuitively acceptable (e.g., おおwas normalized to おおō 'oh' and サラサラwas normalized to サラサラ sarasara 'smoothly'). However, we assessed these as errors based on our criterion that interjections have no (non-)standard forms and the BCCWJ guidelines that regards onomatopoeia with and without long sound insertion as different lemmas.

Discussion
The derivation rules used in MeCab+ER improved segmentation and POS tagging performance and contributed to the correct normalization of parts of variant forms, but the overall normalization performance was limited to F 1 of 35.3%.
We classified the main segmentation and nor-malization errors into two types: complicated variant forms and unknown words of specific vocabulary types such as emoticons and neologisms/slang. The effective use of linguistic resources may be required to build more accurate systems, for example, discovering variant form candidates from large raw text similar to (Saito et al., 2017), and constructing/using term dictionaries of specific vocabulary types. For Chinese, Li and Yarowsky (2008) published a dataset of formal-informal word pairs collected from Chinese webpages. Wang et al. (2013) released a crowdsourced corpus constructed from microblog posts on Sina Weibo.

Classification of Linguistic Phenomena in UGT
To construct an MA dictionary, Nakamoto et al. (2000) classified unknown words occurring in Japanese chat text into contraction (e.g., すげー for すごい sugoi 'awesome'), exceptional kana variant (e.g., こんぴゅーた for コンピュータ 'computer'), abbreviation, typographical errors, filler, phonomime and phenomime, proper nouns, and other types. Ikeda et al. (2010) classified "peculiar expressions" in Japanese blogs into visual substitution (e.g., わたＵ for わたし watashi 'me'), sound change (e.g., でっかい for でかい dekai 'big'), kana substitution (e.g., びたみん for ビタ ミン 'vitamin'), and other unknown words into similar categories to Nakamoto et al. (2000). Kaji et al. (2015) performed error analysis of Japanese MA methods on Twitter text. They classified missegmented words into a dozen categories, including spoken or dialect words, onomatopoeia, interjections, emoticons/AA, proper nouns, foreign words, misspelled words, and other non-standard word variants. Ikeda et al. (2010)'s classification of peculiar expressions is most similar to our types of variant forms and Kaji et al. (2015)'s classification is most similar to our types of vocabulary (shown in Table 2), whereas we provide more detailed definitions of categories and criteria for standard and non-standard forms. Other work on Japanese MA and LN did not consider diverse phenomena in UGT (Sasano et al., 2013;Saito et al., 2014).
Methods for MA and LN In the last two decades, previous work has explored various rules and extraction methods for formal-informal word pairs to enhance Japanese MA and LN models for UGT. Nakamoto et al. (2000) proposed an alignment method based on string similarity between original and variant forms. Ikeda et al. (2010) automatically constructed normalization rules of peculiar expressions in blogs, based on frequency, edit distance, and estimated accuracy improvements. Sasano et al. (2013) defined derivation rules to recognize unknown onomatopoeia and variant forms of known words that frequently occur in webpages. Their rules were also implemented in a recent MA toolkit Juman++ (Tolmachev et al., 2020) to handle unknown words. Saito et al. (2014) estimated character-level alignment from manually annotated pairs of formal and informal words on Twitter. Saito et al. (2017) extracted formal-informal word pairs from unlabeled Twitter data based on semantic and phonetic similarity.
For English and Chinese, various classification methods for normalization of informal words (Li and Yarowsky, 2008;Wang et al., 2013;Han and Baldwin, 2011;Jin, 2015;van der Goot, 2019) have been developed based on, for example, string, phonetic, semantic similarity, or co-occurrence frequency. Qian et al. (2015) proposed a transitionbased method with append(x), separate(x), and separate_and_substitute(x,y) operations for the joint word segmentation, POS tagging, and normalization of Chinese microblog text. Dekker and van der Goot (2020) automatically generated pseudo training data from English raw tweets using noise insertion operations to achieve comparable performance without manually annotated data to an existing LN system.

Conclusion
We presented a publicly available Japanese UGT corpus annotated with morphological and normalization information. Our corpus enables the performance comparison of existing and future systems and identifies the main remaining issues of MA and LN of UGT. Experiments on our corpus demonstrated the limited performance of the existing systems for non-general words and non-standard forms mainly caused by two types of difficult examples: complicated variant forms and unknown words of non-general vocabulary types.
In the future, we plan to (1) expand the corpus by further annotating of 5-10 times more sentences for a more precise evaluation and (2) develop a joint MA and LN method with high coverage.