Subcharacter Information in Japanese Embeddings: When Is It Worth It?

Languages with logographic writing systems present a difficulty for traditional character-level models. Leveraging the subcharacter information was recently shown to be beneficial for a number of intrinsic and extrinsic tasks in Chinese. We examine whether the same strategies could be applied for Japanese, and contribute a new analogy dataset for this language.


Introduction
No matter how big a corpus is, there will always be rare and out-of-vocabulary (OOV) words, and they pose a problem for the widely used word embedding models such as word2vec. A growing body of work on subword and character-level representations addresses this limitation in composing the representations for OOV words out of their parts (Kim et al., 2015;Zhang et al., 2015).
However, logographic writing systems consist of thousands of characters, varying in frequency in different domains. Fortunately, many Chinese characters (called kanji in Japanese) contain semantically meaningful components. For example, 木 (a standalone kanji for the word tree) also occurs as a component in 桜 (sakura) and 杉 (Japanese cypress).
We investigate the effect of explicit inclusion of kanjis and kanji components in the word embedding space on word similarity and word analogy tasks, as well as sentiment polarity classification. We show that the positive results reported for Chinese carry over to Japanese only partially, that the gains are not stable, and in many cases character ngrams perform better than character-level models. We also contribute a new large dataset for word analogies, the first one for this relatively lowresourced language, and a tokenizer-friendly version of its only similarity dataset.

Related Work
To date, most work on representing subcharacter information relies on language-specific resources that list character components 1 . A growing list of papers address various combinations of wordlevel, character-level and subcharacter-level embeddings in Chinese (Sun et al., 2014;Li et al., 2015;Yu et al., 2017). They have been successful on a range of tasks, including similarity and analogy (Yu et al., 2017;Yin et al., 2016), text classification (Li et al., 2015) sentiment polarity classification (Benajiba et al., 2017), segmentation, and POS-tagging (Shao et al., 2017).
Japanese kanjis were borrowed from Chinese, but it remains unclear whether these success stories could also carry over to Japanese. Chinese is an analytic language, but Japanese is agglutinative, which complicates tokenization. Also, in Japanese, words can be spelled either in kanji or in phonetic alphabets (hiragana and katakana), which further increases data sparsity. Numerous homonyms make this sparse data also noisy.
To the best of our knowledge, subcharacter information in Japanese has been addressed only by Nguyen et al. (2017) and Ke and Hagiwara (2017). The former consider the language modeling task and compare several kinds of kanji decomposition, evaluating on model perplexity. Ke and Hagiwara (2017) propose to use subcharacter information instead of characters, showing that such a model performs on par with word and character-level models on sentiment classification, with considerably smaller vocabulary. This study explores a model comparable to that proposed by Yu et al. (2017) for Chinese. We jointly learn a representation of words, kanjis, and kanjis' components, and we evaluate it on similarity, analogy, and sentiment classification tasks. We also contribute jBATS, the first analogy dataset for Japanese.

Incorporating Subcharacter Information
Kanji analysis depends on its complexity. Kanjis consisting of only 2-4 strokes may not be decomposable, or only containing 1-2 simple components (bushu). The more complex kanjis can usually be decomposed in analyzable bushu. This is referred to as shallow and deep decomposition ( Figure 1a). Nguyen et al. (2017) compared several decomposition databases in language modeling and concluded that shallow decomposition yields lower perplexity. This is rather to be expected, since many "atomic" bushu are not clearly meaningful. For example, Figure 1a shows the kanji 劣 ("to be inferior") as decomposable into 少 ("little, few") and 力 ("strength"). At the deep decomposition, only bushu 小 ("small") can be clearly related to the meaning of the original kanji 劣.
Hence, we use shallow decomposition. The bushu are obtained from IDS 2 , a database that performed well for Nguyen et al. (2017). IDS is generated with character topic maps, which enables wider coverage 3 than crowd-sourced alternatives such as GlyphWiki.
In pre-processing each kanji was prepended the list of bushu (Figure 1b). Two corpora were used: the Japanese Wikipedia dump of April 01, 2018 and a collection of 1,859,640 Mainichi newspaper articles (Nichigai Associate, 1994-2009. We chose newspapers because this domain has a relatively higher rate of words spelled in kanji rather than hiragana. As explained above, tokenization is not a trivial task in Japanese. The classic dictionary-based tokenizers such as MeCab or Juman, or their more recent ports such as Kuromoji do not handle OOV very well, and the newer ML-based tokenizers such as TinySegmenter or Micter are also not fully reliable. We tokenized the corpora with MeCab using a weekly updated neologism dictionary 4 , which yielded roughly 357 million tokens for Mainichi and 579 for Wiki 5 . The tokenization was highly inconsistent: for example, 満腹感 ("feeling full") is split into 満腹 ("full stomach") and 感 ("feeling"), but 恐怖感 ("feeling fear") is a single word, rather than 恐怖 + 感 ("fear" and "feeling"). We additionally pre-processed the corpora to correct the tokenization for all the affixes 2 http://github.com/cjkvi/cjkvi-ids 3 A limitation of IDS is that it does not unify the representations of several frequent bushu, which could decrease the overall quality of the resulting space (e.g. 心 "heart" is being pictured as 心, 忄 and 㣺 depending on its position in kanji). 4 http://github.com/neologd/ mecab-ipadic-neologd 5 The Wikipedia tokenized corpus is available at http: //vecto.space/data/corpora/ja  Original SG. Skip-Gram (SG) (Mikolov et al., 2013) is a popular word-level model. Given a target word in the corpus, SG model uses the vector of this target word to predict its contextual words.
FastText. FastText (Bojanowski et al., 2017) is a state-of-the-art subword-level model that learns morphology from character n-grams. In this model, each word is considered as the sum of all the character n-grams.

Characters and subcharacters
Characters (kanji). To take individual kanji into account we modified SG by summing the target word vector w with vectors of its constituent characters c 1 , and c 2 . This can be regarded as a special case of FastText, where the minimal n-gram size and maximum n-gram size are both set to 1. Our model is similar to the one suggested by Yu et al. (2017), who learn Chinese word embeddings based on characters and sub-characters. We refer to this model as SG+kanji.
Subcharacters (bushu). Similarly to characters, we sum the vector of the target word, its constituent characters, and their constituent bushu to incorporate the bushu information. For example, Figure 3 shows that the vector of the word 仲間, the vectors of characters 仲 and 間, and the vectors of bushu 亻, 中, 門, 日 are summed to predict the contextual words. We refer to this model as SG+kanji+bushu.
Expanding vocabulary. FastText, SG+kanji and SG+kanji+bushu models can be used to compute the representation for any word as a sum of the vectors of its constituents. We collect the vocabulary of all the datasets used in this paper, calculate the vectors for any words missing in the embedding vocabulary, and add them. Such models will be referred to as MODEL+OOV.

Implementation
All models were implemented in Chainer framework (Tokui et al., 2015) with the following parameters: vector size 300, batch size 1000, negative sampling size 5, window size 2. For performance reasons all models were trained for 1 epoch. Words, kanjis and bushu appearing less than 50 times in the corpus were ignored. The optimization function was Adam (Kingma and Ba, 2014). The n-gram size of FastText 6 is set to 1, for  For SG+kanji+bushu model there were 2510 bushu in total, 1.47% of which were ignored in the model since they were not in the standard UTF-8 word ("w) encoding. This affected 1.37% of tokens in Wikipedia.

Evaluation: jBATS
We present jBATS 9 , a new analogy dataset for Japanese that is comparable to BATS , currently the largest analogy dataset for English. Like BATS, jBATS covers 40 linguistic relations which are listed in Table 1. There are 4 types of relations: inflectional and derivational morphology, and encyclopedic and lexicographic semantics. Each type has 10 categories, with 50 word pairs per category (except for E03 which has 47 pairs, since there are only 47 prefectures). This enables generation of 97,712 analogy questions.
The inflectional morphology set is based on the traditional Japanese grammar (Teramura, 1982) which lists 7 different forms of godan, shimoichidan and kamiichidan verbs, as well as 5 forms of i-adjectives. Including the past tense form, there allelize training. 8 http://vecto.space/data/embeddings/ja 9 http://vecto.space/projects/jBATS are 8 and 6 forms for verbs and adjectives respectively. All categories were adjusted to the MeCab tokenization. After excluding redundant or rare forms there were 5 distinctive forms for verbs and 3 for adjectives, which were paired to form 7 verb and 3 adjective categories. The derivational morphology set includes 9 highly productive affixes which are usually represented by a single kanji character, and a set of pairs of transitive and intransitive verbs which are formed with several infix patterns.
The encyclopedic and lexicographic semantics sections were designed similarly to BATS , but adjusted for Japanese. For example, UK counties were replaced with Japanese prefectures. The E09 animal-young category of BATS would be rendered with a prefix in Japanese, and was replaced with plain: honorific word pairs, a concept highly relevant for the Japanese culture.
All tokens were chosen based on their frequencies in BCCWJ 10 (Maekawa, 2008), the Balanced Corpus of Contemporary Written Japanese, and the Mainichi newspaper corpus described in Section 3. We aimed to choose relatively frequent and not genre-specific words. For broader categories (adjectives and verbs) we balanced between BCCWJ and Mainichi corpora, choosing items of mean frequencies between 3,000 and 100,000 whenever possible.

Word similarity
The recent Japanese word similarity dataset (Sakaizawa and Komachi, 2017) contains 4,851 word pairs that were annotated by crowd workers with agreement 0.56-0.69. Like MEN (Bruni et al., 2014) and SimLex (Hill et al., 2015), this dataset is split by parts of speech: verbs, nouns, adjectives and adverbs. We refer to this dataset as jSIM.
The division by parts of speech is relevant for this study: many Japanese adverbs are written mostly in hiragana and would not benefit from bushu information. However, some pairs in jSIM were misclassified. Furthermore, since this dataset was based on paraphrases, many pairs contained phrases rather than words, and/or words in forms that would not be preserved in a corpus tokenized the Mecab style (which is the most frequently used in Japanese NLP). Therefore, for embeddings with standard pre-processing jSIM would have a very high OOV rate. The authors of jSIM do not actually present any experiments with word embeddings.
We have prepared 3 versions of jSIM that are summarized in Table 2. The full version contains most word pairs of the original dataset (except those which categories were ambiguous or mixed), with corrected POS attribution in 2-5% of pairs in each category 11 : for example, the pair 苛立たしい -忌ま忌ましい was moved from verbs to adjectives. The tokenized version contains only the items that could be identified by a Mecab-style tokenizer, and had no more than one content-word stem: e.g. this would exclude phrases like 早く来る. However, many of the remaining items could become ambiguous when tokenized: 終わった would become 終わっ た -and 終わっ could map to 終わった, 終わって, 終わっちゃう, etc., and therefore be more difficult to detect in the similarity task. Thus we also prepared the unambiguous subset which contains only the words that could still be identified unambiguously even when tokenized (for example, 迷 う remains 迷う). All these versions of jSIM are available for download 12 . Table 3 shows the results on all 3 datasets on all models, trained on the full Mainichi corpus, a half Mainichi corpus, and Wikipedia. The strongest effect for inclusion of bushu is observed in the OOV condition: in all datasets the Spearman's correlations are higher for SG+kanji+bushu than for other SG models, which suggests that this information is indeed meaningful and helpful. This even holds for the full version, where up to 90% vocabulary is missing and has to be composed. For invocabulary condition this effect is noticeably absent in Wikipedia (perhaps due to the higher ratio of names, where the kanji meanings are often irrelevant).  However, in most cases the improvement due to inclusion of bushu, even when it is observed, is not sufficient to catch up with the FastText algorithm, and in most cases FastText has substantial advantage. This is significant, as it might warrant the review of the previous results for Chinese on this task: of all the studies on subcharacter information in Chinese that we reviewed, only one explicitly compared their model to FastText (Benajiba et al., 2017), and their task was different (sentiment analysis).
In terms of parts of speech, the only clear effect is for the adjectives, which we attribute to the fact that many Japanese adjectives contain a single kanji character, directly related to the meaning of the word (e.g. 惜しい). The adjectives category contains 55.45% such words, compared to 14.78% for nouns and 23.71% for adverbs in the full jSIM (the ratio is similar for Tokenized and Unambiguous sets). On the other hand, all jSIM versions have over 70% of nouns with more than one kanji; some of them may not be directly related to the meaning of the word, and increase the noise. Ac- FastText+OOV .448 .184 .245 .242 .438 .222 .286 .410 .453 .202 .275 .405 SG+kanji+OOV .323 .195 .175 .210 .293 .262 .210 .353 .341 .250 .197 .363 SG+kanji+bushu+OOV .348 .171 .178 .201 .318 .231 .223 .330 .373 .249 .210 225 .909 .393 .269 .112 .192 .384 .301 .112 .203 FastText+OOV .451 .186 .242 .243 .442 .225 .281 .400 .455 .219 .270 .402 SG+kanji+OOV .296 .179 .146 .185 .240 .240 .191 .325 .270 .239 .184 .278 SG+kanji+bushu+OOV .313 .183 .159 .171 .249 .238 .208 .315 .292 .254 .197 .243 Table 3: Spearman's correlation with human similarity judgements. Boldface indicates the highest result on a given corpus (separately for in-vocabulary and OOV conditions). Shaded numbers indicate the highest result among the three Skip-Gram models. cordingly, we observe the weakest effect for inclusion of bushu. However, the ratio of 1-kanji words for verbs is roughly the same as for the adjectives, but the pattern is less clear.
Adverbs are the only category in which SG clearly outperforms FastText. This could be due to a high proportion of hiragana (about 50% in all datasets), which as single-character ngrams could not yield very meaningful representations. Also, the particles と and に, important for adverbs, are lost in tokenization.

jBATS
In this paper, we consider two methods for the word analogy task. 3CosAdd (Mikolov et al., 2013) is the original method based on linear offset between 2 vector pairs. Given an analogy a:a :: b:b (a is to a as b is to b ), the answer is calculated as b = argmax d∈V (cos(b , b − a + a )), where cos(u, v) = u·v ||u||·||v|| LRCos  is a more recent and currently the best-performing method. It is based on a set of word pairs that have the same relation. For example, given a set of pairs such as husband:wife, uncle:aunt, all right-hand words are considered to be exemplars of a class ("women"), and logistic regression classifier is trained for that class. The answer (e.g. queen) is determined as the word vector that is the most similar to the source word (e.g. king), but is likely to be a woman: Figure 3 shows that the overall pattern of accuracy for jBATS is comparable to what  report for English: derivational and inflectional morphology are much easier than either kind of semantics. In line with the results by , LRCos significantly outperforms 3CosAdd, achieving much better accuracy on some encyclopedic categories with which D01  D02  D03  D04  D05  D06  D07  D08  D09  D10  E01  E02  E03  E04  E05  E06  E07  E08  E09  E10  I01  I02  I03  I04  I05  I06  I07  I08  I09  I10  L01  L02  L03  L04  L05  L06  L07  L08  L09 D01  D02  D03  D04  D05  D06  D07  D08  D09  D10  E01  E02  E03  E04  E05  E06  E07  E08  E09  E10  I01  I02  I03  I04  I05  I06  I07  I08  I09  I10  L01  L02  L03  L04  L05  L06  L07  L08  L09 Table 1 for the codes on x-axis).
3CosAdd does not cope at all. Lexicographic semantics is a problem, as in English, because syn-  onyms or antonyms of different words do not constitute a coherent semantic class by themselves. Table 4 shows the average results per relation type for the better-performing LRCos (the pattern of results was similar for 3CosAdd). The morphology categories behave similarly to adjectives in the similarity task: the SG+kanji beats the original SG by a large margin on inflectional and derivational morphology categories, and bushu improve accuracy even further. In this task, these models also win over FastText. However, these are the categories in which the words either contain a single kanji, or (in derivational morphology) a single kanji affix needs to be identified. Semantic categories contain a variety of nouns, mostly consisting of several kanjis with various morphological patterns. Moreover, many proper nouns as well as animal species are written in katakana, with no kanjis at all. This could be the reason why information from kanjis and bushu are not helpful or even detrimental in the semantic questions.
There is a clear corpus effect in that the encyclopedic semantic questions are (predictably) more successful with Wikipedia than with Mainichi, but at the expense of morphology. This could be interpreted as confirmation of the dependence of the current analogy methods on similarity (Rogers et al., 2017): all words cannot be close to all other words, so a higher ratio of some relation type has  to come with a decrease in some other.

Sentiment analysis
The binary sentiment classification accuracy was tested with the Rakuten reviews dataset by Zhang and LeCun (2017). Although Benajiba et al. (2017) report that incorporating subcharacter information provided a boost in accuracy on this task in Chinese, we did not confirm this to be the case for Japanese.  (2017)), so no model had a clear advantage.  The lack of positive effect for inclusion of kanji and bushu is to be expected, as we found that most of the dataset is written informally, in hiragana, even for words that are normally written with kanjis. Once again, this shows that the results of incorporating (sub)character information in Japanese are not the same as in Chinese, and depend on the task and domain of the texts.
Interestingly, the accuracy is just as high for all OOV models, even though about 20% of the vo- 13 The Chainer framework (Tokui et al., 2015) is used to implement the CNN classifier with default settings. cabulary had to be constructed.

Error analysis
We conducted manual analysis of 200 mispredictions of 3CosAdd method in I03, D02, E02 and L10 categories (50 examples in each). The percentage of different types of errors is shown in Table 5. Overall, most mistakes are interpretable, and only 10.5% of mispredicted vectors are not clearly related to the source words.
The most frequent example of mis-classification was predicting the wrong form but with the correct stem, especially in morphological categories. This is consistent with what  report for English and was especially frequent in the I03 and D02 categories (76% and 36% of errors per category respectively). It is not surprising since these categories consist of verbs (I03) and adjectives (D02). Furthermore, in 25% of cases the assigned item was from the same semantic category (for example, colours) and in 13% of case an antonym was predicted. Other, though relatively less frequent mistakes include semantic relations like predicting synonyms of the given word, words (or single kanji) related to either target or source pair, or simply returning the same token. Words which were not related in any way to any source word were very rare. Table 7 shows that the shared semantic space of words, kanjis and bushu is indeed shared. For example, the bushu 疒 (yamaidare "the roof from illness") is often used in kanjis which are related to a disease. Therefore kanji like 症 ("disease") would, 疒 yamaidare (the roof from illness) 豸 najina-hen (devine beast, insect without legs) 患(sickness) 症(disease) 妊 (pregnancy) 臓 (internal organs, bowels) 腫 (tumor) 爭(to fight, to compete) 蝶(butterfly) 皃(shape) 貌(shape, silhouette) 豹(leopard) インフルエザ (influenza) 関節リウマチ (articular rheumatism) リューマチ (rheumatism) リウマチ(rheumatism) メタボリックシンドローム (metabolic syndrome) 獅子 (lion, king of beasts) 同流 (same origin, same school) 本性(true nature, human nature) 弥勒(Maitreya Buddha) 無頼 (villain, scoundrel) Table 7: Example bushu: closest single kanji (upper row) and multiple kanji/katakana (lower row) for SG+kanji+bushu model. of course, be similar to 疒 in the vector space. Interestingly, we also find that its close neighbors include kanjis that do not have this bushu, but are related to disease, such as 腫 and 患. Furthermore, even words written only in katakana, like イ ンフルエザ, are correctly positioned in the same space. Similar observations can be made for bushu 豸(mujina-hen) which represents a divine beast, insects without legs, animals with long spine, or a legendary Chinese beast Xiezhi.

Stability of the similarity results
Our similarity experiments showed that in many cases the gain of any one model over the other is not very significant and would not be reproduced in a different run and/or a different corpus. This could be due to skewed frequency distribution or the general instability of embeddings for rare words, recently demonstrated for word2vec (Wendlandt et al., 2018).
One puzzling observation is that sometimes the smaller corpus yielded better embeddings. Intuitively, the larger the corpus, the more informative distributional representations can be obtained. However, Table 3 shows that for adverbs and verbs the full and tokenized versions of jSIM a half of Mainichi was actually significantly better than the full Mainichi. It is not clear whether it is due to a lucky random initialization or some other factors.