Word Familiarity Rate Estimation Using a Bayesian Linear Mixed Model

This paper presents research on word familiarity rate estimation using the ‘Word List by Semantic Principles’. We collected rating information on 96,557 words in the ‘Word List by Semantic Principles’ via Yahoo! crowdsourcing. We asked 3,392 subject participants to use their introspection to rate the familiarity of words based on the five perspectives of ‘KNOW’, ‘WRITE’, ‘READ’, ‘SPEAK’, and ‘LISTEN’, and each word was rated by at least 16 subject participants. We used Bayesian linear mixed models to estimate the word familiarity rates. We also explored the ratings with the semantic labels used in the ‘Word List by Semantic Principles’.


Introduction
Compiling a lexicon is difficult work. In the lexicography field, there are two main types of methodology that are utilized to compile lexicons. One is a corpus-based methodology, which supports the objectivity of the language resources and results. This methodology requires large-scale, balanced corpora to function, which do exist in several languages; for instance, there are several corpus databases for the Japanese language, such as the 'Balanced Corpus of Contemporary Written Japanese' , the 'Corpus of Spontaneous Japanese' (Maekawa et al., 2000) and the 'NINJAL Web Japanese Corpus' (Asahara et al., 2014). In contrast to the corpusbased lexicography, the intuition-based method is more rooted in the subjective perspective of the lexicographer. Nowadays, however, we can perform large-scale experiments that gather enough crowdsourced subjective perspectives to constitute objective linguistic data on individual words.
Generally, a lexicon covers several layers of linguistic features, such as pronunciation, morphological information, part-of-speech or word class, relevant syntactic phenomena, and semantic categories. In addition, the terms in a lexicon include additional features that are used in daily life. One such language resource in Japanese is the 'Word Familiarity Rate', which measures how familiar people are with a specific word by NTT 1 (Amano and Kondo, 1999). However, this 'Word Familiarity Rate' experiment was completed more than twenty years ago, and it is therefore possible that the usage and register of words have changed in the intervening years.
In this study, we construct a word familiarity rate database using entries extracted from the 'Word List by Semantic Principles' ( Bunrui goihyo, hereafter WLSP) (Kokuritsu Kokugo Kenkyusho, 2004). We utilized crowdsourcing to perform a large-scale subjective experiment on 96,557 WLSP entries. We asked the subject participants to rate the familiarity of words along five perspectives: KNOW, WRITE, READ, SPEAK, and LISTEN. The quality of results gathered by crowdsourcing may be lower than that of results collected in a controlled experiment; however, the cost of constructing a crowdsourced study is lower than the cost of conducting an experiment. We utilized a Bayesian linear mixed model (Sorensen et al., 2016) to alleviate noise in the data.
Our work makes the following contributions to the literature: • We compiled a word familiarity rate database for thesaurus entries.
• We used crowdsourcing via human subject participants to explore word ratings.
• We introduced a Bayesian linear mixed model to this type of rate modelling.
7 • The word list was taken from the surface forms of WLSP. This enabled us to connect word familiarity rates with the semantic categories in a thesaurus. Kondo et al. (2018) produced a correspondence table between WLSP and UniDic (a lexicon with morphological information). The morphological analyser MeCab enabled us to automatically annotate the familiarity rates using these resources.
• The preceding work introduced the contrast between character-based (WRITE, READ) and voice-based (SPEAK, LISTEN) perspectives. We contributed to the literature by also introducing a new contrast between production (WRITE, SPEAK) and reception (READ, LISTEN) perspectives.
The remainder of this paper is organised as follows. Section 2 presents related work on the 'Word List by Semantic Principles' and the 'Word Familiarity Rate' in Japanese. Section 3 displays the methodology that we used to develop the word familiarity ratings, namely, crowdsourcing and a Bayesian linear mixed model. Section 4 evaluates the results, and Section 5 presents a conclusion and discusses future research.

'Word List by Semantic Principles'
The 'Word List by Semantic Principles' ( , WLSP) is one of the major thesauri for contemporary Japanese. The first version of the WLSP was released in 1964 by Kokuritsu Kokugo Kenkyusho (Kokuritsu Kokugo Kenkyusho, 1964), and a newer, expanded version was published in 2004(Kokuritsu Kokugo Kenkyusho, 2004. Its comma separated value (CSV) file of the expanded version can be used for research purposes. 2 2 200,000 yen (+ tax) for commercial use.
The data include more than 90,000 words with four syntactic categories (nominal word, verbal word, modifer word, and other) and several hierarchical semantic levels. The categories are indicated with a one integer digit to the left of a radix point and with four fractional digits to the right of the radix point. Table 1 shows an example of the word ' (Last Year)', which is assigned a value of 1.1642. Here, the first '1' presents the syntactic part, which is referred to as the 'Nominal Word', while '1642' presents the hierarchical semantic part, as follows: the first digit, '.1', refers to the top-level semantic category 'Relation'; the two digits '.16' refer to the second-level semantic category 'Time'; and the four digits '.1642' refer to the finest-grained semantic category 'Past Time'. These five digits are therefore referred to as the 'WLSP number'. The syntactic categories are 1. Nominal Word, 2. Verbal Word, 3. Modifier Word, and 4. Other (e.g. Conjunction, Interjection, Greeting).
We used all the words as the target words to be annotated for familiarity rates.

Word Familiarity Rate in Japanese
Preceding work used two methods to estimate the word familiarity ratings: a word frequency-based (objective) and a cognitive experiment-based (subjective) method. The Nihongo-no goitokusei database (Amano and Kondo, 1999) includes both objective and subjective data for word familiarity ratings. The data were constructed from 14 years of Asahi Shinbun newspaper articles, from 1985 to 1998. They used a morphological analyser, Sumomo, to analyse the articles and split the sentences into words.
The subjective data are cognitive experimentbased. The 40 participants rated word familiarity of three types of stimuli: character-based, voicebased, and both. The participants were chosen based on 'Hyakurakan' ( ), -a Japanese proficiency test -to control their linguistic compe-tence. The rating score is an integer from 1 (lowest) to 7 (highest), and the number of target entries is 88,569 of character and voice-based stimuli, from 69,084 words. The data gathering was held from September 1995 to July 1996 in the NTT institute. Even though the rating environment was controlled, the estimation of the word familiarity was based on the average of ratings by participants. More sophisticated statistical analysis should be utilised for reducing the subject participant biases.

Design
In this section, we present our methodology for constructing a word familiarity rate lexicon at low cost. The word list constitutes 96,557 words taken from the WLSP. We did not prepare any voice data (oral pronunciations) for the lexical entries, but we did cover speech and hearing as two of the following five perspectives: In this design, we split the judgements between character-based (WRITE and READ) and voicebased (SPEAK and LISTEN) judgements and between production (WRITE and SPEAK) and reception (READ and LISTEN) judgements. The participants gave five ratings for each factor, ranging from 5 (well known/often used) to 1 (little known/rarely used).
The rating data were collected not in person but on a crowdsourcing platform. We used 'Yahoo! crowdsourcing'; 3,392 participants judged the word familiarity rates. The participants checked a stimulus word and answered rating scores for KNOW, WRITE, READ, SPEAK, and READ; at least 16 answers were collected for each word. The data were gathered on November, 2018. The data collection, which cost 1,455,494 yen, was completed within two weeks.

Model
The collected rating data is biased due to the use of the particular subject participants, which necessitates that statistical methods should be used to resolve the biases. We used a Bayesian linear mixed model to measure the ratings. The graphical model used to estimate the ratings is shown in Figure 3: N word is the number of words, and N subj is the number of participants; Index i : 1 . . . N word is the index of words, and index j : 1 . . . N subj is the index of participants; and y (i)(j) is the rating of KNOW, WRITE, READ, SPEAK, LISTEN, in which y is generated by a Normal distribution with µ (i)(j) and σ, as follows: Here, the σ is a hyper-parameter of the standard deviation, and µ (i)(j) is a linear formula of slopes word and an intercept α: The slopes are modelled by a Normal distribution with the hyper-parameters of µ word , σ word , µ subj , σ subj (means and standard deviations): The word familiarity rates are composed by word . On the other hand, the biases of subject participants are modelled by γ (j) subj . We set the means µ word and µ subj as 0.0 to make the average 0.0; we also set the standard deviations σ word and σ subj as 1.0. We used R and Stan to model the data. We set an iteration at 5,000 × 4 chains with an initial warm-up of 100 iterations.

Data Analysis
This section describes the qualitative evaluation of the estimated word familiarity rate data. To evaluate the data, we first reviewed the distribution of the five perspectives and the biases of the subject participants. Second, we confirmed the top and bottom 10 words of the estimated values. Third, we also reviewed the top and bottom 10 categories by the WLSP's second semantic category for the estimated values.  word , and the y-axis specifies the frequencies. The five perspectives are distinguished in the histogram with different colours. As illustrated in Figure 1, KNOW has a higher familiarity rating than the other perspectives, since it is the most fundamental perspective. The character-based perspectives (WRITE and READ) had lower familiarity ratings than the voice-based perspectives (SPEAK and LISTEN). Furthermore, the production perspectives (WRITE and SPEAK) had lower familiarity ratings than the reception perspectives (READ and LISTEN). Figure 2 displays the histogram of the estimated subject participant biases. The x-axis specifies the estimated subject participant biases γ (j) subj , and the y-axis specifies the frequencies. The subject participant biases are modelled with standard normal distributions. We should introduce other distributions for the biases in our future work. We did attempt to use other distributions in the model; however, only the standard normal distribution converged. In future work, we will increase the amount of rating data and again attempt to use other distributions.

Evaluation by Words
In this section, we describe the top (KNOWN) and bottom (UNKNOWN) 10 words for several perspectives.

Known vs. Unknown
First, we reviewed KNOW, which is the most fundamental perspective.
Tables 2 and 3 display the top 10 known and unknown words for the perspective KNOW, respec-  tively. The known words are ones that tend to be used in daily social life, while the unknown words are never or rarely used in Japan. Though we also analysed the other perspectives {WRITE, READ, SPEAK, LISTEN}, we omitted tables for the remaining four perspectives due to the limited space.

Character-based vs. Voice-based
Next, we surveyed the difference between the character-based (WRITE/READ) and voice-based (SPEAK/LISTEN) results by evaluating the values  for (WRITE + READ -SPEAK -LISTEN). The difference between character-based (WRITE and READ) and voice-based (SPEAK and LISTEN) stimuli can be observed in the 'Nihongo no goi tokusei' database. Here, if the value is positive, the word tends to be used in written language. If the value is negative, the word tends to be used in spoken language. Table 4 shows the positively-valued examples. These words tend to be used in written documents or letters. Punctuation-related words ' (ampersand)' and ' (punctuation)' also appeared in the top 10 words. Table 5 shows the negatively-valued examples. These words tend to be used in conversations in daily life. The greeting ' (bye bye)' and the interjection ' (oof!)' are also observed.

Production vs. Reception
We surveyed the difference between the production (WRITE/SPEAK) and reception (READ/LISTEN) results and evaluated the (WRITE + SPEAK -READ -LISTEN) values. This approach is unique because no existing research has evaluated these perspectives.
The difference between production and recep-  tion thus seems to reflect whether or not the word is used in both mass media and in normal speech. Table 6 shows the production biased words, which tend to be technical terms. Some of the subject participants work histories (e.g. in the medical or music fields) explain certain words in Table 6, such as ' (capillary tube)' and ' (adhesive tape)' or traditional music ' (sing a song)'. Table 7 shows the reception biased words, and the negative words (' (murder)' and ' (filling charges)') are confirmed. The word ' (Takamori Saigo)' also appears as a reception biased word in Table 6, which is the main character in a TV drama.

Evaluation by WLSP categories
This section presents our evaluation of the WLSP categories. We evaluated the results using the second level of the semantic category in the WLSP, which includes two fractional digits to the right of the radix point (as explained in section 2.1). We also present the most and least familiar words in the same WLSP categories.

Known vs. Unknown
Tables 8 and 9 display the top 10 known and unknown word categories based on the perspective KNOW, respectively. As illustrated in Tables 8 and 9, the known words tend to be modifiers or verbs, while the unknown words tend to be nouns. The most well-known category is 3.53 ( --: Modifier-Nature-Creature), which includes gender-related words such as ' (feminine)' (KNOW=1.81) and ' (masculine)' (1.71). The least known category is 3.52 ( --: Modifier-Nature-World), which includes rarely used words such as ' (bleak)' (-1.46) and ' (big and high)' (-1.35).

Character-based vs. Voice-based
Figures 10 and 11 display the results for the character-based biased and voice-based biased categories, respectively. As shown in these tables, the nominal action and subject categories tend to be character-based biased, whereas the voca-

Production vs. Reception
Tables 12 and 13 display the results for the production biased and reception biased categories, respectively. Generally, the reception values (READ, LISTEN) tend to be larger than the production values (WRITE, SPEAK). Therefore, the  --Verb-Action-Behaviour -0.52 3.14 --Verb-Relation-Power -0.52 P-R: WRITE + SPEAK -READ -LISTEN values for Pro-Rec (WRITE + SPEAK -READ -LISTEN) become negative, even for the production biased categories. The syntactic categories (excluding nouns, verbs, and modifiers) are production biased such as the animal call, interjection, vocative, and conjunction categories. The other production biased category is 4.50 ( -: Other-Animal Call), which includes words such as ' (croak)' (WRITE+SPEAK-READ-LISTEN=0.45) and ' (croak)' (0.23). The reception biased words refer to the vocabulary used on the news or in TV show such as nominal organisation, treatment, or intercourse. The reception biased category with the highest ranking is 1.27 ( --: Noun-Subject-Organization), which includes words such as ' (Ministry of Health, Labour, and Welfare)' (-2.23) and ' (Financial Services Agency)' (-2.18).

Discussions
In this paper, we presented the word familiarity rating tendencies based on a crowdsourced study. The character-based (WRITE and READ) /voicebased (SPEAK and LISTEN) contrasting results confirm the findings in Nihongo no goi tokusei; however, in our data, we uniquely observe the contrast between the production and reception categories.
However, we still face the issue of normalising the ratings. This study s proposed method, in which the mean and standard deviation are set to 0.0 and 1.0, respectively, is sufficient when rating relative values or when arranging ratings in a certain order. We also calculated the ratings with γ (i) word +µ subj +α; with this calculation, the ratings can be ranged from 1.0 to 5.0, excluding outliers. Though the normalization of ratings should be determined by the rating method used, calculating the value γ (i) word is sufficient for most uses.

Conclusions
We have presented a Japanese word familiarity rate database for entries in the WLSP. To do so, we used crowdsourcing to explore the word familiarity ratings in terms of five perspectives: KNOW, WRITE, READ, SPEAK, and LISTEN. A Bayesian linear mixed model was utilised to estimate the ratings. The data 3 and code 4 are publicly available. Our future work on this topic is as follows. In this paper, we modelled the word familiarity rates and the subject participant biases with the standard normal distribution. While we did attempt to model the rates and biases with other distributions, the MCMC estimation did not converge. In the future, we hope to perform the survey on a yearly basis (to enlarge the data size) in order to model other distributions. We will also enhance the target word list to include UniDic entries for content words. In addition, we plan to create a morphological analyser, which will extract the word familiarity rates.