Distinguishing Japanese Non-standard Usages from Standard Ones

We focus on non-standard usages of common words on social media. In the context of social media, words sometimes have other usages that are totally different from their original. In this study, we attempt to distinguish non-standard usages on social media from standard ones in an unsupervised manner. Our basic idea is that non-standardness can be measured by the inconsistency between the expected meaning of the target word and the given context. For this purpose, we use context embeddings derived from word embeddings. Our experimental results show that the model leveraging the context embedding outperforms other methods and provide us with findings, for example, on how to construct context embeddings and which corpus to use.


Introduction
On social media such as Twitter, we often find posts that are difficult to interpret without prior knowledge on non-standard usage of words. For example, consider the following Japanese sentence 1 : mackerel-POSS load-NOM increase-PRS "The load on a mackerel increases", which does not make sense given the standard usages for the words in the sentence. But here, mackerel is a non-standard usage that means computer server. The entire sentence should be interpreted as "The load on the computer server increases".
The Japanese word " (saba)" (i.e., mackerel) is used to mean computer server by Japanese computer geeks because saba happens to have a pronunciation that is similar to sābā (i.e., computer server). When a word is used in a meaning that is different from its dictionary meaning, we call such a usage non-standard. 2 Non-standard usages can be found in many languages (Sboev, 2016). For example, the word "catfish" means a ray-finned fish as in a standard dictionary, but on social media, it can mean a person who pretends to be someone else in order to create a fake identity. Such non-standard usages would be an obstacle to a variety of language processings including machine translation; Google Translate cannot correctly interpret examples such as this. Humans, however, would be able to notice non-standard usages from the inconsistency between the expected word meaning and the context.
The purpose of this work is to develop a method for distinguishing non-standard usages of Japanese words from standard ones. Since it is impractical to construct a large labeled data set for each word, we focus on unsupervised approaches. The main idea in our method is that the difference between the target word's embedding learned from a general corpus and the embedding predicted from the given context would be a good indicator of the degree of non-standardness.

Data
We created a dataset for evaluating our method. First, we selected 40 words that have nonstandard usages, including computer terms, company/service names, and other Internet slang. Ten Category   Usage  standard non-standard  Computer terms  416  234  Company/Service names  440  252  Other Internet slang  817  814  Total  1,673  1,300   Table 1: Statistics of the dataset.
of the 40 words were computer terms, another 10 were company/service names, and the remaining 20 were other Internet slang. For each of the 40 target words, we found 100 tweets that contained the target word. Here, we used Twitter as the source for examples since there are many non-standard usages on it. To segment tweets into words, we used the Japanese morphological analyzer MeCab 3 with the standard IPA standard dictionary. 4 Next, we asked two human annotators to judge whether the usage of the target word in each tweet is standard, non-standard, a named entity, or undecidable. We excluded tweets which at least one annotator judged as undecidable (96 tweets). 5 Cohen's kappa of the annotations for the remaining 3,904 tweets was 0.808. We further excluded tweets which at least one annotator judged as containing a named entity (772 tweets) in order to focus the dataset on our main purpose. 6 Finally, to create a final dataset, we selected from the remaining 3,132 tweets the 2,973 tweets that are judged as standard by both annotators or as non-standard by them. The selected 2,973 tweets are equivalent to 94.9% of the entire set of tweets, which suggests that human can reliably distinguish non-standard usages from standard ones. The statistics of the final dataset are shown in Table 2.

Methodology
Our basic idea for distinguishing word usages is that if a word is used in a non-standard manner, the context words around it will tend to differ from standard context words. To implement this idea, we employed word embeddings. Below, we review the Skip-gram model used for obtaining the word embeddings in Section 3.1 and present our method in Section 3.2.

Skip-gram
Skip-gram (Mikolov et al., 2013) is widely used for obtaining word embeddings. Given a sequence of words w 1 , w 2 , ..., w T as training data, Skip-gram maximizes the likelihood where W is the vocabulary size of the training data. Skip-gram learns a model predicting context words using word embeddings v IN and v OU T , which are called input embedding and output embedding respectively.
The embeddings are learned in such a way tends to be large for such words and small for word pairs that do not co-occur in the training corpus. We exploited this tendency for recognizing non-standard usages; if the dot-product between the embeddings of the target word and the context words is small, it should indicate a non-standard usage, on the condition that the embeddings have been learned on a general balanced corpus where words correspond to their standard meanings in most cases.
v IN is widely used as a word embedding in many studies, while v OU T has not been in the limelight; only a few researchers have examined the effectiveness of v OU T (Mitra et al., 2016;Press and Wolf, 2017). In recent studies, embeddings v IN are usually used for measuring the similarity between words. However, given the characteristics described in the previous paragraph and SGNS's equivalence with shifted positive pointwise mutual information (Levy and Goldberg, 2014), if we want to measure to what extent word w t tends to co-occur with w k in the training data, then we should use the simi- In this study, we show the importance of using v OU T in a task where we need to see if a word matches its context.

Distinguishing Non-standard Usages from Standard Ones
Following the idea described in Section 3.1, we propose a method for distinguishing non-standard usages from standard ones by leveraging word embeddings. An overview of our method is shown in Figure 1. We use Skip-gram with Negative Sampling (SGNS) (Mikolov et al., 2013) for obtaining the word embeddings. Given a target word w t and its context w c as input, we calculate the following weighted average of scaled dot-products as a measure of standardness: where v IN wt is the input embedding for the target word w t and v OU T w j is the output embedding for the context word w j . α w j is a non-negative weight for the word w j , and σ is the sigmoid function used for scaling dot-products into a range from 0 to 1. Although the values of α w j are arbitrary, we will use the values given by the training algorithm used in word2vec 7 and gensim (Řehůřek and Sojka, 2010), popular tools for obtaining word embeddings. In their training of word embeddings, context words that are closer to the target word are weighted higher. 8 We therefore set α w j to be m + 1 − d w j , where m is the window size and d w j is an integer that represents the distance between w j and the target word. Hence, this is a decaying weighting. In contrast, with uniform weights, we set α w j to be 1 for all w j in the context. We call the score of Equation (1) standardness. If the standardness is low, our method regards the instance as non-standard; otherwise, our method regards it as standard. We should note again that, in our method, word embeddings should be learned on a general balanced corpus that is different from the domain of the target instances.

Methods for Comparative Evaluation
Our model has three characteristics: (input and output) word embeddings, decaying weights, and a general balanced corpus. We evaluated each of these characteristics in a task distinguishing nonstandard usages from standard ones.
First, we verified the effectiveness of the input and output embeddings. We tested a method in which only input embeddings are used to calculate the similarity: the cosine similarity between , which is a similar framework to that of previous work (Neelakantan et al., 2014;Gharbieh et al., 2016). We then tested a method based on the positive pointwise mutual information (PPMI) (Levy et al., 2015;Hamilton et al., 2016). Here, suppose that M is a matrix in which each element is a PPMI of words w i and w (1) is replaced with the (t, j)-element of the low-rank approximation of M obtained through singular value decomposition (SVD). We refer to this model as SVD.  Next, we replaced the decaying weights α with uniform weights to examine the impact of decaying weights.
Finally, we conducted experiments with different training corpora to examine the impact of the balanced corpus. We used four corpora as training data for obtaining word embeddings. These corpora are described in Table 2.

Experimental Settings
In the training of the word embeddings, we set the window size to 5, and the dimensions of the word embeddings to 300. We regarded the words with frequency counts of 5 or less in the training data as unknown words and replaced those words with "<unk>". We used gensim (Řehůřek and Sojka, 2010) as an implementation of SGNS, where we set the number of negative samples to 10. We used the code provided by Levy et al. (2015) as the SVD implementation. For the evaluations, we ranked test instances in ascending order of standardness score and evaluated the ranking in terms of the area under the ROC curve (AUC) (Davis and Goadrich, 2006). Table 3 shows the AUC for each model. 13 First, we examined the impact of the choice of training corpus for obtaining word embeddings. The models with BCCWJ are constantly better than those with other corpora, although BCCWJ is smaller than the others (Table 2). This result suggests that use of a balanced corpus is crucial in our method for this task. 9 The Balanced Corpus of Contemporary Written Japanese (Maekawa et al., 2010).
Next, we examined the impact of context embeddings. Table 3 shows that our model (SGNS IN-OUT) with BCCWJ achieved the best AUCs (.875 and .870), better than the AUCs of SGNS . This result suggests that input embeddings should be used in combination with output embeddings for the task of judging whether a word matches its context or not. Table 3 also shows that SGNS-based models are better than SVD-based models.
As we discussed in Section 3.2, we used two weighting schemes for each model. Although the AUC of each decaying weight model is larger than that of the corresponding uniform weight model, the differences were not statistically significant.

Related Work
The previous studies focused on distinguishing non-standard usages that are multi-word expressions or idiomatic expressions (Kiela and Clark, 2013;Salehi et al., 2015;Li and Sporleder, 2010). The task of this research is similar to new sense detection (Cook et al., 2014). Our research target includes jargon, whose actual meaning is difficult to infer without specific knowledge about its usage (Huang and Riloff, 2010). Recent studies in computational linguistics have used word embeddings and other techniques to capture various semantic changes in words, such as diachronic changes, geographical variations, and sentiment changes (Mitra et al., 2014;Kulkarni et al., 2015;Frermann and Lapata, 2016;Eisenstein et al., 2010;Hamilton et al., 2016;Yang and Eisenstein, 2016).
A few researchers have exploited output embeddings for natural language applications such as document ranking (Mitra et al., 2016) and improving language models (Press and Wolf, 2017).

Conclusion
We presented a model that uses context embeddings to distinguish Japanese non-standard usages from standard ones on social media. Our experimental results show that our model is better than the other models tested. They indicate the importance of context embeddings. To sum up, to distinguish non-standard usage, (1) using a balanced corpus as training data for obtaining word embeddings is crucial, (2) exploiting context embeddings derived from input and output word embeddings of SGNS achieves the best AUC, and (3) decaying weights have little impact on performance.
We are interested in expanding our method for detecting words that have non-standard usages. We are also interested in finding the meanings of the detected non-standard usages.