Keyboard Logs as Natural Annotations for Word Segmentation

In this paper we propose a framework to improve word segmentation accuracy us-ing input method logs. An input method is software used to type sentences in languages which have far more characters than the number of keys on a keyboard. The main contributions of this paper are: 1) an input method server that proposes word candidates which are not included in the vocabulary, 2) a publicly usable input method that logs user behavior (like typing and selection of word candidates), and 3) a method for improving word segmentation by using these logs. We conducted word segmentation experiments on tweets from Twitter, and showed that our method improves accuracy in this domain. Our method itself is domain-independent and only needs logs from the target domain.


Introduction
The first step of almost all natural language processing (NLP) for languages with ambiguous word boundaries (such as Japanese and Chinese) is solving the problem of word identification ambiguity. This task is called word segmentation (WS) and the accuracy of state-of-the-art methods based on machine learning techniques is more than 98% for Japanese and 95% for Chinese Yang and Vozila, 2014). Compared to languages like English with clear word boundaries, this ambiguity poses an additional problem for NLP tasks in these languages. To make matters worse, the domains of the available training data often differ from domains where there is a high demand for NLP, which causes a severe degradation in WS performance. Examples include ma-chine translation of patents, text mining of medical texts, and marketing on the micro-blog site Twitter 1 . Some papers have reported low accuracy on WS or the joint task of WS and part-of-speech (POS) tagging of Japanese or Chinese in these domains (Mori and Neubig, 2014;Kaji and Kitsuregawa, 2014;Liu et al., 2014) To cope with this problem, we propose a way to collect information from people as they type Japanese or Chinese on computers. These languages use far more characters than the number of keys on a keyboard, so users use software called an input method (IM) to type text in these languages. Unlike written texts in these languages, which lack word boundary information, text entered with an IM can provide word boundary information that can used by NLP systems. As we show in this paper, logs collected from IMs are a valuable source of word boundary information.
An IM consists of a client (front-end) and a server (back-end). The client receives a key sequence typed by the user and sends a phoneme sequence (kana in Japanese or pinyin in Chinese) or some predefined commands to the server. The server converts the phoneme sequence into normal written text as a word sequence or proposes word candidates for the phoneme sequence in the region specified by the user. We noticed that the actions performed by people using the IM (such as typing and selecting word candidates) provide information about word boundaries, including context information.
In this paper, we first describe an IM for Japanese which allows us to collect this information. We then propose an automatic word segmenter that uses IM logs as a language resource to improve its performance. Finally, we report experimental results showing that our method increases the accuracy of a word segmenter on Twitter texts by using logs collected from a browser add-on ver-sion of our IM.
The three main contributions of this paper are: • an IM server that proposes word candidates which are not included in the vocabulary (Section 3), • a publicly usable IM that logs user behavior (such as typing and selection of word candidates) (Section 4), • a method for improving word segmentation by using these logs (Section 5).
To the best of our knowledge, this is the first paper proposing a method for using IM logs to successfully improve WS.

Related Work
The main focus of this paper is WS. Corpus-based, or empirical, methods were proposed in the early 90's (Nagata, 1994). Then (Mori and Kurata, 2005) extended it by lexicalizing the states like many researches in that era, grouping the word-POS pairs into clusters inspired by the class-based n-gram model (Brown et al., 1992), and making the history length variable like a POS tagger in English (Ron et al., 1996). In parallel, there were attempts at solving Chinese WS in a similar way (Sproat and Chang, 1996). WS or the joint task of WS and POS tagging can be seen as a sequence labeling problem. So conditional random fields (CRFs) (Peng et al., 2004;Lafferty et al., 2001) have been applied to this task and showed better performance than POS-based Markov models (Kudo et al., 2004). The training time of sequencebased methods tends to be long, especially when we use partially annotated data. Thus a simple method based on pointwise classification has been shown to be as accurate as sequence-based methods and fast enough to make active learning practically possible . Since the pointwise method decides whether there is a word boundary or not between two characters without referring to other word boundary decisions in the same sentence, it is straightforward to train the model from partially annotated sentences. We adopt this WS system for our experiments.
Along with the evolution of models, the NLP community has become increasingly aware of the importance of language resources (Neubig and Mori, 2010;Mori and Neubig, 2014). So Mori and Oda (2009) proposed to incorpolate dictionaries for human into a WS system with a different word definition. CRFs were also extended to enable training from partially annotated sentences (Tsuboi et al., 2008). When using partially annotated sentences for WS training data, word boundary information exists only between some character pairs and is absent for others. This extension was adopted in Chinese WS to make use of socalled natural annotations (Yang and Vozila, 2014;Jiang et al., 2013). In that work, tags in hyper-texts were regarded as annotations and used to improve WS performance. The IM logs used in this paper are also classified as natural annotations, but contain much more noise. In addition, we need an IM that is specifically designed to collect logs as natural annotations.
Server design is the most important factor in capturing information from IM logs. The most popular IM servers are based on statistical language modeling (Mori et al., 1999;Chen and Lee, 2000;Maeta and Mori, 2012). Their parameters are trained from manually segmented sentences whose words are annotated with phoneme sequences, and from sentences automatically annotated with NLP tools which are also based on machine learning models trained on the annotated sentences. Thus normal IM servers are not capable of presenting out-of-vocabulary (OOV) words (which provide large amounts of information on word boundaries) as conversion candidates. To make our IM server capable of presenting OOV words, we extend a statistical IM server based on (Mori et al., 2006), and ensure that it is computationally efficient enough for practical use by the public.
The target domain in our experiments is Twitter, a site where users post short messages called tweets. Since tweets are an immediate and powerful reflection of public attitudes and social trends, there have been numerous attempts at extracting information from them. Examples include information analysis of disasters (Sakai et al., 2010), estimation of depressive tendencies (Tsugawa et al., 2013), speech diarization (Higashinaka et al., 2011), and many others. These works require preprocessing of tweets with NLP tools, and WS is the first step. So it is clear that there is strong demand for improving WS accuracy. Another reason why we have chosen Twitter for the test domain is that the tweets typed using our server are open and we can avoid privacy problems. Our method does not utilize any other characteristics of tweets. So it also works in other domains such as blogs.

Input Method Suggesting OOV Words
In this section we propose a practical statistical IM server that suggests OOV word candidates in addition to words in its vocabulary.

Statistical Input Method
An input method (IM) is software which converts a phoneme sequence into a word sequence. This is useful for languages which contain far more characters than keys on a keyboard. Since there are some ambiguities in conversion, a conversion engine based on a word n-gram model has been proposed (Chen and Lee, 2000). Today, almost all IM engines are based on statistical methods.
For the LM unit, instead of words we propose to adopt word-pronunciation pairs u = y, w . Thus given a phoneme sequence y l 1 = y 1 y 2 · · · y l as the input, the goal of our IM engine is to output a word sequenceŵ m 1 that maximizes the probability P (w, y l 1 ) as follows: where the concatenation of y i in each u i is equal to the input: y l 1 = y 1 y 2 · · · y m . In addition u j (j ≤ 0) are special symbols introduced to simplify the notation and u m+1 is a special symbol indicating a sentence boundary.
As in existing statistical IM engines, parameters are estimated from a corpus whose sentences are segmented into words annotated with their pronunciations as follows: where F (·) denotes the frequency of a pair sequence in the corpus. In contrast to IM engines based on a word n-gram model, ours does not require an additional model describing relationships between words and pronunciations, and thus it is much simpler and more practical. Existing statistical IM engines only need an accurate automatic word segmenter to estimate the parameters of the word n-gram model. As the equation above shows, our pair-based engine also needs an accurate way of automatically estimating pronunciation (phoneme sequences). However, recently an automatic pronunciation estimator ) that delivers as accurate as state-of-the-art word segmenters has been proposed. As we explain in Section 6, in our experiments both our IM engine and existing ones delivered accuracy of 91%.

Enumerating Substrings as Candidate Words
Essentially, the IM engine which we have explained above does not have the ability to enumerate words which are unknown to the word segmenter and the pronunciation estimator used to build the training data. The aim of our research is to gather language information from user behavior as they use an IM. So we extend the basic IM engine to enumerate all the substrings in a corpus with all possible pronunciations. For that purpose, we adopt the notion of a stochastically segmented corpus (SSC) (Mori and Takuma, 2004) and extend it to the pronunciation annotation to words.

Stochastically Segmented Corpora
An SSC is defined as a combination of a raw corpus C r (hereafter referred to as the character sequence x nr 1 ) and word boundary probabilities of the form P i , which is the probability that a word boundary exists between two characters x i and x i+1 . These probabilities are estimated by a model based on logistic regression (LR) (Fan et al., 2008) trained on a manually segmented corpus referring to the same features as those used in . Since there are word boundaries before the first character and after the last character of the corpus, P 0 = P nr = 1. Then word n-gram frequencies on an SSC are calculated as follows: Word 0-gram frequency: This is defined as the expected number of words in the SSC: Word n-gram frequency (n ≥ 1): Consider the situation in which a word sequence w n 1 occurs in the SSC as a subsequence beginning at the (i + 1)-th character and ending at the k-th character and each word w m in the word sequence is equal to the character sequence beginning at the xi b m -th character and ending at the e m -th charac- Figure 1 for an example). The word ngram frequency of a word sequence f r (w n 1 ) in the SSC is defined by the summation of the stochastic frequency at each occurrence of the character sequence of the word sequence w n 1 over all of the occurrences in the SSC: where e n 1 = (e 1 , e 2 , · · · , e n ) and O n = {(i, e n 1 )|x em bm = w m , 1 ≤ m ≤ n}. We calculate word n-gram probabilities by dividing word n-gram frequencies by word (n − 1)gram frequencies. For a detailed explanation and a mathematical proof of this method, please refer to (Mori and Takuma, 2004).

Pseudo-Stochastically Segmented
Corpora The computational costs (in terms of both time and space) for calculating an n-gram model from an SSC are very high 2 , so it is not a practical technique for implementing an IM engine. In order to reduce the computational costs we approximate an SSC using a deterministically tagged corpus, which is called a pseudo-stochastically segmented corpus (pSSC) (Kameko et al., 2015). The following is the method for producing a pSSC from an SSC.
• For i = 1 to n r − 1 1. output a character x i , 2. generate a random number 0 ≤ p < 1, 3. output a word boundary if p < P i or output nothing otherwise.
Now we have a corpus in the same format as a standard segmented corpus with variable (nonconstant) segmentation.

Pseudo-Stochastically Tagged Corpora
We can annotate a word with its all possible pronunciations and their probabilities, as is done in an SSC. We call a corpus containing sequences of words (w 1 w 2 · · · w i · · · ) annotated with a sequence of pairs of a pronunciation and its probability ( y i,1 , p i,1 , y i,2 , p i,2 , · · · , where ∑ j p i,j = 1, for ∀i) a stochastically tagged corpus (STC) 3 . We can estimate these probabilities using an LR model built from sentences annotated with pronunciations .
Similar to pSSC we then produce a pseudostochastically tagged sentence (pSTC) from an STC as follows: • For each w i in the sentence 1. generate a random number 0 ≤ p < 1, 2. annotate w i with its j-th phoneme se- Now we have a corpus in the same format as a standard corpus annotated with variable pronunciation. By estimating the parameters in Equation (1) from a pSTC derived from a pSSC, our IM engine can also suggest OOV word candidates with various possible segmentation and pronunciations without incurring high computational costs.

Suggestion of OOV Words
Here we give an intuitive explanation why our IM engine can suggest OOV words for a certain phoneme sequence. Let us take an OOV word example: " /yo-ko-a-ri," an abbreviation of " " (Yokohama city arena). A WS system tends to segment it into " " (side) and " " (ant) because they are frequent nouns. In a pSSC, however, some occurrences of the string " " are remain concatenated as the correct word. For pronunciation, the first character has two possible pronunciations "yo-ko" and "o-u." So deterministic pronunciation estimation of this new word has the risk of outputting the erroneous result "o-u-a-ri." This prevents our engine from presenting " " as a conversion candidate for the input "yo-ko-a-ri." The pSTC, however, contains two possible pronunciations for this word and allows our engine to present the OOV word " " for the input "yo-ko-a-ri." Thus when the user of our IM engine types "yoko-a-ri-ni-i-ku" and selects " (to) (go)," the engine can learn an OOV word " /yo-ko-a-ri" with context " /ni /i-ku".

Input Method Logs
In this section we first propose an IM which allows us to collect user logs. We then examine the characteristics of these logs and some difficulties in using them as language resources.

Collecting Logs from an Input Method
As Figure 2 shows, the client of our IM, running on the user's PC, is used to input characters and to modify conversion results. The server logs both input from the client and the results of conversions performed in response to requests from the client. Our IM has two phases: phoneme typing and conversion result editing. In each phase, the client sends the typed keys to the server with a timestamp and its IP address.
Phoneme typing: First the user inputs ASCII characters for a phoneme sequence. If the phoneme sequence itself is what the user wants to write, the user may not go to the next phase. The server records the keys typed to enter the phoneme sequence, cursor movements, and the phoneme sequence if the user selects it as-is.
Conversion result editing: Then the user presses a space key to make the IM engine convert the phoneme sequence to the most likely word sequence based on Equation (1). Sometimes the user changes some word boundaries, makes the IM engine enumerate candidate words covering the region, and selects the intended one from the list of candidates.
The server records a space key and the final word sequence. Table 1 shows an example of interesting log messages from the same IP address 4 . In many cases, users type sentence fragments but not a complete sentence. So in the example there are six fragments within a short period indicated by the timestamps. If the user selects the phoneme sequence as-is without going to the conversion result editing phase, we can expect that there are word boundaries on both sides of the phoneme sequence. In- There are two main problems that make it difficult to directly use IM logs as a training corpus for word segmentation. The first problem is fragmentation. IM users send the phoneme sequences for sentence fragments to the engine to avoid editing long conversion results that require many cursor movements. Thus the phoneme sequence and the final word sequence tend to be sentence fragments (as we noted above) and as a result they lose context information. The second problem is noise. Word boundary information is unreliable even when it is present because of mistakenly selected conversions or words entered separately. From these observations, the IM logs are treated as partially segmented sentence fragments that include noise.

Word Segmentation Using Input Method Logs
In this section we first explain various ways to generate language resources for a word segmenter from IM logs. We then describe an automatic word segmenter which utilizes these resources.
In the examples below we use the three-valued notation (Mori and Oda, 2009) to denote partial segmentation as follows: | : there is a word boundary, -: there is not a word boundary, : there is no information.

Input Method Logs as Language Resources
The phoneme sequences and edit results in the final selection themselves are considered to be partially segmented sentences. We call the corpus generated directly from the logs "Log-as-is." Examples in Table 1 are converted as following.
Example of Log-as-is (12 annotations) Here the number of annotations is the sum of "-" and "|". In this example, one entry corresponds to one entry of the training data for the word segmenter. As you can easily imagine, Log-as-is may contain mistaken results (noise) and short entries (fragmentation). Both are harmful for a word segmenter.
To cope with the fragmentation problem, we propose to connect some logs based on their timestamps. If the difference between the timestamps of two sequential logs is short, both logs are probably from the same sentence. So we connect two sequential logs if the time difference between the last key of the first log and the first key of the second log is smaller than a certain threshold s. In the experiment we set s = 500[ms] based on observations of our behavior 5 . This method is referred to as "Log-chunk." Using this method, we obtain the following from the examples in Table 1.
Example of Log-chunk (15 annotations) We see that Log-chunk contains more context information than Log-as-is. For preventing the noise problem, we propose to filter out logs with a small number of conversions. We expect that an edited sentence will have many OOV words and not much noise. Therefore we use logs which were converted more than n c times. In the experiment we set n c = 2 based on observations of our behavior 6 . This method is referred to as "Log-mconv." Using this method, the examples in Table 1 becomes the following. Example of Log-mconv (3 annotations)

--|
As this example shows, Log-mconv contains short entries (fragmentation) like Log-as-is. However, we expect that the annotated tweets do not include mistaken boundaries or conversions that were discarded.
Obviously we can combine Log-chunk and Log-mconv to avoid both the fragmentation and noise problems. This combination is referred to as "Log-chunk-mconv."

Training a Word Segmenter on Logs
The IM logs give us partially segmented sentence fragments, so we need a word segmenter capable of learning from them. We can use a word segmenter based on a sequence classifier (Tsuboi et al., 2008;Yang and Vozila, 2014;Jiang et al., 2013) or one based on a pointwise classifier . Although both types are viable, we adopt the latter in the experiments because it requires much less training time while delivering comparable accuracy.
Here is a brief explanation of the word segmenter based on the pointwise method. For more detail the reader may refer to . The input is an unsegmented character sequence x = x 1 x 2 · · · x k . The word segmenter decides if there is a word boundary t i = 1 or not t i = 0 by using support vector machines (SVMs) (Fan et al., 2008) 7 . The features are character n-grams 6 The results were stable for nc in the preliminary experiments. 7 The reason why we use SVM for word segmentation is that the accuracy is generally higher than that based on LR. It was so in the experiments of this paper. The F-measure of LR on TWI-test was 91.30 (Recall = 89.50, Precision = 93.17), and character type n-grams (n = 1, 2, 3) around the decision points in a window with a width of 6 characters. Additional features are triggered if character n-grams in the window match with character sequences in the dictionary. This approach is called pointwise because the word boundary decision is made without referring to the other decisions on the points j = i. As you can see from the explanation given above, we can also use partially segmented sentences from IM logs for training in the standard way.

Evaluation
As an evaluation of our methods, we measured the accuracy of WS without using logs (the baseline) and using logs converted by several methods. There are two test corpora: one is the general domain corpus from which we built the baseline WS, and the other is the same domain that the IM logs were collected from, Twitter.

Corpora
The annotated corpus we used to build the baseline word segmenter is the manually annotated part (core data) of the Balanced Corpus of Contemporary Written Japanese (BCCWJ) (Maekawa, 2008), plus newspaper articles and daily conversation sentences. We also used a 234,652-word dictionary (UniDic) provided with the BCCWJ. A small portion of the BCCWJ core data is reserved for testing. In addition, we manually segmented sentences randomly obtained from Twitter 8 during the same period as the log collection for the test corpus. Table 2 shows the details of these corpora.
which is lower than that of SVM (see Table 4). To make an SSC, however, we use an LR model because we need word boundary probabilities. 8 We extracted body text from 1,592 tweets excluding mentions, hash tags, URLs, and ticker symbols. Then we divided the body text into sentences by separating on newline characters, resulting in 2,976 sentences.

Models using Input Method Logs
To make the training data for our IM server, we first chose randomly selected tweets (786,331 sentences) in addition to the unannotated part of the BCCWJ (358,078 sentences). We then trained LR models which estimate word boundary probabilities and pronunciation probabilities for words (and word candidates) from the training data shown in Table 2 and UniDic. We made a pSTC for our IM engine from 1,207,182 sentences randomly obtained from Twitter by following the procedure which we explained in Subsection 3.2.3 9 . We launched our IM as a browser add-on for Twitter and collected 19,770 IM logs from 7 users between April 24 and December 31, 2014. Following the procedures in Section 5.1, we obtained the language resources shown in Table 3. We combined them with the training corpus and dictionaries to build four WSs, which we compared with the baseline.

Results and Discussion
Following the standard in WS experiments, the evaluation criteria are recall, precision, and Fmeasure (their harmonic mean). Recall is the number of correctly segmented words divided by the number of words in the test corpus. Precision is the number of correctly segmented words divided by the number of words in the system output. Table 4 and 5 show WS accuracy on TWI-test and BCCWJ-test, respectively. The difference in 9 There is no overlap with the test data. accuracy of the baseline method on BCCWJ-test and TWI-test shows that WS of tweets is very difficult. The fact that the precision on TWI-test is much higher than the recall indicates that the baseline model suffers from over-segmentation. This over-segmentation problem is mainly caused by OOV words being divided into known words. For example, " " (Yokohama arena) is divided into the two know words " " (side) and " " (ant).
When we compare the F-measures on TWI-test, all the models referring to the IM logs outperform the baseline model trained only from the BCCWJ. The highest is the Log-chunk-mconv model and the improvement over the baseline is statistically significant (significance level: 1%). In addition the accuracies of the five methods on the BCCWJ (Table 5) are almost the same and there is no statistical significance (significance level: 1%) between any two of them.
We analyzed the words misrecognized by the WSs, which we call error words. Table 6 shows the number of error words, the number of OOV words, and the ratio of OOV words to error words. Here the vocabulary is the set of the words appearing in the training data or in UniDic (see Table 2). Although the result of the WS trained on Log-asis contains more error words than the baseline, the OOV ratio is less than the baseline. This means that the IM logs have a potential to reduce errors caused by OOV words.
Table 6 also indicates that the best method Logchunk-mconv had the greatest success in reducing  errors caused by OOV words. However, the majority of error words are in-vocabulary words. It can be said that our log chunking method (Logchunk or Log-chunk-mconv) enabled the WSs to eliminate many known word errors by using context information.
To investigate the impact of the log size, we measured WS accuracy on TWI-test when varying the log size during training. Figure 3 shows the results. Table 4 says that Log-chunk-mconv and Log-chunk increase the accuracy nicely. The graph, however, clarifies that Log-chunk-mconv achieves high accuracy with fewer training data converted from logs. In other words, the method Log-chunk-mconv is good at distilling the informative parts and filtering out the noisy parts. These characteristics are very important properties to have as we consider deploying our IM to a wider audience. An IM is needed to type Japanese and the number of Japanese speakers is more than 100 million. If we can use input logs of even 1% of them for the same or longer period 10 , the idea we propose in this paper can improve WS accuracy on various domains efficiently and automatically.
As a final remark, this paper describes a suc-10 The number of users using our system in this paper is 7 for 8 months.
cessful example of how to build a useful tool for the NLP community. This process has three steps: 1) design a useful NLP application that can collect user logs, 2) deploy it for public use, and 3) devise a method for mining data from the logs.

Conclusion
This paper described the design of a publicly usable IM which collects natural annotations for use as training data for another system. Specifically, we (1) described how to construct an IM server that suggests OOV word candidates, (2) designed a publicly usable IM that collects logs of user behavior, and (3) proposed a method for using this data to improve word segmenters. Tweets from Twitter are a promising source of data with great potential for NLP, which is one reason why we used them as the target domain for our experiments. The experimental results showed that our methods improve accuracy in this domain. Our method itself is domain-independent and only needs logs from the target domain, so it is worth testing on other domains and with much longer periods of data collection.