Can Symbol Grounding Improve Low-Level NLP? Word Segmentation as a Case Study

We propose a novel framework for improving a word segmenter using information acquired from symbol grounding. We generate a term dictionary in three steps: generating a pseudo-stochastically segmented corpus, building a symbol grounding model to enumerate word candidates, and ﬁltering them according to the grounding scores. We applied our method to game records of Japanese chess with commentaries. The experimental results show that the accuracy of a word segmenter can be improved by incorporating the generated dictionary.


Introduction
Today we can easily obtain a large amount of text associated with multi-modal information, and there is a growing interest in the use of nontextual information in the natural language processing (NLP) community. Many of these studies aim to output natural language sentences from a nonlinguistic modality, such as image (Farhadi et al., 2010;Yang et al., 2011;. Kiros et al. (2014) showed that multi-modal information improves the performance of a language model. Inspired by these studies, we explore a method for improving the performance of a low-level NLP task using multi-modal information. In this work, we focus on the task of word segmentation (WS) in Japanese. WS is often performed as the first processing step for languages without clear word boundaries, and it is as important as part-of-speech (POS) tagging in English. We assume that a large set of pairs of non-textual data and sentences describing them is available as the information source. In our experiments, the pairs consist of game states in Shogi (Japanese chess) and textual comments on them, which were made by Shogi experts. We enumerate substrings (character sequences) in the sentences and match them with Shogi states by a neural network model. The rationale here is that substrings which match with non-language data well tend to be real words.
Our method consists of three steps (see Figure  1). First, we segment commentary sentences for a game state in various ways to produce word candidates. Then, we match them with game states of a Shogi playing program. Finally, we compile the symbol grounding results at all states and incorporate them to an automatic WS. To the best of our knowledge, this is the first result reporting a performance improvement in an NLP task by symbol grounding.

Stochastically Segmented Corpus
Before symbol grounding, we need to segment the text into words that include probable candidate words. For this purpose, we use a stochastically segmented corpus (SSC) (Mori and Takuma, 2004). Then we propose to simulate it by a normal (deterministically) segmented corpus to avoid the problem of computational cost.

Stochastically Segmented Corpora
An SSC is defined as a combination of a raw corpus C r (hereafter referred to as the character sequence x nr 1 ) and word boundary probabilities of the form P i , which is the probability that a word boundary exists between two characters x i and x i+1 . These probabilities are estimated by a model based on logistic regression (LR) (Fan et al., 2008) trained on a manually segmented corpus by referring to the surrounding characters 1 . Since there are word boundaries before the first character and after the last character of the corpus, P 0 = P nr = 1. The expected frequency of a word w in an SSC is calculated as follows: {i | x i+k i+1 = w} is the set of all the occurrences of the string matching with w 2 .

Pseudo-Stochastically Segmented
Corpora The computational cost (in terms of both time and space) for calculating the expected frequencies in an SSC is very high 3 , so it is not a practical approach for symbol grounding. In this work, we approximate an SSC using a deterministically segmented corpus, which we call a pseudostochastically segmented corpus (pSSC). The following is the process we use to produce a pSSC from an SSC.
• For i = 1 to n r − 1 1. output a character x i , 2. generate a random number 0 ≤ p < 1, 3. output a word boundary if p < P i or output nothing otherwise.
Now we have a corpus in the same format as a standard segmented corpus with variable (nonconstant) segmentation, where x i and x i+1 are segmented with the probability of P i . We execute the above procedure m times and divide the counts by m. The law of large numbers guarantees that the approximation errors decrease to 0 when m → ∞.

Symbol Grounding
As the target of symbol grounding, we use states (piece positions) of a Shogi game and commen-taries associated with them. We should note, however, that our framework is general and applicable to different types of combinations such as image/description pairs (Regneri et al., 2013).

Game Commentary
The Japanese language is one of the languages without clear word boundaries and we need an automatic WS as the first step of NLP. In Shogi, there are many professional players and many commentaries about game states are available.

Grounding Words
We build a symbol grounding model using a Shogi commentary dataset. We use a set of pairs of a Shogi state S i and a commentary sentence C i as the training set. A Shogi state S i is converted into a feature vector f(S i ). We generate m (in our experiment, m = 4) pSSC C ′ i from C i . C ′ i contains m corpora of the same text body but with different word segmentation, C ′ ij (j = 1, . . . , m). We treat these as m pairs of a feature vector of Shogi state f(S i ) and a sequence of words C ′ ij . We train a model which predicts words in C ′ ij using f(S i ) as input.
We use a multi-layer perceptron as the prediction model. The input is a vector of the features of a state. The hidden layer is a 100-dimensional vector and is activated by a bipolar sigmoid function. Its output is a d-dimensional real-valued vector, each of whose elements indicates whether a word in the vocabulary of d words appears in the commentary or not. The output layer is activated by a binary sigmoid function.
We use features of Shogi states which a computer Shogi program called Gekisashi (Tsuruoka et al., 2002) uses to evaluate the states in game tree search as input. The features of Shogi states used in this experiment are below: a) Positions of pieces (e.g. my rook is at 2h). b) Pieces captured (e.g. the opponent has a bishop). c) Combinations of a) and b) (e.g. my king is at 7h and the opponent's rook is at 7b). d) Other heuristic features.
Among them, a), b) and c) occupy the majority.
Unlike normal symbol grounding, the vocabulary contains many word candidates appearing in the pSSC generated from the commentaries. Some are real words and some are wrong fragments. These wrong fragments will appear more or less randomly in the commentaries than real words. The perceptron therefore cannot acquire strong relation between states and fragments and the output values of the perceptron will be smaller than those of real words.

Word Segmentation Using Symbol Grounding Result
This section describes a baseline automatic word segmenter and a method for incorporating the symbol grounding result to it.

Baseline Word Segmenter
Among many Japanese WS and morphological analyzers (word segmentation and POS tagging), we adopt pointwise WS (Neubig et al., 2011), because it is the only word segmenter which is capable of adding new words without POS information.
The input of the pointwise WS is an unsegmented character sequence x = x 1 x 2 · · · x k . The word segmenter decides if there is a word boundary t i = 1 or not t i = 0 by using support vector machines (SVMs) (Fan et al., 2008). The features are character n-grams and character type n-grams (n = 1, 2, 3) around the decision points in a window with a width of 6 characters. Additional features are triggered if character n-grams in the window match with character sequences in the dictionary.

Training a Word Segmenter with
Grounded Words As a first trial for incorporating symbol grounding results to an NLP task, we propose to generate a dictionary based on the symbol grounding result. We can expect that the word candidates that are given high scores by the perceptron in the symbol grounding result have strong relationship to the positions. In other words, we can make a good dictionary by selecting word candidates in descending order of the scores. As a method for taking all the occurrences into account, we test the following three functions: sum: the summation of the scores of all the output vectors, ave: the average of them, max: the maximum in them.
First, we acquire a V -dimensional real-valued vector for each Shogi state S i as the result of symbol grounding. Then, for each candidate in C ′ ij , we get the element of the vector which corresponds to the candidate as the score of the candidate. After that, we get the summation of, the average of, or the maximum in the scores of the same candidate over the whole dataset.
Finally we select the top R percent of word candidates in descending order of the value of sum, ave, or max and add them to the WS dictionary and retrain the model.

Evaluation
We conducted word segmentation experiments in the following settings.

Corpora
The annotated corpus we used to build the baseline word segmenter is the manually annotated part (core data) of the Balanced Corpus of Contemporary Written Japanese (BCCWJ) (Maekawa, 2008), plus newspaper articles and daily conversation sentences. We also used a 234,652-word dictionary (UniDic) provided with the BCCWJ. A small portion of the BCCWJ core data is reserved for testing. In addition, we manually segmented sentences randomly obtained from Shogi commentaries. We divided these sentences into two parts: a development set and a test set. Table 1 shows the details of these corpora.
To make a pSSC, we prepared 33,151 pairs of a Shogi position and a commentary sentence. The  sentences are converted into pSSC m = 4 times by an LR word segmentation model trained from the training data in Table 1 and sent to the symbol grounding module.

Word Segmentation Systems
We built the following two word segmentation models (Neubig et al., 2011) to evaluate our framework.
Baseline: The model is trained from training data shown in Table 1 and UniDic. +Sym.Gro.: The model is trained from the language resources for the Baseline and the symbol grounding result.
To decide the function and the value of R for +Sym.Gro. (see Section 4.2), we measured the accuracies on the development set of all the combinations. The best combination was sum and R = 0.011 4 . In this case, 127 words were added to the dictionary.

Results and Discussion
Following the standard in word segmentation experiments, the evaluation criteria are recall, precision, and F-measure (their harmonic mean). Table 2 and 3 show WS accuracies on BCCWJtest and Shogi-test, respectively. The difference in accuracy of the baseline method on BCCWJ-test and Shogi-test shows that WS of Shogi commentaries is very difficult. Like many other domains, Shogi commentaries contain many special words and expressions, which decrease the accuracy.
When we compare the F-measures on Shogitest (Table 3), +Sym.Gro. outperforms Baseline. The improvement is statistically significant (at 5% level). The error reduction ratio is comparable to a natural annotation case (Liu et al., 2014), despite the fact that our method is unsupervised except for 4 In addition we measured the accuracies on the test set of all the combinations and found that the same function and the value of the parameter are the best. This indicates the stability of the function and the parameter. a hyperparameter. Thus we can say that WS improvement by symbol grounding is as valuable as the annotation additions.
From a close look at the comparison of the recall and the precision, we see that the improvement in the recall is higher than that of the precision. This result shows that the symbol grounding successfully acquired new words with a few erroneous words. As the final remark, the result on the general domain (Table 2) shows that our framework does not cause a severe performance degradation in the general domain.

Related Work
The NLP task we focus on in this paper is word segmentation. One of the first empirical methods was based on a hidden Markov model (Nagata, 1994). In parallel, there were attempts at solving Chinese word segmentation in a similar way (Sproat and Chang, 1996). These methods take words as the modeling unit.
Recently, Neubig et al. (2011) have presented a method for directly deciding whether there is a word boundary or not at each point between characters. For Chinese word segmentation, there are some attempts at tagging characters with BIES tags (Xue, 2003) by a sequence labeller such as CRFs (Lafferty et al., 2001), where B, I, E, and S means the beginning of a word, intermediate of a word, the end of a word, and a single character word, respectively. The pointwise WS can be seen as character tagging with the BI tag system, in which there is no constraint between neighboring tags. For Japanese WS, our preliminary experiments showed that the combination of the BI tag system with SVMs is slightly better than the BIES tag system with CRFs. This is another reason why we used the former in this paper. Our extension of word segmentation is, however, applicable to the BIES/CRFs combination as well.
The method we describe in this paper is unsupervised and requires a small amount of annotated data to tune the hyperparameter. From this viewpoint, the approach based on natural annotation (Yang and Vozila, 2014;Jiang et al., 2013;Liu et al., 2014) may come to readers' mind. In these studies, tags in hyper-texts were regarded as partial annotations and used to improve WS performance using CRFs trainable from such data (Tsuboi et al., 2008). Mori and Nagao (1996) proposed a method for extracting new words from a large amount of raw text. Murawaki and Kuro-hashi (2008) proposed an online method in a similar setting. In contrast to these studies, this paper proposes to use other modalities, game states as the first trial, than languages.

Conclusion
We have described an unsupervised method for improving word segmentation based on symbol grounding results. To extract word candidates from raw sentences, we first segment sentences stochastically, and then match the word candidate sequences with game states that are described by the sentences. Finally, we selected word candidates referring to the grounding scores. The experimental results showed that we can improve word segmentation by using symbol grounding results. Our framework is general and it is worth testing on other NLP tasks. As future work, we will apply other deep neural network models to our approach. It is interesting to apply the symbol grounding results to an embedding model-based word segmentation approach (Ma and Hinrichs, 2015). It is also interesting to extend our method to deal with other types of non-textual information such as images and economic indices.