Segmentation-Free Word Embedding for Unsegmented Languages

In this paper, we propose a new pipeline of word embedding for unsegmented languages, called segmentation-free word embedding, which does not require word segmentation as a preprocessing step. Unlike space-delimited languages, unsegmented languages, such as Chinese and Japanese, require word segmentation as a preprocessing step. However, word segmentation, that often requires manually annotated resources, is difficult and expensive, and unavoidable errors in word segmentation affect downstream tasks. To avoid these problems in learning word vectors of unsegmented languages, we consider word co-occurrence statistics over all possible candidates of segmentations based on frequent character n-grams instead of segmented sentences provided by conventional word segmenters. Our experiments of noun category prediction tasks on raw Twitter, Weibo, and Wikipedia corpora show that the proposed method outperforms the conventional approaches that require word segmenters.


Introduction
Word embedding, which learns dense vector representation of words from large text corpora, has received much attention in the natural language processing (NLP) community in recent years. It is reported that the representation of words well captures semantic and syntactic properties of words (Bengio et al., 2003;Mikolov et al., * This work was done while the author was at Shimodaira laboratory, Division of Mathematical Science, Graduate School of Engineering Science, Osaka University, and Mathematical Statistics Team, RIKEN Center for Advanced Intelligence Project. 2013; Pennington et al., 2014), and is useful for many downstream NLP tasks, including part-ofspeech tagging, syntactic parsing, and machine translation (Huang et al., 2011;Socher et al., 2013;Sutskever et al., 2014). In order to train word embedding models on a raw text corpus, we have to do word segmentation as a preprocessing step. In space-delimited languages such as English and Spanish, simple rulebased and co-occurrence-based approaches offer reasonable segmentations. On the other hands, these approaches are impractical for unsegmented languages such as Chinese, Japanese, and Thai. Therefore, machine learning-based approaches are widely used in NLP for unsegmented languages. Conditional random field (CRF)-based supervised word segmentation (Kudo et al., 2004;Tseng et al., 2005) is still the most used one in Japanese and Chinese NLP (Prettenhofer and Stein, 2010;Funaki and Nakayama, 2015;Ishiwatari et al., 2015;Nakazawa et al., 2016).
However, there are some problems for these supervised word segmentation as a preprocessing step of a word embedding pipeline. First, they require language-specific manually annotated resources such as word dictionaries and segmented corpora. Since these manually annotated resources are typically unavailable for domainspecific corpora (e.g. Twitter or Weibo corpora that contain many neologisms and informal words), we have to create manually annotated resources if we need. Second, they cannot take advantage of word occurrence frequencies in a corpus. Even though a certain proper noun (e.g. " " (The Old Man and the Sea)) occurs frequently in a corpus, word segmenters will continue to split the proper noun erroneously (e.g. " / / " (a old man / and / a sea)) if it is not registered in the word dictionary. Because of segmentation errors incurred by these problems, the downstream word embedding model cannot learn vector representation of proper nouns, neologisms, and informal words.
In this paper, in order to learn word vectors from a raw text corpus while avoiding the above problems, we propose a new word segmentationfree pipeline for word embedding, referred to as segmentation-free word embedding (sembei). Our framework first enumerates all possible segmentations (referred to as a frequent n-gram lattice) based on character n-grams that frequently occurred in the raw corpus, and then learns n-gram vectors from co-occurrence frequencies over the frequent n-gram lattice. Using the general idea of segmentation-free word embedding, we can extend existing word embedding models. Specifically, in this paper, we propose a segmentationfree version of the widely used skip-gram model with negative sampling (SGNS) (Mikolov et al., 2013), which we refer to as SGNS-sembei.
Although the frequent character n-grams necessarily include many non-words (i.e. n-grams that are not words), remarkably, our results show that nearest neighbor search works well for frequent words and even proper nouns (e.g. nearest neighbors of n-gram " " (Germany) are " " (China), " " (United Kingdom), etc.). This observation suggests that we can use the proposed method for automatic acquisition of synonyms from large raw text corpora.
We conduct experiments on a noun category prediction task on several corpora and observe that our method outperforms the conventional approaches that use word segmenters. Fig. 1 shows a t-SNE projection of vector representation of Japanese nouns which is learned from only a raw Twitter corpus. We can see that the proposed method can learn vector representation of these nouns, and the learnt representation achieves good separation based on their categories.

Related Work
There are some representation models that do not rely on any segmenters. Dhingra et al. (2016) proposed a character-based RNN model for vector representation of tweets, and Schütze (2017) proposed a new text embedding method that learns n-gram vectors from the corpus that segmented randomly and then constructs text embeddings by summing up the n-gram vectors. In the field of representation learning for biological sequences (e.g. DNA and RNA), Asgari and Mofrad (2015) applied the skip-gram model (Mikolov et al., 2013) to fixed length fragments of biological sequences. These methods mainly aim at learning vector representation of texts or biological sequences instead of words or fragments of sequences. On the other hand, in this paper, we focus on learning vector representation of words from a raw corpus of unsegmented languages.

Conventional Approaches to Word Embeddings
Word embedding is also commonly used in NLP for unsegmented languages (Prettenhofer and Stein, 2010;Funaki and Nakayama, 2015;Ishiwatari et al., 2015). In these studies, they usually segment a raw corpus into words using a word segmenter or a morphological analyzer, and then feed the segmented corpus to word embedding models (e.g. the skip-gram model (Mikolov et al., 2013) or the GloVe (Pennington et al., 2014)) as in the case of space-delimited languages. The flowchart of the above process is shown in the left part of Fig. 2.

The original SGNS
The original skip-gram model with negative sampling (Mikolov et al., 2013) (we refer to it as the original SGNS) learns vector representation of words v w and their contextsṽ c that minimize the  Figure 2: Flowcharts of previous and proposed pipelines. Morphological analyzers are in charge of the shaded part. Our main idea is to replace a word dictionary with a set of frequent character ngrams, and omit the indentification of the optimum path.
following objective function: where σ(x) := (1 + e −x ) −1 , D is a multiset (bag) of positive samples (i.e. co-occurred pairs in the corpus), and D ′ is a multiset of negative samples. This objective function is maximized using stochastic gradient descent (SGD).

Segmentation-Free Word Embeddings
In this section, we first introduce the general idea of segmentation-free word embeddings (sembei), and then propose a segmentation-free version of the SGNS. While conventional word embedding approaches learn word vectors from segmented corpora that provided by word segmenters, our approach learns n-gram vectors from raw corpora, as in the right part of Fig. 2. In order to learn n-gram vectors from a raw corpus of unsegmented languages, we first construct a frequent n-gram lattice, which represents all possible segmentations based on frequent character n-grams of the corpus, in the same way as the construction of word lattices used in morphological analysis. Then, we learn n-gram vectors using co-occurrence statistics over the frequent n-gram lattice instead of segmented corpora as in conventional approaches.

Segmentation-Free Version of the SGNS
Here, we introduce a segmentation-free version of SGNS, referred to as SGNS-sembei, as an application of the idea of segmentation-free word embedding. Our method simply optimizes the original SGNS's objective function (1) with the slight modification: changing the definition of the multiset of positive samples D.
In SGNS-sembei, D is redefined as the multiset of character n-gram pairs (w, c) where w and c occur adjacently in the corpus (i.e. w and c are connected in the frequent n-gram lattice). In addition, to discriminate co-occurrence with different order in the frequent n-gram lattice, we define contextual words with their relative positions to the center word as the same way as Ling et al. (2015) did.
We also redefine the multiset of negative samples D ′ using D in the same way as the original SGNS, and then optimize the objective function (1) using SGD.

Experiment
In this section, we evaluate our method by the noun category prediction task on Twitter, Weibo, and Wikipedia corpora.
The C++ implementation of the proposed method is available on GitHub 1 .

Settings
We used four raw text corpora: Wikipedia (Japanese), Wikipedia (Chinese), Twitter (Japanese), and Weibo (Chinese). The Wikipedia corpora consist of only a part of the Wikipedia dumps 2 (dated on February 20th, 2017), whose HTML tags are removed. The Weiboscope corpus (Chau et al., 2013) consists of 226,841,122 posts mainly in Chinese, and we use only a part of it. The Twitter corpus consists of 17,316,968 Japanese tweets that were collected from October 26th, 2016 until November 22nd, 2016 via the Twitter Streaming API. We removed hashtags, users' id, and URL from Twitter and Weibo corpora. We extracted about 1,460k frequent n-grams 3 as the frequent character n-grams for our proposed method.
We extracted the noun-category pairs from the Wikidata (Vrandečić and Krötzsch, 2014) (We used the dump dated January 9th, 2017) as follows. We first extracted Wikidata entities whose headwords are also in the 1,460k frequent ngrams, and then extracted the Wikidata entities whose "instance of" properties are any of the predetermined category set 4 , and then collected names and their categories of the entities. Examples of the extracted noun-category pairs are shown in Table 1.
We randomly split the noun-category pairs into a train (60%) and a test (40%) set. We trained linear C-SVM classifiers (Hastie et al., 2009) with the train set to predict categories from vector representation of the nouns.
We performed a grid search over (C, classifier) ∈ {0.5, 1, 5, 10, 50, 100} × {one-vs-one, one-vs-rest} of linear SVM using the train set for each vector representation, and 2 We used {ja,zh}wiki-20170220-pages-articles1.xml in https://dumps.wikimedia.org 3 In this experiment, we defined the frequent n-grams as the union of the top-kn frequent ngrams, where n and kn are the pre-specified numbers.

Baseline Systems
We compared SGNS-sembei with the conventional approaches that use the original SGNS and word segmenters. To segment the raw corpora, we used the MeCab (Kudo et al., 2004) for Japanese corpora and the Stanford Word Segmenter (Tseng et al., 2005) for Chinese corpora with their default dictionaries 5 . And we ignored the words that occur less than 5 times. We also ran these baseline systems in an ideal setting: running the word segmenters with the default dictionaries and additional dictionaries that consist of the nouns extracted in § 5.1.

Results
In both the original SGNS and SGNS-sembei, we fixed the dimensionality of vector representation to 200 and the number of iterations to 5 in both baseline and our method. In SGNS-sembei, we used the number of negative samples n neg = 10, size of context window h = 1, initial learning rate α init = 0.01.
The resulting micro F-scores and the coverages (i.e. the percentages of the noun-category pairs whose nouns' vector representation exists) are shown in Table 2, and the t-SNE (Maaten and Hinton, 2008) projections of Japanese nouns vectors learned from the Twitter corpus are shown in Fig. 1. We observed that our proposed method outperforms the conventional approaches that use word segmenters. Furthermore, the coverages of our method were higher than those of the SGNS with the default dictionary (especially in Japanese) and competitive to those of the SGNS with the default dictionary and Wikidata (which is an ideal setting) even though our method does not require any manually annotated resources. We can also see that the learnt representation achieves good separation based on their categories as in Fig. 1. Nearest neighbor search using Twitter and Weibo corpora was also performed as preliminary experiments, and surprisingly, it worked well for frequent words as in Table. 3.

Conclusion
We proposed segmentation-free word embedding for unsegmented languages. Although our method does not rely on any manually annotated resources, experimental results of the noun category prediction task on several corpora showed that our method outperforms conventional approaches that rely on manually annotated resources.
As an anonymous reviewer suggested, a possible direction of future work is to leverage another word segmentation approach which uses linguistic features, such as the Stanford Word Segmenter (Tseng et al., 2005) with k-best segmentations.