A Syllable-based Technique for Word Embeddings of Korean Words

Word embedding has become a fundamental component to many NLP tasks such as named entity recognition and machine translation. However, popular models that learn such embeddings are unaware of the morphology of words, so it is not directly applicable to highly agglutinative languages such as Korean. We propose a syllable-based learning model for Korean using a convolutional neural network, in which word representation is composed of trained syllable vectors. Our model successfully produces morphologically meaningful representation of Korean words compared to the original Skip-gram embeddings. The results also show that it is quite robust to the Out-of-Vocabulary problem.


Introduction
Continuous word representation has been a fundamental ingredient to many NLP tasks with the advent of simple and successful approaches such as Word2Vec (Mikolov et al., 2013a) and GloVe (Pennington et al., 2014). Although it has been verified that they are effective in formulating semantic and syntactic relationship between words, there are some limitations. First, they are only available to words in pre-defined vocabulary thus prone to the Out-of-Vocabulary(OOV) problem. Second, they cannot utilize subword information at all because they regard word as a basic unit. Those problems become more magnified when applying word-based methods to agglutinative languages such as Korean, Japanese, Turkish, and Finnish. In this work, we propose a new model * Portions of this research were done while the author was a student at Seoul National University. that utilizes syllables as basic components of word representation to alleviate the problems, especially for Korean. In our experiment, we confirm that our model constructs representation of words which contains a semantic and syntactic relationship between words. We also show that our model can handle OOV problem and capture morphological information without dedicated analysis.

Related Work
Recent works that utilize subword information to construct word representation could be largely divided into two families: The models that use morphemes as a component and the others taking advantage of characters.

Morpheme-based representation models
A morpheme is the smallest unit of meaning in linguistics. Therefore, there are many researches that consider morphemes when building word representations (Luong et al., 2013;Botha and Blunsom, 2014;Cotterell and Schütze, 2015). Luong et al. (2013) applies a recursive neural network over morpheme embeddings to obtain word embeddings. Although morpheme-based models are good at capturing semantics, one major drawback is that most of them require manually annotated data or an explicit morphological analyzer which could introduce unintended errors. Our model doesn't need such a preprocessing.

Character-based representation models
Recently, utilizing information from characters has become one of the active NLP research topics. One way to extract knowledge from a sequence of characters is using character n-grams (Wieting et al., 2016;Bojanowski et al., 2016). Bojanowski et al. (2016) suggests an approach based on the Skip-gram model (Mikolov et al., 2013a), where the model sums character n-gram vectors to represent a word. On the other hand, Figure 1: Overall architecture of our model. Each syllable is a d-dimensional vector. For a given word 'ᄋ ᅡ ᆫᄂ ᅧ ᆼᄒ ᅡᄉ ᅦᄋ ᅭ' (hello, annyeonghaseyo), we concatenate vectors according to syllable order in word. After passing through the convolutional layer and max pooling layer, word representation is produced. All parameters are jointly trained by Skip-gram scheme.
there are some approaches (Dos Santos and Gatti, 2014;Ling et al., 2015;Santos and Guimaraes, 2015;Zhang et al., 2015;Kim et al., 2016;Jozefowicz et al., 2016;Chung et al., 2016) in which word representations are composed of character embeddings via deep neural networks such as convolutional neural networks (CNN) or recurrent neural networks (RNN). Kim et al. (2016) introduces a language model that aggregates subword information through a character-level CNN. Models based on characters have shown competitive results on many tasks. A problem of character-based models is that characters themselves have no semantic meanings so that models often concentrate on only local syntactic features of words. To avoid the problem, we select syllables which have fine-granularity like a character but has its own meaning in Korean as a basic component of the representation of words.

Proposed Model Characteristics of Korean Words
Morphologically, unlike many other languages, a Korean word (Eojeol) is not just a concatenation of characters. It is constructed by the following hierarchy: a sequence of syllables (Eumjeol) forms a word, and the composition of 2 or 3 characters (Jaso) forms a syllable (Kang and Kim, 1994).
In linguistics, Korean language is categorized as an agglutinative language, where each word is made of a set of morphemes. To complete the Korean word (Eumjeol), a root morpheme must be combined with a bound morpheme (Josa), or a postposition (Eomi). This derivation produces about 60 different forms of the similar meaning, which causes the explosion of vocabulary. For the same reason, the number of occurrences of each word is relatively small even with a large corpus, which prevents the model from an efficient learning. Thus, most of the Korean word representation models use morphemes as an embedding unit, though it requires an additional preprocessing. The problem is that errors coming from an immature morpheme analyzer might be propagated to the word representation model. Moreover, a single Korean syllable possess a semantic meaning. For example, the word 'ᄃ ᅢᄒ ᅡ ᆨ'(college, daehag) is a composition of 'ᄃ ᅢ'(big, or great, dae) and 'ᄒ ᅡ ᆨ'(learn, or a study, hag). Therefore, our model regards syllables as embedding units rather than words or morphemes. For instance, the representations of 'ᄂ ᅡᄂ ᅳ ᆫ'(I am, naneun), 'ᄂ ᅡᄋ ᅴ'(my, naui), or 'ᄂ ᅡᄋ ᅦᄀ ᅦ'(to me, na-ege) are constructed by leveraging the same syllable vector 'ᄂ ᅡ'(I, na).

Syllable-based Representation
Similar to (Kim et al., 2016), let S be a set of all Korean syllables. We embed each syllables into d-dimensional vector space, so that Q ∈ R d×|S| becomes a syllable embedding ma-trix. Let (s 1 , s 2 , ..., s l ) denote a word t ∈ V which consists of l syllables, t is represented by concatenating syllable vectors as a column vector: (Qs 1 , Qs 2 , ..., Qs l ) ∈ R d×l . Then we apply a convolution filter H ∈ R d×w having a width w, we get a feature map f t ∈ R l−w+1 . For filters whose widths are more than 1, they need a zero padding when processing words coming from only a single syllable.
In detail, for the given filter H, the feature map can be calculated as follows: where A, B = tr(AB ) denoting Frobenius inner product. We then apply a max pooling y t = max i f t i to extract the most important feature. By using multiple filters, namely H 1 , H 2 , ..., H h , we get a final representation y t = (y t 1 , ..., y t h ) for the word t.
For training, we adopt Skip-gram (Mikolov et al., 2013b) method with negative sampling so that for a given center word y t , we maximize the log-probability of predicting context word y c . We jointly train syllable embedding matrix and convolution filters all together. Figure 1 shows overall architecture of our model.

Datasets and Baselines
The Experiments are performed on a randomly sampled subset of Korean News corpus collected from 2012 to 2014, containing approximately 2.7M tokens, 11k vocabulary, and 1k syllables. We compare our model to the original skip-gram model with negative sampling (Mikolov et al., 2013b) as a baseline.

Implementation details
For all experiments, we use the following common parameters for both our model and baseline. We use vector representations of dimension 320, the size of window is 4 and the negative-sampling parameter is 7. We train over twelve epochs. In our model, the dimension of syllable embedding is 320. Empirically, using filters with size 1~4 was enough since most of Korean words are composed of 2~4 syllables 1 .

Quantitative Evaluation
We use the WordSim353 dataset (Finkelstein et al., 2001;Agirre et al., 2009) for the word similarity and relatedness task. As WordSim353 dataset is an English data, we translated it into Korean. The quality of the word vector representation is evaluated by computing Pearson correlation coefficient between human judgment scores and the cosine similarity between word vectors.
The graph in Figure 2 shows that our model outperforms the baseline on WS353-Similarity dataset. We estimated it since a lot of similar words share the same syllable(s) in Korean. On the other hand, on WS353-Relatedness, the performance is not as good in comparison with the similarity task. We presume that leveraging syllables on computing representations can be a noise among related words without common syllables.

Out-Of-Vocabulary Test
Since our model uses syllable vectors when computing word representation, it is possible to achieve representation of OOV words by combining syllables. To evaluate the representations of OOV words, we manually chose 4 newly coined words not appear in training data (Table 1). These words were derived from original words. For example, 'ᄀ ᅮᄀ ᅳ ᆯᄉ ᅵ ᆫ'(God Google, gugeulsin) is derived from 'ᄀ ᅮ ᄀ ᅳ ᆯ'(Google, gugeul) and 'ᄀ ᅢ ᆯ ᄂ ᅩ ᄐ ᅳ'(Gal'Note, gaelnoteu) is a abbreviation form of 'ᄀ ᅢ ᆯᄅ ᅥ ᆨᄉ ᅵᄂ ᅩᄐ ᅳ'(Galaxy Note, gaelleogsinoteu). Morphologically, two of them concatenate additional syllables to the original word, and the other Original word Newly coined word two remove some syllables.
We examined the nearest neighbor of the representations of OOV words, and confirmed that each original word vector is placed in the nearest distance. It is no wonder since almost every newly coined word keeps the syllables of original word with their positions fixed.

Morphological Representation Test
We now evaluate our model on language morphology by observing how word representation leverages morphological characteristics. As mentioned above, the process of forming a sentence of Korean is totally different from many other languages. In case of Korean, a word can function in the sentence only if it is combined with the bound morpheme. For example, 'ᄉ ᅥᄋ ᅮ ᆯᄋ ᅳ ᆯ'(of Seoul, seoul-eul) is a combination of full morpheme 'ᄉ ᅥᄋ ᅮ ᆯ'(Seoul, seoul) + bound morpheme 'ᄋ ᅳ ᆯ'(of, eul).
To compare how models learn the morphological characteristics, we randomly sampled hundred words and the same words combined with certain postposition('ᄋ ᅳ ᆯ', eul) from the training data. The graph in Figure 3 shows this result more clearly. We can observe that words forming the discriminative parallel clusters against postposition-combined-words while the baseline doesn't.

Conclusion
We present a syllable-based word representation model experimented with Korean, which is one of morphologically rich languages. Our model keeps the characteristics of Skip-gram models, in which word representation learns from context words. It also takes into account the morphological characteristics by sharing parameters between the words that contain common syllables. We demonstrate that our model is competitive on quantitative evaluations. Furthermore, we show that the model can handle OOV words, and capture morphological relationships. As a future work, we have a plan to expand our model so that it can utilize overall information extracted from words, morphemes and characters.