Subword-level Word Vector Representations for Korean

Research on distributed word representations is focused on widely-used languages such as English. Although the same methods can be used for other languages, language-specific knowledge can enhance the accuracy and richness of word vector representations. In this paper, we look at improving distributed word representations for Korean using knowledge about the unique linguistic structure of Korean. Specifically, we decompose Korean words into the jamo-level, beyond the character-level, allowing a systematic use of subword information. To evaluate the vectors, we develop Korean test sets for word similarity and analogy and make them publicly available. The results show that our simple method outperforms word2vec and character-level Skip-Grams on semantic and syntactic similarity and analogy tasks and contributes positively toward downstream NLP tasks such as sentiment analysis.


Introduction
Word vector representations built from a large corpus embed useful semantic and syntactic knowledge. They can be used to measure the similarity between words and can be applied to various downstream tasks such as document classification (Yang et al., 2016), conversation modeling (Serban et al., 2016), and machine translation (Neishi et al., 2017). Most previous research for learning the vectors focuses on English (Collobert and Weston, 2008;Mikolov et al., 2013b,a;Pennington et al., 2014;Liu et al., 2015;Cao and Lu, 2017) and thus leads to difficulties and limitations in directly applying those techniques to a language with a different internal structure from that of English.
The mismatch is especially significant for morphologically rich languages such as Korean where the morphological richness could be captured by subword level embedding such as character embedding. It has been already shown that decomposing a word into subword units and using them as inputs improves performance for downstream NLP such as text classification (Zhang et al., 2015), language modeling , and machine translation (Ling et al., 2015;. Despite their effectiveness in capturing syntactic features of diverse languages, decomposing a word into a set of n-grams and learning n-gram vectors does not consider the unique linguistic structures of various languages. Thus, researchers have integrated language-specific structures to learn word vectors, for example subcharacter components of Chinese characters  and syntactic information (such as prefixes or post-fixes) derived from external sources for English (Cao and Lu, 2017).
For Korean, integrating Korean linguistic structure at the level of jamo, the consonants and vowels that are much more rigidly defined than English, is shown to be effective for sentence parsing (Stratos, 2017). Previous work has looked at improving the vector representations of Korean using the character-level decomposition (Choi et al., 2017), but there is room for further investigation because Korean characters can be decomposed to jamos which are smaller units than the characters.
In this paper, we propose a method to integrate Korean-specific subword information to learn Korean word vectors and show improvements over previous baselines methods for word similarity, analogy, and sentiment analysis. Our first contri-bution is the method to decompose the words into both character-level units and jamo-level units and train the subword vectors through the Skip-Gram model. Our second major contribution is the Korean evaluation datasets for word similarity and analogy tasks, a translation of the WS-353 with annotations by 14 Korean native speakers, and 10,000 items for semantic and syntactic analogies, developed with Korean linguistic expertise. Using those datasets, we show that our model improves performance over other baseline methods without relying on external resources for word decomposition.
2 Related Work 2.1 Language-specific features for NLP Recent studies in NLP field flourish with development of various word vector models. Although such studies aim for universal usage, distinct characteristics of individual languages still remain as a barrier for a unified model. The aforementioned issue is even more prominent when it comes to languages that have rich morphology but lack resources for research (Berardi et al., 2015). Accordingly, various studies dealing with language specific NLP technique proposed considering linguistics traits in models.
A large portion of these papers was dedicated to Chinese. Since Chinese is a logosyllabic language,  relevant studies focused on incorporation of different subword level features on word embedding, such as word internal structure (Wang et al., 2017), subcharacter component, , syllable (Assylbekov et al., 2017), radicals (Yin et al., 2016), and sememe (Niu et al., 2017).
The Korean language is a member of the agglutinative languages (Song, 2006), so previous studies have tried fusing the complex internal structure into the model. For example, a grammatical composition called 'Josa' in combination with word embedding is utilized in semantic role labeling (Nam and  and exploiting jamo to handle morphological variation (Stratos, 2017). Also considered in prior work to obtain the word vector presentations for Korean is the syllable (Choi et al., 2017).

Subword features for NLP
Applying subword features to various NLP tasks has become popular in the NLP field. Typically, character-level information is useful when combined with the neural network based models. (Vania and Lopez, 2017;Assylbekov et al., 2017;Cao and Lu, 2017) Previous papers showed performance enhancement in various tasks including language modeling (Bojanowski et al., 2017(Bojanowski et al., , 2015, machine translation (Ling et al., 2015), text classification (Zhang et al., 2015;Ling et al., 2015) and parsing (Yu and Vu, 2017). In addition, the character n-gram fused model was suggested as a solution for a small dataset due to its robustness against data sparsity (Cao and Lu, 2017).

Model
We introduce our model training Korean word vector representations based on a subword-level information Skip-Gram. First, we briefly explain the hierarchical composition structure of Korean words to show how we decompose a Korean word into a sequence of subword components (jamo). Then, we extract character and jamo n-grams from the decomposed sequence to compute word vectors as a mean of the extracted n-grams. We train the vectors by widely-used Skip-Gram model.

Decomposition of Korean Words
Korean words are formed by an explicit hierarchical structure which can be exploited for better modeling. Every word can be decomposed into a sequence of characters, which in turn can be decomposed into jamos, the smallest lexicographic units representing the consonants and vowels of the language. Unlike English which has a more flexible sequences of consonants and vowels making up syllables (e.g., "straight"), a Korean "character" which is similar to a syllable in English has a rigid structure of three jamos. They have names that reflect the position in a character: 1) chosung (syllable onset), 2) joongsung (syllable nucleus), and 3) jongsung (syllable coda). The prefix cho in chosung means "first", joong in joongsung means "middle", and jong in jongsung means "end" of a character. Each component indicates how the character should be pronounced. With the exception of empty consonants, chosung and jongsung are consonants while joongsung are vowels. The jamos are written with the chosung on top, with joongsung on the right of or below chosung, and jongsung on the bottom (see Fig. 1).
As shown in the top of Fig. 1, some characters such as '해Sun' lack jongsung. In this case, we add an empty jongsung symbol e such that a character always has three (jamos). Thus, the character '달Moon' is decomposed into {ㄷ, ㅏ, ㄹ}, and '해Sun' into {ㅎ, ㅐ, e}.
When decomposing a word, we keep the order of the characters and the order of jamos (chosung, joongsung, and jongsung) within the character. By following this rule, we ensure that a Korean word with N characters will have 3N jamos in order. Lastly, the symbols for start of a word < and end of a word > are added to the sequence. For example, the word '강아지puppy' will be decomposed to a sequence of jamos: {<, ㄱ, ㅏ, ㅇ, ㅇ, ㅏ, e, ㅈ, ㅣ, e, >}.

Extracting N-grams from jamo Sequence
We extract the following jamo-level and characterlevel n-grams from the decomposed Korean words: 1) character-level n-grams, and 2) intercharacter jamo-level n-grams. These two levels of subword features can be successfully integrated into jamo-level n-grams by ensuring a character has three jamos, adding empty jongsung symbol to the sequence. For better understanding, we start with the word '먹었다ate'. Character-level n-grams. Since we add the empty jongsung symbol e when decomposing characters, we can find jamo-level trigrams representing a single character in the decomposed jamo sequence of a word. For example, there are three character-level unigrams in the word '먹었다ate': {ㅁ, ㅓ, ㄱ}, {ㅇ, ㅓ, ㅆ}, {ㄷ, ㅏ, e} Next, we find character-level n-grams by using the extracted unigrams. Adjacent unigrams are attached to construct n-grams. There are two character-level bigrams, and one trigram in the example: {ㅁ, ㅓ, ㄱ, ㅇ, ㅓ, ㅆ}, {ㅇ, ㅓ, ㅆ, ㄷ, ㅏ, e} {ㅁ, ㅓ, ㄱ, ㅇ, ㅓ, ㅆ, ㅇ, ㅓ, ㅆ, ㄷ, ㅏ, e} Lastly, we add the total jamo sequence of a word including < and > to the set of extracted characterlevel n-grams. Inter-character jamo-level n-grams. Since Korean is a member of the agglutinative language, a syntactic character is attached to the semantic part in the word, and this generates many variations. These variations are often determined by jamo-level information. For example, usage of the subjective case '이' or '가' is determined by the existence of jongsung in the previous character. In order to learn these regularities, we consider jamolevel n-grams across adjacent characters as well. For instance, there are 6 inter-character jamo-level trigrams in the example:

Subword Information Skip-Gram
Suppose the training corpus contains a sequence of words {..., w t−2 , w t−1 , w t , w t+1 , w t+2 , ...}, the Skip-Gram model maximizes the log probability of context word w t+j under a target word w t : where c is the size of context window, t is total number of words in the corpus. The original Skip-Gram model use softmax function outputs for log p(w t+j |w t ) in Eq. 1, however, it requires large computational cost. To avoid computing softmax precisely, we approximately maximize the log probability by Noise Contrastive Estimation, and it can be simplified to the negative sampling using the binary logistic loss: where n c is the number of negative samples, and s(w t+j , w t ) is a scoring function. The function computes the dot product between the input of the target word vector w t and the output of the context word vector w t+j . In Skip-Gram (Mikolov et al., 2013a), an input of a word w t is uniquely assigned over the training corpus; however, the vector in the Subword Information Skip-Gram model (Bojanowski et al., 2017) is the mean vector of the set of n-grams extracted from the word. Formally, the scoring function s(w t , w t+j ) is: where the decomposed set of n-grams of w t is G t and its elements are g t , |G t | is total number of elements of G t . In general, the n-grams for 3 ≤ n ≤ 6 is extracted from a word, regardless of the subword-level or compositionality of a word. Similarly, we construct a vector representation of a Korean word by using the extracted two types of n-grams. We compute the sum of jamo-level ngrams, sum of character-level n-grams, and compute mean of the vectors. Let us denote characterlevel n-grams of w t to G ct , and inter-character jamo-level n-grams G jt , then we obtain the scoring function s(w t , w t+j ) as follows: where z g jt is the vector representation of the jamolevel n-gram g jt , and z gct is that of the characterlevel n-gram g ct . N is sum of the number of character-level n-grams and the number of intercharacter jamo-level n-grams |G ct | + |G jt |.

Corpus
We collect a corpus of Korean documents from various sources to cover a wide context of word usages. The corpus used to train the models include: 1) Korean Wikipedia, 2) online news articles, and 3) Sejong Corpus. The corpus contains 0.12 billion tokens with 638,708 unique words. We discard words that occur fewer than ten times in the entire corpus. Details of the corpus are shown in Table 1 Sejong Corpus. This data is a publicly available corpus 2 which is collected under a national research project named the "21st century Sejong Project". The corpus was developed from 1998 to 2007, and contains formal text (newpapers, dictionaries, novels, etc) and informal text (transcriptions of TV shows and radio programs, etc). Thus, the corpus covers topics and context of language usage which could not be dealt with Wikipedia or news articles. We exclude some documents containing unnatural sentences such as POS-tagged sentences.

Evaluation Tasks and Datasets
We evaluate the performance of word vectors through word similarity task and word analogy task. However, to best of our knowledge, there is no Korean evaluation dataset for either task. Thus we first develop the evaluation datasets. We also test the word vectors for sentiment analysis.

Word Similarity Evaluation Dataset
Translating the test set. We develop a Korean version of the word similarity evaluation set. Two graduate students who speak Korean as native language translated the English word pairs in WS-353 (Finkelstein et al., 2001). Then, 14 Korean native speakers annotated the similarity between pairs by giving scores from 0 to 10 for the translated pairs, following written instructions. The original English instructions were translated into Korean as well. Among the 14 scores for each pair, we exclude the minimum and maximum scores and compute the mean of the rest of the scores. The correlation between the original scores and the annotated scores of the translated pairs is .82, which indicates that the translations are sufficiently reliable. We attribute the difference to the linguistic and cultural differences. We make the Korean version of WS-353 publicly available. 3

Word Analogy Evaluation Dataset
We develop the word analogy test items to evaluate the performance of word vectors. The evaluation dataset consists of 10,000 items and includes 5,000 items for evaluating the semantic features and 5,000 for the syntactic features. We also release our word analogy evaluation dataset for future research. Semantic Feature Evaluation To evaluate the semantic features of word vectors, we refer to the English version of the word analogy test sets. (Mikolov et al., 2013a;Gladkova et al., 2016). We cover the features in both sets and translated items into Korean. The items are clustered to five categories including miscellaneous items. Each category consists of 1,000 items.
• Capital-Country (Capt.) includes two word pairs representing the relation between the country name and its capital: Korean-specific test items, rather than trying to cover the existing categories in the original sets (Mikolov et al., 2013a;Gladkova et al., 2016). This is because most of the syntactic features in these sets are not available in Korean. We develop the test set with linguistic expert knowledge of Korean. The following case is a good example. In Korean, the subject marker is attached to the back of a word, and other case markers are also explicit at the word level. Here, word level refers to 'a phrase delimited by two whitespaces around it'. Unlike Korean, in English, subjects are determined by the position in a sentence (i.e., subject comes before the verb), so the case is not explicitly marked in the word. Similarly, there are other important and unique syntactic features of the Korean language, of which we choose the following five categories to evaluate the word vectors: • Case contains various case markers attached to common nouns. This evaluates a case in Korean which is represented within a wordlevel: 교수Professor : 교수가Professor+case가

Sentiment Analysis
We perform a binary sentiment classification task for evaluation of word vectors. Given a sequence of words, the trained classifier should predict the sentiment from the inputs while maintaining the input word vectors fixed. Dataset We choose Naver Sentiment Movie Corpus 4 . Scraped from Korean portal site Naver, the dataset contains 200K movie reviews. Each review is no longer than 140 characters and contain binary label according to its sentiment (1 for positive and 0 for negative). The number of samples in both sentiments is equal with 100K of positives and 100K of negatives in sum. We sample from the dataset for training (100K), validation (25K), and test set (25K). Again, each set's ratio of sentiment class is balanced. Although we apply simple preprocessing of stripping out punctuation and emoticon, the dataset is still noisy with typos, segmentation errors and abnormal word usage since its original source is raw comments from portal site.
Classifier In order to build sentiment classifier, we adopt single layer LSTM with 300 hidden units and 0.5 dropout rates. Given the final state of LSTM unit, sigmoid activation function is applied for output prediction. We use cross-entropy loss and optimize parameters through Adam optimizer (Kingma and Ba, 2014) with learning rate of 0.001.

Comparison Models
We compare performance of our model to comparison models including word-level, character-level, and jamo-level Skip-Gram models trained by negative sampling. Hyperparameters of each models are tuned over word similarity task. We fix the number of training epochs 5. Skip-Gram (SG) We first compare the performance with word-level Skip-Gram model (Mikolov et al., 2013a) where a unique vector is assigned for every unique words in the corpus. We set the number of dimensions as 300, number of negative samples to 5, and window size to 5. Character-level Skip-Gram (SISG(ch)) splits words to character-level n-grams based on subword information skip-gram. (Bojanowski et al., 2017). We set the number of dimensions as 300, number of negative samples to 5, and window size to 5. The n was set to 2-4. Jamo-level Skip-Gram with Empty Jongsung Symbol (SISG(jm)) splits words to jamo-level ngrams based on subword information skip-gram. show higher consistency to human word similarity judgment on our method. (Bojanowski et al., 2017). In addition, if a character lacks jongsung, the symbol e is added. We set the number of dimensions as 300, number of negative samples to 5, and window size to 5. The n was set to 3-6. Note that setting n=3-6 and adding the jongsung symbol makes this model as a specific case of our model, containing jamo-level ngrams (n=3-6) and character-level n-grams (n=1-2) as well.

Optimization
In order to train our model, we apply stochastic gradient descent with linearly scheduled learning rate decay. Initial learning rate is set to .025. To speed up the training, we train the vectors in parallel with shared parameters, and they are updated asynchronously.
For our model, we set n of character n-grams to 1-4 or 1-6, and n of inter-character jamolevel n-grams to 3-5. We name both model as SISG(ch4+jm) and SISG(ch6+jm), respectively. The number of dimension is set to 300, window size to 5, and negative samples to 5. We train our model 5 epochs over training corpus.

Results
Word Similarity. We report Spearman's correlation coefficient between the human judgment and model's cosine similarity for the similarity of word pairs. Fig. 2 presents the results. For word-level skip-gram, Spearman's correlation is .599. If we decompose words into characters n-grams in order to construct word vectors (SISG(Ch)), performance is highly improved to .658. It indicates that decomposing words itself is helpful to learn good  Table 2: Performance of our method and comparison models. Average cosine distance for each category in word analogy task are reported. Overall, our model outperforms comparison models, showing close distance between predicted vector a + b − c and the target vector d (a:b=c:d). Specifically, performance is improved more in syntactic analogies.
Korean word vectors, which is morphologically rich language. Moreover, if the words are decomposed to deeper level (SISG(jm)), performance is further improved to .671.
Next, addition of an empty jongsung symbol e to jamo sequence, which reflects Koreanspecific linguistic regularities, improves the quality of word vectors. SISG(jm), specific case of our model, shows higher correlation coefficient than the other baselines. Lastly, when we extend number of characters to learn in a word to 4 or 6, our models outperform others. Word Analogy. In general, given an item a:b=c:d and corresponding word vectors u a , u b , u c , u d , the vector u a + u b − u c is used to compute cosine distances between the vector and the others. Then the vectors are ranked in terms of the distance by ascending order and if the vector u d is found at the top, the item is counted as correct. Top 1 accuracy or error rate for each category is frequently used metric for this task, however, in this case these rank-based measures may not be an appropriate measure since the total number of unique n-grams (e.g., SISG) or unique words (e.g., SG) over the same corpus largely differ from each other. For fair comparison, we directly report cosine distances between the vector u a + u b − u c and u d of each category, rather than evaluating ranks of the vectors. Formally, given an item a:b=c:d, we compute 3COSADD based metric: We report the average cosine distance between predicted vector u a + u b − u c and target vector u d of each category. In semantic analogies, decomposing word into character helps little for learning semantic fea-tures. However, jamo-level n-grams help representing overall semantic features and our model show higher performance compared to baseline models. One exception is Name-Nationality category since it mainly consists of items including proper nouns, and decomposing these nouns does not help learning the semantic feature of the word. For example, it is obvious that the semantic features of both words '간디Ghandi' and '인도India' could not be derived from that of characters or jamo n-grams comprising those words.
On the other hand, decomposing words does help to learn syntactic features for all categories, and decomposing a word to even deeper levels makes learning those features more effectively. Our model outperforms all other baselines, and the amount of decreased cosine distances compared to that of word-level Skip-Gram is larger than semantic categories. Korean language is agglutinative language that character-level syntactic affixes are attached to the root of the word, and the combination of them determines final form the word. Also, the form can be reduced with jamo-level transformation. This is the main reason that we can learn syntactic feature of Korean words if we decompose a word into character-level and jamo-level simultanously. We observe similar tendency when using 3COSMUL distance metric. (Levy and Goldberg, 2014) Sentiment Analysis. We report accuracy, loss, precision, recall and f1 score for binary sentiment classification task over test set. Although overall performance is homogeneous, our method which decompose a word to 1-6 character n-grams and 3-5 jamo n-grams show slightly higher performance over comparison models. In addition, our approach show better results compared to character-  Table 3: Performance of sentiment classification task. 3-5 jamo n-grams and 1-6 chracter n-grams show slightly higher performance in terms of accuracy and f1-score over comparison models.  Table 4: Spearman's correlation coefficient of Word similarity task by n-gram of jamos and characters. Performance are improved when the 3-5 gram of jamos and 1-4 or 1-6 gram of characters. level SISG or jamo-level SISG. On the other hand, word-level Skip-Gram show comparable F1-score to our model, and is even higher than other comparison models. This is because the dataset contains significant amount of proper nouns, such as movie or actor names, and these word's semantic representations are captured better by word-level representations, as shown in word analogy task.
Effect of Size n in both n-grams. Table. 4 shows performance of word similarity task for each number of inter-character jamo-level n-grams and character-level n-grams. For the n of jamo-level ngrams, including n=5,6 of n-grams and excluding bigrams show higher performance. Meanwhile, n of character-level n-grams, including all of the character n-grams while decomposing a word does not guarantee performance improvement. Since most of the Korean word consists of no more than 6 characters (97.2% of total corpus), it seems maximum number of n=6 in character n-gram is large enough to learn word vectors. In addition, words with no more than 4 characters takes 82.6% of total corpus, so that n=4 sufficient to learn character n-grams as well.

Conclusion and Discussions
In this paper, we present how to decompose a Korean character into a sequence of jamos with empty jongsung symbols, then extract characterlevel n-grams and intercharacter jamo-level ngrams from that sequence. Both n-grams construct a word vector representation by computing the average of n-grams, and these vectors are trained by subword-level information Skip-Gram. Prior to evaluating the performance of the vectors, we developed test set for word similarity and word analogy tasks for Korean. We demonstrated the effectiveness of the learned word vectors in capturing the semantic and syntactic information by evaluating these vectors with word similarity and word analogy tasks. Specifically, the vectors using both jamo and character-level information can represent syntactic features more precisely even in an agglutinative language. Furthermore, sentiment classification results of our work indicate that the representative power of the vectors positively contributes to downstream NLP task.
Decomposing Korean word into jamo-level or character unigram helps capturing syntactic information. For example, Korean words add a character to the root of the word (e.g., '-은' subjective case, '-었' for past tense '-시-' for honorific, '-히-' for voice, and '-고-' for verb ending form.) Then composed word can be reduced to have fewer characters by transforming jamos, such as '되었 다' to '됐다'. Hence, the inter-character jamolevel n-grams also help capture these features. On the other hand, larger n-grams such as characterlevel trigram will learn unique meaning of that word since those larger component of the word will mostly occur with that word. By leveraging both features, our method produces word vectors reflecting linguistic features effectively, and thus, outperforms previous word-level approaches.
Since Korean words are divisible once more into grapheme level, resulting in longer sequence of jamos for a given word, we plan to explore potential applicability of deeper level of subword information in Korean. Meanwhile, we will further train our model over noisy data and investigate how it is dealing with noisy words. Generally, informal Korean text contains intentional typos ('맛잇다'delicious' with typo'), stand-alone jamo as a character, ('ㅋㅋlol') and segmentation errors. ('같 이가다'go together' without space'). Since these errors occur frequently, it is important to apply the vectors in training NLP models over real-word data. We plan to apply these vectors for various neural network based NLP models, such as conversation modeling. Lastly, since our method can capture Korean syntactic features through jamo and character n-grams, we can apply the same idea to other tasks such as POS tagging and parsing.