A Sub-Character Architecture for Korean Language Processing

We introduce a novel sub-character architecture that exploits a unique compositional structure of the Korean language. Our method decomposes each character into a small set of primitive phonetic units called jamo letters from which character- and word-level representations are induced. The jamo letters divulge syntactic and semantic information that is difficult to access with conventional character-level units. They greatly alleviate the data sparsity problem, reducing the observation space to 1.6% of the original while increasing accuracy in our experiments. We apply our architecture to dependency parsing and achieve dramatic improvement over strong lexical baselines.


Introduction
Korean is generally recognized as a language isolate: that is, it has no apparent genealogical relationship with other languages (Song, 2006;Campbell and Mixco, 2007). A unique feature of the language is that each character is composed of a small, fixed set of basic phonetic units called jamo letters. Despite the important role jamo plays in encoding syntactic and semantic information of words, it has been neglected in existing modern Korean processing algorithms. In this paper, we bridge this gap by introducing a novel compositional neural architecture that explicitly leverages the sub-character information.
Specifically, we perform Unicode decomposition on each Korean character to recover its underlying jamo letters and construct character-and word-level representations from these letters. See Figure 1: Korean sentence "ᄉ ᅡ ᆫᄋ ᅳ ᆯ ᄀ ᅡ ᆻᄃ ᅡ" (I went to the mountain) decomposed to words, characters, and jamos. Figure 1 for an illustration of the decomposition. The decomposition is deterministic; this is a crucial departure from previous work that uses language-specific sub-character information such as radical (a graphical component of a Chinese character). The radical structure of a Chinese character does not follow any systematic process, requiring an incomplete dictionary mapping between characters and radicals to take advantage of this information (Sun et al., 2014;Yin et al., 2016). In contrast, our Unicode decomposition does not need any supervision and can extract correct jamo letters for all possible Korean characters.
Our jamo architecture is fully general and can be plugged in any Korean processing network. For a concrete demonstration of its utility, in this work we focus on dependency parsing. McDonald et al. (2013) note that "Korean emerges as a very clear outlier" in their cross-lingual parsing experiments on the universal treebank, implying a need to tailor a model for this language isolate. Because of the compositional morphology, Korean suffers extreme data sparsity at the word level: 2,703 out of 4,698 word types (> 57%) in the held-out portion of our treebank are OOV. This makes the language challenging for simple lexical parsers even when augmented with a large set of pre-trained word representations.
While such data sparsity can also be alleviated by incorporating more conventional characterlevel information, we show that incorporating jamo is an effective and economical new approach to combating the sparsity problem for Korean. In experiments, we decisively improve the LAS of the lexical BiLSTM parser of Kiperwasser and Goldberg (2016) from 82.77 to 91.46 while reducing the size of input space by 98.4% when we replace words with jamos. As a point of reference, a strong feature-rich parser using gold POS tags obtains 88.61.
To summarize, we make the following contributions.
• To our knowledge, this is the first work that leverages jamo in end-to-end neural Korean processing. To this end, we develop a novel sub-character architecture based on deterministic Unicode decomposition.
• We perform extensive experiments on dependency parsing to verify the utility of the approach. We show clear performance boost with a drastically smaller set of parameters. Our final model outperforms strong baselines by a large margin.
• We release an implementation of our jamo architecture which can be plugged in any Korean processing network. 1

Related Work
We make a few additional remarks on related work to better situate our work. Our work follows the successful line of work on incorporating sub-lexical information to neural models. Various character-based architectures have been proposed. For instance, Ma and Hovy (2016) and Kim et al. (2016) use CNNs over characters whereas Lample et al. (2016) and  use bidirectional LSTMs (BiLSTMs). Both approaches have been shown to be profitable; we employ a BiLSTM-based approach. Many previous works have also considered morphemes to augment lexical models (Luong et al., 2013;Botha and Blunsom, 2014;Cotterell et al., 2016). Sub-character models are substantially rarer; an extreme case is considered by Gillick et al. (2016) who process text as a sequence of bytes. We believe that such byte-level models are too general and that there are opportunities to exploit natural sub-character structure for certain languages such as Korean and Chinese.
There exists a line of work on exploiting graphical components of Chinese characters called radicals (Sun et al., 2014;Yin et al., 2016). For instance, 足 (foot) is the radical of 跑 (run). While related, our work on Korean is distinguished in critical ways and should not be thought of as just an extension to another language. First, as mentioned earlier, the compositional structure is fundamentally different between Chinese and Korean. The mapping between radicals and characters in Chinese is nondeterministic and can only be loosely approximated by an incomplete dictionary. In contrast, the mapping between jamos and Korean characters is deterministic (Section 3.1), allowing for systematic decomposition of all possible Korean characters. Second, the previous work on Chinese radicals was concerned with learning word embeddings. We develop an end-to-end compositional model for a downstream task: parsing.

Jamo Structure of the Korean Language
Let W denote the set of word types and C the set of character types. In many languages, c ∈ C is the most basic unit that is meaningful. In Korean, each character is further composed of a small fixed set of phonetic units called jamo letters J where |J | = 51. The jamo letters are categorized as head consonants J h , vowels J v , or tail consonants J t . The composition is completely systematic. Given any character c ∈ C, there exist c h ∈ J h , c v ∈ J v , and c t ∈ J t such that their composition yields c. Conversely, any c h ∈ J h , c v ∈ J v , and c t ∈ J t can be composed to yield a valid character c ∈ C.
As an example, consider the word ᄀ ᅡ ᆻᄃ ᅡ (went). It is composed of two characters, ᄀ ᅡ ᆻ, ᄃ ᅡ ∈ C. Each character is furthermore composed of three jamo letters as follows: and ㅆ ∈ J t .
• ᄃ ᅡ ∈ C is composed of ㄷ ∈ J h , ㅏ ∈ J v , and an empty letter ∅ ∈ J t .
The tail consonant can be empty; we assume a special symbol ∅ ∈ J t to denote an empty letter. Figure 1 illustrates the decomposition of a Korean sentence down to jamo letters. Note that the number of possible characters is combinatorial in the number of jamo letters, loosely upper bounded by 51 3 = 132, 651. This upper bound is loose because certain combina- The combinatorial nature of Korean characters motivates the compositional architecture below. For completeness, we describe the entire forward pass of the transition-based BiLSTM parser of Kiperwasser and Goldberg (2016) that we use in our experiments.

Jamo Architecture
The parameters associated with the jamo layer are • Embedding e l ∈ R d for each letter l ∈ J Given a Korean character c ∈ C, we perform Unicode decomposition (Section 3.3) to recover the underlying jamo letters c h , c v , c t ∈ J . We compose the letters to induce a representation of c as This representation is then concatenated with a character-level lookup embedding, and the result is fed into an LSTM to produce a word representation. We use an LSTM (Hochreiter and Schmidhuber, 1997) simply as a mapping φ : R d 1 × R d 2 → R d 2 that takes an input vector x and a state vector h to output a new state vector h = φ(x, h). The parameters associated with this layer are Given a word w ∈ W and its character sequence c 1 . . . c m ∈ C, we compute and induce a representation of w as Lastly, this representation is concatenated with a word-level lookup embedding (which can be initialized with pre-trained word embeddings), and the result is fed into a BiLSTM network. The parameters associated with this layer are • Embedding e w ∈ R d W for each w ∈ W • Two-layer BiLSTM Φ that maps h 1 . . . h n ∈ R d+d W to z 1 . . . z n ∈ R d * • Feedforward for predicting transitions Given a sentence w 1 . . . w n ∈ W, the final d *dimensional word representations are given by The parser then uses the feedforward network to greedily predict transitions based on words that are active in the system. The model is trained end-toend by optimizing a max-margin objective. Since this part is not a contribution of this paper, we refer to Kiperwasser and Goldberg (2016) for details. By setting the embedding dimension of jamos d, characters d , or words d W to zero, we can configure the network to use any combination of these units. We report these experiments in Section 4.

Unicode Decomposition
Our architecture requires dynamically extracting jamo letters given any Korean character. This is achieved by simple Unicode manipulation. where c t is set to ∅ if T (c t ) = 0.

Why Use Jamo Letters?
The most obvious benefit of using jamo letters is alleviating data sparsity by flattening the combinatorial space of Korean characters. We discuss some additional explicit benefits. First, jamo letters often indicate syntactic properties of words. For example, a tail consonant ㅆ strongly implies that the word is a past tense verb as in ᄀ ᅡ ᆻ ᄃ ᅡ (went), ᄋ ᅪ ᆻᄃ ᅡ (came), and ᄒ ᅢ ᆻᄃ ᅡ (did). Thus a jamo-level model can identify unseen verbs more effectively than word-or character-level models. Second, jamo letters dictate the sound of a character. For example, ᄀ ᅡ ᆻ is pronounced as got because the head consonant ㄱ is associated with the sound g, the vowel ㅏ with o, and the tail consonant ㅆ with t. This is clearly critical for speech recognition/synthesis and indeed has been investigated in the speech community (Lee et al., 1994;Sakti et al., 2010). While speech processing is not our focus, the phonetic signals can capture useful lexical correlation (e.g., for onomatopoeic words).

Experiments
Data We use the publicly available Korean treebank in the universal treebank version 2.0 (Mc-Donald et al., 2013). 2 The dataset comes with a train/development/test split; data statistics are shown in Table 1. Since the test portion is significantly smaller than the dev portion, we report performance on both.
As expected, we observe severe data sparsity with words: 24,814 out of 31,060 elements in the vocabulary appear only once in the training data. On the dev set, about 57% word types and 3% character types are OOV. Upon Unicode decomposition, we obtain the following 48 jamo types: 2 https://github.com/ryanmcd/uni-dep-tb none of which is OOV in the dev set.
Implementation and baselines We implement our jamo architecture using the DyNet library (Neubig et al., 2017) and plug it into the BiLSTM parser of Kiperwasser and Goldberg (2016). 3 For Korean syllable manipulation, we use the freely available toolkit by Joshua Dong. 4 We train the parser for 30 epochs and use the dev portion for model selection. We compare our approach to the following baselines: • McDonald13: A cross-lingual parser originally reported in McDonald et al. (2013).
• Yara: A beam-search transition-based parser of Rasooli and Tetreault (2015) based on the rich non-local features in Zhang and Nivre (2011). We use beam width 64. We use 5-fold jackknifing on the training portion to provide POS tag features. We also report on using gold POS tags.
• K&G16: The basic BiLSTM parser of Kiperwasser and Goldberg (2016) without the sublexical architecture introduced in this work.
• Stack LSTM: A greedy transition-based parser based on stack LSTM representations. Dyer15 denotes the word-level variant . Ballesteros15 denotes the character-level variant .
For pre-trained word embeddings, we apply the spectral algorithm of Stratos et al. (2015)   We observe decisive improvement when we incorporate sub-lexical information into the parser of K&G16. In fact, a strictly sub-lexical parser using only jamos or characters clearly outperforms its lexical counterpart despite the fact that the model is drastically smaller (e.g., 90.77 with 500× 100 jamo embeddings vs 82.77 with 298115×100 word embeddings). Notably, jamos alone achieve 91.46 which is not far behind the best result 92.31 obtained by using word, character, and jamo units in conjunction. This demonstrates that our compositional architecture learns to build effective representations of Korean characters and words for parsing from a minuscule set of jamo letters.

Discussion of Future Work
We have presented a natural sub-character architecture to model the unique compositional orthography of the Korean language. The architecture induces word-/sentence-level representations from a small set of phonetic units called jamo letters. This is enabled by efficient and deterministic Unicode decomposition of characters.
We have focused on dependency parsing to demonstrate the utility of our approach as an economical and effective way to combat data sparsity. However, we believe that the true benefit of this architecture will be more evident in speech processing as jamo letters are definitions of sound in the language. Another potentially interesting application is informal text on the internet. Ill-formed words such as ㅎㅎㅎ (shorthand for ᄒ ᅡᄒ ᅡᄒ ᅡ, an onomatopoeic expression of laughter) and ㄴㄴ (shorthand for ᄂ ᅩᄂ ᅩ, a transcription of no no) are omnipresent in social media. The jamo architecture can be useful in this scenario, for instance by correlating ㅎㅎㅎ and ᄒ ᅡᄒ ᅡᄒ ᅡ which might otherwise be treated as independent.