Korean Morphological Analysis with Tied Sequence-to-Sequence Multi-Task Model

Korean morphological analysis has been considered as a sequence of morpheme processing and POS tagging. Thus, a pipeline model of the tasks has been adopted widely by previous studies. However, the model has a problem that it cannot utilize interactions among the tasks. This paper formulates Korean morphological analysis as a combination of the tasks and presents a tied sequence-to-sequence multi-task model for training the two tasks simultaneously without any explicit regularization. The experiments prove the proposed model achieves the state-of-the-art performance.


Introduction
Korean is an agglutinative language (Song, 2006). Thus, it is a fundamental step for understanding a sentence to analyze the grammatical structure of eojeols, where an eojeol is a linguistic unit segmented by a white space. An eojeol is composed of one or more morphemes. As a result, one eojeol can be analyzed into several morpheme combinations depending on a context, which yields different part-of-speech (POS) tags of the morpheme combinations. In addition, some morphemes have a different surface form from their base form when they are derived from an eojeol. Therefore, the goal of Korean morphological analyzer is not only to decompose and recover morphemes from eojeols precisely (morpheme processing), but also to assign POS tags to the decomposed and/or recovered morphemes accurately according to a context (POS tagging).
Traditional approaches to Korean morphological analysis have adopted a pipeline model of morpheme processing and POS tagging (Lee and Rim, 2009;Na, 2015;Choi et al., 2016;Matteson et al., 2018;Song and Park, 2018). That is, they first  Figure 1: Korean morphological analysis of a sentence "나는 하늘을 나는 새를 봤다" of which meaning is "I saw a bird flying in the sky". Correct morpheme analysis helps predicting POS tags (blue dotted arrows) while POS tagging affects morpheme analysis (red arrows). Orange morphemes in the morpheme sequence are recovered morphemes. Best viewed in color.
decompose and recover morphemes from eojeols or assign so-called POSMORPH tags (Heigold et al., 2016), and then an actual POS tag sequence is determined or resolved from the POSMORPH tags using a sequential labeling algorithm. However, this pipeline model suffers from two kinds of weaknesses. One weakness is that the errors from morpheme processing are apt to be propagated to POS tagging, and the other is that it is difficult to model mutual interactions between morpheme processing and POS tagging under the pipeline model.
In Korean, morpheme processing and POS tagging affect each other. That is, correct morpheme analysis helps POS tagging, and POS tags are helpful in analyzing morphemes. Figure 1 shows such an example in Korean morphological analysis. The second eojeol '하늘을 (in the sky)' is composed of two morphemes '하늘 (sky)' and '을 (objective postposition)'. If '하늘을' is precisely decomposed into '하늘' and '을', their POS tags can be predicted easily. This is the information flow on which the previous studies focus. On the other hand, the eojeol '나는' which appears twice in this figure is morphologically ambiguous and thus can be analyzed in two ways. However, if morpheme processing obtains some information from POS tagging, it can decompose ambiguous eojeols correctly. That is, the first '나는' is decomposed into '나 (I)' and '는 (topical postposition)', while the second '나는' is into '날 (fly)' and '는 (verbal ending)'. Therefore, morpheme processing and POS tagging should be trained simultaneously.
This paper proposes a model to train morpheme processing and POS tagging simultaneously in Korean morphological analysis. The proposed model regards morpheme processing and POS tagging as individual tasks. Then, the two tasks are jointly trained under a sequence-to-sequence multi-task framework (Luong et al., 2016;Anastasopoulos and Chiang, 2018). The main characteristic of the proposed model is that there is a single decoder for both generating a morpheme sequence and assigning POS tags to the generated morphemes. As a result, the decoder shares general representations of both tasks and forces a one-to-one mapping between morphemes and POS tags without additional regularization. Morpheme processing is accomplished by a pointergenerator network (See et al., 2017), while POS tagging is done by a CRF network (Huang et al., 2015;Lample et al., 2016) utilizing the information from morpheme processing. Our experimental results show that the proposed model outperforms existing Korean morphological analyzers with its state-of-the-art performance.

Morpheme Processing and POS Tagging
Given an input sequence x = (w 1 , ..., w n ) where w i is the i-th eojeol, Korean morphological analysis aims to produce an output sequence y = ( m 1 , t 1 , ..., m k , t k ) where m j is the j-th morpheme and t j is its POS tag. Korean morphological analysis is different from existing NLP tasks such as English POS tagging (Toutanova et al., 2003;Manning, 2011), and joint word segmentation and POS tagging for Chinese (Zhang and Clark, 2008;Shao et al., 2017;Chen et al., 2017). In English POS tagging, a word and its tag have a one-to-one mapping so that the length of an input sequence is equal to that of an output sequence. On the other hand, they are usually different in Korean, because an eojeol consists of several morphemes. That is, n < k in general. The key distinguishing factor between Chinese and Korean mor-phological analysis is that lemmatization that recovers a base form for every morpheme as well as morpheme segmentation is a must in Korean morphological analysis, while Chinese morphological analysis requires only word segmentation. That is, where is a concatenate operator. According to Figure  1, the eojeol sequence '나는 하늘을 나는 새를 봤 다' is different from its morpheme sequence '나는 하늘을 날는 새를 보았다' where underline marks indicate the difference. 1 Given an annotated corpus D = ( where N is the number of sentences, Korean morphological analyzer is obtained by maximizing the conditional probability p(y|x). The conditional probability can be calculated as Morpheme processing (1) where m = (m 1 , . . . , m k ) is a morpheme sequence, and t = (t 1 , . . . , t k ) is a tag sequence. Equation (1) implies that p(y|x) can be further decomposed into two another conditional probabilities. One conditional probability, p(m|x), generates a morpheme sequence m from the input sequence x, and the other p(t|m, x) assigns POS tags considering the morpheme sequence m and the input sequence x. Therefore, p(m|x) corresponds to morpheme processing, while p(t|m, x) is POS tagging.

Linguistic Unit in Morphological Analysis
Eojeol is the unit of spacing in Korean sentences. However, it is inappropriate to use eojeols directly in sequence-to-sequence models, because the number of eojeols is extremely huge due to the agglutinative characteristics of Korean. For instance, there exist 624,655 kinds of eojeols in the corpus used in the experiments. This large eojeol size causes a complexity problem in morphological analysis. According to Korean orthography, an eojeol is a sequence of syllables. For instance, an eojeol '하 늘을' is a sequence of three syllables '하', '늘', and '을'. A morpheme in Korean is also a sequence of syllables like an eojeol. The number of distinguished syllables in the corpus above is just z s' h Figure 2: The proposed model for Korean morphological analysis based on a Tied sequence-to-sequence multi-task model. Since syllable is adopted as a unit for the proposed model, the input for the encoder is a syllable sequence s. For clarity's sake, there are some dependencies not shown. 5,245 which is much smaller than that of eojeols. Therefore, this paper adopts syllable as a unit for sequence-to-sequence models for Korean morphological analysis.

Tied Sequence-to-Sequence Multi-Task Model
The proposed model is based on the sequence-tosequence model with attention (Bahdanau et al., 2015) and extends the model for multi-task learning similarly to the work of Anastasopoulos and Chiang (2018). The proposed model consists of four parts: a recurrent encoder, a recurrent decoder, attention, and task-dependent networks. Figure 2 shows an overall structure of the proposed model. The encoder encodes an input sequence x into a sequence of hidden states. Since syllable is adopted as the unit for the model, the encoder (bidirectional LSTM) actually transforms an input syllable sequence s = (s 1 , . . . , s len(x) ) into a sequence of encoder hidden states h 1 , . . . , h len(x) , where len(·) is the number of syllables including white spaces. The attention transforms the encoder hidden states h into a sequence of context vectors c through attention weights. The context vectors capture relevant source-side information for both morpheme processing and POS tagging. Unlike other multi-task models, there is only one decoder for the two tasks of morpheme pro-cessing and POS tagging in the proposed model. Note that a morpheme and its POS tag should have a one-to-one mapping relation. Since the two tasks are forced to share the same decoder states, a oneto-one mapping between a morpheme and its POS tag is guaranteed without any explicit regularization or task-specific token. In detail, the decoder (unidirectional LSTM) first computes a sequence of decoder hidden states z. At every time t, the decoder hidden state z t is calculated from the embedding of the previous syllable s t−1 , the decoder state z t−1 , and the context vector c t . Then, two task-dependent networks solve their own task using the same z.
In morpheme processing represented by solid arrows in Figure 2, a syllable s t is produced at every time t. Most syllables are directly copied from the input syllables while some syllables are newly generated from a vocabulary to recover the base form of a morpheme. To do this, we adopt the pointer-generator network (See et al., 2017) which allows both copying syllables via pointing and generating syllables from a fixed vocabulary. The advantage of this approach is that we can handle non-Korean characters such as hanja (Chinese character) and special symbols with ease.
In POS tagging represented by dotted arrows in Figure 2, it is possible to generate a POS tag at every time t by paying attention to the encoder syllable states as done in morpheme processing. However, this approach does not reflect the global dependency among adjacent tags. To solve this problem, this paper adopts a CRF network (Huang et al., 2015;Lample et al., 2016) over the pointergenerator network. When the morpheme sequence generated at the morpheme processing is given, the CRF network predicts POS tags for the morpheme sequence by considering the global dependency among tags (Lafferty et al., 2001). Furthermore, this paper uses skip-connections (He et al., 2016) in order for the CRF network to pay attention to the hidden states of the decoder. Since some POS tags can be predicted directly from input syllables, the skip-connections help the CRF network consider the information of input syllables. In detail, a sequence of syllables generated by the pointer-generator network is transformed to vectors, and the vectors are concatenated with the decoder state z and the context vector c. Then, the concatenated vectors are fed to the CRF layer to predict POS tags.  Note that syllable is the unit for the model and one POS tag spans several syllables of a morpheme. Thus, to identify morpheme boundaries, morpheme processing of the proposed model generates a special symbol of morpheme ending for every explicit morpheme boundary instead of adopting the IOB tagging scheme. Such an approach reduces the number of POS tags and is helpful in segmenting eojeols into morphemes precisely.
The proposed model is trained to maximize the weighted sum of conditional log-likelihood of each task where the weights are set to be equal.

Experimental Settings
The same data set with the work of Na (2015) is used for the experiments. This data set is derived from Sejong corpus. 2 Each eojeol in the Sejong corpus is annotated with pairs of a morpheme and its POS tag. The simple statistics on the data set used in the experiments is given in Table 2.
We set the syllable embedding size and the hidden size to 100. The number of LSTM layers is set to three and the batch size is 128. We use the gradient normalization with a threshold of five.
Three baselines are adopted to show the superiority of the proposed model. The first baseline is the model of Na (2015). This model consists of three sub-models (CRF segmentation, CRF tagging, and post-processing) and executes these three models in consecutive order. The second is the model of Song and Park (2018) which consists of two sub-models (generator and CRF tagger). The last is khaiii, a publicly-available CNN based morphological analyzer. 3 This analyzer assigns a POSMORPH tag for every syllable using a CNN classifier and then resolves the tag through postprocessing. These baselines are all pipeline models. We also compare the proposed model with a single task non-pipeline model which generates a single sequence of morphemes and POS tags using the pointer-generator network.
The F1-measure at the morpheme level and the accuracy at the eojeol level are used as evaluation metrics, and all performances are micro-averaged. Table 1 shows the performances of the proposed model and the baselines. It also contains the results of the ablation study on the proposed model by changing the network of sub-tasks or the structure of multi-task models. 'Non-cascade' is similar to the standard multi-task model (Dong et al., 2015) in which each task just shares a decoder and there is no direct connection between tasks. According to Figure 2, there are no dotted arrows (marked s and c) from the Pointer-generator network and the CRF network in 'Non-cascade' model. 'Cascade'  is a multi-task model where the POS tagging network has a connection from the morpheme processing network but no skip connection from the decoder. In Figure 2, 'Cascade' model has no dotted arrow (marked z) from the decoder to the CRF network. 'Tied' is the proposed model that considers both connections. According to the table, all multi-task models that adopt Pointer-Generator and/or CRF outperform the baseline Generator-Generator, which proves their adoption is effective in improving the performance of morphological analysis. It is also noticed from the table that 'Cascade' multi-task model achieves better performance than 'Noncascade' model, which implies that the unimpeded information from morpheme processing to POS tagging helps predict accurate POS tags. The proposed 'Tied' multi-task model yields higher performances than 'Cascade' model. When training the 'Tied' multi-task model, the feedbacks from POS tagging network influence decoder states directly. As a result, the 'Tied' model is able to learn better representation for both morpheme processing and POS tagging. All these results indicate that 'Tied' multi-task model with Pointer-Generator network and CRF is the best choice for Korean morphological analysis.

Experimental Results
Compared to the pipeline models which showed the state-of-the-art performance previously, even simple multi-task models report a similar performance, which means that morphological analysis can be improved by training morpheme processing and POS tagging jointly. Even if the single-task non-pipeline model achieves a reasonable performance, its performance is not as high as that of the proposed model. This is because the singletask model does not consider the global tag dependency explicitly. To sum up, Korean morphological analysis should be solved by training the two tasks jointly with appropriate networks.

Error Analysis
Even if the proposed method achieves over 97.4% F1-score at the morpheme level, we believe that there still exists some room to improve the performance of Korean morphological analysis. To do this, we analyzed the errors by the proposed method. Table 3 shows error types and their percentages. Morpheme segmentation is when all results are correct except some morphemes that are decomposed wrongly. This type happens mostly in Korean compound nouns where Korean compound nouns can be written as one or more eojeols. If a compound noun represented as one eojeol is given, the proposed method sometimes decomposes the compound noun into a series of nouns. 5.8% errors belong to this type. POS tagging error type is when morphemes are correctly recovered and segmented but their POS tags are predicted wrongly. Its percentage is 39.3%. In this type, 37% errors occur between noun and proper noun. The remaining but majority errors are related to morpheme recovery. That is, these errors occur when the proposed method fails in recovering morphemes from eojeols accurately. Its percentage is 54.9%. Therefore, it is inferred from the error analysis that developing a more accurate morpheme processing model is required to improve the performance of the proposed Korean morphological analyzer.

Conclusion
This paper has formulated Korean morphological analysis as a combination of morpheme processing and POS tagging. Thus, the two tasks are trained simultaneously through the tied sequenceto-sequence multi-task model with the pointergenerator network and the CRF network. According to the experiment results, the jointly trained morphological analyzer achieves higher performances than the legacy analyzers which are pipeline models of morpheme processing and POS tagging. 4