Zero-shot North Korean to English Neural Machine Translation by Character Tokenization and Phoneme Decomposition

The primary limitation of North Korean to English translation is the lack of a parallel corpus; therefore, high translation accuracy cannot be achieved. To address this problem, we propose a zero-shot approach using South Korean data, which are remarkably similar to North Korean data. We train a neural machine translation model after tokenizing a South Korean text at the character level and decomposing characters into phonemes.We demonstrate that our method can effectively learn North Korean to English translation and improve the BLEU scores by +1.01 points in comparison with the baseline.


Introduction
Neural machine translation (NMT) has been adapted to many languages; however, machine translation of the North Korean language 1 has seldom been performed. One of the reasons is the lack of large-scale bilingual data for training North Korean neural models. It is known that large-scale bilingual data are required to improve the translation accuracy of an NMT model. For example, one of the previous works suggests that an NMT system is less accurate than a phrase-based statistical machine translation system if there are no more than 100 million words in the bilingual training data (Koehn and Knowles, 2017).
There are three approaches to solve low language resource bottleneck. First, Wang et al. (2006) proposed a method to train a translation model using a pivot language as an intermediate language. This approach translates from the source language to 1 Korean is a language mainly used in the Korean peninsula; however, there are some grammatical differences between the Republic of Korea and the Democratic People's Republic of Korea. In this study, we refer to the Korean language used in the Republic of Korea as "South Korean," and the Korean language used in the Democratic People's Republic of Korea as "North Korean." the pivot language and from the pivot language to the target language. However, there is no good pivot language between North Korean and English. Second, Johnson et al. (2017) proposed a many-tomany translation model, where multiple languages are translated into other languages using a single shared encoder and decoder. They demonstrated that this model can translate a language pair that is unseen in training data. However, North Korean does not have any bilingual data between any languages. Third, Marujo et al. (2011) proposed a rulebased method to convert similar languages into a target language, such as Brazilian Portuguese to European Portuguese, and extended the target language resources. North Korean is a language remarkably similar to South Korean, but conversion from South Korean to North Korean needs to be determined considering the context, which makes rule-based conversion difficult.
Therefore, in this study, we propose a method to tokenize South Korean input sentences at the character level and decompose them into phonemes to mitigate the grammatical differences between South Korean and North Korean, and demonstrate that the translation model from North Korean to English can be effectively learned using bilingual South Korean-English data. The main contributions of this study are as follows.
• Because there is no evaluation dataset between North Korean and English, we create a North Korean-English evaluation dataset by manually translating the South Korean-English bilingual evaluation dataset into a North Korean one.
• We demonstrate that the North Korean-English translation model can be trained effectively on bilingual South Korean-English data by character-level tokenization and phonemelevel decomposition.

Related Work
The pivot language approach increases the translation error between the source language and the target language, because the translation model of each language is independently trained.  addressed this problem by allowing interaction during the translation model training. Moreover,  proposed a method to train a source-to-target model using a pretrained teacher model as its guide. Marujo et al. (2011) proposed a rule-based method to convert similar languages into a target language to extend the language resources of the target side. Wang et al. (2016) presented a method to extract the conversion rules between similar languages. Firat et al. (2016) proposed a many-to-many translation model with several encoders and decoders. However, the accuracy of a many-to-many translation model with a single shared encoder and decoder was found to be higher (Johnson et al., 2017).
Finally, the translation accuracy was improved by preprocessing of the bilingual data. Zhang and Komachi (2018) demonstrated that higher translation accuracy can be obtained by decomposing Kanji into ideographic characters and strokes in Japanese-Chinese NMT. Stratos (2017) proposed a speech-parsing model for South Korean with character-level tokenization and decomposition into phonemes, demonstrating an improvement in the speech-parsing accuracy.

Grammatical differences
The two Korean languages have grammatical differences, including differences in word segmentation (WS), initial sound rule (ISR), and compound words. Table 1 presents examples of grammatical differences between South Korean and North Korean words or phrases that have the same meaning. We only consider the differences in the WS and ISR in our study, as differences in compound words in the evaluation data rarely appear.

Word segmentation. South Korean and North
Korean differ in the way to tokenize words containing formal and proper nouns and in quantitative expressions. For example, words are separated in both South Korean and North Korean when particles appear; however, they are not separated in North Korean if the next word after a particle is a formal noun. In Table 1, the word meaning "many things" is written as "많은 것" in South Korean and is separated because "은" is a particle. However, since "것" is a formal noun, it is written consecutively in North Korean as "많은것." To convert WS from South Korean grammar to North Korean grammar, it is necessary to consider the context.

Initial sound rule.
In South Korean, a consonant "ㄹ" changes into "ㅇ" or "ㄴ" when it is combined with "ㅑ, ㅕ, ㅛ, ㅜ, ㅠ, ㅣ, ㅖ," or other vowels, whereas it does not change in North Korean. For example, the word that means "basketball" in Table  1 is represented as "농구" in South Korean because of the ISR, but is represented as "롱구" in North Korean. Additionally, some South Korean words become polysemous owing to the ISR. In Table  1, the words that mean "fulfillment" and "move" both become "이행" in South Korean, but remain "리행" and "이행" in North Korean, respectively. It is difficult to mitigate the difference in the ISR without considering the context.

Creating North Korean Evaluation Data
We created the North Korean to English translation evaluation dataset by having a North Korean native speaker manually convert the evaluation dataset in the News Korean-English parallel corpus 2 into

Korean Neural Machine Translation using Character Tokenization and Phoneme Decomposition
We propose a method to tokenize input sentences into characters or decompose them into phonemes. Using this method, it is possible to reduce the influence of grammatical differences between South Korean and North Korean to train a machine translation model in North Korean using bilingual South Korean data. In the following South Korean or North Korean sentences, we indicate the word boundary as ⬚ for better understanding.

Character model.
In character level tokenization, we split each word into characters. For example, the word that means "many things" in Table 1 is written as "많은⬚것" in South Korean and "많은 것" in North Korean, but when we tokenize it at the character level, it becomes "많⬚은⬚것," and there is no difference between the two languages. Therefore, character level tokenization can overcome the difference in WS to some extent.  characters in a word into phonemes (vowels and consonants). As a result, we can reduce the effect of ISR. For example, the word "basketball" is written as "농구" in South Korean and "롱구" in North Korean; therefore, only one out of two tokens are common at the character level. When they are decomposed into phonemes, the former is "ㄴㅗㅇㄱㅜ" in South Korean, and the latter is "ㄹㅗㅇㄱㅜ" in North Korean, resulting in four out of five tokens being common. In this way, decomposition into phonemes can reduce the effect of ISR.
In addition, we retain the word or phrase boundary in the input sentence in this model. For example, when decomposing the sentence "롱구 는⬚운동" into phonemes, it is decomposed as "ㄹㅗㅇㄱㅜㄴㅡㄴ⬚ㅇㅜㄴㄷㅗㅇ." By applying byte-pair encoding (BPE, Sennrich et al., 2016) to the sentence that has been decomposed into phonemes, it is possible to segment the sentence at the phoneme level while considering word or phrase boundaries.
Character (phoneme BPE) model. In character (phoneme BPE) tokenization, we tokenize a sentence at the character level and decompose it into phonemes. Tokenization at the character level and decomposition into phonemes can mitigate the differences in WS and ISR, and it is possible to combine both. For example, when the sentence "롱구는⬚운동" is tokenized at the character level and decomposed into phonemes, it becomes "ㄹㅗㅇ⬚ㄱㅜ⬚ㄴㅡㄴ⬚ㅇㅜㄴ⬚ㄷㅗㅇ." By applying BPE to this sentence, it is possible to segment the sentence at the phoneme level while considering character boundaries.  ) 10.28 10.05 10.30 10.69 10.20±.16 10.03±.21 10.29±.19 10.60±.16

Settings
We train a BiDeep recurrent neural network using Nematus 3 for implementation. We adjust the hyperparameters as in Sennrich and Zhang (2019) ( Table  2). We use a News Korean-English parallel corpus for training the model and convert it into North Korean grammar (3.2) for evaluating the model. We perform tokenization and truecasing using Moses scripts for all the input sentence pairs. We delete sentences with more than 200 words from the training data. Table 3 presents the training, development, and test data statistics. In the evaluation, we perform detruecasing and detokenization for the translation outputs using Moses script and evaluate the bilingual evaluation understudy (BLEU) score using sacreBLEU (Post, 2018). We select the model using South Korean and North Korean development data.
In this study, in addition to the word level data of South Korean and North Korean as input languages, we use the four preprocessing methods, which are described in the following paragraphs and presented in Table 5. 3 https://github.com/EdinburghNLP/nematus Word (character BPE) model. According to Sennrich and Zhang (2019), we apply character level BPE to each of the South Korean, North Korean, and English sides that had been split with words. We set the merge operation to 30k and the frequency threshold to 10. For the following South Korean and North Korean preprocessing steps, the English side used only the word (character BPE) model. In addition to our re-implementation of Sennrich and Zhang (2019), we cite the BLEU score reported in their paper.
Character model. We perform character level tokenization. As for English and Hanja included in the South Korean and North Korean data, we treat them as words without further tokenization. In addition, we limit the token types to a maximum frequency of 1,700.
Word (phoneme BPE) model. We decompose the words into phonemes and apply BPE. We set the merge operation to 30k and the frequency threshold to 10. We use hgtk (Hangul toolkit) 4 for the decomposition into phonemes.
Character (phoneme BPE) model. We perform the character level tokenization, decomposition into phonemes, and application of BPE. We set the merge operation to 1k. Table 4 presents the BLEU scores for the evaluation data. In the cases of both the South Korean and North Korean languages, the char (phonBPE) models achieved the highest scores in the dev data. The test data reveals an improvement of +0.67 points for South Korean and +1.01 points for North Korean in comparison with the word (charBPE) model, respectively.

Reference
A division of General Motors is getting some financial help from the Federal Reserve: Source GM의 자회사가 연방준비제도로부터 재정적 지원을 받게 되었습니다. word (charBPE) GM's job company is getting financial assistance from the Federal Reserve. char GM's automaker has been receiving financial assistance from the Federal Reserve. word (phonBPE) GM's company has received financial assistance from the Federal Reserve. char (phonBPE) GM's company has been receiving financial assistance from the Federal Reserve. Source GM의 자회사가 련방준비제도로부터 재정적지원을 받게 되었습니다. word (charBPE) GM's own company is getting money from a scusty system. char GM's automaker has been receiving financial assistance from the Federal Reserve. word (phonBPE) GM's ZGM company gets financial assistance from the getaway. char (phonBPE) GM has received financial assistance from the Federal Reserve. Table 6: Translation examples that differ in the WS and ISR (upper: South Korean, lower: North Korean). The word that means "financial help" is written as "재정적 지원" in South Korean, and in North Korean, it is written consecutively as "재정적지원." Additionally, in South Korean, the word that means "federal" becomes "연방" because of the head ISR but remains "련방" in North Korean.

Reference
It added that it was consulting with the Ministry of Unification on the plan.
The Ministry ··· said it is discussing the plan. char (phonBPE) The Ministry ··· said it was discussing the plan. Source 해양수산부는 이 방안에 대해 통일부와 론의중이라고 덧붙였다. char The Ministry ··· said the plan is under way with the Unification Ministry. char (phonBPE) The Ministry ··· said the plan would be discussed with the Unification Ministry.

Discussion
We extract two subsets that have differences in the WS or ISR in the test data to test the hypothesis that each preprocessing step can absorb the grammatical differences. Table 3 presents the WS and ISR subset data statistics. Output of each model. Table 6 presents the outputs of each model. The words that include grammatical differences, such as "재정적지원" and "련 방," are not well-translated in the word-based models. However, the character-based models can translate them correctly. Character-level tokenization can mitigate both grammatical differences as shown in the example of Table 6; however, character-level tokenization cannot solve all the grammatical differences. For example, Table 7 presents an example, wherein the word "론의중" is affected by the ISR, and only the char (phonBPE) model can translate it in North Korean translation. Therefore, tokenization at the character level and decomposition into phonemes are necessary to reduce the differences of the WS and ISR.

Human evaluation
We randomly extracted 50 lines from each model output in the North Korean to English test. Three evaluators evaluated the fluency and adequacy on a scale of 1-5. Table 8 presents the results of the human evaluation. The char (phonBPE) model exhibits the highest scores in both metrics, with an improvement of +0.11 points in the fluency evaluation and +0.02 points in the adequacy evaluation in comparison with the word (charBPE) model. Additionally, the human evaluation results indicate that character tokenization and phoneme decomposition can improve the accuracy of the North Korean to English translation.

Conclusions and Future Work
In this study, to solve the language resource bottleneck in North Korean translation, we proposed a method to tokenize input sentences in South Korean and North Korean at the character level and decompose them into phonemes. This method is simple and mitigates the grammatical differences between South Korean and North Korean; moreover, the method demonstrates improvement in translation accuracy for North Korean to English translation. However, the differences that exist between South Korean and North Korean are not only grammatical ones. There are some words that have the same pronunciation and notation but different meanings. For example, the meaning of "낙지" is "squid" in South Korean, but "octopus" in North Korean. Therefore, the differences in word meanings are a major challenge. In the future, we intend to use the English translation data of North Korean news articles to create an evaluation dataset that considers differences in words, and attempt to develop a translation method using a language model with context, such as BERT (Devlin et al., 2019).