KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding

Natural language inference (NLI) and semantic textual similarity (STS) are key tasks in natural language understanding (NLU). Although several benchmark datasets for those tasks have been released in English and a few other languages, there are no publicly available NLI or STS datasets in the Korean language. Motivated by this, we construct and release new datasets for Korean NLI and STS, dubbed KorNLI and KorSTS, respectively. Following previous approaches, we machine-translate existing English training sets and manually translate development and test sets into Korean. To accelerate research on Korean NLU, we also establish baselines on KorNLI and KorSTS. Our datasets are publicly available at https://github.com/kakaobrain/KorNLUDatasets.


Introduction
Natural language inference (NLI) and semantic textual similarity (STS) are considered as two of the central tasks in natural language understanding (NLU). They are not only featured in GLUE (Wang et al., 2018) and SuperGLUE (Wang et al., 2019), which are two popular benchmarks for NLU, but also known to be useful for supplementary training of pre-trained language models (Phang et al., 2018) as well as for building and evaluating fixedsize sentence embeddings (Reimers and Gurevych, 2019). Accordingly, several benchmark datasets have been released for both NLI (Bowman et al., 2015;Williams et al., 2018) and STS (Cer et al., 2017) in the English language.
When it comes to the Korean language, however, benchmark datasets for NLI and STS do not exist. Popular benchmark datasets for Korean NLU typically involve question answering 12 and sentiment * Equal Contribution. 1 https://korquad.github.io/ (Lim et al., 2019) 2 http://www.aihub.or.kr/aidata/84 analysis 3 , but not NLI or STS. We believe that the lack of publicly available benchmark datasets for Korean NLI and STS has led to the lack of interest for building Korean NLU models suited for these key understanding tasks. Motivated by this, we construct and release Ko-rNLI and KorSTS, two new benchmark datasets for NLI and STS in the Korean language. Following previous work (Conneau et al., 2018), we construct our datasets by machine-translating existing English training sets and by translating English development and test sets via human translators. We then establish baselines for both KorNLI and KorSTS to facilitate research on Korean NLU.

NLI and the {S,M,X}NLI Datasets
In an NLI task, a system receives a pair of sentences, a premise and a hypothesis, and classifies their relationship into one out of three categories: entailment, contradiction, and neutral.
There are several publicly available NLI datasets in English. Bowman et al. (2015) introduced the Stanford NLI (SNLI) dataset, which consists of 570K English sentence pairs based on image captions. Williams et al. (2018) introduced the Multi-Genre NLI (MNLI) dataset, which consists of 455K English sentence pairs from ten genres. Conneau et al. (2018) released the Cross-lingual NLI (XNLI) dataset by extending the development and test data of the MNLI corpus to 15 languages. Note that Korean is not one of the 15 languages in XNLI. There are also publicly available NLI datasets in a few other non-English languages (Fonseca et al., 2016;Real et al., 2019;Hayashibe, 2020), but none exists for Korean at the time of publication. Figure 1: Data construction process. MT and PE indicate machine translation and post-editing, respectively. We translate original English data into Korean using an internal translation engine. For development and test data, the machine translation outputs are further post-edited by human experts.

STS and the STS-B Dataset
STS is a task that assesses the gradations of semantic similarity between two sentences. The similarity score ranges from 0 (completely dissimilar) to 5 (completely equivalent). It is commonly used to evaluate either how well a model grasps the closeness of two sentences in meaning, or how well a sentence embedding embodies the semantic representation of the sentence.
The STS-B dataset consists of 8,628 English sentence pairs selected from the STS tasks organized in the context of SemEval between 2012 and 2017 (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015(Agirre et al., , 2016. The domain of input sentences covers image captions, news headlines, and user forums. For details, we refer readers to Cer et al. (2017).

Data Construction
We explain how we develop two new Korean language understanding datasets: KorNLI and Ko-rSTS. The KorNLI dataset is derived from three different sources: SNLI, MNLI, and XNLI, while the KorSTS dataset stems from the STS-B dataset. The overall construction process, which is applied identically to the two new datasets, is illustrated in Figure 1.
First, we translate the training sets of the SNLI, MNLI, and STS-B datasets, as well as the development and test sets of the XNLI 4 and STS-B datasets, into Korean using an internal neural machine translation engine. Then, the translation results of the development and test sets are post-edited by professional translators in order to guarantee the quality of evaluation. This multi-stage translation strategy 4 Only English examples count.
aims not only to expedite the translators' work, but also to help maintain the translation consistency between the training and evaluation datasets. It is worth noting that the post-editing procedure does not simply mean proofreading. Rather, it refers to human translation based on the prior machine translation results, which serve as first drafts.

Translation Quality
To ensure translation quality, we hired two professional translators with at least seven years of experience who specialize in academic papers/books as well as business contracts. The two translators each post-edited half of the dataset and cross-checked each other's translation afterward. This was further examined by one of the authors, who is fluent in both English and Korean.
We also note that the professional translators did not have to edit much during post-editing, suggesting that the machine-translated sentences were often good enough to begin with. We found that the BLEU scores between the machine-translated and post-edited sentences were 63.30 for KorNLI and 73.26 for KorSTS, and for approximately half the time (47% for KorNLI and 53% for KorSTS), the translators did not have to change the machinetranslated sentence at all.
Finally, we note that translators did not see the English gold labels during post-editing, in order to expedite the post-editing process. See Section 5 for a discussion on the effect of translation on data quality.   are almost twice as long as the hypotheses, as reported in Conneau et al. (2018). We present a few examples in Table 2.

KorSTS
As provided in Table 3 Table 4.

Baselines
In this section, we provide baselines for the Korean NLI and STS tasks using our newly created benchmark datasets. Because both tasks receive a pair of sentences as an input, there are two different approaches depending on whether the model encodes the sentences jointly ("cross-encoding") or separately ("bi-encoding"). 5

Cross-encoding Approaches
As illustrated with BERT (Devlin et al., 2019) and many of its variants, the de facto standard approach for NLU tasks is to pre-train a large language model and fine-tune it on each task. In the cross-encoding 5 These nomenclatures (cross-encoding and bi-encoding) are adopted from Humeau et al. (2020).
"A man is eating something." "A man is speaking." approach, the pre-trained language model takes each sentence pair as a single input for fine-tuning. These cross-encoding models typically achieve the state-of-the-art performance over bi-encoding models, which encode each input sentence separately. For both KorNLI and KorSTS, we consider two pre-trained language models. We first pre-train a Korean RoBERTa (Liu et al., 2019), both base and large versions, on a collection of internally collected Korean corpora (65GB). We construct a byte pair encoding (BPE) (Gage, 1994;Sennrich et al., 2016) dictionary of 32K tokens using Sen-tencePiece (Kudo and Richardson, 2018). We train our models using fairseq  with 32 V100 GPUs for the base model (25 days) and 64 for the large model (20 days).
We also use XLM-R (Conneau and Lample, 2019), a publicly available cross-lingual language model that was pre-trained on 2.5TB of Common   Crawl corpora in 100 languages including Korean (54GB). Note that the base and large architectures of XLM-R are identical to those of RoBERTa, except that the vocabulary size is significantly larger (250K), making the embedding and output layers that much larger.
In Table 5, we report the test set scores for crossencoding models fine-tuned on KorNLI (accuracy) and KorSTS (Spearman correlation). For KorNLI, we additionally include results for XLM-R models fine-tuned on the original MNLI training set (also known as cross-lingual transfer in XNLI). To ensure comparability across settings, we only train on the MNLI portion when fine-tuning on KorNLI.
Overall, the Korean RoBERTa models outperform the XLM-R models, regardless of whether they are fine-tuned on Korean or English training sets. For each model, the larger variant outperforms the base one, consistent with previous findings. The large version of Korean RoBERTa performs the best for both KorNLI (83.67%) and KorSTS (85.27%) among all models tested. Among the XLM-R models for KorNLI, those fine-tuned on the Korean training set consistently outperform the cross-lingual transfer variants.

Bi-encoding Approaches
We also report the KorSTS scores of bi-encoding models. The bi-encoding approach bears practical importance in applications such as semantic search, where computing pairwise similarity among a large set of sentences is computationally expensive with cross-encoding.
Here, we first provide two baselines that do not use pre-trained language models: Korean fastText and the multilingual universal sentence encoder (M-USE). Korean fastText (Bojanowski et al., 2017) is a pre-trained word embedding model 6 trained on Korean text from Common Crawl. To produce sentence embeddings, we take the average of fastText word embeddings for each sentence. M-USE 7 (Yang et al., 2019), is a CNN-based sentence encoder model trained for NLI, questionanswering, and translation ranking across 16 languages including Korean. For both Korean fastText and M-USE, we compute the cosine similarity between two input sentence embeddings to make an unsupervised STS prediction.
Pre-trained language models can also be used as bi-encoding models following the approach of Sen-tenceBERT (Reimers and Gurevych, 2019), which involves fine-tuning a BERT-like model with a Siamese network structure on NLI and/or STS. We use the SentenceBERT approach for both Korean RoBERTa ("Korean SRoBERTa") and XLM-R ("SXLM-R"). We adopt the MEAN pooling strategy, i.e., computing the sentence vector as the mean of all contextualized word vectors.
In Table 6, we present the KorSTS test set scores (100 × Spearman correlation) for the biencoding models. We categorize each result based on whether the model was additionally trained on KorNLI and/or KorSTS. Note that models that are not fine-tuned at all or only fine-tuned to KorNLI can be considered as unsupervised w.r.t. KorSTS. Also note that M-USE is trained on a machinetranslated version of SNLI, which is a subset of KorNLI, as part of its pre-training step. 6 https://dl.fbaipublicfiles.com/ fasttext/vectors-crawl/cc.ko.300.bin.gz 7 https://tfhub.dev/google/ universal-sentence-encoder-multilingual/ 3 First, given each model, we find that supplementary training on KorNLI consistently improves the KorSTS scores for both unsupervised and supervised settings, as was the case with English models (Conneau et al., 2017;Reimers and Gurevych, 2019). This shows that the KorNLI dataset can be an effective intermediate training source for biencoding approaches. When comparing the baseline models in each setting, we find that both M-USE and the SentenceBERT-based models trained on KorNLI achieve competitive unsupervised Ko-rSTS scores. Both models significantly outperform the average of fastText embeddings model and the Korean SRoBERTa and SXLM-R models without fine-tuning. Among our baselines, large SXLM-R trained on KorNLI followed by KorSTS achieves the best score (81.84).

Effect of Translation on Data Quality
As noted in (Conneau et al., 2018), translation quality does not necessarily guarantee that the semantic relationships between sentences are preserved. We also translated each sentence independently and took the gold labels from the original English pair, so the resulting label might no longer be "gold," due to both incorrect translations and (in rarer cases) linguistic differences that make it difficult to translate specific concepts.
Fortunately, it was also pointed out in (Conneau et al., 2018) that annotators could recover the NLI labels at a similar accuracy in translated pairs (83% in French) as in original pairs (85% in English). In addition, our baseline experiments in Section 4.1 show that supplementary training on KorNLI improves KorSTS performance (+1% for RoBERTa and +4-11% for XLM-R), suggesting that the labels of KorNLI are still meaningful. Another quantitative evidence is that the performance of XLM-R fine-tuned on KorNLI (80.3% with cross-lingual transfer) is within a comparable range of the model's performance on other XNLI languages (80.1% on average).
Nevertheless, we could also find some (not many) examples the gold label becomes incorrect after translating input sentences to Korean. For example, there were cases in which the two input sentences for KorSTS were so similar (with 4+ similarity scores) that upon translation, the two inputs simply became identical. In another case, the English word sir appeared in the premise of an NLI example and was translated to ᄉ ᅥ ᆫᄉ ᅢ ᆼᄂ ᅵ ᆷ, which is a correct word translation but is a gender-neutral noun, because there is no gender-specific counterpart to the word in Korean. As a result, when the hypothesis referencing the entity as the man got translated into ᄂ ᅡ ᆷᄌ ᅡ (gender-specific), the English gold label (entailment) was no longer correct in the translated example. More systematically analyzing these errors is an interesting future work, although the amount of human efforts involved in this analysis would match that of labeling a new dataset.

Conclusion
We introduced KorNLI and KorSTS-new datasets for Korean natural language understanding. Using these datasets, we also established baselines for Korean NLI and STS with both cross-encoding and bi-encoding approaches. Looking forward, we hope that our datasets and baselines will facilitate future research on not only improving Korean NLU systems but also increasing language diversity in NLU research.

A Korean RoBERTa Pre-training
For the Korean RoBERTa baselines used in §4, we pre-train a RoBERTa (Liu et al., 2019) model on an internal Korean corpora of size 65GB, consisting of online news articles (56GB), encyclopedia (7GB), movie subtitles (∼1GB), and the Sejong corpus 8 (∼0.5GB). We use fairseq , which includes the official implementation for RoBERTa. In Table 7, we list all hyperparameters we use for Korean RoBERTa pre-training. Note that,   The fine-tuning hyperparameters are summarized in Table 8. For each dataset and model size, we choose the hyperparameter configurations that are used in the corresponding English version of the dataset and model size (except for the XLM-R cross-lingual transfer using MNLI, where we also use the same hyperparameters as RoBERTa and XLM-R on KorNLI). We find that the hyperparameters used for English models and datasets give sufficiently good performances on the development set, so we do not perform an additional hyperparameter search. After training each model for 10 epochs, we choose the model checkpoint that achieve the highest score on the development set and evaluate it on the test set to obtain our final results in §4.1.
We also report the development set scores for the best checkpoint in Table 9. We observe that the XLM-R models fine-tuned on KorNLI and KorSTS achieve the highest scores on the development set, although the Korean RoBERTa models perform better on the test set (  Table 9: KorNLI and KorSTS development set scores for fine-tuned cross-encoding language models. Ko-rNLI scores are accuracy (%) and KorSTS scores are 100 × Spearman correlation. † To ensure comparability with XNLI, we only use the MNLI portion of the KorNLI dataset.

C Fine-tuning with Bi-encoding Approaches
To fine-tune Korean RoBERTa and XLM-R models using the bi-encoding approach ( §4.2), we train Korean Sentence RoBERTa ("Korean SRoBERTa") and Sentence XLM-R ("SXLM-R"), following the fine-tuning procedure of SentenceBERT (Reimers and Gurevych, 2019). Unless described otherwise, we follow the experimental settings, including all hyperparameters, of SentenceBERT 10 . For each model size, we manually search among learning rates {2e-5, 1e-5} for training on KorNLI, {1e-5, 2e-6} for training on KorSTS, and {1e-5, 2e-6} for training on KorSTS after KorNLI. After training until convergence, we choose the learning rate that lead to the highest KorSTS score on the development set. These hyperparameters are shown in Table 10.