Open Korean Corpora: A Practical Report

Korean is often referred to as a low-resource language in the research community. While this claim is partially true, it is also because the availability of resources is inadequately advertised and curated. This work curates and reviews a list of Korean corpora, first describing institution-level resource development, then further iterate through a list of current open datasets for different types of tasks. We then propose a direction on how open-source dataset construction and releases should be done for less-resourced languages to promote research.


Introduction
The Korean language is less explored in terms of corpus and computational linguistics, but its prevalence is often underrated.It regards about 80 million language users and is recently adopted in multilingual research as it is bound to CJK (Chinese, Japanese, and Korean), also handling a distinguished writing system.
However, compared to the industrial need, the interest in Korean natural language processing (NLP) has not been developed much in international viewpoints, which recurrently hinders the related publication and further academic extension.Besides, in the recent NLP, where the benchmark practice is a trend, such systems lack at this point, deterring abroad and even native researchers who start Korean NLP from finding directions.Park et al.  (2016) has shown a decent survey, but it seems that the techniques are mainly on the NLP pipeline.Also, albeit some curations on Korean NLP1 and datasets2 , we considered that little more organization is required, and better if internationally available.Our attempts are expected to mitigate the challenges that the researchers who handle Korean from a multi-or cross-lingual viewpoint may face.
In this paper, we scrutinize the struggles of government, institutes, industry, and individuals to construct public Korean NLP resources.First, we state how the institutional organizations have tackled the issue by making up the accessible resources, and point out the limitation thereof regarding international availability and license, to finally introduce and curate the fully public datasets along with the proposed criteria.Through this, we want to find out the current state of Korean corpora across the NLP tasks and whether they are freely or conditionally available.Our survey is to be curated and updated in the public repository 3 .

Accessible Resources
for Korean linguistics 4 .However, at the same time, it usually undergoes the massive dataset construction from the view of computational linguistics, to apt to the new wave of language artificial intelligence (AI).Widely known ones include Korean word dictionaries 5 and Sejong Corpus (Kim, 2006)  6 .The dictionary contains fundamental and new lexicons that make up Korean (along with the content), and the Sejong Corpus is a large-scale labeled NLP pipeline corpus for the tasks such as constituency and dependency parsing, mainly provided in .xlm-likeformat.Besides, recently, labeled corpora of about 300 million word size is released7 , covering inter-sentence tasks such as similarity and entailment.The corpora (of about 37 datasets as of May 2023) are continuously being updated regarding typos and inappropriate contents, upon user report and academic feedback.
Electronics and Telecommunications Research Institute (ETRI) has been collecting, refining, and tagging language processing and speech learning data over a long period of time8 .Aside from NIKL, which mainly focuses on classical NLP pipelines, ETRI has also built a database for semantic analysis and question answering (QA), which are the outcome of a project Exo-brain9 .The project includes syntax-semantic ones such as part of speech (POS) tagging and semantic role labeling (SRL), simultaneously providing construction guidelines for the corpora.
AI HUB is a platform organized by National Information Society Agency (NIA) in which a largescale dataset are integrated10 .The datasets are built for various tasks at the government level, to promote the development of the AI industry.Provided resources are labeled or parallel corpora in reallife domains.Here, the domains are law, patent, common sense, open dialog, machine reading comprehension, and machine translation.Also, about 1,000 hours of speech corpus is provided to be used in spoken language modeling11 .Recently, some new datasets have been distributed on wellness and emotional dialog, so that many people can have trials for social good and public AI.Also, open dictionary NIADic12 is freely available, provided by K-ICT Big Data Center.As of May 2023, there are 97 Korean language datasets being provided, and are continuously being managed by the institution.

Accessibility
The above datasets guarantee high quality, along with well-defined guidelines and the well-educated workers.However, their usage is often unfortunately confined to domestic researchers for procedural issues.Researchers abroad can indeed access the data, but they may face difficulty filling out and submitting the particular application form, instead of the barrier-free downloading system.Also, in most cases, modification and redistribution are restricted, making them uncompetitive in view of quality enhancement (Han et al., 2017).
Here, we want to introduce datasets that can be utilized as an alternative to the limitedly accessible Korean NLP resources.Instead of scrutinizing all available corpora, we are going to curate them under specific criteria.

Open Datasets
All the datasets to be introduced from now on are fully open access.This means that the dataset is downloadable with a single click or cloning, or at least one can acquire the dataset with simple signing.We set three checklists for the status of the corpus, namely documentation, usage, and redistribution.The first one is on how fine-grained the corpus description is.
• Does the corpus have any documentation on the usage?(doc) • Does the corpus have a related article?13 (paper) • Does the corpus have a internationally available publication?(int'l) Next, we check whether the dataset is both academically and commercially available, academic use only, or unknown (all, academic, unknown).For the last one, we also investigate if redistribution is available with or without modification, if neither, or unknown (rd, rd/mod-x, none, unknown)14 .These attributes are noted along with each corpus title.
Primarily, we want to introduce the recent trials on constructing Korean NLP benchmarks in current pretrained language models (PLMs) era, and then look around the specific areas of Korean NLP.

Benchmark studies
Due to emerging PLM studies in Korean and their public release (Yang, 2021), the need of fair evaluation has grown for a few years.This led to the construction of benchmark dataset that follows GLUE, General Language Understanding Evaluation benchmark (Wang et al., 2018), and such need had called for the participation of companies and institutions to create new benchmarks that aim to evaluate PLMs' capabilities of Korean understanding.

KoBEST
[int'l, all, rd] Jang et al. ( 2022) is a new Korean benchmark dataset designed for more challenging language understanding tasks.It comprises five newly constructed datasets: BoolQ (6K paragraph-sentence pairs), COPA (5K sentence triplets), KB-WiC (6K sentence pairs), KB-HellaSwag (3K paragraph and four pairs of sentences), and SentiNeg (4K sentence pairs).KoBEST aims to evaluate a model's ability to reason based on more complex knowledge beyond textual form, such as the passage of time, meaning of text, and causality16 .

Parsing and tagging
As a part of the classical NLP pipeline, we aggregate studies on POS tagging, tree tagging and dependency parsing, named entity recognition (NER) and SRL.

KAIST Morpho-Syntactically Annotated Corpus
[paper, academic, none] Lee et al. (1999) applies morphological analysis to freely available KAIST raw corpus17 .The scale is about 70M words and the domain includes novel, non-literature, article, etc.

OpenKorPOS
[int'l, all, rd] Moon et al. ( 2022) is a semi-automatically constructed corpus for Korean part-of-speech tagging, built with multiple open-source Korean POS analyzers on Wikipedia dataset18 .The corpus contains about 55M words (eojeols) and inherits the license of Wikipedia.

KAIST Korean Tree-Tagging Corpus
[int'l, academic, none] Choi et al. (1994)  19 bases on independently collected 30K sentences that are annotated according to the tree tagging scheme for Korean.

KMOU NER
[paper, academic, rd] is an NER dataset built by Korean Marine and Ocean University23 .The named entities are tagged for about 24K utterances according to name, time, and number.The data source are Exo-brain (by ETRI) and their own data combined, while the redistribution is available only for the latter.

AIR×NAVER NER/SRL
[doc, academic, none] adopted NER24 and SRL25 data constructed by Changwon National University for the purpose of a public competition26 , and is annotated according to CoNLL format (Tjong Kim Sang and De Meulder,  2003).Corpus size is about 90K and 35K each.

KoNEC & KoNNEC
[paper, all, rd] KoNEC (Cheong et al., 2022) is an NER dataset27 that annotates 150 types of named entities on the raw corpus of the KLUE-NER data, and KoN-NEC28 is annotated using a nested entity annotation approach on the KoNEC data.

Entailment, sentence similarity, and paraphrase
Here we aggregate corpora for logical inference and checking similarity, as well as style transfer datasets as a part of paraphrase datasets.

Question Pair
[doc, all, rd] consists of about 10,000 open domain sentence pairs29 , with the binary labels that are handannotated on whether the sentences are paraphrase or irrelevant.

KorNLI/KorSTS
[int'l, all, rd] Ham et al. ( 2020) is a natural language inference (NLI) and sentence textual similarity (STS) dataset for Korean30 .For KorNLI, the train set was con-structed by machine translating SNLI (Bowman  et al., 2015) and MNLI (Williams et al., 2018), and the valid and test set were constructed by human translation of XNLI (Conneau et al., 2018).Just as in the original dataset, the pairs are labelled with entailment, contradiction, or neutral.About 940K examples are provided for training, and 2,490 and 5,010 respectively for dev and test.For KorSTS, the scoring was done from 0 to 5 to elaborate rather than the binary label that determines paraphrase.Following the scheme of NLI, 5,749 training data were machine translated using the STS-B dataset (Cer et al., 2017) as a source, while 1,500 dev set and 1,379 test set pairs are human translated.

ParaKQC
[int'l, all, rd] Cho et al. (2020a)31 originally consists of 10,000 questions and commands, and each instance is labeled with 4 topics (mail, smart agent, scheduling, and weather) and 4 speech acts (wh-question, alternative question, prohibition, and requirement).The sentence set can be extended to about 540K sentence pairs that determine sentence similarity and paraphrase.

StyleKQC
[int'l, all, rd] Cho et al. ( 2022b) is Korean style transfer and paraphrase dataset that deals with formal and informal Korean questions and commands32 .It builds upon the construction scheme of ParaKQC and contains 30K sentences, namely 15K for formal and informal style each, built upon 3,000 source phrases and covers six domains regarding smart agents.

Korean Smile Style Dataset
[doc, academic, rd] Kim (2022b) is colloquial style transfer dataset that contains sentences in 17 styles, built upon a total of about 2,500 dialogs33 .Styles include formal and informal, robot-like, chat-style, etc., which are casually classified by the dataset builder.

Intention understanding, sentiment analysis, and offensive language detection
Beyond corpora that cover sentence pairs, here we introduce some corpora that suit single sentence classification task, that mainly deal with intention or sentiment.

NSMC
[doc, all, rd] is a review sentiment corpus35 of size 200K, which consists of Naver movie comments automatically labeled according to the methodology of Maas et al.  (2011).It adopts pos/neg binary labels, and it has been widely used as a benchmark for pretrained language models.

Kocasm
[doc, all, rd] Kim and Cho (2019)36 is a Korean sarcasm dataset constructed following the collection scheme of Ghosh and Veale (2016).It contains about 9K Korean tweets crawled online according to some sarcasm-related hashtags and were binary classified manually by authors.

BEEP!
[int'l, all, rd] Moon et al. ( 2020) is a hand-labeled, crowdsourced dataset of about 9.4K Naver entertainment news comments with hate speech and social bias37 .Bias and hate attribute consists of 3 labels, namely gender/others/none and hate/offensive/none, respectively.

APEACH
[int'l, all, rd] Yang et al. ( 2022) is a balanced evaluation set containing a total of 4K Korean sentences that are either hate speech or non-hate speech 38 .All sentences in the dataset were generated by human participants under the instruction of task managers (authors) and the moderator (of the crowdsourcing platform), given one of ten topics (racism, sexual harassment, gender stereotypes, etc.) per sentence as a condition.
All instances are classified into hate speech, offensive language (comments and profanity terms), and clean expressions, while hate speech can be further annotated with seven multi-label topics.

HateScore
[int'l, academic, rd] Kang et al. ( 2022) is a multilabel hate speech detection corpus that shares the similar construction scheme with Unsmile39 .It contains 35K instances that consist of 24K online comments, 2.2K neutral sentences from Wikipedia, 1.7K sentences generated human-in-the-loop, and 7.1K rule-generated sentences.

KOLD
[int'l, all, rd] Jeong et al. ( 2022) is Korean offensive language detection corpus that is constructed upon 40K Korean comments from NAVER news and YouTube40 .Comments are annotated hierarchically with the type (offensive and not-offensive) and the target (untargeted, individual, or group) of offensive language, which also contains the corresponding text spans.The target group and its attribute are also annotated.

K-MHaS
[int'l, all, rd] Lee et al. ( 2022a) is a multi-labeled Korean hate speech dataset built upon 109K utterances from Korean online news comments, tagged in (a) binary manner and (b) 8 fine-grained hate speech classes including politics, origin, physical, age, gender, religion, race, and profanity41 .

KODOLI
[int'l, all, rd] Park et al. ( 2023) is a recently published Korean dataset for offensive language identification of size about 38K sentences42 .It consists of various texts collected and sampled from online communities and news articles, and the texts are tagged in offensive, likely-offensive and none labels.It also contains two auxiliary annotations regarding abusive language and sentiment.

DKTC
[doc, academic, rd] Cho et al. ( 2022a) is Korean dataset of threatening conversations, which consists of 4K conversations regarding threatening, chantage, bullying, and other harassment (1K each) for train, and 500 test dataset containing conversations without threatening43 .

QA and dialogue
In this section, we list up QA and dialogue datasets which include question passages or conversations those have significantly longer text compared to sentence-level instances.
KorQuAD 1.0, 2.0 [int'l, all, rd/mod-x] provides human-generated QA corpus and leaderboard for Korean44 .KorQuAD 1.0 (Lim et al.,  2019) benchmarks SQuAD 1.0 (Rajpurkar et al.,  2016) and consists of total 70K questions.Ko-rQuAD 2.0 of size 100K aims at machine reading comprehension for structured HTML natural questions, which was created referring to the scheme of Google Natural Questions (Kwiatkowski et al.,  2019)  45 .

HuLiC
[doc, academic, rd] consists of human-human conversations of 40K turns and human-machine conversations of 75K turns, where humans talk about movies and humanmachine talk about open topics46 .Workers' demographics and other evaluation attributes such as sensibleness, specificity, human-likeness (evaluated turn-wisely), and preference (evaluated every 20 turns).

OPELA
[int'l, academic, rd] Lee et al. ( 2022b) is a Korean persona dialogue dataset consisting of about 600 conversations, which are created by human participants, namely eleven persona actors (accompanying character profiles) and about 500 user actors47 .Conversations were made on a chatting app provided by the crowdsourcing platform, where moderators and task managers could monitor the conversation process and moderate the probable issues.Dialogues are additionally annotated with six psychology-related attributes, and 1/3 of the created data was published online for public usage.

CareCall
[int'l, academic, rd] Bae et al. (2022) corpus is a Korean "Role specified" Open-domain dialogue in caring senior cit-izens domain 48 .The dataset is created by using LLMs, along with human support.10K filtered dialogues are bot-generated via one-shot dialogue generation and human filtering, and each consists of a list of utterances, where each line is tagged with the role (system or user), text, out-of-bounds (boolean whether checks if the utterance of the system violates role specifications).

Summarization, Translation, and Transliteration
Summarization, translation, and transliteration datasets are separately grouped to specify their usual usage of sequence-to-sequence architectures.
Sci-news-sum-kr [doc, academic, rd] contains about 50 Korean news summarizations generated by two Korean natives 49 .Since the size is not large, it is recommended to be used as a dev set.

sae4K
[int'l, all, rd] Cho et al. (2020c) contains the directive sentence summarization of the sentence level 50 .It includes about 50K pairs of utterance and natural language query pair for questions and commands, where the data is partly based on 3i4K (Cho and Kim, 2022)  and some are human-generate in concurrence with Cho et al. (2020a).

Korean Parallel Corpora
[int'l, academic, rd/mod-x] Park et al. ( 2016) contains about 100K en-ko sentence pairs for machine translation (MT).The data mainly bases on news articles, and now also provides the data on North Korean 51 .

KAIST Translation Evaluation Set
[doc, academic, none] is an evaluation set of size about 3,000 for en-ko MT 52 , augmented with index, original sentence, translation, related articles, and text source.

Transliteration Dataset
[doc, all, rd] is not an official data repository 54 , but en-ko transliteration is collected from public dictionaries such as NIKL or Wiktionary 55 .A total of about 35K en (word) -ko (pronunciation) pairs are included.

KAIST Transliteration Evaluation Set
[doc, academic, none] is a word-pronunciation pair for phonotactics in en-ko 56 , and consists of 7,186 words excerpted from the loanword dictionary 57 .

Korean in multilingual corpora
We also investigate Korean examples in multilingual corpora, where the type of the dataset may belong to one of the types discussed above.

Multilingual G2P Conversion
[int'l, all, rd] Gorman et al. ( 2020) is a shared task of SIGMOR-PHON 2020 58 , which aims to transform grapheme sequence into a phoneme sequence.The dataset was created with WikiPron 59 (Lee et al., 2020), and has been built for 10 languages including Korean (3,600 pairs for train, and 450 for dev/test each).

PAWS-X
[int'l, all, rd] Yang et al. ( 2019) is a dataset that consists of 23,659 human translated PAWS evaluation pairs (Zhang et al., 2019) and about 300K machinetranslated ones, for 6 languages including Korean 60 .Among them, Korean occupies about 5K train pairs, and 1,965 and 1,972 for dev/test each.

Multilingual Tweet Intimacy Analysis
[int'l, unk, unk] Pei et al. ( 2022) is a multilingual tweet intimacy dataset (MINT) suggested as a task of SemEval 2023 64 .Tweets of six languages labeled in score 1-5 are used in both training and test (each 2K instances), and also tweets of four languages including Korean are used only for zero-shot evaluation (each 500 instances).

IWSLT 2023
[int'l, all, rd] holds a formality track that accommodates research in formality translation, where the released dataset consists of 1,000 pairs for en-ko and en-vi each, and 600 zero-shot examples each for en-pt and enru 65 .Each pair includes a source English sentence and an informal/formal version of the sentence in the target language.

Speech corpora
Though most of the datasets discussed in this paper are in text format, we also incorporate speech corpora which can be usefully utilized in speech recognition, speech synthesis, or spoken language processing.Speech datasets are usually massive, that a downloading via a single click is not necessarily guaranteed.Thus, we listed some of them as open even if they require some application form.

KSS
[doc, academic, rd] Park ( 2018) is a book corpus read by a female voice actress.12K speech utterances and transcriptions are provided66 .

Zeroth
[doc, all, rd] is an automatic speech recognition (ASR) dataset that contains approximately 50 hours of wellrefined training data 67 .The speech corpus is provided free upon request and can be utilized for both research and commercial purposes.

ClovaCall
[int'l, academic, none] Ha et al. ( 2020) is an ASR dataset that consists of approximately 80 hours of telephone speech 68 .The corpus is provided upon request, for only research purposes.

Pansori-TED×KR
[int'l, academic, rd/mod-x] Choi and Lee ( 2018) is an ASR dataset obtained by extracting the voices of Korean speakers from Pansori (Korean traditional song in colloquial style) and TED videos, with the transcription augmented 69 .The total reaches 3 hours, but it incorporates unique phonations that are not viable in other datasets.

ProSem
[int'l, all, rd] Cho et al. ( 2019) is a spoken language understanding corpus for syntactic ambiguity resolution in Korean, classifying spoken utterances into 7 speech acts 70 .For about 7,100 utterances recorded by two speakers, namely a male and a female, the ground truth text and label are annotated along with the English translation.

kosp2e
[int'l, academic, rd] Cho et al. ( 2021) is a Korean speech to English text translation dataset that consists of 30K utterances and their Korean script/English translation 71 .The dataset is based on four publicly available corpora, and the license follows each source corpus.It covers various text domains such as news, textbook, AI agent and diary, and all translations were manually performed.

JIT/JSS
[int'l, all, rd] Park et al. ( 2020) are datasets containing audio files recorded by a native Jejueo speaker, along with transcript files72 .

Other topics
Here, we accommodate other Korean datasets that are substantial in their quantity, quality, and documentation, but were not discussed in the previous sections.

KoCHET
[int'l, academic, unk] Kim et al. ( 2022) is a Korean cultural heritage corpus for Entity-related Tasks, covering the area of named entity recognition, relation extraction (RE), and entity typing (ET), consisting of 112K NER, 38K RE, and 113K ET examples 73 .The construction of the dataset was advised by cultural heritage experts of Korea, and the modified redistribution is allowed for worldwide researchers.

KommonGen
[int'l, all, rd] Seo et al. ( 2021) is a Korean-generated dataset that translates MS-COCO data captions into Korean and uses them for generalized common sense inference 74 .Though only partly available, the team expanded the research to be internationally available (Seo et al., 2022) and published the official train set (43K instances) and test set (2K instances) online.The full dataset is available in AI HUB, a government-driven dataset hub for Korean AI, but the access is restricted only to Korean citizens75 .

LBox Open
[int'l, academic, rd] Hwang et al. ( 2022) is a multi-task benchmark for Korean legal language understanding and judgement prediction, where the precedent corpus consists of 150K cases of Korean legal precedent (80K from law open data and 70K from the creator's own database) 76 .Tasks include case name classification (100 classes, 10K pairs), statute classification (46 classes, 2,760 pairs), fine/imprisonment range prediction (10K examples), summarization, etc.

K2NLG
[doc, academic, rd] is a dataset for the task of generating summaries from knowledge sets (or knowledge graphs).The object types and relationships of the target knowledge in the model follow the KBox ontology77 .

Korean Ambiguity Dataset
[int'l, all, rd] Bareun ( 2023) is a word sense disambiguation task constructed by Bareun NLP in collaboration with Seoul National University Korean linguistics department, which contains 35K sentences and about 8,200 surface forms of Korean78 .It aims at building a comprehensive and objective benchmark on the morpheme-level decomposition of Korean sentences.

Korean GEC dataset
[int'l, academic, rd] Yoon et al. ( 2022) is a secondary-processed dataset for correcting grammatical errors in Korean, derived from the Korean language learner corpus.The Korean GEC dataset can only be used under the same license as the original data79 .

Summary
In total, we surveyed 62 corpora, namely 40 Korean text corpora, 14 multilingual corpora, and 8 speech corpora.They are composed of 2 benchmark studies, 10 datasets on parsing and tagging, 6 datasets on entailment, sentence similarity, and paraphrase, 11 datasets on intention understanding, sentiment analysis, and offensive language detection, 5 datasets on QA and dialogue, 7 datasets on summarization, translation, and transliteration, 7 datasets on Korean in multilingual corpora, 8 datasets on speech (pre-)processing, and 6 Korean datsets on other underclassified topics80 .We highlight the advent of Korean-specific benchmark datasets and the increasing number of datasets concerning offensive language, which highly concerns the prevalence of large LMs and PLM studies regarding evaluation and risk prevention.We deem that this trend is not temporary, but the future trend is currently obscure due to the decreased need of task-specific corpora.We will update the full specification table of datasets in our github repository 81 .Documentation Ensuring that a curated list of resources is up-to-date is a challenge.In this regard, we aim to make our work open and canonical, as an online repository of curated resources for Korean.For the research community to have unconstrained access to all current open resources, while endorsing community contributions, the following criteria are crucial: • The canonical, current version of this paper will be regularly published as a revision, e.g., on arxiv.org,based on a community-open version of this paper.• The resources will also have a corresponding registry, following the same metadata protocol for usability in different types of research, as we used in this protocol.• Each new contribution to the resource list will have a corresponding entry in the acknowledgments section.
We will make the registry machine parseable, so that other curated sites such as nlpprogress.com, can utilize the registry to automate updates.The project will be maintained as an open-source project, under a permissive license.A living document is a new territory for the field of academia, but we strongly believe that given the rapid progress of NLP research, this is an experiment worth attempting; and hope that a successful effort can inspire other languages to follow the same approach.Our approach is to be described in the public repository, guaranteeing the accessibility for domestic and abroad researchers.Also, a large portion of the data are expected to be more easily accessible via Koco82 and Korpora 83 , the recently constructed dataset wrappers for Korean NLP.
Limitation Given recent developments in building Korean PLMs, the need for (commercially) available Korean raw texts has become significant in both industry and open source domains.Though a large portion of government or institution-driven corpora suffices such need, and also Korean web texts have been managed within multilingual web crawl corpora disclosed in venues such as LREC, we have not covered those raw texts in this paper since both types of corpora (namely annotated ones and those for PLM pretraining) differs a lot in terms of the cleanliness/format of the data and the objective of construction.For sure, one may utilize the annotated corpora in their pretraining of PLMs and vice versa -that is, annotate a part of massive raw text for some specific tasks, but we deemed that the survey on raw texts can be handled more thoroughly within reports of open source (or industry-driven) Korean PLM building projects such as Polyglot (Ko et al., 2022), and will possibly redirect the readers to that region for a better opportunity of knowledge.
Another limitation of our study is that, while recent development of large language models such as ChatGPT84 has brought a wave of change in data annotation and construction schemes, our study is still focused on datasets that are built in conventional ways -manual or semi-automatic.We believe that human-generated or human-annotated datasets are still a valuable reference for machine annotation, prompting, or future studies of human behaviors or thoughts, especially in Korean NLP, where the model-centric researches are actively ongoing but computational analysis of language data is still less highlighted.

Conclusion
In this paper, we investigated the Korean NLP datasets constructed and released as public resources.Our curation suggests a variety of open corpora that are freely available.This information will not only be helpful for the Korean researchers who want to start NLP, but also for the abroad ones who are interested in Korean NLP.Nonetheless, we think that Korean open corpora are still less disclosed or not yet sufficient.It is notable that the Korean government is currently supplying substantial funds to build a database.To guide this well, appropriate management and documentation should be guaranteed, so that the construction is meaningful and the outcome is internationally available.