Data sets for NLG

This page lists data sets and corpora used for research in natural language generation. They are available for download over the web. If you know of a dataset which is not listed here, you can email siggen-board@aclweb.org, or just click on Edit in the upper left corner of this page and add the system yourself.

We also have a blog page about data sets, which includes comments about appropriate and inappropriate usage, additional information about data sets, and pointers to related resources.

Data-to-text/Concept-to-text Generation

These datasets contain data and corresponding texts based on this data.

boxscore-data (Rotowire) and SportSett

https://github.com/harvardnlp/boxscore-data/

Boxscore-data consists of (human-written) NBA basketball game summaries aligned with their corresponding box- and line-scores.

https://github.com/nlgcat/sport_sett_basketball

SportSett is an expanded data set which includes additional information about basketball games. It is structured as a relational DB

E2E

http://www.macs.hw.ac.uk/InteractionLab/E2E/#data (blog comments)

Crowdsourced restaurant descriptions with corresponding restaurant data. English.

Methodius Corpus

https://www.inf.ed.ac.uk/research/isdd/admin/package?view=1&id=197

This dataset consists of 5000 short texts describing ancient Greek artefacts, generated by the Methodius NLG system. Each text is linked to its corresponding content plan (including rhetorical relations) and OpenCCG logical form (which describes the syntactic structure).

Personage Stylistic Variation for NLG

https://nlds.soe.ucsc.edu/stylistic-variation-nlg

This dataset provides training data for natural language generation of restaurant descriptions in different Big-Five personality styles.

Personage Sentence Planning for NLG

https://nlds.soe.ucsc.edu/sentence-planning-NLG

This dataset provides training data for natural language generation of restaurant descriptions using sentence planning operations of various kinds.

SUMTIME

https://ehudreiter.files.wordpress.com/2016/12/sumtime.zip (blog comments)

Weather forecasts written by human forecasters, with corresponding forecast data, for UK North Sea marine forecasts.

ToTTo

https://github.com/google-research-datasets/ToTTo/

100,000 examples of descriptions of the content of highlighted cells in a Wikipedia table.

Weather

https://github.com/facebookresearch/TreeNLG

~30K human annotated utterances for tree-structured weather meaning representations.

WeatherGov

https://cs.stanford.edu/~pliang/data/weather-data.zip (blog comments)

Computer-generated weather forecasts from weather.gov (US public forecast), along with corresponding weather data.

WebNLG

http://webnlg.loria.fr/pages/data.html (blog comments)

Crowdsourced descriptions of semantic web entities, with corresponding RDF triples.

WikiBio (Wikipedia biography dataset)

https://github.com/DavidGrangier/wikipedia-biography-dataset (blog comments)

This dataset gathers 728,321 biographies from Wikipedia. It consists of the first paragraph and the infobox (both tokenized).

WikiBio German and French(Wikipedia biography dataset)

https://github.com/PrekshaNema25/StructuredData_To_Descriptions

This dataset consists of the first paragraph and the infobox from German and French Wikipedia biography pages.

Wikipedia Person and Animal Dataset

https://eaglew.github.io/dataset/narrating

This dataset gathers 428,748 person and 12,236 animal infobox with descriptions based on Wikipedia dump (2018/04/01) and Wikidata (2018/04/12).

The Wikipedia company corpus

https://gricad-gitlab.univ-grenoble-alpes.fr/getalp/wikipediacompanycorpus

Company descriptions collected from Wikipedia. The dataset contains semantic representations, short, and long descriptions for 51K companies in English

Referring Expressions Generation

Referring expression generation is a sub-task of NLG that focuses only on the generation of referring expressions (descriptions) that identify specific entities called targets.

GRE3D3 and GRE3D7: Spatial Relations in Referring Expressions

http://jetteviethen.net/research/spatial.html

Two web-based production experiments were conducted by Jette Viethen under the supervision of Robert Dale. The resulting corpora GRE3D3 and GRE3D7 contain 720 and 4480 referring expressions, respectively. Each referring expression describes a simple object in a simple 3D scene. GRE3D3 scenes contain 3 objects and GRE3D7 scenes contain 7 objects.

RefClef, RefCOCO, RefCOCO+ and RefCOCOg

https://github.com/lichengunc/refer

Referring expressions for objects in images, and the corresponding images.

The REAL dataset

https://datastorre.stir.ac.uk/handle/11667/82

Referring expressions for real-wrold objects in images, and the corresponding images.

GeoDescriptors

https://gitlab.citius.usc.es/alejandro.ramos/geodescriptors

Geographical descriptions (eg, "Norte de Galicia") and corresponding regions on a map

TUNA Reference Corpus

https://www.abdn.ac.uk/ncs/departments/computing-science/corpus-496.php (blog comments)

https://www.abdn.ac.uk/ncs/documents/corpus.zip [direct download]

The TUNA Reference Corpus is a semantically and pragmatically transparent corpus of identifying references to objects in visual domains. It was constructed via an online experiment and has since been used in a number of evaluation studies on Referring Expressions Generation, as well as in two Shared Tasks: the Attribute Selection for Referring Expressions Generation task (2007), and the Referring Expression Generation task (2008). Main authors: Kees van Deemter, Albert Gatt, Ielka van der Sluis.

COCONUT Corpus

http://www.pitt.edu/~coconut/coconut-corpus.html

http://www.pitt.edu/%7Ecoconut/corpora/corpus.tar.gz [direct download]

COCONUT was a project on “Cooperative, coordinated natural language utterances”. The COCONUT corpus is a collection of computer-mediated dialogues in which two subjects collaborate on a simple task, namely buying furniture. SGML annotations were added according to the COCONUT-DRI coding scheme.

Stars2 corpus of referring expressions

A collection of 884 annotated definite descriptions produced by 56 subjects in collaborative communication involving speaker-hearer pairs in situations designed so as to challenge existing REG algorithms, with a particular focus on the issue of attribute choice in referential overspeci�fication. Link: https://drive.google.com/file/d/0B-KyU7T8S8bLZ1lEQmJRdUc1V28/view?usp=sharing Cite: https://link.springer.com/article/10.1007/s10579-016-9350-y

b5 corpus of text and referring expressions labelled with personality information

A collection of crowd sourced scene descriptions and an annotated REG corpus, both of which labelled with Big Five personality scores of their authors. Suitable for studies in personality-dependent text generation and referring expression generation. Link: https://drive.google.com/open?id=0B-KyU7T8S8bLTHpaMnh2U2NWZzQ Cite: https://www.aclweb.org/anthology/L18-1183

Surface Realisation

Surface Realization Shared Task 2018 (SR'18) dataset

http://taln.upf.edu/pages/msr2018-ws/SRST.html#data

Description: A multilingual dataset automatically converted from the Universal Dependencies v2.0, comprising unordered syntactic structures (10 languages) and predicate-argument structures (3 languages).

Finnish morphology

https://www.kaggle.com/mikahama/finnish-locative-cases-for-nouns

Dataset for picking the correct locative case for Finnish nouns (e.g Venäjällä vs Suomessa)

https://www.kaggle.com/mikahama/cases-of-complements-of-finnish-verbs

Dataset for picking the right case for objects of verbs in Finnish (e.g. näen talon vs uneksin talosta)

Dialogue

Alex Context NLG Dataset

https://github.com/UFAL-DSG/alex_context_nlg_dataset

A dataset for NLG in dialogue systems in the public transport information domain. It includes preceding context along with each data instance, which should allow NLG systems trained on this data to adapt to user's way of speaking, which should improve perceived naturalness. Papers: http://workshop.colips.org/re-wochat/documents/02_Paper_6.pdf, https://www.aclweb.org/anthology/W16-3622

Cam4NLG

https://github.com/shawnwun/RNNLG/tree/master/data

Cam4NLG: Cam4NLG contains 4 NLG datasets for dialogue system development, each of them is in a unique domain. Each data point contains a (dialogue act, ground truth, handcrafted baseline) tuple.

CLASSiC WOZ corpus on InformationPresentation in Spoken Dialogue Systems

http://www.classic-project.org/corpora

CLASSiC is a project on Computational Learning in Adaptive Systems for Spoken Conversation. The Wizard-of-Oz corpus on Information Presentation in Spoken Dialogue Systems contains the wizards' choices on Information Presentation strategy (summary, compare, recommend , or a combination of those) and attribute selection. The domain is restaurant search in Edinburgh. Objective measures (such as dialogue length, number of database hits, number of sentences generated etc.), as well as subjective measures (the user scores) were logged.

CODA corpus Release 1.0

http://computing.open.ac.uk/coda/resources/code_form.html

This release contains approximately 700 turns of human-authored expository dialogue (by Mark Twain and George Berkeley) which has been aligned with monologue that expresses the same information as the dialogue. The monologue side is annotated with Coherence Relations (RST), and the dialogue side with Dialogue Act tags.

Hotel Dialogs for NLG

https://nlds.soe.ucsc.edu/hotels

This set of hotel corpora includes a set of paraphrases, room and property descriptions, and full hotel dialogues aimed at exploring different ways of eliciting dialogic, conversational descriptions about hotels.

Summarisation

CASS (French)

https://github.com/euranova/CASS-dataset

This dataset is composed of decisions made by the French Court of cassation and summaries of these decisions made by lawyer.

TL;DR

https://toolbox.google.com/datasetsearch/search?query=Webis-TLDR-17%20Corpus&docid=kzcwbWD9z3B4Ah3wAAAAAA%3D%3D

Dataset for abstractive summarization constructed using Reddit posts. It is the largest corpus (approximately 3 Million posts) for informal text such as Social Media text, which can be used to train neural networks for summarization technology.

Image description

Question Generation

QGSTEC 2010 Generating Questions from Sentences Corpus

http://computing.open.ac.uk/coda/resources/qg_form.html

A corpus of over 1000 questions (both human and machine generated). The automatically generated questions have been rated by several raters according to five criteria (relevance, question type, syntactic correctness and fluency, ambiguity, and variety).

QGSTEC+

https://github.com/Keith-Godwin/QG-STEC-plus

Improved annotations for the QGSTEC corpus (with higher inter-rater reliability) as described in Godwin and Piwek (2016).

Paper Generation

ACL Title and Abstract Dataset

https://github.com/EagleW/ACL_titles_abstracts_dataset

This dataset gathers 10,874 title and abstract pairs from the ACL Anthology Network (until 2016).

PubMed Term, Abstract, Conclusion, Title Dataset

https://eaglew.github.io/dataset/paperrobot_writing

This dataset gathers three types of pairs: Title-to-Abstract (Training: 22,811/Development: 2095/Test: 2095), Abstract-to-Conclusion and Future work (Training: 22,811/Development: 2095/Test: 2095), Conclusion and Future work-to-Title (Training: 15,902/Development: 2095/Test: 2095) from PubMed. Each pair contains a pair of input and output as well as the corresponding terms(from original KB and link prediction results).

ReviewRobot Dataset

https://github.com/EagleW/ReviewRobot/blob/master/dataset/README.md

This dataset contains 8,110 paper and review pairs and background KG from 174,165 papers. It also contains information extraction results from SciIE and various knowledge graphs built on the IE results.

Challenge Data Repository

https://sites.google.com/site/genchalrepository/

Other

PIL: Patient Information Leaflet corpus

http://mcs.open.ac.uk/nlg/old_projects/pills/corpus/PIL/

http://mcs.open.ac.uk/nlg/old_projects/pills/corpus/PIL-corpus-2.0.tar.gz [direct download]

The Patient Information Leaflet (PIL) corpus] is a searchable and browsable collection of patient information leaflets available in various document formats as well as structurally annotated SGML. The PIL corpus was initially developed as part of the ICONOCLAST project at ITRI, Brighton.

Validity of BLEU Evaluation Metric

https://abdn.pure.elsevier.com/en/datasets/data-for-structured-review-of-the-validity-of-bleu

https://abdn.pure.elsevier.com/files/125166547/bleu_survey_data.zip [direct download]

Correlations between BLEU and human evaluations (for MT as well as NLG), extracted from papers in the ACL Anthology

This page was imported semi-automatically from the NLG Resources Wiki which was run by ACL SIGGEN in the years 2005–2009. Please correct conversion errors and help update its contents.

Now this page is associated with the Natural Language Generation Portal.