Data sets for NLG

2019-04-12T19:14:41Z

Emiel: Adding some image description datasets.

This page lists data sets and corpora used for research in natural language generation. They are available for download over the web. If you know of a dataset which is not listed here, you can email siggen-board@aclweb.org, or just click on Edit in the upper left corner of this page and add the system yourself.

==Data-to-text/Concept-to-text Generation==
These datasets contain data and corresponding texts based on this data.

=== boxscore-data ===
https://github.com/harvardnlp/boxscore-data/

This dataset consists of (human-written) NBA basketball game summaries aligned with their corresponding box- and line-scores.

=== E2E ===
http://www.macs.hw.ac.uk/InteractionLab/E2E/#data

Crowdsourced restaurant descriptions with corresponding restaurant data. English.

=== SUMTIME ===
https://ehudreiter.files.wordpress.com/2016/12/sumtime.zip

Weather forecasts written by human forecasters, with corresponding forecast data, for UK North Sea marine forecasts.

=== WeatherGov ===
https://cs.stanford.edu/~pliang/data/weather-data.zip

Computer-generated weather forecasts from weather.gov (US public forecast), along with corresponding weather data.

=== WebNLG===
https://github.com/ThiagoCF05/webnlg

Crowdsourced descriptions of semantic web entities, with corresponding RDF triples.

=== The Wikipedia company corpus ===
https://gricad-gitlab.univ-grenoble-alpes.fr/getalp/wikipediacompanycorpus

Company descriptions collected from Wikipedia. The dataset contains semantic representations, short, and long descriptions for 51K companies in English

== Referring Expressions Generation==
Referring expression generation is a sub-task of NLG that focuses only on the generation of referring expressions (descriptions) that identify specific entities called targets.

=== GRE3D3 and GRE3D7: Spatial Relations in Referring Expressions ===
http://jetteviethen.net/research/spatial.html

Two web-based production experiments were conducted by Jette Viethen under the supervision of Robert Dale.
The resulting corpora GRE3D3 and GRE3D7 contain 720 and 4480 referring expressions, respectively. Each referring expression describes a simple object in a simple 3D scene. GRE3D3 scenes contain 3 objects and GRE3D7 scenes contain 7 objects.

=== RefClef, RefCOCO, RefCOCO+ and RefCOCOg ===
https://github.com/lichengunc/refer

Referring expressions for objects in images, and the corresponding images.

=== The REAL dataset ===
https://datastorre.stir.ac.uk/handle/11667/82

Referring expressions for real-wrold objects in images, and the corresponding images.

=== GeoDescriptors ===
https://gitlab.citius.usc.es/alejandro.ramos/geodescriptors

Geographical descriptions (eg, "Norte de Galicia") and corresponding regions on a map

=== TUNA Reference Corpus ===
https://www.abdn.ac.uk/ncs/departments/computing-science/corpus-496.php

https://www.abdn.ac.uk/ncs/documents/corpus.zip [direct download]

The TUNA Reference Corpus is a semantically and pragmatically transparent corpus of identifying references to objects in visual domains. It was constructed via an online experiment and has since been used in a number of evaluation studies on Referring Expressions Generation, as well as in two Shared Tasks: the Attribute Selection for Referring Expressions Generation task (2007), and the Referring Expression Generation task (2008). Main authors: Kees van Deemter, Albert Gatt, Ielka van der Sluis.

=== COCONUT Corpus ===
http://www.pitt.edu/~coconut/coconut-corpus.html

http://www.pitt.edu/%7Ecoconut/corpora/corpus.tar.gz [direct download]

COCONUT was a project on “Cooperative, coordinated natural language utterances”. The COCONUT corpus is a collection of computer-mediated dialogues in which two subjects collaborate on a simple task, namely buying furniture. SGML annotations were added according to the [http://www.pitt.edu/%7Epjordan/papers/coconut-manual.pdf COCONUT-DRI coding scheme].

=== Stars2 corpus of referring expressions ===
A collection of 884 annotated definite descriptions produced by 56 subjects in collaborative communication involving speaker-hearer pairs in situations designed so as to challenge existing REG algorithms, with a particular focus on the issue of attribute choice in referential overspeci�fication.
Link: https://drive.google.com/file/d/0B-KyU7T8S8bLZ1lEQmJRdUc1V28/view?usp=sharing
Cite: https://link.springer.com/article/10.1007/s10579-016-9350-y

=== b5 corpus of text and referring expressions labelled with personality information ===
A collection of crowd sourced scene descriptions and an annotated REG corpus, both of which labelled with Big Five personality scores of their authors. Suitable for studies in personality-dependent text generation and referring expression generation.
Link: https://drive.google.com/open?id=0B-KyU7T8S8bLTHpaMnh2U2NWZzQ
Cite: https://www.aclweb.org/anthology/L18-1183

== Dialogue ==

=== Cam4NLG ===
https://github.com/shawnwun/RNNLG/tree/master/data

Cam4NLG: Cam4NLG contains 4 NLG datasets for dialogue system development, each of them is in a unique domain. Each data point contains a (dialogue act, ground truth, handcrafted baseline) tuple.

===CLASSiC WOZ corpus on InformationPresentation in Spoken Dialogue Systems===
http://www.classic-project.org/corpora

CLASSiC is a project on [http://www.classic-project.org/ Computational Learning in Adaptive Systems for Spoken Conversation]. The Wizard-of-Oz corpus on Information Presentation in Spoken Dialogue Systems contains the wizards' choices on Information Presentation strategy (summary, compare, recommend , or a combination of those) and attribute selection. The domain is restaurant search in Edinburgh. Objective measures (such as dialogue length, number of database hits, number of sentences generated etc.), as well as subjective measures (the user scores) were logged.

== Summarisation ==

=== TL;DR ===
https://toolbox.google.com/datasetsearch/search?query=Webis-TLDR-17%20Corpus&docid=kzcwbWD9z3B4Ah3wAAAAAA%3D%3D

Dataset for abstractive summarization constructed using Reddit posts. It is the largest corpus (approximately 3 Million posts) for informal text such as Social Media text, which can be used to train neural networks for summarization technology.

== Image description ==

===Chinese===
* Flickr8k-CN: http://lixirong.net/datasets/flickr8kcn

===Dutch===

* DIDEC: http://didec.uvt.nl
* Flickr30K https://github.com/cltl/DutchDescriptions

===German===
* Multi30K: http://www.statmt.org/wmt16/multimodal-task.html

== Other ==
=== PIL: Patient Information Leaflet corpus ===
http://mcs.open.ac.uk/nlg/old_projects/pills/corpus/PIL/

http://mcs.open.ac.uk/nlg/old_projects/pills/corpus/PIL-corpus-2.0.tar.gz [direct download]

The Patient Information Leaflet (PIL) corpus] is a [http://www.itri.brighton.ac.uk/projects/pills/corpus/PIL/searchtool/search.html searchable] and [http://mcs.open.ac.uk/nlg/old_projects/pills/corpus/PIL/ browsable] collection of patient information leaflets available in various document formats as well as structurally annotated SGML. The PIL corpus was initially developed as part of the ICONOCLAST project at ITRI, Brighton.

=== Validity of BLEU Evaluation Metric ===
https://abdn.pure.elsevier.com/en/datasets/data-for-structured-review-of-the-validity-of-bleu

https://abdn.pure.elsevier.com/files/125166547/bleu_survey_data.zip [direct download]

Correlations between BLEU and human evaluations (for MT as well as NLG), extracted from papers in the ACL Anthology

[[Category:Knowledge Collections and Datasets]]
{{SIGGEN Wiki}}

ACL Wiki - User contributions [en]

Data sets for NLG