DiscoFuse: A Large-Scale Dataset for Discourse-Based Sentence Fusion

Mor Geva, Eric Malmi, Idan Szpektor, Jonathan Berant


Abstract
Sentence fusion is the task of joining several independent sentences into a single coherent text. Current datasets for sentence fusion are small and insufficient for training modern neural models. In this paper, we propose a method for automatically-generating fusion examples from raw text and present DiscoFuse, a large scale dataset for discourse-based sentence fusion. We author a set of rules for identifying a diverse set of discourse phenomena in raw text, and decomposing the text into two independent sentences. We apply our approach on two document collections: Wikipedia and Sports articles, yielding 60 million fusion examples annotated with discourse information required to reconstruct the fused text. We develop a sequence-to-sequence model on DiscoFuse and thoroughly analyze its strengths and weaknesses with respect to the various discourse phenomena, using both automatic as well as human evaluation. Finally, we conduct transfer learning experiments with WebSplit, a recent dataset for text simplification. We show that pretraining on DiscoFuse substantially improves performance on WebSplit when viewed as a sentence fusion task.
Anthology ID:
N19-1348
Volume:
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
Month:
June
Year:
2019
Address:
Minneapolis, Minnesota
Editors:
Jill Burstein, Christy Doran, Thamar Solorio
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3443–3455
Language:
URL:
https://aclanthology.org/N19-1348
DOI:
10.18653/v1/N19-1348
Bibkey:
Cite (ACL):
Mor Geva, Eric Malmi, Idan Szpektor, and Jonathan Berant. 2019. DiscoFuse: A Large-Scale Dataset for Discourse-Based Sentence Fusion. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3443–3455, Minneapolis, Minnesota. Association for Computational Linguistics.
Cite (Informal):
DiscoFuse: A Large-Scale Dataset for Discourse-Based Sentence Fusion (Geva et al., NAACL 2019)
Copy Citation:
PDF:
https://aclanthology.org/N19-1348.pdf
Video:
 https://aclanthology.org/N19-1348.mp4
Code
 additional community code
Data
DiscoFuse