On the Creation of a Corpus for Coherence Evaluation of Discursive Units

Elham Mohammadi, Timothe Beiko, Leila Kosseim


Abstract
In this paper, we report on our experiments towards the creation of a corpus for coherence evaluation. Most corpora for textual coherence evaluation are composed of randomly shuffled sentences that focus on sentence ordering, regardless of whether the sentences were originally related by a discourse relation. To the best of our knowledge, no publicly available corpus has been designed specifically for the evaluation of coherence of known discursive units. In this paper, we focus on coherence modeling at the intra-discursive level and describe our approach to build a corpus of incoherent pairs of sentences. We experimented with a variety of corruption strategies to create synthetic incoherent pairs of discourse arguments from coherent ones. Using discourse argument pairs from the Penn Discourse Tree Bank, we generate incoherent discourse argument pairs, by swapping either their discourse connective or a discourse argument. To evaluate how incoherent the generated corpora are, we use a convolutional neural network to try to distinguish the original pairs from the corrupted ones. Results of the classifier as well as a manual inspection of the corpora show that generating such corpora is still a challenge as the generated instances are clearly not “incoherent enough”, indicating that more effort should be spent on developing more robust ways of generating incoherent corpora.
Anthology ID:
2020.lrec-1.134
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
1067–1072
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.134
DOI:
Bibkey:
Cite (ACL):
Elham Mohammadi, Timothe Beiko, and Leila Kosseim. 2020. On the Creation of a Corpus for Coherence Evaluation of Discursive Units. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1067–1072, Marseille, France. European Language Resources Association.
Cite (Informal):
On the Creation of a Corpus for Coherence Evaluation of Discursive Units (Mohammadi et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.134.pdf