SLäNDa: An Annotated Corpus of Narrative and Dialogue in Swedish Literary Fiction

Sara Stymne, Carin Östman


Abstract
We describe a new corpus, SLäNDa, the Swedish Literary corpus of Narrative and Dialogue. It contains Swedish literary fiction, which has been manually annotated for cited materials, with a focus on dialogue. The annotation covers excerpts from eight Swedish novels written between 1879–1940, a period of modernization of the Swedish language. SLäNDa contains annotations for all cited materials that are separate from the main narrative, like quotations and signs. The main focus is on dialogue, for which we annotate speech segments, speech tags, and speakers. In this paper we describe the annotation protocol and procedure and show that we can reach a high inter-annotator agreement. In total, SLäNDa contains annotations of 44 chapters with over 220K tokens. The annotation identified 4,733 instances of cited material and 1,143 named speaker–speech mappings. The corpus is useful for developing computational tools for different types of analysis of literary narrative and speech. We perform a small pilot study where we show how our annotation can help in analyzing language change in Swedish. We find that a number of common function words have their modern version appear earlier in speech than in narrative.
Anthology ID:
2020.lrec-1.103
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
826–834
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.103
DOI:
Bibkey:
Cite (ACL):
Sara Stymne and Carin Östman. 2020. SLäNDa: An Annotated Corpus of Narrative and Dialogue in Swedish Literary Fiction. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 826–834, Marseille, France. European Language Resources Association.
Cite (Informal):
SLäNDa: An Annotated Corpus of Narrative and Dialogue in Swedish Literary Fiction (Stymne & Östman, LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.103.pdf