Recognizing Sentence-level Logical Document Structures with the Help of Context-free Grammars

Jonathan Hildebrand, Wahed Hemati, Alexander Mehler


Abstract
Current sentence boundary detectors split documents into sequentially ordered sentences by detecting their beginnings and ends. Sentences, however, are more deeply structured even on this side of constituent and dependency structure: they can consist of a main sentence and several subordinate clauses as well as further segments (e.g. inserts in parentheses); they can even recursively embed whole sentences and then contain multiple sentence beginnings and ends. In this paper, we introduce a tool that segments sentences into tree structures to detect this type of recursive structure. To this end, we retrain different constituency parsers with the help of modified training data to transform them into sentence segmenters. With these segmenters, documents are mapped to sequences of sentence-related “logical document structures”. The resulting segmenters aim to improve downstream tasks by providing additional structural information. In this context, we experiment with German dependency parsing. We show that for certain sentence categories, which can be determined automatically, improvements in German dependency parsing can be achieved using our segmenter for preprocessing. The assumption suggests that improvements in other languages and tasks can be achieved.
Anthology ID:
2020.lrec-1.650
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
5282–5290
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.650
DOI:
Bibkey:
Cite (ACL):
Jonathan Hildebrand, Wahed Hemati, and Alexander Mehler. 2020. Recognizing Sentence-level Logical Document Structures with the Help of Context-free Grammars. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 5282–5290, Marseille, France. European Language Resources Association.
Cite (Informal):
Recognizing Sentence-level Logical Document Structures with the Help of Context-free Grammars (Hildebrand et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.650.pdf
Data
Penn Treebank