Extraction of unmarked quotations in Newspapers

Stéphanie Weiser, Patrick Watrin


Abstract
This paper presents work in progress to automatically extract quotation sentences from newspaper articles. The focus is the extraction and annotation of unmarked quotation sentences. A linguistic study shows that unmarked quotation sentences can be formalised into 16 patterns that can be used to develop an extraction grammar. The question of unmarked quotation boundaries identification is also raised as they are often ambiguous. An annotation scheme allowing to describe all the elements that can take place in a quotation sentence is defined. This paper presents the creation of two resources necessary to our system. A dictionary of verbs introducing quotations has been automatically built using a grammar of marked quotations sentences to identify the verbs able to introduce quotations. A grammar formalising the patterns of unmarked quotation sentences ― using the tool Unitex, based on finite state machines ― has been developed. A short experiment has been performed on two patterns and shows some promising results.
Anthology ID:
L12-1320
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
559–562
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/566_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Stéphanie Weiser and Patrick Watrin. 2012. Extraction of unmarked quotations in Newspapers. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 559–562, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
Extraction of unmarked quotations in Newspapers (Weiser & Watrin, LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/566_Paper.pdf