Anaphoric Annotation of Wikipedia and Blogs in the Live Memories Corpus

Kepa Joseba Rodríguez, Francesca Delogu, Yannick Versley, Egon W. Stemle, Massimo Poesio


Abstract
The Live Memories corpus is an Italian corpus annotated for anaphoric relations. This annotation effort aims to contribute to two significant issues for the CL research: the lack of annotated anaphoric resources for Italian and the increasing interest for the social Web. The Live Memories Corpus contains texts from the Italian Wikipedia about the region Trentino/Süd Tirol and from blog sites with users' comments. It is planned to add a set of articles of local news papers. The corpus includes manual annotated information about morphosyntactic agreement, anaphoricity, and semantic class of the NPs. The anaphoric annotation includes discourse deixis, bridging relations and markes cases of ambiguity with the annotation of alternative interpretations. For the annotation of the anaphoric links the corpus takes into account specific phenomena of the Italian language like incorporated clitics and phonetically non realized pronouns. Reliability studies for the annotation of the mentioned phenomena and for annotation of anaphoric links in general offer satisfactory results. The Wikipedia and blogs dataset will be distributed under Creative Commons Attributions licence.
Anthology ID:
L10-1295
Volume:
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Month:
May
Year:
2010
Address:
Valletta, Malta
Editors:
Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner, Daniel Tapias
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/431_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Kepa Joseba Rodríguez, Francesca Delogu, Yannick Versley, Egon W. Stemle, and Massimo Poesio. 2010. Anaphoric Annotation of Wikipedia and Blogs in the Live Memories Corpus. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Valletta, Malta. European Language Resources Association (ELRA).
Cite (Informal):
Anaphoric Annotation of Wikipedia and Blogs in the Live Memories Corpus (Rodríguez et al., LREC 2010)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/431_Paper.pdf