Hungarian Word-Sense Disambiguated Corpus

Veronika Vincze, György Szarvas, Attila Almási, Dóra Szauter, Róbert Ormándi, Richárd Farkas, Csaba Hatvani, János Csirik


Abstract
To create the first Hungarian WSD corpus, 39 suitable word form samples were selected for the purpose of word sense disambiguation. Among others, selection criteria required the given word form to be frequent in Hungarian language usage, and to have more than one sense considered frequent in usage. HNC and its Heti Világgazdaság subcorpus provided the basis for corpus text selection. This way, each sample has a relevant context (whole article), and information on the lemma, POS-tagging and automatic tokenization is also available. When planning the corpus, 300-500 samples of each word form were to be annotated. This size makes it possible that the subcorpora prepared for the individual word forms can be compared to data available for other languages. However, the finalized database also contains unannotated samples and samples with single annotation, which were annotated only by one of the linguists. The corpus follows the ACL’s SensEval/SemEval WSD tasks format. The first version of the corpus was developed within the scope of the project titled The construction Hungarian WordNet Ontology and its application in Information Extraction Systems (Hatvani et al., 2007). The corpus “ for research and educational purposes” is available and can be downloaded free of charge.
Anthology ID:
L08-1609
Volume:
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Month:
May
Year:
2008
Address:
Marrakech, Morocco
Editors:
Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Daniel Tapias
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2008/pdf/847_paper.pdf
DOI:
Bibkey:
Cite (ACL):
Veronika Vincze, György Szarvas, Attila Almási, Dóra Szauter, Róbert Ormándi, Richárd Farkas, Csaba Hatvani, and János Csirik. 2008. Hungarian Word-Sense Disambiguated Corpus. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), Marrakech, Morocco. European Language Resources Association (ELRA).
Cite (Informal):
Hungarian Word-Sense Disambiguated Corpus (Vincze et al., LREC 2008)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2008/pdf/847_paper.pdf