QurSim: A corpus for evaluation of relatedness in short texts

Abdul-Baquee Sharaf, Eric Atwell


Abstract
This paper presents a large corpus created from the original Quranic text, where semantically similar or related verses are linked together. This corpus will be a valuable evaluation resource for computational linguists investigating similarity and relatedness in short texts. Furthermore, this dataset can be used for evaluation of paraphrase analysis and machine translation tasks. Our dataset is characterised by: (1) superior quality of relatedness assignment; as we have incorporated relations marked by well-known domain experts, this dataset could thus be considered a gold standard corpus for various evaluation tasks, (2) the size of our dataset; over 7,600 pairs of related verses are collected from scholarly sources with several levels of degree of relatedness. This dataset could be extended to over 13,500 pairs of related verses observing the commutative property of strongly related pairs. This dataset was incorporated into online query pages where users can visualize for a given verse a network of all directly and indirectly related verses. Empirical experiments showed that only 33% of related pairs shared root words, emphasising the need to go beyond common lexical matching methods, and incorporate -in addition- semantic, domain knowledge, and other corpus-based approaches.
Anthology ID:
L12-1051
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
2295–2302
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/190_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Abdul-Baquee Sharaf and Eric Atwell. 2012. QurSim: A corpus for evaluation of relatedness in short texts. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 2295–2302, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
QurSim: A corpus for evaluation of relatedness in short texts (Sharaf & Atwell, LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/190_Paper.pdf