Finding Domain Terms using Wikipedia

Jorge Vivaldi, Horacio Rodríguez


Abstract
In this paper we present a new approach for obtaining the terminology of a given domain using the category and page structures of the Wikipedia in a language independent way. Our approach consists basically, for each domain, on navigating the Category graph of the Wikipedia starting from the root nodes associated to the domain. A heavy filtering mechanism is carried out for preventing as much as possible the inclusion of spurious categories. For each selected category all the pages belonging to it are then recovered and filtered. This procedure is iterate several times until achieving convergence. Both category names and page names are considered candidates to belong to the terminology of the domain. This approach has been applied to three broad coverage domains: astronomy, chemistry and medicine, and two languages, English and Spanish, showing a promising performance.
Anthology ID:
L10-1518
Volume:
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Month:
May
Year:
2010
Address:
Valletta, Malta
Editors:
Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner, Daniel Tapias
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/748_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Jorge Vivaldi and Horacio Rodríguez. 2010. Finding Domain Terms using Wikipedia. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Valletta, Malta. European Language Resources Association (ELRA).
Cite (Informal):
Finding Domain Terms using Wikipedia (Vivaldi & Rodríguez, LREC 2010)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/748_Paper.pdf