Comparing the Quality of Focused Crawlers and of the Translation Resources Obtained from them

Bruno Laranjeira, Viviane Moreira, Aline Villavicencio, Carlos Ramisch, Maria José Finatto


Abstract
Comparable corpora have been used as an alternative for parallel corpora as resources for computational tasks that involve domain-specific natural language processing. One way to gather documents related to a specific topic of interest is to traverse a portion of the web graph in a targeted way, using focused crawling algorithms. In this paper, we compare several focused crawling algorithms using them to collect comparable corpora on a specific domain. Then, we compare the evaluation of the focused crawling algorithms to the performance of linguistic processes executed after training with the corresponding generated corpora. Also, we propose a novel approach for focused crawling, exploiting the expressive power of multiword expressions.
Anthology ID:
L14-1072
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
3572–3578
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/1095_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Bruno Laranjeira, Viviane Moreira, Aline Villavicencio, Carlos Ramisch, and Maria José Finatto. 2014. Comparing the Quality of Focused Crawlers and of the Translation Resources Obtained from them. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 3572–3578, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
Comparing the Quality of Focused Crawlers and of the Translation Resources Obtained from them (Laranjeira et al., LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/1095_Paper.pdf