Domain-Specific Corpus Expansion with Focused Webcrawling

Steffen Remus, Chris Biemann


Abstract
This work presents a straightforward method for extending or creating in-domain web corpora by focused webcrawling. The focused webcrawler uses statistical N-gram language models to estimate the relatedness of documents and weblinks and needs as input only N-grams or plain texts of a predefined domain and seed URLs as starting points. Two experiments demonstrate that our focused crawler is able to stay focused in domain and language. The first experiment shows that the crawler stays in a focused domain, the second experiment demonstrates that language models trained on focused crawls obtain better perplexity scores on in-domain corpora. We distribute the focused crawler as open source software.
Anthology ID:
L16-1572
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
3607–3611
Language:
URL:
https://aclanthology.org/L16-1572
DOI:
Bibkey:
Cite (ACL):
Steffen Remus and Chris Biemann. 2016. Domain-Specific Corpus Expansion with Focused Webcrawling. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 3607–3611, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
Domain-Specific Corpus Expansion with Focused Webcrawling (Remus & Biemann, LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1572.pdf