Evaluation of Web-based Corpora: Effects of Seed Selection and Time Interval

Motoko Ueyama


Abstract
Recently, there have been efforts to construct written corpora by using the WWW. A promising approach to build Web corpora is to run automated queries to search engines and download pages found in this way. This makes it possible to build corpora rapidly and economically, but we cannot control what are contained in resulting corpora. Under these circumstances, it is important to verify the general nature of Web corpora. This study, in particular, investigated effects of two essential factors on three Japanese corpora that we built: seed terms used for queries; and time interval between different corpus construction sessions, which measures the stability of query results over time. We evaluated the corpora qualitatively, in terms of domains, genres and typical lexical items. Results show these two patterns: 1) both seed selection and time interval affect the distribution of text and lexicon; 2) the effect of seed selection is much stronger. The prominent effect of seed selection suggests that a good understanding of the cause-and-effect relation between seeds and retrieved documents is an important step to gain some control over the characteristics of Web corpora, in particular, for the construction of general corpora meant to represent a language as a whole.
Anthology ID:
L06-1432
Volume:
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
Month:
May
Year:
2006
Address:
Genoa, Italy
Editors:
Nicoletta Calzolari, Khalid Choukri, Aldo Gangemi, Bente Maegaard, Joseph Mariani, Jan Odijk, Daniel Tapias
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2006/pdf/697_pdf.pdf
DOI:
Bibkey:
Cite (ACL):
Motoko Ueyama. 2006. Evaluation of Web-based Corpora: Effects of Seed Selection and Time Interval. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy. European Language Resources Association (ELRA).
Cite (Informal):
Evaluation of Web-based Corpora: Effects of Seed Selection and Time Interval (Ueyama, LREC 2006)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2006/pdf/697_pdf.pdf