New Tools for Web-Scale N-grams

Dekang Lin, Kenneth Church, Heng Ji, Satoshi Sekine, David Yarowsky, Shane Bergsma, Kailash Patil, Emily Pitler, Rachel Lathbury, Vikram Rao, Kapil Dalwani, Sushant Narsale


Abstract
While the web provides a fantastic linguistic resource, collecting and processing data at web-scale is beyond the reach of most academic laboratories. Previous research has relied on search engines to collect online information, but this is hopelessly inefficient for building large-scale linguistic resources, such as lists of named-entity types or clusters of distributionally similar words. An alternative to processing web-scale text directly is to use the information provided in an N-gram corpus. An N-gram corpus is an efficient compression of large amounts of text. An N-gram corpus states how often each sequence of words (up to length N) occurs. We propose tools for working with enhanced web-scale N-gram corpora that include richer levels of source annotation, such as part-of-speech tags. We describe a new set of search tools that make use of these tags, and collectively lower the barrier for lexical learning and ambiguity resolution at web-scale. They will allow novel sources of information to be applied to long-standing natural language challenges.
Anthology ID:
L10-1158
Volume:
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Month:
May
Year:
2010
Address:
Valletta, Malta
Editors:
Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner, Daniel Tapias
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/233_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Dekang Lin, Kenneth Church, Heng Ji, Satoshi Sekine, David Yarowsky, Shane Bergsma, Kailash Patil, Emily Pitler, Rachel Lathbury, Vikram Rao, Kapil Dalwani, and Sushant Narsale. 2010. New Tools for Web-Scale N-grams. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Valletta, Malta. European Language Resources Association (ELRA).
Cite (Informal):
New Tools for Web-Scale N-grams (Lin et al., LREC 2010)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/233_Paper.pdf