Corpora for English
(Redirected from Corpora (English))
Jump to navigation
Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
For languages other than English, see List of resources by language.
Free and Downloadable
- American National Corpus (ANC)
- Congressional floor-debate transcripts, with support/oppose labels
- Dialogue Diversity Corpus
- English stop words (from SMART)
- Groningen Meaning Bank semantically annotated corpus
- GUM - Georgetown University Multilayer corpus, multiple parses, coreference, entities, sentence types and RST
- Project Gutenberg
- International Corpus of English
- HamleDT, harmonized dependency treebanks of many languages, common annotation style.
- Hutter Prize for Lossless Compression of Human Knowledge 100M sample of Wikipedia
- Large Text Compression Benchmark's 1G sample of Wikipedia
- Movie Review Data
- Multiword Expression Resources
- Susanne: Annotated American English Corpus
- SUSANNE Analytic Scheme
- The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English
- The LUCY Corpus - Documentation
- TRAINS Dialogue Corpus
- UMBC Webbase Corpus
- UN parallel corpora
- VP Ellipsis corpus
- WMT corpora, including Europarl, News Commentary, and News Crawl
Proprietary or Require Prior Permission
- Araneum Anglicum, Gigaword English web corpus
- Araneum Anglicum Asiaticum, Gigaword Asian English web corpus
- British National Corpus (BNC)
- ClueWeb
- Corpus of Spoken Professional English
- English Intonation in the British Isles -The IViE Corpus
- English Verb Classes And Alternations: A Preliminary Investigation (Index)
- GOV2 Corpus - 426 gigabytes of text
- Multi-Perspective Question Answering (MPQA)
- Oxford English Corpus
- Sketch Engine
- WaCky
- WebCorp
Link collections
- Collections of texts and corpora
- Manuel Barbera: General Corpora and Corpus Linguistics Resources
- Annotated list of resources on statistical NLP and corpus-based CL
Corpora tools
- ANNIS - open source search tool for complex multilayer corpora
- List of stop words
- Poliqarp - open source XML-aware indexer, search engine and concordancer
- The Sketch Engine
- Treebank tokenization scheme