Corpora for English
Jump to navigation
Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
This list needs some cleaning. Please help.
English
- American English SpeechDat-Car
- American National Corpus (ANC)
- AMERICAN NATIONAL CORPUS FIRST RELEASE
- Biomedical corpora
- BNCweb a web-based interface to the British National Corpus
- Bookmarks for Corpus-based Linguists
- British National Corpus (from Oxford University)
- British National Corpus (BNC)
- British National Corpus project page (from UCREL)
- Brown Corpus
- Collins Wordbanks
- Corpus of Spoken Professional English
- Dialogue Diversity Corpus
- Electronic Text Center -- University of Virginia
- English Intonation in the British Isles -The IViE Corpus
- English stop words (from SMART)
- English Verb Classes And Alternations: A Preliminary Investigation (Index)
- Exploring Words and Phrases from the British National Corpus
- Gutenberg
- ICAME
- List of English stopwords
- Mapping WordNet Versions 1.6 and 2.0
- Movie Review Data
- Multiword Expression Resources
- Oxford English Corpus
- Phrases in English
- Restricted English Corpus from Dr. Caroline Lyon for PhD
- Sketch Engine
- Susanne: Annotated American English Corpus
- The BNC Index (for the BNCWorld Edition)
- The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English
- The Dialogue Diversity Corpus
- The LUCY Corpus - Documentation
- TRAINS Dialogue Corpus
- WebCorp
German
Multilingual
- ACQUIS COMMUNAUTAIRE Multilingual Corpus
- Bank of Swedish
- Croatian National Corpus (HNK)
- Czech National Corpus (CNC)
- CELEX - The Dutch Center for Lexical Information
- Centre for Disease Control - Chinese, French, Japanese, Spanish info on SARS
- COMPARA corpus
- Debian free software community
- EMILLE corpus
- European Parliament Proceedings Parallel Corpus 1996-2003
- EuroWordNet
- French Foreign Ministry's magazine
- GlossaNet
- Haitian Creole corpus -Teknoloji pou lang kreyol
- Hungarian National Corpus
- Hansard French-English parallel corpus
- ICE corpora
- IPI PAN Corpus of Polish
- Learner Behaviour on the Internet
- MuchMore Springer Bilingual Corpus
- MULTEXT-East: Multilingual Corpora for Eastern and Central European Languages
- Multilingual Corpora: Available Resources
- Tanaka Corpus: Japanese-English sentence pairs
- MultiSemCor
- Newspapers on the Internet
- OPUS - an open source parallel corpus
- Oslo Corpus of Bosnian
- PolyU Language Bank
- Portuguese Corpus
- Public registry of the Council of the EU
- Russian National Corpus (RNK)
- The Bible as a Resource for Translation Software
- The ECI Multilingual corpus
- Slovenian Corpus FIDA and FIDA+
- Spanish Corpus
- UN declaration of human rights in multiple languages
- UNITEX
- Useful links about parallel corpora, by Olivier Kraif
- WaCky Project
- Wortlisten: spoken German, English, French, and Dutch
Russian
- Russian Corpora
- Russian Corpora
- Russian Corpus Page
- Russian Corpus Site
- Russian Corpus Site
- Russian Newspaper Corpus
- Russicon Resources
- Bokr Russian Reference Corpus
Slovak
Italian
- LIP - Lessico di frequenza dell'Italiano Parlato - Access via BADIP
- ColFIS Corpus e Lessico di Frequenza dell'Italiano Scritto
- Corpus di Italiano Scritto contemporaneo (CORIS/CODIS)
- Tesoro della lingua italiana delle origini (TLIO)
Link Collections
- Collections of texts and corpora
- Manuel Barbera: General Corpora and Corpus Linguistics Resources
- Isabella Chiari: Corpora, Software and Linguistic resources
- Annotated list of resources on statistical NLP and corpus-based CL
Corpora Tools
Uncategorized
- 1963 Time Magazine corpus
- 2000 NIST Speaker Recognition Evaluation Corpus
- A Syntactically Annotated Corpus of German Newspaper Texts
- A Web Corpus and Topic Signatures for All WordNet 1.6 Nominal Senses (v 1.0)
- Alpino Treebank
- An Empirical Grammar of the English Verb System
- AOT
- Arabic Newswire Part 1
- Base Textuelle de Moyen Francais
- BNC Online Service
- BRITISH NATIONAL CORPUS - WORLD EDITION
- Corpus de referencia de la lengua Espanola contemporanea: corpus oral peninsular
- Corpus del Espanol
- Corpus of spoken Bulgarian
- Corpus Resources (Chulalongkorn University, Thailand)
- Cranfield collection
- CREA
- Czech National Corpus
- Danish news corpus
- Edinburgh Associative Thesaurus (EAT)
- EuroWordNet
- Experimental Corpus Query System (University of Stuttgart, Germany)
- Finnish text bank
- HAITIAN CREOLE ELECTRONIC TEXTS
- Hansards Corpus - Searchable
- HCRC Map Task Corpus XML annotations
- Helsinki Corpus of Swahili (HCS)
- ICOPOST
- IMS Corpus Toolbox, Univ. of Stuttgart
- IMS Corpus Workbench (CWB)
- International Corpus of Learner English
- IPI PAN Polish Corpus
- Kiel University's Institute on Phonetics and Speech Procesing
- Lacio Web Corpora
- LANGUAGE LEARNING CENTER - ACADEMIC CORPUS
- list of Japanese transitive - intransitive verb pairs
- Manuel Barbera: General Corpora and Corpus Linguistics Resources
- Medlars collection
- Miscellaneous Word Lists from Oxford University
- Multilingual Text Tools and Corpora
- Name lists from US census
- Nexing Corpus
- On-line books at CMU
- OPUS -- An Open Source Parallel Corpus
- Oxford Text Archive Corpus of Italian Newspapers
- Polish subcorpus of the International Corpus of Learner English
- Ramon Piero Center for Research
- Reuters Corpus
- Romanian NLP
- Sanskrit Library
- Slovene-English Parallel Corpus
- Speech in Noisy Environments 1 (SPINE1 CODED) Coded Audio
- Speech in Noisy Environments 2 (SPINE2 CODED) Coded Audio
- Survey of Electronic Corpora (by Jane A. Edwards, file at CMU)
- Survey of English Usage, University College, London
- Switchboard Transcription Project
- TELRI Research Archive of Computational Tools and Resources
- The Childes Corpus - Children's language
- The CORPORA DataCenter (Norway)
- The Moby Corpus
- The Oslo Corpus of Bosnian Texts
- The Sofie Treebank - A Parallel Treebank of North European Languages