Difference between revisions of "Corpora, datasets, lexicons"

From ACL Wiki
Jump to navigation Jump to search
Line 5: Line 5:
 
== Corpora ==
 
== Corpora ==
  
 +
=== English ===
 +
(alphabetical order)
 
* [http://americannationalcorpus.org/ American National Corpus (ANC)]
 
* [http://americannationalcorpus.org/ American National Corpus (ANC)]
 
* [http://compbio.uchsc.edu/ccp/corpora/index.shtml Biomedical corpora]
 
* [http://compbio.uchsc.edu/ccp/corpora/index.shtml Biomedical corpora]
* [http://www.tekstlab.uio.no/Bosnian/Corpus.html The Oslo Corpus of Bosnian]
 
 
* [http://www.natcorp.ox.ac.uk/ British National Corpus (BNC)]
 
* [http://www.natcorp.ox.ac.uk/ British National Corpus (BNC)]
 
* [http://clwww.essex.ac.uk/w3c/corpus_ling/content/corpora/list/private/brown/brown.html Brown Corpus]
 
* [http://clwww.essex.ac.uk/w3c/corpus_ling/content/corpora/list/private/brown/brown.html Brown Corpus]
 
* [http://www.collins.co.uk/books.aspx?group=154 Collins Wordbanks]
 
* [http://www.collins.co.uk/books.aspx?group=154 Collins Wordbanks]
 +
* [http://www.gutenberg.org/wiki/Main_Page Gutenberg]
 +
* [http://www.askoxford.com/oec/mainpage/?view=uk Oxford English Corpus]
 +
* [http://www.webcorp.org.uk/guide/ WebCorp]
 +
 +
=== Multilingual ===
 +
(alphabetical order)
 +
* [http://spraakbanken.gu.se/ Bank of Swedish]
 +
* [http://www.tekstlab.uio.no/Bosnian/Corpus.html Oslo Corpus of Bosnian]
 
* [http://hnk.ffzg.hr/ Croatian National Corpus (HNK)]
 
* [http://hnk.ffzg.hr/ Croatian National Corpus (HNK)]
 
* [http://ucnk.ff.cuni.cz/ Czech National Corpus (CNC)]
 
* [http://ucnk.ff.cuni.cz/ Czech National Corpus (CNC)]
* [http://devoted.to/corpora David Lee's Bookmarks for Corpus-based Linguists]
 
* [http://www.gutenberg.org/wiki/Main_Page Gutenberg]
 
 
* [http://corpus.nytud.hu/mnsz/ Hungarian National Corpus]
 
* [http://corpus.nytud.hu/mnsz/ Hungarian National Corpus]
 
* [http://korpus.pl/ IPI PAN Corpus of Polish]
 
* [http://korpus.pl/ IPI PAN Corpus of Polish]
* [http://www.askoxford.com/oec/mainpage/?view=uk Oxford English Corpus]
 
 
* [http://www.corpusdoportugues.org/ Portuguese Corpus]
 
* [http://www.corpusdoportugues.org/ Portuguese Corpus]
 
* [http://www.ruscorpora.ru/ Russian National Corpus (RNK)]
 
* [http://www.ruscorpora.ru/ Russian National Corpus (RNK)]
Line 23: Line 29:
 
* [http://www.fida.net/ Slovenian Corpus FIDA] and [http://www.fidaplus.net/ FIDA+]
 
* [http://www.fida.net/ Slovenian Corpus FIDA] and [http://www.fidaplus.net/ FIDA+]
 
* [http://www.corpusdelespanol.org/ Spanish Corpus]
 
* [http://www.corpusdelespanol.org/ Spanish Corpus]
* [http://spraakbanken.gu.se/ Bank of Swedish]
 
 
* [http://www.csse.monash.edu.au/~jwb/tanakacorpus.html Tanaka Corpus: Japanese-English sentence pairs]
 
* [http://www.csse.monash.edu.au/~jwb/tanakacorpus.html Tanaka Corpus: Japanese-English sentence pairs]
* [http://www.webcorp.org.uk/guide/ WebCorp]
+
 
 +
=== Other lists of corpora ===
 +
(alphabetical order)
 +
* [http://devoted.to/corpora David Lee's Bookmarks for Corpus-based Linguists]
  
 
== Datasets ==
 
== Datasets ==

Revision as of 07:45, 2 November 2006

Miscellaneous

Corpora

English

(alphabetical order)

Multilingual

(alphabetical order)

Other lists of corpora

(alphabetical order)

Datasets

Lexicons

  • WordNet - the original
    • eXtended WordNet - glosses are syntactically parsed, transformed into logic forms, and content words are semantically disambiguated
    • WordNet Domains - augmented with Domain Labels, such as POLITICS, ECONOMY, SPORT
    • SentiWordNet - assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity