ACL Wiki - User contributions [en]

Corpora for English

2015-03-08T19:33:41Z

Vladob54: Added: Araneum

For languages other than English, see [[List of resources by language]].


*[ftp://ftp.cs.cornell.edu/pub/smart/time/ 1963 Time Magazine corpus]
*[http://www.elda.fr/catalogue/en/speech/S0115.html American English SpeechDat-Car]
*[http://americannationalcorpus.org/ American National Corpus (ANC)]
*[http://americannationalcorpus.org/FirstRelease/ AMERICAN NATIONAL CORPUS FIRST RELEASE]
*[http://ucts.uniba.sk/aranea_about/ Araneum Anglicum], Gigaword English web corpus
*[http://ucts.uniba.sk/aranea_about/ Araneum Anglicum Asiaticum], Gigaword Asian English web corpus
*[http://compbio.uchsc.edu/ccp/corpora/index.shtml Biomedical corpora]
*[http://homepage.mac.com/bncweb/ BNCweb a web-based interface to the British National Corpus]
*[http://devoted.to/corpora Bookmarks for Corpus-based Linguists]
*[http://info.ox.ac.uk/bnc/ British National Corpus (from Oxford University)]
*[http://www.natcorp.ox.ac.uk/ British National Corpus (BNC)]
*[http://www.comp.lancs.ac.uk/computing/research/ucrel/bnc.html British National Corpus project page (from UCREL)]
*[http://clwww.essex.ac.uk/w3c/corpus_ling/content/corpora/list/private/brown/brown.html Brown Corpus]
*[http://boston.lti.cs.cmu.edu/Data/clueweb09/ ClueWeb]
*[http://computing.open.ac.uk/coda/data.html CODA Parallel Annotated Monologue-Dialogue Corpus]
*[http://www.collins.co.uk/books.aspx?group=154 Collins Wordbanks]
*[http://www.cs.cornell.edu/home/llee/data/convote.html Congressional floor-debate transcripts, with support/oppose labels]
*[http://www.athel.com/corpdes.html Corpus of Spoken Professional English]
*[http://www-rcf.usc.edu/~billmann/diversity/DDivers-site.htm Dialogue Diversity Corpus]
*[http://etext.lib.virginia.edu/ Electronic Text Center -- University of Virginia]
*[http://www.phon.ox.ac.uk/~esther/ivyweb/ English Intonation in the British Isles -The IViE Corpus]
*[http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/naive-bayes/bow-0.8/stopwords.c English stop words (from SMART)]
*[http://www-personal.umich.edu/~jlawler/levin.html English Verb Classes And Alternations: A Preliminary Investigation (Index)]
*[http://usna.edu/LangStudy/BNC/ Exploring Words and Phrases from the British National Corpus]
*[http://ir.dcs.gla.ac.uk/test_collections/gov2-summary.htm GOV2 Corpus] - 426 gigabytes of text
*[http://gmb.let.rug.nl Groningen Meaning Bank] semantically annotated corpus
*[http://www.gutenberg.org/wiki/Main_Page Gutenberg]
*[http://ufal.mff.cuni.cz/hamledt HamleDT], harmonized dependency treebanks of many languages, common annotation style.
*[http://prize.hutter1.net/ Hutter Prize for Lossless Compression of Human Knowledge 100M sample of Wikipedia]
*[http://nora.hd.uib.no/icame.html ICAME]
*[http://www.cs.fit.edu/~mmahoney/compression/text.html Large Text Compression Benchmark's 1G sample of Wikipedia]
*[http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/naive-bayes/bow-0.8/stopwords.c List of English stopwords]
*[http://www.cs.cornell.edu/People/pabo/movie-review-data/ Movie Review Data]
*[http://www.cs.pitt.edu/mpqa/ Multi-Perspective Question Answering (MPQA)]
*[http://mwe.stanford.edu/resources/ Multiword Expression Resources]
*[http://www.askoxford.com/oec/mainpage/?view=uk Oxford English Corpus]
*[http://pie.usna.edu/ Phrases in English]
*[http://homepages.feis.herts.ac.uk/~comrcml/Lyon-thesis.ps Restricted English Corpus from Dr. Caroline Lyon for PhD]
*[http://www.sketchengine.co.uk/ Sketch Engine]
*[http://www-2.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/susanne/0.html Susanne: Annotated American English Corpus]
*[http://clix.to/davidlee00 The BNC Index (for the BNCWorld Edition)]
*[http://www-users.york.ac.uk/~sp20/corpus.html The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English]
*[http://www-rcf.usc.edu/~billmann/diversity/DDivers-site.htm The Dialogue Diversity Corpus]
*[http://www.grsampson.net/LucyDoc.html The LUCY Corpus - Documentation]
*[http://www.cs.rochester.edu/research/cisd/resources/trains.html TRAINS Dialogue Corpus]
*[http://ebiquity.umbc.edu/resource/html/id/351 UMBC Webbase Corpus]
*[http://www.euromatrixplus.net/multi-un/ UN parallel corpora]
*[http://www.let.rug.nl/~bos/vpe/ VP Ellipsis corpus]
*[http://wacky.sslmit.unibo.it/ WaCky]
*[http://www.webcorp.org.uk/guide/ WebCorp]
* [http://www.statmt.org/wmt13/translation-task.html#download WMT corpora], including [http://en.wikipedia.org/wiki/Europarl_corpus Europarl], News Commentary, and News Crawl

==Link collections==


*[http://www.dcs.gla.ac.uk/idom/ir_resources/ Collections of texts and corpora]
*[http://www.bmanuel.org/clr2_mp.html Manuel Barbera: General Corpora and Corpus Linguistics Resources]
*[http://www.sultry.arts.usyd.edu.au/links/statnlp.html Annotated list of resources on statistical NLP and corpus-based CL]

==Corpora tools==


*[http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words List of stop words]
*[http://korpus.pl/index.php?page=poliqarp Poliqarp] - open source XML-aware indexer, search engine and concordancer
*[http://www.sketchengine.co.uk/ The Sketch Engine]
*[http://www.cis.upenn.edu/~treebank/tokenization.html Treebank tokenization scheme]

[[Category:Corpora|*]]

Resources for Chinese

2015-03-08T19:31:19Z

Vladob54: Added: Araneum

==Tools==
===Free software===
* [https://github.com/yzhang/rseg rseg] word segmentation; written in ruby (no compilation, no hard dependencies apart from ruby), comes with a model (MIT license)
* [https://code.google.com/p/ctbparser/ ctbparser] word segmentation, POS tagging, NER, dependency parsing, all using Conditional Random Fields; written in C++ (LGPL license)
* [http://www.cl.cam.ac.uk/~yz360/zpar.html ZPar] word segmentation, POS tagging, CFG/dep/CCG parsing of Chinese and English; written in C++ (GPL3 license)
* [http://code.google.com/p/duduplus/ DuDuPlus: a graph-based dependency parser for English and Chinese] ("Other Open Source" license?)
** where is the source code?

==Corpora==
===Free license===
* [http://corpora.heliohost.org/ HC Corpora] 1606811 lines of [http://en.wikipedia.org/wiki/Fair_use Fair Use] excerpts from news, blogs, twitter
* [http://www.euromatrixplus.net/multi-un/ UN parallel corpora]

===Nonfree or Unknown license===
* [http://ucts.uniba.sk/aranea_about/ Araneum Sinicum], Gigaword Chinese web corpus
* [http://www.chinesecomputing.com Chinese Computing]
* [http://www.icl.pku.edu.cn/icl_groups/corpus/dwldform1.asp Word Segmented and POS tagged People Daily Corpus at ICL of Peking University]
* [http://corpus.leeds.ac.uk/frqc/i-zh-char.num.html Frequency list of characters in the Internet corpus]
* [http://corpus.leeds.ac.uk/frqc/internet-zh.num Frequency list of lexical items in the Internet corpus]
* [http://www.ling.lancs.ac.uk/corplang/lcmc/ Lancaster Corpus of Mandarin Chinese]
* [http://corpus.leeds.ac.uk/query-zh.html A collection of Chinese corpora and frequency lists] Online query with three corpora

[[Category:Resources by language|Chinese]]

Resources for Spanish

2015-03-08T19:29:31Z

Vladob54: Added: Araneum

==Corpora==
* [http://ucts.uniba.sk/aranea_about/ Araneum Hispanicum], Gigaword Spanish web corpus
* [http://www.corpusdelespanol.org/ Corpus del Español] (website only)
* [http://www.lllf.uam.es/~fmarcos/informes/corpus/corpulee.html Corpus de referencia de la lengua Española contemporanea: corpus oral peninsular]
* [http://ufal.mff.cuni.cz/hamledt HamleDT], harmonized dependency treebanks of many languages, common annotation style.
* [http://www.statmt.org/wmt13/translation-task.html#download WMT corpora], including [http://en.wikipedia.org/wiki/Europarl_corpus Europarl], News Commentary, and News Crawl
* [http://www.euromatrixplus.net/multi-un/ UN parallel corpora]

== Grammars ==
* [[Generation grammars|KPML generation grammar]]

[[Category:Resources by language|Spanish]]

Resources for Slovak

2015-03-08T19:28:02Z

Vladob54: Added: Araneum

==Corpora==
===Free license===
* [http://www.statmt.org/europarl Europarl corpus], sentence aligned with English

===Unknown license===

* [http://ucts.uniba.sk/aranea_about/ Araneum Slovacum], Gigaword Slovak web corpus
* [http://ufal.mff.cuni.cz/hamledt HamleDT], harmonized dependency treebanks of many languages, common annotation style.
* [http://korpus.juls.savba.sk/ Slovenský národný korpus / Slovak National Corpus]

==Lexical resources==
===Free software===
* [http://www.sk-spell.sk.cx/mass-msas Malý Anglicko-Slovenský a Slovensko-Anglický Slovník (mass/msas)] is a Slovak-English-Slovak dictionary, available in the StarDict format, under the [[GNU FDL]].

===Proprietary===

[[Category:Resources by language|Slovak]]

Resources for Russian

2015-03-08T19:23:43Z

Vladob54: Typo

==Corpora==
===Free open source===
* [http://www.euromatrixplus.net/multi-un/ MultiUN] "A Multilingual corpus from United Nation Documents", the Russian portion is 876 MB, the other languages in the multilingual corpus are: English/French/Spanish/Arabic/Chinese/German
* [http://www.statmt.org/wmt13/translation-task.html#download WMT corpora], including the Yandex 1M corpus, News Commentary, and News Crawl

===Unknown license===


* [http://ucts.uniba.sk/aranea_about/ Araneum Russicum], Gigaword Russian web corpus
* [http://ufal.mff.cuni.cz/hamledt HamleDT], harmonized dependency treebanks of many languages, common annotation style.
* [http://www.helsinki.fi/venaja/english/e-material/hanco/index.htm HANCO: The Helsinki annotated corpus of Russian texts] (searchable, no visible download links)
* [http://www.sfb441.uni-tuebingen.de/b1/korpora.html Russian Corpora (uni-tuebingen.de)] (searchable, no visible download links)
* [http://corpus.leeds.ac.uk/ruscorpora.html Russian Internet Corpus]
* [http://www.ruscorpora.ru/ Russian National Corpus]
* [http://www.philol.msu.ru/~lex/corpus/ Russian Newspaper Corpus]
* [http://lib.ru/ Various texts in Russian (lib.ru)]

== POS taggers ==

* [http://www.aot.ru/ AOT, morphological analyser]
* [http://corpus.leeds.ac.uk/mocky/ Mocky, statistical taggers and lemmatiser]
* [http://company.yandex.ru/technology/mystem/ Mystem, morphological analyser]

== Grammars ==
* [[Generation grammars|KPML generation grammar]]
* [http://abisource.com/projects/link-grammar/ Link Grammar Parser], includes Russian dictionaries.

==Various resources==
* [http://rykov-cl.narod.ru/r.html Russian Corpora (rykov-cl.narod.ru)]
* [http://corpus.leeds.ac.uk/serge/frqlist/ Russian frequency lists]
* [http://www.philol.msu.ru/rus/galya-1 Russian Phonetics on the Web]
* [http://schools.keldysh.ru/uvk1838/Sciper/volume2/langres/russiclr.htm Russicon Resources]

[[Category:Resources by language|Russian]]

Resources for Russian

2015-03-08T19:23:17Z

Vladob54: Added: Araneum

==Corpora==
===Free open source===
* [http://www.euromatrixplus.net/multi-un/ MultiUN] "A Multilingual corpus from United Nation Documents", the Russian portion is 876 MB, the other languages in the multilingual corpus are: English/French/Spanish/Arabic/Chinese/German
* [http://www.statmt.org/wmt13/translation-task.html#download WMT corpora], including the Yandex 1M corpus, News Commentary, and News Crawl

===Unknown license===


* [http://ucts.uniba.sk/aranea_about/ Araneum Rusicum], Gigaword Russian web corpus
* [http://ufal.mff.cuni.cz/hamledt HamleDT], harmonized dependency treebanks of many languages, common annotation style.
* [http://www.helsinki.fi/venaja/english/e-material/hanco/index.htm HANCO: The Helsinki annotated corpus of Russian texts] (searchable, no visible download links)
* [http://www.sfb441.uni-tuebingen.de/b1/korpora.html Russian Corpora (uni-tuebingen.de)] (searchable, no visible download links)
* [http://corpus.leeds.ac.uk/ruscorpora.html Russian Internet Corpus]
* [http://www.ruscorpora.ru/ Russian National Corpus]
* [http://www.philol.msu.ru/~lex/corpus/ Russian Newspaper Corpus]
* [http://lib.ru/ Various texts in Russian (lib.ru)]

== POS taggers ==

* [http://www.aot.ru/ AOT, morphological analyser]
* [http://corpus.leeds.ac.uk/mocky/ Mocky, statistical taggers and lemmatiser]
* [http://company.yandex.ru/technology/mystem/ Mystem, morphological analyser]

== Grammars ==
* [[Generation grammars|KPML generation grammar]]
* [http://abisource.com/projects/link-grammar/ Link Grammar Parser], includes Russian dictionaries.

==Various resources==
* [http://rykov-cl.narod.ru/r.html Russian Corpora (rykov-cl.narod.ru)]
* [http://corpus.leeds.ac.uk/serge/frqlist/ Russian frequency lists]
* [http://www.philol.msu.ru/rus/galya-1 Russian Phonetics on the Web]
* [http://schools.keldysh.ru/uvk1838/Sciper/volume2/langres/russiclr.htm Russicon Resources]

[[Category:Resources by language|Russian]]

Resources for Polish

2015-03-08T19:22:13Z

Vladob54: Added: Araneum

==Corpora==
* [http://ucts.uniba.sk/aranea_about/ Araneum Polonicum], Gigaword Polish web corpus
* [http://www.statmt.org/europarl Europarl corpus], sentence aligned with English
* [http://korpus.pl/en/ IPI PAN Corpus] - The IPI PAN Corpus is a large (currently over 250 million segments), morphosyntactically annotated, publicly available corpus of Polish, developed by the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences (ICS PAS)
* [http://korpus.pwn.pl/index_en.php PWN Corpus] - PWN has prepared and made available an online version of the Corpus of Polish consisting of 40 million words. The samples were taken from 386 books, 977 editions selected from 185 different press publications, 84 transcribed spoken texts, 207 web sites and several hundred advertising leaflets and other ephemera. The full version of the corpus is available on payment for access, while a demonstration version of over 7.5 million words is available free of charge.

==Taggers, parsers, morphology analysers==

==Free/Open Source Software==
* [http://morfologik.blogspot.com/ Morfologik] -- morphological dictionary by Marcin Miłkowski (of LanguageTool), licensed under CC-SA / GNU LGPL
** [http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki/Morfologik_converted Morfologik converted to the IKIPI tagset] (the tagset of the IPI PAN Corpus)
* [http://nlp.pwr.wroc.pl/en/tools-and-resources/narzedzia-przetwarzania-morfosyntaktycznego Morphosyntactic Toolchain] by WrocUT Language Technology Group G4.19, licensed under GNU LGPL (some optional addons are GNU GPL). Command-line utilities providing tokenisation, morphological analysis, morphosyntactic tagging, shallow parsing (chunking), WCCL feature vectors for machine learning.

==Unknown license==
* [http://nlp.ipipan.waw.pl/~wolinski/morfeusz/ "Morfeusz"] - morphological analyser of Polish (Wolinski, 2005),
** [http://www.springerlink.com/content/l101v8823391j568/ main reference] Morfeusz — a Practical Tool for the Morphological Analysis of Polish
* "AMOR" - morphology analyser of Polish (Joanna Rabiega, 2000),
** [http://members.chello.pl/jrw/doc/jr_ma.pdf/ main reference] Podstawy lingwistyczne automatycznego analizatora morfologicznego AMOR
* [http://duch.mimuw.edu.pl/~kszafran/index.php?option=com_docman&task=cat_view&gid=33&Itemid=43 "SAM"] - morphological analyser of Polish (Krzysztof Szafran, 1994),
* [http://sourceforge.net/project/showfiles.php?group_id=166344 Morfologik] - Polish morphological analyzer based on current ispell dictionaries, and Java libraries interfacing it. First completely open-source and comprehensive morphological tools for Polish. Will be used for grammar correction tools (to be included in the future)
* [http://nlp.ipipan.waw.pl/Spejd/ Spejd - Shallow Parsing and Disambiguation Engine]
* [http://www.cs.put.poznan.pl/dweiss/xml/projects/lametyzator/index.xml lemmatizer] - Dawid Weiss

==Lexical resources==
* [http://plwordnet.pwr.wroc.pl/wordnet/ plWordnet] - a lexico-semantic database of Polish language.
* [https://play.google.com/store/apps/details?id=com.pwr.plwordnet Mobile plWordNet] - free mobile application for plWordNet browsing.

==Bibliography==

==External links==
* [http://bach.ipipan.waw.pl/mailman/listinfo/ling Polish linguistics mailing list] - mainly in Polish

[[Category:Resources by language|Polish]]

Resources for Italian

2015-03-08T19:21:15Z

Vladob54: Added: Araneum

== Tools for Italian ==

=== Tokenisers ===
* [http://tcc.itc.it/projects/textpro/index.php TextPro]

=== POS taggers ===
* [http://tcc.itc.it/projects/textpro/index.php TextPro]
* [http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html TreeTagger]

===Morphology===
====Free software====
* [http://sslmitdev-online.sslmit.unibo.it/linguistics/morph-it.php Morph-It! version 0.47] - a free morphological resource for the Italian language, includes [[SFST]] sources. [[LGPL]] license.

====Unknown license====
* [http://archivium.biz/ dic_it: il Verbiario] - a morphological analizer and verb coniugator for Italian verbs (web interface only?)

=== Named Entity Recognisers ===
* [http://tcc.itc.it/projects/ontotext/entitypro.html EntityPro]

=== Temporal Expressions ===
* [http://tcc.itc.it/projects/ontotext/ita-chronos.html ITA-Chronos]

=== Parsers ===
* [http://ai-nlp.info.uniroma2.it/external/chaosproject/ Chaos] - Robust syntactic parser for Italian and for English

=== Generators ===
* [http://tcc.itc.it/projects/xig/index.html XIG] - Interchange to Italian Generator

== Resources for Italian ==

=== Corpora ===


* [http://ucts.uniba.sk/aranea_about/ Araneum Italicum], Gigaword Italian web corpus
* [http://www.istc.cnr.it/material/database/colfis/ ColFIS Corpus e Lessico di Frequenza dell'Italiano Scritto]
* [http://corpus.cilta.unibo.it:8080/coris_ita.html Corpus di Italiano Scritto contemporaneo (CORIS/CODIS)]
* [http://www.statmt.org/europarl Europarl corpus], sentence aligned with English
* [http://ufal.mff.cuni.cz/hamledt HamleDT], harmonized dependency treebanks of many languages, common annotation style.
* [http://corpora.informatik.uni-leipzig.de/ Italian plain text and Co-occurrences at LCC]
* [http://languageserver.uni-graz.at/badip/badip/20_corpusLip.php LIP - Lessico di frequenza dell'Italiano Parlato - Access via BADIP]
* [http://multisemcor.itc.it/ MultiSemCor] - English/Italian parallel corpus
* [http://www.uni-duisburg.de/Fak2/FremdPhil/Romanistik/Personal/Burr/humcomp/ Oxford Text Archive Corpus of Italian Newspapers]
* [http://tlio.ovi.cnr.it/TLIO/ Tesoro della lingua italiana delle origini (TLIO)]

=== Tagsets ===
* [http://tcc.itc.it/projects/textpro/index.php LemmaPro] - Italian POS tagset for LemmaPro

=== Treebanks ===
* [http://catalog.elra.info/retd/product_info.phpproducts_id=879&osCsid=0cef41a96779ef79b67c71bbf35e6eaa ISST] - Italian Syntactic-Semantic Treebank
* [http://www.di.unito.it/~tutreeb/ TUT] - Turin University Treebank
* [http://157.138.41.87/HTMLipar/indexparsing_a.htm VIT] - Venice Italian Treebank

=== WordNets ===
* [http://www.elda.fr/ EuroWordNet]
* [http://multiwordnet.itc.it/english/home.php MultiWordNet] - a multilingual lexical database in which the Italian WordNet is strictly aligned with Princeton WordNet 1.6

=== Lexicons ===
* [http://www.ilc.cnr.it/clips/PSC_decription.htm PAROLE-SIMPLE-CLIPS] - a four-layered, general purpose computational lexicon

== Links ==
* [http://evalita.itc.it/ Evalita] - Evaluation of NLP tools for Italian

[[Category:Resources by language|Italian]]

Resources for German

2015-03-08T19:18:53Z

Vladob54: Added: Araneum

==Corpora==
===Free license===
* [http://www.computing.dcu.ie/~ygraham/software.html RIA Open Source Rule Induction Tool] includes an LFG-parsed German-English phrase-aligned parallel corpus, a subset of the EuroParl corpus (4000 sentences for each language, the tool at least is LGPL)
* [http://www.euromatrixplus.net/multi-un/ UN parallel corpora]
* [http://www.statmt.org/wmt13/translation-task.html#download WMT corpora], including [http://en.wikipedia.org/wiki/Europarl_corpus Europarl], News Commentary, and News Crawl

===Unknown license===


* [http://ucts.uniba.sk/aranea_about/ Araneum Germanicum], Gigaword German web corpus
* [http://www.phonetik.uni-muenchen.de/Bas/BasKorporaeng.html Bavarian Archive for Speech Signals Corpora]
* [http://corpora.ids-mannheim.de/~cosmas/ COSMAS II]
* [http://www.ims.uni-stuttgart.de/projekte/tc/CQP.html Experimental Corpus Query System (University of Stuttgart, Germany)]
* [http://www.wortschatz.uni-leipzig.de/ German plain text and Co-occurrences at LCC]
* [http://ufal.mff.cuni.cz/hamledt HamleDT], harmonized dependency treebanks of many languages, common annotation style.
* [http://www.coli.uni-sb.de/sfb378/negra-corpus/negra-corpus.html NEGRA Corpus]
* [http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERCorpus/ TIGER treebank]
* [http://www.sfs.uni-tuebingen.de/en_tuebadz.shtml Tübingen Treebank of Written German (TüBa-D/Z)]
* [http://www.sfs.uni-tuebingen.de/en_tuebads.shtml Tübingen Treebank of Spoken German (TüBa-D/S, aka Verbmobil treebank)]
* [http://www.sfs.uni-tuebingen.de/en_tuepp.shtml Tübingen Partially Parsed Corpus of Written German (TüPP-D/Z)]
* [http://www.coli.uni-saarland.de/~gparis/LMD-TAZ_corpus/ Le Monde Diplomatique-Die Tageszeitung Translation Corpus] - French-German, aligned (parallel)

==Evaluation datasets==
* [http://www.ukp.tu-darmstadt.de/data/semRelDatasets Semantic relatedness evaluation]

== Grammars ==
* [[Generation grammars|KPML generation grammar]]

== Morphological analysis ==
=== Free software ===
* [https://code.google.com/p/morphisto/ Morphisto], based on [[SMOR]], is an [[SFST]]-based analyser and generator for German. (The morphology is GPLv2, but the lexicon is proprietary/non-commercial: CC-BY-SA-NC v3)
* [http://www.danielnaber.de/morphologie/index_en.html German morphology data], based on [http://www.wolfganglezius.de/doku.php?id=cl:morphy Morhpy], licensed under CC-BY-SA 3.0

==Lexicons==
===Free software===
* [http://www-user.tu-chemnitz.de/~fri/ding/ DING] - German-English Dictionary with approximately 253,000 entries (GPL 2 or later).
* [http://www.openthesaurus.de/ OpenThesaurus] - German synonyms and associated terms (LGPL)

===Proprietary/gratis===
* [http://www.ims.uni-stuttgart.de/tcl/RESOURCES/German-Lexicon-en.html Lexical information for German] ("The data is freely available for education, research and other '''non-commercial''' purposes.")
* [http://www.canoo.net/ Canoo.net] - German Dictionaries and Grammars

===Unknown license===
* [http://www.ims.uni-stuttgart.de/projekte/IMSLex/ IMSLex German Lexicon] (no license information, but only "sample" download)
* [http://www.cl.uzh.ch/CL/siclemat/sprachanalyse/molif/ mOlif morphological analyzer] (broken link)

==Resource Access==
* [http://wortschatz.uni-leipzig.de/Webservices/ Web service access to German language statistics]

==Timeline Analysis==
* [http://wortschatz.uni-leipzig.de/wort-des-tages/ German Words of the Day]
* [http://www.sfs.uni-tuebingen.de/~lothar/nw/ Wortwarte (selection of German neologisms for each day) ]

[[Category:Resources by language|German]]

Resources for French

2015-03-08T19:17:25Z

Vladob54: Added: Araneum

==Corpora==
* [http://www.statmt.org/wmt10/training-giga-fren.tar 10^9 French-English corpus]
* [http://ucts.uniba.sk/aranea_about/ Araneum Francogallicum], Gigaword French web corpus
* [http://atilf.atilf.fr/dmf.htm Base Textuelle de Moyen Francais]
* [http://corpora.informatik.uni-leipzig.de/ French plain text and Co-occurrences at LCC]
* [http://www.up.univ-mrs.fr/veronis/donnees/index.html French Stopword List]
* [http://www.cnrtl.fr/lexiques/morphalou/ Lexique Morphalou]
* [http://w3.univ-tlse2.fr/erss/verbaction/main.html Lexique Verbaction]
* [http://www.coli.uni-saarland.de/~gparis/LMD-TAZ_corpus/ Le Monde Diplomatique-Die Tageszeitung Translation Corpus] - French-German, aligned (parallel)
* [http://www.euromatrixplus.net/multi-un/ UN parallel corpora]
* [http://www.statmt.org/wmt13/translation-task.html#download WMT corpora], including [http://en.wikipedia.org/wiki/Europarl_corpus Europarl], News Commentary, and News Crawl
* [http://88milsms.huma-num.fr/ Large SMS corpus in French (88milSMS)]

== Grammars/parsers ==
===Free software===
* [http://led.loria.fr/en_outils.php#114 HPSG FroG] (under the LGPLLR according to [http://2009.rmll.info/IMG/pdf/RMLL2009-Sciences-Sebastien_Paumier-LGPLLR.pdf this presentation])
* [http://alpage.inria.fr/~sagot/wolf.html WOLF] – Wordnet Libre du Français, distribuée sous licence Cecill-C (compatible LGPL)
* [http://alpage.inria.fr/~sagot/lefff.html Lefff] – (Lexique des Formes Fléchies du Français) est un lexique morphologique et syntaxique à large couverture, distribué sous licence libre LGPL-LR (Lesser General Public License For Linguistic Resources), see also [http://gforge.inria.fr/projects/alexina/ Alexina]
* [http://sites.google.com/site/morfetteweb/ Morfette] data driven PoS tagger and lemmatizer, New BSD License
* [http://wiki.apertium.org/wiki/Main_Page Apertium] has analysers/generators in the [[lttoolbox]] format for French, along with statistical disambiguation models, see e.g. the files in [https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-fr-ca fr-ca], [https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-fr-es fr-es] and [https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-br-fr br-fr]

===Unknown licence===
* [[Generation grammars|KPML generation grammar]]
* [http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html Treetagger] has some French support (gratis for research)
* [https://gforge.inria.fr/frs/download.php/27240/melt-0.6.tar.gz MeLT], data driven pos tagger

==Morphology, dictionaries==
===Free software===
* [http://www.dicollecte.org/ Dicollecte] LEXIQUE FRANÇAIS, LISTE DES FORMES FLÉCHIES, MPL/GPL/LGPL
* [http://www.univ-nancy2.fr/pers/namer/Telecharger_Flemm.html Flemmv3.1] - inflectional morphology parser for French -- perl, GPL license.

[[Category:Resources by language|French]]

Resources for Hungarian

2015-03-08T19:15:06Z

Vladob54: Added: Araneum

==Corpora==
* [http://ucts.uniba.sk/aranea_about/ Araneum Hungaricum], Gigaword Hungarian web corpus
* [http://www.statmt.org/europarl Europarl corpus], sentence aligned with English
* Hunglish parallel corpus ([http://mokk.bme.hu/resources/hunglishcorpus download], [http://hunglish.hu/search search])
* [http://mokk.bme.hu/resources/webcorpus/ Hungarian Webcorpus]
* [http://ufal.mff.cuni.cz/hamledt HamleDT], harmonized dependency treebanks of many languages, common annotation style.

== Tools ==
* [http://code.google.com/p/hunpos/ hunpos] (open-source POS-tagger)
* [http://mokk.bme.hu/resources/hunmorph/ hunmorph] (open-source morphological analyzer)

[[Category:Resources by language|Hungarian]]

Resources for Dutch

2015-03-08T19:12:30Z

Vladob54: Added: Araneum

== Corpora ==
* [http://ucts.uniba.sk/aranea_about/ Araneum Nederlandicum], Gigaword Dutch web corpus
* [http://corpora.informatik.uni-leipzig.de/ Dutch Plain text and Co-occurrences at LCC]
* [http://www.statmt.org/europarl Europarl corpus] - sentence-aligned with English
* [http://www.clips.uantwerpen.be/datasets/csi-corpus CLiPS Stylometry Investigation (CSI) corpus] - multi-purpose text corpus, main use in stylometry
* [http://ufal.mff.cuni.cz/hamledt HamleDT], harmonized dependency treebanks of many languages, common annotation style.

== Tools ==
* [http://www.let.rug.nl/~vannoord/alp/Alpino/ Dutch HPSG-based parser] Includes the Alpino treebank (7137 sentences, newspaper, manually corrected)

== Grammars ==
* [[Generation grammars|KPML generation grammar]]

[[Category:Resources by language|Dutch]]

Resources for Finnish

2015-03-08T19:10:15Z

Vladob54: Added: Araneum

==Corpora==
* [http://ucts.uniba.sk/aranea_about/ Araneum Finnicum], Gigaword Finnish web corpus
* [http://www.statmt.org/europarl Europarl corpus], sentence aligned with English
* [http://corpora.informatik.uni-leipzig.de/ Finnish plain text and Co-occurrences at LCC]
* [http://www.csc.fi/english/research/sciences/linguistics/index_html CSC Kielipankki] Language Bank at the [http://www.csc.fi/ CSC] Scientific Computing Centre, including some 200 million word tokens of Finnish texts.
* [http://ufal.mff.cuni.cz/hamledt HamleDT], harmonized dependency treebanks of many languages, common annotation style.

==Morphological analysers==
===Free software===
* [https://gna.org/projects/omorfi/ Omorfi] is an Open Morphology for Finnish, in association with the [[voikko]] speller project, see also https://kitwiki.csc.fi/twiki/bin/view/KitWiki/OmorfiHFSTVersion for installing with [[HFST]]. (LGPL/GPL)

[[Category:Resources by language|Finnish]]