Difference between revisions of "Resources for Croatian"

Revision as of 13:07, 25 March 2010

IHJJ - Institute of Croatian Language and Linguistics
Croatian Language Technologies Portal - exhaustive lists of corpora, dictionaries, tools, associations, institutions and projects in LT. Developed in the Institute of Linguistics, Facutly of Humanities and Social Sciences, University of Zagreb.

Croatian National Corpus - 101.2 mil. tokens synchronic (text from 1990 on), standard Croatian reference corpus; lemmatised and MSD-tagged with the Croatian MultText East tagset using hybrid tagger CroTag and lemmatiser. Developed at the Institute of Linguistics, Faculty of Humanities and Social Sciences, University of Zagreb since 1998.
Croatian Language Corpus (continuously growing (currently approx. 100 mil. tokens) corpus of Croatian covering various genres and time periods, using Philologic for online search)

Southeast European Times (sentence aligned corpus, Albanian, Bulgarian, English, Greek, Macedonian, Romanian, Serbo-Croatian, Turkish — approximately 4.5 million words per language)

Croatian Morphological Lexicon - Croatian inflectional lexicon comprising more than 110,000 lemmas yielding more than 3.8 mln word-forms; freely searchable. Developed at the Institute of Linguistics, Faculty of Humanities and Social Sciences, University of Zagreb.

@@ Line 12: / Line 12: @@
 <!-- Please keep this list in alphabetical order -->
-* [http://xixona.dlsi.ua.es/~fran/setimes/ Southeast European Times] (paragraph aligned corpus, Albanian, Bosnian, Bulgarian, Croatian, English, Greek, Macedonian, Romanian, Serbian, Turkish &mdash; 9,678 paragraphs, 92,450&mdash; 122,912 words per language)
+* [http://www.statmt.org/setimes/ Southeast European Times] (sentence aligned corpus, Albanian, Bulgarian, English, Greek, Macedonian, Romanian, Serbo-Croatian, Turkish &mdash; approximately 4.5 million words per language)
 ==Lexicons==