Difference between revisions of "Language Identification Tools"

From ACL Wiki
Jump to navigation Jump to search
Line 11: Line 11:
 
** https://github.com/crodas/PHPTextCat/ – a php module for libtextcat
 
** https://github.com/crodas/PHPTextCat/ – a php module for libtextcat
 
** https://launchpad.net/pylibtextcat python2 / https://github.com/bbqsrc/pylibtextcat/ python3 interface to libtextcat
 
** https://launchpad.net/pylibtextcat python2 / https://github.com/bbqsrc/pylibtextcat/ python3 interface to libtextcat
* Nutch Language Identifier https://wiki.apache.org/nutch/LanguageIdentifier (Apache license)
+
* Nutch Language Identifier https://wiki.apache.org/nutch/LanguageIdentifier Java (Apache 2.0 license)
 +
** https://code.google.com/p/language-detection/ source code, data for 53 languages
 +
** https://code.google.com/p/lang-guess/ lang-guess is a fork of language-detection
 
* Compact Language Detector for Javascript https://github.com/jaukia/cld-js (3-clause license)
 
* Compact Language Detector for Javascript https://github.com/jaukia/cld-js (3-clause license)
 
** doesn't seem to include a method to add new languages, the existing ones were presumably generated by Google
 
** doesn't seem to include a method to add new languages, the existing ones were presumably generated by Google
 +
* LID http://www.cavar.me/damir/LID/ Python and Scheme (GPL3)
  
 
==Proprietary==
 
==Proprietary==

Revision as of 03:01, 6 December 2012

A listing of language identification tools. Language identification can mean both identifiying text type (e.g. news vs literature) and language (e.g. English vs Frisian vs Dutch).

Most of these tools require training on a big corpus (see List of resources by language for corpora per language), but many come with some prebuilt language models.

Free Software

Proprietary

See also