Difference between revisions of "Language Identification Tools"

Revision as of 03:25, 6 December 2012

A listing of language identification tools. Language identification can mean both identifiying text type (e.g. news vs literature) and language (e.g. English vs Frisian vs Dutch).

Most of these tools require training on a big corpus (see List of resources by language for corpora per language), but many come with some prebuilt language models.

Free Software

LibTextCat http://software.wise-guys.nl/libtextcat/ C library (BSD license)
- Interfaces to the C library libtextcat:
  - http://www.jedi.be/pages/JTextCat/ – a java interface to libtextcat
  - https://github.com/crodas/PHPTextCat/ – a php module for libtextcat
  - https://launchpad.net/pylibtextcat python2 / https://github.com/bbqsrc/pylibtextcat/ python3 interface to libtextcat
- http://odur.let.rug.nl/~vannoord/TextCat/ – original perl TextCat implementation
  - http://opus.lingfil.uu.se/tools/public/language_guesser/textcat – perl version with more language models, encoding fixes
- http://olivo.net/software/lc4j/ – a java reimplementation
- http://thomas.mangin.com//content/texcat-in-python.html – a python implementation by Thomas Mangin
- http://www.mnogosearch.org/guesser/ – another C reimplementation

Nutch Language Identifier https://wiki.apache.org/nutch/LanguageIdentifier Java (Apache 2.0 license)
- https://code.google.com/p/language-detection/ source code, data for 53 languages
- https://code.google.com/p/lang-guess/ lang-guess is a fork of language-detection

Compact Language Detector for Javascript https://github.com/jaukia/cld-js (3-clause license)
- doesn't seem to include a method to add new languages, the existing ones were presumably generated by Google

LID http://www.cavar.me/damir/LID/ Python and Scheme (GPL3)

Proprietary

Google Language Identification API
Lingua-Systems lid http://www.lingua-systems.com/language-identifier/

@@ Line 1: / Line 1: @@
 A listing of language identification tools. Language identification can mean both identifiying text type (e.g. news vs literature) and language (e.g. English vs Frisian vs Dutch).
-Most of these tools require training on a big corpus (see [[Category:Resources_by_language]] for corpora per language), but many come with some prebuilt language models.
+Most of these tools require training on a big corpus (see [[List of resources by language]] for corpora per language), but many come with some prebuilt language models.
+==Free Software==
+* LibTextCat http://software.wise-guys.nl/libtextcat/ C library (BSD license)
+** Interfaces to the C library libtextcat:
+*** http://www.jedi.be/pages/JTextCat/ – a java interface to libtextcat
+*** https://github.com/crodas/PHPTextCat/ – a php module for libtextcat
+*** https://launchpad.net/pylibtextcat python2 / https://github.com/bbqsrc/pylibtextcat/ python3 interface to libtextcat
+** http://odur.let.rug.nl/~vannoord/TextCat/ – original perl TextCat implementation
+*** http://opus.lingfil.uu.se/tools/public/language_guesser/textcat – perl version with more language models, encoding fixes
+** http://olivo.net/software/lc4j/ – a java reimplementation
+** http://thomas.mangin.com//content/texcat-in-python.html – a python implementation by Thomas Mangin
+** http://www.mnogosearch.org/guesser/ – another C reimplementation
+* Nutch Language Identifier https://wiki.apache.org/nutch/LanguageIdentifier Java (Apache 2.0 license)
+** https://code.google.com/p/language-detection/ source code, data for 53 languages
+** https://code.google.com/p/lang-guess/ lang-guess is a fork of language-detection
-==Free Software==
-* TextCat
-** http://opus.lingfil.uu.se/tools/public/language_guesser/textcat/LM - language models for the perl version
-** http://olivo.net/software/lc4j/ - a java implementation
-* Nutch Language Identifier https://wiki.apache.org/nutch/LanguageIdentifier (Apache license)
 * Compact Language Detector for Javascript https://github.com/jaukia/cld-js (3-clause license)
 ** doesn't seem to include a method to add new languages, the existing ones were presumably generated by Google
+* LID http://www.cavar.me/damir/LID/ Python and Scheme (GPL3)
 ==Proprietary==
@@ Line 19: / Line 34: @@
 * [[Language Identification (State of the art)]]
 * [https://en.wikipedia.org/wiki/Language_detection English Wikipedia on Language detection]
+* [http://www.let.rug.nl/~vannoord/TextCat/competitors.html TextCat competitors] – list compiled by Gertjan van Noord

Difference between revisions of "Language Identification Tools"

Revision as of 03:25, 6 December 2012

Free Software

Proprietary

See also

Navigation menu

Search