Language Identification Tools
A listing of language identification tools. Language identification can mean both identifiying text type (e.g. news vs literature) and language (e.g. English vs Frisian vs Dutch).
Most of these tools require training on a big corpus (see List of resources by language for corpora per language), but many come with some prebuilt language models.
- LibTextCat http://software.wise-guys.nl/libtextcat/ C library (BSD license)
- Interfaces to the C library libtextcat:
- http://odur.let.rug.nl/~vannoord/TextCat/ – original perl TextCat implementation
- http://opus.lingfil.uu.se/tools/public/language_guesser/textcat – perl version with more language models, encoding fixes
- http://olivo.net/software/lc4j/ – a java reimplementation
- http://thomas.mangin.com//content/texcat-in-python.html – a python implementation by Thomas Mangin
- http://www.mnogosearch.org/guesser/ – another C reimplementation
- Nutch Language Identifier https://wiki.apache.org/nutch/LanguageIdentifier Java (Apache 2.0 license)
- doesn't seem to include a method to add new languages, the existing ones were presumably generated by Google
- LID http://www.cavar.me/damir/LID/ Python and Scheme (GPL3)
- Google Language Identification API
- Lingua-Systems lid http://www.lingua-systems.com/language-identifier/