Difference between revisions of "Language Identification Tools"

From ACL Wiki
Jump to navigation Jump to search
(wops)
 
(3 intermediate revisions by the same user not shown)
Line 14: Line 14:
 
** http://thomas.mangin.com//content/texcat-in-python.html – a python implementation by Thomas Mangin
 
** http://thomas.mangin.com//content/texcat-in-python.html – a python implementation by Thomas Mangin
 
** http://www.mnogosearch.org/guesser/ – another C reimplementation
 
** http://www.mnogosearch.org/guesser/ – another C reimplementation
 +
 +
 +
* Languid/GuessLanguage, trigram based
 +
** http://languid.cantbedone.org/ (dead link) original Perl version by Maciej Ceglowski
 +
** http://websvn.kde.org/branches/work/sonnet-refactoring/common/nlp/guesslanguage.cpp?view=markup C++ version by Jacob R Rideout for KDE
 +
** https://bitbucket.org/spirit/guess_language Python3 version by Phi-Long Do, supports Python2 via lib3to2
 +
  
 
* Nutch Language Identifier https://wiki.apache.org/nutch/LanguageIdentifier Java (Apache 2.0 license)
 
* Nutch Language Identifier https://wiki.apache.org/nutch/LanguageIdentifier Java (Apache 2.0 license)

Latest revision as of 08:41, 19 December 2012

A listing of language identification tools. Language identification can mean both identifiying text type (e.g. news vs literature) and language (e.g. English vs Frisian vs Dutch).

Most of these tools require training on a big corpus (see List of resources by language for corpora per language), but many come with some prebuilt language models.

Free Software




  • Compact Language Detector for Javascript https://github.com/jaukia/cld-js (3-clause license)
    • doesn't seem to include a method to add new languages, the existing ones were presumably generated by Google


Proprietary

See also