Proposal for "LanguageDetect"

» Metadata » Status
» Description
Text_LanguageDetect can identify human language in which a sample of text is written. 52 languages are currently supported.

This package works by calculating the relative frequencies of every 3 contiguous characters (trigrams) in the sample text. For instance, in English frequent trigrams are ' I ' and 'the'; or in German 'ich' and 'die' frequently occur. A simple multibyte character iterator is included for multibyte string support. The set of frequencies is converted into ranks and a crude measure of correlation is calculated against known frequencies for all of the known language set. The results are then sorted, and the best scorer is presumed to be the language of the sample string.

The database of language models was converted from that which is provided with perl's Language::Guess, which is available under the GPL update: which is not subject to copyright.

Limitations of this release include a lack of support for Korean and Japanese, lack of intelligent character set detection, and lack of support for developing and using alternative language models.

Otherwise, this package is fully functional. Check out the demo! Comments appreciated.
» Dependencies » Links
» Timeline » Changelog
  • First Draft: 2005-11-30
  • Proposal: 2005-12-04
  • Call for Votes: 2005-12-19
  • Nicholas Pisarro
    [2005-12-05 13:49 UTC]

    Improved error reporting
  • Nicholas Pisarro
    [2005-12-07 19:31 UTC]

    Updated license information (database is not subject to copyright)
    Many minor style changes
  • Nicholas Pisarro
    [2005-12-13 17:26 UTC]

    Reorganized files, updated package file. Data file now installs in the correct place.
  • Nicholas Pisarro
    [2005-12-15 04:10 UTC]

    style changes
  • Nicholas Pisarro
    [2005-12-18 23:31 UTC]

    Fixed encoding problem in web example