| » Metadata | » Status |
|---|---|
|
|
| » Description | |
|
Text_LanguageDetect can identify human language in which a sample of text is written. 52 languages are currently supported. This package works by calculating the relative frequencies of every 3 contiguous characters (trigrams) in the sample text. For instance, in English frequent trigrams are ' I ' and 'the'; or in German 'ich' and 'die' frequently occur. A simple multibyte character iterator is included for multibyte string support. The set of frequencies is converted into ranks and a crude measure of correlation is calculated against known frequencies for all of the known language set. The results are then sorted, and the best scorer is presumed to be the language of the sample string. The database of language models was converted from that which is provided with perl's Language::Guess, Limitations of this release include a lack of support for Korean and Japanese, lack of intelligent character set detection, and lack of support for developing and using alternative language models. Otherwise, this package is fully functional. Check out the demo! Comments appreciated. |
|
| » Dependencies | » Links |
|
|
|
| » Timeline | » Changelog |
|
|