Proposal for "I18N_UnicodeNormalizer"

» Metadata » Status
  • Category: Internationalization
  • Proposer: Michel Corne 
  • License: BSD Style
» Description
Unicode Normalizer

"...Unicode's normalization is the concept of character composition and decomposition. Character composition is the process of combining simpler characters into fewer precomposed characters, such as the n character and the combining ~ character into the single ñ character. Decomposition is the opposite process, breaking precomposed characters back into their component pieces... Normalization is important when comparing text strings for searching and sorting (collation)..." [Wikipedia]

Performs the 4 normalizations:
NFD: Canonical Decomposition
NFC: Canonical Decomposition, followed by Canonical Composition
NFKD: Compatibility Decomposition
NFKC: Compatibility Decomposition, followed by Canonical Composition

Complies with the official Unicode.org regression test.
Fully tested with PHPUnit. Code coverage is close to 100%.
Optimized to provide a performance gain up to 9X vs other implementations.
Uses UTF8 binary strings natively but can normalize a string in any UTF format.

Example 1: NFC-normalization of UTF-8 string 'foo'
$normalized = I18N_UnicodeNormalizer::toNFC('foo');
or
$normalizer = new I18N_UnicodeNormalizer();
$normalized = $normalizer->normalize('foo', 'NFC')

Example 2: NFC-normalization of ISO-8859-1 string 'foo'
$normalized = I18N_UnicodeNormalizer::toNFC('foo', 'ISO-8859-1');
or
$normalizer = new I18N_UnicodeNormalizer();
$normalized = $normalizer->normalize('foo', 'NFC', 'ISO-8859-1')
» Dependencies » Links
» Timeline » Changelog
  • First Draft: 2007-06-11
  • Proposal: 2007-06-11
  • Call for Votes: 2007-06-20