utfnormal
utfnormal is a library that contains Unicode normalization routines. It includes pure PHP implementations, and automatically uses the php-intl extension if installed.
The main function to care about is UtfNormal\Validator::cleanUp()
. This will strip illegal UTF-8 sequences and characters that are illegal in XML, and if necessary convert to normalization form C (NFC). See also "Unicode equivalence" on Wikipedia.
If you know the string is already valid UTF-8, you can directly call:
UtfNormal\Validator::toNFC()
,UtfNormal\Validator::toNFK()
,- or
UtfNormal\Validator::toNFKC()
This will convert a given UTF-8 string to Normalization Form C, K, or KC if it's not already such. The function assumes that the input string is already valid UTF-8; if there are corrupt characters this may produce erroneous results.
Performance is kind of stinky in absolute terms, though it should be speedy on pure ASCII text. ;) On text that can be determined quickly to already be in NFC it's not too awful but it can quickly get uncomfortably slow, particularly for Korean text (the Hangul decomposition/composition code is extra slow).
Bugs should be filed in Wikimedia's Phabricator under the "utfnormal" project.
To use it in your project, run composer require wikimedia/utfnormal
.
This library was first introduced in MediaWiki 1.3 (rev:4965). It was split out of the MediaWiki codebase and published as an independent library during the MediaWiki 1.25 development cycle.
External links
[edit]- Source code (Phabricator mirror, GitHub mirror)
- Composer package
- API Documentation
- Test coverage report
- Issue tracker