Bug #10682 :: Text_Wiki not multi-byte safe

Package home | Report new bug | New search | Development Roadmap

Status: Open | Feedback | All | Closed Since Version 1.2.3

Bug #10682	Text_Wiki not multi-byte safe
Submitted:	2007-04-11 00:44 UTC
From:	quinncom	Assigned:
Status:	Bogus	Package:	Text_Wiki (version 1.1.0)
PHP Version:	5.2.1	OS:	Linux 2.6
Roadmaps:	(Not assigned)
Subscription	Your email:

Comments Patches (1) Add Comment Add patch

[2007-04-11 00:44 UTC] quinncom (Quinn Comendant)

Description:
------------
Much of Text_Wiki is not multi-byte safe but can be fixed by searching for string functions and replacing them with mb_ equivalents.

Also the render method of Text_Wiki has this at line 1037:

    $char = $this->source{$i};

Which isn't friendly to multi-byte characters, and needs to be changed to:

    $char = mb_substr($this->source, $i, 1);

Comments

[2007-04-11 01:51 UTC] quinncom (Quinn Comendant)

However, mb_substr used in this way is extremely slow. Rendering texts with more than 10K characters is a server killer.

I've done some research but can't seem to find a faster way to loop through multi-byte characters any faster. Does anybody know?

Perhaps the only solution is to update the render method to use a non-looping algorithm. Possible?

[2007-04-11 02:18 UTC] quinncom (Quinn Comendant)

Aha. I totally didn't see the $this->renderingType == 'preg' option. That seems to work for multi-byte chars. How did I miss that? ;P

[2007-04-11 05:12 UTC] justinpatrin (Justin Patrin)

I added some code a little while ago to do the post-processing with preg but it had some issues and doesn't support the stack-based rendering method recently added.

The other problem with using mb_* is it will restrict Text_Wiki to use on those systems which have the mb_* functions in PHP. IIRC it's not installed by default in many of the PHP4 installations, although I may be wrong.

Perhaps we could take the non-markup text and put it into an array then pull it back out on rendering to put it back in....then again I don't know how we'd do that without mb_substring....

[2007-04-11 05:42 UTC] quinncom (Quinn Comendant)

Are you saying post-processing with preg is broken?

I have heard some hosting companies are not supporting mb_* functions. However this is totally going to change as unicode gains recognition. It's not a problem for us as we maintain our own servers. If this is a problem it would be easy enough to offer a switch to enable mb_ functions, perhaps using variable variables like this:

if ($use_mb_functions === true) {
    $strlen = 'mb_strlen';
    ...
} else {
    $strlen = 'strlen';
    ...
}

$length = $strlen($mytxt);

I totally don't know what you mean by putting non-markup text into an array. For what purpose? At the moment I'm quite pleased with how the preg rendering is working for me.

[2007-04-11 14:33 UTC] justinpatrin (Justin Patrin)

Are you using the preg rendering option that was introduced 2 versions ago? As I mentioned it is flawed in a few ways and was disabled in the newest release. The parsers still use preg of course but the rendering needs to use the byte-by-byte system to render correctly in all cases and allow for the stack rendering scheme (which, admittedly, is not used much yet).

[2007-06-09 23:21 UTC] justinpatrin (Justin Patrin)

I am unsure how to properly fix this with the current

[2007-07-18 16:04 UTC] floele (Florian Schmitz)

There is a simple way to at least make Text_Wiki *usable* with multibyte chars, even if it's not entirely safe. The current main problem is the use of htmlentities which needs to be replaced with htmlspecialchars (just two occurences).

To further ensure UTF-8 compatibility, you could make use of the UTF-8 functions (inc/utf8.php) that ship with Dokuwiki. These are able work without the mb_ functions, but they'll be used if available.

The problem with

$char = $this->source{$i};

could probably be solved by first running the dokuwiki function $arr = utf8_to_unicode($s) on the whole string (once) which creates an array of Unicode representations (integers) of each char and then loop over the array and do

$char = unicode_to_utf8($array[$i]);

This should be able to run in acceptable speed. If not, there could be an option whether or not to be utf-8 safe. Then the user has to make use of appropriate caching methods when using utf-8 support.

[2007-07-24 18:05 UTC] floele (Florian Schmitz)

Ops, looks like I didn't check carefully enough. This can actually be configured with

$wiki->setFormatConf('Xhtml', 'translate', HTML_SPECIALCHARS);

but took me a quite while to find out how (splendid docs weren't any help). So sorry for my rather useless advice ;)

[2011-03-20 00:35 UTC] till (Till Klampaeckel)

-Status:           Open
+Status:           Bogus
-Roadmap Versions: 2.0.0
+Roadmap Versions:
Per last comment, it's not an issue in Text_Wiki.