Package home | Report new bug | New search | Development Roadmap Status: Open | Feedback | All | Closed Since Version 0.3.0

Bug #13385 If mb_detect_encoding() fails conversion will fail to an error
Submitted: 2008-03-13 07:40 UTC Modified: 2008-07-09 21:43 UTC
From: dreiska Assigned: taak
Status: Closed Package: Text_LanguageDetect (version CVS)
PHP Version: 5.1.6 OS: RHEL
Roadmaps: (Not assigned)    
Subscription  


 [2008-03-13 07:40 UTC] dreiska (Tomi Heiskanen)
Description: ------------ On LanguageDetect class there is no preparations if string encoding could not be detected. It still tries to convert and results an PHP warning. Test script: --------------- $str = 'I¨tvanas I buvo karunuotas sostineje E¨tergome 1000 m.'; $obj_detect = new Text_LanguageDetect(); $obj_detect->detect( $str ); Expected result: ---------------- No errors. Actual result: -------------- Warning: mb_convert_encoding() [function.mb-convert-encoding]: Illegal character encoding specified in /usr/share/pear/Text/LanguageDetect.php on line 699

Comments

 [2008-05-01 20:51 UTC] taak (Nicholas Pisarro)
how is your test string actually encoded? what is the output for var_dump(mb_detect_encoding($str)); and echo preg_replace('/[\x7f-\xff]/e', '"\x".dechex(ord(\\0))', $str); ?
 [2008-05-27 02:37 UTC] dreiska (Tomi Heiskanen)
I'm running PHP 5.2.4 now so it might affect things. However clearly there is a bug in detect() function: if string encoding is not regonized conversion of the string should not be attempted. Output: string 'UTF-8' (length=5) Notice: Use of undefined constant � - assumed '�' in (6) : regexp code on line 1 Notice: Use of undefined constant � - assumed '�' in (6) : regexp code on line 1 Notice: Use of undefined constant � - assumed '�' in (6) : regexp code on line 1 Notice: Use of undefined constant � - assumed '�' in (6) : regexp code on line 1 I\xc5\xa1tvanas I buvo karunuotas sostineje E\xc5\xa1tergome 1000 m.
 [2008-06-30 20:51 UTC] taak (Nicholas Pisarro)
I was not able to reproduce based this description initially... however, based on the ¨ char in the string I tried converting it to iso-8859-2, after which I was able to reproduce the error. I can't see how the error could be possible if the encoding really was UTF-8, as reported by the var_dump(mb_detect_encoding($str)) test as well as the \xc5\xa1 chars being valid UTF-8. Do you think it's possible that the string was converted to UTF-8 by some copying and pasting at some point? Regardless, I've fixed the problem based on the string not being UTF- 8, and will commit the change to CVS shortly.
 [2008-07-09 21:43 UTC] taak (Nicholas Pisarro)
This bug has been fixed in CVS. If this was a documentation problem, the fix will appear on pear.php.net by the end of next Sunday (CET). If this was a problem with the pear.php.net website, the change should be live shortly. Otherwise, the fix will appear in the package's next release. Thank you for the report and for helping us make PEAR better. In CVS. Let me know if you didn't think I got this right, or if you have any other issues.