Package home | Report new bug | New search | Development Roadmap Status: Open | Feedback | All | Closed Since Version 1.5.5

Bug #52 XML_RPC processes data as UTF-8 regardless of actual encoding
Submitted: 2003-10-01 09:16 UTC
From: daniel at ilse dot net Assigned: danielc
Status: Closed Package: XML_RPC
PHP Version: Irrelevant OS:
Roadmaps: (Not assigned)    
Subscription  


 [2003-10-01 09:16 UTC] daniel at ilse dot net
Description: ------------ The PEAR XMLRPC package handles xmldata as a UTF-8 stream, even if it's encoded otherwise. In my case data was being sent with encoding='ISO-8859-1' and containing valid 8bit characters, the result was an XMLRPC error 'invalid_return'. Fix: get the encoding from the actual xml data, something like this: (patch against $Id: RPC.php,v 1.11 2002/10/02 21:10:19 ssb Exp $) --- /usr/local/lib/php/XML/RPC.php.orig Tue Sep 30 15:17:25 2003 +++ /usr/local/lib/php/XML/RPC.php Tue Sep 30 15:20:31 2003 @@ -616,12 +616,46 @@ return $this->parseResponse($ipd); } + function getEncoding($str) + { + $rc=$GLOBALS['XML_RPC_defencoding']; + + $lines=preg_split('/(\r\n|\r|\n)/',$str); + + foreach($lines as $line) + { + if(preg_match('/<\?xml[^>]*\s*encoding\s*=\s*[\'"]([^"\']*)[\'"]/i',$line,$match)) + { + $rc=trim($match[1]); + + break; + } + } + + return $rc; + } + function parseResponse($data="") { global $XML_RPC_xh,$XML_RPC_err,$XML_RPC_str; global $XML_RPC_defencoding; - $parser = xml_parser_create($XML_RPC_defencoding); + if($data!="") + { + $encoding=$this->getEncoding($data); + } + else + { + $encoding=$XML_RPC_defencoding; + } + + $parser = xml_parser_create($encoding); $XML_RPC_xh[$parser]=array();

Comments

 [2003-10-11 14:35 UTC] mroch
This looks pretty good to me, except I'm wondering if there's a better way to get the first line (the XML declaration "<?xml...>" must be on the first line). Also, how can this be tested?
 [2004-03-12 11:09 UTC] pajoye
Why not only overwrite the default enconding? require_once 'XML/RPC.php'; $GLOBALS['XML_RPC_defencoding'] = "ISO-8859-1"; I set it to "won't fix" for now. If you have some arguments to add this function, please open it again. pierre
 [2004-03-12 14:42 UTC] daniel at ilse dot net
|Why not only overwrite the default enconding? Because I don't know (and shouldn't care) what the encoding will be... For example I get my xml messages from a third party, if they change their output encoding I don't want to have to change my code. The XMLRPC (in combination with mbstring or iconv) should be able do gracefully handle this. I'll look further into this myself and will propably supply a more useful patch in the not so near future.
 [2004-05-02 00:33 UTC] pear dot php dot net at chsc dot dk
I believe this is a genuine bug that should be fixed. The XML-RPC standard does not specify that the XML must be in UTF-8, so the code should not assume it is. Different clients may choose to send data in different encodings, so hardcoding the default encoding will not work. The fix mentioned above is not correct, though. According to RFC 3023 section 3.1, the encoding specified in the <?xml encoding=... ?> tag should be ignored for XML received over HTTP in favor of the encoding specified in the Content-Type header (e.g. "Content-Type: text/xml; charset=iso-8859-1). I have made a patch, that examines the Content-Type header and uses the charset specified there. If no charset is specified, RFC 3023 states that the data should be treated as US-ASCII, even though a different encoding is specified using <?xml encoding=...?>. However, since some clients do not properly specify the encoding, the patch checks the <?xml?> tag for an encoding attribute and used the encoding specified there. If there isn't any encoding specified here either, it defaults to UTF-8. This will not break standards compliant clients, since all encodings supported by the XML parser (UTF-8 and ISO-8859-1) are a superset of US-ASCII. I chose UTF-8 as the default so that XML_RPC_Server would remain compatible with earlier versions. The patch also adds the possibility of specifying the charset used in PHP strings that are passed as arguments to and returned from methods in XML_RPC_Server. Currently all strings are assumed to be UTF-8. The encoding is specified as the third optional argument in the constructor. If this is omitted, the code defaults to UTF-8 for compatiblity with earlier versions of XML_RPC_Server. The patch can be downloaded from http://serber.chsc.dk/misc/XML_RPC_Server.patch
 [2004-05-09 21:06 UTC] pear dot php dot net at chsc dot dk
I suggest reopening this bug based on the new arguments and the availability of a patch. I do not have sufficient permissions to reopen it myself.
 [2004-05-31 19:38 UTC] daniel at ilse dot net
I'd love to reopen the bug, but this form only let's me select either the current status (won't fix) or close. I suggest that some reports the bug again, I'll stick to my patch
 [2004-06-08 12:46 UTC] neufeind
Due to several requests I herewith reopen the bug again. Please add comments / patches / suggestions! Mail from Sebastien Person (sperson at easter-eggs.com) to pear-dev today: I can confirm that the issue is still true. I have the same problem as described and I need to modify pear lib in order to have a working XML_RPC based webservice.
 [2004-12-16 23:35 UTC] techtonik at tut dot by
I don't think this would be useful taking into account that supported encodings for xml_parser_create() are only ISO-8859-1, UTF-8 and US-ASCII. What will it return with "Content-Encoding: Windows-1251" header? I prefer to have an automatic conversion from source encoding to UTF-8 done by Client methods and leave Server side operate only with UTF-8.
 [2004-12-19 21:39 UTC] pear dot php dot net at chsc dot dk
>I prefer to have an automatic conversion from source >encoding to UTF-8 done by Client methods and leave Server >side operate only with UTF-8. This is only possible when you control both the client and the server. When providing a public XML-RPC service you have no control over which character sets the clients use. Some client programs and XML-RPC libraries may not even allow the user to specify which character is used when generating the XML (the actual character set that is sent over the wire should be transparent to both the client and the server applications). The parser may not support any character set in the world, but supporting both UTF-8 and ISO-8859-1 is much better than supporting only UTF-8, especially since this can be accomplished very easily (as shown in the patch).
 [2004-12-30 21:36 UTC] danielc
Fixed in CVS.