Package home | Report new bug | New search | Development Roadmap Status: Open | Feedback | All | Closed Since Version 1.10.12

Bug #30 Mail_Mime: _encodeHeaders is not RFC-2047 compliant.
Submitted: 2003-09-23 09:13 UTC
From: ed at avi dot ru Assigned: cipri
Status: Closed Package: Mail_Mime
PHP Version: Irrelevant OS: All
Roadmaps: 1.4.0, 1.4.0a1    
Subscription  


 [2003-09-23 09:13 UTC] ed at avi dot ru
Description: ------------ Hello! I tried to use PEAR module Mail_mime to compose a message with cyrillic subject and noticed that whitespaces between cyrillic words in resulting message are missing (Cyrillic letters are in 0x80-0xFF area). The investigation discovered that the Mail_mime::encodeHeaders() method is working incorrectly. Let's imagine that we have the following subject (or any other header, it doesn't matter): AAA BBB BBB AAA BBB where A is a latin character and B is a cyrillic character. Your regular expression will detect 3 BBB patterns and will encode them separately: AAA =?charset?Q?=xx=xx=xx?= =?charset?Q?=xx=xx=xx?= AAA =?charset?Q?=xx=xx=xx?= And what we see? First two BBB patterns produced two sequential RFC-2047 'encoded-word' patterns, which is a) incorrect, because two sequential 'encoded-word' patterns must be separated by CRLF SPACE, not by SPACE only; b) is understood by mailers as CRLF SPACE and thus translataed as empty string. Oops... The solution is to change the regular expression so that it could find not the single 'words-with-[0x80-0xFF]' patterns, but space-separated sequences of such words. And it will be better to include '-' symbol in \w qualifier thus treating words with '-' as solid words. The regex is: /([\w\-]*[\x80-\xFF]+[\w\-]*(\s+[\w\-]*[\x80-\xFF]+[\w\-]*)*)\s*/ And the replacing regex also needs improvement: as per RFC-2047 'encoded-word' pattern must not contain spaces and tabs to be treated as single atom; it also must not contain underscores, equal and question signs (they are special characters and must be escaped); the correct regex is: /([\s_=\?\x80-\xFF])/e And one more improvement: it sometimes better (as per RFC-2047) to encode headers not as quoted-printable but as base64 (if header consists mainly of 0x80-0xFF characters). The extended and corrected function is here: /** * Encodes a header as per RFC2047 * * @param string $input The header data to encode * @return string Encoded data * @access private */ function _encodeHeaders($input) { foreach ($input as $hdr_name => $hdr_value) { preg_match_all('/([\w\-]*[\x80-\xFF]+[\w\-]*(\s+[\w\-]*[\x80-\xFF]+[\w\-]*)*)\s*/', $hdr_value, $matches); foreach ($matches[1] as $value) { switch ($head_encoding = $this->_build_params['head_encoding']) { case 'base64': $symbol = 'B'; $replacement = base64_encode($value); break; default: if ($head_encoding != 'quoted-printable') { PEAR::raiseError( 'Invalid header encoding specified; using `quoted-printable` instead', NULL, PEAR_ERROR_TRIGGER, E_USER_WARNING ); } $symbol = 'Q'; $replacement = preg_replace('/([\s_=\?\x80-\xFF])/e', '"=" . strtoupper(dechex(ord("\1")))', $value); } $hdr_value = str_replace($value, '=?' . $this->_build_params['head_charset'] . '?' . $symbol . '?' . $replacement . '?=', $hdr_value); } $input[$hdr_name] = $hdr_value; } return $input; } The function steel may need some inprovement, because headers must be divided on CRLF SPACE -separated parts no longer than 76 characters, but it's another story :) Anyway, thanks for attention. Let me know if you find these improvements useful and include them in next version of Mail_mime. Bye! With best regards, Edward Surov

Comments

 [2003-10-09 00:40 UTC] samm at os2 dot ru
10x for solution. I used your patch and it work correctly for my subjects.
 [2003-10-30 22:33 UTC] heino at php dot net
I concur on the fact that the current code is not RFC compliant! I haven't tested the suggested code, but it should solve some of the problems; it needs some modification though...
 [2003-11-04 09:40 UTC] ed at avi dot ru
A small bug in this patch: in switch statement script doesn't check $this->_build_params['head_encoding'] for existence. It's better to implement this part with the following code: $head_encoding = array_key_exists('head_encoding', $this->_build_params) ? $this->_build_params['head_encoding'] : 'quoted-printable'
 [2004-09-27 07:41 UTC] struchkov at ma-journal dot ru
I rewrite _encodeHeaders to function _encodeHeaders($input) { foreach ($input as $hdr_name => $hdr_value) { if (preg_match('/(\w*[\x80-\xFF]+\w*)/', hdr_value)) $hdr_value='=?'.$this->_build_params['head_charset'].'?B?'.base64_encode($hdr_value).'?='; $input[$hdr_name] = $hdr_value; } return $input; } is it correct?
 [2004-09-28 08:06 UTC] ed at avi dot ru
struchkov at ma-journal dot ru: your solution will work correctly (let's forget about max. encoded string size), but it encodes even words that are not required to be encoded, and thus a) the size of headers is greater than it could be; b) headers are harder to read with text editors. Besides, your program doesn't allow user to choose between quoted-printable and base64 encoding - these options could help to decrease encoded data size.
 [2005-02-18 10:57 UTC] minakov at mail dot ru
function _encodeHeaders($input) { foreach ($input as $hdr_name => $hdr_value) { if (preg_match('/(\w*[\x80-\xFF]+\w*)/', $hdr_value)) { $replacement = preg_replace('/([\x80-\xFF])/e', '"=" . strtoupper(dechex(ord("\1")))', $hdr_value); $input[$hdr_name] = '=?' . $this->_build_params['head_charset'] . '?Q?' . $replacement . '?='; } } return $input; }
 [2005-04-07 06:13 UTC] laurynas dot butkus at gmail dot com
last fix by minakov at mail dot ru works nicely for me with UTF-8. thanks!! I hope the bug will be fixed in future releases..
 [2005-05-27 10:06 UTC] mdv at inyourpocket dot com
i think due to RFC 2047 when used Q encoding you must encode space, = and _ (see RFC 2047 section 2 and 4.2), so when you apply q encoding you must replace space, = and _ function _encodeHeaders($input) { foreach ($input as $hdr_name => $hdr_value) if (preg_match('/[\x80-\xFF]?/', $hdr_value)) $input[$hdr_name] = '=?'.$this->_build_params['head_charset'].'?Q?'.preg_replace('/([\x80-\xFF]|\x20|\x3D|\x5F)/e', '"=".strtoupper(dechex(ord("\1")))', $hdr_value).'?='; return $input; }
 [2005-05-27 10:16 UTC] mdv at inyourpocket dot com
of course reg exp should be /[\x80-\xFF]/ instead /[\x80-\xFF]?/
 [2005-06-02 14:21 UTC] mdv at inyourpocket dot com
kmail (kde mail client) works perfectly, so i "convert" kmail c++ code to php (file kmmsgbase.cpp, function encodeRFC2047String()) i don't know how kmail function codec->encode() works but i replace it with preg_replace('/([\x00-\x20\x80-\xFF])/e', '"=".strtoupper(dechex(ord("\1")))', $s) so any comments welcome /** * Encodes a header as per RFC2047 * * @param string $input The header data to encode * @return string Encoded data * @access private */ function _encodeHeaders($input) { foreach ($input as $k => $v) $input[$k] = $this->encodeRFC2047String($v); return $input; } function encodeRFC2047Quoted($s, $b) { if ($b) return base64_encode($s); else return preg_replace('/([\x00-\x20\x80-\xFF])/e', '"=".strtoupper(dechex(ord("\1")))', $s); } function encodeRFC2047String($s) { $strLength = strlen($s); $breakLine = $start = $stop = $p = $pos = $encLength = 0; $maxLen = 75 - 7 - strlen($this->_build_params['head_charset']); $nonAscii = 0; for ($i = 0; $i < $strLength; $i++) if ($s[$i] >= 128) $nonAscii++; $useBase64 = ($nonAscii * 6 > $strLength); $result = ''; while ($pos < $strLength) { $start = $pos; $p = $pos; while ($p < $strLength) { if (!$breakLine && ($s[$p] == ' ' || (strpos($this->dontQuote, $s[$p]) !== false))) $start = $p + 1; if (ord($s[$p]) >= 128 || ord($s[$p]) < 32) break; $p++; } if ($breakLine || $p < $strLength) { while (strpos($this->dontQuote, $s[$start]) !== false) $start++; $stop = $start; while ($stop < $strLength && (strpos($this->dontQuote, $s[$stop]) === false)) $stop++; $result .= substr($s, $pos, $start - $pos); $encLength = strlen($this->encodeRFC2047Quoted(substr($s, $start, $stop - $start), $useBase64)); $breakLine = ($encLength > $maxLen); if ($breakLine) { $dif = ($stop - $start) / 2; $step = $dif; while (abs($step) > 1) { $encLength = strlen($this->encodeRFC2047Quoted(substr($s, $start, $dif), $useBase64)); $step = ($encLength > $maxLen) ? (-abs($step) / 2) : (abs($step) / 2); $dif += $step; } $stop = $start + $dif; } $p = $stop; //echo 'DEBUG: p: ', $p, ', start:: ', $start, ', strlen:: ', $strLength, '<br>'; while (($p > $start) && !isset($s[$p])) $p--; if ($p > $start) $stop = $p; if (substr($result, -3) == "?= ") $start--; if (substr($result, -5) == "?=\n ") { $start--; $result = substr($result, 0, strlen($result) - 1); } $lastNewLine = strripos($result, "\n "); if (trim(substr($result, $lastNewLine)) && ((strlen($result) - $lastNewLine + $encLength + 2) > $maxLen)) $result .= "\n "; $result .= "=?"; $result .= $this->_build_params['head_charset']; $result .= $useBase64 ? "?b?" : "?q?"; $result .= $this->encodeRFC2047Quoted(substr($s, $start, $stop - $start), $useBase64); //encodeRFC2047Quoted(codec->fromUnicode(_str.mid(start, stop - start)), useBase64); $result .= "?="; if ($breakLine) $result .= "\n "; $pos = $stop; } else { $result .= substr($s, $pos); break; } } return $result; }
 [2005-06-13 12:39 UTC] krumb at valentins dot de
I have the same problem with german umlauts like äöü. The Whitespaces between 2 words including umlauts is away. include('Mail.php'); include('Mail/mime.php'); $text = 'Text version of email'; $html = '<html><body>HTML version of email<br>message äöü </body></html>'; $file = '/home/richard/example.php'; $crlf = "\n"; $hdrs = array( 'From' => 'user@domain.com', 'Subject' => 'Süper gröse tolle grüße von mir' ); $mime = new Mail_mime($crlf); $mime->setTXTBody($text); $mime->setHTMLBody($html); $mime->addAttachment($file, 'text/plain'); $body = $mime->get(); $hdrs = $mime->headers($hdrs); $mail =& Mail::factory('mail'); $mail->send('xxx@test.de', $hdrs, $body);
 [2005-09-02 08:15 UTC] eric at persson dot tm
Have the same problem with swedish characters like åäö in words following each other. I solved it temporarily by adding $value = $value.' '; as in the following: function _encodeHeaders($input) { foreach ($input as $hdr_name => $hdr_value) { preg_match_all('/(\w*[\x80-\xFF]+\w*)/', $hdr_value, $matches); foreach ($matches[1] as $value) { $value = $value.' '; $replacement = preg_replace('/([\x80-\xFF])/e', '"=" . strtoupper(dechex(ord("\1")))', $value); $hdr_value = str_replace($value, '=?' . $this->_build_params['head_charset'] . '?Q?' . $replacement . '?=', $hdr_value); } $input[$hdr_name] = $hdr_value; } return $input; }
 [2005-09-30 11:20 UTC] roundcube at gmail dot com
Why is this bug marked as "irrelevant"? To me it seems very importent to have headers encoded correctly. For example Gmail can't read the headers with two separately encoded words and will parse the message completely wrong. All headers following the encoded line wil then appear as part of the body. Also important to me are the additional chars (\s_=?) that should be encoded as well. It's now more than two years since this was reported but nt implemented in any release. Please fix it! The solution of ed[at]avi[dot]ru (2003-09-23) will do it. With my best regards, Thomas
 [2005-09-30 12:16 UTC] roundcube at gmail dot com
Sorry for the mistake: of course this bug is not marked as "irrelevant" (watched the wrong column...) but seems to be a long term issue. Please process soon.
 [2006-02-09 09:46 UTC] cipri
Bug #3578 was a duplicate of this.
 [2006-02-09 20:41 UTC] vilius dot simonaitis at gmail dot com
The funny thing is that _encodeHeaders function maches only a single UTF-8 character or these character must be in a row. It matches "èiukce" corectly, but the word "èiukèë" splits into pieces and the encoder missbehaves. The other problem is because of spaces. For some reason, some spaces are not encoded correctly in headers. Here is correct source (notice the first preg): function _encodeHeaders($input) { foreach ($input as $hdr_name => $hdr_value) { //Notice, that this preg matches the whole word with any number of UTF-8 characters. preg_match_all('/((\w*[\x80-\xFF]+\w*)+\s*)/', $hdr_value, $matches); foreach ($matches[1] as $value) { $replacement = preg_replace('/([\x80-\xFF])/e', '"=" . strtoupper(dechex(ord("\1")))', $value); $hdr_value = str_replace($value, '=?' . $this->_build_params['head_charset'] . '?Q?' . $replacement . '?=', $hdr_value); } $input[$hdr_name] = $hdr_value; } return $input; }
 [2006-04-02 00:02 UTC] okin7 at yahoo dot fr (Nicolas Grekas)
Here is what I personnaly use, hope it can help ================================= function _encodeHeaders($input) { foreach ($input as $hdr_name => $hdr_value) { if (preg_match('/[\x80-\xFF]/', $hdr_value)) { $hdr_value = preg_replace( '/[=_\x80-\xFF]/e', '"=".strtoupper(dechex(ord("\0")))', $hdr_value ); $hdr_value = str_replace(' ', '_', $hdr_value); $input[$hdr_name] = '=?' . $this->_build_params['head_charset'] . '?Q?' . $hdr_value . '?='; } } return $input; } =================================
 [2006-04-02 00:16 UTC] okin7 at yahoo dot fr (Nicolas Grekas)
Even better, sorry for the double post : the second regexg : '/[=_\x80-\xFF]/e' should be replaced with : '/[=_\?\x00-\x1F\x80-\xFF]/e' so my correct version is : ================================= function _encodeHeaders($input) { foreach ($input as $hdr_name => $hdr_value) { if (preg_match('/[\x80-\xFF]/', $hdr_value)) { $hdr_value = preg_replace( '/[=_\?\x00-\x1F\x80-\xFF]/e', '"=".strtoupper(dechex(ord("\0")))', $hdr_value ); $hdr_value = str_replace(' ', '_', $hdr_value); $input[$hdr_name] = '=?' . $this->_build_params['head_charset'] . '?Q?' . $hdr_value . '?='; } } return $input; }
 [2006-04-05 14:38 UTC] okin7 at yahoo dot fr (Nicolas Grekas)
So, I've read RFC 2047, and my previous proposals where not so good. I've rewritten me code, here is my new purposal. It should be more RFC compliant now. THe contribution of mdv at inyourpocket dot com might be good also, but if there is a licensing issue comming from kmail, my version has no licensing issue. I've written it from scratch. ======================================= function _encodeHeaders($input) { $ns = "[^\(\)<>@,;:\"\/\[\]\r\n]*"; foreach ($input as $hdr_name => $hdr_value) { $input[$hdr_name] = preg_replace_callback("/{$ns}(?:[\\x80-\\xFF]{$ns})+/", array($this, '_encodeHeaderWord'), $hdr_value); } return $input; } function _encodeHeaderWord($word) { $word = preg_replace('/[=_\?\x00-\x1F\x80-\xFF]/e', '"=".strtoupper(dechex(ord("\0")))', $word[0]); preg_match('/^( *)(.*?)( *)$/', $word, $w); $word =& $w[2]; $word = str_replace(' ', '_', $word); $start = '=?' . $this->_build_params['head_charset'] . '?Q?'; $offsetLen = strlen($start) + 2; $w[1] .= $start; while ($offsetLen + strlen($word) > 75) { $splitPos = 75 - $offsetLen; switch ('=') { case substr($word, $splitPos - 2, 1): --$splitPos; case substr($word, $splitPos - 1, 1): --$splitPos; } $w[1] .= substr($word, 0, $splitPos) . "?={$this->_eol} {$start}"; $word = substr($word, $splitPos); } return $w[1] . $word . '?=' . $w[3]; }
 [2006-04-09 15:10 UTC] cipri (Cipriano Groenendal)
Bug #4317 was a duplicate of this.
 [2006-04-09 15:11 UTC] cipri (Cipriano Groenendal)
I'm currently testing some code that does not just include the patches provided here, but takes are of all RFC-2047 requirements. Stay tuned for a fix, hopefully sometime soon.
 [2006-04-26 07:24 UTC] erik at forss dot se (Erik Jansson)
Until the problem is resolved this code fixed my problems. I just use the class fMail_Mime instead of Mail_Mime class fMail_Mime extends Mail_Mime { function _encodeHeaders($input) { foreach ($input as $hdr_name => $hdr_value) { preg_match_all('/(\w*[\x80-\xFF ]+\w*)/', $hdr_value, $matches); foreach ($matches[1] as $value) { $replacement = preg_replace('/([\x80-\xFF ])/e', '"=" . strtoupper(dechex(ord("\1")))', $value); if($replacement == "=20") break; $hdr_value = str_replace($value, '=?' . $this->_build_params['head_charset'] . '?Q?' . $replacement . '?=', $hdr_value); } $input[$hdr_name] = $hdr_value; } return $input; } }
 [2006-04-28 13:43 UTC] cipri (Cipriano Groenendal)
I've just committed an initial fix to CVS, that follows almost all of the quoted-printable encoding rules specified in RFC2047: It properly encodes whitespace. It properly encodes = ? and _ It properly encodes non-ASCII chars [\x80-\x77] It properly splits the headers at ~75 chars. I'm now implementing the head_encoding option so one can select "base64" as encoding instead of the default quoted-printable. I'd appreciate it greatly if people could help me test this latest function, available in CVS (mime.php,v 1.53). Especially if you use unicode or any other charset then ISO-8859-1, which is all I've been able to test with so far.
 [2006-04-28 18:06 UTC] cipri (Cipriano Groenendal)
I just commited mime.php,v 1.54 to CVS. This version adds the extra build flag 'head_encoding' which lets you pick the encoding to use when encoding the headers. When set to 'base64' it will use B64 encoding for the headers. Any other settings will default to quoted-printable encoding.
 [2006-05-03 07:33 UTC] laurynas dot butkus at gmail dot com (Laurynas Butkus)
This bugtracking system does not display Baltic characters, so I made a html copy of this report (to be able to display real test data): http://lauris.night.lt/php/mail_mime_bug_notes.html Tested with the last mime.php version from CVS (1.55 2006/04/28). Tried to send message using UTF-8 Baltic letters in From: and in Subject. Test data: From: avardė <some@email.com> Subject: U¸pildytčėįčęėsdfsčęqwzxą ęčėnaujėa Ųojalumo ¦rogramos Čanketa When using default head-encoding, Mail can't send email and spits an error: Validation failed for: =?UTF-8?Q?avard=C4=97_?= If changing head-encoding to 'base64', mail is sent but the headers are messed up (on Thunderbird): Subject: U¸pildytčėįčęėsdfsčęqwzxą ęčėnauj� =?UTF-8?B?==?= "From" also does not look as it should. Tried other suggested fixes - all has problems when using my test data: either Mail spits an error or headers are messed up (even body in one case). Least messed message comes when using Nicolas Grekas suggested fix at 2006-04-05 14:38 UTC. Only subject is cut down to "U¸pildytčėįčęėsdfsč��" but everything else looks ok.
 [2006-06-07 09:23 UTC] mp at webfactory dot de (Matthias Pigulla)
RFC2047 paragraph 5 states where encoded-words may appear. In particular, it says that "an 'encoded-word' MUST NOT appear in any portion of an 'addr-spec'." The current revision 1.56 in CVS is broken in this regard as it simply encodes the whole header value. That's ok for headers defined as "*text" (eg Subject:), but not for headers like To: or From:. These will end up with the address being part of the encoded-word, confusing mail readers (not decoding sender's name when displaying it etc.)
 [2006-12-03 22:53 UTC] cipri (Cipriano Groenendal)
This bug has been fixed in CVS. If this was a documentation problem, the fix will appear on pear.php.net by the end of next Sunday (CET). If this was a problem with the pear.php.net website, the change should be live shortly. Otherwise, the fix will appear in the package's next release. Thank you for the report and for helping us make PEAR better. All words are now encoded seperately. This prevents the encoding of e-mail addresses(which should never contain non US-ASCII chars, and thus will never get encoded.
 [2007-02-27 21:23 UTC] stephen dot bigelis at gmail dot com (stephen bigelis)
I am posting this here because I couldn't find it anywhere else and It took me forever to figure out what was going wrong. (and this page popped up all the time during my endless search for a solution) Gmail does not like "\r\n" It will feed two lines and cause the email to appear as an attachment because the boundary marker cannot be read. If you try to define MAIL_MIMEPART_CRLF as "\n" it will still get messed. Removing the CRLF in the boundary line fixes this problem but then the section headers also need to be adjusted. The best fix I can find is to edit encode() and _quotedPrintableEncode () in mimePart.php Here are my temporary fixes, anyone know of a permanent one? function encode() { $encoded =& $this->_encoded; if (!empty($this->_subparts)) { srand((double)microtime()*1000000); $boundary = '=_' . md5(rand() . microtime()); $this->_headers['Content-Type'] .= ';' . "\t" . 'boundary="' . $boundary . '"'; // Add body parts to $subparts for ($i = 0; $i < count($this->_subparts); $i++) { $headers = array(); $tmp = $this->_subparts[$i]->encode(); foreach ($tmp['headers'] as $key => $value) { $headers[] = $key . ': ' . $value; } $subparts[] = implode("\n", $headers) . MAIL_MIMEPART_CRLF . MAIL_MIMEPART_CRLF . $tmp['body']; } $encoded['body'] = '--' . $boundary . "\n" . implode('--' . $boundary . "\n", $subparts) . '--' . $boundary.'--' . "\n"; } else { $encoded['body'] = $this->_getEncodedData($this->_body, $this->_encoding) . MAIL_MIMEPART_CRLF; } // Add headers to $encoded $encoded['headers'] =& $this->_headers; return $encoded; } and in _quotedPrintableEncode just change $eol= "\n" So far this is working, hopefully I will remember to update this when I work in a solid fix.
 [2007-05-05 15:09 UTC] cipri (Cipriano Groenendal)
Thank you for your bug report. This issue has been fixed in the latest released version of the package, which you can download at http://pear.php.net/get/Mail_Mime Fixed in 1.4.0
 [2009-09-09 18:22 UTC] shot (Piotr Szotkowski)
I don’t think this issue is fixed – Matthias Pigulla’s comment from http://pear.php.net/bugs/bug.php?id=30#1149654190 was not addressed and current Mail_mime still encodes From: Józef Szotkowski <shot@devielle> to From: =?utf-8?Q?J=C3=B3zef=20Szotkowski=20<shot@devielle>?= (which even subsequently fails the test done by Mail_RFC822).
 [2009-09-09 20:42 UTC] shot (Piotr Szotkowski)
FWIW, I fixed the issue of _encodeHeaders() also encoding email addresses by subclassing Mail_mime and adding a wrapper method: http://svn.civicrm.org/civicrm/trunk/CRM/Utils/Mail/FixedMailMIME.php