Package home | Report new bug | New search | Development Roadmap Status: Open | Feedback | All | Closed Since Version 1.10.12

Bug #30 Mail_Mime: _encodeHeaders is not RFC-2047 compliant.
Submitted: 2003-09-23 09:13 UTC
From: ed at avi dot ru Assigned: cipri
Status: Closed Package: Mail_Mime
PHP Version: Irrelevant OS: All
Roadmaps: 1.4.0, 1.4.0a1    
Subscription  


 [2003-09-23 09:13 UTC] ed at avi dot ru
Description: ------------ Hello! I tried to use PEAR module Mail_mime to compose a message with cyrillic subject and noticed that whitespaces between cyrillic words in resulting message are missing (Cyrillic letters are in 0x80-0xFF area). The investigation discovered that the Mail_mime::encodeHeaders() method is working incorrectly. Let's imagine that we have the following subject (or any other header, it doesn't matter): AAA BBB BBB AAA BBB where A is a latin character and B is a cyrillic character. Your regular expression will detect 3 BBB patterns and will encode them separately: AAA =?charset?Q?=xx=xx=xx?= =?charset?Q?=xx=xx=xx?= AAA =?charset?Q?=xx=xx=xx?= And what we see? First two BBB patterns produced two sequential RFC-2047 'encoded-word' patterns, which is a) incorrect, because two sequential 'encoded-word' patterns must be separated by CRLF SPACE, not by SPACE only; b) is understood by mailers as CRLF SPACE and thus translataed as empty string. Oops... The solution is to change the regular expression so that it could find not the single 'words-with-[0x80-0xFF]' patterns, but space-separated sequences of such words. And it will be better to include '-' symbol in \w qualifier thus treating words with '-' as solid words. The regex is: /([\w\-]*[\x80-\xFF]+[\w\-]*(\s+[\w\-]*[\x80-\xFF]+[\w\-]*)*)\s*/ And the replacing regex also needs improvement: as per RFC-2047 'encoded-word' pattern must not contain spaces and tabs to be treated as single atom; it also must not contain underscores, equal and question signs (they are special characters and must be escaped); the correct regex is: /([\s_=\?\x80-\xFF])/e And one more improvement: it sometimes better (as per RFC-2047) to encode headers not as quoted-printable but as base64 (if header consists mainly of 0x80-0xFF characters). The extended and corrected function is here: /** * Encodes a header as per RFC2047 * * @param string $input The header data to encode * @return string Encoded data * @access private */ function _encodeHeaders($input) { foreach ($input as $hdr_name => $hdr_value) { preg_match_all('/([\w\-]*[\x80-\xFF]+[\w\-]*(\s+[\w\-]*[\x80-\xFF]+[\w\-]*)*)\s*/', $hdr_value, $matches); foreach ($matches[1] as $value) { switch ($head_encoding = $this->_build_params['head_encoding']) { case 'base64': $symbol = 'B'; $replacement = base64_encode($value); break; default: if ($head_encoding != 'quoted-printable') { PEAR::raiseError( 'Invalid header encoding specified; using `quoted-printable` instead', NULL, PEAR_ERROR_TRIGGER, E_USER_WARNING ); } $symbol = 'Q'; $replacement = preg_replace('/([\s_=\?\x80-\xFF])/e', '"=" . strtoupper(dechex(ord("\1")))', $value); } $hdr_value = str_replace($value, '=?' . $this->_build_params['head_charset'] . '?' . $symbol . '?' . $replacement . '?=', $hdr_value); } $input[$hdr_name] = $hdr_value; } return $input; } The function steel may need some inprovement, because headers must be divided on CRLF SPACE -separated parts no longer than 76 characters, but it's another story :) Anyway, thanks for attention. Let me know if you find these improvements useful and include them in next version of Mail_mime. Bye! With best regards, Edward Surov

Comments

 [2003-10-09 00:40 UTC] samm at os2 dot ru
10x for solution. I used your patch and it work correctly for my subjects.
 [2003-10-30 22:33 UTC] heino at php dot net
I concur on the fact that the current code is not RFC compliant! I haven't tested the suggested code, but it should solve some of the problems; it needs some modification though...
 [2003-11-04 09:40 UTC] ed at avi dot ru
A small bug in this patch: in switch statement script doesn't check $this->_build_params['head_encoding'] for existence. It's better to implement this part with the following code: $head_encoding = array_key_exists('head_encoding', $this->_build_params) ? $this->_build_params['head_encoding'] : 'quoted-printable'
 [2004-09-27 07:41 UTC] struchkov at ma-journal dot ru
I rewrite _encodeHeaders to function _encodeHeaders($input) { foreach ($input as $hdr_name => $hdr_value) { if (preg_match('/(\w*[\x80-\xFF]+\w*)/', hdr_value)) $hdr_value='=?'.$this->_build_params['head_charset'].'?B?'.base64_encode($hdr_value).'?='; $input[$hdr_name] = $hdr_value; } return $input; } is it correct?
 [2004-09-28 08:06 UTC] ed at avi dot ru
struchkov at ma-journal dot ru: your solution will work correctly (let's forget about max. encoded string size), but it encodes even words that are not required to be encoded, and thus a) the size of headers is greater than it could be; b) headers are harder to read with text editors. Besides, your program doesn't allow user to choose between quoted-printable and base64 encoding - these options could help to decrease encoded data size.
 [2005-02-18 10:57 UTC] minakov at mail dot ru
function _encodeHeaders($input) { foreach ($input as $hdr_name => $hdr_value) { if (preg_match('/(\w*[\x80-\xFF]+\w*)/', $hdr_value)) { $replacement = preg_replace('/([\x80-\xFF])/e', '"=" . strtoupper(dechex(ord("\1")))', $hdr_value); $input[$hdr_name] = '=?' . $this->_build_params['head_charset'] . '?Q?' . $replacement . '?='; } } return $input; }
 [2005-04-07 06:13 UTC] laurynas dot butkus at gmail dot com
last fix by minakov at mail dot ru works nicely for me with UTF-8. thanks!! I hope the bug will be fixed in future releases..
 [2005-05-27 10:06 UTC] mdv at inyourpocket dot com
i think due to RFC 2047 when used Q encoding you must encode space, = and _ (see RFC 2047 section 2 and 4.2), so when you apply q encoding you must replace space, = and _ function _encodeHeaders($input) { foreach ($input as $hdr_name => $hdr_value) if (preg_match('/[\x80-\xFF]?/', $hdr_value)) $input[$hdr_name] = '=?'.$this->_build_params['head_charset'].'?Q?'.preg_replace('/([\x80-\xFF]|\x20|\x3D|\x5F)/e', '"=".strtoupper(dechex(ord("\1")))', $hdr_value).'?='; return $input; }
 [2005-05-27 10:16 UTC] mdv at inyourpocket dot com
of course reg exp should be /[\x80-\xFF]/ instead /[\x80-\xFF]?/
 [2005-06-02 14:21 UTC] mdv at inyourpocket dot com
kmail (kde mail client) works perfectly, so i "convert" kmail c++ code to php (file kmmsgbase.cpp, function encodeRFC2047String()) i don't know how kmail function codec->encode() works but i replace it with preg_replace('/([\x00-\x20\x80-\xFF])/e', '"=".strtoupper(dechex(ord("\1")))', $s) so any comments welcome /** * Encodes a header as per RFC2047 * * @param string $input The header data to encode * @return string Encoded data * @access private */ function _encodeHeaders($input) { foreach ($input as $k => $v) $input[$k] = $this->encodeRFC2047String($v); return $input; } function encodeRFC2047Quoted($s, $b) { if ($b) return base64_encode($s); else return preg_replace('/([\x00-\x20\x80-\xFF])/e', '"=".strtoupper(dechex(ord("\1")))', $s); } function encodeRFC2047String($s) { $strLength = strlen($s); $breakLine = $start = $stop = $p = $pos = $encLength = 0; $maxLen = 75 - 7 - strlen($this->_build_params['head_charset']); $nonAscii = 0; for ($i = 0; $i < $strLength; $i++) if ($s[$i] >= 128) $nonAscii++; $useBase64 = ($nonAscii * 6 > $strLength); $result = ''; while ($pos < $strLength) { $start = $pos; $p = $pos; while ($p < $strLength) { if (!$breakLine && ($s[$p] == ' ' || (strpos($this->dontQuote, $s[$p]) !== false))) $start = $p + 1; if (ord($s[$p]) >= 128 || ord($s[$p]) < 32) break; $p++; } if ($breakLine || $p < $strLength) { while (strpos($this->dontQuote, $s[$start]) !== false) $start++; $stop = $start; while ($stop < $strLength && (strpos($this->dontQuote, $s[$stop]) === false)) $stop++; $result .= substr($s, $pos, $start - $pos); $encLength = strlen($this->encodeRFC2047Quoted(substr($s, $start, $stop - $start), $useBase64)); $breakLine = ($encLength > $maxLen); if ($breakLine) { $dif = ($stop - $start) / 2; $step = $dif; while (abs($step) > 1) { $encLength = strlen($this->encodeRFC2047Quoted(substr($s, $start, $dif), $useBase64)); $step = ($encLength > $maxLen) ? (-abs($step) / 2) : (abs($step) / 2); $dif += $step; } $stop = $start + $dif; } $p = $stop; //echo 'DEBUG: p: ', $p, ', start:: ', $start, ', strlen:: ', $strLength, '<br>'; while (($p > $start) && !isset($s[$p])) $p--; if ($p > $start) $stop = $p; if (substr($result, -3) == "?= ") $start--; if (substr($result, -5) == "?=\n ") { $start--; $result = substr($result, 0, strlen($result) - 1); } $lastNewLine = strripos($result, "\n "); if (trim(substr($result, $lastNewLine)) && ((strlen($result) - $lastNewLine + $encLength + 2) > $maxLen)) $result .= "\n "; $result .= "=?"; $result .= $this->_build_params['head_charset']; $result .= $useBase64 ? "?b?" : "?q?"; $result .= $this->encodeRFC2047Quoted(substr($s, $start, $stop - $start), $useBase64); //encodeRFC2047Quoted(codec->fromUnicode(_str.mid(start, stop - start)), useBase64); $result .= "?="; if ($breakLine) $result .= "\n "; $pos = $stop; } else { $result .= substr($s, $pos); break; } } return $result; }
 [2005-06-13 12:39 UTC] krumb at valentins dot de
I have the same problem with german umlauts like äöü. The Whitespaces between 2 words including umlauts is away. include('Mail.php'); include('Mail/mime.php'); $text = 'Text version of email'; $html = '<html><body>HTML version of email<br>message äöü </body></html>'; $file = '/home/richard/example.php'; $crlf = "\n"; $hdrs = array( 'From' => 'user@domain.com', 'Subject' => 'Süper gröse tolle grüße von mir' ); $mime = new Mail_mime($crlf); $mime->setTXTBody($text); $mime->setHTMLBody($html); $mime->addAttachment($file, 'text/plain'); $body = $mime->get(); $hdrs = $mime->headers($hdrs); $mail =& Mail::factory('mail'); $mail->send('xxx@test.de', $hdrs, $body);
 [2005-09-02 08:15 UTC] eric at persson dot tm
Have the same problem with swedish characters like åäö in words following each other. I solved it temporarily by adding $value = $value.' '; as in the following: function _encodeHeaders($input) { foreach ($input as $hdr_name => $hdr_value) { preg_match_all('/(\w*[\x80-\xFF]+\w*)/', $hdr_value, $matches); foreach ($matches[1] as $value) { $value = $value.' '; $replacement = preg_replace('/([\x80-\xFF])/e', '"=" . strtoupper(dechex(ord("\1")))', $value); $hdr_value = str_replace($value, '=?' . $this->_build_params['head_charset'] . '?Q?' . $replacement . '?=', $hdr_value); } $input[$hdr_name] = $hdr_value; } return $input; }
 [2005-09-30 11:20 UTC] roundcube at gmail dot com
Why is this bug marked as "irrelevant"? To me it seems very importent to have headers encoded correctly. For example Gmail can't read the headers with two separately encoded words and will parse the message completely wrong. All headers following the encoded line wil then appear as part of the body. Also important to me are the additional chars (\s_=?) that should be encoded as well. It's now more than two years since this was reported but nt implemented in any release. Please fix it! The solution of ed[at]avi[dot]ru (2003-09-23) will do it. With my best regards, Thomas
 [2005-09-30 12:16 UTC] roundcube at gmail dot com
Sorry for the mistake: of course this bug is not marked as "irrelevant" (watched the wrong column...) but seems to be a long term issue. Please process soon.
 [2006-02-09 09:46 UTC] User who submitted this comment has not confirmed identity
If you submitted this note, check your email.If you do not have a message, click here to re-send
MANUAL CONFIRMATION IS NOT POSSIBLE.  Write a message to pear-dev@lists.php.net
to request the confirmation link.  All bugs/comments/patches associated with this

email address will be deleted within 48 hours if the account request is not confirmed!
 [2006-02-09 20:41 UTC] vilius dot simonaitis at gmail dot com
The funny thing is that _encodeHeaders function maches only a single UTF-8 character or these character must be in a row. It matches "èiukce" corectly, but the word "èiukèë" splits into pieces and the encoder missbehaves. The other problem is because of spaces. For some reason, some spaces are not encoded correctly in headers. Here is correct source (notice the first preg): function _encodeHeaders($input) { foreach ($input as $hdr_name => $hdr_value) { //Notice, that this preg matches the whole word with any number of UTF-8 characters. preg_match_all('/((\w*[\x80-\xFF]+\w*)+\s*)/', $hdr_value, $matches); foreach ($matches[1] as $value) { $replacement = preg_replace('/([\x80-\xFF])/e', '"=" . strtoupper(dechex(ord("\1")))', $value); $hdr_value = str_replace($value, '=?' . $this->_build_params['head_charset'] . '?Q?' . $replacement . '?=', $hdr_value); } $input[$hdr_name] = $hdr_value; } return $input; }
 [2006-04-02 00:02 UTC] okin7 at yahoo dot fr (Nicolas Grekas)
Here is what I personnaly use, hope it can help ================================= function _encodeHeaders($input) { foreach ($input as $hdr_name => $hdr_value) { if (preg_match('/[\x80-\xFF]/', $hdr_value)) { $hdr_value = preg_replace( '/[=_\x80-\xFF]/e', '"=".strtoupper(dechex(ord("\0")))', $hdr_value ); $hdr_value = str_replace(' ', '_', $hdr_value); $input[$hdr_name] = '=?' . $this->_build_params['head_charset'] . '?Q?' . $hdr_value . '?='; } } return $input; } =================================
 [2006-04-02 00:16 UTC] okin7 at yahoo dot fr (Nicolas Grekas)
Even better, sorry for the double post : the second regexg : '/[=_\x80-\xFF]/e' should be replaced with : '/[=_\?\x00-\x1F\x80-\xFF]/e' so my correct version is : ================================= function _encodeHeaders($input) { foreach ($input as $hdr_name => $hdr_value) { if (preg_match('/[\x80-\xFF]/', $hdr_value)) { $hdr_value = preg_replace( '/[=_\?\x00-\x1F\x80-\xFF]/e', '"=".strtoupper(dechex(ord("\0")))', $hdr_value ); $hdr_value = str_replace(' ', '_', $hdr_value); $input[$hdr_name] = '=?' . $this->_build_params['head_charset'] . '?Q?' . $hdr_value . '?='; } } return $input; }
 [2006-04-05 14:38 UTC] okin7 at yahoo dot fr (Nicolas Grekas)
So, I've read RFC 2047, and my previous proposals where not so good. I've rewritten me code, here is my new purposal. It should be more RFC compliant now. THe contribution of mdv at inyourpocket dot com might be good also, but if there is a licensing issue comming from kmail, my version has no licensing issue. I've written it from scratch. ======================================= function _encodeHeaders($input) { $ns = "[^\(\)<>@,;:\"\/\[\]\r\n]*"; foreach ($input as $hdr_name => $hdr_value) { $input[$hdr_name] = preg_replace_callback("/{$ns}(?:[\\x80-\\xFF]{$ns})+/", array($this, '_encodeHeaderWord'), $hdr_value); } return $input; } function _encodeHeaderWord($word) { $word = preg_replace('/[=_\?\x00-\x1F\x80-\xFF]/e', '"=".strtoupper(dechex(ord("\0")))', $word[0]); preg_match('/^( *)(.*?)( *)$/', $word, $w); $word =& $w[2]; $word = str_replace(' ', '_', $word); $start = '=?' . $this->_build_params['head_charset'] . '?Q?'; $offsetLen = strlen($start) + 2; $w[1] .= $start; while ($offsetLen + strlen($word) > 75) { $splitPos = 75 - $offsetLen; switch ('=') { case substr($word, $splitPos - 2, 1): --$splitPos; case substr($word, $splitPos - 1, 1): --$splitPos; } $w[1] .= substr($word, 0, $splitPos) . "?={$this->_eol} {$start}"; $word = substr($word, $splitPos); } return $w[1] . $word . '?=' . $w[3]; }
 [2006-04-09 15:10 UTC] User who submitted this comment has not confirmed identity
If you submitted this note, check your email.If you do not have a message, click here to re-send
MANUAL CONFIRMATION IS NOT POSSIBLE.  Write a message to pear-dev@lists.php.net
to request the confirmation link.  All bugs/comments/patches associated with this

email address will be deleted within 48 hours if the account request is not confirmed!
 [2006-04-09 15:11 UTC] User who submitted this comment has not confirmed identity
If you submitted this note, check your email.If you do not have a message, click here to re-send
MANUAL CONFIRMATION IS NOT POSSIBLE.  Write a message to pear-dev@lists.php.net
to request the confirmation link.  All bugs/comments/patches associated with this

email address will be deleted within 48 hours if the account request is not confirmed!
 [2006-04-26 07:24 UTC] erik at forss dot se (Erik Jansson)
Until the problem is resolved this code fixed my problems. I just use the class fMail_Mime instead of Mail_Mime class fMail_Mime extends Mail_Mime { function _encodeHeaders($input) { foreach ($input as $hdr_name => $hdr_value) { preg_match_all('/(\w*[\x80-\xFF ]+\w*)/', $hdr_value, $matches); foreach ($matches[1] as $value) { $replacement = preg_replace('/([\x80-\xFF ])/e', '"=" . strtoupper(dechex(ord("\1")))', $value); if($replacement == "=20") break; $hdr_value = str_replace($value, '=?' . $this->_build_params['head_charset'] . '?Q?' . $replacement . '?=', $hdr_value); } $input[$hdr_name] = $hdr_value; } return $input; } }
 [2006-04-28 13:43 UTC] User who submitted this comment has not confirmed identity
If you submitted this note, check your email.If you do not have a message, click here to re-send
MANUAL CONFIRMATION IS NOT POSSIBLE.  Write a message to pear-dev@lists.php.net
to request the confirmation link.  All bugs/comments/patches associated with this

email address will be deleted within 48 hours if the account request is not confirmed!
 [2006-04-28 18:06 UTC] User who submitted this comment has not confirmed identity
If you submitted this note, check your email.If you do not have a message, click here to re-send
MANUAL CONFIRMATION IS NOT POSSIBLE.  Write a message to pear-dev@lists.php.net
to request the confirmation link.  All bugs/comments/patches associated with this

email address will be deleted within 48 hours if the account request is not confirmed!
 [2006-05-03 07:33 UTC] laurynas dot butkus at gmail dot com (Laurynas Butkus)
This bugtracking system does not display Baltic characters, so I made a html copy of this report (to be able to display real test data): http://lauris.night.lt/php/mail_mime_bug_notes.html Tested with the last mime.php version from CVS (1.55 2006/04/28). Tried to send message using UTF-8 Baltic letters in From: and in Subject. Test data: From: avardė <some@email.com> Subject: U¸pildytčėįčęėsdfsčęqwzxą ęčėnaujėa Ųojalumo ¦rogramos Čanketa When using default head-encoding, Mail can't send email and spits an error: Validation failed for: =?UTF-8?Q?avard=C4=97_?= If changing head-encoding to 'base64', mail is sent but the headers are messed up (on Thunderbird): Subject: U¸pildytčėįčęėsdfsčęqwzxą ęčėnauj� =?UTF-8?B?==?= "From" also does not look as it should. Tried other suggested fixes - all has problems when using my test data: either Mail spits an error or headers are messed up (even body in one case). Least messed message comes when using Nicolas Grekas suggested fix at 2006-04-05 14:38 UTC. Only subject is cut down to "U¸pildytčėįčęėsdfsč��" but everything else looks ok.
 [2006-06-07 09:23 UTC] mp at webfactory dot de (Matthias Pigulla)
RFC2047 paragraph 5 states where encoded-words may appear. In particular, it says that "an 'encoded-word' MUST NOT appear in any portion of an 'addr-spec'." The current revision 1.56 in CVS is broken in this regard as it simply encodes the whole header value. That's ok for headers defined as "*text" (eg Subject:), but not for headers like To: or From:. These will end up with the address being part of the encoded-word, confusing mail readers (not decoding sender's name when displaying it etc.)
 [2006-12-03 22:53 UTC] User who submitted this comment has not confirmed identity
If you submitted this note, check your email.If you do not have a message, click here to re-send
MANUAL CONFIRMATION IS NOT POSSIBLE.  Write a message to pear-dev@lists.php.net
to request the confirmation link.  All bugs/comments/patches associated with this

email address will be deleted within 48 hours if the account request is not confirmed!
 [2007-02-27 21:23 UTC] stephen dot bigelis at gmail dot com (stephen bigelis)
I am posting this here because I couldn't find it anywhere else and It took me forever to figure out what was going wrong. (and this page popped up all the time during my endless search for a solution) Gmail does not like "\r\n" It will feed two lines and cause the email to appear as an attachment because the boundary marker cannot be read. If you try to define MAIL_MIMEPART_CRLF as "\n" it will still get messed. Removing the CRLF in the boundary line fixes this problem but then the section headers also need to be adjusted. The best fix I can find is to edit encode() and _quotedPrintableEncode () in mimePart.php Here are my temporary fixes, anyone know of a permanent one? function encode() { $encoded =& $this->_encoded; if (!empty($this->_subparts)) { srand((double)microtime()*1000000); $boundary = '=_' . md5(rand() . microtime()); $this->_headers['Content-Type'] .= ';' . "\t" . 'boundary="' . $boundary . '"'; // Add body parts to $subparts for ($i = 0; $i < count($this->_subparts); $i++) { $headers = array(); $tmp = $this->_subparts[$i]->encode(); foreach ($tmp['headers'] as $key => $value) { $headers[] = $key . ': ' . $value; } $subparts[] = implode("\n", $headers) . MAIL_MIMEPART_CRLF . MAIL_MIMEPART_CRLF . $tmp['body']; } $encoded['body'] = '--' . $boundary . "\n" . implode('--' . $boundary . "\n", $subparts) . '--' . $boundary.'--' . "\n"; } else { $encoded['body'] = $this->_getEncodedData($this->_body, $this->_encoding) . MAIL_MIMEPART_CRLF; } // Add headers to $encoded $encoded['headers'] =& $this->_headers; return $encoded; } and in _quotedPrintableEncode just change $eol= "\n" So far this is working, hopefully I will remember to update this when I work in a solid fix.
 [2007-05-05 15:09 UTC] User who submitted this comment has not confirmed identity
If you submitted this note, check your email.If you do not have a message, click here to re-send
MANUAL CONFIRMATION IS NOT POSSIBLE.  Write a message to pear-dev@lists.php.net
to request the confirmation link.  All bugs/comments/patches associated with this

email address will be deleted within 48 hours if the account request is not confirmed!
 [2009-09-09 18:22 UTC] User who submitted this comment has not confirmed identity
If you submitted this note, check your email.If you do not have a message, click here to re-send
MANUAL CONFIRMATION IS NOT POSSIBLE.  Write a message to pear-dev@lists.php.net
to request the confirmation link.  All bugs/comments/patches associated with this

email address will be deleted within 48 hours if the account request is not confirmed!
 [2009-09-09 20:42 UTC] User who submitted this comment has not confirmed identity
If you submitted this note, check your email.If you do not have a message, click here to re-send
MANUAL CONFIRMATION IS NOT POSSIBLE.  Write a message to pear-dev@lists.php.net
to request the confirmation link.  All bugs/comments/patches associated with this

email address will be deleted within 48 hours if the account request is not confirmed!