Package home | Report new bug | New search | Development Roadmap Status: Open | Feedback | All | Closed Since Version 2.1.2

Bug #20425 Incomplete percent-encoding of userinfo, path and query
Submitted: 2014-10-09 06:23 UTC
From: pracj3am Assigned:
Status: Open Package: Net_URL2 (version 2.0.9)
PHP Version: Irrelevant OS:
Roadmaps: (Not assigned)    
Subscription  


 [2014-10-09 06:23 UTC] pracj3am (Jan Prachar)
Description: ------------ When parsing URI, characters that are invalid are percent-encoded in the userinfo, path and query part (method _encodeData). But there are more characters that should be percent-encoded according to rfc3986 like [ ] | ` { }. Concretely this is the whole set: [\x-\x20\x22\x3C\x3E\5B-\5E\x60\7B-\7D\7F-\FF] Also the same charcters should be pecent-encoded in a fragment part. Test script: --------------- echo (new Net_URL2('http://user[1]@example.com/p\s/|" ?{}#^'))->getUrl(); Expected result: ---------------- http://user%5B1%5D@example.com/p%5Cs/%7C%22%20?%7B%7D#%5E Actual result: -------------- http://user[1]@example.com/p\s/|%22%20?{}#^

Comments

 [2014-10-09 15:46 UTC] tkli (Tom Klingenberg)
IIRC that special handling has been done to align wrong input handling with that how browsers do it with their URI treatment. Strictly, Net_URL2 expects those parts to be correctly encoded already. However this should make it more robust so that Net_URL2 can accept URIs that are acceptable by browsers as well without running into double-encode problems: The example URI you give: http://user[1]@example.com/p\s/|" ?{}#^ for example is turned when entered into Chromium into the following effective request URI (fragment is kept in client): http://user%5B1%5D@example.com/p/s/%7C%22%20?{} This is similar to how Net_URL2 already does it: http://user[1]@example.com/p\s/|%22%20?{}#^ The differences I see is with the square brackets, the slash-correction and pipe symbol. Angle-brackets do not need to be converted and question mark would result in data-loss (separator) if it would have. There is a documentation problem however because the comment does not cover the userinfo part in the docblock of Net_URL2::_encodeData : * Encode characters that might have been forgotten to encode when passing * in an URL. Applied onto Path and Query. As with any fuzzy logic, this method is a best guess. When I introduced it, I did check that with browser behavior. Now re-checking it and seeing the differences to Chromium, I can't say why or why not I didn't cover square brackets for example. It's perhaps best to research browser behaviors again and list those incl. the results and the test-URIs. I might still have some notes about that on the one or other computer. I might be able to gather that later on.
 [2014-10-09 17:38 UTC] pracj3am (Jan Prachar)
I also experimented with different browsers. For eaxmple following URL 'http://example.com/ "<>[]\{}|`^? "<>[]\{}|`^' Chromium turn into GET /%20%22%3C%3E[]/%7B%7D%7C%60%5E?%20%22%3C%3E[]\{}|`^ Firefox GET /%20%22%3C%3E%5B%5D%5C%7B%7D|%60%5E?%20%22%3C%3E[]\{}|%60^ So in the path component Chromium encodes everything except square brackets and backslash (turned into slash). While Firefox encodes everything but |. In the query component they are quite permitive. Notice that not encoding square brackets was reported as bug in Firefox and fixed recently see https://bugzilla.mozilla.org/show_bug.cgi?id=473822 Anyway I think you cannot make any harmm if you ancode all invalid characters.
 [2014-10-09 18:14 UTC] tkli (Tom Klingenberg)
That's good info. I think we should do a matrix specifying which part (userinfo, host, path, query, fragment) should deal with which characters. E.g. the Firefox issue you refer to is about the query if I grasped it right. We then can put it to a test and have it properly specified. This should make clear what the intend is and how it was solved.
 [2014-10-09 18:24 UTC] tkli (Tom Klingenberg)
colons in path perhaps shouldn't be translated for interoperability reasons: http://en.wikipedia.org/wiki/File_URI_scheme#Windows_2
 [2014-10-10 04:23 UTC] tkli (Tom Klingenberg)
at least the documentation problem will be resolved in the next 2.0.10 release (just around the corner).
 [2014-11-24 15:18 UTC] pracj3am (Jan Prachar)
Do you need any help?
 [2014-11-27 15:52 UTC] tkli (Tom Klingenberg)
Hi Jan, some help would be great. A compilation which browser auto-converts which characters in a structured manner would be great (for which part) so it is easier to review (and implement) an accepted / safe way to deal with "readable" URLs as input. Additionally any kind of feedback is always welcome!
 [2014-12-29 16:57 UTC] tkli (Tom Klingenberg)
Hi Jan, I now could find more about the issue and how RFC3986 covers this. I've looked into it for the query part. It is possible to not percent-encode certain characters even thought they are reserved. This basically work for any character that has not other definition as separator for that part. However, an URI class should not just change these because those URIs are not equivalent / identical. So encoding or not encoding such a character can make a difference. I've also checked this quickly with chromium and the URI is not changed on these parts. This is an additional implication I think Net_URL2 should handle as well. I've also identified a flaw in URI normalization while looking into this, I don't think it's grave but technically it's a flaw as many parts do allow more characters from reserve already per the ABNF, but normalization does an urlencode(urldecode()) which does not take the concrete rules into account. I think about fixing this first.