Bug #20425 :: Incomplete percent-encoding of userinfo, path and query

Package home | Report new bug | New search | Development Roadmap

Status: Open | Feedback | All | Closed Since Version 2.2.3

Bug #20425	Incomplete percent-encoding of userinfo, path and query
Submitted:	2014-10-09 06:23 UTC
From:	pracj3am	Assigned:
Status:	Open	Package:	Net_URL2 (version 2.0.9)
PHP Version:	Irrelevant	OS:
Roadmaps:	(Not assigned)
Subscription	Your email:

Comments Add Comment Add patch

[2014-10-09 06:23 UTC] pracj3am (Jan Prachar)

Description:
------------
When parsing URI, characters that are invalid are percent-encoded in the userinfo, path 
and query part (method _encodeData). But there are more characters that should be 
percent-encoded according to rfc3986 like [ ] | ` { }. Concretely this is the whole set:
[\x-\x20\x22\x3C\x3E\5B-\5E\x60\7B-\7D\7F-\FF]

Also the same charcters should be pecent-encoded in a fragment part.

Test script:
---------------
echo (new Net_URL2('http://user[1]@example.com/p\s/|" ?{}#^'))->getUrl();

Expected result:
----------------
http://user%5B1%5D@example.com/p%5Cs/%7C%22%20?%7B%7D#%5E

Actual result:
--------------
http://user[1]@example.com/p\s/|%22%20?{}#^

Comments

[2014-10-09 15:46 UTC] tkli (Tom Klingenberg)

IIRC that special handling has been done to align wrong input handling with that how browsers do it 
with their URI treatment. Strictly, Net_URL2 expects those parts to be correctly encoded already. 
However this should make it more robust so that Net_URL2 can accept URIs that are acceptable by 
browsers as well without running into double-encode problems:

The example URI you give:

    http://user[1]@example.com/p\s/|" ?{}#^

for example is turned when entered into Chromium into the following effective request URI (fragment 
is kept in client):

    http://user%5B1%5D@example.com/p/s/%7C%22%20?{}

This is similar to how Net_URL2 already does it:

    http://user[1]@example.com/p\s/|%22%20?{}#^

The differences I see is with the square brackets, the slash-correction and pipe symbol. 

Angle-brackets do not need to be converted and question mark would result in data-loss (separator) if 
it would have.

There is a documentation problem however because the comment does not cover the userinfo part in 
the docblock of Net_URL2::_encodeData :

     * Encode characters that might have been forgotten to encode when passing
     * in an URL. Applied onto Path and Query.

As with any fuzzy logic, this method is a best guess. When I introduced it, I did check that with 
browser behavior. Now re-checking it and seeing the differences to Chromium, I can't say why or why 
not I didn't cover square brackets for example.

It's perhaps best to research browser behaviors again and list those incl. the results and the test-URIs.

I might still have some notes about that on the one or other computer. I might be able to gather that 
later on.

[2014-10-09 17:38 UTC] pracj3am (Jan Prachar)

I also experimented with different browsers. For eaxmple following URL
'http://example.com/ "<>[]\{}|`^? "<>[]\{}|`^'

Chromium turn into
GET /%20%22%3C%3E[]/%7B%7D%7C%60%5E?%20%22%3C%3E[]\{}|`^

Firefox
GET /%20%22%3C%3E%5B%5D%5C%7B%7D|%60%5E?%20%22%3C%3E[]\{}|%60^

So in the path component Chromium encodes everything except square brackets and backslash (turned into slash). While Firefox encodes everything but |. In the query component they are quite permitive.

Notice that not encoding square brackets was reported as bug in Firefox and fixed recently see https://bugzilla.mozilla.org/show_bug.cgi?id=473822

Anyway I think you cannot make any harmm if you ancode all invalid characters.

[2014-10-09 18:14 UTC] tkli (Tom Klingenberg)

That's good info.

I think we should do a matrix specifying which part (userinfo, host, path, query, fragment) should 
deal with which characters.

E.g. the Firefox issue you refer to is about the query if I grasped it right.

We then can put it to a test and have it properly specified. This should make clear what the intend 
is and how it was solved.

[2014-10-09 18:24 UTC] tkli (Tom Klingenberg)

colons in path perhaps shouldn't be translated for interoperability reasons:

http://en.wikipedia.org/wiki/File_URI_scheme#Windows_2

[2014-10-10 04:23 UTC] tkli (Tom Klingenberg)

at least the documentation problem will be resolved in the next 2.0.10 release (just around the 
corner).

[2014-11-24 15:18 UTC] pracj3am (Jan Prachar)

Do you need any help?

[2014-11-27 15:52 UTC] tkli (Tom Klingenberg)

Hi Jan,

some help would be great. A compilation which browser auto-converts which characters in a 
structured manner would be great (for which part) so it is easier to review (and implement) an 
accepted / safe way to deal with "readable" URLs as input.

Additionally any kind of feedback is always welcome!

[2014-12-29 16:57 UTC] tkli (Tom Klingenberg)

Hi Jan, 

I now could find more about the issue and how RFC3986 covers this. I've looked 
into 
it for the query part.

It is possible to not percent-encode certain characters even thought they are 
reserved. This basically work for any character that has not other definition as 
separator for that part.

However, an URI class should not just change these because those URIs are not 
equivalent / identical. So encoding or not encoding such a character can make a 
difference.

I've also checked this quickly with chromium and the URI is not changed on these 
parts.

This is an additional implication I think Net_URL2 should handle as well.

I've also identified a flaw in URI normalization while looking into this, I don't think 
it's 
grave but technically it's a flaw as many parts do allow more characters from 
reserve 
already per the ABNF, but normalization does an urlencode(urldecode()) which 
does 
not take the concrete rules into account. I think about fixing this first.