S3: Plus sign is interpreted as space in the path part of URLs (opens in new tab)

(forums.aws.amazon.com)

97 pointsysh78y ago46 comments

46 comments

36 comments · 12 top-level

ryanbrunner8y ago· 5 in thread

I don't necessarily think this is even breaking the HTTP standard. While '+' should not be interpreted as spaces as part of a URL while it's being treated as a URL, the HTTP spec doesn't specify / care what file that may map to on a server.

Edit: As mentioned below, this isn't correct since URLs should be able to be escaped and return the same resource, and an escaped + differs from an unescaped + on S3.

asdfaoeu8y ago

Sure but /%2B should resolve to the same thing as /+

ryanbrunner8y ago

Ah, fair enough, that's a good point.

jamix8y ago

Exactly! The OP's point is summarized in this sentence:

> My point is that the spec requires + to be escaped only inside the querystring.

So what? What the standard mandates for query strings is irrelevant here. It's up to the server how to interpret and map the URLs. "Unconventional and unfortunate" - yes, but breaking the HTTP spec? No.

zAy0LfpBZLC8mAC8y ago

Please read the actual spec before telling poeple whether something is conforming to it or not. Just making stuff up is exactly how this mess is created. The relevant section in this case:

https://tools.ietf.org/html/rfc3986#section-6.2.2.2

jchw8y ago

It breaks the HTTP spec because it internally is decoding the URL wrong. This is important because things that speak HTTP are free to choose to percent encode, or not, the plus sign in a path, and the canonical URL should not differ. If it mapped even an escaped plus to a space, it'd be consistent, though still questionable, behavior.

mfer8y ago· 5 in thread

&tldr; A legacy behavior is to treat + as a space. When you've been around you need to keep backwards compatibility.

URLs and URIs have separate standards from HTTP and they have changed over time (been replaced by newer ones).

Many years ago it was common to encode a space as a + sign. For example, the PHP function urlencode[1] does the same thing with a + sign. If you're a PHP user, don't use this function unless you know you need to. There are better functions now.

[1] http://php.net/manual/en/function.urlencode.php

brlewis8y ago

When was + treated as space in the path part of the URL? Sure it's been treated as space in the query part, but that would be a weird breaking change if early web treated path and query the same way, and then later standards made them different.

mfer8y ago

At the time S3 launched the URL spec was RFC 1738 and we had HTML 4.01[2]. And, the URI syntax (all the way back in 1998) noted to use %20 for a space[3].

As far as I can tell, this traces its history back to encoding for forms[4]. It's been used far beyond the encoding for forms and maybe someone can explain why.

It's also not just PHP whose function is that way. In Python urlencode encodes as a + (at least in 2.7.x).

I remember working on the web many years ago where "+" is what was used. This may have been a spec misinterpretation or something else. In any case, it was common enough.

Note, I'm not saying it was right. Just not uncommon.

[1] https://www.ietf.org/rfc/rfc1738.txt [2] https://www.w3.org/TR/html401/ [3] https://www.ietf.org/rfc/rfc2396.txt [4] https://www.w3.org/TR/html4/interact/forms.html#h-17.13.4.1

jordanlev8y ago

> If you're a PHP user, don't use this function unless you know you need to. There are better functions now.

Don't leave me hanging! What are the better functions now?

godDLL8y ago

`rawurlencode()` is what you're after.

And here is where you'd ask that question, a coding forum https://stackoverflow.com/questions/996139/urlencode-vs-rawu...

1 more reply

Crespyl8y ago

Knowing PHPs standard library, probably something like "urlencode_safe_for_real_this_time".

Kidding aside, IIRC "rawurlencode" is the RFC compliant one.

marindez8y ago· 5 in thread

It's reasonable they don't want to fix it because it will break existing URLs. Welcome to the ugly world of back compatibility.

Liquid_Fire8y ago

They could make it configurable on a per bucket basis (perhaps defaulting to the old behaviour if necessary; ideally you would make the conformant behaviour the default, of course).

That way you could opt in to the standard conformant behaviour if you require it, but they can still keep backward compatibility.

majewsky8y ago

I'm not familiar with how S3 works in detail, but I imagine this could require additional API calls in the backend which increases the latency and resource usage of API requests. In the worst case, such a change could easily require Amazon to purchase dozens, if not hundreds of additional servers.

1 more reply

lallysingh8y ago

Or use a different domain for the same buckets, and resolve the name correctly on this new domain.

Coding_Cat8y ago

They could compromise by adding a few more lines of code and having '+' resolve to ' ' if and only if the file can't be found with '+', or vice versa.

Immidiatly mark this behaviour as deprecated and switch over to proper '+' == '+' behaviour later.

edit: LiquidFire's idea is better.

viraptor8y ago

That would require synchronisation, potentially between multiple servers. Doing this efficiently, without race conditions could be very tricky at their size.

1 more reply

bmn__8y ago· 5 in thread

In response to the reported RFC violation, elving@AWS writes: "I agree that's unconventional and unfortunate." My corporate bullshit detector is off the scale.

In earlier times, we would have both the ability and the balls to treat that unwillingness to uphold the rules we all set out with as damage to the Internet, and route around it. But sadly, AWS has become too big to fail, so the engineers introduce special cases into their products and deploy them.

mmahemoff8y ago

To the contrary, I think it's actually a refreshingly honest response. A "corporate bullshit" response would be to ignore it altogether, try to argue it's a feature not a bug, or give a canned statement about how we respect the environment and want the world to be a better place.

The AWS support is explicitly acknowledging it's an issue, while giving a rational reason why it probably won't be fixed (even if you disagree with the reason). The back-compat concern is unfortunate but a good argument can be made it's not in users' interests either (beyond being just a cost to AWS to implement the change).

cm21878y ago

But can they even change it without risking to break tens of thousands of websites?

1 more reply

jrochkind18y ago

Eh, URL/URI escaping is an interesting example, because people have been doing inconsistent and sometimes standards-problematic things with it pretty much as long as they existed. And indeed it's been a perpetual pain and problem. (just one example read up on `&` and `;`, and whether `&` can/should/must be escaped in what contexts; that's not the only one, `+` is another long-standing one). So not a great example of how everyone used to always be consistently standards-compliant in "earlier times", more like a counter-example. I don't think it's unique, my experience is not that everything used to be more consistent and standards compliant in "earlier times" than it is now, when it comes to the web, if anything the reverse.

kinkrtyavimoodh8y ago

How is this 'corporate bullshit'? Corporate BS is about giving vague circumlocutionary responses that try to just press all the right PR buttons.

This is the opposite of that.

viraptor8y ago

When were these "earlier times" for the web? Not during browser wars, that's for sure. Not when web2.0 started with crazy ideas about rest. Not during flash-everywhere era. Etc...

tazjin8y ago· 2 in thread

Amazon has a difficult time with the HTTP standard sometimes. Last time I had to touch an AWS project we discovered a bug[1] in the C++ code backing a Java library (sic).

They had implemented their own HTTP client, but forgot to add the "Host" header to requests which is required by HTTP 1.1.

Interestingly this client sent requests only to their own services, which means that they either released that without testing it or the backend once accepted faulty requests.

[1]: https://github.com/awslabs/amazon-kinesis-producer/issues/61

hnlmorg8y ago

It's common for HTTP servers to accept requests without a host header. It's not usually needed by the server unless you're hardening it (I don't class it as a security issue but some security audits will flag it up if you don't force the server to reject invalid host headers) or running named virtual hosts (which is more common than it used to be thanks to SNI but you still often see a 1:1 relationship between (virtual) hosts and IPs). So Amazon could easily have tested their client on 3rd party servers and still not spotted the problem.

As an anecdote, about 15/20ish years ago I wrote my own webbrowser. Obviously something highly rudimentary albeit browsers were much easier to implement back then anyway. I was too lazy to read the HTTP spec (it was a hobby project and I was young and impatient) so a lot of what I did was trial and error. I too wasn't sending a host header but it took long while before I ran into any sites that rejected my HTTP requests. The web landscape was very different back then though and IPs were plentiful but it just goes to show how servers have coded around bad clients for years.

tazjin8y ago

> So Amazon could easily have tested their client on 3rd party servers and still not spotted the problem

This would still be a red flag, as the service in question is their instance metadata service that provides authentication tokens.

Something that important should be integration-tested with the actual service.

1 more reply

mmahemoff8y ago· 1 in thread

(Update - the original title mentioned AWS has been breaking standards since 2010. The new title is fine. Thanks for updating it.)

Little bit of hyperbole in the title imo. S3 has generally been very good at embracing the fundamental principles of HTTP and REST, leaving aside corner cases like this.

majewsky8y ago

Don't see a hyperbole. The title is technically correct (the best kind of correct), although questionable from a grammar standpoint.

_pmf_8y ago· 1 in thread

Should this really be considered to be a spec violation? It's a restriction, sure, but S3 is to be considered a specific application with specific constraints.

rkeene28y ago

Does S3 use HTTP ? If so, it's a violation of the specification of S3 by way of incorporation of the HTTP specification.

Otherwise, if S3 does not use HTTP, we would need to see the S3 specification to determine if it (the implementation Amazon uses) is in violation

gldalmaso8y ago

Does anyone know if this behavior persists when the bucket is served as a website?

ComputerGuru8y ago

Anyone that's dealt with S3 in any capacity should be aware of this, it's literally one of the first encoding problems to come up when dealing with signing requests.

@dang can you please add (2010) to the title?

tolmasky8y ago

Funnily enough, "URLs and plus signs" is still my most up voted question on stack overflow ( https://stackoverflow.com/questions/1005676/urls-and-plus-si... ) -- same a+b example too. 7 years later, it seems even the big names have issue with this.

mike5038y ago

This burned me and because of it I can't host a specific static site on S3 because it requires plus signs. Can't change the files being uploaded due to the system generating them... tried to rig up some sort of Akamai rewrite rule to change it at the CDN level but couldn't get it to work.

pawelkomarnicki8y ago

I change the "+" into the escaped code :-) It helped

j / k navigate · click thread line to collapse

46 comments

36 comments · 12 top-level

ryanbrunner8y ago· 5 in thread

Edit: As mentioned below, this isn't correct since URLs should be able to be escaped and return the same resource, and an escaped + differs from an unescaped + on S3.

asdfaoeu8y ago

Sure but /%2B should resolve to the same thing as /+

ryanbrunner8y ago

Ah, fair enough, that's a good point.

jamix8y ago

Exactly! The OP's point is summarized in this sentence:

> My point is that the spec requires + to be escaped only inside the querystring.

zAy0LfpBZLC8mAC8y ago

Please read the actual spec before telling poeple whether something is conforming to it or not. Just making stuff up is exactly how this mess is created. The relevant section in this case:

https://tools.ietf.org/html/rfc3986#section-6.2.2.2

jchw8y ago

mfer8y ago· 5 in thread

&tldr; A legacy behavior is to treat + as a space. When you've been around you need to keep backwards compatibility.

URLs and URIs have separate standards from HTTP and they have changed over time (been replaced by newer ones).

[1] http://php.net/manual/en/function.urlencode.php

brlewis8y ago

mfer8y ago

At the time S3 launched the URL spec was RFC 1738 and we had HTML 4.01[2]. And, the URI syntax (all the way back in 1998) noted to use %20 for a space[3].

As far as I can tell, this traces its history back to encoding for forms[4]. It's been used far beyond the encoding for forms and maybe someone can explain why.

It's also not just PHP whose function is that way. In Python urlencode encodes as a + (at least in 2.7.x).

I remember working on the web many years ago where "+" is what was used. This may have been a spec misinterpretation or something else. In any case, it was common enough.

Note, I'm not saying it was right. Just not uncommon.

[1] https://www.ietf.org/rfc/rfc1738.txt [2] https://www.w3.org/TR/html401/ [3] https://www.ietf.org/rfc/rfc2396.txt [4] https://www.w3.org/TR/html4/interact/forms.html#h-17.13.4.1

jordanlev8y ago

> If you're a PHP user, don't use this function unless you know you need to. There are better functions now.

Don't leave me hanging! What are the better functions now?

godDLL8y ago

`rawurlencode()` is what you're after.

And here is where you'd ask that question, a coding forum https://stackoverflow.com/questions/996139/urlencode-vs-rawu...

1 more reply

Crespyl8y ago

Knowing PHPs standard library, probably something like "urlencode_safe_for_real_this_time".

Kidding aside, IIRC "rawurlencode" is the RFC compliant one.

marindez8y ago· 5 in thread

It's reasonable they don't want to fix it because it will break existing URLs. Welcome to the ugly world of back compatibility.

Liquid_Fire8y ago

They could make it configurable on a per bucket basis (perhaps defaulting to the old behaviour if necessary; ideally you would make the conformant behaviour the default, of course).

That way you could opt in to the standard conformant behaviour if you require it, but they can still keep backward compatibility.

majewsky8y ago

1 more reply

lallysingh8y ago

Or use a different domain for the same buckets, and resolve the name correctly on this new domain.

Coding_Cat8y ago

They could compromise by adding a few more lines of code and having '+' resolve to ' ' if and only if the file can't be found with '+', or vice versa.

Immidiatly mark this behaviour as deprecated and switch over to proper '+' == '+' behaviour later.

edit: LiquidFire's idea is better.

viraptor8y ago

That would require synchronisation, potentially between multiple servers. Doing this efficiently, without race conditions could be very tricky at their size.

1 more reply

bmn__8y ago· 5 in thread

In response to the reported RFC violation, elving@AWS writes: "I agree that's unconventional and unfortunate." My corporate bullshit detector is off the scale.

mmahemoff8y ago

cm21878y ago

But can they even change it without risking to break tens of thousands of websites?

1 more reply

jrochkind18y ago

kinkrtyavimoodh8y ago

How is this 'corporate bullshit'? Corporate BS is about giving vague circumlocutionary responses that try to just press all the right PR buttons.

This is the opposite of that.

viraptor8y ago

When were these "earlier times" for the web? Not during browser wars, that's for sure. Not when web2.0 started with crazy ideas about rest. Not during flash-everywhere era. Etc...

tazjin8y ago· 2 in thread

Amazon has a difficult time with the HTTP standard sometimes. Last time I had to touch an AWS project we discovered a bug[1] in the C++ code backing a Java library (sic).

They had implemented their own HTTP client, but forgot to add the "Host" header to requests which is required by HTTP 1.1.

Interestingly this client sent requests only to their own services, which means that they either released that without testing it or the backend once accepted faulty requests.

[1]: https://github.com/awslabs/amazon-kinesis-producer/issues/61

hnlmorg8y ago

tazjin8y ago

> So Amazon could easily have tested their client on 3rd party servers and still not spotted the problem

This would still be a red flag, as the service in question is their instance metadata service that provides authentication tokens.

Something that important should be integration-tested with the actual service.

1 more reply

mmahemoff8y ago· 1 in thread

(Update - the original title mentioned AWS has been breaking standards since 2010. The new title is fine. Thanks for updating it.)

Little bit of hyperbole in the title imo. S3 has generally been very good at embracing the fundamental principles of HTTP and REST, leaving aside corner cases like this.

majewsky8y ago

Don't see a hyperbole. The title is technically correct (the best kind of correct), although questionable from a grammar standpoint.

_pmf_8y ago· 1 in thread

Should this really be considered to be a spec violation? It's a restriction, sure, but S3 is to be considered a specific application with specific constraints.

rkeene28y ago

Does S3 use HTTP ? If so, it's a violation of the specification of S3 by way of incorporation of the HTTP specification.

Otherwise, if S3 does not use HTTP, we would need to see the S3 specification to determine if it (the implementation Amazon uses) is in violation

gldalmaso8y ago

Does anyone know if this behavior persists when the bucket is served as a website?

ComputerGuru8y ago

Anyone that's dealt with S3 in any capacity should be aware of this, it's literally one of the first encoding problems to come up when dealing with signing requests.

@dang can you please add (2010) to the title?

tolmasky8y ago

mike5038y ago

pawelkomarnicki8y ago

I change the "+" into the escaped code :-) It helped

j / k navigate · click thread line to collapse