As of Ruby 1.9, there's a pretty sensible solution to this in the language, and I haven't had any encoding problems in some time. I appreciate I might have missed something though!
I assume that with "Ruby 1.9 solution", you refer to the fact that Ruby source code is by default evaluated as UTF-8, right?
That's definitely a good thing, but with Python3 that wasn't the only change brought into the language.
I said "if Ruby ever decides to fix", because the need for a change is not obvious and not universally accepted: it's basically the same issue as automatic type coercion (aka weak/strong typing) and early (or late) raise/throw/handling of exception.
Basically: In Python2 and Ruby you have one or two String types (runtime types, in this discussion I only care about them), with the Ruby Strings tagged with the encoding internally used. In python the types are just "anything goes" (binary strings, the old Python2 string) and unicode (the actual internal encoding is an implementation detail).
The problem (if you agree that it is one) is that you can easily mix-and-match them, and everything will work fine only as long as the operation makes sense. When it won't anymore you'll get an exception.
This is a problem when you don't completely control the type/encoding of your input (e.g. if you have an HTTP request and your string depends on the type/charset specified in the Content-Type).
A dumb example of what could happen:
a = "Ё".force_encoding "ISO-8859-5"
a2 = "®".force_encoding "ISO-8859-1"
# a + a2 will fail with Encoding::CompatibilityError
A similar thing can happen in Python2. While Python3 will reject the same operation as soon as the types get in contact with each other (still at runtime, but it'll be like doing `1+"1"` in Python or Ruby: you'll spot it right away).I wrote a quite lengthy blog post about this change in Python3, but I haven't translated it in English yet, if there's some interest I could try to do it a bit sooner.
Anyhow, I don't want to create a flame or anything like that. I just wanted to explain why the Python3 choice has been made, and why a destructive change might have had its merit. While I prefer the Python3's approach, and I'm definitely not a Ruby developer, I still appreciate these updates to Ruby: for example I actually touched first hand the internal encoding-handling code of Ruby (Rubinius) some time ago with a friend of mine: http://git.io/7kM4Gw and I can benefit from the new GC code in new rubies, which makes metasploit 4 times faster to load.
In contrast, Python 3 changed the meaning of "foo". It also supported only u"foo" in Python 2 (to opt-in to unicode strings) and only b"foo" in Python 3 (to opt-in to byte-strings) for a fairly long period of time, making it extremely, extremely awkward (at best) to write a program with shims as abstractions that let most of the program remain oblivious to the differences.
Python 3.3 and 2.7 finally landed a lot of fixes to this kind of problem, but it landed fairly late, and after most of the community got a sense of the relative difficulty level of a transition to Python 3 that maintained support for Python 2 at the same time.
Both Ruby and JavaScript have taught me the value of a transition path to a new version that allows people to write libraries that support both the old and new version at the same time. Communities move a little at a time, especially long-term production projects. The best way to move them is via libraries that can serve as a bridge and target both the old and new version together.
This was a pretty major change, I think I'd call it a 'destructive' change, it was indeed a big pain upgrading apps from ruby 1.8 to 1.9, and character encoding was the major issue generally.
I'm not sure I understand what you're saying about python 2 vs 3, or what you think needs to be changed in ruby. If I understand right, you're saying that it ought to be guaranteed to raise if you try to concatenate strings with different encoding.
Instead, at present, for encodings that are ascii-compatible (which is most encodings), ruby will will let you concatenate if both strings (or just the argument and not necessarily the receiver? I forget) are entirely composed of ascii-compatible chars, otherwise it will raise.
I think you're probably (although I'm not 100% confident) right that it would be better to 'fail fast' and always raise, requiring explici treatments of encodings, instead of depending on the nature of the arguments (which may have come from I/O), which makes bugs less predictable. There continues to be a lot of confusion about how char encodings work in rubyland, and it's possible a simpler model would be less confusing (although I suspect char encoding issues are confusing to some extent no matter what, by their nature).
In general, even as it is, I find dealing with char encodings more sane in ruby (1.9+) than any other language I've worked in (but I haven't worked in python).
If ruby ever decides to make things even more strict, I don't think it'll actually be as disruptive as the 1.8 to 1.9 transition. For anyone who ever deals with not-entirely-ascii text (and who doesn't?), they basically already had to deal with the issue. Ruby was trying to make the transition easier on the developer to make some circumstances where it would let you get away with being sloppy with encodings -- I'm not sure if it succeeded in making it any easier, the transition was pretty challenging anyway, and "fail fast" might actually have been easier, I think I agree if that's what you're saying.
I don't know enough about python to have an answer, but I continue to be curious about what differences resulted in the entire ruby community pretty much coming along on the ruby 1.8 to 1.9 jump (and subsequent less disruptive jumps), while the python community seems to have had more of a disjoint. I don't know if it was helped by ruby's attempt to make the encoding switch less painful with it's current behavior. Or if it's as simple as the 100-ton gorilla of Rails being able to make the community follow in ruby-land.
Whoops, you're right... I confused the version, what I had in mind is the "source code as UTF-8 by default", which wasn't introduced in Ruby1.9, but in Ruby2.0
> If ruby ever decides to make things even more strict, I don't think it'll actually be as disruptive as the 1.8 to 1.9 transition.
admittedly, I almost never touched ruby1.8, so I've no idea how actually hard was the transition from ruby1.8.
I'm under the impression that before ruby1.9, Ruby was simply encoding-oblivious, and for any encoding-sensitive piece of code, people simply relied on things like libuconv. Am I mistaken?
If that's the case, the change from 1.8 to 1.9 was painful for sure, but it was more the case of actually caring about encoding for the very first time in a codebase.
This is quite recent (and it deals with Jruby, which is different underneath): http://blog.rayapps.com/2013/03/11/7-things-that-can-go-wron...
but by reading this blog post, I'm under the impression that most of the breakage that you'd get with the move to Ruby1.9 wouldn't be in exceptions, but in strings corruption.
Migrating to a fail-fast approach (like Python3), imho makes things more difficult ecosystem-wise, because you'll get plenty of exceptions even just when importing the library when first trying to use/update it.
With the Ruby1.9 upgrade, you could've used a library even if it was not 100% compatible and correctly working with Ruby1.9, I'd assume. This could let people gradually migrate and port their code, while reporting issues about corruption and fixing them as they appear.
Instead, if you're the author of a big Python2 library that relies on the encoding, maybe you won't prioritize the porting work, because you realize how much work is it, and the fact that unless you've actually correctly migrated 100% of the codebase, your users won't benefit for it (and so you have less of an incentive to start porting a couple classes/modules/packages)
That'd be compounded with the fact that, in Python2 like in Ruby, you actually already have your libraries and your codebase working in an internationalized environment... things might get more robust, but in the meanwhile everything will break, and the benefit isn't immediately available nor obvious.
The last straw is then obviously the community and memes: I don't believe that Python developers are more conservative (the ones that use virtualenv at least, and it's most of them in the web development industry I'd assume... things might be different in the scientific calculus, ops, desktop guis, pentest, etc industries), and they intrinsecally prefer stabler things. Not more than Ruby developers at least.
But for sure, memes like "Python2 and Python3 are two different languages" can demoralize and stifle initiatives to port libraries. And also some mistakes happened without any doubt (mistakes that embittered part of the community), but they've been realized only in hindsight: I'm talking about not keeping the u'' literal (which has been reintroduced in Python3.3) and proposing 2to3 as a tool to be used at build/installation time, instead of only as an helper during migration to a single Python2/3 codebase.
> If I understand right, you're saying that it ought to be guaranteed to raise if you try to concatenate strings with different encoding.
Let's say that while I'd prefer if Ruby behaved like this, I'm not advocating at-all for such a change, due to all the problems I just mentioned, and the fact that I wouldn't want any such responsibility :)