>assuming the world is Unicode is flat out wrong
True, but Py2's approach makes lots of developers assume the world is Latin-1. I see way too many examples of things broken on a Chinese locale environment, including Python's official IDLE ([1]).
[1] https://bugs.python.org/issue15809 (Summary of this bug: in 2.x IDLE, an explicit unicode literal used to still be encoded using system's ANSI encoding instead of, well, unicode.)
> This ground rule meant that a mass insertion of b'' prefixes everywhere was not desirable, as that would require developers to think about whether a type was a bytes or str, a distinction they didn't have to worry about on Python 2 because we practically never used the Unicode-based string type in Mercurial.
Requiring developers to think which one it should be is, of course, the whole point of the changes in Python 3 - and it's what produces better apps that are more aware of i18n issues in general and Unicode in particular.
And the complaint doesn't even make sense if taken at face value - if all strings in Mercurial are byte strings, then what is there to think about? just use b'' throughout, no need to worry about anything else. Of course, the devil is in the details, which is reflected by the word "practically" in that sentence - this kinda implies that there are places where Unicode strings are used. At which point you do want the developers to think about bytes vs Unicode.
So the real complaint is that Python switched the defaults in a way that made bytes-centric code more complicated - because it has to be explicit now, instead of the Python 2 world, where bytes was the default, and Unicode had to be requested explicitly. Which, of course, is the right change for the vast majority of code out there, that operates on higher level of abstraction, where "all strings are Unicode by default" is a perfectly reasonable assumption to force.
The article directly answers that question. Many, many things in the standard library now only accept unicode strings, not byte strings. So a wholesale change to b'' everywhere breaks lots of stuff.
> So the real complaint is that Python switched the defaults in a way that made bytes-centric code more complicated - because it has to be explicit now, instead of the Python 2 world, where bytes was the default, and Unicode had to be requested explicitly.
Once again, the article directly states that the default is not the problem. The lack of escape hatches is. Paths are not unicode strings, and pretending they are does not work. Using bytes when you need bytes works only until you need to call a library function that only accepts strings.
The author explains later in the article that many system level python 3 apis that are important to a vcs require unicode and won't accept bytes. So apparently it wasn't as easy as just sticking 'b' in front of every literal.
The author made it clear. The issue wasn't just that the default changed. It was that 3.0 took away the ability to always make your choice explicit.
Changing the default would have no effect on code that was always explicit. Going over the code and making all implicit strings explicit would allow them to know when they had full coverage, and also make the code work with both 2 and 3.
With 3, any implicit had to get b added, while any string with u had to be made implicit (drop the u). You couldn't tell by looking at code if it was converted or not. At least that's how I read it.
I'm also a "non-latin" user and I will keep repeating this point ad nauseam: there would have been many strictly superious solutions to solving this problem and most of them would have been closer to what we had in Python 2 than 3.
Both rust and go decided to go with Unicode support that is largely based around utf-8 with free transmutations to bytes and almost free conversions from bytes. This could have been Python 3 but that path was dismissed without a lot of evidence.
A Unicode model that was a bad idea in 2005 was picked and we now have it in 2020 where it's a lot worse because thanks to emojis we now are well outside the basic plane.
Both of those are newer languages that happen to take a stance from the day 1. So not quite comparable.
That said, UTF-8 is one of the best pragmatic solutions to this Unicode problem. Most engineers I meet who throw their hands up in the air complaining about Unicode haven't read the simple Wikipedia page for utf-8.
Python 2 was already half way there, they just to had to tweak a few places bytes are converted to strings. Of course this is easier for newer languages to solve. We can't blame Python for having to provide backward compatibility.
PS: I also blame all the "encoding detection" libraries which exist to try to solve an unsolvable problem. Nobody can detect an encoding, at least not reliably. If these half-assed libraries did not exist, people would have finally settled on UTF-8 and given up on others by now.
What do you mean by "free"? Rust requires you to explicitly convert a string to bytes or vice versa, no? Which is pretty much what you do in Python - the only difference I can see is that you have shortcut methods to encode/decode using UTF-8, but semantically they're no different from encode/decode in Python.
I'm pretty dubious about specifying that the internal representation must be UTF-8. That's a failure of abstraction (because the program shouldn't know or care what the internal representation is), leads to inherent performance/interop problems on several compile targets (Windows, the JVM, Javascript), and seems to imply that Han unification is forced at the language level.
Do you mean that if you have bytes, but you want to send them to a function that expects a string, then it would automatically interpret the bytes as UTF-8?
If so, that violates the "Explicit is better than implicit" part of the Zen of Python. Encoding/Decoding bytes to/from strings shouldn't happen automatically because doing so means you have to make an assumption about the encoding.
Hindsight is 20/20 naturally, but in retrospect, they should have just made `bytes` into the name for old `str` and used `from __future__ import` to create a gradual system for moving from 2 to 3 instead of a big bang "we'll break everything once and then never again".
I think this is misreading the author's criticism. The fact that string literals are now Unicode is not the fundamental problem; the fact that standard library APIs that formerly took bytes now incorrectly take Unicode strings is the problem.
IMO it's great that the world is moving towards opaque blobs of Unicode for strings, but that requires understanding when something shouldn't simply be a string in the first place (for reasons of legacy or otherwise).
>Perhaps my least favorite feature of Python 3 is its insistence that the world is Unicode
>standard library APIs that formerly took bytes now incorrectly take Unicode strings
What do you mean by "incorrectly"?
C# char is a UTF-16 code unit, not a Unicode code point.
Most code points "fit" into just one UTF-16 code unit, but not all.
For example: 𝐀 ("Mathematical Bold Capital A", code point U+1D400) is encoded in UTF-16 as a surrogate pair of code units: U+D835 and U+DC00. So reversing "x𝐀y" should produce "y𝐀x" ("y\ud835\udc00x") - note how U+D835 and U+DC00 were not reversed in the result.
API members that operate on code points universally take a string and an index.
That being said, treating strings as arrays of characters is fraught with peril in most cases anyway. You can't trivially reverse strings in any encoding, as you need to reverse the sequence of grapheme clusters (to account for diacritics, etc.). You can't trivially truncate strings either, for pretty much the same reason. You can't trivially grab a single character from the middle of a string, again, for the same reason. So basically, indexing, reversing, truncating, copying a subsequence, etc. are all not trivially possible regardless of the encoding. UTF-16 is not the main problem here, as even in UTF-32 it'd be broken.
Making strings Unicode by default is wonderful compared to the alternatives (and OP's assertion that this amounts to "assuming the world is Unicode" is disingenuous: there's nothing stopping programs from handling bytes correctly - Python 3 merely resolved the ambiguity).
The decision of a default encoding surely dates back to Python 1.0 or earlier, which predates not just UTF-8 but even Unicode itself. Python is an old language!
And if the assertion is that Python 2.0 should have made the tumultuous Unicode jump when it released in 2000, I could get behind that (especially in retrospect!), but enthusiasm for both Unicode and UTF-8 was not nearly as high then as it is today, so I don't begrudge them for not jumping at the opportunity.
First was the idea that normal feature contributors should not see any b"" or any sign of python3 support for the first couple years of the effort. Huge mistake. You need some b"".
But you don't need all b"" everywhere. That was the second huge mistake. Don't just convert every natural string in the whole codebase to b"". The natural string type is the right type in many places, both for python2 (bytes-like) and python3 (unicode-like). The helpers for converting kwargs keys to/from bytes is a sign that you are way off track. This guy got really hung up on the fact that the python2 natural string type is bytes-like, and tryied to force explicit bytes everywhere (dict keys, http headers, etc) and was really tilting at windmills for most of these past 5 years.
Yes, you pretty much had to wait for python-3.4 to be released and for python-2.6 to be mostly retired in favor of python-2.7. Then, starting in early 2014, it was pretty straightforward to make a clean codebase compatible with python-2.7 and python-3.4+, and I saw it done for Tornado, paramiko, and a few other smaller projects.
For many programs, yes. Not for a revision control system that needs to be sure it's working with the exact binary data that's stored in the repository. Repository data is bytes, not Unicode.
I think this article is an excellent illustration of the Python developers' failure to properly recognize this use case in the 2 to 3 transition.
For example, when I converted our existing Subversion repository to Mercurial I had to rename a couple of files that had non ASCII characters in their names because Mercurial couldn't handle it. At least on Windows file names would either be broken in Explorer or in the command line.
In fact I just checked and it is STILL broken in Mercurial 4.8.2 which I happened to have installed on my work laptop with Windows. Any file with non ASCII characters in the name is shown as garbled in the command line interface on Windows.
I remember some mailing list post way back when where mpm said that it was very important that hg was 8-bit clean since a Makefile might contain some random string of bytes that indicated a file and for that Makefile to work the file in question had to have the exact same string of bytes for a name. Of course, if file names are just strings of bytes instead of text, you can't display them, or send them over the internet to a machine with another file name encoding or do hardly anything useful with them. So basic functionality still seems to be broken to support unix systems with non-ascii filenames that aren't in UTF-8.
For all programs, for the simple reason that:
> Various standard library functionality now wanted unicode str and didn't accept bytes, even though the Python 2 implementation used the equivalent of bytes.
Much of the stdlib works with native strings and will either blow up or misbehave if fed anything else[0], which means much of your codebase will necessarily be native strings, with a subset being explicitly bytes or unicode.
> Repository data is bytes, not Unicode.
It's also mostly absent from the source code, and where it is present (e.g. placeholders or separators) it's easy to flag as explicitly bytes.
[0] though some e.g. the encoding layers or io module want either bytes or unicode depending what you're doing specifically, and not always the most sensible, like baseXY being bytes -> bytes conversions where 95% of the use case is to smuggle binary data through text… oh well
You just need to be aware that in some cases the work is already done for you by the language, for example in python if you open a file (without "b" option, the python will do the translation on the fly and you don't need to worry about it)
TBH I do think the problem is easier to address in a statically typed world.
The entire 2 to 3 transition is an excellent illustration of Python developers failing properly recognize the challenges in transition. What other popular language intentionally broke backwards comparability? It's hard to think of any.
Python set the entire community back 10 years or more by making this drastic mistake.
If we imagine an alternative reality where Rust started only with byte-strings and added unicode as an afterthought like Python did, you'd definitely face a massive amount of churn but at least the compiler would yell at you every time you pass a byte string where unicode is expected and vice-versa. Once you'll have fixed all of the errors in the vast majority of cases there's a good chance that your program would work again. It would be very annoying but at least you know clearly where the problems occur.
In Python on the other hand this type of code refactoring is very painful in my experience. You may end up with the same function being called sometimes with unicode and sometimes with bytes. And then you have to look at the call stack to figure out where it comes from. And then you realize that you end up with, say, a list of records which sometimes contain unicode and sometimes byte arrays depending on whether the code that updated them used the old or the new version etc...
And if it turns out that you can't easily reproduce the problem and you just get a bug report sent from somewhere in production then Good Luck; Have Fun.
I agree with you on the benefits of static typing, but let's clear: Python didn't add unicode as an "afterthought". The initial release of Python predates the initial release of the Unicode standard, by almost a year.
Furthermore, even if this were not the case, it took a while before Unicode got any significant adoption among programming languages, well after the release of Python 1.0. I think Java in 1996 was the first language to adopt Unicode.
When I read that, I was angry on behalf of the people doing the porting work who had their hands tied by it, and I was angry on behalf of the Mercurial developers who, I think, must have been underestimated. It's normal that platforms don't stand still and coding standards on a project evolve over time. Obviously it's not going to fly for open source contributors to be "voluntold" to do porting work, but to be aware of it and accommodate it and know enough about the new platform to mostly avoid creating new work for the porters seems like a small and reasonable ask, especially when you compare it to the effort required to make high-quality contributions in the first place.
I get that there are people who are bitter to this day about Python having a version 3, but surely by 2017 the vast, vast majority of developers who were going to rage quit the Python community over it were already gone.
Keeping blame details (and line-lengths, ha!) was given as the excuse and that is a nice feature and all. However they could have copied the repo over before porting to keep that information and saved time. Wouldn't be surprised if it was eventually lost anyway.
I had to switch back to treating headers as bytes for as long as possible.
It is a stupid client which doesn't send valid ascii for http headers of course.
...or a smart malicious actor.
as a mercurial user i never understood this decision. for instance look at this recent commit: https://www.mercurial-scm.org/repo/hg/rev/b4c82b704180
would anyone disagree with the fact that an error message should be a string?
a source transformer to add b'' all over the place? really?
and i still don't understand why the hg transition had to be more complex than: https://docs.djangoproject.com/en/1.11/topics/python3/
... and of course now this: https://www.mercurial-scm.org/wiki/OxidationPlan
i wonder what does matt mackall think of all these developments?
Isn't this more a problem with Python not easily differentiating between String and Byte types? Both Go and Rust ("""systems""" level languages) have decided that "utf-8 ought to be enough for anybody" and that seems to be a good decision.
There was this assumption that Unicode code points were the correct single unit to talk about Unicode. You iterate over code points, you talk about string lengths in terms of code points, you slice in terms of code points. Much like the infamy of 16-bit Unicode, this is an assumption that has kinda gotten worse over time. Now we can and do want to talk about bytes, code points, and newer sets like extended grapheme clusters. I think this is probably the big failing of Python 3's Unicode model. Making a string type operate on extended grapheme clusters might fix it, but we'd be in for the same sort of pain, and the flexibility of "everything is bytes, we can iterate over it differently" of Go and Rust is much nicer in comparison.
The second thing was this assumption that everything remotely looking like text was Unicode, despite this maybe not being true. HTTP has parts that look like plain text, like "GET" and "POST" and the headers like "Content-Type: text/html". But the correct way to view this as ASCII bytes, and no other encoding makes sense; binary data intermixed with "plain text" definitely happens, and the need to pick and choose between either Unicode or Bytes caused major damage in the standard library which still persists to this day -- some parts definitely chose the wrong side. Take a look at the craziness in the "zipfile" module for one other example. It's probably fixed now, but back then, I basically had to rewrite it from scratch in one of my other projects.
They eventually relented and added back a lot of the conveniences to blur the line between bytes and unicode again, like adding the % formatting operator for bytes, which I think shows that their insistence on separating the two didn't really pan out in practice. And yet, migration is still a pain.
It would "kinda work out", if your Unicode strings were ASCII in practice, and only then. Because whenever a Unicode and a non-Unicode string had to be combined, it used ASCII as the default encoding to converge them.
Which is to say, it only worked out for English input, and even then only until the point where you hit a foreign name, or something like "naïve". Then you'd suddenly get an exception - and it happened not at the point where the offending input was generated, but at the point where two strings happened to be combined.
This was a horrible state of affairs for basically everybody except the English speakers, because there was a lot of Python code out there that was written against and tested solely on inputs that wouldn't break it like that.
Intermixing binary data with text can be represented just fine in a type system where the two are different. For your HTTP example, the obvious answer is that the values that are fundamentally binary, like the method name or the headers, should be bytes, while the parts that have a known encoding should be str - there's nothing there that requires actually mixing them in a single value. In those very rare cases where you genuinely do have something like Unicode followed by binary followed by Unicode in a single value, that is trivially represented by a (str, bytes, str) tuple.
The problem with the Python stdlib isn't that bytes and Unicode are distinct. It's that it's overly strict about only accepting Unicode in some places where bytes should be legal, too. This is orthogonal to them being separate types.
The most messed-up thing about Python 3 is that it's supposed to be justified by doing Unicode right and they still got it wrong.
Having strings be sequences of Unicode code points is a super-bizarre design. That is, Python 3 strings indeed are semantically sequences of Unicode code points rather than sequences of Unicode scalar values. You can not only materialize lone surrogates (defensible for compatibility with UTF-16) but you can also materialize surrogate pairs in addition to actual astral characters. You still can't materialize units that are above the Unicode range, though, so it's not like C++'s std::u32string.
Looking at the old PEPs, it appears to have arisen by accident rather than as an actual design.
Go has string and byte[], and you can't mix it, you have to cast. Java has String, char[] and byte[] and similarly you need to do cast. Rust has Bytes and String (I don't know Rust enough, but I'm pretty sure it doesn't implicit conversion between them).
Also Python 3 doesn't distinct between Bytes and Unicode, Python 3 has distinction between bytes and text (str - BTW: Guido actually expressed regret that he did use "str" instead of "text", because it would be much clearer)
In Python 3 you don't have Unicode (as far as you should be concerned), you have text and bytes, how the bytes are stored internally is an implementation detail, if you need to write to a file or to network, you encode the text using various encodings (most popular is UTF-8) and you decode it back when reading.
When working with e.g. filepaths, Rust has an OsStr type.
I look at Perl, which was a juggernaut when I first used Python, and announcements of Perl 6 certainly didn’t help Perl’s slide. Often cited is the fact that Perl 6 is a totally different language unrelated by anything but creator and name. The Perl brand was not enough to carry the bulk of Perl users from Perl 5 to Perl 6. Perl 6 is now called Raku, which probably better reflects the magnitude of the change.
On the other hand Python 3 is a small but still significant departure from Python 2. If they’d called Python 3 something else, we’d probably be griping about how superficially different from Python 2 it was without bringing substantially new ideas.
Oddly my feeling is that Racket, in its departure from mainline Scheme, largely did retain its core audience, but that may have been a feature of its usage in academia.
Fast forward to last year when a prominent Racket architect announced “Racket 2” which would completely change the syntax of the language. Prominent community members reacted negatively, due to fears of Perl 6’s fate. But now they’ve decided to simply call the new research language Rhombus and have reiterated plans to continue supporting Racket. I went from feeling very negative to the change to being okay with the direction.
I’m not sure there are lessons to draw, other than noting than version bumping versus making a new language with a new name can be bad for entirely different reasons.
I think the takeaway is that if you want to make breaking changes, make a new thing and turn the old thing over to a maintenance team. If after a while you learn some things that could improve the old thing see if they can be incorporated without breaking compatibility or if the new thing is really so much better people will switch.
I mean Perl5 is still mostly backwards compatible back to the original version released in 1987. (There were a few rarely used bad features that should have never been there which have been removed.)
The way it does this is by having you specifically ask for the new features, if they would otherwise break code.
> While hindsight is 20/20, many of the issues with Python 3 were obvious at the time and could have been mitigated had the language maintainers been more accommodating - and dare I say empathetic - to its users.
In contrast, the original porting guidance for module authors was actually to maintain the Python 2 source as the master copy, and use 2to3 to transform it for running tests or cutting a Python 3 release. How is a transition ever supposed to happen if the new hotness is perpetually a second class citizen?
Python 3 is a “success” in that a lot of people have moved. But it was, as you rightly point out, a hard won victory that left a lot of people unhappy.
FTFY
Indeed. That is why Perl 6 has been renamed to Raku (https://raku.org using the #rakulang tag on social media).
I'm not sure what lessons can be drawn from this, other than being indecisive has its price.
That said, Racket is open source the maintainers have good reasons for a change based on getting a larger user base. I wish them great success.
But Racket as a project was always about language experiments. I certainly didn’t bristle at Typed Racket or any of the other languages. And so, in changing “Racket 2” to “Rhombus” and committing to mainline support of Racket, I feel pretty comfortable with the direction. I find this fascinating that I feel this way given that nothing has really changed but the name.
The end result of this is that I just spent a good chunk of last week reviewing a pull request with 70,000 lines of changes, which was one of the final in a series of ~10k line pull requests that came in through the fall.
All of this was the heroic effort of one of my coworkers who had the unenviable task of combing through our entire codebase to determine "This is unicode. This is bytes. Here is an api boundary where we need to encode / decode." etc.
It was a nightmare of effort that I'm glad to have behind us.
Dynamic typing!
The issue is they changed the types out from underneath you.
And then left it to each library to decide which type it was actually going to accept.
However the makers of Delphi spent many years preparing for this, so when the time came for us to switch we only had to spend half a day or so to migrate our half a million lines of code.
u"Hello World"> In the early days of Mercurial's Python 3 port in 2015, Mercurial's project maintainer (Matt Mackall) set a ground rule that the Python 3 port shouldn't overly disrupt others: he wanted the Python 3 port to more or less happen in the background and not require every developer to be aware of Python 3's low-level behavior in order to get work done on the existing Python 2 code base. This may seem like a questionable decision (and I probably disagreed with him to some extent at the time because I was doing Python 3 porting work and the decision constrained this work). But it was the correct decision. Matt knew that it would be years before the Python 3 port was either necessary or resulted in a meaningful return on investment (the value proposition of Python 3 has always been weak to Mercurial because Python 3 doesn't demonstrate a compelling advantage over Python 2 for our use case).
As a general rule, this seems like good practice, but surely b-strings, print_function, etc are a trivial upfront cost, and one that would have to be paid sooner or later anyway?
The static compiler would notice all the breaking type changes at compile time and you can systematically fix all of them at once. You wouldn't miss one or two and have to run your unit testing suite to exercise the type system underneath.
I really believe big breaking changes like this in a language causing migration stagnation is a property of dynamically typed languages. With other statically typed languages like swift or rust, it happened quite frequently but wasn't as big of a deal in practice.
The language wasn’t ready for the transition, but it feels like it may have been even harder on them because of the requirements imposed on their project.
I think the core error here was in NOT doing what he calls a "flag day" conversion. Sometimes it is easier to do something quickly, than to live with it happening slowly. I've done "flag day" conversions, and they were pretty painless, if stressful at the time.
It matters a lot where you work. If you are in high level land Python 3 is not much of a chance. If you work at the boundary (wire protocols, OS interop, text transformation) then Python 3 is a significant step back, especially before 3.6. A lot of the mud that Mercurial stepped through is also where I went through with my libraries. The day I managed to get the PEP through that reintroduced the u prefix on strings was also the last time I voluntarily participated in a language summit. The atmosphere was awful and not evidence based.
Read through the receipts here: https://bugs.python.org/issue3982
Same thing applies to using Mypy. Some modules are easy to add annotations for, other modules have insanely complicated types.
Futurize and Pasturize in particular provide essentially all of the features that this post laments missing.
When Mercurial accepts a 3rd party package, downstream packagers like Debian get all hot and bothered and end up making questionable patches to our source code.
Some environments just can't use dependencies like this. IMO Python 3 was too much of a breaking change, and in particular, the ability to transition from 2 to 3 should have been better in Python itself.
I can't imagine what it would do to Mercurial's performance to have picked the wrong migration library early on.
Having just done transitions on a number of much smaller projects I had the same thought. Changes to string handling tripped me up and the changes to relative imports took some thinking. But the biggest frustration was the nagging question: Why am I doing this?
edit: missing word
Lack of security updates past 2019 forced our hand. Did you find a way around that?
Amazon is maintaining Python 2 for at least 4 years, as part of their Amazon Linux long term support release. Google app engine will support Python 2 for an unknown amount of time; they haven't announced an end date. PyPy is Python 2, with (to the best of my limited knowledge) no plans to deprecate support. There are also other LTS releases out there which include Python 2 support.
IOW, the forcing function of the PSF no longer supporting Python is not as big a factor as was hoped.
It's particularly uncool that Guido brought up the prospect of lawyers (https://github.com/naftaliharris/tauthon/issues/47#issuecomm...) to force it not to be called Python and opposed to letting people who care about keeping Python 2 alive evolve it as "Python 2". (I know he has the legal right to insist on the name change. Still uncool.)
It took years before the advent of six, Python 3 u’’ literals, and modernize. The author discusses this at length.
> So I'm not sure six would have saved enough effort to justify the baggage of integrating a 3rd party package into Mercurial. (When Mercurial accepts a 3rd party package, downstream packagers like Debian get all hot and bothered and end up making questionable patches to our source code. So we prefer to minimize the surface area for problems by minimizing dependencies on 3rd party packages.)
Every huge task, like porting from Python 2 to Python 3 or any other huge task is either everybody's task or just a small group's one. And since latter seems more reasonable to not interfere with ongoing development, former is the only way I have seen such tasks to succeed.
Artificial rules to create comfort for one group at the expense of another group, like the following
>> This ground rule meant that a mass insertion of b'' prefixes everywhere was not desirable, as that would require developers to think about whether a type was a bytes or str, a distinction they didn't have to worry about on Python 2 because we practically never used the Unicode-based string type in Mercurial.
sound pretty much wrong to me.
If there is a pain, it should become everybody's pain, or otherwise people will simply burn out and hate own work, like the author did. There is no way porting to Python 3 can be harder than porting to Rust. Rust is statically typed and not garbage collected. Everyone would have to think if they need string or array of bytes anyway, but also, who owns them.
Overall, described situation looks like management issue and not a technical one to me.
Edit: typos.
The author addresses this. The difference is that when porting to Rust you'd likely get a faster and more correct program in the end. (Huge caveat of big rewrites, of course). Whereas with Python 3 they feel like they did all the porting work and got nothing valuable in return.
The Rust compiler statically checks those decisions, while in Python issues with string types will only be caught at run-time, so everywhere your test suite has missing coverage, porting is likely to introduce regressions. That is one way in which a Rust port would be easier.
It would take quite a bit of change in a language for a port to be safer than an upgrade, but it's not completely impossible.
I also remember my first forays into Python 3 and the annoyance I had at some of the decisions. I recall when they relented on the % operator for string interpolation and I agree it was a poor initial choice to leave it out. I totally agree with the author that Python 3 could have made some subtle changes earlier on to help those with massive codebases.
And I still feel it was the right move. Somehow Python is even more relevant today than it was when this painful process began. While some may say that popularity is despite missteps I actually believe the general slow and cautious push forward is one of the primary reasons Python continues to succeed. There is a balance between completely abandoning old users (e.g. Perl 5 to Perl 6) and keeping every historical wart (e.g. C++). IMO, the Python community found a middle ground and made it work.
I know this because every change I've heard about is reminiscent of a Perl5 change where backwards compatibility was not broken.
The transition to Python 3 was not handled anywhere as well as it could have been.
There is a reason Go2 isn't copying Python3. (Strangely they seem to be copying the Perl5 update model even though they don't realize it.)
The thing is that because Python has an unhealthy fixation on “There should only be one way to do things” they rejected things that would have made the transition easier. (Or even less necessary.)
I think it is kind-of telling that someone thought it necessary to create Tauthon. Tauthon is sort-of applying the Perl5 update model to Python 2.7.
Please note that Perl 6 has been renamed to Raku (https://raku.org using the #rakulang tag on social media).
In the original design of Perl 6, a source-level compatibility layer ("use v5") was envisioned that should allow Perl 5 source code to run inside of Perl 6. So the plan was to actually not abandon old users.
In my opinion, this failed for two reasons:
1. Most of Perl 5 actually depends on XS code, the hastily devised and not very well thought out interface to C code of Perl 5. Being able to run Perl 5 source code in Perl 6 doesn't bring you much, unless you have a complete stack free of XS. Although some people tried to achieve that (with many PurePerl initiatives), this really never materialized.
2. Then when the Inline::Perl5 module came along, allowing seamless integration of a Perl 5 interpreter inside Perl 6, using Perl 5 modules inside of Perl 6 as if they were Perl 6 modules, it basically nailed the coffin in which the "use v5" initiative found itself already in.
And now they're considered different languages after the rename to Raku, dividing already limited resources. I guess that's the way of life.
More amazing to me is that in Catalina, the release famous for breaking just about everything else, “Python 2” is still there and works as it always has! Of course, Apple did announce that it will be ripped out in the next release. :)
What I don't think people realize is that not only are you expected to move to 3.x, but you'll have to keep up or fall behind with new 3.x releases. During that same period (since 2008) 3.x has had 9 big releases. Of course that 2.x stability was done with the assumption you'd move to 3.x and isn't sustainable for PSF indefinitely.
They did? Damn, I was using that...
I ported two projects with ~200000 Python-SLOC (about the same size as Mercurial according to sloccount) back in the early 3.x days. Doing this via more or less flag-day conversions within a few months, converting the codebases first to 2to3-able subset, and as a second step later on dropping 2to3 via common dialect of Python 2/3 with six, was not very painful in the end.
Sounds like you used the same method, just over a smaller timeframe: convert to a common 2/3 subset, then drop Python 2 at some later point.
The project lead said to not push `b""` on people. That was a mistake imo that led them down a very frustrating rabbit hole (transformers, `pycompat`) that probably greatly extended their port time. One reason given is to not confuse devs with those details but they are critical details and ones you can't avoid with Rust. This inconsistency makes me wonder if the post is mostly misdirected frustration. A lot of it centers on.
I agree about the early python3 releases making it harder. I don't remember what the python leadership's intent was but i think I actually agree with what they did, now. Over my career, I've come to appreciate starting with the ideal and working backwards. This let's you learn what is needed rather than wasting time on speculation (planning or dev) or making a more crippled product.
I can understand frustrations with bugs / differences in python versions. I ran into that a lot just within `2.7.*`
In my mind, the most notable complaint is the stdlib's mixed efforts in supporting str or bytes. I feel "batteries included" maked this harder. They had to port a lot. Not everything can get the same level of scrutiny, especially from domain experts that represent a variety of use cases. They also can't break compat. If they weren't battries included, the porting efforts would be more directed, pull in the right people, and you can fix things later if you get it wrong.
What I find interesting is how different our experiences are that lead to the same place. My frustrations with python are rooted in build tools and packaging and have been loving Rust.
EDIT: I'm also surprised at the hostility towards distribution packagers. Instead of working with them to find mutually valid solutions, the express frustration at distributions and cripple themselves in not allowing third-party dependencies.
These days, it's "cool" to hate your downstreams (y'know, bite the hands that feed you and all that).
Seriously though, as one of those "distribution packagers" (Fedora, Mageia, OpenMandriva, and openSUSE!), it sucks that I encounter this more and more often. I try to be somewhat involved in the projects I package and contribute where I can, be it code, advice, or anything in between. Ten years ago, people were generally friendly to me. These days? It's rare to get a thank-you. Usually I get grumbles and anger for daring to ship it in a distro package. I've even had a couple of patches rejected that fix real bugs simply because they were discovered as part of my packaging and testing something because it doesn't happen on the dev's machine in his virtualenv on his Mac...
On the other hand, Python adoption has really taken off since Python 3.5-ish. Python has never been more popular.
So while you may wonder what might have been, had the transition been smoother, it’s hard to argue that Python 3 is a failure. All’s well that ends well, I guess?
Although it’s sad that Guido felt the need to step down. It’ll be interesting to see where Python goes this decade, now the transition is over and there’s a wealth of possibilities in front of it.
I expect there’ll be a lot of people looking to replace JavaScript with Python once you can run it in the browser with WASM.
You can see that at work in the responses here. "And I still feel it was the right move. Somehow Python is even more relevant today than it was when this painful process began." I.e. success is thought to justify every decision made along the way.
I see this fallacy at work in Linux too. "Linux is successful, therefore haphazard CI and using email to track bugs and patches must be a fine way to operate".
Not what you want to hear about an operating system :p
Good lord, how much ignorance can hackernews handle??
I think there are two mental models for how to approach the str/bytes split:
1.) A `str` is for unicode use cases, and a `bytes` is better for cases that don't support unicode.
2.) A `bytes` is an array of numbers between 0 and 255. A `str` should almost always be used when your value is conceptually a sequence of characters. `str` doesn't imply that arbitrary unicode is allowed, and it's fine to have a convention that a particular `str` is ASCII-only, just like other conventions you might have on variable values.
My impression is that #1 is the Python 2 mental model and is tempting for Python 3, but that #2 often works better when writing Python 3 code. Under mental model #2, asking for "%s" formatting is really asking for a replacement strategy that detects the number 37 followed by the number 155 in an array of numbers and fills in a sub-array, which seems more strange and likely to get false positives if you're really working with binary data like the bytes of a .jpg file.
That said, I'm sure the devil is in the details, and maybe a project like mercurial has to stay backcompat with bytes data that is neither ASCII nor valid UTF-8, or some other compelling reason to stick with bytes everywhere.
I've had the same problem with a few Python 2 -> 3 conversions -- everything is fine until you have to operate on text or filenames which aren't valid utf8/unicode.
I'm tempted to say "nobody should have filenames like that", but I guess a project like Mercurial needs to be as compatible as possible. Are there modern use cases for filenames like that, or is it fair to say it's all legacy data?
> One was that the added b characters would cause a lot of lines to grow beyond our length limits and we'd have to reformat code
ORLY?! Well, guess what: hard line size limits are stupid. Now you know why.
That's why "foolish consistency is the hobgoblin of little minds" is one of the 1st phrases of PEP-8.
But I'm tired of people saying "oooh let's cut all lines to be under 80-characters" like it's some kind of Biblical Mandate. No, it isn't. And the 80 chars limit is BS. Probably the part I hate the most about PEP-8 (and especially how people interpret the PEP-8)
> is its insistence that the world is Unicode
Oh please. Yes, the world is Unicode. Get over it. Maybe not bytes on disk/network. But apart from that? Yes. If libraries take bytes or unicodes I can agree it's a thorny issue, but let's move on because a happy day is a day where I don't get an UnicodeDecodeError because Python2, to add insult to the injury thinks the world is not only not Unicode, but it's all ASCII.
Windows made the right call a long time ago when it decided to make all strings Unicode. Ok, maybe UTF-8 would be better than 16, but it still does the job.
But I have to agree with them that any version < Py3.4 or 3.5 was really not worth it.
So, just ignoring the 2 things that a version control system exists to work with directly?
Otherwise just convert to and from when saving and sending it to network
Core developers made the design decisions that made nobody want to adopt it.
> the ecosystem of users and projects are collectively much better-off than if the transition had not occurred at all.
The question seems more like, "could the same benefits have been had with less pain", and a reasonable reading is that the answer is yes. (ex. 4 years of not being able to work with bytes reasonably even if you did need them)
Let's just hope there will never be a Python 4 and the developers now finally start focusing on the greatest flaw of Python: performance.
I feel like this is the essence of the article: specific constraints/choices of Mecurial made their port to Python 3 difficult. Working with early Python 3 certainly did not help. But there seems to have been some stubbornness here mixed with a lot of retroactive justification.
> One was that the added b characters would cause a lot of lines to grow beyond our length limits and we'd have to reformat code.
This is almost ridiculous. You are going to write a JIT partial 2to3 instead of just increasing your length limits and/or using an autoformatter? (Of course, it turns out they eventually did do that... after a bit more stubborness regarding the autoformatter.)
> So I'm not sure six would have saved enough effort to justify the baggage of integrating a 3rd party package into Mercurial.
Couldn't this have been a very occasional copy and paste, instead of a downstream dependency? [six](https://six.readthedocs.io/) "consists of only one Python file, so it is painless to copy into a project."
> Initially, Python 3 had a rather cavalier attitude towards backwards and forwards compatibility.
Yes, can't disagree. Early adopters who attempted to write 2- and 3- compatible code suffered the most.
lib3to6 is a Python compatibility library similar to to BableJS. It translates (most) valid Python 3.7 syntax to valid Python 2.7 and Python 3 syntax (aka. universal python). If you would like to develop with a modern python version and yet still maintain backward compatibility or if you want to bring a legacy codebase forward step by step (my use case), then please have a look.
Gladly we already had mypy hints, this helped us find a lot of mistakes when (not) using bytes.
Now we're on python 3 we're auto-migrate the type hints to be inlined with tools, like com2ann https://github.com/ilevkivskyi/com2ann And we're auto rewriting code to be more python 3 like with libcst and custom codemods...
> The only Python 3 feature that Mercurial developers seem to almost universally get excited about is type annotations. We already have some people playing around with pytype using comment-based annotations and pytype has already caught a few bugs. We're eager to go all in on type annotations and uncover lots of dynamic typing bugs and poorly implemented APIs.
Over in perl land people still spill their hate on types, which caused hard forks.
2. Don't even pretend to be interested in trying to do a migration until seven years later.
3. Make sure that your migration plan includes a development cycle that's deliberately hostile to the migration process.
4. ?
5. How could the python maintainers do this to us.
The description of the migration process was a good read. The fud afterwards... wasn't.
And there were a few inaccuracies (I'm being charitable, some of them were straight up lies).
> Python 3.0 was released on December 3, 2008. And it took the better part of a decade for the community to embrace it.
False, I've been using python 3, python 3 exclusively, since 2014, for all my projects.
> Yes, Python is still healthy today and Python 3 is (finally) being adopted at scale
False, same as above.
> I am ecstatic the community is finally rallying around Python 3
Again, false. Not only did "the community" rallied around python 3 years ago, he isn't really happy about it, but I'll get to that later.
> For nearly 4 years, Python 3 took away the consistent syntax for denoting bytes/Unicode string literals.
Or, to put it another way, python 3 was compatible with python 2's string types almost eight years before python 2 reached end of life.
> An ecosystem that falters for that long is generally not healthy
This entire paragraph was a hypothetical. It seems he really wanted to criticize something that did not happen.
> The only language I've seen properly implement higher-order abstractions on top of operating system facilities is Rust
And here's where his true point becomes evident: this is a hype piece for a language he found that he likes better. He's just attacking something in his previous language that he thinks is valid just as an attempt to highlight why the new toy is truly better. In short: He felt like complaining about the migration would be a good way to proselytize.
Just in case: no, it isn't better, and I say this as someone who currently isn't using python nor rust. I'm using a language that I'm quickly growing to hate more than I do either of them at their worst (no, it's not JavaScript).
> if Rust were at its current state 5 years ago, Mercurial would have likely ported from Python 2 to Rust instead of Python 3. As crazy as it initially sounded, I think I agree with that assessment.
So... The best he can say about rust is that it might be better than python 3 five years ago that, by his own opinion on everything he wrote before this, was terrible? Well, that's a recommendation not to use rust if I ever saw any.
When a hype piece defeats its own point.
> And speaking as a maintainer, I have mad respect for the people leading such a large community.
No, he doesn't; he used several appeals to emotion beforehand to try to paint them as terrible people.
> It should not have taken 11 years to get to where we are today.
This statement by itself is a truism that doesn't really mean anything, but the implication is that python 3 is only worthwhile 11 years later and it took that long for it to be so I'll reply to that.
No, it didn't. It didn't even take that long for mercurial, they started the migration four and half years ago, not eleven.
> am confident it will grow stronger by taking the time to do so
What is it to him? He should just move on to rust and be happy with it (sure, there are many people unhappy with it, but he wouldn't take the effort to proselytize if he wasn't).
In conclusion, I just don't understand the need to tear something else down to prop up a new thing. I'm sure I would have liked a post about things he could do with rust, but now...