Those issues are common when you're having python 2 code that uses unicode datatype and you have a task to migrate it to python 3.
You run your python 2 code on python 3 and it fails, most people at that point will place encode() or decode() in place where you have a failure. When the correct fix would be to place encode/decode at I/O boundary (writing to files (and in python 3 even that is not needed if you open files in text mode), network etc).
Ironically a python 2 code that doesn't use unicode is easier to port.
When you program in python 3 from the start it's very rare to need encode/decode strings. You only do that if you are working on I/O level.
> And the documentation was particularly horrible regarding that, not even the experienced pythoners knew how to deal with it properly.
Because it's not really python specific knowledge. It's really about understanding what the unicode is, what bytes are, and when to use each.
The general practice is to keep everything you do as text, and do the conversion only when doing I/O. You should think of unicode/text as as a representation of a text, as you think of a picture or sound. Similarly to image and audio text can be encoded as bytes. Once it is bytes it can be transmitted over network or written to a file etc. If you're reading the data, you need to decode it back to the text.
This is what Python 3 is doing:
- by default all string is of type str, which is unicode
- bytes are meant for binary data
- you can open files in text and binary mode, if you open in text the encoding is happening for you
- socket communication - here if you need to convert string to bytes and back
Python 2 is a tire fire in this area:
- text is bytes
- text also can be unicode (so two ways to represent the same thing)
- binary data can also be text
- I/O accepts text/bytes, no conversion happening
- a lot (most? all?) stdlib is actually expecting string/bytes as input and output
- cherry on top is that python2 also implicitly converts between unicode and string so you can do crazy thing like my_string.encode().encode() or my_string.decode()
So now you get a python 2 code, where someone wanted to be correct (it is actually quite hard to do it, mainly because of the implicit conversion) so the existing code will have plenty of encode() and decode() because some functions now expect str some unicode.
At different functions you might then have bytes or unicode as a string.
Now you take such code and try to move it to python 3, which no longer has implicit conversion and will throw an error when it expected text and got bytes and vice versa. The str now is unicode, unicode type no longer exists and bytes is now not the same thing as str. So your code now blows up.
Most people see an error so they add encode() or decode() often trying which one works (like what you were removing) when the proper fix would be actually removing encodes() and decodes() in other places of the code.
It's quite difficult task when your code base is big, so this is why Guido put a lot of effort with type annotations, mypy. One of its benefits supposed to help with these issues.