>>> x = '日本語'*100000000
>>> import time
>>> t = time.time(); y = x.encode(); time.time() - t # takes nontrivial time
>>> t = time.time(); y = x.encode(); time.time() - t # not cached; not any faster
Generally, the only reason this would happen implicitly is for I/O; actual operations on the string operate directly on the internal representation.Python uses either 8, 16 or 32 bits per character according to the maximum code point found in the string; uint8 is thus used for all strings representable in Latin-1, not just "ASCII". (It does have other optimizations for ASCII strings.)
The reason for Windows being stuck with UTF-16 is quite easy to understand: backwards compatibility. Those APIs were introduced before there supplementary Unicode planes, such that "UTF-16" could be equated with UCS-2; then the surrogate-pair logic was bolted on top of that. Basically the same thing that happened in Java.