I’m guessing that, by now, most developers on Unix-like systems would be using UTF-8 for filenames – though a decade after these articles were published, there still doesn’t seem to be any good/universal solution to the problem of characters with multiple Unicode representations.
¹ https://dwheeler.com/essays/fixing-unix-linux-filenames.html
So if you have four possible normalizations: NFD, NFC, NFKD, NFKC and your string has N ambiguous codepoints, the number of possible strings you need to try is N^4.
NFD was used by HFS+, but got abandandoned by the current insecure APFS. (which uses unidentifiable names). NFD is faster to produce, but NFC is complete and needs less space. with NFD you can still have reordered sequence variants, and thus nonidentifiable names.
and the NFKD, NFKC hacks should only be used with python internally, because they didn't understood Unicode. or just read the TR's without understanding it.
it will need several more decades until filesystems will find out about their wrong decisions. maybe I'll bug them with CVE's some day.
Nowadays, there’s an understanding to assume those bytes are strings encoded in some ISO-8859 variant or UTF-8, but technically, the creat system call doesn’t receive strings; it receives byte arrays.
Historically, that was the (somewhat) right decision because it meant file systems didn’t need to know much about character encodings (they only needed to know the byte value of ‘/‘ and that zero is the name terminator), giving you a nice separation of concerns.
With Unicode, if you want to normalize names on write, or even only reject incorrectly normalized names, or have case-insensitive file names, your file system code needs to know a lot of Unicode. That can be problematic on small embedded systems.
I know zsh handles auto-complete well but I can `y-a-w` out of my `nvim` `:terminal` a lot easier.
In Mac OS extensions are important but they are hidden by default. I always force them to be shown when I setup a new Mac. Hiding them obscures important information. Hiding them offers nearly no benefit.
Any tool that can't handle that is faulty, and I'll either stop using it or make a temporary symlink to get the job done. But most tools are fine with unicode these days. Of course I speak from the perspective of a personal computer situation with no deadlines or business requirements or software limitations.
I still use the old cmd.exe with the Terminal font. It doesn't even render most unicode properly, they come out as ?. I deal with it because I like cmd.exe and I like Terminal and I'd rather see ? than butcher the file's name by romanizing it.