Filenames with Accents (2011) (opens in new tab)

(nedbatchelder.com)

21 pointsFrankSansC4y ago18 comments

18 comments

12 comments · 3 top-level

baal80spam4y ago· 6 in thread

Side note - after all these years I still don't feel comfortable with using special characters (like ą, ż, ź) and spaces in filenames in Windows. DOS times sit deeply in my soul and It just doesn't feel right.

renewiltord4y ago

This is amusing but my filesystem has no spaces in it. It's still a pain in the ass from the command line to handle spaces and instead of `-print0` and all that shit I just ban spaces from my filesystem. `-` and `_` are sufficient as spacers.

I know zsh handles auto-complete well but I can `y-a-w` out of my `nvim` `:terminal` a lot easier.

enriquto4y ago

I like this approach. How do you enforce your ban? Is there a mount(8) option for that? Or do you simply avoid these filenames "by hand"? What happens if you untar an archive and it contains filenames with spaces?

1 more reply

klyrs4y ago

I've learned to live without file extensions... but I still wince at 4-letter file extensions. I haven't used windows in almost 20 years.

Tagbert4y ago

Do your files really not have file extensions or is it just that they are hidden by default in the file browser?

In Mac OS extensions are important but they are hidden by default. I always force them to be shown when I setup a new Mac. Hiding them obscures important information. Hiding them offers nearly no benefit.

1 more reply

voussoir4y ago

Windows user here. I feel that the computer should work for the human, not the other way around. I use whatever unicode characters are appropriate for the file, except the minimum set of disallowed characters: colon, slash, asterisk...

Any tool that can't handle that is faulty, and I'll either stop using it or make a temporary symlink to get the job done. But most tools are fine with unicode these days. Of course I speak from the perspective of a personal computer situation with no deadlines or business requirements or software limitations.

I still use the old cmd.exe with the Terminal font. It doesn't even render most unicode properly, they come out as ?. I deal with it because I like cmd.exe and I like Terminal and I'd rather see ? than butcher the file's name by romanizing it.

KSPAtlas4y ago

It hurts readability because l and ł are completely different

juancn4y ago· 3 in thread

You should normalize names on write, on read is very hard to fix. You can have a perfectly valid, denormalized strings representing codepoints with different normalizations.

So if you have four possible normalizations: NFD, NFC, NFKD, NFKC and your string has N ambiguous codepoints, the number of possible strings you need to try is N^4.

rurban4y ago

only use NFC.

NFD was used by HFS+, but got abandandoned by the current insecure APFS. (which uses unidentifiable names). NFD is faster to produce, but NFC is complete and needs less space. with NFD you can still have reordered sequence variants, and thus nonidentifiable names.

and the NFKD, NFKC hacks should only be used with python internally, because they didn't understood Unicode. or just read the TR's without understanding it.

it will need several more decades until filesystems will find out about their wrong decisions. maybe I'll bug them with CVE's some day.

asddubs4y ago

seems like the only way to really achieve that is to bake it into the api for storing/naming files of the respective operating system

enriquto4y ago

Indeed. The unix world would be a much happier place if the creat system call normalized the strings it receives to replace literal spaces with non-breaking spaces, and similar stuff. Regular users wouldn't notice, and it would simplify tons of shell scripts.

1 more reply

Anthony-G4y ago

In 2009, David A. Wheeler wrote a comprehensive article covering problems with Unix/Linux/POSIX filenames¹. Given that the OS naïvely treats filenames as a simple stream of bytes, he advocated that developers use UTF-8 for encoding filenames. He mentioned the issue of multiple normalisation systems being used to encode characters that have more than one Unicode representation but glossed over it because such problems are “overshadowed by the terrible awful even worse problems caused by filenames all being in random unguessable charsets”.

I’m guessing that, by now, most developers on Unix-like systems would be using UTF-8 for filenames – though a decade after these articles were published, there still doesn’t seem to be any good/universal solution to the problem of characters with multiple Unicode representations.

¹ https://dwheeler.com/essays/fixing-unix-linux-filenames.html

j / k navigate · click thread line to collapse

18 comments

12 comments · 3 top-level

baal80spam4y ago· 6 in thread

renewiltord4y ago

I know zsh handles auto-complete well but I can `y-a-w` out of my `nvim` `:terminal` a lot easier.

enriquto4y ago

1 more reply

klyrs4y ago

I've learned to live without file extensions... but I still wince at 4-letter file extensions. I haven't used windows in almost 20 years.

Tagbert4y ago

Do your files really not have file extensions or is it just that they are hidden by default in the file browser?

1 more reply

voussoir4y ago

KSPAtlas4y ago

It hurts readability because l and ł are completely different

juancn4y ago· 3 in thread

You should normalize names on write, on read is very hard to fix. You can have a perfectly valid, denormalized strings representing codepoints with different normalizations.

So if you have four possible normalizations: NFD, NFC, NFKD, NFKC and your string has N ambiguous codepoints, the number of possible strings you need to try is N^4.

rurban4y ago

only use NFC.

and the NFKD, NFKC hacks should only be used with python internally, because they didn't understood Unicode. or just read the TR's without understanding it.

it will need several more decades until filesystems will find out about their wrong decisions. maybe I'll bug them with CVE's some day.

asddubs4y ago

seems like the only way to really achieve that is to bake it into the api for storing/naming files of the respective operating system

enriquto4y ago

1 more reply

Anthony-G4y ago

¹ https://dwheeler.com/essays/fixing-unix-linux-filenames.html

j / k navigate · click thread line to collapse