> Windows was an early adopter of Unicode, and its file APIs use UTF‑16 internally since Windows 2000
Wrong. Windows uses WTF-16 [0] despite what the documentation says.
It's an old problem that people on different OS's need to access the same filesystem, particularly Windows clients versus UNIXy clients.
While UNIX filesystems traditionally accept any sequence of bytes except slash and NUL, and treat "." and ".." specially, Windows filesystems have had many additional restrictions on valid filenames, e.g., a list of "reserved names" that should not be used, short and long names, and case-insensitive names [1].
The NetApp filer, a NAS storage appliance, runs a specialized OS called Data ONTAP, which maintains a special on-disk representation called WAFL to enable this. WAFL is famous for its copy-on-write representation for file contents. [3] But in practice, most people in the real world are affected by its treatment of filenames. It's useful to take a look at how it solves these problems.
The NAS presents the same set of files both as an NFS volume for UNIXy clients, as well as a CIFS volume for Windows clients. It does this by enhancing the directory entries with additional information and configuration features to satisfy both requirements. For typical problems and what features they offer to solve them, see their documentation page about naming files and paths [2].
[1] https://learn.microsoft.com/en-us/windows/win32/fileio/namin...
[2] https://docs.netapp.com/us-en/ontap/nfs-admin/multi-byte-fil...
[3] Dave Hitz, "File System Design for an NFS File Server Appliance", https://www.cs.princeton.edu/courses/archive/fall17/cos318/r...
The APIs, Win32 in the case of [1], have further restrictions. If you want you can use a different API/personality and write whatever value you'd like (sans NUL and /) provided said personality has no limits -- NTFS does no validation itself.
In practice, Win32 being the default personality makes said plethlora of restrictions true, but it isn't a "filesystem" limitation, rather an API restriction. A nuanced if unimportant difference.
If you look at OS-X there are similar issues. The Apple File System is case insensitive for a particular Unicode version.
In practice, “UTF-16” means “potentially-ill-formed UTF-16”. It’s that simple.
1. On Windows, accessed by WSL
2. On Linux (WSL), using UTF-8 locale
3. On Linux (WSL), using POSIX locale
The difference is weird for me as a user. I'd like to know about the decisions made behind these. If anyone has information, please let me know.
Beyond that, it is up to the API you're choosing to use to read the volume. Win32 has of course many more restrictions than POSIX would, but since Windows NT supports multiple personalities, you could still RW illegal Win32 characters under NT, e.g. with SFU.
The real solution is to force the entire world population to use the Rotokas language of Papua New Guinea.
Apple HPFS did some things right. They did at least NFD. But linux insanities brought them back to -Whomoglyph attacks
Of course, you lose round trip ability with legacy encodings, which is why we have the mess that is unicode. Oh and silly things like unicode flag emojis wouldn't work, but honestly maybe that would be for the best. Oh well, it's too late now so I guess we just accept it.
Also, backspace is locale dependent. In some languages, backspace removes the accent, which makes sense with combining characters, and other it removes the whole character. Which is going to be fun when whole syllable is code point.
Languages are hard, there is no way to make them simple.
I don't think we're there yet. I think if someone did make a complete list of "valid" emoji right now, which for the sake of argument I'll call "formally defined in the Unicode standard", it would even on an absolute scale look like we're a long ways away from a full 32-bits of valid combinations. But you have to think of this on the log scale because this is about "bits" and those four-person families are already quite a long ways along to a full 32 bits. It wouldn't take much more customization, or the formal addition of more people in a group, to get there.
And someone who knows more about Unicode than I do may be able to establish that there are already in the standard ways to get to more than 32 bits' worth of data in a single standardized glyph; I certainly wouldn't bet much against that already being true.
(Personally, I'll go with "worse". In hindsight, we should probably have frozen Unicode into the original Docomo (and the other phone company that had them) emoji necessary for interoperability, and then created the emoji as an extension into Unicode. It seems like it would be useful to "support Unicode" without having to come with the complete understanding of what is increasingly the most complicated "language" in Unicode; forget doing good Arabic rendering or trying to understand an ideographic language, the emojis blow all that complexity away now. But here we are.)
[1]: https://unicode.org/emoji/charts/emoji-zwj-sequences.html
[2]: https://www.unicode.org/reports/tr51/#Multi_Person_Groupings
Why not? They're just two (or more) characters from a special set next to each other that a font may combine. (and some ad-hoc ZWJ sequences) I don't think windows even ships a font that does that by default.
Ext4 filename has maximal length 255 characters. That is the only legacy limit you have to deal with as a Linux user. And even that can be avoided by using more modern filesystems.
And we get filesystem level snapshots etc...
If you change the locale to an uninstalled one, then yes. But if the locale is installed, then I don't see a problem.
echo $LANG
# output: en_US.UTF-8
touch fusée.txt
LANG=fr_FR.UTF-8 ls
# output: 'fus'$'\303\251''e.txt'
sudo locale-gen fr_FR.UTF-8
sudo update-locale
LANG=fr_FR.UTF-8 ls
# output: fusée.txt
Are you maybe using non-UTF-8 locale?
Second, your comment shows you are lacking the knowledge on Linux as well. In Linux, as I have written in the foot note, accepts anything but 0x00 (null) and 0x2F (“/”). Other than that, all characters are valid paths. If you consider these a problem, I'd like to remind that the 2048 surrogate pairs is a really small subset of unrenderable combinations allowed in Linux.
Anyone are free to have their opinions but at least, before making bold claims, please do your due diligence.
POSIX 2024 encourages (but doesn’t require) implementations to disallow newline in file names, returning EILSEQ if you try to create a new file or directory with a name containing a newline. Thus far Linux hasn’t adopted that recommendation, but I personally hope it does some day.
For backward compatibility, it would have to be a mount option. It could be done at VFS level so it applies to all filesystems.
Personally I would go even further and introduce a “require_sane_filenames” mount option, which would block you (at the VFS layer) from creating any file name containing invalid UTF-8 (including overlong sequences and UTF-8 encoded surrogates), C0 controls or (UTF-8 encoded) C1 controls.
Also I think it would be great if filesystems had a superblock bit that declared they only supported “sane filenames”. Then even accessing such a file would error because it would be a sign of filesystem corruption.
ls -l win32/
total 0
-rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\277\237''.exe'
-rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\267\213''.exe'
-rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\240\220''.exe'
-rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\274\273''.exe'
-rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\251\205''.exe'
-rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\255\223''.exe'
-rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\272\257''.exe'
-rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\264\207''.exe'
-rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\261\246''.exe'
-rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\254\266''.exe'
... LANG=en_IE.UTF-8
LANGUAGE=en_IE:en
LC_CTYPE="en_IE.UTF-8"
LC_NUMERIC="en_IE.UTF-8"
LC_TIME="en_IE.UTF-8"
LC_COLLATE="en_IE.UTF-8"
LC_MONETARY="en_IE.UTF-8"
LC_MESSAGES="en_IE.UTF-8"
LC_PAPER="en_IE.UTF-8"
LC_NAME="en_IE.UTF-8"
LC_ADDRESS="en_IE.UTF-8"
LC_TELEPHONE="en_IE.UTF-8"
LC_MEASUREMENT="en_IE.UTF-8"
LC_IDENTIFICATION="en_IE.UTF-8"
LC_ALL=https://learn.microsoft.com/en-us/previous-versions/windows/...
I was a bit confused by the detour via utf-8 to arrive at the code points and had to look up UTF-8 encoding first to understand how they relate. Then I tried out the following
candidate = chr(0xD800)
candidate2 = bytes([0xED, 0xA0, 0x80]).decode('utf-8', errors='surrogatepass')
print(candidate == candidate2) # True
and it seems that you could just iterate over code points directly with the `chr()` function.It seems obvious that attempts to create files with such filenames ought to be blocked.