Understanding Surrogate Pairs: Why Some Windows Filenames Can't Be Read (opens in new tab)

(zaferbalkan.com)

85 pointsfeldrim1y ago62 comments

62 comments

46 comments · 13 top-level

account421y ago· 9 in thread

Falsehoods programmers believe about filenames #1: Filenames are text and can be represented in common text encodings.

> Windows was an early adopter of Unicode, and its file APIs use UTF‑16 internally since Windows 2000

Wrong. Windows uses WTF-16 [0] despite what the documentation says.

[0] https://simonsapin.github.io/wtf-8/#ill-formed-utf-16

ripe1y ago

Thank you for posting the WTF-16 document. Very relevant for OP.

It's an old problem that people on different OS's need to access the same filesystem, particularly Windows clients versus UNIXy clients.

While UNIX filesystems traditionally accept any sequence of bytes except slash and NUL, and treat "." and ".." specially, Windows filesystems have had many additional restrictions on valid filenames, e.g., a list of "reserved names" that should not be used, short and long names, and case-insensitive names [1].

The NetApp filer, a NAS storage appliance, runs a specialized OS called Data ONTAP, which maintains a special on-disk representation called WAFL to enable this. WAFL is famous for its copy-on-write representation for file contents. [3] But in practice, most people in the real world are affected by its treatment of filenames. It's useful to take a look at how it solves these problems.

The NAS presents the same set of files both as an NFS volume for UNIXy clients, as well as a CIFS volume for Windows clients. It does this by enhancing the directory entries with additional information and configuration features to satisfy both requirements. For typical problems and what features they offer to solve them, see their documentation page about naming files and paths [2].

[1] https://learn.microsoft.com/en-us/windows/win32/fileio/namin...

[2] https://docs.netapp.com/us-en/ontap/nfs-admin/multi-byte-fil...

[3] Dave Hitz, "File System Design for an NFS File Server Appliance", https://www.cs.princeton.edu/courses/archive/fall17/cos318/r...

sumtechguy1y ago

If you want to subtly break a windows install you can use setCaseSensitiveInfo with fsutil. It turns on/off case sensitivity for a directory. There is also a similar set of options for samba shares which comes with interesting tradeoffs for speed of reading a directory list.

p_ing1y ago

NTFS has the same two restrictions that many UNIX file systems have, NUL and slash.

The APIs, Win32 in the case of [1], have further restrictions. If you want you can use a different API/personality and write whatever value you'd like (sans NUL and /) provided said personality has no limits -- NTFS does no validation itself.

In practice, Win32 being the default personality makes said plethlora of restrictions true, but it isn't a "filesystem" limitation, rather an API restriction. A nuanced if unimportant difference.

zombot1y ago

Microsoft never implements a standard, they only ever implement their own shit. Sometimes it's a close enough parody of a standard to fool superficial onlookers, but that's as close as you'll ever get.

formerly_proven1y ago

Java, NT, .NET, "wide" C and C++ and a few others from the same time frame ended up with WTF-16 because surrogate pairs didn't exist when they were designed. They were designed with UCS-2, which is a fixed-length encoding. Unicode 2.0 then extended that to be variable length (16/32-bit) using surrogate pairs and that's where all the systems come from which don't validate surrogate pairs.

4 more replies

nitwit0051y ago

In this case, they correctly met a standard, and the standard changed.

If you look at OS-X there are similar issues. The Apple File System is case insensitive for a particular Unicode version.

FirmwareBurner1y ago

>Microsoft never implements a standard

Win32 ?

2 more replies

chrismorgan1y ago

Nothing uses well-formed UTF-16. I don’t think I know of a single piece of software or library that uses 16-bit code units that validates them.

In practice, “UTF-16” means “potentially-ill-formed UTF-16”. It’s that simple.

layer81y ago

Historically, this is because Windows NT used UCS-2 [0]. Unicode only moved to beyond 65536 characters, and introduced the concept of surrogate pairs, with Unicode 2.0 in 1996.

[0] https://www.unicode.org/faq/utf_bom.html#utf16-11

feldrimOP1y ago· 5 in thread

Hi all. OP here. I added a Postscriptum about the surrogte pairs and their status in Linux. I used WSL to access those files under Windows, and generated the same on Linux. You can see that behavior differs on the same file names:

1. On Windows, accessed by WSL

2. On Linux (WSL), using UTF-8 locale

3. On Linux (WSL), using POSIX locale

The difference is weird for me as a user. I'd like to know about the decisions made behind these. If anyone has information, please let me know.

formerly_proven1y ago

The Linux section just seems to be artifacts of the WSL hacks, it has nothing to do with how Linux filenames function. Those are simply bags of bytes, the encoding only matters for displaying them, and isn't interpreted internally. ls failing to access the .exe is clearly a WSL filesystem issue and not a Linux / ls issue. You also can't set a UTF-16 locale because that's not what a locale is. UTF-16/32 vs SBCS and UTF-8 is the wide/narrow character distinction, which is a whole separate thing, different ABIs, different APIs.

feldrimOP1y ago

WSL-to-Windows, yes, it is due to translations. But within the WSL, not sure. I'll try to replicate them on a Ubuntu VM for comparison.

p_ing1y ago

Your subscript 2 applies to NTFS. The only characters NTFS does not allow are NUL and "/".

Beyond that, it is up to the API you're choosing to use to read the volume. Win32 has of course many more restrictions than POSIX would, but since Windows NT supports multiple personalities, you could still RW illegal Win32 characters under NT, e.g. with SFU.

zombot1y ago

WSL is not Linux, despite whatever Microsoft says.

p_ing1y ago

WSL is Linux -- it's an automatically managed VM with some special sauce for connectivity between the parent partition and guest.

1 more reply

Devasta1y ago· 5 in thread

Stuff like this is why UTF and any attempt at trying to encode all characters is a mistake.

The real solution is to force the entire world population to use the Rotokas language of Papua New Guinea.

rurban1y ago

No, the real solution is to follow the unicode security guidelines for identifiers. Esp. on linux, where the silly garbage-in, garbage-out mantra doesn't fly with identifiers, because identifiers need to stay identifiable.

Apple HPFS did some things right. They did at least NFD. But linux insanities brought them back to -Whomoglyph attacks

wheybags1y ago

I still think we should have forced everything into a 32-bit char, with no distinction between codepoints and grapheme clusters. One press on backspace removes one char. Address of char 7 is base+7x4. String length is byte length x 4. cat /dev/urandom is a valid string, it's the font's job to deal with unknown byte values, if you just want to process the text you dont need to care. Everything about text processing becomes super easy like in the old ascii only k&r c example code. I'm not 100% certain, but I don't think there's a widely used language that couldn't be represented by that.

Of course, you lose round trip ability with legacy encodings, which is why we have the mess that is unicode. Oh and silly things like unicode flag emojis wouldn't work, but honestly maybe that would be for the best. Oh well, it's too late now so I guess we just accept it.

ianburrell1y ago

Grapheme clusters are locale dependent. Also, if you aren't allowing combining characters, then you are going to need lots of extra codepoints. In some languages, like Indian ones, vowels are combining characters. Or there are languages where multiple code points produce grapheme cluster, like Hangul syllables. You are going to need a lot more code points to represent all possible strings. Text processing is going to be much harder cause there a thousand different representations of Hangul character.

Also, backspace is locale dependent. In some languages, backspace removes the accent, which makes sense with combining characters, and other it removes the whole character. Which is going to be fun when whole syllable is code point.

Languages are hard, there is no way to make them simple.

jerf1y ago

For better or worse, thanks to emoji Zero Width Joiner support [1], we're well on our way to there being more than 4 billion potential Unicode "characters". 4 billion is only 32 bits and you start spending a few bits here on hair style and a few bits there on skin color and a few bits on "misc" and then allow arbitrary combinations of them into composite families [2] and you can burn through 32-bits fairly quickly.

I don't think we're there yet. I think if someone did make a complete list of "valid" emoji right now, which for the sake of argument I'll call "formally defined in the Unicode standard", it would even on an absolute scale look like we're a long ways away from a full 32-bits of valid combinations. But you have to think of this on the log scale because this is about "bits" and those four-person families are already quite a long ways along to a full 32 bits. It wouldn't take much more customization, or the formal addition of more people in a group, to get there.

And someone who knows more about Unicode than I do may be able to establish that there are already in the standard ways to get to more than 32 bits' worth of data in a single standardized glyph; I certainly wouldn't bet much against that already being true.

(Personally, I'll go with "worse". In hindsight, we should probably have frozen Unicode into the original Docomo (and the other phone company that had them) emoji necessary for interoperability, and then created the emoji as an extension into Unicode. It seems like it would be useful to "support Unicode" without having to come with the complete understanding of what is increasingly the most complicated "language" in Unicode; forget doing good Arabic rendering or trying to understand an ideographic language, the emojis blow all that complexity away now. But here we are.)

[1]: https://unicode.org/emoji/charts/emoji-zwj-sequences.html

[2]: https://www.unicode.org/reports/tr51/#Multi_Person_Groupings

1 more reply

extraduder_ire1y ago

>Oh and silly things like unicode flag emojis wouldn't work, but honestly maybe that would be for the best.

Why not? They're just two (or more) characters from a special set next to each other that a font may combine. (and some ad-hoc ZWJ sequences) I don't think windows even ships a font that does that by default.

theiebrjfb1y ago· 4 in thread

Yet another reason to use Linux everywhere. It is 2025 and Windows (and probably Mac) users have to deal with weird Unicode filesystem issues. Good luck puting Chinese characters or emoticons into filenames.

Ext4 filename has maximal length 255 characters. That is the only legacy limit you have to deal with as a Linux user. And even that can be avoided by using more modern filesystems.

And we get filesystem level snapshots etc...

layer81y ago

You have the same, if not worse, issue on Linux with filenames that aren’t valid UTF-8 sequences. Not to mention that on Linux switching the locale may change the interpretation of filenames as characters, which isn’t the case with NTFS.

kwertzzz1y ago

> Not to mention that on Linux switching the locale may change the interpretation of filenames as characters, which isn’t the case with NTFS.

If you change the locale to an uninstalled one, then yes. But if the locale is installed, then I don't see a problem.

echo $LANG

# output: en_US.UTF-8

touch fusée.txt

LANG=fr_FR.UTF-8 ls

# output: 'fus'$'\303\251''e.txt'

sudo locale-gen fr_FR.UTF-8

sudo update-locale

LANG=fr_FR.UTF-8 ls

# output: fusée.txt

Are you maybe using non-UTF-8 locale?

1 more reply

feldrimOP1y ago

I see two points here. First, you did not read the article and did not see the footnote that these are valid in Linux as well.

Second, your comment shows you are lacking the knowledge on Linux as well. In Linux, as I have written in the foot note, accepts anything but 0x00 (null) and 0x2F (“/”). Other than that, all characters are valid paths. If you consider these a problem, I'd like to remind that the 2048 surrogate pairs is a really small subset of unrenderable combinations allowed in Linux.

Anyone are free to have their opinions but at least, before making bold claims, please do your due diligence.

skissane1y ago

> In Linux, as I have written in the foot note, accepts anything but 0x00 (null) and 0x2F (“/”)

POSIX 2024 encourages (but doesn’t require) implementations to disallow newline in file names, returning EILSEQ if you try to create a new file or directory with a name containing a newline. Thus far Linux hasn’t adopted that recommendation, but I personally hope it does some day.

For backward compatibility, it would have to be a mount option. It could be done at VFS level so it applies to all filesystems.

Personally I would go even further and introduce a “require_sane_filenames” mount option, which would block you (at the VFS layer) from creating any file name containing invalid UTF-8 (including overlong sequences and UTF-8 encoded surrogates), C0 controls or (UTF-8 encoded) C1 controls.

Also I think it would be great if filesystems had a superblock bit that declared they only supported “sane filenames”. Then even accessing such a file would error because it would be a sign of filesystem corruption.

1 more reply

dwdz1y ago· 2 in thread

The script works just fine on real Linux, it creates 2048 files and ls command lists them all with different names.

    ls -l win32/
    total 0
    -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\277\237''.exe'
    -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\267\213''.exe'
    -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\240\220''.exe'
    -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\274\273''.exe'
    -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\251\205''.exe'
    -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\255\223''.exe'
    -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\272\257''.exe'
    -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\264\207''.exe'
    -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\261\246''.exe'
    -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\254\266''.exe'
    ...

feldrimOP1y ago

Oh, great. Can you also share the locale? I'll write another Postscriptum section then.

dwdz1y ago

    LANG=en_IE.UTF-8
    LANGUAGE=en_IE:en
    LC_CTYPE="en_IE.UTF-8"
    LC_NUMERIC="en_IE.UTF-8"
    LC_TIME="en_IE.UTF-8"
    LC_COLLATE="en_IE.UTF-8"
    LC_MONETARY="en_IE.UTF-8"
    LC_MESSAGES="en_IE.UTF-8"
    LC_PAPER="en_IE.UTF-8"
    LC_NAME="en_IE.UTF-8"
    LC_ADDRESS="en_IE.UTF-8"
    LC_TELEPHONE="en_IE.UTF-8"
    LC_MEASUREMENT="en_IE.UTF-8"
    LC_IDENTIFICATION="en_IE.UTF-8"
    LC_ALL=

1 more reply

kzrdude1y ago· 2 in thread

I remember that in Mac OS X times, sometime between OS X v10.1 and 10.4, a system upgrade caused a bunch of unicode named files to become inaccessible/untouchable (but still present with a directory listing). At the time I didn't have the skills to figure out what had happened. I'm still curious to know if it was an intended breaking change.

kps1y ago

OS X (at the POSIX level) assumes UTF-8 and normalizes file names to decomposed form (NFD). If for example you `date >$'\xC3\xBC'` (i.e. ‘ü’), then the actual stored file name is `$'\x75\xCC\88'` (i.e. ‘ü’ — assuming HN or my browser don't normalize!) and both `cat $'\xC3\xBC'` and `cat $'\x75\xCC\88'` (or ‘ü’ or ‘ü’) both work.

kzrdude1y ago

Unrelated to my question above, I think

n_plus_1_acc1y ago· 2 in thread

I think it's hilarious that the event viewer XML gets borked.

feldrimOP1y ago

I am not 100% sure but mmc.exe has not been updated for years and it must be relying on WebBrowser control of Internet Explorer. Yes, IE is still alive in Windows.

https://learn.microsoft.com/en-us/previous-versions/windows/...

account421y ago

And we should all be thankful for that. Just imagine if all those system tools were as "useful" as the modernized windows settings.

rob741y ago· 2 in thread

Or otherwise said: Surrogate Pairs are used in UTF-16 (which uses two bytes per character, so it can encode up to 65536 characters) to encode Unicode characters that have code points that can't be encoded using just two bytes.

feldrimOP1y ago

Yep. The quirk here is that the surrogates, that are merely enablers for other characters, can be paired with each other. With the absence of other valid characters, they are not enabling anything. One assumes there is a validation but it does not exist here.

layer81y ago

There is no validation on the file system level because file names in NTFS are sequences of arbitrary 16-bit values, similar to how on Unix file systems, file names are sequences of arbitrary 8-bit values. Arguably the situation on Unix is worse, because there the interpretation and validity depends on the current locale.

1 more reply

mofeien1y ago· 1 in thread

Hi, thanks for the interesting submission!

I was a bit confused by the detour via utf-8 to arrive at the code points and had to look up UTF-8 encoding first to understand how they relate. Then I tried out the following

  candidate = chr(0xD800)
  candidate2 = bytes([0xED, 0xA0, 0x80]).decode('utf-8', errors='surrogatepass')
  print(candidate == candidate2) # True

and it seems that you could just iterate over code points directly with the `chr()` function.

feldrimOP1y ago

I f I remember correctly, I tried that but in order to cover the exact range I need, the high and low surrogates, I picked this way out of practicality. It was just easier.

ooterness1y ago· 1 in thread

Why does the Windows filesystem allow filenames with invalid strings?

It seems obvious that attempts to create files with such filenames ought to be blocked.

feldrimOP1y ago

It's mentioned in a comment here that the existing restrictions are due to Windows APIs and NTFS does not check file names in a restricted way. Therefore, if devs want to filter these out or not in the API, is another story.

qingcharles1y ago

Aha! Found the name of my next album. Try downloading me on Napster now!

somewhereoutth1y ago

Ah got caught by surrogate pairs recently:- javascript sees them as 2 chars when e.g. slicing strings, so it is possible to end up with invalid strings if you chop between a pair.

ge961y ago

I remember a long time ago I accidentally put some symbol in a folder name like ? in Windows had problems

j / k navigate · click thread line to collapse

62 comments

46 comments · 13 top-level

account421y ago· 9 in thread

Falsehoods programmers believe about filenames #1: Filenames are text and can be represented in common text encodings.

> Windows was an early adopter of Unicode, and its file APIs use UTF‑16 internally since Windows 2000

Wrong. Windows uses WTF-16 [0] despite what the documentation says.

[0] https://simonsapin.github.io/wtf-8/#ill-formed-utf-16

ripe1y ago

Thank you for posting the WTF-16 document. Very relevant for OP.

It's an old problem that people on different OS's need to access the same filesystem, particularly Windows clients versus UNIXy clients.

[1] https://learn.microsoft.com/en-us/windows/win32/fileio/namin...

[2] https://docs.netapp.com/us-en/ontap/nfs-admin/multi-byte-fil...

[3] Dave Hitz, "File System Design for an NFS File Server Appliance", https://www.cs.princeton.edu/courses/archive/fall17/cos318/r...

sumtechguy1y ago

p_ing1y ago

NTFS has the same two restrictions that many UNIX file systems have, NUL and slash.

In practice, Win32 being the default personality makes said plethlora of restrictions true, but it isn't a "filesystem" limitation, rather an API restriction. A nuanced if unimportant difference.

zombot1y ago

formerly_proven1y ago

4 more replies

nitwit0051y ago

In this case, they correctly met a standard, and the standard changed.

If you look at OS-X there are similar issues. The Apple File System is case insensitive for a particular Unicode version.

FirmwareBurner1y ago

>Microsoft never implements a standard

Win32 ?

2 more replies

chrismorgan1y ago

Nothing uses well-formed UTF-16. I don’t think I know of a single piece of software or library that uses 16-bit code units that validates them.

In practice, “UTF-16” means “potentially-ill-formed UTF-16”. It’s that simple.

layer81y ago

Historically, this is because Windows NT used UCS-2 [0]. Unicode only moved to beyond 65536 characters, and introduced the concept of surrogate pairs, with Unicode 2.0 in 1996.

[0] https://www.unicode.org/faq/utf_bom.html#utf16-11

feldrimOP1y ago· 5 in thread

1. On Windows, accessed by WSL

2. On Linux (WSL), using UTF-8 locale

3. On Linux (WSL), using POSIX locale

The difference is weird for me as a user. I'd like to know about the decisions made behind these. If anyone has information, please let me know.

formerly_proven1y ago

feldrimOP1y ago

WSL-to-Windows, yes, it is due to translations. But within the WSL, not sure. I'll try to replicate them on a Ubuntu VM for comparison.

p_ing1y ago

Your subscript 2 applies to NTFS. The only characters NTFS does not allow are NUL and "/".

zombot1y ago

WSL is not Linux, despite whatever Microsoft says.

p_ing1y ago

WSL is Linux -- it's an automatically managed VM with some special sauce for connectivity between the parent partition and guest.

1 more reply

Devasta1y ago· 5 in thread

Stuff like this is why UTF and any attempt at trying to encode all characters is a mistake.

The real solution is to force the entire world population to use the Rotokas language of Papua New Guinea.

rurban1y ago

Apple HPFS did some things right. They did at least NFD. But linux insanities brought them back to -Whomoglyph attacks

wheybags1y ago

ianburrell1y ago

Languages are hard, there is no way to make them simple.

jerf1y ago

[1]: https://unicode.org/emoji/charts/emoji-zwj-sequences.html

[2]: https://www.unicode.org/reports/tr51/#Multi_Person_Groupings

1 more reply

extraduder_ire1y ago

>Oh and silly things like unicode flag emojis wouldn't work, but honestly maybe that would be for the best.

theiebrjfb1y ago· 4 in thread

Ext4 filename has maximal length 255 characters. That is the only legacy limit you have to deal with as a Linux user. And even that can be avoided by using more modern filesystems.

And we get filesystem level snapshots etc...

layer81y ago

kwertzzz1y ago

> Not to mention that on Linux switching the locale may change the interpretation of filenames as characters, which isn’t the case with NTFS.

If you change the locale to an uninstalled one, then yes. But if the locale is installed, then I don't see a problem.

echo $LANG

# output: en_US.UTF-8

touch fusée.txt

LANG=fr_FR.UTF-8 ls

# output: 'fus'$'\303\251''e.txt'

sudo locale-gen fr_FR.UTF-8

sudo update-locale

LANG=fr_FR.UTF-8 ls

# output: fusée.txt

Are you maybe using non-UTF-8 locale?

1 more reply

feldrimOP1y ago

I see two points here. First, you did not read the article and did not see the footnote that these are valid in Linux as well.

Anyone are free to have their opinions but at least, before making bold claims, please do your due diligence.

skissane1y ago

> In Linux, as I have written in the foot note, accepts anything but 0x00 (null) and 0x2F (“/”)

For backward compatibility, it would have to be a mount option. It could be done at VFS level so it applies to all filesystems.

1 more reply

dwdz1y ago· 2 in thread

The script works just fine on real Linux, it creates 2048 files and ls command lists them all with different names.

    ls -l win32/
    total 0
    -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\277\237''.exe'
    -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\267\213''.exe'
    -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\240\220''.exe'
    -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\274\273''.exe'
    -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\251\205''.exe'
    -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\255\223''.exe'
    -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\272\257''.exe'
    -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\264\207''.exe'
    -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\261\246''.exe'
    -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\254\266''.exe'
    ...

feldrimOP1y ago

Oh, great. Can you also share the locale? I'll write another Postscriptum section then.

dwdz1y ago

    LANG=en_IE.UTF-8
    LANGUAGE=en_IE:en
    LC_CTYPE="en_IE.UTF-8"
    LC_NUMERIC="en_IE.UTF-8"
    LC_TIME="en_IE.UTF-8"
    LC_COLLATE="en_IE.UTF-8"
    LC_MONETARY="en_IE.UTF-8"
    LC_MESSAGES="en_IE.UTF-8"
    LC_PAPER="en_IE.UTF-8"
    LC_NAME="en_IE.UTF-8"
    LC_ADDRESS="en_IE.UTF-8"
    LC_TELEPHONE="en_IE.UTF-8"
    LC_MEASUREMENT="en_IE.UTF-8"
    LC_IDENTIFICATION="en_IE.UTF-8"
    LC_ALL=

1 more reply

kzrdude1y ago· 2 in thread

kps1y ago

kzrdude1y ago

Unrelated to my question above, I think

n_plus_1_acc1y ago· 2 in thread

I think it's hilarious that the event viewer XML gets borked.

feldrimOP1y ago

I am not 100% sure but mmc.exe has not been updated for years and it must be relying on WebBrowser control of Internet Explorer. Yes, IE is still alive in Windows.

https://learn.microsoft.com/en-us/previous-versions/windows/...

account421y ago

And we should all be thankful for that. Just imagine if all those system tools were as "useful" as the modernized windows settings.

rob741y ago· 2 in thread

feldrimOP1y ago

layer81y ago

1 more reply

mofeien1y ago· 1 in thread

Hi, thanks for the interesting submission!

I was a bit confused by the detour via utf-8 to arrive at the code points and had to look up UTF-8 encoding first to understand how they relate. Then I tried out the following

  candidate = chr(0xD800)
  candidate2 = bytes([0xED, 0xA0, 0x80]).decode('utf-8', errors='surrogatepass')
  print(candidate == candidate2) # True

and it seems that you could just iterate over code points directly with the `chr()` function.

feldrimOP1y ago

I f I remember correctly, I tried that but in order to cover the exact range I need, the high and low surrogates, I picked this way out of practicality. It was just easier.

ooterness1y ago· 1 in thread

Why does the Windows filesystem allow filenames with invalid strings?

It seems obvious that attempts to create files with such filenames ought to be blocked.

feldrimOP1y ago

qingcharles1y ago

Aha! Found the name of my next album. Try downloading me on Napster now!

somewhereoutth1y ago

Ah got caught by surrogate pairs recently:- javascript sees them as 2 chars when e.g. slicing strings, so it is possible to end up with invalid strings if you chop between a pair.

ge961y ago

I remember a long time ago I accidentally put some symbol in a folder name like ? in Windows had problems

j / k navigate · click thread line to collapse