> If the poll passes, which is likely, AssemblyScript will be severely impacted as the tools it has developed must be deprecated due to unresolvable correctness and security problems the decision imposes upon languages utilizing JavaScript-like 16-bit string semantics and its users.
So, the problem is that AssemblyScript wants to keep using UTF-16? I'm not sure I understand.
Is AssemblyScript the thing that lets you hand-write WebAsm?
I’m confused why they can’t just switch their (nascent) language to UTF-8, and if so why the alarmist attitude? I didn’t think they were mature enough to claim no breaking changes, for example.
I probably prefer we drag the web (and .Net and Java) platforms towards UTF-8, to be honest… but maybe that’s just me.
P.S. the web will never switch to UTF-8. It would break too many web pages. Most browser vendors won't even accept breaking 0.1% of web pages, unless they're doing it to show you more ads (i.e. Chrome).
That's not what the web needs. The web needs WebAssembly to work flawlessly with JavaScript for maximal potential, so the web will be great and not just a performance landmine that native developers will laugh (as much) at.
If you want to chime in and retrieve more context, here are some relevant issues:
* https://github.com/WebAssembly/interface-types/issues/135
How does a compiler ensure that when that string is passed to a Rust Wasm module it goes to it in UTF-8 and then when moments later the same string is passed by the same module to JS it goes over as WTF-16?
How will the compiler know where the string is being passed after compilation (at runtime)?
What new syntax would you propose for TypeScript to make it possible to work with all strings types? How would you keep TS/JS developer ergonomics up to par with what currently exists?
If Interface Types we're to consider web as a first class citizen (because Wasm originated as a web feature) then interop between Wasm modules and JS would considered of utmost importance, without making a web language ,(such as AssemblyScript) have to go through great lengths to engineer that aforementioned complication.
For FFI there's nothing a compiler can do. That's why FFI is unsafe and restricted to rudimentary types in most languages - it's up to the caller to ensure the data is laid out as the callee expects.
I also don't know what interface types have to do with anything. Wasm is far lower level than interfaces, and nothing is stopping you from implementing interfaces in your language and doing automatic type conversion through them to handle string representations as required.
Look past the web for a moment - wasm is a competitor with the JVM, GraalVM, and LLVM as a platform and implementation independent byte code. Think about how your language would be implemented on those targets before the web.
This announcement is deliberately phrased to scare people who do not have sufficient context. I don't know why some AssemblyScript maintainers have decided to act in this extreme way over what is quite a niche issue. The vote that this announcement is sounding the alarm over is _not_ a vote on whether UTF-16 should be supported.
There has been a longstanding debate as part of the Wasm interface types proposal regarding whether UTF-8 should be privileged as a canonical string representation. Recently, we have moved in the direction of supporting both UTF-8 and UTF-16, although a vote to confirm this is still pending (but I personally believe would pass uncontroversially).
However, JavaScript strings are not always well-formed UTF-16 - in particular some validation is deferred for performance reasons, meaning that strings can contain invalid code points called isolated surrogates. Again, the referenced vote is _not_ a vote on whether UTF-16 should be supported, but is in fact a vote on whether we should require that invalid code points should be sanitised when strings are copied across component boundaries. Some AS maintainers have developed a strong opinion that such sanitisation would somehow be a webcompat/security hazard and have campaigned stridently against it. However sanitising strings in this way is actually a recommended security practice (https://websec.github.io/unicode-security-guide/character-tr...), so they haven't gained the traction they were hoping for with their objections.
The announcement is worded to obscure this point - talking about "JavaScript-like 16-bit string semantics" (i.e. where isolated surrogates are not sanitised) as opposed to merely "UTF-16", which forbids isolated surrogates by definition, but inviting the conflation of the two.
AS does not need to radically alter its string representation - if we were were to support UTF-16 with sanitisation, they could simply document that their potentially invalid UTF-16 strings will be sanitised when passed between components. Note that the component model is actually still being specified, so this design choice doesn't even affect any currently existing AS code. I interpret the announcement's threat of radical change as some maintainers holding AS hostage over the (again, very niche) string sanitisation issue, which is frankly pretty poor behaviour.
You previously posted yourself that documenting sanitisation at the component boundary would be an acceptable solution: (https://web.archive.org/web/20210726140105if_/https://github...).
I don't understand why you have so radically changed your opinion since then.
We must get rid of legacy encodings no matter the cost, I'm tired of seeing Java and Qt apps wasting millions of CPU cycles mindlessly converting stuff back and forth from UTF-16. It's plain madness, and sometimes you just need the courage to destroy everything and start again.
UTF-8 is a great hack that works wonderfully on Linux and BSD, because neither actually supported internationalisation properly until recently. They clung to 8-bit ASCII with white knuckles until they could bear it no longer, but then UTF-8 came to the rescue and there was much rejoicing. "It's the inevitable future!" cried millions of Linux devs... in English. I mention this because UTF-8 is a bit... shit... if you're from Asia.
Meanwhile, in the other universe, UCS-2 or UTF-16 have been around for forever because in that Universe people do things for money and had to take internationalisation seriously. Not just recently, but decades ago. Before some Linux developers were born. In this Universe, an ungodly amount of Real Important Code was written by Big Business and Big Government. The type of code that processes trillions of dollars, not the type used to call MySQL unreliably from some Python ML bullshit running in a container or whatever the kids are doing these days.
So, yes. Clearly UTF-16 has to "die" because it's inconvenient for C developers that never figured out how to deal with strings based on more than encoding.
PS: There are several Unicode compression formats that blow UTF-8 out of the water if used in the right way. If you can support those, then you can support UTF-16. If you can't, then you can't claim that you chose UTF-8 because you care about performance.
The needs of all the different WASM consumers also creates tension here. A C# programmer trying to ship a webapp has very different needs from a C programmer trying to run WASM on a cloudflare edge node, and you can't really satisfy both of them, so you end up having to tell one of them to go take a walk into the sea.
- an extra performance cost due to format conversion at the boundary,
- as well as negative implications on security and data integrity,
thus making this a loss for the web if Interface Types will not be fully compatible with the web (JavaScript) by default.
Hope that sums it up in one sentence. :)
https://github.com/WebAssembly/meetings/blob/main/main/2021/...
This influx is the reason why AssemblyScript is now in the top three WebAssembly languages next to C++ and Rust (https://blog.scottlogic.com/2021/06/21/state-of-wasm.html) and should not be taken lightly.
There is a huge opportunity here to build an optimal foundation for these incoming developers, so that they won't be let down.
The influx has only just begun.
Ideally though, interface types would give languages options: the ability to choose which format their boundary will use. Obviously a JS host and a language like AssemblyScript would align on WTF-16, while a Rust Wasm module running on a Rust-powered Wasm runtime like wasmtime could optimally choose UTF-8.
I'm hoping things will be designed with flexibility in mind for this upcoming most-generic runtime feature.
In the past year it has gained numerous libraries and bindings, including from Surma from Google. Stay tuned...
[1] http://utf8everywhere.org/
[2] https://github.com/microsoft/language-server-protocol/issues...
So it's not just UTF16 that has problems and can cause security problems. I just wanted to emphasize that
Every problem that UTF-8, it shares with UTF-16. It also shares every problem with UTF-32.
You can certainly decide you don't care about any existing code, and that anyone using UTF-16 based platforms (Windows, .NET, Java, JavaScript) should get a bad experience, but I don't think the case for that is as obvious as you believe it is.