Show HN: High-speed UTF-8 validation in Rust (opens in new tab)

(github.com)

127 pointskryps5y ago45 comments

45 comments

37 comments · 9 top-level

tialaramex5y ago· 6 in thread

One flavour I would expect to be valuable that isn't present here is this:

Process the input (as quickly as possible) and never fail, but replace each invalid sequence of bytes with U+FFFD (bytes 0xEF 0xBF 0xBD).

chrismorgan5y ago

One major problem with this: you can’t do it in place. Invalid byte sequences could be 1–4 bytes long, but U+FFFD is exactly three bytes long.

AjtevhihBudhirv5y ago

There're still faster approaches than naïve and probably common "copy valid byte sequences one by one into a resizable result buffer". For instance, scan through the input bytes all at once, keeping track of position and length of valid sequences, then memcopy each valid sequence into a preallocated buffer.

Edit: Although, it looks like Rust's std already does this, except for preallocating an exactly correct size result buffer: https://doc.rust-lang.org/src/alloc/string.rs.html#538

1 more reply

cryptonector5y ago

Use an ASCII character like `?` or, even better, ASCII SUB (0x1A, substitute), which is specifically intended for this sort of thing. Failing that, there are a number of other unused ASCII control characters like VT (vertical tab), FS (file separator), GS (group separator), US (unit separator), CAN (cancel), etc.

1 more reply

nezirus5y ago

Yeah, I'd like to see that as well, something like to_string_lossy(), but really fast, so I can call it whenever I expect occasional broken UTF-8 string (and don't care about few replaced characters).

nerdponx5y ago

I don't know, you are throwing away the length of the invalid sequence that way. Maybe useful in certain situations, but probably not in most.

loa_in_5y ago

I bet one could modify this to save indexes of invalid chars and do what you say with just one additional copy

celeritascelery5y ago· 6 in thread

I find it interesting that with the state of rust SIMD many implementations just opted for reimplementing the code multiple times for each intrinsic (SSE, Neon, AVX, etc). One would think that one of the generic libraries (faster, simdeez, packed_simd) would start to see widespread use, but they all seem to have issues that have prevented adoption.

mooman2195y ago

I work on a SIMD optimized font library [0] and have stumbled into the same situation of hand writing SIMD intrinsics. Some things are just kinda hard to make sure they get optimized correctly, and there is enough difference between the platforms where that matters when fiddling with bits. I also kinda have fun writing SIMD code like this too.

[0]: https://github.com/mooman219/fontdue/blob/master/src/platfor...

monocasa5y ago

The various SIMD ISAs are different enough from eachother that you really want a custom implementation per ISA for perf reasons. Particularly if you're doing something off the beaten path like this rather than cranking through some vector math.

celeritascelery5y ago

I don't know. Looking in their code, they have one for SSE and another for AVX2. The algorithm is identical expect for one is 128 and the other is 256 bits. It seems like something that would be easy to abstract over so you only had a single implementation (at least for x86 SIMD). But it is obviously a hard problem or it would be solved already.

Fronzie5y ago

Even for vector math, there are subtle differences, such as the numerical precision of the hardware reciprocal, leading to different implementations for x/y.

The Eigen library did recently combine most SSE and AVX code paths, however: https://eigen.tuxfamily.org/index.php?title=3.4

pgwhalen5y ago

Can you comment on the Vector API that Java is set to include? Do you think it's entirely misguided, or more of a best effort given what one can reasonably expect from Java?

http://openjdk.java.net/jeps/8261663

(I'm completely ignorant on the subject)

BelenusMordred5y ago

packed_simd is now packed_simd2, the original author is MIA and no one else has crate or repo permissions

I really think the general push right now is towards stdsimd being the future for portability.

nerdponx5y ago· 5 in thread

Does Rust compile code that can be used/called from other languages with an FFI? That to me is always one of the persistent advantages of C, I don't have to fully understand how compile machine code works on a technical level, but there are lots of different languages that let me use C libraries in their own language runtimes.

It would be great to have things like high-performance unicode handling with consistent semantics across multiple languages!

hardwaresofton5y ago

Yes it does -- one of the great parts about Rust. There are lots of projects out there slotting rust into other languages and getting crazy speedups.

- https://doc.rust-lang.org/reference/linkage.html

- https://doc.rust-lang.org/book/ch19-01-unsafe-rust.html#call...

jhgb5y ago

A few days ago I read https://dev.to/veer66/calling-rust-from-common-lisp-45c5 , which illustrates presumably what you're talking about.

Disp4tch5y ago

Yes, from the python side of things there are tools like py03 that make integrating rust into python code really painless.

I have a sql parsing library (shameless plug) that is 50x faster than any other python implementation, it is just a super simple wrapper around a rust crate.

https://github.com/wseaton/sqloxide

gliptic5y ago

Yes, you can expose things with the C ABI. This is essential for how it's used in Firefox.

kzrdude5y ago

Yes, Rust natively can expose C ffi functions (for example).

The https://cxx.rs/ project is also a major crate for C++ interoperability.

jfk135y ago· 2 in thread

I'd be curious how this compares with the encoding_rs crate (https://github.com/hsivonen/encoding_rs), which also offers UTF-8 validation (among many other things).

cryptonector5y ago

This is SIMD based. If the encoding_rs crate isn't SIMD based, then this thing will beat its pants off.

burntsushi5y ago

It does use SIMD (when enabled). But also, the encoding_rs crate is doing more. It isn't just checking for validity. It's also doing transcoding and error detection.

1 more reply

TwoBit5y ago· 2 in thread

Given the current ubiquity of UTF-8, is there value in having native CPU instructions for it? Or do they exist somewhere already?

dahfizz5y ago

What would these cpu instructions do?

TwoBit5y ago

Read or write code points quickly, with validation.

kouteiheika5y ago· 2 in thread

So are there any plans to eventually contribute this to the standard library, e.g. just like hashbrown was?

mrec5y ago

The author discusses this in the corresponding /r/rust thread:

https://www.reddit.com/r/rust/comments/mvc6o5/incredibly_fas...

tl;dr: it's not straightforward, mostly because the standard functionality lives in `core` which doesn't have access to OS support for CPU feature detection.

bombcar5y ago

Does Rust allow for libraries to “override” core functionality when available?

1 more reply

meisel5y ago· 2 in thread

How does this compare, speed-wise, to the UTF-8 validation done in simdjson?

krypsOP5y ago

Check the benchmarks section (https://github.com/rusticstuff/simdutf8#Benchmarks), second table. simdutf8 is up to 28 % faster on my Comet Lake CPU. However with pure ASCII clang does something magical with simdjson and it beats my implementation by a lot. GCC-compiled simdjson is slower all around except for a few outliers with short byte sequences.

The algorithm is the one from simdjson, the main difference is that it uses an extra step in the beginning to align reads to the SIMD block size.

tyingq5y ago

If someone is wanting to make a code comparison, most of the UTF-8 validation in simdjson seems to be nicely summed up in this pull request that sped it up: https://github.com/simdjson/simdjson/pull/993/files

celeritascelery5y ago· 2 in thread

> The implementation is similar to the one in simdjson except that it aligns reads to the block size of the SIMD extension, which leads to better peak performance compared to the implementation in simdjson.

I didn't really understand this part. Aligned to what? to the cache line? SIMD always reads the block size. Unless I am missing something here.

unwind5y ago

I read it as "to the width of the SIMD registers" which I have seen in other quick scanners, but did not read the code here.

celeritascelery5y ago

Aren’t all reads aligned to the width of the SIMD register? If I do an AVX512 command it will read 512 bits right?

1 more reply

parhamn5y ago· 1 in thread

I was wonder about what the ARM/M1 support for simd like instructions are. It seems like it will be a while for these simd packages to be as performant on ARM. Is this correct?

jsheard5y ago

ARM has it's own SIMD instruction sets (NEON and SVE) but the intrinsics have yet to be stabilized in Rust, so the Rust SIMD ecosystem is x86-centric for now.

The plan is for Rust to eventually have a portable SIMD abstraction built into the standard library to reduce the need for CPU-specific code.

j / k navigate · click thread line to collapse

45 comments

37 comments · 9 top-level

tialaramex5y ago· 6 in thread

One flavour I would expect to be valuable that isn't present here is this:

Process the input (as quickly as possible) and never fail, but replace each invalid sequence of bytes with U+FFFD (bytes 0xEF 0xBF 0xBD).

chrismorgan5y ago

One major problem with this: you can’t do it in place. Invalid byte sequences could be 1–4 bytes long, but U+FFFD is exactly three bytes long.

AjtevhihBudhirv5y ago

Edit: Although, it looks like Rust's std already does this, except for preallocating an exactly correct size result buffer: https://doc.rust-lang.org/src/alloc/string.rs.html#538

1 more reply

cryptonector5y ago

1 more reply

nezirus5y ago

nerdponx5y ago

I don't know, you are throwing away the length of the invalid sequence that way. Maybe useful in certain situations, but probably not in most.

loa_in_5y ago

I bet one could modify this to save indexes of invalid chars and do what you say with just one additional copy

celeritascelery5y ago· 6 in thread

mooman2195y ago

[0]: https://github.com/mooman219/fontdue/blob/master/src/platfor...

monocasa5y ago

celeritascelery5y ago

Fronzie5y ago

Even for vector math, there are subtle differences, such as the numerical precision of the hardware reciprocal, leading to different implementations for x/y.

The Eigen library did recently combine most SSE and AVX code paths, however: https://eigen.tuxfamily.org/index.php?title=3.4

pgwhalen5y ago

Can you comment on the Vector API that Java is set to include? Do you think it's entirely misguided, or more of a best effort given what one can reasonably expect from Java?

http://openjdk.java.net/jeps/8261663

(I'm completely ignorant on the subject)

BelenusMordred5y ago

packed_simd is now packed_simd2, the original author is MIA and no one else has crate or repo permissions

I really think the general push right now is towards stdsimd being the future for portability.

nerdponx5y ago· 5 in thread

It would be great to have things like high-performance unicode handling with consistent semantics across multiple languages!

hardwaresofton5y ago

Yes it does -- one of the great parts about Rust. There are lots of projects out there slotting rust into other languages and getting crazy speedups.

- https://doc.rust-lang.org/reference/linkage.html

- https://doc.rust-lang.org/book/ch19-01-unsafe-rust.html#call...

jhgb5y ago

A few days ago I read https://dev.to/veer66/calling-rust-from-common-lisp-45c5 , which illustrates presumably what you're talking about.

Disp4tch5y ago

Yes, from the python side of things there are tools like py03 that make integrating rust into python code really painless.

I have a sql parsing library (shameless plug) that is 50x faster than any other python implementation, it is just a super simple wrapper around a rust crate.

https://github.com/wseaton/sqloxide

gliptic5y ago

Yes, you can expose things with the C ABI. This is essential for how it's used in Firefox.

kzrdude5y ago

Yes, Rust natively can expose C ffi functions (for example).

The https://cxx.rs/ project is also a major crate for C++ interoperability.

jfk135y ago· 2 in thread

I'd be curious how this compares with the encoding_rs crate (https://github.com/hsivonen/encoding_rs), which also offers UTF-8 validation (among many other things).

cryptonector5y ago

This is SIMD based. If the encoding_rs crate isn't SIMD based, then this thing will beat its pants off.

burntsushi5y ago

It does use SIMD (when enabled). But also, the encoding_rs crate is doing more. It isn't just checking for validity. It's also doing transcoding and error detection.

1 more reply

TwoBit5y ago· 2 in thread

Given the current ubiquity of UTF-8, is there value in having native CPU instructions for it? Or do they exist somewhere already?

dahfizz5y ago

What would these cpu instructions do?

TwoBit5y ago

Read or write code points quickly, with validation.

kouteiheika5y ago· 2 in thread

So are there any plans to eventually contribute this to the standard library, e.g. just like hashbrown was?

mrec5y ago

The author discusses this in the corresponding /r/rust thread:

https://www.reddit.com/r/rust/comments/mvc6o5/incredibly_fas...

tl;dr: it's not straightforward, mostly because the standard functionality lives in `core` which doesn't have access to OS support for CPU feature detection.

bombcar5y ago

Does Rust allow for libraries to “override” core functionality when available?

1 more reply

meisel5y ago· 2 in thread

How does this compare, speed-wise, to the UTF-8 validation done in simdjson?

krypsOP5y ago

The algorithm is the one from simdjson, the main difference is that it uses an extra step in the beginning to align reads to the SIMD block size.

tyingq5y ago

celeritascelery5y ago· 2 in thread

I didn't really understand this part. Aligned to what? to the cache line? SIMD always reads the block size. Unless I am missing something here.

unwind5y ago

I read it as "to the width of the SIMD registers" which I have seen in other quick scanners, but did not read the code here.

celeritascelery5y ago

Aren’t all reads aligned to the width of the SIMD register? If I do an AVX512 command it will read 512 bits right?

1 more reply

parhamn5y ago· 1 in thread

I was wonder about what the ARM/M1 support for simd like instructions are. It seems like it will be a while for these simd packages to be as performant on ARM. Is this correct?

jsheard5y ago

ARM has it's own SIMD instruction sets (NEON and SVE) but the intrinsics have yet to be stabilized in Rust, so the Rust SIMD ecosystem is x86-centric for now.

The plan is for Rust to eventually have a portable SIMD abstraction built into the standard library to reduce the need for CPU-specific code.

j / k navigate · click thread line to collapse