Validating UTF-8 bytes using only 0.45 cycles per byte (AVX edition) (opens in new tab)

(lemire.me)

148 pointsakarambir7y ago51 comments

51 comments

I see a lot of applications trying to take advantage of SIMD, but what when you try to run them on systems that don't support these instructions? My guess is that you need to write multiple files taking advantage of different sets of instructions and then dynamically figure out which to use at runtime with cpuid, but isn't that cumbersome and a way to inflate a codebase dramatically?

en4bz7y ago

https://gcc.gnu.org/wiki/FunctionMultiVersioning

1 more reply

wmu7y ago

Speaking of the Intel world it's not that bad. There are three major version right now: SSE4.1, AVX and AVX2 (AVX512 is not popular yet).

In the past (roughly 10 years ego) it was a problem, as there were: MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, XOP, 3DNow and perhaps a few more extensions.

it's not a typo, there are three 'S' :)

wmu7y ago

Sorry, I forgot that in HN comments the asterisk char is an italics indicator. There should be a mark after SSSE3.

oconnor6637y ago

> inflate a codebase dramatically

This is usually only done for very specific algorithms. Unicode validation, hash functions, things like that. Unless you have an absolutely tiny application (which you might, if you're some kind of microcontroller), it's going to be a small percentage of your overall code size.

londons_explore7y ago

In a microcontroller, I don't think you'll be needing AVX2...

Rebelgecko7y ago

I'm not sure where exactly the line is drawn between a microcontroller and a CPU, but even some of the lower end ARMs support SIMD instructions.

jmgrosen7y ago

Generally speaking, I think if you care enough about performance to write manual SIMD code, being a little more cumbersome is a tradeoff you’re willing to make.

why_only_157y ago

In my understanding when you use intrinsics and build for a processor without support for the intrinsics then GCC for example will replace it with equivalent code.

mcbain7y ago

Unfortunately, no.

That is the case with GCCs __builtin functions. With a few exceptions, intrinsics are basically macros for inline asm that the compiler can reason about.

If on x86-64 you use a _mm256* intrinsic and compile without AVX support you just get a compile error, not a pair of equivalent SSE instructions.

1 more reply

eesmith7y ago

That is true. Here's a couple of negatives. First, you still need to build once for each architecture, either as different executables, or as different object files, and provide some dispatch mechanism to use the right one based on what hardware is available.

Second, if the intrinsics aren't built-in then there may be faster alternatives than using the GCC emulated version.

BeeOnRope7y ago

You must be thinking about GCC "builtins" because there is no emulation for x86 SIMD intrinsics (ie the things in <immintrin.h>).

1 more reply

saagarjha7y ago

Darwin platforms ship binaries with different slices for different versions of Intel processors. You have the generic x86_64 and the newer x86_64h which supports more features.

bradleyjg7y ago

Under the new string model in java > 8 a fairly frequent workflow is:

1) get external string

2) figure out if it is UTF-8, UTF-16, or some other recognizable encoding

3) validate the byte stream

4) figure out if the code points in the incoming string can be represented in Latin-1

5) instantiate a java string using either the Latin-1 encoder or the UTF-16 encoder

I know some or all of these steps are done using hotspot intrinsics, and then the JIT/VM does inlining, folding and so on, but I wonder how fast a custom assembly function to do all these steps at once could be.

Twirrim7y ago

You might be interested in his blog on the same subject a few days ago: https://lemire.me/blog/2018/10/16/validating-utf-8-bytes-jav...

adamretter7y ago

If you are given the external string as bytes, which is all you can have if you don't know the encoding. Then steps 2,3,4 can all be done as one step I would have thought. Something like - https://github.com/adamretter/utf8-validator/blob/optimize-u...

jwilk7y ago

Previous blog post on HN:

https://news.ycombinator.com/item?id=17081571

kissiel7y ago

I wonder about the Joules per byte. AFAIK AVX units are quite expensive energy-wise.

masklinn7y ago

Don't they also tend to work at a lower clock due to their higher energy requirements?

edit: though this is AVX2 ("AVX-256") rather than AVX-512, and Lemire has covered AVX and the possibility of throttling (with or without AVX) in the past so they're probably aware of the potential issue and consider that they either won't get triggered or the gain is good enough to compensate the lower frequency.

kissiel7y ago

Nice. So I understand that AVX2 is not bringing the CPU's clock down.

Got any sources for power consumption figures/comparisons of those AVX units?

lorenzhs7y ago

Heavy use of complex AVX2 operations causes downclocking, too, but typically less so than AVX-512. More details are documented in https://en.wikichip.org/wiki/intel/frequency_behavior -- also see e.g. https://en.wikichip.org/wiki/intel/xeon_gold/6138#Frequencie... for an example how the frequencies differ depending on the number of active cores.

I think the reason for reducing clock speed when vector units are in heavy use is to keep power usage in check.

You might also find https://blog.cloudflare.com/on-the-dangers-of-intels-frequen... helpful, which goes into detail about a specific case where dynamic frequency scaling resulted in AVX-512 code running slower than AVX2 code.

4 more replies

twtw7y ago

It could well be lower than a scalar approach. SIMD units like AVX are power hungry, but a greater fraction of that power is relevant computation rather than power for control, schedule, etc. Ideally, the constant instruction overhead to get it executing on a functional unit is amortized over the width of the vector.

akarambirOP7y ago

What does linux utilities like sed, awk use for text manipulation because they were very slow when I was changing a few table names in a sql file.

zorked7y ago

I don't think they use anything in common. Try to set your locale to "C" as otherwise string comparisons will do extra work handling your locale's notions of equivalent characters.

coldtea7y ago

What was the size of the SQL file?

A "few table names" doesn't mean much if the SQL file is 20GB.

In any case, sed and awk are plenty fast, but not the fastest methods of text manipulation. You could write a custom C program for that.

Thiez7y ago

While it sure is possible to do text manipulation in C, I don't think it should ever be the first choice, even if 'fastest' is a goal. A 0 byte is perfectly acceptable in a utf8 string (or any unicode string, really). But C has those annoying zero-terminated strings, so if you want to manipulate arbitrary unicode strings the first thing you can do is kiss the string functions in the C standard library goodbye. Which you probably want to do anyway because pascal-strings are simply better.

I would use Rust or C++ for this task.

2 more replies

masklinn7y ago

Note that this and that are not necessarily related: you're talking about performing unicode-aware text matching and manipulation, TFA is solely about validating a buffer's content as UTF-8.

rurban7y ago

They are still mostly not multi-byte string (i.e. unicode) aware after decades of work. I.e. you cannot really search for strings, with case-folding or normalized variants.

See http://crashcourse.housegordon.org/coreutils-multibyte-suppo... and http://perl11.org/blog/foldcase.html for an overview of the performance problems.

This tool only does the minor task of validation of the UTF-8 encoding, nothing else. There are still the major tasks of decoding, folding and normalization to do.

akx7y ago

How slow? On my 2013 MBP, `gsed` (sed from coreutils) can do a replacement like that at about 350 MiB/s (of which most seems to be spent writing to disk, since writing to /dev/null hikes it up to 800 MiB/s).

akarambirOP7y ago

It was sed substitute command on a ~800Mb file on Thinkpad T470 with SSD. It was taking around 40-50 sec for each substitution. Though as others have pointed, it may not be directly related to article in discussion.

coldtea7y ago

>It was taking around 40-50 sec for each substitution.

Substitution should not be really a relevant metric as it wouldn't influence the result much. Sed/Awk will still have to go through the whole file to find all occurrences they should substitute (and when they do find an occurrence, the substitution would take nanoseconds).

The size of the file is a better metric (e.g. how many seconds for that 800mb in total).

Also, whether you used regex in your awk/sed, and what kind. A badly written regex can slow down search very much.

etatoby7y ago

Did you use any quadratic or worse regex algorithm? Such as having more than one .* in a single regex.

Did you set LANG=C before running sed, to bypass the UTF-8 logic?

Also, if you had a list of substitutions to perform, did you try writing them as a single sed script?

j / k navigate · click thread line to collapse

51 comments

the_clarence7y ago

en4bz7y ago

https://gcc.gnu.org/wiki/FunctionMultiVersioning

1 more reply

wmu7y ago

Speaking of the Intel world it's not that bad. There are three major version right now: SSE4.1, AVX and AVX2 (AVX512 is not popular yet).

In the past (roughly 10 years ego) it was a problem, as there were: MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, XOP, 3DNow and perhaps a few more extensions.

it's not a typo, there are three 'S' :)

wmu7y ago

Sorry, I forgot that in HN comments the asterisk char is an italics indicator. There should be a mark after SSSE3.

oconnor6637y ago

> inflate a codebase dramatically

londons_explore7y ago

In a microcontroller, I don't think you'll be needing AVX2...

Rebelgecko7y ago

I'm not sure where exactly the line is drawn between a microcontroller and a CPU, but even some of the lower end ARMs support SIMD instructions.

jmgrosen7y ago

Generally speaking, I think if you care enough about performance to write manual SIMD code, being a little more cumbersome is a tradeoff you’re willing to make.

why_only_157y ago

In my understanding when you use intrinsics and build for a processor without support for the intrinsics then GCC for example will replace it with equivalent code.

mcbain7y ago

Unfortunately, no.

That is the case with GCCs __builtin functions. With a few exceptions, intrinsics are basically macros for inline asm that the compiler can reason about.

If on x86-64 you use a _mm256* intrinsic and compile without AVX support you just get a compile error, not a pair of equivalent SSE instructions.

1 more reply

eesmith7y ago

Second, if the intrinsics aren't built-in then there may be faster alternatives than using the GCC emulated version.

BeeOnRope7y ago

You must be thinking about GCC "builtins" because there is no emulation for x86 SIMD intrinsics (ie the things in <immintrin.h>).

1 more reply

saagarjha7y ago

Darwin platforms ship binaries with different slices for different versions of Intel processors. You have the generic x86_64 and the newer x86_64h which supports more features.

bradleyjg7y ago

Under the new string model in java > 8 a fairly frequent workflow is:

1) get external string

2) figure out if it is UTF-8, UTF-16, or some other recognizable encoding

3) validate the byte stream

4) figure out if the code points in the incoming string can be represented in Latin-1

5) instantiate a java string using either the Latin-1 encoder or the UTF-16 encoder

Twirrim7y ago

You might be interested in his blog on the same subject a few days ago: https://lemire.me/blog/2018/10/16/validating-utf-8-bytes-jav...

adamretter7y ago

jwilk7y ago

Previous blog post on HN:

https://news.ycombinator.com/item?id=17081571

kissiel7y ago

I wonder about the Joules per byte. AFAIK AVX units are quite expensive energy-wise.

masklinn7y ago

Don't they also tend to work at a lower clock due to their higher energy requirements?

kissiel7y ago

Nice. So I understand that AVX2 is not bringing the CPU's clock down.

Got any sources for power consumption figures/comparisons of those AVX units?

lorenzhs7y ago

I think the reason for reducing clock speed when vector units are in heavy use is to keep power usage in check.

4 more replies

twtw7y ago

akarambirOP7y ago

What does linux utilities like sed, awk use for text manipulation because they were very slow when I was changing a few table names in a sql file.

zorked7y ago

I don't think they use anything in common. Try to set your locale to "C" as otherwise string comparisons will do extra work handling your locale's notions of equivalent characters.

coldtea7y ago

What was the size of the SQL file?

A "few table names" doesn't mean much if the SQL file is 20GB.

In any case, sed and awk are plenty fast, but not the fastest methods of text manipulation. You could write a custom C program for that.

Thiez7y ago

I would use Rust or C++ for this task.

2 more replies

masklinn7y ago

Note that this and that are not necessarily related: you're talking about performing unicode-aware text matching and manipulation, TFA is solely about validating a buffer's content as UTF-8.

rurban7y ago

They are still mostly not multi-byte string (i.e. unicode) aware after decades of work. I.e. you cannot really search for strings, with case-folding or normalized variants.

See http://crashcourse.housegordon.org/coreutils-multibyte-suppo... and http://perl11.org/blog/foldcase.html for an overview of the performance problems.

This tool only does the minor task of validation of the UTF-8 encoding, nothing else. There are still the major tasks of decoding, folding and normalization to do.

akx7y ago

akarambirOP7y ago

coldtea7y ago

>It was taking around 40-50 sec for each substitution.

The size of the file is a better metric (e.g. how many seconds for that 800mb in total).

Also, whether you used regex in your awk/sed, and what kind. A badly written regex can slow down search very much.

etatoby7y ago

Did you use any quadratic or worse regex algorithm? Such as having more than one .* in a single regex.

Did you set LANG=C before running sed, to bypass the UTF-8 logic?

Also, if you had a list of substitutions to perform, did you try writing them as a single sed script?

j / k navigate · click thread line to collapse