undefined | Better HN

0 pointsjstimpfle1mo ago0 comments

I asked where is the part about unaligned pointers in your string processing example. Saying that you want to load multiple bytes at a time does not imply at all that you have to do unaligned loads.

Doing unaligned loads using SSE or AVX might have been possible on Intel architectures for a long time, but it is still a little bit slower afaik. But anyway when you get into sub-architecture specific details like that, you've essentially left C-land, and you're essentially doing assembler level programming.

0 comments

2 comments · 1 top-level

simonask1mo ago· 1 in thread

Every vectorized string search algorithm (including those treating an unsigned long as a "vector" of 8 bytes) currently needs a prelude that performs the search up to the first alignment boundary, and then performs the bulk of the search on well-aligned blocks, and then finally a postlude search in the tail of the string, where the tail is shorter than the block size/alignment.

Using unaligned loads, you can get rid of the prelude, including the associated branches and intptr arithmetic, and just have to deal with the tail.

If you're comparing short-ish strings, almost all of the time is spent in the prelude and postlude, even if the entire substring fits in a register. This is a silly language limitation when the hardware can actually easily just support the unaligned load.

In particular, it doesn't seem justified that what at most amounts to a tiny inefficiency in hardware turns into a very expensive class of bugs (UB).

jstimpfleOP1mo ago

Have you ever wanted to do this? I find the premise ridiculous.

But anyway, you're complaining that you have to work too hard to do unaligned loads (i.e. the wrong thing even if it should work on a particular machine) in C, when basically every other language makes you work more for basic systems programming tasks?

Whether unaligned loads can work on the machine level, it depends on the hardware. On some other architectures, you probably get anything from traps to unpredictable behaviour. It's totally fine that C does not define the behaviour for unaligned loads.

If you want to do some weird stuff like loading a single unaligned 16 byte quantity, where there was no "middle part" to begin with, just do memcpy then. The compiler might just do the appropriate thing on this architecture. Or if you need to closely control what's happened, write assembly then. But again, why would you even do this?

j / k navigate · click thread line to collapse