undefined | Better HN

0 pointspclmulqdq4y ago0 comments

AVX-512 finally gets a lot of things right about vector manipulation and plugged a lot of the holes in the instruction set. Part of me is upset that it came with the "512" name - they could have called it "AVX3" or "AVX Version 2" (since it's intel and they love confusing names).

0 comments

6 comments · 3 top-level

adrian_b4y ago· 3 in thread

Actually AVX-512 predates AVX and Sandy Bridge.

The original name of AVX-512 was "Larrabee New Instructions". Unlike with the other Intel instruction set extensions, the team which defined the "Larrabee New Instructions" included graphics experts hired from outside Intel, which is probably the reason why AVX-512 is a better SIMD instruction set than all the other designed by Intel.

Unfortunately, Sandy Bridge (2011), instead of implementing a scaled-down version of the "Larrabee New Instructions", implemented the significantly worse AVX instruction set.

A couple of years later, Intel Haswell (2013), added to AVX a few of the extra instructions of the "Larrabee New Instructions", e.g. fused multiply-add and memory gather instructions. The Haswell AVX2 was thus a great improvement over the Sandy Bridge AVX, but it remained far from having all the features that had already existed in LRBni (made public in 2009).

After the Intel Larrabee project flopped, LRBni passed through a few name changes, until 2016, when it was renamed to AVX-512 after a small change in the binary encoding of the instructions.

I also dislike the name "AVX-512", but my reason is different. "AVX-512" is made to sound like it is an evolution of AVX, while the truth is the other way around, AVX was an involution of LRBni, whose purpose was to maximize the profits of Intel by minimizing the CPU manufacturing costs, taking advantage of the fact that the competition was weak, so the buyers had to be content with the crippled Intel CPUs with AVX, because nobody offered anything better.

The existence of AVX has caused a lot of additional work for many programmers, who had to write programs much more complex than it would have been possible with LRBni, which had from the beginning features designed to allow simplified programming, e.g. the mask registers that allow much simpler prologues and epilogues for loops and both gather loads and scatter stores for accessing the memory.

boulos4y ago

Hmm. That's not how I recall it. The folks in Israel working on Sandybridge (Gesher), already had their AVX plans in place before LRBni was "finalized" (even by the time of "our" siggraph paper -- I was only tangentially involved, not listed -- new instructions were being added all the time).

So it's more like both groups knew what the other was doing, but LRBni was free to focus primarily on graphics and a clean slate, while the AVX folks shot for "SSE but wider, and a few more".

AVX-512 is sort of a franken-combo of what AVX3 would have been, plus many of the LRBni instructions that shipped in the poorly named MIC parts, plus some more (e.g., now including a VNNI dialect, bf16 ops, etc.).

adrian_b4y ago

Indeed, as you say, the development of both LRBni and of AVX by 2 separate Intel teams stretched over many years.

Most of the development of LRBni was between 2005 and 2009, when it became publicly known. The first product with LRBni was Knights Ferry, which was introduced in 2010, being made with the older 45-nm process. Knights Ferry was used only in development systems, due to insufficient performance.

Sandy Bridge, using the newer 32-nm process, was launched in 2011. I do not know when the development of Sandy Bridge had started, but in any case the first few years of development must have overlapped with the last few years of the development of LRBni.

I suppose that there was little, if any, communication between the 2 Intel teams.

AVX was developed as an instruction set extension in the same way as the majority of the instruction set extensions had been developed by Intel since the days of Intel 8008 (1972) and until the present x86 ISA.

Intel has only very seldom introduced new instructions that had been designed having a global view of the instruction set and making a thorough analysis of which instructions should exist in order to reach either the best performance or the least programming effort.

In most cases the new instructions have been chosen so that they would need only minimal hardware changes from the previous CPU generation for their implementation, while still providing a measurable improvement in some benchmark. The most innovative additions to the Intel ISA had usually been included in the instruction sets of other CPUs many years before, but Intel has delayed to also add them as much as possible.

This strategy of Intel is indeed the best for ensuring the largest profits from making CPUs, as long as there is no strong competition.

Moreover, now the quality of the ISA matters much less for performance than earlier, because the very complex CPUs from today can perform a lot of transformations on the instruction stream, like splitting / reordering / fusion, which can remove performance bottlenecks due to poor instruction encoding.

Most programmers use only high-level languages, so only the compiler writers and those that have to write extremely optimized programs have to deal with various ugly parts of the Intel-AMD ISA.

So AVX for Sandy Bridge has been designed in the typical Intel way, having as target to be a minimal improvement over SSE.

On the other hand LRBni was designed from the ground, to be the best instruction set that they knew how to implement for performing its tasks.

So it was normal that the end results were different.

For the Intel customers, it would have been much better if Intel did not have 2 divergent developments for their future SIMD ISA, but they would have established a single, coherent, roadmap for SIMD ISA development during the next generations of Intel CPUs.

In an ideal company such a roadmap should have been established after discussions with a wide participation, from all the relevant Intel teams.

For cost reasons, it is obvious that it would not have been good for Sandy Bridge to implement the full LRBni ISA. Nevertheless, it would have been very easy to implement a LRBni subset better than AVX.

Sandy Bridge should still have implemented only 256-bit operations, and the implementation of some operations, e.g. gather and scatter, could have been delayed for a later CPU generation.

However other LRBni features, should have been present since the beginning, e.g. the mask registers, because they influence the instruction encoding formats.

The mask registers would have required very little additional hardware resources (the actual hardware registers can reuse the 8087 registers), but they would have simplified AVX programming a lot, by removing the complicated code needed to handle correctly different data sizes and alignments.

The current CPUs with AVX-512 support would have been simpler, by not having to decode 2 completely distinct binary instruction formats, for AVX and for AVX-512, which is a fact that made difficult the implementation of AVX-512 in the small cores of Alder Lake.

pclmulqdqOP4y ago

TIL. Thank you for the history lesson on AVX. Comparing to SVE and the RISC-V vector instructions, AVX feels so clunky, but I guess that was part of the "Intel tax."

atq21194y ago

Agreed. Though I feel that for the most part, size-agnostic vector instructions a la SVE would be the way to go.

janwas4y ago

:) I have actually heard it referred to as AVX3, we also adopted that name in Highway.

j / k navigate · click thread line to collapse