Future instruction set: AVX-512 (opens in new tab)

(agner.org)

102 pointsprzemoc12y ago66 comments

66 comments

35 comments · 9 top-level

3pt1415912y ago· 6 in thread

Is there a good place to begin to learn about this stuff from the ground up? Maybe a user friendly compiler written for the purpose of education?

hvs12y ago

From the ground up? Well, there's always the source:

http://www.intel.com/content/www/us/en/processors/architectu...

For learning assembly? For MASM I enjoyed this book years ago:

http://www.amazon.com/Assembly-Language-Intel-Based-Computer...

For GAS something like this might be more appropriate:

http://www.amazon.com/Professional-Assembly-Language-Richard...

Patterson & Hennessy is used a lot in colleges to teach low-level architecture and assembly:

http://www.amazon.com/Computer-Architecture-Fifth-Quantitati...

minimax12y ago

The shallow end of the pool is simple RISC CPUs e.g. Atmel 8-bit AVR. The complete instruction manual is something like 160 pages (compare to x86 at 3k+) and there are tons of beginner resources for doing assembler on those chips.

xradionut12y ago

There's more structured material for MIPS than AVR. Plus the AVR has some funky memory maps and modes.

1 more reply

hmottestad12y ago

There is a developer friendly virtual machine called Bochs. It can be hooked to a debugger so you can see what is happening. I prefer DDD for debugging hooked to GDB (backend).

Here is an assembly cheat sheet that I like a lot: http://www.jegerlehner.ch/intel/

Also GCC can output assembly if you want to see what simple C code looks like in assembly http://stackoverflow.com/questions/137038/how-do-you-get-ass...

Finally I would say that the recommended method of learning about all of this and more is to take a course at uni. You should take C programming and operating system architecture where you should learn assembly. Then you should take a course where you get to build you own OS (http://www.uio.no/studier/emner/matnat/ifi/INF3151/index-eng...) followed by a course on multi core architecture (http://www.uio.no/studier/emner/matnat/ifi/INF5063/)

polymathist12y ago

I would recommend not jumping right into x86 and instead starting with something simpler. I learned MIPS in school. The instruction set is very easy to wrap your head around.

MIPS Quick Tutorial: http://logos.cs.uic.edu/366/notes/mips%20quick%20tutorial.ht...

A simulator that will let you run MIPS on windows/linux/mac: http://pages.cs.wisc.edu/~larus/spim.html

deletes12y ago

http://stackoverflow.com/questions/27568/assembler-ide-simul...

WhitneyLand12y ago· 5 in thread

And yet phones are already doing real-time 4k video encoding.

In other words, since a lot of compute intensive scenarios are already being served by less general purpose hardware, what are the most recognizable scenarios where AVX-512 will make a big difference over AVX-256?

I know video encoding can be an integer only algorithm while AVX seems to help floats more but still...

protopete12y ago

Which phone can do realtime 4k encoding? Can you tell me which SoC it uses?

Edit: It's the Acer Liquid S2, with Qualcomm Snapdragon 800 SoC

devx12y ago

All Snapdragon 800-based phones, and there are quite a few of them, and there will be more in the coming months (and of course later even more chips doing the same). Right now it's the one you said, LG G2, Sony Xperia Z1, Samsung Galaxy Note 3, and soon the Nexus 5.

Now I'm not 100 percent sure if every single one of them has that as a user-centric feature (maybe they didn't enable it), but the chip supports 4k video recording.

alayne12y ago

Surely you'd prefer to encode video much faster than real time.

deathanatos12y ago

If it can do it instantly, that would be great. But why does it need to be faster than real-time? You can't capture video any faster. (Though, "real-time" probably needs some sort of framerate qualifier… no doubt 60 fps is harder to encode than 30 fps.)

Unless you're thinking of re-encoding … on a phone?

2 more replies

wmf12y ago

I think AVX is for scientific computing where there are a variety of algorithms that can't justify specific offloads like video encoding.

Symmetry12y ago· 4 in thread

I really wonder what this is for. Isn't AVX-256 code usually already limited by memory bandwidth? There are design tradeoffs between memory latency and memory bandwidth and in order for Intel CPUs to keep their advantage in high single threaded performance they have to lean towards the low latency side of things.

Is this aimed at Larrabee?

reitzensteinm12y ago

SKUs of Haswell with the GT3e graphics configuration include 128mb of on package DRAM, which is intended to give more memory bandwidth to the GPU, but it also acts as a cache for the CPU.

Based on what you're saying, it seems as though AVX-512 and future models with larger, faster embedded DRAM might play very nicely together.

Symmetry12y ago

I should have been clearer. I was talking about the cache hierarchy and CPU memory pipes. I suppose there's a divergence in main memory too, with CPUs using DDR3 and GPUs using GDDR5, but as you say you can just throw cache at that problem.

caf12y ago

Larrabee itself is defunct, but one of its successors is the "Xeon Phi" line, and the Knights Landing generation of that will have AVX-512.

corresation12y ago

Memory technology keeps improving as well, and will have improved by late-2014/2015. Regardless, even if an eight-core chip would be starved for memory running eternal AVX-512 instructions, the advantage would be power and heat savings with a single core running a limited set of computations as quickly and efficiently as possible, doing the most with the least. As the power profile of chips gets smaller and smaller, that would be the biggest advantage.

tachyonbeam12y ago· 3 in thread

Seems to me like Intel and AMD aren't very forward thinking, adding new instructions and registers every year or two. As if the x86 instruction set wasn't bloated enough, now they're going to have instructions with 4-byte prefixes, and new registers you can only access with AVX. What's next after that, AVX-1024 with 6-byte prefixes? Meanwhile this renders MMX and SSE sort of redundant. Seems to me we might be better served with some kind of vector coprocessor and instructions that can operate directly on large vectors in memory, instead of doubling the size of the vector registers all the time and making x86 an ever harder target to generate efficient code for.

Maybe this is another area where ARM can beat x86. Have a better planned out vector instruction set that can be expanded without adding hundreds of new instructions all the time, and more compact machine code.

corresation12y ago

Meanwhile this renders MMX and SSE sort of redundant.

It's an evolution of the same SIMD ideas. Yes, the newer variants do render the older variants redundant, but hang around because code might use it.

MMX - SIMD, integer only, reused existing floating point registers making it a PITA that often didn't even payoff because of the expensive state switching between MMX and floating point.

SSE - starting as floating point SIMD with its own 128-bit registers. Evolved through SSE 4.2 with more instructions (functionality in hardware), flexibility (e.g. operate on 4 singles or 2 doubles or...) and the addition of integer functionality.

AVX - Double the size of the vector registers, adding more of them and lots of new instructions and functionality. The successor of SSE, at least on the floating point side. AVX2 brought the integer functionality.

AVX-512 - Double the size of the vector registers again. 16 single float operations in one go.

This kind of sounds like baseless griping. Unless you write a compiler, why do you care? Do you really sweat the instruction prefixes?

ChuckMcM12y ago

I wish that Intel would have spent some transistors on an arbitrary precision decimal arithmetic floating point unit. That would have helped scientific processing but in the past has been 'too expensive' in terms of transistors to implement. Now that we have more transistors than we know what to do with, seems like that should be revisited.

Then generalize the vector coproccessing abilities of the GPU and that would be a pretty flexible base to work from.

2 more replies

Symmetry12y ago

Well, all those instruction prefixes mean that decoding x86 instructions is really hard. That leads to more cycles to decode, for a larger branch mis-predict penalty. And your decode unit takes as much power as your integer cluster (which is still small potatoes compared to all the OoO resources). And you're limited to 1 decoded instruction per cycle when executing instructions brought into cache for the first time, before the processor can tag the boundaries (but again, most instructions are executed many times).

1 more reply

Aardwolf12y ago· 3 in thread

Is there any x86 instruction set that supports quadruple precision floating point numbers? If not, why not? Is it not useful enough?

nwhitehead12y ago

There aren't a lot of advantages to putting quadruple precision into the hardware. Typically with numeric code you don't run out of exponent space you run into precision limits. To increase precision you can use software techniques like double-double representation. This doubles precision and keeps the exponent range the same at the cost of increased numbers of instructions.

The real action is in FMA (fused multiply-add) instructions. These instructions do two operations then a correct rounding of the result (e.g. round(a*b+c)). FMA in hardware is great. It lets you write functions with provably tight errors or even provably correct rounding of the result. More and more platforms are providing FMA [1].

[1]: http://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_ope...

ansgri12y ago

Where exactly is it useful? Double is already an overkill in all multimedia operations, and finance folks use fixed-point anyway.

fdej12y ago

Quadruple (or higher) precision is needed in many scientific applications.

1 more reply

waterlesscloud12y ago· 3 in thread

This might well be a dumb question, but why don't we have 128 bit processors?

No advantage? Crazy expense?

lambda12y ago

At some point, you need to ask what you mean by 128 bit. When people talk about an 8 bit, 16 bit, 32 bit, or 64 bit processor, they are actually generally conflating two or more things. There's the size of general purpose register, the size of the data bus (how much you can load from memory in a single transfer), the size of the address bus (how many lines you have for addressing RAM), and the size of pointers. In many machines, these have been the same, though for example, 8 bit processors frequently had 14 or 16 bit addresses and busses so they could access up to 16 or 64k of memory; but there's also, for example, the 68008 with 32 bit registers, a 16 bit address bus, and an 8 bit data bus.

So, when people talk about 32 or 64 bits, they generally mean two things: the size of general purpose registers, and the size of addresses.

There's basically no need for addresses beyond 64 bits, at least for quite some time. With 64 bits, you can address 16,384 petabytes (16 exabytes) in a single process. Since the biggest single machines I can find these days support a maximum of 4 TB of RAM (if you filled it with 32 GB DIMMs that aren't yet available), we have a long way to go before you will need more than 64 bits of address space.

Furthermore, increasing address size can hurt performance. If your pointers are all 128 bit, they take up twice the space as 64 bit pointers. There have already been plenty of workloads that show a reduction in performance when ported to 64 bit machines, just because the 64 bit pointers fill up so much valuable cache space. In fact, for this reason, Linux even has support for the x32 ABI, which uses an x86-64 processor in 64 bit mode but only uses 32 bit pointers, so they can take advantage of extra registers available to x86-64 without paying the price for the larger pointers. https://en.wikipedia.org/wiki/X32_ABI

So, there's no benefit to 128 bit addresses and lots of potential downside, so it's not going to happen for quite some time. How about for data, though?

Well, most software doesn't really need to work with integers or floating point numbers larger than 64 bits, anyhow. For lots of applications, 64 or even 32 bits is sufficient. Public key crypto can frequently take advantage of large integers, though it generally needs even bigger integers, like 2048 bits, so you generally have to do bignum arithmetic anyhow.

Lots of the gains that you get from working with larger types come from working on vectors of smaller types. But for those purposes, chips have had 128 bit registers for quite some time. SSE, introduced in 1999, included 128 bit vector registers, which could be treated as 4 32 bit integers (AltiVec on PowerPC had introduced the same idea a few years earlier; the idea of SIMD has been around in supercomputers for many years). Later extensions like SSE2 expanded their use to allow you to treat them as two 64 bit floats, two 64 bit integers, 8 16 bit shorts, and 16 8-bit bytes.

So, for the only use case for which it's particularly valuable, working on vectors in aggregate, we've had 128 bit registers for quite some time. We've had 256 bit registers for a couple of years now in the form of AVX. Now this promises to expand those to 512 bits. There's no good reason to expand your addresses in the same way; at that point, you're just wasting space.

josephlord12y ago

Up voted. Although 16 exabytes is less overhead if you are memory mapping persistent storage rather than just RAM which makes increasing sense with SSDs. 64bit addressing is still plenty for most scenarios for some time to come though even if this approach is taken.

wmf12y ago

There's no particular advantage to having 128-bit integers or pointers. 128-bit or larger SIMD has existed for 15 years or so.

carterschonwald12y ago· 2 in thread

The AVX / SIMD part of x86 instruction set is probably the most understandable subset to focus on learning! And i'm very excited about AVX-512

I'll actually going to be spending some of my time over the next year adding proper SIMD support (including all the shuffles) to the main Haskell compiler, GHC!

Theres some really interesting constraints on the SIMD shuffle primops that need some type system cleverness to compile correctly!

Namely, you need to know "statically, at code gen time", the shuffle constants that are given as "immediates" to the instructions! Normal values don't quite have the right semantics, and accordingly the simd intrinsics in C compilers kinda lie about the types they expect (ie if you give them a variable of the right type, they'll give you an error saying they need an actual constant literal).

tl;dr I'm going to make sure the GHC (and haskell) can support AVX 512 by the time thusly equipped CPUs are made available

nkurz12y ago

  > Namely, you need to know "statically, at code gen time",
  > the shuffle constants that are given as "immediates" to 
  > the instructions!

Can you clarify where the extra difficulty is?

I'm ignorant of GHC, but I'd think that from the compilers POV all that matters is that that the operand is a constant. Then it's just a matter of putting the value in instead of the name. In GCC inline assembly, you can use the poorly documented '%c' prefix to have a number treated as an immediate, so I'd guess this must be possible internally too. Also possibly worth noting is that unlike the others, PSHUFB works from a register rather a value encoded into the instruction.

carterschonwald12y ago

There's more than one shuffle instruction, in fact there's quite a few! You're right that some can take register args, but those aren't the ones I care about as much.

You're right, there are hacks in c that handle that. My goal for ghc is to actually have a systematic solution for handling any sort of constant literal expression at compile time. This includes making it easy to add new primops that require compile time literal data.

There's some interesting implications if you want that restriction to be typecheckable! This includes having a "static data kind". Part of why you want that is also because ghc is great at common sub expression elimination, and I consider any implementation strategy that could be broken by compiler optimization to be unacceptable.

[edit to clarify: just naively using normal literal values would likely be subject to cse optimization, and having code gen need to look around to lookup a variable rather than being a localized process is somewhat horrifying]

One particular end goal of mine is this: SIMD isn't that complicated, and it's really easy to experiment with (but only if you can cope with c). I want to make experimenting with SIMD much more accessible.

Interestingly enough, the notion of static data I want seems like it might be an especially strong version of the notion of static data that Cloud Haskell (the distributed process lib) would like. So there may be some trickle Down there!

One really cool optimization having a proper notion of static literals might enable is making it much easier to generate things like static lookup tables and related data structures that are small and perf sensitive

Edit: also if you want to try and stare at the source for a serious compiler, ghc (while huge) is pretty darn readable. Just pick a piece you want to understand and stare at the code for a while!

Edit: I should add that Geoff mainland has some great preliminary work adding experimental simd support to ghc that's in head/ pending 7.8. That said, ghc support for interesting simd won't be ready for prime time till 7.10 in a yearish

noahl12y ago

The mask registers that can turn off operations on individual elements of a vector reminded me of CUDA. It might be possible to emulate individual "threads" on these pretty easily.

Does OpenCL have a similar threading concept? I don't know much about it, sadly.

erichocean12y ago

Floating point vector instructions have options for specifying the rounding mode and for suppressing exceptions.

That's huge.

j / k navigate · click thread line to collapse

66 comments

35 comments · 9 top-level

3pt1415912y ago· 6 in thread

Is there a good place to begin to learn about this stuff from the ground up? Maybe a user friendly compiler written for the purpose of education?

hvs12y ago

From the ground up? Well, there's always the source:

http://www.intel.com/content/www/us/en/processors/architectu...

For learning assembly? For MASM I enjoyed this book years ago:

http://www.amazon.com/Assembly-Language-Intel-Based-Computer...

For GAS something like this might be more appropriate:

http://www.amazon.com/Professional-Assembly-Language-Richard...

Patterson & Hennessy is used a lot in colleges to teach low-level architecture and assembly:

http://www.amazon.com/Computer-Architecture-Fifth-Quantitati...

minimax12y ago

xradionut12y ago

There's more structured material for MIPS than AVR. Plus the AVR has some funky memory maps and modes.

1 more reply

hmottestad12y ago

There is a developer friendly virtual machine called Bochs. It can be hooked to a debugger so you can see what is happening. I prefer DDD for debugging hooked to GDB (backend).

Here is an assembly cheat sheet that I like a lot: http://www.jegerlehner.ch/intel/

Also GCC can output assembly if you want to see what simple C code looks like in assembly http://stackoverflow.com/questions/137038/how-do-you-get-ass...

polymathist12y ago

I would recommend not jumping right into x86 and instead starting with something simpler. I learned MIPS in school. The instruction set is very easy to wrap your head around.

MIPS Quick Tutorial: http://logos.cs.uic.edu/366/notes/mips%20quick%20tutorial.ht...

A simulator that will let you run MIPS on windows/linux/mac: http://pages.cs.wisc.edu/~larus/spim.html

deletes12y ago

http://stackoverflow.com/questions/27568/assembler-ide-simul...

WhitneyLand12y ago· 5 in thread

And yet phones are already doing real-time 4k video encoding.

I know video encoding can be an integer only algorithm while AVX seems to help floats more but still...

protopete12y ago

Which phone can do realtime 4k encoding? Can you tell me which SoC it uses?

Edit: It's the Acer Liquid S2, with Qualcomm Snapdragon 800 SoC

devx12y ago

Now I'm not 100 percent sure if every single one of them has that as a user-centric feature (maybe they didn't enable it), but the chip supports 4k video recording.

alayne12y ago

Surely you'd prefer to encode video much faster than real time.

deathanatos12y ago

Unless you're thinking of re-encoding … on a phone?

2 more replies

wmf12y ago

I think AVX is for scientific computing where there are a variety of algorithms that can't justify specific offloads like video encoding.

Symmetry12y ago· 4 in thread

Is this aimed at Larrabee?

reitzensteinm12y ago

SKUs of Haswell with the GT3e graphics configuration include 128mb of on package DRAM, which is intended to give more memory bandwidth to the GPU, but it also acts as a cache for the CPU.

Based on what you're saying, it seems as though AVX-512 and future models with larger, faster embedded DRAM might play very nicely together.

Symmetry12y ago

caf12y ago

Larrabee itself is defunct, but one of its successors is the "Xeon Phi" line, and the Knights Landing generation of that will have AVX-512.

corresation12y ago

tachyonbeam12y ago· 3 in thread

corresation12y ago

Meanwhile this renders MMX and SSE sort of redundant.

It's an evolution of the same SIMD ideas. Yes, the newer variants do render the older variants redundant, but hang around because code might use it.

MMX - SIMD, integer only, reused existing floating point registers making it a PITA that often didn't even payoff because of the expensive state switching between MMX and floating point.

AVX-512 - Double the size of the vector registers again. 16 single float operations in one go.

This kind of sounds like baseless griping. Unless you write a compiler, why do you care? Do you really sweat the instruction prefixes?

ChuckMcM12y ago

Then generalize the vector coproccessing abilities of the GPU and that would be a pretty flexible base to work from.

2 more replies

Symmetry12y ago

1 more reply

Aardwolf12y ago· 3 in thread

Is there any x86 instruction set that supports quadruple precision floating point numbers? If not, why not? Is it not useful enough?

nwhitehead12y ago

[1]: http://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_ope...

ansgri12y ago

Where exactly is it useful? Double is already an overkill in all multimedia operations, and finance folks use fixed-point anyway.

fdej12y ago

Quadruple (or higher) precision is needed in many scientific applications.

1 more reply

waterlesscloud12y ago· 3 in thread

This might well be a dumb question, but why don't we have 128 bit processors?

No advantage? Crazy expense?

lambda12y ago

So, when people talk about 32 or 64 bits, they generally mean two things: the size of general purpose registers, and the size of addresses.

So, there's no benefit to 128 bit addresses and lots of potential downside, so it's not going to happen for quite some time. How about for data, though?

josephlord12y ago

wmf12y ago

There's no particular advantage to having 128-bit integers or pointers. 128-bit or larger SIMD has existed for 15 years or so.

carterschonwald12y ago· 2 in thread

The AVX / SIMD part of x86 instruction set is probably the most understandable subset to focus on learning! And i'm very excited about AVX-512

I'll actually going to be spending some of my time over the next year adding proper SIMD support (including all the shuffles) to the main Haskell compiler, GHC!

Theres some really interesting constraints on the SIMD shuffle primops that need some type system cleverness to compile correctly!

tl;dr I'm going to make sure the GHC (and haskell) can support AVX 512 by the time thusly equipped CPUs are made available

nkurz12y ago

  > Namely, you need to know "statically, at code gen time",
  > the shuffle constants that are given as "immediates" to 
  > the instructions!

Can you clarify where the extra difficulty is?

carterschonwald12y ago

There's more than one shuffle instruction, in fact there's quite a few! You're right that some can take register args, but those aren't the ones I care about as much.

Edit: also if you want to try and stare at the source for a serious compiler, ghc (while huge) is pretty darn readable. Just pick a piece you want to understand and stare at the code for a while!

noahl12y ago

The mask registers that can turn off operations on individual elements of a vector reminded me of CUDA. It might be possible to emulate individual "threads" on these pretty easily.

Does OpenCL have a similar threading concept? I don't know much about it, sadly.

erichocean12y ago

Floating point vector instructions have options for specifying the rounding mode and for suppressing exceptions.

That's huge.

j / k navigate · click thread line to collapse