http://www.intel.com/content/www/us/en/processors/architectu...
For learning assembly? For MASM I enjoyed this book years ago:
http://www.amazon.com/Assembly-Language-Intel-Based-Computer...
For GAS something like this might be more appropriate:
http://www.amazon.com/Professional-Assembly-Language-Richard...
Patterson & Hennessy is used a lot in colleges to teach low-level architecture and assembly:
http://www.amazon.com/Computer-Architecture-Fifth-Quantitati...
Here is an assembly cheat sheet that I like a lot: http://www.jegerlehner.ch/intel/
Also GCC can output assembly if you want to see what simple C code looks like in assembly http://stackoverflow.com/questions/137038/how-do-you-get-ass...
Finally I would say that the recommended method of learning about all of this and more is to take a course at uni. You should take C programming and operating system architecture where you should learn assembly. Then you should take a course where you get to build you own OS (http://www.uio.no/studier/emner/matnat/ifi/INF3151/index-eng...) followed by a course on multi core architecture (http://www.uio.no/studier/emner/matnat/ifi/INF5063/)
MIPS Quick Tutorial: http://logos.cs.uic.edu/366/notes/mips%20quick%20tutorial.ht...
A simulator that will let you run MIPS on windows/linux/mac: http://pages.cs.wisc.edu/~larus/spim.html
In other words, since a lot of compute intensive scenarios are already being served by less general purpose hardware, what are the most recognizable scenarios where AVX-512 will make a big difference over AVX-256?
I know video encoding can be an integer only algorithm while AVX seems to help floats more but still...
Edit: It's the Acer Liquid S2, with Qualcomm Snapdragon 800 SoC
Now I'm not 100 percent sure if every single one of them has that as a user-centric feature (maybe they didn't enable it), but the chip supports 4k video recording.
Unless you're thinking of re-encoding … on a phone?
Is this aimed at Larrabee?
Based on what you're saying, it seems as though AVX-512 and future models with larger, faster embedded DRAM might play very nicely together.
Maybe this is another area where ARM can beat x86. Have a better planned out vector instruction set that can be expanded without adding hundreds of new instructions all the time, and more compact machine code.
It's an evolution of the same SIMD ideas. Yes, the newer variants do render the older variants redundant, but hang around because code might use it.
MMX - SIMD, integer only, reused existing floating point registers making it a PITA that often didn't even payoff because of the expensive state switching between MMX and floating point.
SSE - starting as floating point SIMD with its own 128-bit registers. Evolved through SSE 4.2 with more instructions (functionality in hardware), flexibility (e.g. operate on 4 singles or 2 doubles or...) and the addition of integer functionality.
AVX - Double the size of the vector registers, adding more of them and lots of new instructions and functionality. The successor of SSE, at least on the floating point side. AVX2 brought the integer functionality.
AVX-512 - Double the size of the vector registers again. 16 single float operations in one go.
Maybe this is another area where ARM can beat x86. Have a better planned out vector instruction set that can be expanded without adding hundreds of new instructions all the time, and more compact machine code.
This kind of sounds like baseless griping. Unless you write a compiler, why do you care? Do you really sweat the instruction prefixes?
Then generalize the vector coproccessing abilities of the GPU and that would be a pretty flexible base to work from.
The real action is in FMA (fused multiply-add) instructions. These instructions do two operations then a correct rounding of the result (e.g. round(a*b+c)). FMA in hardware is great. It lets you write functions with provably tight errors or even provably correct rounding of the result. More and more platforms are providing FMA [1].
[1]: http://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_ope...
No advantage? Crazy expense?
So, when people talk about 32 or 64 bits, they generally mean two things: the size of general purpose registers, and the size of addresses.
There's basically no need for addresses beyond 64 bits, at least for quite some time. With 64 bits, you can address 16,384 petabytes (16 exabytes) in a single process. Since the biggest single machines I can find these days support a maximum of 4 TB of RAM (if you filled it with 32 GB DIMMs that aren't yet available), we have a long way to go before you will need more than 64 bits of address space.
Furthermore, increasing address size can hurt performance. If your pointers are all 128 bit, they take up twice the space as 64 bit pointers. There have already been plenty of workloads that show a reduction in performance when ported to 64 bit machines, just because the 64 bit pointers fill up so much valuable cache space. In fact, for this reason, Linux even has support for the x32 ABI, which uses an x86-64 processor in 64 bit mode but only uses 32 bit pointers, so they can take advantage of extra registers available to x86-64 without paying the price for the larger pointers. https://en.wikipedia.org/wiki/X32_ABI
So, there's no benefit to 128 bit addresses and lots of potential downside, so it's not going to happen for quite some time. How about for data, though?
Well, most software doesn't really need to work with integers or floating point numbers larger than 64 bits, anyhow. For lots of applications, 64 or even 32 bits is sufficient. Public key crypto can frequently take advantage of large integers, though it generally needs even bigger integers, like 2048 bits, so you generally have to do bignum arithmetic anyhow.
Lots of the gains that you get from working with larger types come from working on vectors of smaller types. But for those purposes, chips have had 128 bit registers for quite some time. SSE, introduced in 1999, included 128 bit vector registers, which could be treated as 4 32 bit integers (AltiVec on PowerPC had introduced the same idea a few years earlier; the idea of SIMD has been around in supercomputers for many years). Later extensions like SSE2 expanded their use to allow you to treat them as two 64 bit floats, two 64 bit integers, 8 16 bit shorts, and 16 8-bit bytes.
So, for the only use case for which it's particularly valuable, working on vectors in aggregate, we've had 128 bit registers for quite some time. We've had 256 bit registers for a couple of years now in the form of AVX. Now this promises to expand those to 512 bits. There's no good reason to expand your addresses in the same way; at that point, you're just wasting space.
I'll actually going to be spending some of my time over the next year adding proper SIMD support (including all the shuffles) to the main Haskell compiler, GHC!
Theres some really interesting constraints on the SIMD shuffle primops that need some type system cleverness to compile correctly!
Namely, you need to know "statically, at code gen time", the shuffle constants that are given as "immediates" to the instructions! Normal values don't quite have the right semantics, and accordingly the simd intrinsics in C compilers kinda lie about the types they expect (ie if you give them a variable of the right type, they'll give you an error saying they need an actual constant literal).
tl;dr I'm going to make sure the GHC (and haskell) can support AVX 512 by the time thusly equipped CPUs are made available
> Namely, you need to know "statically, at code gen time",
> the shuffle constants that are given as "immediates" to
> the instructions!
Can you clarify where the extra difficulty is?I'm ignorant of GHC, but I'd think that from the compilers POV all that matters is that that the operand is a constant. Then it's just a matter of putting the value in instead of the name. In GCC inline assembly, you can use the poorly documented '%c' prefix to have a number treated as an immediate, so I'd guess this must be possible internally too. Also possibly worth noting is that unlike the others, PSHUFB works from a register rather a value encoded into the instruction.
You're right, there are hacks in c that handle that. My goal for ghc is to actually have a systematic solution for handling any sort of constant literal expression at compile time. This includes making it easy to add new primops that require compile time literal data.
There's some interesting implications if you want that restriction to be typecheckable! This includes having a "static data kind". Part of why you want that is also because ghc is great at common sub expression elimination, and I consider any implementation strategy that could be broken by compiler optimization to be unacceptable.
[edit to clarify: just naively using normal literal values would likely be subject to cse optimization, and having code gen need to look around to lookup a variable rather than being a localized process is somewhat horrifying]
One particular end goal of mine is this: SIMD isn't that complicated, and it's really easy to experiment with (but only if you can cope with c). I want to make experimenting with SIMD much more accessible.
Interestingly enough, the notion of static data I want seems like it might be an especially strong version of the notion of static data that Cloud Haskell (the distributed process lib) would like. So there may be some trickle Down there!
One really cool optimization having a proper notion of static literals might enable is making it much easier to generate things like static lookup tables and related data structures that are small and perf sensitive
Edit: also if you want to try and stare at the source for a serious compiler, ghc (while huge) is pretty darn readable. Just pick a piece you want to understand and stare at the code for a while!
Edit: I should add that Geoff mainland has some great preliminary work adding experimental simd support to ghc that's in head/ pending 7.8. That said, ghc support for interesting simd won't be ready for prime time till 7.10 in a yearish
Does OpenCL have a similar threading concept? I don't know much about it, sadly.
That's huge.