We had numerous different examples of this in the Hyperscan project and I broke out something similar on my blog: https://branchfree.org/2018/05/30/smh-the-swiss-army-chainsa...
We also used SIMD quite extensively as a 'wider GPR' - not doing stuff over tons of input characters but instead using the superior size of SIMD registers to implement things like bitwise string and NFA matchers.
A SIMD instruction can be a reasonable proxy for a wide vector processor but the reverse is not true - a specialized vector architecture is unlikely to be very helpful for this kind of 'mixed' SIMD processing. Almost any "argument from DAXPY" fails for the much richer uses of SIMD processing among active practitioners using modern SIMD.
For a while, it seemed like a set of instructions based on a rank-2 CLOS network (aka butterfly network) would get the job done. It scales at O(log(N)) in gate delay+ and O(N * log(N)) in area, and is very capable. Fanout is O(1). You can do all kinds of inter- and intra-lane swaps, rotations, and shifts with it. You can even do things like expand packed 16-bit RGB into 32-bit uniform RGBA.
But things like that SMH algorithm are definitely out of scope: Each input bit can only appear in at most one output location with the butterfly. So the cost to replicate a prefix scales at O(repetitions), which is unfortunate. Some algorithms based on using general shuffle are also relying on the use of PSHUFB's functionality as a complete lookup table, which the butterfly network can't do, either.
My conclusion was that you're basically stuck with a general permute instruction for a modern SIMD ISA, scaling be damned.
+ The latency scaling is somewhat misleading thanks to wire delay - they are both O(N) in wire delay.
However the RISC-V Vector Extensions basically let you use the processor in "SIMD mode" by setting the vector length to a small value. It will depend on the processor architecture whether a vector length of e.g. 4 is efficient or not, but I expect for many implementations it will be relatively efficient (and you can definitely just use it as a "wider GPR" if you want to).
The only catch at the ISA level is it costs an instruction to change the vector size. So if you keep swiching between e.g. 128-bit and 512-bit instructions at a very fine granularity, that might add overhead... I'm not sure that's a very common case though?
With a SIMD architecture, as registers get ever wider you need to recompile your code, and binary releases require multiple code paths.
With a vector architecture, the hardware designers can just increase the machine's internal vector size and existing code will benefit immediately.
Furthermore, it seems reasonable to count GPU users among "active practitioners of modern SIMD". All code that is written in a GPU-style -- which admittedly isn't too common yet on CPUs, but it would certainly be feasible and is possible since ispc came along -- immediately becomes a beneficiary of a well-designed vector instruction architecture.
Are you getting hung up on the term 'vector'? That doesn't assume you're just doing linear algebra.
As for SIMD, it's a huge benefit when used in the right context. I applied it image processing and video compression algorithms in the past, with significant performance gains.
Me too.
> As for SIMD, it's a huge benefit when used in the right context.
Sure, but as the article points out, that benefit could be even huger when used on a "proper" vector architecture with veeeery wide vector registers that do not also double as not-very-wide scalar registers. "The SIMD instructions execute 10 to 20 times more instructions than RV32V because each SIMD loop does only 2 or 4 elements instead of 64 in the vector case."
I think adding SIMD instructions to x86 was a good trade-off at the time, but I also think the authors are correct that new ISAs designed now are better off with a vector architecture like they propose. In the end it's apples vs. oranges because the two contexts are not comparable.
It all started with the famous Edsger Dijkstra paper titled “Go To Statement Considered Harmful” [1], which lead to an endless series of subsequent papers later patterned after its title [2].
I agree with the sentiment that this title pattern is overused currently, to the point of being cliche — with perhaps a bit of presumptuousness as well, due to the implicit suggestion that the author’s claim of “Considered Harmful” will withstand the test of time as well as Dijkstra’s paper (though perhaps I’m reading too much into it, in this case).
Same. Please, when you write a title for your article, be specific about what you think is wrong. "X Considered Harmful" is lazy writing and devalues any substantive argument you make because it's become a cliche to the point where it's almost anti-clickbait.
GUILTY ADMISSION: a related trap I started falling into when writing content for work (email, documents, whatever) and I was in a hurry for a subject line or title was "Thoughts on X". That's right, "thoughts": I communed with the creator and distilled them from on high. Yeah, ain't nobody got time to read that.
Holy quack! I didn't even know there were 80 (feels too much already, I barely used a tiny portion when exercising in assembly), 1400 sounds really insane.
Much in the same way, AVX adds only 8 completely new instructions, but adds new 256-bit variants for many pre-existing SSE instructions. This generates enormous amounts of opcodes, without actually increasing complexity _that_ much.
The problem with this is that when the processor stalls on an instruction fetch for the next cacheline of code, it just sits there idle for the entire time. This greatly elongates your tails when looking at performance in terms of latency percentiles.
It makes me wonder if Intel or AMD have investigated a MicroVAX-like trimming or compression of the opcodes so that the most common/useful codes fit in the fewest bytes. In particular it seems like SIMD lengths are inverted, the longest vectors should have shorter opcodes since they're more useful. It might even be worth deprecating MMX/128-bit SSE.
AMD64 came out in 2003, a new decoder might be appropriate by 2023.
In the future, instructions will be designed by machine, for example by considering millions of permutations of possible instruction "combined add with shift with multiply by 8 and set the 6th bit", "double indirect program counter jump with offset 63", etc.
Each permutation will be added to various compilers and simulated by running benchmarks on complete architecture simulators to find out which new instruction adds the most to the power to die area to performance to code size tradeoff.
I predict there will be many more future instructions with 'fuzzy' effects which don't affect correctness, only performance. Eg. 'Set the branch predictor state to favor the next branch for the number of times in register EAX', or 'go start executing the following code speculatively, because you'll jump to it in a few hundred clock cycles and it would be handy to have it all pre-executed'.
Until you're debugging broken compiler/JIT output, which I've had to do multiple times in the last year while using .NET Core.
You don't normally need to write assembly, but you do need to use compiler intrinsics, which map 1-1 with assembly.
It's not as ambitious as your approach though, more like a 1:1 translation and thus cannot take advantage of wider vectors.
The x86 SIMD extensions such as SSE and AVX, on the other hand, aim to integrate that concept with x86 and are therefore pretty similar.
My understanding is that the largest difference is that some of the Cell cores had different opcodes that meant you could schedule some threads on some cores but not any thread on any core.
> An architect partitions the existing 64-bit registers
> The IA-32 instruction set has grown from 80 to around 1400 instructions since 1978, largely fueled by SIMD.
Wait, what. IA-32 started in 1985 not 1978. It didn't have any existing 64 bit registers. It was called IA-32 because of the 32 bit registers, like EAX and EBX. And then looking at the 1986 reference manual https://css.csail.mit.edu/6.858/2014/readings/i386.pdf I count 96 instructions under 17.2.2.11. The IA-32 instruction set didn't grow much all these years, IA-64 did to the best my knowledge but please let me know if I am wrong here. As for IA-64, I looked at https://www.intel.com/content/dam/www/public/us/en/documents... and it's hard to get an accurate count because some instructions are grouped together, it's either 627 or 996 (and I may have made a counting mistake given I started from a PDF, but it should be close) which is indeed very high but even our best attempt only finds a tenfold growth (and perhaps only a 6.5) instead of the 17.5 the article suggested.
I see how this could be beneficial, specially when writing codes, as it's way closer to just a normal loop.
Then again, why not combine both ideas and pipeline chunks of SIMD type? Say we have 4 execution stages and 32bit SIMD types (unrealistic, I know) and want to process 8-bit numbers. Wouldn't we be able to process 16 of them at the same time? Actually, isn't that kind of what GPUs already do?
I'm sure smarter people than I have reasoned about this, maybe someone can link a good article. I only know of this one [1] and one about GPGPU that I just can't find any more (but which was also very interesting)
https://massivebottleneck.com/2019/02/17/vector-vs-simd-dyna...
I think I'm kinda explaining how it's similar (and different) to what modern GPUs do but I'm not sure I understand what you mean by "wouldn't we be able to process 16 of them at the same time" - do you mean a throughput of 16/clock, or just that 16 are "in flight" through the pipeline with a throughput of 4/clock?
I'm not sure I'm clear enough about it for those without a GPU HW background. If it's not clear I'm happy to write down a more detailed explanation here!
I disagree. GPUs have a fixed vector width. AMD GPUs have 64x32-bit vectors, while NVidia GPUs have 32x32-bit vectors.
What's being discussed here is a variable lengthed vector being supported on the hardware level, which is very, very different than how GPUs work.
Yes you would if you had 4 execution ports available and no data dependencies. Of course, those execution ports could also be processing 256 bit wide SIMD registers instead of just 32 bits. So it's a bad idea.
Instruction count is also higher, which is never a good thing.
> Actually, isn't that kind of what GPUs already do?
No.
Strikes me as much cleaner design-wise than stuff like CUDA or openCL or SIMD of today.
For Mill CPU variants with wide vector units the CPU could execute certain instructions in one go, while for variants with narrow units it might have to issue multiple instructions.
Their idea is to handle this by basically doing ahead of time compilation of a generic program image, turning it into a specialized version for the installed CPU.
Sounds neat, proof is in the pudding.
There are reasons to prefer SIMD or vectors machines or, put another way, packed or unpacked vectors. But this is a very one-sided presentation. Also, some SIMD ISAs like Arm's SVE can handle different widths pretty nicely.
https://www.youtube.com/watch?v=gFrMcRqNH90
It's an entirely different approach than what the RISC-V folks are pushing for. It's great that this guy is working with them on the vector instructions, but I'm afraid it's too soon to claim a "right" way to go.
It's also not fair to compare instructions executed between SIMD and some huge vector register implementation. Most common RISC-V CPUs are likely to have smaller vector register from 256 to 512 bits wide.
I watched that presentation a while ago, and while the figures that are shown look nice, I suspect the crux is that I'm not sure whether MXP is practically implementable? I'm not at all an expert on this topic, so take this with a large grain of salt. Anyway:
1) With MXP instead of a vector register file you have a scratchpad memory, i.e. a chunk of byte-addressable memory in the CPU. Now, if you want multiple vector ALU's (lanes), that scratchpad then needs to be multi-ported, which quickly starts to eat up a lot of area and power. In contrast, a vector regfile can be split into single-ported per lane chunks, saving power and area.
2) MXP seems to be dependent on these shuffling engines to align the data and feed to the ALU's. What's the overhead of these? Seems far from trivial?
As for other potential models, I have to admit I'm not entirely convinced by their dismissal of the SIMT style model. Sure, it needs a bit more micro-architectural state, a program counter per vector lane, basically. But there's also a certain simplicity in the model, no need for separate vector instructions, for the really basic stuff you need only the fork/join type instructions to switch from scalar execution to SIMT and back. And there's no denying that SIMT has been an extremely successful programming model in the GPU world.
> It's also not fair to compare instructions executed between SIMD and some huge vector register implementation. Most common RISC-V CPUs are likely to have smaller vector register from 256 to 512 bits wide.
True; the more interesting parts is the overhead stuff. Does your ISA require vector load/stores to be aligned one a vector-size boundary? Well, then when vectorizing a loop you need a scalar pre-loop to handle the first elements until you hit the right alignment and can use the vectorized stuff. Similarly, how do you handle the tail of the loop if the number of elements is not a multiple of the vector length? If you don't have a vector length register or such you need a separate tail loop. Or is the data in memory contiguous? Without scatter-gather and strided load/store you have to choose between not vectorizing or packing the data.
That bloats the code and is one of the reasons why autovectorizing for SIMD ISA's is difficult for compilers, as often the compiler doesn't know how many iterations a loop will be executed, and due to the above a large number of iterations are necessary to amortize the overhead. With a "proper" vector ISA the overhead is very small and it's profitable to vectorize all loops the compiler is able to.
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....
So does this argument boil down to an inversion of control which in turn removes unnecessary instructions? It certainly sounds more elegant to my naive ISA understanding.
Can I ask, someone with hands on SIMD experience: does relinquishing control over exactly what and how many "vector" operations occur in a single clock make any real world difference?
https://en.wikipedia.org/wiki/AltiVec
I have a computer engineering degree from UIUC and my very first thought upon seeing MMX/SSE/Altivec/etc was "why didn't they make this arbitrarily scalable?" I was excited to be able to perform multiple computations at once, but the implementation seemed really bizarre to me (hardcoding various arithmetic operations in 128 bits or whatnot).
If it had been up to me, I would have probably added an instruction that executed a certain number of other instructions as a single block and let the runtime or CPU/cluster divvy them up to its cores/registers in microcode internally.
It turns out that something like this is conceptually what happens in vector languages like MATLAB (and Octave, Scilab, etc), which I first used around 2005. It's implementation is not terribly optimized, but in practice it doesn't need to be, because all personal computers since the mid 1990s are limited by memory bandwidth, not processing power.
For what it's worth, we're seeing similar ideas in things like graphics shaders, where the user writes a loop that appears to be serial and synchronous, but is parallelized by the runtime. I'm saddened that they had to evolve via graphics cards with their unfortunate memory segmentation (inspired by DOS?) but IMHO the future of programming will look like general-purpose shaders that abstract away caching so that eventually CPU memory, GPU memory, mass storage, networks, even the whole internet looks like a software-transactional memory or content-addressable memory.
We'll also ditch frictional abstractions like asynchronous promises in favor of something like the Actor model from Erlang/Go or a data graph that is lazily evaluated as each dependency is satisfied so it can be treated as a single synchronous serial computation. I've never found a satisfactory name for that last abstraction, so if someone knows what it's called, please let us know thanks!
P.S. the point of all this is to provide an efficient transform between functional programming and imperative programming so we can begin dealing in abstractions and stop prematurely optimizing our programs (which limits them to running on specific operating systems or hardware configurations).