SIMD Instructions Considered Harmful (2017) (opens in new tab)

(sigarch.org)

152 pointsnuclx7y ago118 comments

118 comments

69 comments · 17 top-level

glangdale7y ago· 13 in thread

This argument is less effective given that SIMD is not always a straightforward substitute for vector processing. Sometimes we want 128, 256 or 512 bits of processing as a unit and will follow it up with something different, not a repeated instance of that same process.

We had numerous different examples of this in the Hyperscan project and I broke out something similar on my blog: https://branchfree.org/2018/05/30/smh-the-swiss-army-chainsa...

We also used SIMD quite extensively as a 'wider GPR' - not doing stuff over tons of input characters but instead using the superior size of SIMD registers to implement things like bitwise string and NFA matchers.

A SIMD instruction can be a reasonable proxy for a wide vector processor but the reverse is not true - a specialized vector architecture is unlikely to be very helpful for this kind of 'mixed' SIMD processing. Almost any "argument from DAXPY" fails for the much richer uses of SIMD processing among active practitioners using modern SIMD.

brandmeyer7y ago

I went on a bit of a research expedition to see if there was something that scaled better a general permutation instruction for SIMD machines. General permute scales at O(log(N)) in gate delay+ and O(N^2 * log(N)) in area, where N is the vector length. Its a full crossbar, but the fanout on the wires adds an additional log(N) factor in buffers.

For a while, it seemed like a set of instructions based on a rank-2 CLOS network (aka butterfly network) would get the job done. It scales at O(log(N)) in gate delay+ and O(N * log(N)) in area, and is very capable. Fanout is O(1). You can do all kinds of inter- and intra-lane swaps, rotations, and shifts with it. You can even do things like expand packed 16-bit RGB into 32-bit uniform RGBA.

But things like that SMH algorithm are definitely out of scope: Each input bit can only appear in at most one output location with the butterfly. So the cost to replicate a prefix scales at O(repetitions), which is unfortunate. Some algorithms based on using general shuffle are also relying on the use of PSHUFB's functionality as a complete lookup table, which the butterfly network can't do, either.

My conclusion was that you're basically stuck with a general permute instruction for a modern SIMD ISA, scaling be damned.

+ The latency scaling is somewhat misleading thanks to wire delay - they are both O(N) in wire delay.

arundemeure7y ago

I completely agree in principle (some applications are good fit for traditional long vector processing, but it's a terrible fit for others).

However the RISC-V Vector Extensions basically let you use the processor in "SIMD mode" by setting the vector length to a small value. It will depend on the processor architecture whether a vector length of e.g. 4 is efficient or not, but I expect for many implementations it will be relatively efficient (and you can definitely just use it as a "wider GPR" if you want to).

The only catch at the ISA level is it costs an instruction to change the vector size. So if you keep swiching between e.g. 128-bit and 512-bit instructions at a very fine granularity, that might add overhead... I'm not sure that's a very common case though?

creato7y ago

I agree. It seems like this strategy makes only the most braindead applications of SIMD better (simple loops that can be vectorized by an arbitrary factor), but doesn't really do anything to help the meatier SIMD workloads. Most SIMD code isn't as simple as this, and the code that is usually isn't a significant factor in either developer experience or runtime.

glangdale7y ago

Seriously. A lot of these proposals go veering off into second-order considerations ("Easier to decode!" "A few picojoules less energy") as I'd be very surprised if the bottlenecks are going to be from SIMD vs vector architecture ISA issues - as compared to, say, memory bandwidth or multiply-add bandwidth.

1 more reply

atq21197y ago

Agreed, but then again, the vector architectures have the undeniable benefit of being agnostic to the underlying hardware.

With a SIMD architecture, as registers get ever wider you need to recompile your code, and binary releases require multiple code paths.

With a vector architecture, the hardware designers can just increase the machine's internal vector size and existing code will benefit immediately.

Furthermore, it seems reasonable to count GPU users among "active practitioners of modern SIMD". All code that is written in a GPU-style -- which admittedly isn't too common yet on CPUs, but it would certainly be feasible and is possible since ispc came along -- immediately becomes a beneficiary of a well-designed vector instruction architecture.

skybrian7y ago

Yes, recompiling can be a pain, but there's an argument that we should get used to recompiling things from source and make sure this path is smooth, rather than trying making old binaries work as long as possible. Static binaries are another reason.

1 more reply

lallysingh7y ago

Sounds like you're getting fixated on the example. Short vectors would do what you want. And it's a lot easier to add specific vector instructions for particular use cases than adding N of them for each use case (for N simd sizes).

Are you getting hung up on the term 'vector'? That doesn't assume you're just doing linear algebra.

bloomer7y ago

No he's not getting hung up on the term vector. He is saying that SIMD is fixed but wider width and that is not equivalent to arbitrary width. For example, SIMD shuffle type instructions have no arbitrary length equivalent.

1 more reply

dbaupp7y ago

Yeah, other examples include image codecs, such as JPEG: the DCT performed on the 8x8 blocks can benefit from SIMD, but the lanes aren't independent at all (matrix transposes, various intra-block additions).

wbl7y ago

Do it across blocks and you can squeeze out more parallelism.

2 more replies

jabl7y ago

In a vector arch like the risc-v vector extension, if you want to process exactly 256 or whatever at a time, just set the configuration or vector length register and off you go with SIMD-style programming?

glangdale7y ago

It is very unlikely that such a configuration will perform remotely as fast as a native SIMD implementation, unless there is some truly heroic specialization going on under the hood. Obviously the work is still possible with vector ops, in the same way that it's still possible with scalar ops too. But will it be fast? My guess is no.

1 more reply

petermcneeley7y ago

Agreed. SIMDs true competitor is ironically the super-scalar architecture itself and its relative the VLIW.

NL8077y ago· 9 in thread

Roll my eyes every time I see a "Considered Harmful" headline.

As for SIMD, it's a huge benefit when used in the right context. I applied it image processing and video compression algorithms in the past, with significant performance gains.

tom_mellior7y ago

> Roll my eyes every time I see a "Considered Harmful" headline.

Me too.

> As for SIMD, it's a huge benefit when used in the right context.

Sure, but as the article points out, that benefit could be even huger when used on a "proper" vector architecture with veeeery wide vector registers that do not also double as not-very-wide scalar registers. "The SIMD instructions execute 10 to 20 times more instructions than RV32V because each SIMD loop does only 2 or 4 elements instead of 64 in the vector case."

I think adding SIMD instructions to x86 was a good trade-off at the time, but I also think the authors are correct that new ISAs designed now are better off with a vector architecture like they propose. In the end it's apples vs. oranges because the two contexts are not comparable.

dkersten7y ago

I feel like even though full vector architecture might perform a lot better, the use cases may be much narrower than SIMD, especially on typical desktop or server (web applications etc, not scientific computing, deep learning or image processing -- many of which are already vectorised on GPUs) workloads. As others have mentioned, SIMD allows you to do a little bit of vectorisation in an otherwise non-vector workload, or use it for the wider registers or whatever. I don't know enough about it personally to be able to judge either way, though. I just know that I've attempted to vectorise some hobby game code for fun a few times and typically found it much harder to achieve than it first seemed, even though the data seemed trivially vectorisable at first. Perhaps that's just lack of experience.

lukego7y ago

I haven't written code for a vector architecture yet but, wow, they look so much nicer to program than SIMD on casual inspection.

kazinator7y ago

"Harmful" doesn't mean "so dead in the water that it doesn't actually have performance gains". Intel wouldn't have integrated SIMD if it didn't work at all; that's not the criticism. Rather, the claim is that the performance gains are not well justified by the technical debt they bring to the architecture. The performance gains are not as good as they could be with the vector approach, which has better dynamic and static code density, doesn't horribly proliferate the instruction set, and allows for tuning without recompilation of code to use different instructions.

electrograv7y ago

For those who don’t get the title reference: ”[Programming Pattern] Considered Harmful” is a title meme that began in 1968 — over fifty years ago — and still going strong today!

It all started with the famous Edsger Dijkstra paper titled “Go To Statement Considered Harmful” [1], which lead to an endless series of subsequent papers later patterned after its title [2].

I agree with the sentiment that this title pattern is overused currently, to the point of being cliche — with perhaps a bit of presumptuousness as well, due to the implicit suggestion that the author’s claim of “Considered Harmful” will withstand the test of time as well as Dijkstra’s paper (though perhaps I’m reading too much into it, in this case).

[1] https://dl.acm.org/citation.cfm?doid=362929.362947

[2] https://en.m.wikipedia.org/wiki/Considered_harmful

DerekL7y ago

In your opinion, how does SIMD compare with “vector architectures” described in the article?

bartread7y ago

> Roll my eyes every time I see a "Considered Harmful" headline.

Same. Please, when you write a title for your article, be specific about what you think is wrong. "X Considered Harmful" is lazy writing and devalues any substantive argument you make because it's become a cliche to the point where it's almost anti-clickbait.

GUILTY ADMISSION: a related trap I started falling into when writing content for work (email, documents, whatever) and I was in a hurry for a subject line or title was "Thoughts on X". That's right, "thoughts": I communed with the creator and distilled them from on high. Yeah, ain't nobody got time to read that.

vermilingua7y ago

https://meyerweb.com/eric/comment/chech.html

pjscott7y ago

This is one of the few times when a Considered Harmful article deserves the title. Did you see the RISC-V assembly code? Holy shit! It's so much nicer than every other SIMD ISA I've ever used! So much easier to program, so much easier to write compilers for -- and it should also be easier for CPU designers to handle efficiently. What's not to like?

1 more reply

qwerty4561277y ago· 8 in thread

> IA-32 instruction set has grown from 80 to around 1400 instructions since 1978, largely fueled by SIMD.

Holy quack! I didn't even know there were 80 (feels too much already, I barely used a tiny portion when exercising in assembly), 1400 sounds really insane.

pdpi7y ago

This is more than a little misleading. There's 8 different opcodes for each of INC, DEC, ADD and SUB, 6 each for AND, OR, XOR. MOV alone is 28 different opcodes. All of these groups of opcodes represent the same basic operation, but each opcode varies on addressing modes, and types of arguments (e.g. there's a separate INC/DEC opcode for each register)

Much in the same way, AVX adds only 8 completely new instructions, but adds new 256-bit variants for many pre-existing SSE instructions. This generates enormous amounts of opcodes, without actually increasing complexity _that_ much.

mcguire7y ago

Are they new opcodes, or are they one opcode parameterized with a few bits of length?

2 more replies

CoolGuySteve7y ago

A bigger problem is the relative length of these new instructions. I've noticed some more recent AVX instructions spreading over 8 or 9 bytes in disassembly.

The problem with this is that when the processor stalls on an instruction fetch for the next cacheline of code, it just sits there idle for the entire time. This greatly elongates your tails when looking at performance in terms of latency percentiles.

It makes me wonder if Intel or AMD have investigated a MicroVAX-like trimming or compression of the opcodes so that the most common/useful codes fit in the fewest bytes. In particular it seems like SIMD lengths are inverted, the longest vectors should have shorter opcodes since they're more useful. It might even be worth deprecating MMX/128-bit SSE.

AMD64 came out in 2003, a new decoder might be appropriate by 2023.

londons_explore7y ago

We're past the time that a human needs to understand assembly instructions.

In the future, instructions will be designed by machine, for example by considering millions of permutations of possible instruction "combined add with shift with multiply by 8 and set the 6th bit", "double indirect program counter jump with offset 63", etc.

Each permutation will be added to various compilers and simulated by running benchmarks on complete architecture simulators to find out which new instruction adds the most to the power to die area to performance to code size tradeoff.

I predict there will be many more future instructions with 'fuzzy' effects which don't affect correctness, only performance. Eg. 'Set the branch predictor state to favor the next branch for the number of times in register EAX', or 'go start executing the following code speculatively, because you'll jump to it in a few hundred clock cycles and it would be handy to have it all pre-executed'.

cameronh907y ago

"We're past the time that a human needs to understand assembly instructions."

Until you're debugging broken compiler/JIT output, which I've had to do multiple times in the last year while using .NET Core.

1 more reply

richardwhiuk7y ago

Nah, that's just not true. Auto-vectorisation just isn't good enough at the moment.

You don't normally need to write assembly, but you do need to use compiler intrinsics, which map 1-1 with assembly.

jerf7y ago

Security may throw a wrench into that. Preventing Spectre et al in such an environment would be a challenge. Not mathematically insurmountable, but possibly unsurmountable with real humans and real economics.

tenebrisalietum7y ago

Itanium had a lot of instructions like that IIRC. But has been awhile since I read anything about that.

ajayjain7y ago· 4 in thread

In some recent work from my group [1], we reduce the complexity of keeping up with new SIMD ISAs by retargeting code between generations. For example, a compiler pass can take code written to target SSE2 (with intrinsics) and emit AVX-512 - it auto-vectorizes hand-vectorized code. With a more capable compiler, if the ISA grows in complexity, programmers and users of libraries get speedups without rewriting their code or relying on scalar auto-vectorization. However, the x86 ISA growth certainly pushed some complexity on us as compiler writers - we had to write a pass to retarget instructions!

[1] https://www.nextgenvec.org/#revec

jabl7y ago

Recently a patch was contributed to gcc that converts mmx intrinsics to sse. Also the gcc power target supports x86 vector intrinsics, converting them to the power equivalents.

It's not as ambitious as your approach though, more like a 1:1 translation and thus cannot take advantage of wider vectors.

glangdale7y ago

That patch primarily is there to avoid the pitfalls of MMX on modern architectures; it is gradually becoming deprecated. On SKX, operations that are available on both ports 0 and 1 for SSE or AVX are only available on port 0 for MMX. So code that uses MMX is getting half the throughput (which may or may not matter, but still).

1 more reply

wmu7y ago

Sorry for a non-constructive comment, just wanted to say your paper is great. :)

ajayjain7y ago

Thank you! :)

hohohmm7y ago· 4 in thread

Isn't the whole point of SIMD being as similar to original x86 instructions as possible? reusing as much the existing cpu as possible? Otherwise you would have something like the ps3?

Sebb7677y ago

Yes and no. SIMD (Single Instruction Multiple Data) as a concept has nothing to do with x86, it's basically just the concept of vectorizing the code and is used on many platforms.

The x86 SIMD extensions such as SSE and AVX, on the other hand, aim to integrate that concept with x86 and are therefore pretty similar.

pjmlp7y ago

Not at all, SIMD is a concept used across all CPU architectures, including the PS3.

petermcneeley7y ago

It was primarily the memory architecture that made the PS3 unique.

CoolGuySteve7y ago

If you care about latency, a modern 8-or-more core x86 with its L1/L2 cache segmentation and penalized-but-shared L3 cache is almost as complex. It becomes even more complex if you use the CPU topology to make inferences hyperthreading shared caches or need to deal with the shared FPU on older AMD processors.

My understanding is that the largest difference is that some of the Cell cores had different opcodes that meant you could schedule some threads on some cores but not any thread on any core.

1 more reply

chx7y ago· 3 in thread

There is probably a lot of merit in the advantages of vectors but it weakens the article to set them up as against SIMD when the presented facts are dubious at best:

> An architect partitions the existing 64-bit registers

> The IA-32 instruction set has grown from 80 to around 1400 instructions since 1978, largely fueled by SIMD.

Wait, what. IA-32 started in 1985 not 1978. It didn't have any existing 64 bit registers. It was called IA-32 because of the 32 bit registers, like EAX and EBX. And then looking at the 1986 reference manual https://css.csail.mit.edu/6.858/2014/readings/i386.pdf I count 96 instructions under 17.2.2.11. The IA-32 instruction set didn't grow much all these years, IA-64 did to the best my knowledge but please let me know if I am wrong here. As for IA-64, I looked at https://www.intel.com/content/dam/www/public/us/en/documents... and it's hard to get an accurate count because some instructions are grouped together, it's either 627 or 996 (and I may have made a counting mistake given I started from a PDF, but it should be close) which is indeed very high but even our best attempt only finds a tenfold growth (and perhaps only a 6.5) instead of the 17.5 the article suggested.

CalChris7y ago

Small nit. IA-64 refers to Itanium. I think you meant Intel 64.

https://en.wikipedia.org/wiki/IA-64

chx7y ago

You are correct.

Iwan-Zotow7y ago

well, he might be counting AMD and things like 3dnow! which are now defunct but was (another) legit extension to IA-32

DarkWiiPlayer7y ago· 3 in thread

So, if I understand it correctly, the text argues in favor of the GPU approach of pipelining independant vector operations instead of the current SIMD approach.

I see how this could be beneficial, specially when writing codes, as it's way closer to just a normal loop.

Then again, why not combine both ideas and pipeline chunks of SIMD type? Say we have 4 execution stages and 32bit SIMD types (unrealistic, I know) and want to process 8-bit numbers. Wouldn't we be able to process 16 of them at the same time? Actually, isn't that kind of what GPUs already do?

I'm sure smarter people than I have reasoned about this, maybe someone can link a good article. I only know of this one [1] and one about GPGPU that I just can't find any more (but which was also very interesting)

[1] http://www.lighterra.com/papers/modernmicroprocessors/

arundemeure7y ago

By coincidence I started a new blog a few days ago and my first article is about the SIMD Instructions Considered Harmful post from a power efficiency perspective... maybe I should post it separately on HN? :)

https://massivebottleneck.com/2019/02/17/vector-vs-simd-dyna...

I think I'm kinda explaining how it's similar (and different) to what modern GPUs do but I'm not sure I understand what you mean by "wouldn't we be able to process 16 of them at the same time" - do you mean a throughput of 16/clock, or just that 16 are "in flight" through the pipeline with a throughput of 4/clock?

I'm not sure I'm clear enough about it for those without a GPU HW background. If it's not clear I'm happy to write down a more detailed explanation here!

dragontamer7y ago

> GPU approach of pipelining independant vector operations

I disagree. GPUs have a fixed vector width. AMD GPUs have 64x32-bit vectors, while NVidia GPUs have 32x32-bit vectors.

What's being discussed here is a variable lengthed vector being supported on the hardware level, which is very, very different than how GPUs work.

vardump7y ago

> Wouldn't we be able to process 16 of them at the same time?

Yes you would if you had 4 execution ports available and no data dependencies. Of course, those execution ports could also be processing 256 bit wide SIMD registers instead of just 32 bits. So it's a bad idea.

Instruction count is also higher, which is never a good thing.

> Actually, isn't that kind of what GPUs already do?

No.

etaioinshrdlu7y ago· 3 in thread

Maybe CPU architectures should just have data-parallel loop support of arbitrary width. The CPU can implement it in microcode however it feels like, or perhaps a kernel can trap it and send it off to a GPU transparently.

Strikes me as much cleaner design-wise than stuff like CUDA or openCL or SIMD of today.

vnorilo7y ago

This sounds quite optimistic. How would microcode deal with allocating registers, or nested data parallelism? You are describing transformations that usually happen fairly early in compiler optimization pipelines, and pushing that down to microcode would bring huge complexity.

magicalhippo7y ago

IIRC the Mill CPU handles this by performing a translation at install time.

For Mill CPU variants with wide vector units the CPU could execute certain instructions in one go, while for variants with narrow units it might have to issue multiple instructions.

Their idea is to handle this by basically doing ahead of time compilation of a generic program image, turning it into a specialized version for the installed CPU.

Sounds neat, proof is in the pudding.

3 more replies

mattnewport7y ago

Modern x86 processors already do a lot of register renaming and speculative and out of order execution. Much of the huge complexity you're worried about already exists in modern CPUs in order to track and eliminate false data dependencies and to keep the CPU busy in the face of data hazards.

1 more reply

Symmetry7y ago· 2 in thread

Comparing dynamic instructions between a SIMD architecture with a 32 byte vector width versus a vector architecture with 8*64=512 byte vectors is laughably misleading. Of course you can use fewer instructions if you're willing to throw hugely more transistors at the problem and carry around so much more architectural state.

There are reasons to prefer SIMD or vectors machines or, put another way, packed or unpacked vectors. But this is a very one-sided presentation. Also, some SIMD ISAs like Arm's SVE can handle different widths pretty nicely.

jabl7y ago

I'd guess in the classification of the authors, SVE would qualify as a "real" vector ISA. SVE resembles the risc-v vector extension quite a lot.

Dylan168077y ago

Keep in mind this was after configuring it to have exactly two vectors active. There might only be 1KB of state in there. That's halfway between AVX and AVX-512, so it doesn't strike me as particularly biased.

phkahler7y ago· 1 in thread

And yet the current RISC-V approach is not as good as MXP:

https://www.youtube.com/watch?v=gFrMcRqNH90

It's an entirely different approach than what the RISC-V folks are pushing for. It's great that this guy is working with them on the vector instructions, but I'm afraid it's too soon to claim a "right" way to go.

It's also not fair to compare instructions executed between SIMD and some huge vector register implementation. Most common RISC-V CPUs are likely to have smaller vector register from 256 to 512 bits wide.

jabl7y ago

> And yet the current RISC-V approach is not as good as MXP

I watched that presentation a while ago, and while the figures that are shown look nice, I suspect the crux is that I'm not sure whether MXP is practically implementable? I'm not at all an expert on this topic, so take this with a large grain of salt. Anyway:

1) With MXP instead of a vector register file you have a scratchpad memory, i.e. a chunk of byte-addressable memory in the CPU. Now, if you want multiple vector ALU's (lanes), that scratchpad then needs to be multi-ported, which quickly starts to eat up a lot of area and power. In contrast, a vector regfile can be split into single-ported per lane chunks, saving power and area.

2) MXP seems to be dependent on these shuffling engines to align the data and feed to the ALU's. What's the overhead of these? Seems far from trivial?

As for other potential models, I have to admit I'm not entirely convinced by their dismissal of the SIMT style model. Sure, it needs a bit more micro-architectural state, a program counter per vector lane, basically. But there's also a certain simplicity in the model, no need for separate vector instructions, for the really basic stuff you need only the fork/join type instructions to switch from scalar execution to SIMT and back. And there's no denying that SIMT has been an extremely successful programming model in the GPU world.

> It's also not fair to compare instructions executed between SIMD and some huge vector register implementation. Most common RISC-V CPUs are likely to have smaller vector register from 256 to 512 bits wide.

True; the more interesting parts is the overhead stuff. Does your ISA require vector load/stores to be aligned one a vector-size boundary? Well, then when vectorizing a loop you need a scalar pre-loop to handle the first elements until you hit the right alignment and can use the vectorized stuff. Similarly, how do you handle the tail of the loop if the number of elements is not a multiple of the vector length? If you don't have a vector length register or such you need a separate tail loop. Or is the data in memory contiguous? Without scatter-gather and strided load/store you have to choose between not vectorizing or packing the data.

That bloats the code and is one of the reasons why autovectorizing for SIMD ISA's is difficult for compilers, as often the compiler doesn't know how many iterations a loop will be executed, and due to the above a large number of iterations are necessary to amortize the overhead. With a "proper" vector ISA the overhead is very small and it's profitable to vectorize all loops the compiler is able to.

CalChris7y ago· 1 in thread

Why doesn't this comparison include ARMv8 and ARMv8 NEON? ARMv8 NEON does support double precision and that can help DAXPY. I believe this has been the case since 2011 when AArch64 was announced (well, at least 2015).

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....

wmu7y ago

Well, there is also no AVX512 mentioned.

auggierose7y ago· 1 in thread

I feel if you have a strong need for vectors then you should consider running (part of) your code on the GPU.

_chris_7y ago

Too far away and requires huge thread counts to make up for its overheads. A good vector unit should work well even for very short vectors (e.g., any size of memcpy).

yxhuvud7y ago

The discussion in the comments beneath the article was more interesting than the article itself.

hsivonen7y ago

Prediction: In order to be performance-competitive with other ISAs for software written for SIMD, RISC-V will get a SIMD extension. However, because it wasn't there from thw start, Linux distros will not compile their packages with SIMD enabled and the result will be sad like NEON on 32-bit ARM.

tomxor7y ago

> While a simple vector processor might execute one vector element at a time, element operations are independent by definition, and so a processor could theoretically compute all of them simultaneously. The widest data for RISC-V is 64 bits, and today’s vector processors typically execute two, four, or eight 64-bit elements per clock cycle.

So does this argument boil down to an inversion of control which in turn removes unnecessary instructions? It certainly sounds more elegant to my naive ISA understanding.

Can I ask, someone with hands on SIMD experience: does relinquishing control over exactly what and how many "vector" operations occur in a single clock make any real world difference?

zackmorris7y ago

I've been saying this for nearly 20 years. My first experience with it was Altivec on PowerPC:

https://en.wikipedia.org/wiki/AltiVec

I have a computer engineering degree from UIUC and my very first thought upon seeing MMX/SSE/Altivec/etc was "why didn't they make this arbitrarily scalable?" I was excited to be able to perform multiple computations at once, but the implementation seemed really bizarre to me (hardcoding various arithmetic operations in 128 bits or whatnot).

If it had been up to me, I would have probably added an instruction that executed a certain number of other instructions as a single block and let the runtime or CPU/cluster divvy them up to its cores/registers in microcode internally.

It turns out that something like this is conceptually what happens in vector languages like MATLAB (and Octave, Scilab, etc), which I first used around 2005. It's implementation is not terribly optimized, but in practice it doesn't need to be, because all personal computers since the mid 1990s are limited by memory bandwidth, not processing power.

For what it's worth, we're seeing similar ideas in things like graphics shaders, where the user writes a loop that appears to be serial and synchronous, but is parallelized by the runtime. I'm saddened that they had to evolve via graphics cards with their unfortunate memory segmentation (inspired by DOS?) but IMHO the future of programming will look like general-purpose shaders that abstract away caching so that eventually CPU memory, GPU memory, mass storage, networks, even the whole internet looks like a software-transactional memory or content-addressable memory.

We'll also ditch frictional abstractions like asynchronous promises in favor of something like the Actor model from Erlang/Go or a data graph that is lazily evaluated as each dependency is satisfied so it can be treated as a single synchronous serial computation. I've never found a satisfactory name for that last abstraction, so if someone knows what it's called, please let us know thanks!

P.S. the point of all this is to provide an efficient transform between functional programming and imperative programming so we can begin dealing in abstractions and stop prematurely optimizing our programs (which limits them to running on specific operating systems or hardware configurations).

etjossem7y ago

Considered Harmful Clickbait Considered Harmful (2019)

j / k navigate · click thread line to collapse

118 comments

69 comments · 17 top-level

glangdale7y ago· 13 in thread

We had numerous different examples of this in the Hyperscan project and I broke out something similar on my blog: https://branchfree.org/2018/05/30/smh-the-swiss-army-chainsa...

brandmeyer7y ago

My conclusion was that you're basically stuck with a general permute instruction for a modern SIMD ISA, scaling be damned.

+ The latency scaling is somewhat misleading thanks to wire delay - they are both O(N) in wire delay.

arundemeure7y ago

I completely agree in principle (some applications are good fit for traditional long vector processing, but it's a terrible fit for others).

creato7y ago

glangdale7y ago

1 more reply

atq21197y ago

Agreed, but then again, the vector architectures have the undeniable benefit of being agnostic to the underlying hardware.

With a SIMD architecture, as registers get ever wider you need to recompile your code, and binary releases require multiple code paths.

With a vector architecture, the hardware designers can just increase the machine's internal vector size and existing code will benefit immediately.

skybrian7y ago

1 more reply

lallysingh7y ago

Are you getting hung up on the term 'vector'? That doesn't assume you're just doing linear algebra.

bloomer7y ago

1 more reply

dbaupp7y ago

wbl7y ago

Do it across blocks and you can squeeze out more parallelism.

2 more replies

jabl7y ago

glangdale7y ago

1 more reply

petermcneeley7y ago

Agreed. SIMDs true competitor is ironically the super-scalar architecture itself and its relative the VLIW.

NL8077y ago· 9 in thread

Roll my eyes every time I see a "Considered Harmful" headline.

As for SIMD, it's a huge benefit when used in the right context. I applied it image processing and video compression algorithms in the past, with significant performance gains.

tom_mellior7y ago

> Roll my eyes every time I see a "Considered Harmful" headline.

Me too.

> As for SIMD, it's a huge benefit when used in the right context.

dkersten7y ago

lukego7y ago

I haven't written code for a vector architecture yet but, wow, they look so much nicer to program than SIMD on casual inspection.

kazinator7y ago

electrograv7y ago

For those who don’t get the title reference: ”[Programming Pattern] Considered Harmful” is a title meme that began in 1968 — over fifty years ago — and still going strong today!

It all started with the famous Edsger Dijkstra paper titled “Go To Statement Considered Harmful” [1], which lead to an endless series of subsequent papers later patterned after its title [2].

[1] https://dl.acm.org/citation.cfm?doid=362929.362947

[2] https://en.m.wikipedia.org/wiki/Considered_harmful

DerekL7y ago

In your opinion, how does SIMD compare with “vector architectures” described in the article?

bartread7y ago

> Roll my eyes every time I see a "Considered Harmful" headline.

vermilingua7y ago

https://meyerweb.com/eric/comment/chech.html

pjscott7y ago

1 more reply

qwerty4561277y ago· 8 in thread

> IA-32 instruction set has grown from 80 to around 1400 instructions since 1978, largely fueled by SIMD.

Holy quack! I didn't even know there were 80 (feels too much already, I barely used a tiny portion when exercising in assembly), 1400 sounds really insane.

pdpi7y ago

mcguire7y ago

Are they new opcodes, or are they one opcode parameterized with a few bits of length?

2 more replies

CoolGuySteve7y ago

A bigger problem is the relative length of these new instructions. I've noticed some more recent AVX instructions spreading over 8 or 9 bytes in disassembly.

AMD64 came out in 2003, a new decoder might be appropriate by 2023.

londons_explore7y ago

We're past the time that a human needs to understand assembly instructions.

cameronh907y ago

"We're past the time that a human needs to understand assembly instructions."

Until you're debugging broken compiler/JIT output, which I've had to do multiple times in the last year while using .NET Core.

1 more reply

richardwhiuk7y ago

Nah, that's just not true. Auto-vectorisation just isn't good enough at the moment.

You don't normally need to write assembly, but you do need to use compiler intrinsics, which map 1-1 with assembly.

jerf7y ago

tenebrisalietum7y ago

Itanium had a lot of instructions like that IIRC. But has been awhile since I read anything about that.

ajayjain7y ago· 4 in thread

[1] https://www.nextgenvec.org/#revec

jabl7y ago

Recently a patch was contributed to gcc that converts mmx intrinsics to sse. Also the gcc power target supports x86 vector intrinsics, converting them to the power equivalents.

It's not as ambitious as your approach though, more like a 1:1 translation and thus cannot take advantage of wider vectors.

glangdale7y ago

1 more reply

wmu7y ago

Sorry for a non-constructive comment, just wanted to say your paper is great. :)

ajayjain7y ago

Thank you! :)

hohohmm7y ago· 4 in thread

Isn't the whole point of SIMD being as similar to original x86 instructions as possible? reusing as much the existing cpu as possible? Otherwise you would have something like the ps3?

Sebb7677y ago

Yes and no. SIMD (Single Instruction Multiple Data) as a concept has nothing to do with x86, it's basically just the concept of vectorizing the code and is used on many platforms.

The x86 SIMD extensions such as SSE and AVX, on the other hand, aim to integrate that concept with x86 and are therefore pretty similar.

pjmlp7y ago

Not at all, SIMD is a concept used across all CPU architectures, including the PS3.

petermcneeley7y ago

It was primarily the memory architecture that made the PS3 unique.

CoolGuySteve7y ago

My understanding is that the largest difference is that some of the Cell cores had different opcodes that meant you could schedule some threads on some cores but not any thread on any core.

1 more reply

chx7y ago· 3 in thread

There is probably a lot of merit in the advantages of vectors but it weakens the article to set them up as against SIMD when the presented facts are dubious at best:

> An architect partitions the existing 64-bit registers

> The IA-32 instruction set has grown from 80 to around 1400 instructions since 1978, largely fueled by SIMD.

CalChris7y ago

Small nit. IA-64 refers to Itanium. I think you meant Intel 64.

https://en.wikipedia.org/wiki/IA-64

chx7y ago

You are correct.

Iwan-Zotow7y ago

well, he might be counting AMD and things like 3dnow! which are now defunct but was (another) legit extension to IA-32

DarkWiiPlayer7y ago· 3 in thread

So, if I understand it correctly, the text argues in favor of the GPU approach of pipelining independant vector operations instead of the current SIMD approach.

I see how this could be beneficial, specially when writing codes, as it's way closer to just a normal loop.

[1] http://www.lighterra.com/papers/modernmicroprocessors/

arundemeure7y ago

https://massivebottleneck.com/2019/02/17/vector-vs-simd-dyna...

I'm not sure I'm clear enough about it for those without a GPU HW background. If it's not clear I'm happy to write down a more detailed explanation here!

dragontamer7y ago

> GPU approach of pipelining independant vector operations

I disagree. GPUs have a fixed vector width. AMD GPUs have 64x32-bit vectors, while NVidia GPUs have 32x32-bit vectors.

What's being discussed here is a variable lengthed vector being supported on the hardware level, which is very, very different than how GPUs work.

vardump7y ago

> Wouldn't we be able to process 16 of them at the same time?

Instruction count is also higher, which is never a good thing.

> Actually, isn't that kind of what GPUs already do?

No.

etaioinshrdlu7y ago· 3 in thread

Strikes me as much cleaner design-wise than stuff like CUDA or openCL or SIMD of today.

vnorilo7y ago

magicalhippo7y ago

IIRC the Mill CPU handles this by performing a translation at install time.

For Mill CPU variants with wide vector units the CPU could execute certain instructions in one go, while for variants with narrow units it might have to issue multiple instructions.

Their idea is to handle this by basically doing ahead of time compilation of a generic program image, turning it into a specialized version for the installed CPU.

Sounds neat, proof is in the pudding.

3 more replies

mattnewport7y ago

1 more reply

Symmetry7y ago· 2 in thread

jabl7y ago

I'd guess in the classification of the authors, SVE would qualify as a "real" vector ISA. SVE resembles the risc-v vector extension quite a lot.

Dylan168077y ago

phkahler7y ago· 1 in thread

And yet the current RISC-V approach is not as good as MXP:

https://www.youtube.com/watch?v=gFrMcRqNH90

jabl7y ago

> And yet the current RISC-V approach is not as good as MXP

2) MXP seems to be dependent on these shuffling engines to align the data and feed to the ALU's. What's the overhead of these? Seems far from trivial?

CalChris7y ago· 1 in thread

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....

wmu7y ago

Well, there is also no AVX512 mentioned.

auggierose7y ago· 1 in thread

I feel if you have a strong need for vectors then you should consider running (part of) your code on the GPU.

_chris_7y ago

Too far away and requires huge thread counts to make up for its overheads. A good vector unit should work well even for very short vectors (e.g., any size of memcpy).

yxhuvud7y ago

The discussion in the comments beneath the article was more interesting than the article itself.

hsivonen7y ago

tomxor7y ago

So does this argument boil down to an inversion of control which in turn removes unnecessary instructions? It certainly sounds more elegant to my naive ISA understanding.

Can I ask, someone with hands on SIMD experience: does relinquishing control over exactly what and how many "vector" operations occur in a single clock make any real world difference?

zackmorris7y ago

I've been saying this for nearly 20 years. My first experience with it was Altivec on PowerPC:

https://en.wikipedia.org/wiki/AltiVec

etjossem7y ago

Considered Harmful Clickbait Considered Harmful (2019)

j / k navigate · click thread line to collapse