I read-ish the whole set of slides and it sounds pretty good (the devil is always in the details), but I got a little worried about VLIW-ish issues when [in the slides] he said on slide #57:
The compiler controls when ops issue
One of the big issues with VLIW was that the compiler had to be intimately aware of processor architecture. So when you upgraded your '886 to a '986 you needed new binaries because the '986 had more registers or executions units. [I assume Itanium fixed some of this, but it also sunk my interest in VLIW.]Is this architecture going to face the same issue?
Edit: I watching the video and heard that "nearly all of what the super scalar is doing is [not calculating]". One of the other VLIW issues was that chip area was dominated by cache-area, so all the stuff about [not calculating] shrank and shrank relatively as cache area grew (see: http://techreport.com/review/15818/intel-core-i7-processors). This claim concerns me.
Edit V2: but damn... Exciting stuff.
All Mill family members have a common architecture, that the middle end knows about. ME output (specializer input) is a complete CFG and DFG for the abstract Mill. The specializer first replaces operations that are not present on the target with calls to functions; on a Mill function call is semantically equivalent to an op like an add.
There is no instruction-selection phase. The Mill in general has only one way to implement any given source-level action, and the correct (abstract) code has already been selected by the ME. The BE does instruction scheduling, introducing spill where necessary, using standard schedule algorithms long used for in-order machines and VLIWs. The result may be listed as assembler source or emitted as binary. The schedule algorithms have quadratic worst case but in practice are linear. The generated binaries are cached in the load module, so subsequent runs skip the specializer step.
Current status is that work on the new (LLVM-based) compiler is on hold while we complete the patent filings, and the old one is out of date already. When the filings are in we hope to make the tool chain and sim available on-line for those who want to play. We could do that now with the assembler and sim, except that the asm instruction set exposes some things that the patents aren't in on yet. Grrr - patents!
But perhaps that "final optimization pass" would be nearly as hard as the whole compilation problem in the first place; dunno. I wasn't on the compiler team, so this is perhaps a naive viewpoint.
Pragmatically, exactly how fast is that run-time optimization? Could you realistically JIT it, or should the more-optimal, chip-specific asm be cached between loads? Or is this so slow you'd only ever want to do it once?
If there's a bad gate in the cache then the fab process simply uses one of the spare cells inside; the user never realizes that a spare has been used. If there's a bad gate in the core proper then that chip is gone, lowering the fab yield.
The same can be done at the level of other regular structures. Your two-core chip is really a four-core chip with a couple of bad cores. That's one of the reasons why the vendors are so eager to convince you that multicore is the Pearly Gates - it increases their effective yield. Re cache area:
To a first approximation, power cost of a cache is constant independent of size, whereas core power is superlinear. Consequently the limiting factor for increasing cache size is latency, not power. See ootbcomp.com/docs/encoding for how the Mill doubles instruction cache size without increasing latency.
But the lack of latency hiding means you have to generate code to schedule starting operations at the right time. Mul taking 2 cycles means you want to start any dependent, but lower latency instructions after it. But if at some point in the future mul is scaled down to one cycle for whatever reason your timings will all be off and your belt occupancy will be less than ideal.
I kind of wonder if the reality is simply that there is nothing better equipped to schedule instructions than the cpu itself. Maybe we could have better ways to explicitly hint meaningful information to it, but I'm not sure shoving all the decisions to the compiler is a long term solution.
Why don't you just recompile?
Some criticism however: no sign yet of how variable latencies in memory will be tolerated. Requiring fixed latency for FUs is problematic with cache misses and FU sharing between pipelines. Also his comparison between a 64-item belt and the 300+ registers of OoOE is unfair, since the 300+ registers will likely tolerate more latency than the smaller belt.
I wrote a review of what I get from his first two talks here: http://staff.science.uva.nl/~poss/posts/2013/08/01/mill-cpu/
I think Torben Mogensen posted the same basic idea of replacing registers with 'temporal addressing' to Usenet back in the 90s -- comp.arch? comp.compilers? Boy, it's been a long time.
It's quite a common and recurrent idea really. However promises / dataflow tokens / I-structures / etc all are subject to a common flaw / problem: when you receive multiple completions simultaneously, which of them are you going to schedule first? This choice is highly non-obvious and has tremendous impacts on data locality.
It really does seem like a Lisp would map much better, with the whole caller/callee and those private data belts that looked like hardware level closures!
That said, I got the sense that this was what Intel was going for when they did Larrabee [2] and just missed because of the focus on Graphics. Unlike Larrabee is suspect OOTBC will need to build it themselves like Chip did for the Propeller [3].
That said, the challenge of these bespoke architectures are the requirement for software, or first a GCC port :-). I believe Ivan said they had a port that talked to their simulator, but I don't know if that was an optimizing thing like SGI's compiler for Itanium or a proof of concept thing.
The weird thing is of course "Why?", and one might say "But Chuck, faster and cheaper, why not?" and I look at the explosion of ARM SoC's (cheaper, not necessarily faster than x86) and look back at ARM and think 99% of this was getting the ecosystem built, not the computer architecture. So who can afford to invest the billions to build the eco-system? Who would risk that investment? (Google might but that is a different spin on things).
So playing around with the Zynq-7020 (same chip that is on the Parallea but not the Epiphany co-processor) I can see a nice dual core ARM9 where you have full speed access to a bespoke DSP if you want for the 'tricky' bits. Will that be "good enough" for the kinds of things this would also excel at? I don't know, so I don't know how to handicap OOTBC's chances for success. But I really enjoy novel computer architectures, like this one and Chuck Moore's '1000 forth chickens' chip [4] (it was a reference to the Seymour Cray's quote, "Would you have 1,000 chickens pull your plow or an Ox?"
A really interesting time will be had when 'paid off' fab capacity is sitting idle and the cost for a wafer start becomes a function of how badly the fab wants to keep the line busy.
[1] http://en.wikipedia.org/wiki/Intel_iAPX_432
[2] http://en.wikipedia.org/wiki/Larrabee_(microarchitecture)#Di...
Of course, there's always the outside chance that the experiment worked so well that Intel is keeping an architecture inspired by the Larrabee research project it in reserves for after it's gone as far as it can shrinking transistors and needs something new to sell.
[1] http://www.trustedreviews.com/opinions/intel-larrabee-an-int...
The different VLIW execution units can only process a subset of the instructions set. That means you need the right mix of instructions in your algorithm to take advantage of the full throughput. If you have any sort of serial dependency you won't be able to take advantage of all the execution slots. It basically excels at - signal processing (and even a subset of that). That said, when you hit the sweet spot it's pretty good.
When someone like TI compares their DSP to ARM they usually tend to ignore SIMD (very conveniently). SIMD (NEON on ARM or SSE on x86) can buy you almost another order of magnitude performance on those super-scalar CPUs if you're dealing with vectors of smaller quantities (16 bit or 8 bit). So while on paper the VLIW DSP should blow the general purpose superscalar out of the water for signal processing algorithms at comparable clock rates it's not quite like that when you can use SIMD. It also takes a lot of manual optimization to get the most out of both architectures.
So when you're in the VLIW's sweet spot your performance/power/price is pretty good. But the general purpose processors can get you performance in and out of that sweet spot (you're probably sucking more power but that's life).
You really can't look at "peak" instructions per second on the VLIW as any sort of reliable metric and you need any comparison to include the SIMD units...
EDIT: Another note is that for many applications the external memory bandwidth is really important. The DSPs benefit from their various DMAs and being able to parallelize processing and data transfer but generally x86 blows away all those DSPs and ARM. I guess in a modern system you may also want to throw the GPU into the comparison mix.
Decades-old compiler tech can produce near-perfect schedules for in-order machines (including VLIWs) for all operations with a statically-fixed latency, which in practice means everything but loads. A load miss will stall an in-order machine.
Load misses are of two kinds: where the load result is a data-flow ancestor of all the rest of the program, and when the load is incidental to the program dataflow. In the former case neither in-order nor out-of-order machines can do anything but wait for memory. An example is making random references into a hash table that is bigger than cache - every machine of every architecture is going to run at DRAM speed.
However, there are also incidental misses, where the program has other things to do that do not require the result from the load miss. For these, the OOO machine can keep going, while an in-order machine stalls.
The Mill does not face this issue. It is an in-order machine, yet does not suffer from miss stalls, or at least not from unnecessary ones. In general, the Mill avoids all the misses that an OOO can avoid. Or roughly so - there are a few cases where the Mill stalls but an OOO doesn't, and a few where the OOO stalls but the Mill doesn't, but it's roughly a wash.
Most of how this works will be described in the next presentation (#3) in the series, although some won't be until a later presentation.
- To achieve instruction-level parallelism, traditional architectures (his example: Haswell) often employ very messy techniques like register renaming which create a huge amount of complexity increasing power-consumption.
- He has focused on one such technique called Very Long instruction word (VLIW), and has taken it to an extreme: the technique he proposes is to throw away general purpose registers, and replace it with a "belt": a write-once linear tape of memory (implemented using stacks).
- He then points out various advantages of this model, including in-order execution (ILP traditionally requires reordering), short pipeline, and overall simplification of architectures.
All this looks fine on paper, but I don't see a proposed instruction set, or any indication of what realization of this model will require. In short, it's a cute theoretical exercise.
So, let's look at what the unworkable problems with Haswell are, and what's being done to fix them. Yes, nobody can seem to be able to figure out how to reduce power consumption on performant x86 microarchitectures beyond a point. It's a very old architecture, and I'm hopeful about the rise of ARM. The solution is not to throw away general purpose registers, but rather to cut register renaming and make ILP easier to implement by using a weak memory model (which is exactly what ARM does). ARM64 is emerging and the successes of x86 are slowly percolating to it [1]. Moreover, the arch/arm64 tree in linux.git is 1.5 years old and is under active development; we even got virt/kvm/arm merged in recently (3 months ago), although I'm not sure how it works without processor extensions (it's not pure pvops). ARM32 already rules embedded and mobile devices, and manufacturers are interested in taking it to heavy computing. In short, the roadmap is quite clear: ARM64 is the future.
The core of the Linux memory barriers model described in Documentation/memory-barriers.txt [2] (heavily inspired by the relaxed DEC Alpha) should tell you what you need to know about why a weak memory model is desirable.
[1]: http://www.arm.com/files/downloads/ARMv8_Architecture.pdf
[2]: https://github.com/torvalds/linux/blob/master/Documentation/...
To exploit ILP, ARM has to do the same thing as every other register architecture: move to a superscalar out-of-order architecture. There's no real alternative. The weak memory model makes cache coherency protocols simpler/faster, but won't save you from false register dependencies.
Three-address code might make it possible to defer the costs of complex register renaming, but ARM has discovered the importance of code density, and Thumb-2 (2-address code) is preferred for most functions.
What are false register dependencies? And how does a three-address code help in register renaming?
Some reading references would help. I'm currently reading [1].
[1]: http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2...
This design choice had two motivations: 1) membar is fabulously expensive; and 2) weaker consistency models are incomprehensible to mere mortals.
Keep it up! Maybe tomorrow we'll see an idea to replace threads and locks.
Did you even watch his talk? I didn't see anyone arguing against Bret's scaling concerns with threads in yesterday's HN comments.
If we don’t come up with a new system, I can’t imagine how hard it will be to efficiently program CPUs when they start coming out with 128, 256, and 512 cores.
Therefore, it's better to use a statick-sized queue structure. Hence "belt" - it's like a conveyor belt.
I'm still watching it though. The topic is interesting but man, his voice is droning.
Ivan
Too bad they can't release their emulators/simulators, but I sympathize with their desire to be first-to-file to protect what sounds like years of work.
Also the implementation of the belt itself could be quite nasty. Just look at the face of the guy who asked about it after he hears the answer. ;-)
Finally, they will not get anywhere near peak performance with general purpose code. The parallelism is just not there at the instruction level. They would do well on high performance computing or digital signal processing with enough floating point units and memory bandwidth.
Instead they seem to target big data. An interesting move. It will be interesting to see actual performance numbers (even from a simulator). I wish them luck.
There is an architectural minimal size to Mills; there must be at least one ALU and one memory unit, and all Mills must use a 64-bit address; the z80 market is safe from us, and you won't see Mills in your toaster or thermostat. There's no architectural maximal Mill, but there are diminishing returns; in current process technology we feel that eight ALUs are getting close to that edge, but those decisions are made independently for each market and process.
Ivan
As far as the meat goes, I'm gratified that I'm not the only one to think this is mindbogglingly elegant. I see some misconceptions in this thread, but I'm finding it difficult to explain the divergences without beginning to inexpertly regurgitate large portions of this and the previous talk (on their instruction encoding).
It seems like quite a reach to pick a 'latest and greatest' IA processor (why is this $885? Because it can be) as a point of contrast. We can hit half the performance envelope and/or about an eighth the price while keeping many of the features that make a desktop processor what it is (like, for example, a whopping big cache hierarchy). Picking on x86 seems like a good way of overstating the case; if we're getting a new architecture, we may as well pick on ARM or Tilera or what have you.
Am also, as an early believer in the Itanium, somewhat nervous about static scheduling for any purpose. Dynamic branch prediction is very accurate on today's architectures; this does not mean that static scheduling can emulate this accuracy and enjoy the benefits.
I work daily with a VLIW architecture and the only place I ever see a plain "move" instruction is in loops. Everywhere else, the compiler just churns through the registers as if they were in a belt anyway – just the names are absolute instead of relative.
I can imagine this might simplify the processor logic some – results are primarily read from one of the first few belt locations, and are always written to the first location. "Register moves" aside, is this the primary benefit?