The Belt CPU Architecture (opens in new tab)

(ootbcomp.com)

183 pointsmomerath12y ago79 comments

79 comments

58 comments · 20 top-level

CoffeeDregs12y ago· 9 in thread

First off, I love the post! Super meaty goodness.

I read-ish the whole set of slides and it sounds pretty good (the devil is always in the details), but I got a little worried about VLIW-ish issues when [in the slides] he said on slide #57:

    The compiler controls when ops issue

One of the big issues with VLIW was that the compiler had to be intimately aware of processor architecture. So when you upgraded your '886 to a '986 you needed new binaries because the '986 had more registers or executions units. [I assume Itanium fixed some of this, but it also sunk my interest in VLIW.]

Is this architecture going to face the same issue?

Edit: I watching the video and heard that "nearly all of what the super scalar is doing is [not calculating]". One of the other VLIW issues was that chip area was dominated by cache-area, so all the stuff about [not calculating] shrank and shrank relatively as cache area grew (see: http://techreport.com/review/15818/intel-core-i7-processors). This claim concerns me.

Edit V2: but damn... Exciting stuff.

igodard12y ago

The Mill compiler is a conventional three-phase tool. The first version used the EDG front end and home-grown middle and back ends, but we are converting to a clang front end, LLVM middle, and home-grown back end. All optimization is done in the middle end; the back end (the "specializer" runs at program install (or load) time and does only scheduling and emit.

All Mill family members have a common architecture, that the middle end knows about. ME output (specializer input) is a complete CFG and DFG for the abstract Mill. The specializer first replaces operations that are not present on the target with calls to functions; on a Mill function call is semantically equivalent to an op like an add.

There is no instruction-selection phase. The Mill in general has only one way to implement any given source-level action, and the correct (abstract) code has already been selected by the ME. The BE does instruction scheduling, introducing spill where necessary, using standard schedule algorithms long used for in-order machines and VLIWs. The result may be listed as assembler source or emitted as binary. The schedule algorithms have quadratic worst case but in practice are linear. The generated binaries are cached in the load module, so subsequent runs skip the specializer step.

Current status is that work on the new (LLVM-based) compiler is on hold while we complete the patent filings, and the old one is out of date already. When the filings are in we hope to make the tool chain and sim available on-line for those who want to play. We could do that now with the assembler and sim, except that the asm instruction set exposes some things that the patents aren't in on yet. Grrr - patents!

cpr12y ago

As someone who worked on operating systems back at Multiflow (late 80's, the first VLIW start-up as a spin-off from Yale research), it struck me recently that something like an LLVM representation of binary code might solve the "compiler problem" (needing to know the exact specifics of the chip's latencies). I.e., you'd run and load LLVM binaries, and have a runtime final optimization pass that took into account the specific latencies of each opcode for a particular implementation. (The LLVM architecture is actually already set up to do optimizations at runtime on LLVM "binaries".)

But perhaps that "final optimization pass" would be nearly as hard as the whole compilation problem in the first place; dunno. I wasn't on the compiler team, so this is perhaps a naive viewpoint.

arh6812y ago

> The LLVM architecture is actually already set up to do optimizations at runtime on LLVM "binaries".

Pragmatically, exactly how fast is that run-time optimization? Could you realistically JIT it, or should the more-optimal, chip-specific asm be cached between loads? Or is this so slow you'd only ever want to do it once?

1 more reply

fulafel12y ago

Transmeta was a little like that, except it was x86 to VLIW.

1 more reply

igodard12y ago

In terms of square millimeters the caches dominate chip area. In terms of fab yield (which is the cost driver) they do not, because caches are regular structures that are built with built-in sparing.

If there's a bad gate in the cache then the fab process simply uses one of the spare cells inside; the user never realizes that a spare has been used. If there's a bad gate in the core proper then that chip is gone, lowering the fab yield.

The same can be done at the level of other regular structures. Your two-core chip is really a four-core chip with a couple of bad cores. That's one of the reasons why the vendors are so eager to convince you that multicore is the Pearly Gates - it increases their effective yield. Re cache area:

To a first approximation, power cost of a cache is constant independent of size, whereas core power is superlinear. Consequently the limiting factor for increasing cache size is latency, not power. See ootbcomp.com/docs/encoding for how the Mill doubles instruction cache size without increasing latency.

stormbrew12y ago

I feel like the issue this faces isn't so much the belt length, since I think you'd probably just end up with a dominant choice (much like x86 managed to stick to a roughly unchanging number of registers for a very long time).

But the lack of latency hiding means you have to generate code to schedule starting operations at the right time. Mul taking 2 cycles means you want to start any dependent, but lower latency instructions after it. But if at some point in the future mul is scaled down to one cycle for whatever reason your timings will all be off and your belt occupancy will be less than ideal.

I kind of wonder if the reality is simply that there is nothing better equipped to schedule instructions than the cpu itself. Maybe we could have better ways to explicitly hint meaningful information to it, but I'm not sure shoving all the decisions to the compiler is a long term solution.

gbhn12y ago

The obstacle would simply be recompiling from IR. That doesn't seem like a big challenge in today's environment for a large swath of devices -- nearly mobile devices, for instance, have nearly-complete top-to-bottom source code for all software running on the chip available. So that approach doesn't seem nearly as crazy as it would have 10-20 years ago.

1 more reply

nightcracker12y ago

This question is actually asked: https://www.youtube.com/watch?v=QGw-cy0ylCc&feature=player_d...

gngeal12y ago

Is this architecture going to face the same issue?

Why don't you just recompile?

knz4212y ago· 7 in thread

I like this guy and his presentation style, it is good to see someone delivering seriously on an old idea: queue machines. (The "Belt" is the queue machine model revisited. I can't see a theoretical breakthrough here). But the Mill's realization that combines a queue model with VLIW and embedded spiller machinery is worth attention.

Some criticism however: no sign yet of how variable latencies in memory will be tolerated. Requiring fixed latency for FUs is problematic with cache misses and FU sharing between pipelines. Also his comparison between a 64-item belt and the 300+ registers of OoOE is unfair, since the 300+ registers will likely tolerate more latency than the smaller belt.

I wrote a review of what I get from his first two talks here: http://staff.science.uva.nl/~poss/posts/2013/08/01/mill-cpu/

abecedarius12y ago

Nice post; I had the same thought of using promises for memory access. I guess you could extend that to a massively-parallel design (promises for communication).

I think Torben Mogensen posted the same basic idea of replacing registers with 'temporal addressing' to Usenet back in the 90s -- comp.arch? comp.compilers? Boy, it's been a long time.

knz4212y ago

Every successive incarnation of dataflow scheduling uses a matching store and synchronization tokens on long-latency operations. You have them in Tomasulo's reservation stations at a very small scale (considering FUs as unpredictable), in the MTA throughout the memory system, in the D-RISC core of my research group, and quite a few others.

It's quite a common and recurrent idea really. However promises / dataflow tokens / I-structures / etc all are subject to a common flaw / problem: when you receive multiple completions simultaneously, which of them are you going to schedule first? This choice is highly non-obvious and has tremendous impacts on data locality.

1 more reply

Scaevolus12y ago

The memory architecture hasn't been detailed, but from his claims of latency hiding I could see a memory load succeeding in a constant number of cycles and pushing a belt item tagged as unfilled memory, and then an actual attempt to read that belt item would cause a stall if it hasn't been filled yet.

knz4212y ago

As Ivan just pointed out to me privately, I meant "is the stack machine model revisited", not "queue". Queue and stack machines are quite different things. But the rest of the argument holds.

igodard12y ago

Part of the way the Mill handles memory latency will be in the next talk. Sign up at ootbcomp.com/mailing-list for an announcement of date and venue.

qznc12y ago

The spiller machinery seems to be quite similar to SPARCs register windows. SPARC requires software handlers for the overflow though.

igodard12y ago

The spiller concept is not new with us; there have been others besides SPARC to use the idea. Ours has no software handlers.

Ivan

gruseom12y ago· 4 in thread

What would be the ideal language to compile to this?

kazagistar12y ago

C would probably be fine. One interesting point seems to be the ease with which many parameters can be both passed and returned. So, languages that return multiple values from functions, and that allow for good, automatic software pipe lining.

tlb12y ago

Fortran or numerical code in C. It seems best suited for statically unrollable loops with few data-dependent branches or random memory accesses.

williamcotton12y ago

C expects addressable registers. From what I can tell this doesn't have those!

It really does seem like a Lisp would map much better, with the whole caller/callee and those private data belts that looked like hardware level closures!

2 more replies

joelg23612y ago

My intuition says haskell or lisp. Don't take my word for it.

ChuckMcM12y ago· 3 in thread

I got to talk with Ivan after he gave an earlier version of this talk at Asilomar and the good news was he's the real deal, pretty much every question I could think to throw at him he had a solid answer for, the bad news was I felt the same way about the Mill architecture as I did about Intel's iapx432 architecture [1], which was elegant to a fault.

That said, I got the sense that this was what Intel was going for when they did Larrabee [2] and just missed because of the focus on Graphics. Unlike Larrabee is suspect OOTBC will need to build it themselves like Chip did for the Propeller [3].

That said, the challenge of these bespoke architectures are the requirement for software, or first a GCC port :-). I believe Ivan said they had a port that talked to their simulator, but I don't know if that was an optimizing thing like SGI's compiler for Itanium or a proof of concept thing.

The weird thing is of course "Why?", and one might say "But Chuck, faster and cheaper, why not?" and I look at the explosion of ARM SoC's (cheaper, not necessarily faster than x86) and look back at ARM and think 99% of this was getting the ecosystem built, not the computer architecture. So who can afford to invest the billions to build the eco-system? Who would risk that investment? (Google might but that is a different spin on things).

So playing around with the Zynq-7020 (same chip that is on the Parallea but not the Epiphany co-processor) I can see a nice dual core ARM9 where you have full speed access to a bespoke DSP if you want for the 'tricky' bits. Will that be "good enough" for the kinds of things this would also excel at? I don't know, so I don't know how to handicap OOTBC's chances for success. But I really enjoy novel computer architectures, like this one and Chuck Moore's '1000 forth chickens' chip [4] (it was a reference to the Seymour Cray's quote, "Would you have 1,000 chickens pull your plow or an Ox?"

A really interesting time will be had when 'paid off' fab capacity is sitting idle and the cost for a wafer start becomes a function of how badly the fab wants to keep the line busy.

[1] http://en.wikipedia.org/wiki/Intel_iAPX_432

[2] http://en.wikipedia.org/wiki/Larrabee_(microarchitecture)#Di...

[3] http://www.parallax.com/propeller/

[4] http://www.colorforth.com/S40.htm

igodard12y ago

I'll try to address some of the questions and issues raised here; the Mill has been more fully discussed on comp.arch, and there is more material at ootbcomp.com/docs. Sign up at ootbcomp.com/mailing-list to get announcements of additional videos, white papers etc as they become available.

solarexplorer12y ago

I just signed up. Can't wait to see more of this!

thisishugo12y ago

I remember reading the Larrabee white papers, press material, etc. for an article[1] I wrote some five years ago. I think the reason that Larrabee was pitched for graphics initially was just to limit the domain of "things" (tessellation, texture sampling, etc.) that had to be catered to in silicon. I've no doubt that had it come to market, and had it worked well, we would be looking at a radically different CPU architecture than Haswell on the market today.

Of course, there's always the outside chance that the experiment worked so well that Intel is keeping an architecture inspired by the Larrabee research project it in reserves for after it's gone as far as it can shrinking transistors and needs something new to sell.

[1] http://www.trustedreviews.com/opinions/intel-larrabee-an-int...

YZF12y ago· 3 in thread

As someone who has worked implementing the same algorithms on a VLIW DSP (TI C64x) and ARM and x86 a few comments after watching the beginning of the talk. It's been a while so this is from memory...

The different VLIW execution units can only process a subset of the instructions set. That means you need the right mix of instructions in your algorithm to take advantage of the full throughput. If you have any sort of serial dependency you won't be able to take advantage of all the execution slots. It basically excels at - signal processing (and even a subset of that). That said, when you hit the sweet spot it's pretty good.

When someone like TI compares their DSP to ARM they usually tend to ignore SIMD (very conveniently). SIMD (NEON on ARM or SSE on x86) can buy you almost another order of magnitude performance on those super-scalar CPUs if you're dealing with vectors of smaller quantities (16 bit or 8 bit). So while on paper the VLIW DSP should blow the general purpose superscalar out of the water for signal processing algorithms at comparable clock rates it's not quite like that when you can use SIMD. It also takes a lot of manual optimization to get the most out of both architectures.

So when you're in the VLIW's sweet spot your performance/power/price is pretty good. But the general purpose processors can get you performance in and out of that sweet spot (you're probably sucking more power but that's life).

You really can't look at "peak" instructions per second on the VLIW as any sort of reliable metric and you need any comparison to include the SIMD units...

EDIT: Another note is that for many applications the external memory bandwidth is really important. The DSPs benefit from their various DMAs and being able to parallelize processing and data transfer but generally x86 blows away all those DSPs and ARM. I guess in a modern system you may also want to throw the GPU into the comparison mix.

igodard12y ago

Re Scheduling for latency:

Decades-old compiler tech can produce near-perfect schedules for in-order machines (including VLIWs) for all operations with a statically-fixed latency, which in practice means everything but loads. A load miss will stall an in-order machine.

Load misses are of two kinds: where the load result is a data-flow ancestor of all the rest of the program, and when the load is incidental to the program dataflow. In the former case neither in-order nor out-of-order machines can do anything but wait for memory. An example is making random references into a hash table that is bigger than cache - every machine of every architecture is going to run at DRAM speed.

However, there are also incidental misses, where the program has other things to do that do not require the result from the load miss. For these, the OOO machine can keep going, while an in-order machine stalls.

The Mill does not face this issue. It is an in-order machine, yet does not suffer from miss stalls, or at least not from unnecessary ones. In general, the Mill avoids all the misses that an OOO can avoid. Or roughly so - there are a few cases where the Mill stalls but an OOO doesn't, and a few where the OOO stalls but the Mill doesn't, but it's roughly a wash.

Most of how this works will be described in the next presentation (#3) in the series, although some won't be until a later presentation.

Symmetry12y ago

Both SIMD and VLIW are techniques that only work on programs with a good amount of statically accessible instruction level parallelism, but VLIW is much more general since it can be used in cases where you know in advance that you want to be executing a number of instructions in parallel, rather than one instruction multiple times. The programs that can be sped up by SIMD are a subset of the programs that can be sped up by VLIW, are a subset of the programs that can be sped up by OoO engines. Of course, SIMD has a big advantage in instruction bandwidth vs VLIW, and if you're already going OoO the SIMD instructions only use up a minimum of the OoO engine's resources. Combining VLIW and OoO, on the other hand, is very difficult to do.

YZF12y ago

Agree.

artagnon12y ago· 3 in thread

My takeaway from this presentation:

- To achieve instruction-level parallelism, traditional architectures (his example: Haswell) often employ very messy techniques like register renaming which create a huge amount of complexity increasing power-consumption.

- He has focused on one such technique called Very Long instruction word (VLIW), and has taken it to an extreme: the technique he proposes is to throw away general purpose registers, and replace it with a "belt": a write-once linear tape of memory (implemented using stacks).

- He then points out various advantages of this model, including in-order execution (ILP traditionally requires reordering), short pipeline, and overall simplification of architectures.

All this looks fine on paper, but I don't see a proposed instruction set, or any indication of what realization of this model will require. In short, it's a cute theoretical exercise.

So, let's look at what the unworkable problems with Haswell are, and what's being done to fix them. Yes, nobody can seem to be able to figure out how to reduce power consumption on performant x86 microarchitectures beyond a point. It's a very old architecture, and I'm hopeful about the rise of ARM. The solution is not to throw away general purpose registers, but rather to cut register renaming and make ILP easier to implement by using a weak memory model (which is exactly what ARM does). ARM64 is emerging and the successes of x86 are slowly percolating to it [1]. Moreover, the arch/arm64 tree in linux.git is 1.5 years old and is under active development; we even got virt/kvm/arm merged in recently (3 months ago), although I'm not sure how it works without processor extensions (it's not pure pvops). ARM32 already rules embedded and mobile devices, and manufacturers are interested in taking it to heavy computing. In short, the roadmap is quite clear: ARM64 is the future.

The core of the Linux memory barriers model described in Documentation/memory-barriers.txt [2] (heavily inspired by the relaxed DEC Alpha) should tell you what you need to know about why a weak memory model is desirable.

[1]: http://www.arm.com/files/downloads/ARMv8_Architecture.pdf

[2]: https://github.com/torvalds/linux/blob/master/Documentation/...

Scaevolus12y ago

A weak memory model has almost nothing to do with ILP.

To exploit ILP, ARM has to do the same thing as every other register architecture: move to a superscalar out-of-order architecture. There's no real alternative. The weak memory model makes cache coherency protocols simpler/faster, but won't save you from false register dependencies.

Three-address code might make it possible to defer the costs of complex register renaming, but ARM has discovered the importance of code density, and Thumb-2 (2-address code) is preferred for most functions.

artagnon12y ago

I see. Can you explain to me how it fixes the power consumption problem? Yes, I know that it needs to reorder instructions to achieve ILP, but isn't that the entire point of a weak architecture? To reduce the memory barriers and weaken the ordering constraints, making it easier to reorder?

What are false register dependencies? And how does a three-address code help in register renaming?

Some reading references would help. I'm currently reading [1].

[1]: http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2...

3 more replies

igodard12y ago

The Mill is strong sequentially consistent throughout. There are no membar operations, and the Linux macroes for that purpose are empty.

This design choice had two motivations: 1) membar is fabulously expensive; and 2) weaker consistency models are incomprehensible to mere mortals.

slacka12y ago· 3 in thread

After Bret Victor's inspiring talk yesterday, it's fitting to see a new CPU architecture make it to the top of HN.

Keep it up! Maybe tomorrow we'll see an idea to replace threads and locks.

qznc12y ago

You can replace threads and locks, if you use non-blocking programming techniques.

http://en.wikipedia.org/wiki/Non-blocking_algorithm

slacka12y ago

With massively parallel hardware, concurrency is the biggest challenge facing programmers today. It's a terrible programming paradigm. And your Non-blocking algorithms only add complexity to an already difficult problem. For example, I've seen NBAs lead to a priority inversion bug.

Did you even watch his talk? I didn't see anyone arguing against Bret's scaling concerns with threads in yesterday's HN comments.

lbenes12y ago

In my experience, the worse bugs I have encountered are deadlocks and race conditions. Saying just use a “non-blocking algorithm” is no silver bullet to the difficultly of working with threads.

If we don’t come up with a new system, I can’t imagine how hard it will be to efficiently program CPUs when they start coming out with 128, 256, and 512 cores.

chewxy12y ago· 2 in thread

At 17 mins in, when he talked about rename registers, it reminds me of SSAs. It's quite fascinating to see that something like that happens in the hardware level as well

axman612y ago

My first thought was "Huh, that sounds like an architecture that would map perfectly to LLVM's SSA IR". Can't wait to watch the video now!

chewxy12y ago

nah it's a different thing. He says general purpose processor architectures have so many rename registers because of single assignments. He then proceed to talk about percentage of data being Single Assignment.

Therefore, it's better to use a statick-sized queue structure. Hence "belt" - it's like a conveyor belt.

I'm still watching it though. The topic is interesting but man, his voice is droning.

1 more reply

ajb12y ago· 2 in thread

Interesting idea. A limitation of the temporal addressing is that variables live on entry to a basic block have to be put in the same position on the belt on each path which could enter the block. This is okay for loops, but it's going to require lots of shuffling about in conditional code. Still could be worth it, of course.

solarexplorer12y ago

They will probably do lots of if conversions to avoid this and to make use of the 33 issue slots. They could also add special instructions to push several values on the belt to keep different paths aligned.

igodard12y ago

Bingo!

Ivan

johnpmayer12y ago· 1 in thread

Very interesting that the architecture has no condition codes, or any global state for that matter. Also, that the belts are logical, in that there really isn't a chunk of memory where the belt lives, but the members of a belt live where they live and know "which belt" they belong to and "where on that belt" they are.

Too bad they can't release their emulators/simulators, but I sympathize with their desire to be first-to-file to protect what sounds like years of work.

igodard12y ago

Exactly :-) We will be publishing as fast as we get the filings done.

solarexplorer12y ago· 1 in thread

This was a very nice presentation indeed. Too bad they could not tell us how they handle the memory hierarchy. Because that part is crucial for any statically scheduled architecture.

Also the implementation of the belt itself could be quite nasty. Just look at the face of the guy who asked about it after he hears the answer. ;-)

Finally, they will not get anywhere near peak performance with general purpose code. The parallelism is just not there at the instruction level. They would do well on high performance computing or digital signal processing with enough floating point units and memory bandwidth.

Instead they seem to target big data. An interesting move. It will be interesting to see actual performance numbers (even from a simulator). I wish them luck.

igodard12y ago

We consider the Mill family to be general-purpose, with the same potential markets as any other GP architecture (x86, ARM, zSeries, PowerPC, etc.). Individual family members can be configured for particular markets; the Gold mentioned in the talk is indeed somewhat Big Data oriented, with eight ALUs and only two FPUs. A High Silver, with six FPUs and a SIMD height of 256 bits, is an FP banger. A Tin, with one ALU, no FPU, and 8-byte SIMD, is at the low end.

There is an architectural minimal size to Mills; there must be at least one ALU and one memory unit, and all Mills must use a 64-bit address; the z80 market is safe from us, and you won't see Mills in your toaster or thermostat. There's no architectural maximal Mill, but there are diminishing returns; in current process technology we feel that eight ALUs are getting close to that edge, but those decisions are made independently for each market and process.

Ivan

momerathOP12y ago

Wow, the one submission I've made anywhere that hit the front page, and I'm kicking myself for the title. 'Belt' is really their name for the category of machine architecture in the sense of register and stack machines. 'Mill' is the name of their specific architecture/family.

As far as the meat goes, I'm gratified that I'm not the only one to think this is mindbogglingly elegant. I see some misconceptions in this thread, but I'm finding it difficult to explain the divergences without beginning to inexpertly regurgitate large portions of this and the previous talk (on their instruction encoding).

onan_barbarian12y ago

This looks fascinating - and "the real deal" (not crackpottery) but I'm hesitant about a number of items.

It seems like quite a reach to pick a 'latest and greatest' IA processor (why is this $885? Because it can be) as a point of contrast. We can hit half the performance envelope and/or about an eighth the price while keeping many of the features that make a desktop processor what it is (like, for example, a whopping big cache hierarchy). Picking on x86 seems like a good way of overstating the case; if we're getting a new architecture, we may as well pick on ARM or Tilera or what have you.

Am also, as an early believer in the Itanium, somewhat nervous about static scheduling for any purpose. Dynamic branch prediction is very accurate on today's architectures; this does not mean that static scheduling can emulate this accuracy and enjoy the benefits.

gmrple12y ago

This feels like a functional architecture. It's really cool.

colanderman12y ago

I don't get how this is different than an in-order register-based VLIW architecture, except that it makes branching more difficult to deal with because you need to synchronize belt-deltas between branches (which sounds like the dual of a register move to me)?

I work daily with a VLIW architecture and the only place I ever see a plain "move" instruction is in loops. Everywhere else, the compiler just churns through the registers as if they were in a belt anyway – just the names are absolute instead of relative.

I can imagine this might simplify the processor logic some – results are primarily read from one of the first few belt locations, and are always written to the first location. "Register moves" aside, is this the primary benefit?

X412y ago

This guy looks and is awesome! Can't wait to have them produce many-core Mill CPUs, that I'd like to throw algorithms onto. Then I'm gonna Buy Buy Buy

pavlov19912y ago

This reminds me a lot of SSAs. This is really really cool stuff. One of those genuine progress type things in this space.

tankenmate12y ago

This strikes me as a further progression of the idea of register windows; in particular the SPARC architecture.

microcolonel12y ago

Wow, this addresses quite a number of my concerns with aging CPU architectures, specifically with throwaway registers for fundamentally low-latency operations like chaining ALU functions.

auggierose12y ago

Is there a simulator for this available somewhere?

j / k navigate · click thread line to collapse

79 comments

58 comments · 20 top-level

CoffeeDregs12y ago· 9 in thread

First off, I love the post! Super meaty goodness.

I read-ish the whole set of slides and it sounds pretty good (the devil is always in the details), but I got a little worried about VLIW-ish issues when [in the slides] he said on slide #57:

    The compiler controls when ops issue

Is this architecture going to face the same issue?

Edit V2: but damn... Exciting stuff.

igodard12y ago

cpr12y ago

But perhaps that "final optimization pass" would be nearly as hard as the whole compilation problem in the first place; dunno. I wasn't on the compiler team, so this is perhaps a naive viewpoint.

arh6812y ago

> The LLVM architecture is actually already set up to do optimizations at runtime on LLVM "binaries".

1 more reply

fulafel12y ago

Transmeta was a little like that, except it was x86 to VLIW.

1 more reply

igodard12y ago

In terms of square millimeters the caches dominate chip area. In terms of fab yield (which is the cost driver) they do not, because caches are regular structures that are built with built-in sparing.

stormbrew12y ago

gbhn12y ago

1 more reply

nightcracker12y ago

This question is actually asked: https://www.youtube.com/watch?v=QGw-cy0ylCc&feature=player_d...

gngeal12y ago

Is this architecture going to face the same issue?

Why don't you just recompile?

knz4212y ago· 7 in thread

I wrote a review of what I get from his first two talks here: http://staff.science.uva.nl/~poss/posts/2013/08/01/mill-cpu/

abecedarius12y ago

Nice post; I had the same thought of using promises for memory access. I guess you could extend that to a massively-parallel design (promises for communication).

I think Torben Mogensen posted the same basic idea of replacing registers with 'temporal addressing' to Usenet back in the 90s -- comp.arch? comp.compilers? Boy, it's been a long time.

knz4212y ago

1 more reply

Scaevolus12y ago

knz4212y ago

As Ivan just pointed out to me privately, I meant "is the stack machine model revisited", not "queue". Queue and stack machines are quite different things. But the rest of the argument holds.

igodard12y ago

Part of the way the Mill handles memory latency will be in the next talk. Sign up at ootbcomp.com/mailing-list for an announcement of date and venue.

qznc12y ago

The spiller machinery seems to be quite similar to SPARCs register windows. SPARC requires software handlers for the overflow though.

igodard12y ago

The spiller concept is not new with us; there have been others besides SPARC to use the idea. Ours has no software handlers.

Ivan

gruseom12y ago· 4 in thread

What would be the ideal language to compile to this?

kazagistar12y ago

tlb12y ago

Fortran or numerical code in C. It seems best suited for statically unrollable loops with few data-dependent branches or random memory accesses.

williamcotton12y ago

C expects addressable registers. From what I can tell this doesn't have those!

It really does seem like a Lisp would map much better, with the whole caller/callee and those private data belts that looked like hardware level closures!

2 more replies

joelg23612y ago

My intuition says haskell or lisp. Don't take my word for it.

ChuckMcM12y ago· 3 in thread

A really interesting time will be had when 'paid off' fab capacity is sitting idle and the cost for a wafer start becomes a function of how badly the fab wants to keep the line busy.

[1] http://en.wikipedia.org/wiki/Intel_iAPX_432

[2] http://en.wikipedia.org/wiki/Larrabee_(microarchitecture)#Di...

[3] http://www.parallax.com/propeller/

[4] http://www.colorforth.com/S40.htm

igodard12y ago

solarexplorer12y ago

I just signed up. Can't wait to see more of this!

thisishugo12y ago

[1] http://www.trustedreviews.com/opinions/intel-larrabee-an-int...

YZF12y ago· 3 in thread

As someone who has worked implementing the same algorithms on a VLIW DSP (TI C64x) and ARM and x86 a few comments after watching the beginning of the talk. It's been a while so this is from memory...

You really can't look at "peak" instructions per second on the VLIW as any sort of reliable metric and you need any comparison to include the SIMD units...

igodard12y ago

Re Scheduling for latency:

Most of how this works will be described in the next presentation (#3) in the series, although some won't be until a later presentation.

Symmetry12y ago

YZF12y ago

Agree.

artagnon12y ago· 3 in thread

My takeaway from this presentation:

- He then points out various advantages of this model, including in-order execution (ILP traditionally requires reordering), short pipeline, and overall simplification of architectures.

All this looks fine on paper, but I don't see a proposed instruction set, or any indication of what realization of this model will require. In short, it's a cute theoretical exercise.

[1]: http://www.arm.com/files/downloads/ARMv8_Architecture.pdf

[2]: https://github.com/torvalds/linux/blob/master/Documentation/...

Scaevolus12y ago

A weak memory model has almost nothing to do with ILP.

artagnon12y ago

What are false register dependencies? And how does a three-address code help in register renaming?

Some reading references would help. I'm currently reading [1].

[1]: http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2...

3 more replies

igodard12y ago

The Mill is strong sequentially consistent throughout. There are no membar operations, and the Linux macroes for that purpose are empty.

This design choice had two motivations: 1) membar is fabulously expensive; and 2) weaker consistency models are incomprehensible to mere mortals.

slacka12y ago· 3 in thread

After Bret Victor's inspiring talk yesterday, it's fitting to see a new CPU architecture make it to the top of HN.

Keep it up! Maybe tomorrow we'll see an idea to replace threads and locks.

qznc12y ago

You can replace threads and locks, if you use non-blocking programming techniques.

http://en.wikipedia.org/wiki/Non-blocking_algorithm

slacka12y ago

Did you even watch his talk? I didn't see anyone arguing against Bret's scaling concerns with threads in yesterday's HN comments.

lbenes12y ago

In my experience, the worse bugs I have encountered are deadlocks and race conditions. Saying just use a “non-blocking algorithm” is no silver bullet to the difficultly of working with threads.

If we don’t come up with a new system, I can’t imagine how hard it will be to efficiently program CPUs when they start coming out with 128, 256, and 512 cores.

chewxy12y ago· 2 in thread

At 17 mins in, when he talked about rename registers, it reminds me of SSAs. It's quite fascinating to see that something like that happens in the hardware level as well

axman612y ago

My first thought was "Huh, that sounds like an architecture that would map perfectly to LLVM's SSA IR". Can't wait to watch the video now!

chewxy12y ago

Therefore, it's better to use a statick-sized queue structure. Hence "belt" - it's like a conveyor belt.

I'm still watching it though. The topic is interesting but man, his voice is droning.

1 more reply

ajb12y ago· 2 in thread

solarexplorer12y ago

igodard12y ago

Bingo!

Ivan

johnpmayer12y ago· 1 in thread

Too bad they can't release their emulators/simulators, but I sympathize with their desire to be first-to-file to protect what sounds like years of work.

igodard12y ago

Exactly :-) We will be publishing as fast as we get the filings done.

solarexplorer12y ago· 1 in thread

This was a very nice presentation indeed. Too bad they could not tell us how they handle the memory hierarchy. Because that part is crucial for any statically scheduled architecture.

Also the implementation of the belt itself could be quite nasty. Just look at the face of the guy who asked about it after he hears the answer. ;-)

Instead they seem to target big data. An interesting move. It will be interesting to see actual performance numbers (even from a simulator). I wish them luck.

igodard12y ago

Ivan

momerathOP12y ago

onan_barbarian12y ago

This looks fascinating - and "the real deal" (not crackpottery) but I'm hesitant about a number of items.

gmrple12y ago

This feels like a functional architecture. It's really cool.

colanderman12y ago

X412y ago

This guy looks and is awesome! Can't wait to have them produce many-core Mill CPUs, that I'd like to throw algorithms onto. Then I'm gonna Buy Buy Buy

pavlov19912y ago

This reminds me a lot of SSAs. This is really really cool stuff. One of those genuine progress type things in this space.

tankenmate12y ago

This strikes me as a further progression of the idea of register windows; in particular the SPARC architecture.

microcolonel12y ago

Wow, this addresses quite a number of my concerns with aging CPU architectures, specifically with throwaway registers for fundamentally low-latency operations like chaining ALU functions.

auggierose12y ago

Is there a simulator for this available somewhere?

j / k navigate · click thread line to collapse