Bend: a high-level language that runs on GPUs (via HVM2) (opens in new tab)

(github.com)

1041 pointsLightMachine2y ago253 comments

253 comments

183 comments · 83 top-level

CorrectingYou2y ago· 12 in thread

OP comes around with some of the coolest things posted in HN recently, and all he gets is extensive criticism, when it is clear that this is an early version :/

imranq2y ago

I think HN is a community where people want to post something novel or new. When someone wants to post a kudos, most likely they'll upvote someone else instead of posting yet another "awesome job" (even if it is certainly warranted). Criticism instead can be endlessly diverse since there's usually only limited number ways to get it right, but plenty to get wrong.

In the end, HN comments fall prey to this truth and you see a handful of positive comments, with the majority being criticisms or "I wish this did X". No one person is to blame. Its just the culture of technologists today.

1 more reply

eating5552y ago

I would be pretty appreciated if people criticize my project. That is how you grow. If people tend hide cruel truth behind applause, the world would just crumbled.

diego_sandoval2y ago

My observation is that most criticism is useless, because people don't understand why you did things the way you did them.

If you explain why, they either still don't understand, or don't agree.

If the first iPhone had been presented on HN/Reddit/Twitter, everyone would criticize the lack of physical keyboard.

1 more reply

robocat2y ago

What you appreciate has little to do with whether we should assume others are thick-skinned. If someone has always been knocked down they will struggle to positively accept criticism regardless of how well meant it might be.

LightMachineOP2y ago

I really think I take criticism well... The problem is that people were criticizing us for not doing things that were literally done on the second paragraph. So at this point it didn't feel like productive criticism? That's like being criticized for being naked when you're full clothed. How do you even make sense of that...

2 more replies

vitiral2y ago

It has 905 upvotes, it has received a fair share of positivity as well. Even criticism is often positive, since it expresses interest and engagement with the ideas and approach.

jules2y ago

Not criticizing new projects is a good social norm, because starting new and ambitious projects is good and should not be discouraged. However, criticizing projects that make misleading, unsubstantiated or false claims is also a good social norm, because it discourages people from making misleading, unsubstantiated or false claims.

swayvil2y ago

The coolest things are often the most difficult to understand.

Difficult to understand is often threatening.

Criticism is a popular response to threat and is the form of reply that requires the least understanding.

riku_iki2y ago

it also could be half cooked and that's why criticism arrives.

1 more reply

metadat2y ago

Correction for you - This is patently false, OP has had three hits -- this one, and two one hundred pointers out of 100-200 submissions.

P.s. it seems rather likely the op is Victor Taelin, they mostly submit his tweets and gists.

Who are you rooting for, exactly, newcomer?

P.p.s. Victor Taelin just happens to be the most recent committer on this submission, imagine that.

https://news.ycombinator.com/item?id=35363400

foota2y ago

We're a bit off-topic, but there's no requirement that your account be associated with your identity, especially when the op is pretty clearly involved with the project (as opposed to if they were claiming not to be or something).

1 more reply

LightMachineOP2y ago

I have no idea what you're trying to convey, but I'm Victor Taelin. Also very cool comment on that thread, hypothesizing on whether we'd be able to ever run it on GPUs. We did it! That is what we're announcing today.

1 more reply

ziedaniel12y ago· 12 in thread

Very cool idea - but unless I'm missing something, this seems very slow.

I just wrote a simple loop in C++ to sum up 0 to 2^30. With a single thread without any optimizations it runs in 1.7s on my laptop -- matching Bend's performance on an RTX 4090! With -O3 it vectorizes the loop to run in less than 80ms.

    #include <iostream>

    int main() {
      int sum = 0;
      for (int i = 0; i < 1024*1024*1024; i++) {
        sum += i; 
      }
      std::cout << sum << "\n";
      return 0;
    }

LightMachineOP2y ago

Bend has no tail-call optimization yet. It is allocating a 1-billion long stack, while C is just looping. If you compare against a C program that does actual allocations, Bend will most likely be faster with a few threads.

Bend's codegen is still abysmal, but these are all low-hanging fruits. Most of the work went into making the parallel evaluator correct (which is extremely hard!). I know that sounds "trust me", but the single-thread performance will get much better once we start compiling procedures, generating loops, etc. It just hasn't been done.

(I wonder if I should have waited a little bit more before actually posting it)

jay-barronville2y ago

> (I wonder if I should have waited a little bit more before actually posting it)

No. You built something that’s pretty cool. It’s not done yet, but you’ve accomplished a lot! I’m glad you posted it. Thank you. Ignore the noise and keep cooking!

phkahler2y ago

>> Bend has no tail-call optimization yet.

I've never understood the fascination with tail calls and recursion among computer science folks. Just write a loop, it's what it optimises to anyway.

2 more replies

nneonneo2y ago

If they’re low-hanging fruit, why not do that before posting about it publicly? All that happens is that you push yourself into a nasty situation: people get a poor first impression of the system and are less likely to trust you the second time around, and in the (possibly unlikely) event that the problems turn out to be harder than you expect, you wind up in the really nasty situation of having to deal with failed expectations and pressure to fix them quickly.

2 more replies

nneonneo2y ago

You might want to double check with objdump if the loop is actually vectorized, or if the compiler just optimizes it out. Your loop actually performs signed integer overflow, which is UB in C++; the compiler could legally output anything. If you want to avoid the UB, declare sum as unsigned (unsigned integer overflow is well-defined); the optimization will still happen but at least you’ll be guaranteed that it’ll be correct.

ziedaniel12y ago

I did make sure to check before posting.

Good point about the signed integer overflow, though!

molenzwiebel2y ago

If compiled with -O3 on clang, the loop is entirely optimized out: https://godbolt.org/z/M1rMY6qM9. Probably not the fairest comparison.

LightMachineOP2y ago

Exactly, this kind of thing always happens with these loops, which is why I think programs that allocate are fairer. But then people point out that the C allocator is terrible, so we can't make that point :')

1 more reply

ziedaniel12y ago

I used GCC and checked that it wasn't optimized out (which actually surprised me!)

rroriz2y ago

I think the point is that Bend in a much higher level than C++. But to be fair: I also may be missing the point!

gslepak2y ago

The point is that Bend parallelizes everything that can be parallelized without developers having to do that themselves.

5-2y ago

here is the same loop finishing in one second on my laptop, single-threaded, in a very high-level language, q:

  q)\t sum til floor 2 xexp 30
  1031

Twirrim2y ago· 8 in thread

For what it's worth, I ported the sum example to pure python.

    def sum(depth, x):
        if depth == 0:
          return x
        else:
          fst = sum(depth-1, x*2+0) # adds the fst half
          snd = sum(depth-1, x*2+1) # adds the snd half
          return fst + snd
        
    print(sum(30, 0))

under pypy3 it executes in 0m4.478s, single threaded. Under python 3.12, it executed in 1m42.148s, again single threaded. I mention that because you include benchmark information:

    CPU, Apple M3 Max, 1 thread: 3.5 minutes
    CPU, Apple M3 Max, 16 threads: 10.26 seconds
    GPU, NVIDIA RTX 4090, 32k threads: 1.88 seconds

The bend single-threaded version has been running for 42 minutes on my laptop, is consuming 6GB of memory, and still hasn't finished (12th Gen Intel(R) Core(TM) i7-1270P, Ubuntu 24.04). That seems to be an incredibly slow interpreter. Has this been tested or developed on anything other than Macs / aarch64?

I appreciate this is early days, but it's hard to get excited about what seems to be incredibly slow performance from a really simple example you give. If the simple stuff is slow, what does that mean for the complicated stuff?

If I get a chance tonight, I'll re-run it with `-s` argument, see if I get anything helpful.

LightMachineOP2y ago

Running on 42 minutes is mots likely a bug. Yes, we haven't done much testing outside of M3 Max yet. I'm aware it is 2x slower on non-Apple CPUs. We'll work on that.

For the `sum` example, Bend has a huge disadvantage, because it is allocating 2 IC nodes for each numeric operation, while Python is not. This is obviously terribly inefficient. We'll avoid that soon (just like HVM1 did it). It just wasn't implemented in HVM2 yet.

Note most of the work behind Bend went into making the parallel evaluator correct. Running closures and unrestricted recursion on GPUs is extremely hard. We've just finished that part, so, there was basically 0 effort into micro-optimizations. HVM2's codegen is still abysmal. (And I was very clear about it on the docs!)

That said, please try comparing the Bitonic Sort example, where both are doing the same amount of allocations. I think it will give a much fairer idea of how Bend will perform in practice. HVM1 used to be 3x slower than GHC in a single core, which isn't bad. HVM2 should get to that point not far in the future.

Now, I totally acknowledge these "this is still bad but we promise it will get better!!" can be underwhelming, and I understand if you don't believe on my words. But I actually believe that, with the foundation set, these micro optimizations will be the easiest part, and performance will skyrocket from here. In any case, we'll keep working on making it better, and reporting the progress as milestones are reached.

vrmiguel2y ago

> it is allocating 2 IC nodes for each numeric operation, while Python is not

While that's true, Python would be using big integers (PyLongObject) for most of the computations, meaning every number gets allocated on the heap.

If we use a Python implementation that would avoid this, like PyPy or Cython, the results change significantly:

    % cat sum.py 
    def sum(depth, x):
        if depth == 0:
            return x
        else:
            fst = sum(depth-1, x*2+0) # adds the fst half
            snd = sum(depth-1, x*2+1) # adds the snd half
        return fst + snd

    if __name__ == '__main__':
        print(sum(30, 0))

    % time pypy sum.py
    576460751766552576
    pypy sum.py  4.26s user 0.06s system 96% cpu 4.464 total

That's on an M2 Pro. I also imagine the result in Bend would not be correct since it only supports 24 bit integers, meaning it'd overflow quite quickly when summing up to 2^30, is that right?

[Edit: just noticed the previous comment had already mentioned pypy]

> I'm aware it is 2x slower on non-Apple CPUs.

Do you know why? As far as I can tell, HVM has no aarch64/Apple-specific code. Could it be because Apple Silicon has wider decode blocks?

> can be underwhelming, and I understand if you don't believe on my words

I don't think anyone wants to rain on your parade, but extraordinary claims require extraordinary evidence.

The work you've done in Bend and HVM sounds impressive, but I feel the benchmarks need more evaluation/scrutiny. Since your main competitor would be Mojo and not Python, comparisons to Mojo would be nice as well.

1 more reply

Twirrim2y ago

Bitonic sort runs in 0m2.035s. Transpiled to c and compiled it takes 0m0.425s.

that sum example, transpiled to C and compiled takes 1m12.704s, so it looks like it's just the VM case that is having serious issues of some description!

glitchc2y ago

I have no dog in this fight, but feel compelled to defend the authors here. Recursion does not test compute, rather it tests the compiler's/interpreter's efficiency at standing up and tearing down the call stack.

Clearly this language is positioned at using the gpu for compute-heavy applications and it's still in its early stages. Recursion is not the target application and should not be a relevant benchmark.

1 more reply

fulafel2y ago

"Thread" term in GPUs and CPUs mean different things, it's more like a SIMD lane in GPUs. A bit like ISPC can compile your code so there's eg 32 invocations of your function per CPU thread running on the same time (if you're using 16-bit datums on AVX512), and you could have 2048 executions going on after multiplying up 32 cores * 2 SMT threads/core * 32 compiler executions.

tinyspacewizard2y ago

Python is really bad at recursion (part of why it's not appropriate for functional programming), so perhaps an unfair benchmark?

A Pythonic implementation would use loops and mutation.

metadat2y ago

Why `+0`, is this not a pointless no-op?

pests2y ago

Yes, but when looking at the source it's more obvious this is a repeating pattern.

"Hey, I'm accessing the 0th element here, just want to make that clear"

Without the +0, that statement looks disconnected from the +1 even though conceptually its the same.

Say somebody adds some special marker/tombstone/whatever into element 0 and now all those additions need to be bumped up by one. Someone else may go and see the +1, +2, +3 and just change them to +2, +3 +4, etc while completely missing the lone variable by itself as its visually dissimilar.

Ive usually seen it used in longer lists of statements. It also keeps everything lined up formatting wise.

davidw2y ago· 6 in thread

As a resident of Bend, Oregon... it was kind of funny to read this and I'm curious about the origin of the name.

developedby2y ago

Bending is an operation similar to folding, both in real life and in the language. While fold is recursive on data, bend is recursive on a boolean condition (like a pure while that supports multiple branching recursion points).

I was actually looking forward to seeing someone from Bend to make a comment like this

bytK72y ago

As a fellow resident of Bend I felt the same way when I saw this.

noumenon11112y ago

As a native Bendite but not current Bend resident, seeing that word with a capital letter always makes me smell juniper and sagebrush a little bit.

alex_lav2y ago

Totally off topic but I'll be driving there later this afternoon. Hoping it's as beautiful as last time!

davidw2y ago

If you're going to be here for a bit (I am heading out of town on a bike trip for a few days), always happy to grab a beer with fellow HN people!

1 more reply

blinded2y ago

Thought the same thing!

anentropic2y ago· 5 in thread

I remember seeing HVM on here a year or two back when it came out and it looked intriguing. Exciting to see something being built on top of it!

I would say that the play on words that gives the language its name ("Bend") doesn't really make sense...

https://github.com/HigherOrderCO/bend/blob/main/GUIDE.md

> Bending is the opposite of folding. Whatever fold consumes, bend creates.

But in everyday language bending is not the opposite of folding, they are more or less the same thing. Why not "unfold", which also has a connotation of "the process of happening" as well as merely the opposite of folding?

I have a question about the example code and output for bending:

    type Tree:
      Node { ~lft, ~rgt }
      Leaf { val }

    def main():
      bend x = 0:
        when x < 3:
          tree = Tree/Node { lft: fork(x + 1), rgt: fork(x + 1) }
        else:
          tree = Tree/Leaf { val: 7 }
      return tree

    tree = fork(0)
    tree = ![fork(1), fork(1)]
    tree = ![![fork(2),fork(2)], ![fork(2),fork(2)]]
    tree = ![![![fork(3),fork(3)], ![fork(3),fork(3)]], ![![fork(3),fork(3)], ![fork(3),fork(3)]]]
    tree = ![![![7,7], ![7,7]], ![![7,7], ![7,7]]]

Where does the initial "tree = fork(0)" come from?

redbar0n2y ago

`bend` is a convenience syntax that "just creates an in-place recursive function, immediately calls it with an initial state, and then assigns the end result to a local variable" ... "in a single statement, rather than needing to name a separate external auxilliary function that you'll only use once", according to the original author on Twitter: https://x.com/VictorTaelin/status/1791964640533958924 and https://x.com/VictorTaelin/status/1791996185449791932

But contrary to this, I think explicitly separating function declaration and function calling, in the following kind of syntax, would make it much clearer and less complected where the initial condition `tree = fork(0)` comes from. In the original example it came from `bend x = 0`, but here the function declaration is separate and the call more explicit: so it more obviously comes from `createTree(0)`:

    type Tree
      Branch { left, right }
      Leaf { value }

    def main():

      createTree(x):
        x < 3 ?
          Tree.Branch { left: createTree(x+1), right: createTree(x+1) }
        Tree.Leaf { value: 7 }

      createTree(0)

Besides not needing a local variable `tree` here, the unique thing here is the elimination of the else-clause, to reduce unnecessary nesting, and a rule that the language just early returns the last result of any nested condition. If it doesn't go into any nested condition, then it just returns the last result in the main function body (like Ruby). Without any `return` keywords needed in either case. Wouldn't this be quite beautiful?

barfbagginus2y ago

Re: name. Fold and bend are indeed called fold and unfold in Haskell and traditional functional programming literature.

I wonder if bend has to do with how we manipulate the computation's interaction graph while evaluating a bend. There might be some bending of wires!

Re: code example

In the code example, x=0 is the seed value. tree = fork(0) must mean "fork off to evaluate the bend at the seed value". In that first fork, we fork twice with the value x=1, to get the left and right subtrees of the top level node. We then fork four instances of x=2, eight instances of x = 3, and finally get our balanced binary tree with eight 7s.

Note this is guesswork. I don't know what the ![a, b] syntax means, and I haven't read much of the guide.

Appendix: Notes on Fold Vs Bend

I wrote these for an earlier draft while reminding myself about these operations. I include them more for my benefit, and in case they help you or the audience.

Fold and bend are categorical duals, aka catamorphisms and anamorphisms. One takes a monadic value and reduces it into an ordinary value. The other takes an ordinary value and expands it into a comonadic value.

Fold starts with a value in an inductive data type, and then replaces its constructors with a function. For example it takes a list (1:2):3, and replaces the constructor : with the function `+`, to get (1+2)+3 = 6

Bend starts with a seed value and a function taking values into constructor expressions for a conductive data type. It then grows the seed into a potentially infinite AST. For example the seed value 1 and the function f(x:xs) = (x+1) : (x:xs) gives us the infinite lazy list [1, 2, 3, ...]

anentropic2y ago

The question that comes to me is: can I use fork(x) outside of a bend?

Seems like probably not, there doesn't seem to be enough information in the 'argument' to this 'function' to do anything useful without the implicit context of the bend construct.

For that reason I think I'd prefer it if fork was a keyword (like 'bend' and 'when') rather than a 'function', just at the surface syntax level to give a clue it is something special.

I guess fork is a kind of 'magic' function that represents the body of the bend. It's a bit like a 'self' or 'this'.

At the moment this syntax is in a weird half-way point ...the underlying concept is necessarily functional but it's trying to look kind of like an imperative for-loop still.

I wonder if we couldn't just explicitly create a 'bendable' recursive function that can be 'bent' by calling it. But I guess it's like this because it needs to be tightly constrained by the 'when' and 'else' forms.

TBH the more I look at this example the more confusing it is. The other part I wonder about is the assigning of new values to tree var... can I set other local vars from outside the bend scope? I don't think so, I guess it'd be a syntax error if the var names assigned in the 'when' and 'else' clauses didn't match?

Again it's sort of overloading an imperative-looking syntax to implicitly do the 'return' from the implicit recursive function.

Later on there is this example:

    def render(depth, shader):
      bend d = 0, i = 0:
        when d < depth:
          color = (fork(d+1, i*2+0), fork(d+1, i*2+1))
        else:
          width = depth / 2
          color = shader(i % width, i / width)
      return color

And here I wonder - does 'width' have a value after the bend? Or it's only the last assignment in each clause that is privileged?

That's an odd mix in a language which otherwise has explicit returns like Python.

If so I wonder if a syntax something like this might be clearer:

    def render(depth, shader):
      bend color with d = 0, i = 0:
        when d < depth:
          yield (fork(d+1, i*2+0), fork(d+1, i*2+1))
        else:
          width = depth / 2
          return shader(i % width, i / width)
      return color

i.e. name the return var once in the bend itself, yield intermediate values (to itself, recursively) and return the final state.

developedby2y ago

The first `fork` is from using bend and passing the initial state

  The program above will initialize a state (`x = 0`), and then, for as long as `x < 3`,
  it will "fork" that state in two, creating a `Tree/Node`, and continuing with `x + 1`.
  When `x >= 3`, it will halt and return a `Tree/Leaf` with `7`.
  When all is done, the result will be assigned to the `tree` variable:

anentropic2y ago

I would have described the logic in the exact same way, but I still don't see where initial tree = fork(0) state comes from

all the other "fork"s in the output are produced explicitly by:

    Tree/Node { lft: fork(x + 1), rgt: fork(x + 1) }

1 more reply

yetihehe2y ago· 4 in thread

Bend looks like a nice language.

> That's a 111x speedup by doing nothing. No thread spawning, no explicit management of locks, mutexes. We just asked bend to run our program on RTX, and it did. Simple as that. Note that, for now, Bend only supports 24-bit machine ints (u24), thus, results are always mod 2^24.

Ahh, not even 32bit? Hmm, that seems pretty arbitrary for someone not accustomed to gpu's and wanting to solve some problems requiring 64 bits (gravitational simulation of solar system at millimeter resolution could use ~58bit ints for position).

LightMachineOP2y ago

We will have 64-bit boxed numbers really soon! As in, next month, or earlier if users find this to be a higher priority.

yetihehe2y ago

What other types are you planning? Maybe some floats (even if only on cpu targets, would be nice).

1 more reply

Archit3ch2y ago

Is there a platform with native hardware u64? Maybe some FPGA?

Archit3ch2y ago

Sorry, meant u24.

KingOfCoders2y ago· 4 in thread

The website claims "automatically achieves near-ideal speedup"

12x for 16x threads

51x for 16.000x threads

Can someone point me to a website where it explains that this is the "ideal speedup"? Is there a formula?

lmeyerov2y ago

Bend is intriguing --

1. Some potentially useful perspectives:

* Weak scaling vs strong scaling: https://www.kth.se/blogs/pdc/2018/11/scalability-strong-and-... ?

* ... Strong scaling, especially comparing to a modern sequential baseline, seems to be where folks are noting the author still has some work to do wrt getting to ideal speedups for what performance people care about

* There are parallel models of computation like PRAM for describing asymptomatically idealized speedups of trickier aspects of parallel code like heap usage . Bend currently seems to do a lot of stack allocations that someone writing in most parallel systems wouldn't, and the asymptomatic slowdowns would show up in these models, eg, asymptotically many unnecessary heap/stack data movements. There are a lot of these models, which are useful for being precise when making ideal speedups claims. NUMA, network topology, etc. ("Assume everyone is a sphere and...")

2. The comparisons I'd really like to see are:

* cudf, heavy.ai: how does it compare to high-level python dataframe and SQL that already run in GPUs? How is perf, and what programs do you want people to be able to write and they cannot?

* Halide and other more general purpose languages that compile to GPUs that seem closer to where Bend is going

FWIW, it's totally fine to compare to other languages.

Instead of showing it is beating everywhere, or saying ideal speedups and no comparisons, show where it is strong vs weak compared to others and diff tasks, especially progression across different releases (bend1 vs 2 vs ...), and let folks decide. There is some subset of tasks you already care about, so separate those out and show the quality you get in them, so people know what it looks like when you care + happy path. The rest becomes 'if you need these others, stay clear for now, check in again later as we know and are tracking.' Being clear that wall clock time can be slow and performance per watt can be wasteful is OK, you are looking for early adopters, not OpenAI core engineers.

KingOfCoders2y ago

Thanks for the long reply.

LightMachineOP2y ago

This is on CPU vs GPU.

A GPU core (shading unit) is 100x weaker than a CPU core, thus the difference.

ON the GPU, HVM's performance scales almost 16000x with 16000x cores. Thus the "near ideal speedup".

Not everyone knows how GPUs work, so we should have been more clear about that!

andersa2y ago

It's not.

KeplerBoy2y ago· 4 in thread

What's going on with the super-linear speedup going from one thread to all 16?

210 seconds (3.5 minutes) to 10.5 seconds is a 20x speedup, which isn't really expected.

LightMachineOP2y ago

the single-thread case ran a little slower than it should on this live demo due to a mistake on my part: `run` redirected to the Rust interpreter, rather than the C interpreter. the Rust one is a little bit slower. the numbers on the site and on all docs are correct though, and the actual speedup is ~12x, not ~16x.

KeplerBoy2y ago

Thanks for the explanation and the cool project.

I will give bend a shot on some radar signal processing algorithms.

LightMachineOP2y ago

I apologize, I gave you the wrong answer.

I thought you was talking about the DEMO example, which ran ~30% slower than expected. Instead, you were talking about the README, which was actually incorrect. I noticed the error and edited it. I explained the issue in another comment.

byteknight2y ago

Its possible to see such scaling if involving any level of cache or I/O.

vegadw2y ago· 3 in thread

A lot of negativity in these threads. I say ~cudas~ kudos to the author for getting this far! The only similar project I'm aware of is Futhark, and that's haskell-y syntax - great for some people, but to the general class of C/C++/Python/Js/Java/etc. devs pretty arcane and hard to work with. My biggest complaint with this is, unlike Futhark, it only targets Cuda or multi-core. Futhark which can target OpenCL, Cuda, ISPC, HIP, sigle core CPU, or multi core CPU. The performance problems others are pointing out I'm certain can be tackled.

neonsunset2y ago

Take a look at ILGPU. It's very nice and has been around for a long time! (just no one knows about it, sadly)

Short example: https://github.com/m4rs-mt/ILGPU/blob/master/Samples/SimpleM...

Supports even advanced bits like inline PTX assembly: https://github.com/m4rs-mt/ILGPU/blob/master/Samples/InlineP...

pjmlp2y ago

Chapel has a decent use in HPC.

Also NVidia has sponsored variants of Haskell, .NET, Java, Julia on CUDA, have a Python JIT and are collaborating with Mojo folks.

MarcusE1W2y ago

ParaSail also goes into that direction https://github.com/parasail-lang/parasail.

Made by the designer for Ada since 1995, Tucker Taft. Some of the parallel features of ParaSail made it into Ada 2022.

praetor222y ago· 3 in thread

Look, I understand the value proposition and how cool it is from a theoretical standpoint, but I honestly don't think this will ever become relevant.

Here are some notes from my first impressions and after skimming through the paper. And yes, I am aware that this is very very early software.

1. Bend looks like an extremely limited DSL. No FFI. No way of interacting with raw buffers. Weird 24bit floating point format.

2. There's a reason why ICs are not relevant: performance is and will always be terrible. There is no other way to put it, graph traversal simply doesn't map well on hardware.

3. The premise of optimal reduction is valid. However, you still need to write the kernels in a way that can be parallelized (ie. no data dependencies, use of recursion).

4. There are no serious examples that directly compare Bend/HVM code with it's equivalent OMP/CUDA program. How am I suppose to evaluate the reduction in implementation complexity and what to expect on performance. So many claims, so little actual comparisons.

5. In the real world of high performance parallel computing, tree-like structures are non-existent. Arrays are king. And that's because of the physical nature of how memory works on a hardware level. And do you know what works best on mutable contiguous memory buffers ? Loops. We'll see when HVM will implement this.

In the end, what we currently have is half-baked language that is (almost) fully isolated from external data, extremely slow, a massive abstraction on the underlying hardware (unutilised features: multilevel caches, tensor cores, simd, atomics).

I apologize if this comes out as harsh, I still find the technical implementation and the theoretical background to be very interesting. I'm simply not (yet) convinced of its usefulness in the real world.

LightMachineOP2y ago

Thanks for the feedback. Some corrections:

We do use multi-level caching, and you can achieve 5x higher performance by using it correctly. FFI is already implemented, just not published, because we want to release it with graphics rendering, which I think will be really cool. Haskell/GHC uses a graph and trees too, and nobody would say it is not practical of useful. And while it is true that arrays are king, there are many SOTA algorithms that are implemented in Haskell (including compilers, type-checkers, solvers) because they do not map well to arrays at all.

The main reason ICs are not fast is that nobody ever has done low-level optimization work over it. All previous implementations were terribly inefficient. And my own work is too, because I spent all time so far trying to get it to run *correctly* on GPUs, which was very hard. As you said yourself, there aren't even loops yet. So, how can we solve that? By adding the damn loops! Or do you think there is some inherent limitation preventing us to do that? If you do, you'll be surprised.

HVM2 is finally a correct algorithm that scales. Now we'll optimize it for the actual low-level performance.

mst2y ago

> HVM2 is finally a correct algorithm that scales.

This, I think, is the key thing people are missing.

Maybe your low level performance will never be as good as hoped, but for this sort of task, "the parallelisation part works and produces correct results" might not be sufficient but is absolutely necessary, and any optimisation work done before that has such a high probability of having to be thrown away that under similar circumstances I wouldn't bother in advance either.

physicsguy2y ago

Re: 5, trees are fairly widely used (though not as most CS people would implement them) with Morton or H index ordering in things like the Fast Multipole and Barnes Hut algorithms which reduce O(n^2) pair wise ops to O(n) and O(n log n) respectively. BH more common in Astro, FMM in chemical molecular dynamics.

delu2y ago· 3 in thread

Ten years ago, I took a course on parallel algorithms (15-210 at CMU). It pitched parallelism as the future of computing as Moore's law would hit inevitable limits. I was sold and I was excited to experiment with it. Unfortunately, there weren't many options for general parallel programming. Even the language we used for class (SML) wasn't parallel (there was a section at the end about using extensions and CUDA but it was limited from what I recall).

Since then, I was able to make some experiments with multithreading (thanks Rust) and getting very creative with shaders (thanks Shadertoy). But a general parallel language on the GPU? I'm super excited to play with this!

shwestrick2y ago

Nowadays 210 is actually parallel! You can run 210-style code using MaPLe (https://github.com/MPLLang/mpl) and get competitive performance with respect to C/C++.

If you liked 210, you might also like https://futhark-lang.org/ which is an ML-family language that compiles to GPU with good performance.

amelius2y ago

Huh, the Maple name is already used by a well known computer algebra project.

https://en.wikipedia.org/wiki/Maple_(software)

Rodeoclash2y ago

The trend towards multiple cores in machines was one of the reasons I decided to learn Elixir.

xiaodai2y ago· 3 in thread

Looks cool but what's one toy problem that it can solve more efficiently than others?

JackMorgan2y ago

Here is an example of it summing a huge set of numbers 100x faster than in C.

https://github.com/HigherOrderCO/bend/blob/main/GUIDE.md#par...

ashdnazg2y ago

Note that it's not 100x faster than C, but than bend running on one CPU thread.

Running the equivalent C code takes ~2.3 seconds on my machine. Same order of magnitude as bend on the beefy GPU.

mjaniczek2y ago

This is unproven (and not a toy problem), but I imagine it's going to do pretty well at compilers. The amount of time I'm waiting at work, hypnotizing the tsc process that sits at 100% CPU, wishing it was parallel...

anonzzzies2y ago· 3 in thread

What is the terminal used for that demo? https://github.com/HigherOrderCO/Bend does it just skip commands it cannot execute or?

LightMachineOP2y ago

It was actually just me recording iTerm2 with OBS. The theme is Solarized Light. What do you mean by skip commands?

anonzzzies2y ago

Ah just ctrl-c was it? Sometimes I just think way too difficult. Keep up the good work!

1 more reply

mcintyre19942y ago

They're probably hitting ctrl+c at the end of the lines they don't want to run, that's telling the terminal "cancel that" but it'll usually just go to the next line and leave what you typed in place, like in this video.

animaomnium2y ago· 2 in thread

Fala Taelin, nice work! Does HVM2 compile interaction nets to e.g. spirv, or is this an interpreter (like the original HVM) that happens to run on the GPU?

I ask because a while back I was messing around with compiling interaction nets to C after reducing as much of the program as possible (without reducing the inputs), as a form of whole program optimization. Wouldn't be too much harder to target a shader language.

Edit: Oh I see...

> This repository provides a low-level IR language for specifying the HVM2 nets, and a compiler from that language to C and CUDA HVM

Will have to look at the code then!

https://github.com/HigherOrderCO/HVM

Edit: Wait nvm, it looks like the HVM2 cuda runtime is an interpreter, that traverses an in-memory graph and applies reductions.

https://github.com/HigherOrderCO/HVM/blob/5de3e7ed8f1fcee6f2...

I was talking about traversing an interaction net to recover a lambda-calculus-like term, which can be lowered to C a la lisp in small pieces with minimal runtime overhead.

Honestly the motivation is, you are unlikely to outperform a hand-written GPU kernel for like ML workloads using Bend. In theory, HVM could act as glue, stitching together and parallelizing the dispatch order of compute kernels, but you need a good FFI to do that. Interaction nets are hard to translate across FFI boundaries. But, if you compile nets to C, keeping track of FFI compute kernel nodes embedded in the interaction network, you can recover a sensible FFI with no translation overhead.

The other option is implementing HVM in hardware, which I've been messing around with on a spare FPGA.

LightMachineOP2y ago

It is an interpreter that runs on GPUs, and a compiler to native C and CUDA. We don't target SPIR-V directly, but aim to. Sadly, while the C compiler results in the expected speedups (3x-4x, and much more soon), the CUDA runtime didn't achieve substantial speedups, compared to the non-compiled version. I believe this is due to warp-divergence: with non-compiled procedures, we can actually merge all function calls into a single "generic" interpreted function expander that can be reduced by warp threads without divergence. We'll be researching this more extensively looking forward.

animaomnium2y ago

Oh that's cool! Interested to see where your research leads. Could you drop me a link to where the interaction net → cuda compiler resides? I skimmed through the HVM2 repo and just read the .cu runtime file.

Edit: nvm, I read through the rest of the codebase. I see that HVM compiles the inet to a large static term and then links against the runtime.

https://github.com/HigherOrderCO/HVM/blob/5de3e7ed8f1fcee6f2...

Will have to play around with this and look at the generated assembly, see how much of the runtime a modern c/cu compiler can inline.

Btw, nice code, very compact and clean, well-organized easy to read. Rooting for you!

MrLeap2y ago· 2 in thread

This is incredible. This is the kind of work we need to crack open the under utilized GPUs out there. I know LLMs are all the rage, but there's more gold in them hills.

anon2912y ago

Except... it's not. Coming from a Haskell background and following the author since the early days, I think his work is excellent w.r.t Interaction Combinators and Nets. However, to do LLM work you need to cooperate with the chip, which means doing things in the manner most expeditious to the intricacies of Computer Architecture. That's not what this does. I don't see how Bend would modify its runtime to take advantage of all the things that modern GPU-based BLAS implementations do (which is what I currently do), but would love to be surprised.

As a whole, the speedups claimed are not actually that great. Going from 1 core to 16k cores increases performance by 50x. That's not actually very good.

Like, I really truly love what the author has contributed to functional languages and Interaction Nets. He has good ideas, but while it's cool that this can be done, things like LLMs require very practical tuning.

Finally, the author has a history of making fantastical claims. Again, it's true there is a speedup, but in my view, this is like making an extremely slow language and then optimizing it and then announcing that you've figure out how to improve your language's performance by 50x. While true, it neglects the fact it was very slow to begin with.

LightMachineOP2y ago

You're comparing CPU cores to GPU cores!

It is "only" 50x because a single GPU core is 100x weaker than a CPU core!

Within CUDA cores, it is actually a linear speedup! It does 2k MIPS with 1 CUDA core, and ~28000 MIPS with 16k CUDA cores. If we double the performance of single-core GPU evaluation, we almost double the performance with 16k cores!

1 more reply

api2y ago· 2 in thread

Oh wow do I wish this existed when I was playing with evolutionary computation and genetic algorithms in college…

zackmorris2y ago

Me too, now you see why they never took off.

api2y ago

They never took off because we discovered, to our surprise to some extent, that gradient descent through back propagation works better than expected if you give it the right learning media and the right input and output encodings. It took a ton of fiddling ("graduate student descent") to figure those out.

Back then everyone thought it was doomed to get stuck at local minima, but it turns out that has a lower probability of happening if the search space has enough dimensions. It works well enough to make the sand talk back to us and now that particular design has sucked all the air out of the room.

Nobody has tried EC at anywhere near the scale of GPTs/LLMs because that amount of compute is expensive and at this point we know those will at least work.

I still think EC is fascinating and would love to play with it some more at some point, maybe trying it combined with back propagation in novel ways. Compute only gets cheaper.

yetihehe2y ago· 2 in thread

Wow, Bend looks like a nice language.

trenchgun2y ago

> Ahh, not even 32bit?

This is a proof of concept version which focuses on the provable correctness of the parallel compiler.

GGO2y ago

64bit coming soon

anonzzzies2y ago· 2 in thread

Maybe I missed it, but there seems to be no license attached to HVM2, nor to Bend or Kind?

drtournier2y ago

https://x.com/VictorTaelin/status/1791241244468806117

tekknolagi2y ago

(Taelin says will likely be MIT or similar)

egnehots2y ago· 2 in thread

the interesting comparison nowadays would be against mojo:

https://www.modular.com/max/mojo

ZitchDog2y ago

I think this is quite different- I don’t think mojo runs on the GPU unless I am mistaken.

witherk2y ago

Being able to compile to different hardware including GPUs and TPUs seems to be one of the core goals of Mojo based off what Chris Lattner was saying in his Lex Friendman interview. It doesn't seem to come up much on Modular website though, so I can see why you would think that.

highfrequency2y ago· 2 in thread

> CPU, Apple M3 Max, 1 thread: 3.5 minutes

> CPU, Apple M3 Max, 16 threads: 10.26 seconds

Surprised to see a more than linear speedup in CPU threads. What’s going on here?

LightMachineOP2y ago

I believe the single-core version was running slower due to the memory getting full. The benchmark was adding 2^30 numbers, but HVM2 32-bit has a limit of 2^29 nodes. I've re-ran it with 2^28 instead, and the numbers are `33.39 seconds` (1 core) vs `2.94 seconds` (16 cores). You can replicate the benchmark in an Apple M3 Max. I apologize for the mistake.

Archit3ch2y ago

More cores = more caches?

abeppu2y ago· 2 in thread

I for one found the 'how is this possible' video near the bottom of the page to be unhelpful:

- surely for `3 x 3 = 9`, there is some concept of primitive operations?

- I get that replacement of patterns in a graph can be done in parallel, but (a) identifying when a rewrite rule should apply and (b) communicating the state of the updated graph to worker threads and (c) organizing worker threads to agree on which does each task all take some effort. When is this more work than the original computation (as in the 3x3 example)?

jiehong2y ago

The flip with Chinese characters in the middle tripped me. I guess they wanted to look like “complicated”…

nojvek2y ago

3 x 3 seemed like a pretty bad example to show how they parallelize.

chc42y ago· 2 in thread

24bit integers and floats, no array datatype, and a maximum 4GB heap of nodes are very harse restrictions, especially for any workloads that would actually want to be running on a GPU. The limitations in the HVM2 whitepaper about unsound evaluation around closures and infinite loops because it is evaluating both sides of a conditional without any short circuiting are also extremely concerning.

Before you reply "these are things we can address in the future": that doesn't matter. Everyone can address everything in the future. They are currently hard technical barriers to it's use, with no way of knowing the level of effort that will require or the knock-on effects, especially since some of these issues have been "we can fix that later" for ten years.

I also highly recommend changing your benchmark numbers from "interactions per second" to a standard measurement like FLOPS. No one else on earth knows how many of those interactions are pure overhead from your evaluation semantics, and not doing useful work. They come across as attempting to wow an audience with high numbers and not communicating an apples to apples comparison with other languages.

LightMachineOP2y ago

So use a metric that makes absolutely no sense on given domain, instead of one that is completely correct, sensible, accurate, stablished on the literature, and vastly superior in context? What even is a FLOPS in the context of Interaction Net evaluation? These things aren't even interchangeable.

hahajahen2y ago

The fact that you don’t know the answer to this question, and don’t even seem to think it is relevant, is chilling.

People want to be able to ground your work—which you are claiming is the “parallel future of computation”—in something familiar. Insulting them and telling them their concerns are irrelevant just isn’t going to work.

I would urge you to think about what a standard comparison versus Haskell would look like. Presumably it would be something that dealt with a large state space, but also top down computation (something you couldn’t easily do with matrices). Big examples might include simply taking a giant Haskell benchmark (given the setting of inets it seems like a natural fit) that is implemented in a fairly optimal way—-both algorithmically and also wrt performance—-and compare directly on large inputs.

Sorry to trash on you here, not trying to come across as insulting, but I agree that “reductions per second” is meaningless without a nuanced understanding of the potentially massive encoding blowup that compilation introduces.

We want to believe, but the claims here are big

1 more reply

light_hue_12y ago· 2 in thread

Massive promises of amazing performance but they can't find one convincing example to showcase. It's hard to see what they're bringing to the table when even the simplest possible Haskell code just as fast on my 4 year old laptop with an ancient version of GHC (8.8). No need for an RTX 4090.

   module Main where
   
   sum' :: Int -> Int -> Int
   sum' 0 x = x
   sum' depth x = sum' (depth - 1) ((x \* 2) + 0) + sum' (depth - 1) ((x \* 2) + 1)
   
   main = print $ sum' 30 0

Runs in 2.5s. Sure it's not on a GPU, but it's faster! And things don't get much more high level.

If you're going to promise amazing performance from a high level language, I'd want to see a comparison against JAX.

It's an improvement over traditional interaction nets, sure! But interaction nets have always been a failure performance-wise. Interaction nets are PL equivalent of genetic algorithms in ML, they sound like a cool idea and have a nice story, but then they always seem to be a dead end.

Interaction nets optimize parallelism at the cost of everything else. Including single-threaded performance. You're just warming up the planet by wasting massive amounts of parallel GPU cores to do what a single CPU core could do more easily. They're just the wrong answer to this problem.

LightMachineOP2y ago

You're wrong. The Haskell code is compiled to a loop, which we didn't optimize for yet. I've edited the README to use the Bitonic Sort instead, on which allocations are unavoidable. Past N=20, HVM2 performs 4x faster than GHC -O2.

light_hue_12y ago

What? I ran your example, from your readme, where you promise a massive performance improvement, and you're accusing me of doing something wrong?

This is exactly what a scammer would say.

I guess that's the point here. Scam people who don't know anything about parallel computing by never comparing against any other method?

2 more replies

andrewp1232y ago· 1 in thread

I just wanted to comment on how good the homepage is - it's immediately clear what you do. Most people working with "combinators" would feel a need to use lots of scary lingo, but OP actually shows the simple idea behind the tool (this is the opposite take of most academics, who instead show every last detail and never tell you what's going on). I really appreciate it - we need more of this.

topspin2y ago

I'm ashamed that I didn't think to write this. Well deserved praise.

zackmorris2y ago· 1 in thread

This is nice, and obvious. I've waited about 20 years since I learned MATLAB and GNU Octave for someone to make a graph solver like this. And about 25 years since I first had the idea, when I was learning VLSI with VHDL in college and didn't see anything like the functional programming of circuits in what at the time was the imperative C++ world. The closest thing then was Lisp, but nobody talked about how the graph representation (intermediate code or i-code in the imperative world) could be solved in an auto-parallelized way.

We still see this today in how languages go out of their way to implement higher order method libraries (map/reduce/filter) but then under the hood there is no multithreading, they just expect the developer to annotate their loops to be parallel because the languages aren't formal enough to know about side effects in the innermost logic, and don't support immutability or performant pass-by-value semantics with copy-on-write anyway. So we end up with handwavy languages like Rust that put all of that mental load onto the developer for basically no gain, they just save memory by performing computation in-place imperatively.

I also like how Bend sidesteps the nonexistence of highly scaled symmetric multiprocessing CPUs by supporting GPUs. It makes the argument moot that GPUs can't be stopped because they're too big to fail. Julia is the only other language I've seen that tries this. I wish Clojure did, although it's been a long time since I followed it so maybe it has some parallelism?

I would have dearly loved to work on something like Bend, had someone solved the funding issue. Nobody wants to pay for pure research, and nobody sees the need for languages that do what everyone else is doing except easier. We have Kickstarter for widgets and Patreon for influencers, but makers have to bootstrap themselves or learn everything about finance or live in the right city or have a large network to hopefully meet an angel investor or work in academia and lose all rights to what they invent while spending the majority of their time hustling for grants anyway. So it just never happens and we're stuck with the same old busted techniques. Like how Hollywood only has money for sequels and reboots or the recording industry only has money for canned corporate music and hits from already famous artists and yet another cover that yanks the original better song off the radio.

A quarter of a century can go by in the blink of an eye if you get suckered into building other people's dreams as a people-pleaser. Be careful what you work on.

jjtheblunt2y ago

> A quarter of a century can go by in the blink of an eye if you get suckered into building other people's dreams as a people-pleaser. Be careful what you work on

well said! i find myself reflecting the same sentiment when away from the computer (and i've avoided the people-pleaser thing, but what you said resonates as i watch the world)

notfed2y ago· 1 in thread

This is really, really cool. This makes me think, "I could probably write a high performance GPU program fairly easily"...a sentence that's never formed in my head.

developedby2y ago

That's the main idea!

klabb32y ago· 1 in thread

This is very exciting. I don’t have any GPU background, but I have been worrying a lot about CUDA cementating itself in the ecosystem. Here devs don’t need CUDA directly which would help decoupling the ecosystem from cynical mega corps, always good! Anyway enough politics..

Tried to see what the language is like beyond hello world and found the guide[1]. It looks like a Python and quacks like a Haskell? For instance, variables are immutable, and tree-like divide and conquer data structures/algorithms are promoted for getting good results. That makes sense I guess! I’m not surprised to see a functional core, but I’m surprised to see the pythonic frontend, not that it matters much. I must say I highly doubt that it will make it much easier for Python devs to learn Bend though, although I don’t know if that’s the goal.

What are some challenges in programming with these kind of restrictions in practice? Also, is there good FFI options?

[1]: https://github.com/HigherOrderCO/bend/blob/main/GUIDE.md

mathiasgredal2y ago

We have a replacement for CUDA, it is called C++17 parallel algorithms. It has vendor support for running on the GPU by Intel, AMD and NVIDIA and will also run on all your cores on the CPU. It uses the GPU vendors compiler to convert your C++ to something that can natively run on the GPU. With unified memory support, it becomes very fast to run computations on heap allocated memory using the GPU, but implementations also support non-unified memory

Vendor support:

- https://www.intel.com/content/www/us/en/developer/articles/g...

- https://rocm.blogs.amd.com/software-tools-optimization/hipst...

- https://docs.nvidia.com/hpc-sdk/archive/20.7/pdf/hpc207c++_p...

magnio2y ago· 1 in thread

Congrats on the launch.

I know the docs say this will be fixed soon, but what is the main reason for restricting number types to 24 bits? I saw in the code that they are wrapper around the 32-bit system number types, so what prevents Bend from changing them to U32(u32) right now?

LightMachineOP2y ago

Great question!

Short answer: GPU

Long answer: CUDA

Seriously though. Implementing a full high-level lang in parallel is HARD, so, to simplify it greatly, we made IC nodes 64-bit, which allows us to use native 64-bit atomic operations in many parts of the implementation. Since each 64-bit node has 2 ports, that gives us 32 bits per port. And since we use 3 bits for the tag, that leaves us with 29 bit payloads. We used that space to easily implement unboxed numbers (f24, u24, i24).

That said, we will have (boxed) 64-bit numbers soon! With this foundation in place, adding them is a matter of coding. I just want to have some time to let people use the limited version, find bugs, etc., before I add more stuff.

jes51992y ago· 1 in thread

I think I have a use for this but I’m realizing that I don’t know how to build a mental model of what is going to parallelize in this system. Surely some algorithms are better and getting chopped up than others - how can I tell what is going on?

mjaniczek2y ago

I think this is an unsolved tooling question right now.

You could get some sense of the parallelism by using `/usr/bin/time` and dividing the wall time with the user time.

You could look at the Task Manager / Activity Monitor / htop and see if it's using 800% CPU or whatever.

You could use psrecord (https://pypi.org/project/psrecord/) to get a relatively finegrained CPU+mem usage graph across the duration of the program.

But it would probably still be best to record some sort of stats in the Bend/HVM itself, enabled via a CLI flag. Reductions per ms, sampled across the program duration, or something like that.

I'd be interested in anybody's ideas of what a good metric would be here!

EDIT: CLI flag, not CPU flag

mbforbes2y ago· 1 in thread

Congratulations on the launch and hard work so far! We need projects like this. Great readme and demo as well.

Every time I try to write shaders, or even peek through my fingers at CUDA C(++) code, I recoil in disbelief that we don't have high level programming yet on the GPU. I can't wait until we do. The more great projects attacking it the better in my book.

Munksgaard2y ago

Have you looked at Futhark?

gigatexal2y ago· 1 in thread

The first graphic midway or so down the page has this tag:

tested on: CPU - Apple M3 Max, GPU - NVIDIA RTX 4090

But how? I thought eGPUs don’t work on apple silicon and the pci-e having Mac Pro is still M2 based, no?

GGO2y ago

2 different machines

mccoyb2y ago· 1 in thread

This is cool! Is the idea to put Kind2 on top of this in some way?

I’d also love to find an example of writing a small interpreter in Bend - which runs on the GPU.

LightMachineOP2y ago

Yes, Kind2 will be a type layer on top of Bend, with a similar relationship as in JavaScript / TypeScript (but much more integrated, less ad-hoc and with proofs!). I don't want Kind2 to compete directly with Lean though, as it is doing an amazing job and I'm rooting for it. So, Kind2 will be just a type system for Bend that happens to let you prove theorems about your programs, rather than a higher promise to digitalize all of maths and stuff.

funny_name2y ago· 1 in thread

What kind of software would this language be good for? I assume it's not the kind of language you'd use for web servers exactly.

trenchgun2y ago

Erlang-like actor models would be well suited, so yeah, you could use it for web servers (assuming they are able to finish the language). It's a general purpose high level programming language.

Arch4852y ago

I want to congratulate the author on this, it's super cool. Making correct automatic parallelization is nothing to sneeze at, and something you should absolutely be proud of.

I'm excited to see how this project progresses.

jjovan12y ago

Why so much negativity? An angry crowd sounded more like bots trying to test OP's intelligence by exploiting the ReadMe file imperfections while trying to change the context and intent of the post. It's so ignorant and brutal. They spent hours arguing without taking 2 minutes to properly read the ReadMe file.OP is a one man's show and now they all want to piss on OP. Keep going OP!

ruste2y ago

Been watching your development for a while on Twitter. This is a monumental achievement and I hope it gets the recognition it deserves.

npalli2y ago

Is the recursive sum the best function to show multi-threading or GPU speedups? Seems unlikely. FWIW, i ported the python example to Julia and it ran in about 2.5 seconds the same as the C++ version. Pure python 3.12 took 183 seconds.

  function sum(depth, x)
      if depth == 0
          return x
      else
          fst = sum(depth-1, x*2+0)
          snd = sum(depth-1, x*2+1)
      end
      return fst + snd
  end

println(sum(30,0))

robust-cactus2y ago

This is awesome and much needed. Keep going, forget the overly pedantic folks, the vision is great and early results are exciting.

mjaniczek2y ago

I've made a benchmark of Bend running a simple counter program on CPU vs GPU, vs Haskell,Node,Python,C that I plan to write a blogpost about, probably this Sunday:

https://docs.google.com/spreadsheets/d/1V_DZPpc7_BP3bmOR8Ees...

It's magical how the GPU version is basically flat (although with a high runtime init cost).

gsuuon2y ago

Congrats on the HVM2 launch! Been following for a while, excited to see where this project goes. For others who are lost on the interaction net stuff, there was a neat show hn that gave a more hands-on interactive intro: https://news.ycombinator.com/item?id=37406742 (the 'Get Started' writeup was really helpful)

smusamashah2y ago

I have no interest in this tech as it's apparently for backend stuff and not actually rendering things by itself.

But the demo gif is probably the best I have seen in a Github readme. I watched it till the end. It was instantly engaging. I wanted to see the whole story unfold.

darlansbjr2y ago

Would a compiler be faster by using HVM? Would love to see a fully parallel version of typescript tsc

throwaway25622y ago

i-combinators https://www.semanticscholar.org/paper/Interaction-Combinator...

mattnewport2y ago

This looks cool, I find myself wishing for a language and introductory tutorial that isn't so targeted at Python programmes however (though I understand from a commercial point of view why that may make sense).

It seems like this is actually an elegant typed functional language but the Python syntax looks ugly and verbose and like it's trying to hide that compared to something more ML/F# or Haskell inspired.

I'll try and get past that though as it does look like there's something pretty interesting here.

magicalhippo2y ago

> Everything that can run in parallel, will run in parallel.

On the CPU, there's typically a threshold where dividing and coordinating the parallel work takes more time than simply doing the work on a single thread. Thus you can make the overall runtime much faster by not dividing the work all the way, but rather stop at that optimal threshold and then just loop over the remaining work in the worker threads.

How does this work on the GPU using Bend? Been too long since I did any GPU programming.

mjaniczek2y ago

I've made a benchmark of the current version of Bend running a simple counter program on CPU vs GPU, vs Haskell,Node,Python,C: https://docs.google.com/spreadsheets/d/1V_DZPpc7_BP3bmOR8Ees...

It's magical how the GPU version is basically flat (although with a high runtime init cost)

hintymad2y ago

Speaking of parallel computing, any book or series of books that can help an engineer learn parallel program and go from zero to hero? Ideally the books will cover both intuition, in-depth details, and theories. Something like The Art of Multiprocessor Programming by Herlihy et el for concurrent programming, even though the book arguably still has too steep of a learning curve.

i5heu2y ago

This is some very cool project! I sometime dream about a instruction set architecture (ISA) that runs in some kind of VM that allows for existing languages to be able to run on CPU/GPU/FPGAs/ASICs automatically.

I think this is a much more practically approach and i hope this will give some inspiration to this possibility.

runeks2y ago

What's the expected effort needed for supporting GPUs other than Nvidia — e.g. AMD GPUs or the GPU in a MacBook Pro M1/2/3?

As I understand, it's a lot of work because there's no common way to target these different GPUs. Is this correctly understood?

croemer2y ago

Dupe of https://news.ycombinator.com/item?id=40387394 and https://news.ycombinator.com/item?id=40383196

temp1237892462y ago

Congrats!

I’ve been watching HVM for a while and think it’s extremely cool.

My intuition is that this will eventually be a really big deal.

britannio2y ago

Incredible feat, congratulations!

flakiness2y ago

This kind of sound like Mojo. I wonder how these compare? (Besides HVM/Bend being opensource, which is awesome.)

https://www.modular.com/max/mojo

zmmmmm2y ago

Reminds me a little bit of Concurnas [0] which sadly seemed to get abandoned right at the point when it was nearly viable.

[0] https://concurnas.com/

shadowpho2y ago

Wow this is very impressive!

exitheone2y ago

This seems pretty cool!

Question: Does this take into account memory bandwidth and caches between cores? Because getting them wrong can easily make parallel programs slower than sequential ones.

netbioserror2y ago

So HVM finally yields fruit. I've been eagerly awaiting this day! Bend seems like a very suitable candidate for a Lispy S-expression makeover.

mattdesl2y ago

Looking forward to using this. Curious about how far away WebGPU/WASM support might be, it could provide a single cross-platform backend.

zacksiri2y ago

What would be super interesting is having something library, and able to use something like this inside Elixir or Ruby, to optimize hotspots.

anonzzzies2y ago

This and HVM2 are some of the most interesting work I know off currently. Nice break from all the LLM stuff.

Now I just need a Common Lisp implemented using it!

idiomaxiom2y ago

If folds and bends are isomorphic to loops then loops can be parallelized ala Occam-Pi?

I am really enjoying this implementation :)

programjames2y ago

I just read through the HVM2 and Lafont papers, and I'm pretty impressed with this style of computation!

thinking_banana2y ago

It's pretty cool that you *actually* built this necessary Developer interface aimed towards accessibility of HPC!

kerkeslager2y ago

This looks like the language I've wanted for a long time. I'm excited to see how this plays out.

KingOfCoders2y ago

Has someone written the example in "native" GPU (C/Cuda) to compare performance?

totorovirus2y ago

I can already see how many people are so illiterate about GPUs.

DrNosferatu2y ago

Is there an OpenCL backend?

buckley_19912y ago

Awesome! It looks similar to the problems that Modular AI aims to solve.

wolfspaw2y ago

Nice!

Python-like + High-performance.

And, Different from Mojo, its Fully Open-Source.

IsTom2y ago

Reminds me a lot of the reduceron, sans FPGA.

markush_2y ago

Exciting project, congrats on the release!

kkukshtel2y ago

Honestly incredible, and congrats on the release after what looks like an insane amount of work.

toastal2y ago

Another project locking its communications to the Discord black hole.

naltroc2y ago

so cool. I see it has a lib target, can we use it as a crate instead of external program?

efferifick2y ago

Thank you for sharing!

exabrial2y ago

> First, install Rust nightly.

Eeek.

JayShower2y ago

This is really cool!

jgarzon2y ago

Very nice!

hypersimplex2y ago

congratz on the release

gingfreecss2y ago

amazing

Archit3ch2y ago

Pure functions only? This is disappointing. Furthermore, it invites a comparison with JAX.

tinydev2y ago

This is cool as shit.

3abiton2y ago

> That's a 57x speedup by doing nothing.

Okay, I'll have what you're having.

j / k navigate · click thread line to collapse

253 comments

183 comments · 83 top-level

CorrectingYou2y ago· 12 in thread

OP comes around with some of the coolest things posted in HN recently, and all he gets is extensive criticism, when it is clear that this is an early version :/

imranq2y ago

1 more reply

eating5552y ago

I would be pretty appreciated if people criticize my project. That is how you grow. If people tend hide cruel truth behind applause, the world would just crumbled.

diego_sandoval2y ago

My observation is that most criticism is useless, because people don't understand why you did things the way you did them.

If you explain why, they either still don't understand, or don't agree.

If the first iPhone had been presented on HN/Reddit/Twitter, everyone would criticize the lack of physical keyboard.

1 more reply

robocat2y ago

LightMachineOP2y ago

2 more replies

vitiral2y ago

It has 905 upvotes, it has received a fair share of positivity as well. Even criticism is often positive, since it expresses interest and engagement with the ideas and approach.

jules2y ago

swayvil2y ago

The coolest things are often the most difficult to understand.

Difficult to understand is often threatening.

Criticism is a popular response to threat and is the form of reply that requires the least understanding.

riku_iki2y ago

it also could be half cooked and that's why criticism arrives.

1 more reply

metadat2y ago

Correction for you - This is patently false, OP has had three hits -- this one, and two one hundred pointers out of 100-200 submissions.

P.s. it seems rather likely the op is Victor Taelin, they mostly submit his tweets and gists.

Who are you rooting for, exactly, newcomer?

P.p.s. Victor Taelin just happens to be the most recent committer on this submission, imagine that.

https://news.ycombinator.com/item?id=35363400

foota2y ago

1 more reply

LightMachineOP2y ago

1 more reply

ziedaniel12y ago· 12 in thread

Very cool idea - but unless I'm missing something, this seems very slow.

    #include <iostream>

    int main() {
      int sum = 0;
      for (int i = 0; i < 1024*1024*1024; i++) {
        sum += i; 
      }
      std::cout << sum << "\n";
      return 0;
    }

LightMachineOP2y ago

(I wonder if I should have waited a little bit more before actually posting it)

jay-barronville2y ago

> (I wonder if I should have waited a little bit more before actually posting it)

No. You built something that’s pretty cool. It’s not done yet, but you’ve accomplished a lot! I’m glad you posted it. Thank you. Ignore the noise and keep cooking!

phkahler2y ago

>> Bend has no tail-call optimization yet.

I've never understood the fascination with tail calls and recursion among computer science folks. Just write a loop, it's what it optimises to anyway.

2 more replies

nneonneo2y ago

2 more replies

nneonneo2y ago

ziedaniel12y ago

I did make sure to check before posting.

Good point about the signed integer overflow, though!

molenzwiebel2y ago

If compiled with -O3 on clang, the loop is entirely optimized out: https://godbolt.org/z/M1rMY6qM9. Probably not the fairest comparison.

LightMachineOP2y ago

1 more reply

ziedaniel12y ago

I used GCC and checked that it wasn't optimized out (which actually surprised me!)

rroriz2y ago

I think the point is that Bend in a much higher level than C++. But to be fair: I also may be missing the point!

gslepak2y ago

The point is that Bend parallelizes everything that can be parallelized without developers having to do that themselves.

5-2y ago

here is the same loop finishing in one second on my laptop, single-threaded, in a very high-level language, q:

  q)\t sum til floor 2 xexp 30
  1031

Twirrim2y ago· 8 in thread

For what it's worth, I ported the sum example to pure python.

    def sum(depth, x):
        if depth == 0:
          return x
        else:
          fst = sum(depth-1, x*2+0) # adds the fst half
          snd = sum(depth-1, x*2+1) # adds the snd half
          return fst + snd
        
    print(sum(30, 0))

under pypy3 it executes in 0m4.478s, single threaded. Under python 3.12, it executed in 1m42.148s, again single threaded. I mention that because you include benchmark information:

    CPU, Apple M3 Max, 1 thread: 3.5 minutes
    CPU, Apple M3 Max, 16 threads: 10.26 seconds
    GPU, NVIDIA RTX 4090, 32k threads: 1.88 seconds

If I get a chance tonight, I'll re-run it with `-s` argument, see if I get anything helpful.

LightMachineOP2y ago

Running on 42 minutes is mots likely a bug. Yes, we haven't done much testing outside of M3 Max yet. I'm aware it is 2x slower on non-Apple CPUs. We'll work on that.

vrmiguel2y ago

> it is allocating 2 IC nodes for each numeric operation, while Python is not

While that's true, Python would be using big integers (PyLongObject) for most of the computations, meaning every number gets allocated on the heap.

If we use a Python implementation that would avoid this, like PyPy or Cython, the results change significantly:

    % cat sum.py 
    def sum(depth, x):
        if depth == 0:
            return x
        else:
            fst = sum(depth-1, x*2+0) # adds the fst half
            snd = sum(depth-1, x*2+1) # adds the snd half
        return fst + snd

    if __name__ == '__main__':
        print(sum(30, 0))

    % time pypy sum.py
    576460751766552576
    pypy sum.py  4.26s user 0.06s system 96% cpu 4.464 total

That's on an M2 Pro. I also imagine the result in Bend would not be correct since it only supports 24 bit integers, meaning it'd overflow quite quickly when summing up to 2^30, is that right?

[Edit: just noticed the previous comment had already mentioned pypy]

> I'm aware it is 2x slower on non-Apple CPUs.

Do you know why? As far as I can tell, HVM has no aarch64/Apple-specific code. Could it be because Apple Silicon has wider decode blocks?

> can be underwhelming, and I understand if you don't believe on my words

I don't think anyone wants to rain on your parade, but extraordinary claims require extraordinary evidence.

1 more reply

Twirrim2y ago

Bitonic sort runs in 0m2.035s. Transpiled to c and compiled it takes 0m0.425s.

that sum example, transpiled to C and compiled takes 1m12.704s, so it looks like it's just the VM case that is having serious issues of some description!

glitchc2y ago

Clearly this language is positioned at using the gpu for compute-heavy applications and it's still in its early stages. Recursion is not the target application and should not be a relevant benchmark.

1 more reply

fulafel2y ago

tinyspacewizard2y ago

Python is really bad at recursion (part of why it's not appropriate for functional programming), so perhaps an unfair benchmark?

A Pythonic implementation would use loops and mutation.

metadat2y ago

Why `+0`, is this not a pointless no-op?

pests2y ago

Yes, but when looking at the source it's more obvious this is a repeating pattern.

"Hey, I'm accessing the 0th element here, just want to make that clear"

Without the +0, that statement looks disconnected from the +1 even though conceptually its the same.

Ive usually seen it used in longer lists of statements. It also keeps everything lined up formatting wise.

davidw2y ago· 6 in thread

As a resident of Bend, Oregon... it was kind of funny to read this and I'm curious about the origin of the name.

developedby2y ago

I was actually looking forward to seeing someone from Bend to make a comment like this

bytK72y ago

As a fellow resident of Bend I felt the same way when I saw this.

noumenon11112y ago

As a native Bendite but not current Bend resident, seeing that word with a capital letter always makes me smell juniper and sagebrush a little bit.

alex_lav2y ago

Totally off topic but I'll be driving there later this afternoon. Hoping it's as beautiful as last time!

davidw2y ago

If you're going to be here for a bit (I am heading out of town on a bike trip for a few days), always happy to grab a beer with fellow HN people!

1 more reply

blinded2y ago

Thought the same thing!

anentropic2y ago· 5 in thread

I remember seeing HVM on here a year or two back when it came out and it looked intriguing. Exciting to see something being built on top of it!

I would say that the play on words that gives the language its name ("Bend") doesn't really make sense...

https://github.com/HigherOrderCO/bend/blob/main/GUIDE.md

> Bending is the opposite of folding. Whatever fold consumes, bend creates.

I have a question about the example code and output for bending:

    type Tree:
      Node { ~lft, ~rgt }
      Leaf { val }

    def main():
      bend x = 0:
        when x < 3:
          tree = Tree/Node { lft: fork(x + 1), rgt: fork(x + 1) }
        else:
          tree = Tree/Leaf { val: 7 }
      return tree

    tree = fork(0)
    tree = ![fork(1), fork(1)]
    tree = ![![fork(2),fork(2)], ![fork(2),fork(2)]]
    tree = ![![![fork(3),fork(3)], ![fork(3),fork(3)]], ![![fork(3),fork(3)], ![fork(3),fork(3)]]]
    tree = ![![![7,7], ![7,7]], ![![7,7], ![7,7]]]

Where does the initial "tree = fork(0)" come from?

redbar0n2y ago

    type Tree
      Branch { left, right }
      Leaf { value }

    def main():

      createTree(x):
        x < 3 ?
          Tree.Branch { left: createTree(x+1), right: createTree(x+1) }
        Tree.Leaf { value: 7 }

      createTree(0)

barfbagginus2y ago

Re: name. Fold and bend are indeed called fold and unfold in Haskell and traditional functional programming literature.

I wonder if bend has to do with how we manipulate the computation's interaction graph while evaluating a bend. There might be some bending of wires!

Re: code example

Note this is guesswork. I don't know what the ![a, b] syntax means, and I haven't read much of the guide.

Appendix: Notes on Fold Vs Bend

I wrote these for an earlier draft while reminding myself about these operations. I include them more for my benefit, and in case they help you or the audience.

anentropic2y ago

The question that comes to me is: can I use fork(x) outside of a bend?

Seems like probably not, there doesn't seem to be enough information in the 'argument' to this 'function' to do anything useful without the implicit context of the bend construct.

For that reason I think I'd prefer it if fork was a keyword (like 'bend' and 'when') rather than a 'function', just at the surface syntax level to give a clue it is something special.

I guess fork is a kind of 'magic' function that represents the body of the bend. It's a bit like a 'self' or 'this'.

At the moment this syntax is in a weird half-way point ...the underlying concept is necessarily functional but it's trying to look kind of like an imperative for-loop still.

Again it's sort of overloading an imperative-looking syntax to implicitly do the 'return' from the implicit recursive function.

Later on there is this example:

    def render(depth, shader):
      bend d = 0, i = 0:
        when d < depth:
          color = (fork(d+1, i*2+0), fork(d+1, i*2+1))
        else:
          width = depth / 2
          color = shader(i % width, i / width)
      return color

And here I wonder - does 'width' have a value after the bend? Or it's only the last assignment in each clause that is privileged?

That's an odd mix in a language which otherwise has explicit returns like Python.

If so I wonder if a syntax something like this might be clearer:

    def render(depth, shader):
      bend color with d = 0, i = 0:
        when d < depth:
          yield (fork(d+1, i*2+0), fork(d+1, i*2+1))
        else:
          width = depth / 2
          return shader(i % width, i / width)
      return color

i.e. name the return var once in the bend itself, yield intermediate values (to itself, recursively) and return the final state.

developedby2y ago

The first `fork` is from using bend and passing the initial state

  The program above will initialize a state (`x = 0`), and then, for as long as `x < 3`,
  it will "fork" that state in two, creating a `Tree/Node`, and continuing with `x + 1`.
  When `x >= 3`, it will halt and return a `Tree/Leaf` with `7`.
  When all is done, the result will be assigned to the `tree` variable:

anentropic2y ago

I would have described the logic in the exact same way, but I still don't see where initial tree = fork(0) state comes from

all the other "fork"s in the output are produced explicitly by:

    Tree/Node { lft: fork(x + 1), rgt: fork(x + 1) }

1 more reply

yetihehe2y ago· 4 in thread

Bend looks like a nice language.

LightMachineOP2y ago

We will have 64-bit boxed numbers really soon! As in, next month, or earlier if users find this to be a higher priority.

yetihehe2y ago

What other types are you planning? Maybe some floats (even if only on cpu targets, would be nice).

1 more reply

Archit3ch2y ago

Is there a platform with native hardware u64? Maybe some FPGA?

Archit3ch2y ago

Sorry, meant u24.

KingOfCoders2y ago· 4 in thread

The website claims "automatically achieves near-ideal speedup"

12x for 16x threads

51x for 16.000x threads

Can someone point me to a website where it explains that this is the "ideal speedup"? Is there a formula?

lmeyerov2y ago

Bend is intriguing --

1. Some potentially useful perspectives:

* Weak scaling vs strong scaling: https://www.kth.se/blogs/pdc/2018/11/scalability-strong-and-... ?

2. The comparisons I'd really like to see are:

* cudf, heavy.ai: how does it compare to high-level python dataframe and SQL that already run in GPUs? How is perf, and what programs do you want people to be able to write and they cannot?

* Halide and other more general purpose languages that compile to GPUs that seem closer to where Bend is going

FWIW, it's totally fine to compare to other languages.

KingOfCoders2y ago

Thanks for the long reply.

LightMachineOP2y ago

This is on CPU vs GPU.

A GPU core (shading unit) is 100x weaker than a CPU core, thus the difference.

ON the GPU, HVM's performance scales almost 16000x with 16000x cores. Thus the "near ideal speedup".

Not everyone knows how GPUs work, so we should have been more clear about that!

andersa2y ago

It's not.

KeplerBoy2y ago· 4 in thread

What's going on with the super-linear speedup going from one thread to all 16?

210 seconds (3.5 minutes) to 10.5 seconds is a 20x speedup, which isn't really expected.

LightMachineOP2y ago

KeplerBoy2y ago

Thanks for the explanation and the cool project.

I will give bend a shot on some radar signal processing algorithms.

LightMachineOP2y ago

I apologize, I gave you the wrong answer.

byteknight2y ago

Its possible to see such scaling if involving any level of cache or I/O.

vegadw2y ago· 3 in thread

neonsunset2y ago

Take a look at ILGPU. It's very nice and has been around for a long time! (just no one knows about it, sadly)

Short example: https://github.com/m4rs-mt/ILGPU/blob/master/Samples/SimpleM...

Supports even advanced bits like inline PTX assembly: https://github.com/m4rs-mt/ILGPU/blob/master/Samples/InlineP...

pjmlp2y ago

Chapel has a decent use in HPC.

Also NVidia has sponsored variants of Haskell, .NET, Java, Julia on CUDA, have a Python JIT and are collaborating with Mojo folks.

MarcusE1W2y ago

ParaSail also goes into that direction https://github.com/parasail-lang/parasail.

Made by the designer for Ada since 1995, Tucker Taft. Some of the parallel features of ParaSail made it into Ada 2022.

praetor222y ago· 3 in thread

Look, I understand the value proposition and how cool it is from a theoretical standpoint, but I honestly don't think this will ever become relevant.

Here are some notes from my first impressions and after skimming through the paper. And yes, I am aware that this is very very early software.

1. Bend looks like an extremely limited DSL. No FFI. No way of interacting with raw buffers. Weird 24bit floating point format.

2. There's a reason why ICs are not relevant: performance is and will always be terrible. There is no other way to put it, graph traversal simply doesn't map well on hardware.

3. The premise of optimal reduction is valid. However, you still need to write the kernels in a way that can be parallelized (ie. no data dependencies, use of recursion).

LightMachineOP2y ago

Thanks for the feedback. Some corrections:

HVM2 is finally a correct algorithm that scales. Now we'll optimize it for the actual low-level performance.

mst2y ago

> HVM2 is finally a correct algorithm that scales.

This, I think, is the key thing people are missing.

physicsguy2y ago

delu2y ago· 3 in thread

shwestrick2y ago

Nowadays 210 is actually parallel! You can run 210-style code using MaPLe (https://github.com/MPLLang/mpl) and get competitive performance with respect to C/C++.

If you liked 210, you might also like https://futhark-lang.org/ which is an ML-family language that compiles to GPU with good performance.

amelius2y ago

Huh, the Maple name is already used by a well known computer algebra project.

https://en.wikipedia.org/wiki/Maple_(software)

Rodeoclash2y ago

The trend towards multiple cores in machines was one of the reasons I decided to learn Elixir.

xiaodai2y ago· 3 in thread

Looks cool but what's one toy problem that it can solve more efficiently than others?

JackMorgan2y ago

Here is an example of it summing a huge set of numbers 100x faster than in C.

https://github.com/HigherOrderCO/bend/blob/main/GUIDE.md#par...

ashdnazg2y ago

Note that it's not 100x faster than C, but than bend running on one CPU thread.

Running the equivalent C code takes ~2.3 seconds on my machine. Same order of magnitude as bend on the beefy GPU.

mjaniczek2y ago

anonzzzies2y ago· 3 in thread

What is the terminal used for that demo? https://github.com/HigherOrderCO/Bend does it just skip commands it cannot execute or?

LightMachineOP2y ago

It was actually just me recording iTerm2 with OBS. The theme is Solarized Light. What do you mean by skip commands?

anonzzzies2y ago

Ah just ctrl-c was it? Sometimes I just think way too difficult. Keep up the good work!

1 more reply

mcintyre19942y ago

animaomnium2y ago· 2 in thread

Fala Taelin, nice work! Does HVM2 compile interaction nets to e.g. spirv, or is this an interpreter (like the original HVM) that happens to run on the GPU?

Edit: Oh I see...

> This repository provides a low-level IR language for specifying the HVM2 nets, and a compiler from that language to C and CUDA HVM

Will have to look at the code then!

https://github.com/HigherOrderCO/HVM

Edit: Wait nvm, it looks like the HVM2 cuda runtime is an interpreter, that traverses an in-memory graph and applies reductions.

https://github.com/HigherOrderCO/HVM/blob/5de3e7ed8f1fcee6f2...

I was talking about traversing an interaction net to recover a lambda-calculus-like term, which can be lowered to C a la lisp in small pieces with minimal runtime overhead.

The other option is implementing HVM in hardware, which I've been messing around with on a spare FPGA.

LightMachineOP2y ago

animaomnium2y ago

Edit: nvm, I read through the rest of the codebase. I see that HVM compiles the inet to a large static term and then links against the runtime.

https://github.com/HigherOrderCO/HVM/blob/5de3e7ed8f1fcee6f2...

Will have to play around with this and look at the generated assembly, see how much of the runtime a modern c/cu compiler can inline.

Btw, nice code, very compact and clean, well-organized easy to read. Rooting for you!

MrLeap2y ago· 2 in thread

This is incredible. This is the kind of work we need to crack open the under utilized GPUs out there. I know LLMs are all the rage, but there's more gold in them hills.

anon2912y ago

As a whole, the speedups claimed are not actually that great. Going from 1 core to 16k cores increases performance by 50x. That's not actually very good.

LightMachineOP2y ago

You're comparing CPU cores to GPU cores!

It is "only" 50x because a single GPU core is 100x weaker than a CPU core!

1 more reply

api2y ago· 2 in thread

Oh wow do I wish this existed when I was playing with evolutionary computation and genetic algorithms in college…

zackmorris2y ago

Me too, now you see why they never took off.

api2y ago

Nobody has tried EC at anywhere near the scale of GPTs/LLMs because that amount of compute is expensive and at this point we know those will at least work.

I still think EC is fascinating and would love to play with it some more at some point, maybe trying it combined with back propagation in novel ways. Compute only gets cheaper.

yetihehe2y ago· 2 in thread

Wow, Bend looks like a nice language.

trenchgun2y ago

> Ahh, not even 32bit?

This is a proof of concept version which focuses on the provable correctness of the parallel compiler.

GGO2y ago

64bit coming soon

anonzzzies2y ago· 2 in thread

Maybe I missed it, but there seems to be no license attached to HVM2, nor to Bend or Kind?

drtournier2y ago

https://x.com/VictorTaelin/status/1791241244468806117

tekknolagi2y ago

(Taelin says will likely be MIT or similar)

egnehots2y ago· 2 in thread

the interesting comparison nowadays would be against mojo:

https://www.modular.com/max/mojo

ZitchDog2y ago

I think this is quite different- I don’t think mojo runs on the GPU unless I am mistaken.

witherk2y ago

highfrequency2y ago· 2 in thread

> CPU, Apple M3 Max, 1 thread: 3.5 minutes

> CPU, Apple M3 Max, 16 threads: 10.26 seconds

Surprised to see a more than linear speedup in CPU threads. What’s going on here?

LightMachineOP2y ago

Archit3ch2y ago

More cores = more caches?

abeppu2y ago· 2 in thread

I for one found the 'how is this possible' video near the bottom of the page to be unhelpful:

- surely for `3 x 3 = 9`, there is some concept of primitive operations?

jiehong2y ago

The flip with Chinese characters in the middle tripped me. I guess they wanted to look like “complicated”…

nojvek2y ago

3 x 3 seemed like a pretty bad example to show how they parallelize.

chc42y ago· 2 in thread

LightMachineOP2y ago

hahajahen2y ago

The fact that you don’t know the answer to this question, and don’t even seem to think it is relevant, is chilling.

We want to believe, but the claims here are big

1 more reply

light_hue_12y ago· 2 in thread

   module Main where
   
   sum' :: Int -> Int -> Int
   sum' 0 x = x
   sum' depth x = sum' (depth - 1) ((x \* 2) + 0) + sum' (depth - 1) ((x \* 2) + 1)
   
   main = print $ sum' 30 0

Runs in 2.5s. Sure it's not on a GPU, but it's faster! And things don't get much more high level.

If you're going to promise amazing performance from a high level language, I'd want to see a comparison against JAX.

LightMachineOP2y ago

light_hue_12y ago

What? I ran your example, from your readme, where you promise a massive performance improvement, and you're accusing me of doing something wrong?

This is exactly what a scammer would say.

I guess that's the point here. Scam people who don't know anything about parallel computing by never comparing against any other method?

2 more replies

andrewp1232y ago· 1 in thread

topspin2y ago

I'm ashamed that I didn't think to write this. Well deserved praise.

zackmorris2y ago· 1 in thread

A quarter of a century can go by in the blink of an eye if you get suckered into building other people's dreams as a people-pleaser. Be careful what you work on.

jjtheblunt2y ago

> A quarter of a century can go by in the blink of an eye if you get suckered into building other people's dreams as a people-pleaser. Be careful what you work on

well said! i find myself reflecting the same sentiment when away from the computer (and i've avoided the people-pleaser thing, but what you said resonates as i watch the world)

notfed2y ago· 1 in thread

This is really, really cool. This makes me think, "I could probably write a high performance GPU program fairly easily"...a sentence that's never formed in my head.

developedby2y ago

That's the main idea!

klabb32y ago· 1 in thread

What are some challenges in programming with these kind of restrictions in practice? Also, is there good FFI options?

[1]: https://github.com/HigherOrderCO/bend/blob/main/GUIDE.md

mathiasgredal2y ago

Vendor support:

- https://www.intel.com/content/www/us/en/developer/articles/g...

- https://rocm.blogs.amd.com/software-tools-optimization/hipst...

- https://docs.nvidia.com/hpc-sdk/archive/20.7/pdf/hpc207c++_p...

magnio2y ago· 1 in thread

Congrats on the launch.

LightMachineOP2y ago

Great question!

Short answer: GPU

Long answer: CUDA

jes51992y ago· 1 in thread

mjaniczek2y ago

I think this is an unsolved tooling question right now.

You could get some sense of the parallelism by using `/usr/bin/time` and dividing the wall time with the user time.

You could look at the Task Manager / Activity Monitor / htop and see if it's using 800% CPU or whatever.

You could use psrecord (https://pypi.org/project/psrecord/) to get a relatively finegrained CPU+mem usage graph across the duration of the program.

But it would probably still be best to record some sort of stats in the Bend/HVM itself, enabled via a CLI flag. Reductions per ms, sampled across the program duration, or something like that.

I'd be interested in anybody's ideas of what a good metric would be here!

EDIT: CLI flag, not CPU flag

mbforbes2y ago· 1 in thread

Congratulations on the launch and hard work so far! We need projects like this. Great readme and demo as well.

Munksgaard2y ago

Have you looked at Futhark?

gigatexal2y ago· 1 in thread

The first graphic midway or so down the page has this tag:

tested on: CPU - Apple M3 Max, GPU - NVIDIA RTX 4090

But how? I thought eGPUs don’t work on apple silicon and the pci-e having Mac Pro is still M2 based, no?

GGO2y ago

2 different machines

mccoyb2y ago· 1 in thread

This is cool! Is the idea to put Kind2 on top of this in some way?

I’d also love to find an example of writing a small interpreter in Bend - which runs on the GPU.

LightMachineOP2y ago

funny_name2y ago· 1 in thread

What kind of software would this language be good for? I assume it's not the kind of language you'd use for web servers exactly.

trenchgun2y ago

Erlang-like actor models would be well suited, so yeah, you could use it for web servers (assuming they are able to finish the language). It's a general purpose high level programming language.

Arch4852y ago

I want to congratulate the author on this, it's super cool. Making correct automatic parallelization is nothing to sneeze at, and something you should absolutely be proud of.

I'm excited to see how this project progresses.

jjovan12y ago

ruste2y ago

Been watching your development for a while on Twitter. This is a monumental achievement and I hope it gets the recognition it deserves.

npalli2y ago

  function sum(depth, x)
      if depth == 0
          return x
      else
          fst = sum(depth-1, x*2+0)
          snd = sum(depth-1, x*2+1)
      end
      return fst + snd
  end

println(sum(30,0))

robust-cactus2y ago

This is awesome and much needed. Keep going, forget the overly pedantic folks, the vision is great and early results are exciting.

mjaniczek2y ago

I've made a benchmark of Bend running a simple counter program on CPU vs GPU, vs Haskell,Node,Python,C that I plan to write a blogpost about, probably this Sunday:

https://docs.google.com/spreadsheets/d/1V_DZPpc7_BP3bmOR8Ees...

It's magical how the GPU version is basically flat (although with a high runtime init cost).

gsuuon2y ago

smusamashah2y ago

I have no interest in this tech as it's apparently for backend stuff and not actually rendering things by itself.

But the demo gif is probably the best I have seen in a Github readme. I watched it till the end. It was instantly engaging. I wanted to see the whole story unfold.

darlansbjr2y ago

Would a compiler be faster by using HVM? Would love to see a fully parallel version of typescript tsc

throwaway25622y ago

i-combinators https://www.semanticscholar.org/paper/Interaction-Combinator...

mattnewport2y ago

I'll try and get past that though as it does look like there's something pretty interesting here.

magicalhippo2y ago

> Everything that can run in parallel, will run in parallel.

How does this work on the GPU using Bend? Been too long since I did any GPU programming.

mjaniczek2y ago

I've made a benchmark of the current version of Bend running a simple counter program on CPU vs GPU, vs Haskell,Node,Python,C: https://docs.google.com/spreadsheets/d/1V_DZPpc7_BP3bmOR8Ees...

It's magical how the GPU version is basically flat (although with a high runtime init cost)

hintymad2y ago

i5heu2y ago

I think this is a much more practically approach and i hope this will give some inspiration to this possibility.

runeks2y ago

What's the expected effort needed for supporting GPUs other than Nvidia — e.g. AMD GPUs or the GPU in a MacBook Pro M1/2/3?

As I understand, it's a lot of work because there's no common way to target these different GPUs. Is this correctly understood?

croemer2y ago

Dupe of https://news.ycombinator.com/item?id=40387394 and https://news.ycombinator.com/item?id=40383196

temp1237892462y ago

Congrats!

I’ve been watching HVM for a while and think it’s extremely cool.

My intuition is that this will eventually be a really big deal.

britannio2y ago

Incredible feat, congratulations!

flakiness2y ago

This kind of sound like Mojo. I wonder how these compare? (Besides HVM/Bend being opensource, which is awesome.)

https://www.modular.com/max/mojo

zmmmmm2y ago

Reminds me a little bit of Concurnas [0] which sadly seemed to get abandoned right at the point when it was nearly viable.

[0] https://concurnas.com/

shadowpho2y ago

Wow this is very impressive!

exitheone2y ago

This seems pretty cool!

Question: Does this take into account memory bandwidth and caches between cores? Because getting them wrong can easily make parallel programs slower than sequential ones.

netbioserror2y ago

So HVM finally yields fruit. I've been eagerly awaiting this day! Bend seems like a very suitable candidate for a Lispy S-expression makeover.

mattdesl2y ago

Looking forward to using this. Curious about how far away WebGPU/WASM support might be, it could provide a single cross-platform backend.

zacksiri2y ago

What would be super interesting is having something library, and able to use something like this inside Elixir or Ruby, to optimize hotspots.

anonzzzies2y ago

This and HVM2 are some of the most interesting work I know off currently. Nice break from all the LLM stuff.

Now I just need a Common Lisp implemented using it!

idiomaxiom2y ago

If folds and bends are isomorphic to loops then loops can be parallelized ala Occam-Pi?

I am really enjoying this implementation :)

programjames2y ago

I just read through the HVM2 and Lafont papers, and I'm pretty impressed with this style of computation!

thinking_banana2y ago

It's pretty cool that you *actually* built this necessary Developer interface aimed towards accessibility of HPC!

kerkeslager2y ago

This looks like the language I've wanted for a long time. I'm excited to see how this plays out.

KingOfCoders2y ago

Has someone written the example in "native" GPU (C/Cuda) to compare performance?

totorovirus2y ago

I can already see how many people are so illiterate about GPUs.

DrNosferatu2y ago

Is there an OpenCL backend?

buckley_19912y ago

Awesome! It looks similar to the problems that Modular AI aims to solve.

wolfspaw2y ago

Nice!

Python-like + High-performance.

And, Different from Mojo, its Fully Open-Source.

IsTom2y ago

Reminds me a lot of the reduceron, sans FPGA.

markush_2y ago

Exciting project, congrats on the release!

kkukshtel2y ago

Honestly incredible, and congrats on the release after what looks like an insane amount of work.

toastal2y ago

Another project locking its communications to the Discord black hole.

naltroc2y ago

so cool. I see it has a lib target, can we use it as a crate instead of external program?

efferifick2y ago

Thank you for sharing!

exabrial2y ago

> First, install Rust nightly.

Eeek.

JayShower2y ago

This is really cool!

jgarzon2y ago

Very nice!

hypersimplex2y ago

congratz on the release

gingfreecss2y ago

amazing

Archit3ch2y ago

Pure functions only? This is disappointing. Furthermore, it invites a comparison with JAX.

tinydev2y ago

This is cool as shit.

3abiton2y ago

> That's a 57x speedup by doing nothing.

Okay, I'll have what you're having.

j / k navigate · click thread line to collapse