> That said, I don't think we can be faulted for that one, because I don't think anybody really has a good answer to this particular design problem.
Agreed! To be clear, If there's any implication of "fault" it was certainly not in a moral sense or even anything around making poor design decisions. Julia's compiler is being asked to do many new things with semantics that necessarily predated many advances in PL.
Re Kernel fusion, there's another piece here, which you may or many not have included in "array-level optimizations". Julia's "just write loops" ethos is awesome, until you get to accelerators...now we're back to an "optimizer defined sub language" as TKF puts it. People like loops and flexibility, Dex, Floops.jl, Tullio, Loopvec and KA.jl show that it's possible to retain structure and emit accelerator-able loopy code. But none of those, except for dex, has a solution for fusing kernels that rely on loops. I'm still using the concept of Kernels, because there's still a bit of a separation between low level CUDA.jl code/these various DSLs and higher level array code, even if not as stark as python or C++.
Would be really cool, if like Dex, there's a plan to fuse these sorts of structured loops as well. Dex does it by having type level indexing and loop effects (they're actually moving to a user defined parallel effect handler system (https://arxiv.org/abs/2110.07493) ...the latter can tell the compiler when it's safe to parallelize and fuse+beta reduce loops. But that relies on structured semantics/effects and a higher level IR than exists in Julia.
Not sure what a Julian solution would look like, if possible. But given the usability wins, it would be great to have in Julia as well.