For more generic GPU targets there's TRITON [5],[6].
[1] NVIDIA CUDA 13.1 Powers Next-Gen GPU Programming with NVIDIA CUDA Tile and Performance Gains:
https://developer.nvidia.com/blog/nvidia-cuda-13-1-powers-ne...
[2] Nvidia Tilus: A Tile-Level GPU Kernel Programming Language:
https://github.com/NVIDIA/tilus
[3] Simplify GPU Programming with NVIDIA CUDA Tile in Python:
https://developer.nvidia.com/blog/simplify-gpu-programming-w...
[4] Tile Language:
https://github.com/tile-ai/tilelang
[5] Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations:
https://dl.acm.org/doi/10.1145/3315508.3329973
[6] Triton:
Whatever this is doing could be wrapped up in another language.
Either way it's arguable that is even a good idea, since dealing with a regular thread in the same memory space, getting data to and from the GPU and doing computations on the GPU are all completely separate and have different latency characteristics.