> Something else would not be
ChatGPT tuned for your site.
I suspect a lot of people would be satisfied with anything functionally equivalent regardless of whether it is ChatGPT(TM)-brand.
> GPGPU is not gaming. Unified memory means that Apple Silicon's "RAM" can be compared to VRAM for inference.
The M1 and M2 have a 128-bit memory bus, the same as ordinary dual-channel systems. Only the Pro and Max have more (by 2x and 4x), and it's not obvious that's even the bottleneck here, because the reason they have more is to have enough for the GPU and CPU at the same time, not because a GPU of that size needs that much memory bandwidth when the CPU is idle.
For example, the RTX 4070 Ti is about twice as fast at inference as the RTX 3070 Ti, even though it has slightly less memory bandwidth. And the 4070 Ti has only ~25% more memory bandwidth than the M2 Max GPU but is many times faster.
There is presumably a point at which inference becomes bottlenecked by memory bandwidth rather than compute hardware, but the garden variety x86_64 iGPU may not even be past it, and if it is it's not by much.
The interesting things are a) getting the code written to make existing hardware easy to use, and b) maybe introducing some hefty iGPUs into the server systems with 12 channels per socket and wouldn't run out of memory bandwidth even with significantly more compute hardware, and could then be supplied with hundreds of GB worth of RDIMMs.