Skip to content

Top Best Ask Show New Jobs

Ask HN: Resources for general purpose GPU development on Apple's M* chips?

149 pointsthinking_banana1y ago82 comments

While Apple M* chips seems to have an incredible unified memory access, the available learning resources seem to be quite restricted and often convoluted. Has anyone been able to get past this barrier? I have some familiarity with general purpose software development with CUDA and C++. I want to figure how to work with/ use Apple's developer resources for general purpose programming.

82 comments

43 comments · 13 top-level

morphle1y ago· 10 in thread

You can help with the reverse engineering of Apple Silicon done by a dozen people worldwide, that is how we find out the GPU and NPU instructions[1-4]. There is over 43 trillion float operations per second to unlock at 8 terabit per second 'unified' memory bandwidth and 270 gigabits per second networking (less on the smaller chips)....

[1] https://github.com/AsahiLinux/gpu

[2] https://github.com/dougallj/applegpu

[3] https://github.com/antgroup-skyward/ANETools/tree/main/ANEDi...

[4] https://github.com/hollance/neural-engine

You can use a high level APIs like MLX, Metal or CoreML to compute other things on the GPU and NPU.

Shadama [5] is an example programming language that translates (with Ometa) matrix calculations into WebGPU or WebGL APIs (I forget which). You can do exactly the same with the MLX, Metal or CoreML APIs and only pay around 3% overhead going through the translation stages.

[5] https://github.com/yoshikiohshima/Shadama

I estimate it will cost around $22K at my hourly rate to completely reverse engineer the latest A16 and M4 CPU (ARMV9), GPU and NPU instruction sets. I think I am halfway on the reverse engineering, the debugging part is the hardest problem. You would however not be able to sell software with it on the APP Store as Apple forbids undocumented API's or bare metal instructions.

MuffinFlavored1y ago

This would get rid of needing Metal to be the blackbox and enable things like "nvptx CUDA" equivalent / https://libc.llvm.org/gpu/ right?

Very interesting. A steal for $22k but I guess very niche for now...

Yes, knowing the exact CPU and ANE assembly instructions (or the underlying microcode!!) allows for general purpose software to adaptively compile processes on all the core types, not just the CPU ones. Its won't always be faster, you get more cache misses (some cores don't have cache) and different DMA and thread scheduling, some registers can't fit the floats or large integers, etc etc.

But yes, it will be possible to use all 140 cores of the M2 Ultra or the 36 cores of the M4. There will be an M6 Extreme some day, maybe 500 cores?

Actually, the GPU and ANE cores themselves are built from teams of smaller cores, maybe a few dozens, hundreds or thousand in all, same as in most NVDIA chips.

>A steal for $22k but I guess very niche for now...

A single iPhone or Mac app (a game, an LLM, pattern recognition, security app, VPN, de/encryption, video en/dec coder) that can be sped up by 80%-200% can afford my faster assembly level API.

A whole series of hardware level zero-day exploits for iPhone and Mac would become possible, now that won't be very niche at all. It is worth millions to reverse Apple Silicon instruction sets.

JackYoustra1y ago

any place you have your current progress written up on? Any methodology I could help contribute on? I've read each one of the four links you've given over the years and it seems vague with how far people have currently gotten and exact issues.

>Any methodology I could help contribute on?

Several people have already contacted me today with this request. This is how I give out details and share current progress with you.

Yes, you can help, most people on HN could. It is not that difficult work and it is not just low level debugging, coding and FPGA hardware. It is also organizing and even simple sales, talking to funders. With patience, you could even get paid to help.

>any place you have your current progress written up on?

Not any place in public, because of its value for zero-day exploits. This knowledge is worth millions.

I'm in the process of rewriting my three scientific papers on reverse engineering Apple Silicon low level instructions.

>it seems vague with how far people have currently gotten and exact issues.

Yes, I'm afraid you're right, my apologies. It's very much detailed and technical stuff, some of it under patent and NDA, some even sensitive for winning economic wars and ongoing wars (you can guess those are exiting stories). It even plays a role in the $52.7 billion US, €43 billion EU and $150 billion (unconfirmed) Chinese Chips Acts. Apple Silicon is the main reason TSMC opened a US factory [1], keeping its instruction set details secret is deemed important.

If you want more information, you should join our offline video discussions for more info. Maybe sometimes sign an NDA for the juicy bits.

[1] https://www.cnbc.com/2024/12/13/inside-tsmcs-new-chip-fab-wh...

dgfitz1y ago

It’s too bad they don’t make this easier on developers, Apple. Is there a reason I don’t see?

Apple wants total freedom to rework lower levels of the stack down to the hardware, without worrying about application compatibility, hence their answer will continue to be Metal.

There certainly is a reason and indeed you don't see it because Apple downplays these things in their PR.

It might be the same reason that is behind NVDIA's CUDA moat. CUDA lock-in prevented competitors like AMD and Intel to convince programmers and their customers to switch away from CUDA. So there was no software ported to their competitive GPU's. So you get anti-trust lawsuits [1].

I think you should put yourself in Apples management mindset and then reason. I suspect they think they will not sell more iPhones or Macs if they let third party developers access the low level APIs and write faster software.

They might reason that if no one knows the instruction sets hackers will write less code to break security. Security by obscurity.

They certainly think that blocking competitors from reverse engineering the low power Apple Silicon and blocking them from using TSMC manufacturing capacity will keep them the most profitable company for another decade.

[1] https://news.ycombinator.com/item?id=40593576

KeplerBoy1y ago

Where does the 270 gbit/s networking figure come from? Is it the aggregate bandwidth from the pcie slots on the mac pro, which could support nics at that speeds (and above according to my quick maths#), but there is not really any driver support for modern Intel or Mellanox/Nvidia NICs as far as I can tell.

My use case would be hooking up a device which spews out sensor data at 100 gbit/s over qsfp28 ethernet as directly to a GPU as possible. The new mac mini has the GPU power, but there's no way to get the data into it.

# 2x Gen4x16 + 4x Gen3x8 = 2 * 31.508 GB/s + 4 * 7.877 GB/s ≈ 90 GB/s = 720 gbit/s

> Where does the 270 gbit/s networking figure come from? Is it the aggregate bandwidth from the pcie slots on the Mac pro

We both should restate and specify the calculation for each different Apple Silicon chip and the PCB/machine model it is wired onto.

The $599 M4 Mac mini base model networking (aggregated Wifi, USB-C, 10G Ethernet, Thunderbolt PCIe) is almost 270 Gbps. Your 720 Gbps is for a >$8000 Mac Pro M2 Ultra but the number is to high because the 2x Gen4x16 is shared/oversubscribed with the other PCIe lanes for x8 PCIe slots, SSD and Thunderbolt. You need to measure/benchmark it, not read the marketing PR.

I estimate the $1400 M4 Pro Mac mini networking bandwidth by adding the external WiFi, 10 Gbps Ethernet, two USC-C ports (2 x 10 Gbps) and three Thunderbolt 4 ports (3 x 80/120 Gbps) but subtracting the PCIe 64 Gbps limit and not counting the internal SSD. Two $599 M4 Mac mini base models are faster and cheaper than one M4 Pro Mac mini.

The point of the precise actual measurements I did of the trillion opereations per second and the billion of bits per second networking/interconnect of the M4 Mac mini against all the other Apple silicon machines is to find which package (chip plus pcb plus case) has the best price/performance/watt balanced against them networked together. On januari 2025 you can build the cheapest fastest supercomputer in the world from just off the shelf M4 16Gb Mac mini base models with 10G Ethernet, Mikrotek 100G switches and a few FPGA's. It would outperform all Nvidia, Cerebras, Tenstorrent and datacenter clusters I know of, mainly because of the low power Apple Silicon.

Note that the M4 has only 1,2 Tips unified memory bandwidth and the M4 Pro has double that. The 8 Tops unified memory bandwidth is on the M1 and M2 Studio Ultra with 64/128/192GB DRAM. Without it you cant's reach 50 trillion operations per second. A Mac Studio has only around 190 Gbps external networking bandwidth but does not reach 43 trillion TOPS, as does the 720 Gbps of your Mac Pro estimate. By reverse engineering the instruction set you could squeeze a few percent extra performance out of this M4 cluster.

The 43 trillion TOPS of the M4 itself is an estimate. The ANE does 34 TOPS, the CPU less than 5 TOP depending on float type and we have no reliable benchmarks for the CPU floating point.

>but there's no way to get the data into it at 100 Gbps

I'm confident you can get 100 Gbps in by aggregating M4 Mac mini ports.

I resell a $199 Microtik CCR2004-1G-2XS-PCIe SmartNIC with 2 x 25 Gbps QSFP28 that connects to a x8 PCIe 3.0. (I still have a few available for $140 plus shipping plus a few refurbished 16 x 10 Gbps for $400 and 8 x 100 Gbps switches for $800).

Theoretically you can connect that SmartNIC to two of the three M4 Mac mini Thunderbolt 4/USB4 ports that pass through 2 x x4 PCIe 3.0, if you can figure out how to aggregate the two x4 PCIe lanes into a single x8 port. The driver source code is for Linux and could be ported to MacOS. You then aggregate the ports with the 100 Gbps switch.

I'm pretty sure you could create a new PCB design with a larger Broadcom switch chip model to attach to the 10G Ethernet, two 10 Gbps USB-C ports plus the three Thunderbolt 4/USB4 port and write a new driver to aggregate over the 6 ports. You'd have 126 Gbps minus the PCIe overhead and could combine it into a single 100 Gbps QSFP28 port.

I already warned this is still theoretical. Broadcom might not sell you the switch chip, Intel might not sell you the Thunderbolt chip and Apple might block the installation of your device driver code.

But people already proved the interconnect with the Apple Thunderbolt Bridge driver at 3 x 10 Gbps connected via large expensive Thunderbolt hubs [2]. Others just connect each port to different M4 Macs [1][3][4] in various ways.

[1] https://x.com/alexocheema/status/1807882764261417000

[2] https://www.youtube.com/watch?v=GBR6pHZ68Ho

[3] https://www.youtube.com/watch?v=2eNVV0ouBxg

[4] https://www.youtube.com/watch?v=SkmrUWyZThQ

dylanowen1y ago· 5 in thread

People have already mentioned Metal, but if you want cross platform, https://github.com/gfx-rs/wgpu has a vulkan-like API and cross compiles to all the various GPU frameworks. I believe it uses https://github.com/KhronosGroup/MoltenVK to run on Macs. You can also see the metal shader transpilation results for debugging.

With what the OP asked for, I don't think wgpu is the right choice. They want to push the limits of Apple Silicon, or do Apple platform specific work, so an abstraction layer like wgpu is going in the opposite direction in my opinion.

Metal, and Apple's docs are the place to start.

PittleyDunkin1y ago

Indeed. I'm curious how much overhead there is in practice given the fact that the hardware wasn't designed to provide vulkan support. I honestly have no clue what to expect.

wgpu has its own Metal backend that most people use by default (not MoltenVK).

There is also a Vulkan backend if you want to run Vulkan through MoltenVK though.

dylanowen1y ago

Oh good to know! It's been a while since I've looked at the osx implementation

the metal backend does currently generate quite a lot of unnecessary command buffers, but in general performance seems solid.

amelius1y ago· 5 in thread

Apple is known to actively discourage general purpose computing. Better try a different vendor.

saagarjha1y ago

idk about “known” considering they basically created OpenGL

If OpenGL is your most up-to-date reference for Apple supporting general purpose computing then I think it absolutely emphasizes how little work they've put in.

mixmastamyk1y ago

That was SGI.

codr71y ago

Preferably one that sells computers, not fashion statements.

likeabbas1y ago

It's not a fashion statement, it's a fucking deathwish

aleinin1y ago· 4 in thread

If you're looking for a high level introduction to GPU development on Apple silicon I would recommend learning Metal. It's Apple's GPU acceleration language similar to CUDA for Nvidia hardware. I ported a set of puzzles for CUDA called GPU-Puzzles (a collection of exercises designed to teach GPU programming fundamentals)[1] to Metal [2]. I think it's a very accessible introduction to Metal and writing GPU kernels.

[1] https://github.com/srush/GPU-Puzzles

[2] https://github.com/abeleinin/Metal-Puzzles

After a quick scan through the [2] link, I have added this to the list of things to look into in 2025

Curious about the others in your list

singlepaynews1y ago

Can anyone recommend a CUDA equivalent of (2)? That’s a spectacular learning resource and I’d like to use a similar one to upskill for CUDA

dagmx1y ago

Isn’t the link right before it exactly what you’re asking for? Since 2 is a port of 1

rgovostes1y ago· 2 in thread

It's hard to answer not knowing exactly what your aim is, or your experience level with CUDA and how easily the concepts you know will map to Metal, and what you find "restricted and convoluted" about the documentation.

<Insert your favorite LLM> helped me write some simple Metal-accelerated code by scaffolding the compute pipeline, which took most of the nuisance out of learning the API and let me focus on writing the kernel code.

Here's the code if it's helpful at all. https://github.com/rgov/thps-crack

nixpulvis1y ago

2024 and still finding cheat codes in Tony Hawk Pro Skater 2. Wild!

selimthegrim1y ago

If Jamie Kennedy is reading this, we still haven’t found the cheat code to make you funny.

TriangleEdge1y ago· 2 in thread

Why not OpenCL or OpenGL? You'll not be constrained by the flavor of GPU.

nox1011y ago

Sounds like you've never actually tried running those two APis across platforms?

if you want portable use WebGPU either via wgpu for rust or dawn for C++ They actually do run on Windows, Linux, Mac, iOS, and Android portably

thrtythreeforty1y ago

wgpu Just Works from C++ as well. Both projects implement the webgpu.h API

barkingcat1y ago· 1 in thread

There is no general purpose GPU development on Apple M series.

There is Metal development. You want to learn Apple M-series gpu and gpgpu development? Learn Metal!

https://developer.apple.com/metal/

kristianp1y ago

> There is no general purpose GPU

That's what GPGPU stands for. So your 2 sentences contradict each other.

thetwentyone1y ago· 1 in thread

I’ve had a good time dabbling with Metal.jl: https://github.com/JuliaGPU/Metal.jl

Archit3ch1y ago

Same. It can even run realtime workloads (audio).

billti1y ago

If you know CUDA, then I assume you know a bit already about GPUs and the major concepts. There’s just minor differences and different terminology for things like “warps” etc.

With that base, I’ve found their docs decent enough, especially coupled with the Metal Shader Language pdf they provide (https://developer.apple.com/metal/Metal-Shading-Language-Spe...), and quite a few code samples you can download from the docs site (e.g. https://developer.apple.com/documentation/metal/performing_c...).

I’d note a lot of their stuff was still written in Objective-C, which I’m not that familiar with. But most of that is boilerplate and the rest is largely C/C++ based (including the Metal shader language).

I just ported some CPU/SIMD number crunching (complex matrices) to Metal, and the speed up has been staggering. What used to take days now takes minutes. It is the hottest my M3 MacBook has ever been though! (See https://x.com/billticehurst/status/1871375773413876089 :-)

mkagenius1y ago

Check out MLX[1]. Its a bit like pytorch/tensorflow with added benefit of Apple Silicon.

1. https://ml-explore.github.io/mlx/build/html/index.html

Besides the official docs you can check out llama.cpp as an example that uses metal for accelerated inference on Apple silicon.

desideratum1y ago

I'd reccomend checking out the CUDA mode Discord server! They also have a channel for Metal https://discord.gg/ZqckTYcv

rowanG0771y ago

If you are open to run Linux you can use standard opencl and vulkan.

j / k navigate · click thread line to collapse