[1] https://github.com/AsahiLinux/gpu
[2] https://github.com/dougallj/applegpu
[3] https://github.com/antgroup-skyward/ANETools/tree/main/ANEDi...
[4] https://github.com/hollance/neural-engine
You can use a high level APIs like MLX, Metal or CoreML to compute other things on the GPU and NPU.
Shadama [5] is an example programming language that translates (with Ometa) matrix calculations into WebGPU or WebGL APIs (I forget which). You can do exactly the same with the MLX, Metal or CoreML APIs and only pay around 3% overhead going through the translation stages.
[5] https://github.com/yoshikiohshima/Shadama
I estimate it will cost around $22K at my hourly rate to completely reverse engineer the latest A16 and M4 CPU (ARMV9), GPU and NPU instruction sets. I think I am halfway on the reverse engineering, the debugging part is the hardest problem. You would however not be able to sell software with it on the APP Store as Apple forbids undocumented API's or bare metal instructions.
Very interesting. A steal for $22k but I guess very niche for now...
But yes, it will be possible to use all 140 cores of the M2 Ultra or the 36 cores of the M4. There will be an M6 Extreme some day, maybe 500 cores?
Actually, the GPU and ANE cores themselves are built from teams of smaller cores, maybe a few dozens, hundreds or thousand in all, same as in most NVDIA chips.
>A steal for $22k but I guess very niche for now...
A single iPhone or Mac app (a game, an LLM, pattern recognition, security app, VPN, de/encryption, video en/dec coder) that can be sped up by 80%-200% can afford my faster assembly level API.
A whole series of hardware level zero-day exploits for iPhone and Mac would become possible, now that won't be very niche at all. It is worth millions to reverse Apple Silicon instruction sets.
Several people have already contacted me today with this request. This is how I give out details and share current progress with you.
Yes, you can help, most people on HN could. It is not that difficult work and it is not just low level debugging, coding and FPGA hardware. It is also organizing and even simple sales, talking to funders. With patience, you could even get paid to help.
>any place you have your current progress written up on?
Not any place in public, because of its value for zero-day exploits. This knowledge is worth millions.
I'm in the process of rewriting my three scientific papers on reverse engineering Apple Silicon low level instructions.
>it seems vague with how far people have currently gotten and exact issues.
Yes, I'm afraid you're right, my apologies. It's very much detailed and technical stuff, some of it under patent and NDA, some even sensitive for winning economic wars and ongoing wars (you can guess those are exiting stories). It even plays a role in the $52.7 billion US, €43 billion EU and $150 billion (unconfirmed) Chinese Chips Acts. Apple Silicon is the main reason TSMC opened a US factory [1], keeping its instruction set details secret is deemed important.
If you want more information, you should join our offline video discussions for more info. Maybe sometimes sign an NDA for the juicy bits.
[1] https://www.cnbc.com/2024/12/13/inside-tsmcs-new-chip-fab-wh...
It might be the same reason that is behind NVDIA's CUDA moat. CUDA lock-in prevented competitors like AMD and Intel to convince programmers and their customers to switch away from CUDA. So there was no software ported to their competitive GPU's. So you get anti-trust lawsuits [1].
I think you should put yourself in Apples management mindset and then reason. I suspect they think they will not sell more iPhones or Macs if they let third party developers access the low level APIs and write faster software.
They might reason that if no one knows the instruction sets hackers will write less code to break security. Security by obscurity.
They certainly think that blocking competitors from reverse engineering the low power Apple Silicon and blocking them from using TSMC manufacturing capacity will keep them the most profitable company for another decade.
My use case would be hooking up a device which spews out sensor data at 100 gbit/s over qsfp28 ethernet as directly to a GPU as possible. The new mac mini has the GPU power, but there's no way to get the data into it.
# 2x Gen4x16 + 4x Gen3x8 = 2 * 31.508 GB/s + 4 * 7.877 GB/s ≈ 90 GB/s = 720 gbit/s
We both should restate and specify the calculation for each different Apple Silicon chip and the PCB/machine model it is wired onto.
The $599 M4 Mac mini base model networking (aggregated Wifi, USB-C, 10G Ethernet, Thunderbolt PCIe) is almost 270 Gbps. Your 720 Gbps is for a >$8000 Mac Pro M2 Ultra but the number is to high because the 2x Gen4x16 is shared/oversubscribed with the other PCIe lanes for x8 PCIe slots, SSD and Thunderbolt. You need to measure/benchmark it, not read the marketing PR.
I estimate the $1400 M4 Pro Mac mini networking bandwidth by adding the external WiFi, 10 Gbps Ethernet, two USC-C ports (2 x 10 Gbps) and three Thunderbolt 4 ports (3 x 80/120 Gbps) but subtracting the PCIe 64 Gbps limit and not counting the internal SSD. Two $599 M4 Mac mini base models are faster and cheaper than one M4 Pro Mac mini.
The point of the precise actual measurements I did of the trillion opereations per second and the billion of bits per second networking/interconnect of the M4 Mac mini against all the other Apple silicon machines is to find which package (chip plus pcb plus case) has the best price/performance/watt balanced against them networked together. On januari 2025 you can build the cheapest fastest supercomputer in the world from just off the shelf M4 16Gb Mac mini base models with 10G Ethernet, Mikrotek 100G switches and a few FPGA's. It would outperform all Nvidia, Cerebras, Tenstorrent and datacenter clusters I know of, mainly because of the low power Apple Silicon.
Note that the M4 has only 1,2 Tips unified memory bandwidth and the M4 Pro has double that. The 8 Tops unified memory bandwidth is on the M1 and M2 Studio Ultra with 64/128/192GB DRAM. Without it you cant's reach 50 trillion operations per second. A Mac Studio has only around 190 Gbps external networking bandwidth but does not reach 43 trillion TOPS, as does the 720 Gbps of your Mac Pro estimate. By reverse engineering the instruction set you could squeeze a few percent extra performance out of this M4 cluster.
The 43 trillion TOPS of the M4 itself is an estimate. The ANE does 34 TOPS, the CPU less than 5 TOP depending on float type and we have no reliable benchmarks for the CPU floating point.
I'm confident you can get 100 Gbps in by aggregating M4 Mac mini ports.
I resell a $199 Microtik CCR2004-1G-2XS-PCIe SmartNIC with 2 x 25 Gbps QSFP28 that connects to a x8 PCIe 3.0. (I still have a few available for $140 plus shipping plus a few refurbished 16 x 10 Gbps for $400 and 8 x 100 Gbps switches for $800).
Theoretically you can connect that SmartNIC to two of the three M4 Mac mini Thunderbolt 4/USB4 ports that pass through 2 x x4 PCIe 3.0, if you can figure out how to aggregate the two x4 PCIe lanes into a single x8 port. The driver source code is for Linux and could be ported to MacOS. You then aggregate the ports with the 100 Gbps switch.
I'm pretty sure you could create a new PCB design with a larger Broadcom switch chip model to attach to the 10G Ethernet, two 10 Gbps USB-C ports plus the three Thunderbolt 4/USB4 port and write a new driver to aggregate over the 6 ports. You'd have 126 Gbps minus the PCIe overhead and could combine it into a single 100 Gbps QSFP28 port.
I already warned this is still theoretical. Broadcom might not sell you the switch chip, Intel might not sell you the Thunderbolt chip and Apple might block the installation of your device driver code.
But people already proved the interconnect with the Apple Thunderbolt Bridge driver at 3 x 10 Gbps connected via large expensive Thunderbolt hubs [2]. Others just connect each port to different M4 Macs [1][3][4] in various ways.
[1] https://x.com/alexocheema/status/1807882764261417000
[2] https://www.youtube.com/watch?v=GBR6pHZ68Ho
There is Metal development. You want to learn Apple M-series gpu and gpgpu development? Learn Metal!
That's what GPGPU stands for. So your 2 sentences contradict each other.
<Insert your favorite LLM> helped me write some simple Metal-accelerated code by scaffolding the compute pipeline, which took most of the nuisance out of learning the API and let me focus on writing the kernel code.
Here's the code if it's helpful at all. https://github.com/rgov/thps-crack
With that base, I’ve found their docs decent enough, especially coupled with the Metal Shader Language pdf they provide (https://developer.apple.com/metal/Metal-Shading-Language-Spe...), and quite a few code samples you can download from the docs site (e.g. https://developer.apple.com/documentation/metal/performing_c...).
I’d note a lot of their stuff was still written in Objective-C, which I’m not that familiar with. But most of that is boilerplate and the rest is largely C/C++ based (including the Metal shader language).
I just ported some CPU/SIMD number crunching (complex matrices) to Metal, and the speed up has been staggering. What used to take days now takes minutes. It is the hottest my M3 MacBook has ever been though! (See https://x.com/billticehurst/status/1871375773413876089 :-)
Metal, and Apple's docs are the place to start.
There is also a Vulkan backend if you want to run Vulkan through MoltenVK though.
if you want portable use WebGPU either via wgpu for rust or dawn for C++ They actually do run on Windows, Linux, Mac, iOS, and Android portably