The article mentions that they explored a not-so-large hyper-parameter space (i.e. they trained multiple models with different parameters each).
It would be interesting to know how long does the whole process takes on the M1 vs the V100.
For the small models covered in the article, I'd guess that the V100 can train them all concurrently using MPS (multi-process service: multiple processes can concurrently use the GPU).
In particular it would be interesting to know, whether the V100 trains all models in the same time that it trains one, and whether the M1 does the same, or whether the M1 takes N times more time to train N models.
This could paint a completely different picture, particularly for the user perspective. When I go for lunch, coffee, or home, I usually spawn jobs training a large number of models, such that when I get back, all these models are trained.
I only start training a small number of models at the latter phases of development, when I have already explored a large part of the model space.
---
To make the analogy, what this article is doing is something similar to benchmarking a 64 core CPU against a 1 core CPU using a single threaded benchmark. The 64 core CPU happens to be slightly beefier and faster than the 1 core CPU, but it is more expensive and consumes more power because... it has 64x more cores. So to put things in perspective, it would make sense to also show a benchmark that can use 64x cores, which is the reason somebody would buy a 64-core CPU, and see how the single-core one compares (typically 64x slower).
---
To me, the only news here is that Apple GPU cores are not very far behind NVIDIA's cores for ML training, but there is much more to a GPGPU than just the perf that you get for small models in a small number of cores. Apple would still need to (1) catch up, and (2) extremely scale up their design. They probably can do both if they set their eyes on it. Exciting times.
Depends on your field. In Reinforcement Learning you often really do train just one, at least on the same data set (since the data set often is dynamically generated based on the behavior of the previous iteration of the model).
I completely agree with your conclusion here.
Also I tend to do this initially, when I am exploring the hyperparameter space, for which I tend to use smaller but more models.
I find that using big models initially is just a waste of time. You want to try many things as quickly as possible.
I found setting up Apple’s M1 fork of TensorFlow to be fairly easy, BTW.
I am writing a new book on using Swift for AI applications, motivated by the “niceness” of the Swift language and Apple’s CoreML libraries.
If you are on Linux, then Swift for TensorFlow is OK. You will save some effort by using Google Colab notebooks, that support Swift and Swift for TensorFlow.
(and that's on CIFAR-10). But why not report these results and also test on a more realistic datasets? The internet is full of M1 TF brenchmarks on CIFAR or MNIST, has anyone seen something different?
I wish ML used more than CIFNISTNet, but unfortunately there's not a lot of standard datasets yet. (Even Imagenet is an absolute pain to set up.)
Maybe there's something interesting there, definitely, but the overhype of the title takes away any significant amount of clout I'd give to the publishers for research. If you find something interesting, say it, and stop making vapid generalizations for the sake of more clicks.
Remember, we only can feed the AI hype bubble when we do this. It might be good results, but we need to be at least realistic about it, or there won't be an economy of innovation for people to listen to in the future, because they've tuned it out with all of the crap marketing that comes/came before it.
Thanks for coming to my TED Talk!
It's pretty surprising to me that an M1 performs anywhere near a V100 on model training and I guess the most striking thing is the energy efficiency of the M1.
Though per your note on the scales, that's really interesting empirical results. I'll have to look into that, thanks for passing that along.
Wasn't this whole "M1 memory" thing decided to be a myth now some more technical people have dissected it?
1. There is an idea that M1 has RAM that is vastly higher bandwidth than intel/amd machines. In reality it is the same laptop ddr ram that other machines have, though at a very high clock rate. Not higher than the best intel laptops though. So the bandwidth is not any more amazing than a top end Intel laptop, and latency is no different.
2. But in this case I believe they are talking about the CPU and GPU both being able to freely access the same ram, as compared to a setup where you have a discrete GPU with it's own ram, where data must first be copied to the GPU ram for the GPU to do something with it. In some workloads this can be an inferior approach, in others it can be superior, as the GPU's ram is faster. The M1 model again isn't unique, as its similar to how game consoles work, I believe.
LPDDR4 is more well known for cell phones than laptops actually. I think it shows the stagnation of the laptop market (and DDR4) that LPDDR4 is really catching up (and then some). Or maybe... because cell phones are more widespread these days, cell phones just naturally get the better tech?
On the other hand, M1 is pretty wide. Apple clearly is tackling the memory bottleneck very strongly in its design.
DDR5 is going to be the next major step forward for desktops/laptops.
> 2. But in this case I believe they are talking about the CPU and GPU both being able to freely access the same ram, as compared to a setup where you have a discrete GPU with it's own ram, where data must first be copied to the GPU ram for the GPU to do something with it. In some workloads this can be an inferior approach, in others it can be superior, as the GPU's ram is faster. The M1 model again isn't unique, as its similar to how game consoles work, I believe.
More than just the "same RAM", but probably even shares the same last-level cache. Both AMD's chips and Intel's iGPUs share the cache with its CPU/GPU hybrid architectures.
However: it seems like on-core SIMD units (aka: AVX or ARM NEON / SVE) are even lower latency, since those share L1 cache.
Any situation where you need low latency but SIMD, it makes more sense to use AVX / SVE than even waiting for L3 cache to talk to the iGPU. Any situation where you need massive parallelism, a dedicated 3090 is more useful.
Its going to be tough to figure out a good use of iGPUs: they're being squeezed on the latency front (by things like A64Fx: 512-bit ARM SIMD, as well as AVX512 on the Intel side), and also squeezed by the bandwidth front (by classic GPUs)
Has anyone spotted any work allowing a mainstream tensor library (e.g. jax, tf, pytorch) to run on the neural engine?
We don’t have apples-to-apples benchmarks like SPECint/SPECfp for the SoC accelerators in the M1 (GPU, NPU, etc.) so these early attempts are both facile and critical as we try to categorize and compare the trade-offs between the SoC/discreet and performance/perf-per-watt options available.
Power efficient SoC for desktops is new and we are learning as we go.
We do: https://mlperf.org/
Just run their benchmarks. Submitting your results there is a bit more complicated, because all results there are "verified" by independent entities.
If you feel like your AI use case is not well represented by any of the MLPerf benchmarks, open a discussion thread about it, propose a new benchmark, etc.
The set of benchmarks there increases all the time to cover new applications. For example, on top of the MLPerf Training and MLPerf Inference benchmark suites, we now have a new MLPerf HPC suite to capture ML of very large models.
I think the challenge is selecting the tests that best represent the typical ML/DL use cases for the M1 and comparing it to an alternative such as the V100 using a common toolchain like Tensorflow. One of the problems that I see is that the optimizer/codegen of the toolchain is a key component; the M1 has both GPU and Neural Engine and we don’t know which accelerator is targeted or even possibly both. Should we benchmark ML Create on M1 vs A14 or A12X? Perhaps it is my ignorance but I don’t think we are at a point where our existing benchmarks can be applied meaningfully with the M1 but I’m sure we will get there soon.
The special thing about the V100 is that it's driver EULA allows data center usage. If you don't need that, there are other much cheaper options.
For much closer performance you should get a 2080ti, which should be roughly comparable in speed and have 11GB [edit: wrongly wrote 14GB before] of memory (against the 16GB for the V100). Price-wise you still save a lot of money, after quickly googling around, roughly $1200 vs. $15k-$20k.
But you still lose something, e.g. if you use half precision on V100 you get virtually double speed, if you do on a 1080 / 2080 you get... nothing because it's not supported.
(and more importantly for companies, you can actually use only V100-style stuff on servers [edit: as you mentioned already, although I'm not 100% sure it's just drivers that are the issue?])
[1] I've not used 1080 myself, but I've used 1080ti and V100 extensively, and the latter is about 30% faster. Hence my estimate for comparison with 1080
Wait what? Is it the only thing?
That sounds hard to believe: if true, using the open driver (Nouveau) instead of Nvidia's proprietary one would be a massive money saver for datacenters operators (and even if Nouveau doesn't support the features you'd want already, supporting their development would be much cheaper for a company like Amazon than paying a premium on every GPU they buy)
laughs
(for comparison, GPT3: 175,000,000,000 parameters)
Can Apple's M1 help you train tiny toy examples with no real-world relevance? You bet it can!
Plus it looks like they are comparing Apples to Oranges ;) This seems to be 16 bit precision on the M1 and 32 bit on the V100. So the M1-trained model will most likely yield worse or unusable results, due to lack of precision.
And lastly, they are plainly testing against the wrong target. The V100 is great, but it is far from NVIDIA's flagship for training small low-precision models. At the FP16 that the M1 is using, the correct target would have been an RTX 3090 or the like, which has 35 TFLOPS. The V100 only gets 14 TFLOPS because it lacks the dedicated TensorRT accelerator hardware.
So they compare the M1 against an NVIDIA model from 2017 that lacks the relevant hardware acceleration and, thus, is a whopping 60% slower than what people actually use for such training workloads.
I'm sure my bicycle will also compare very favorably against a car that is lacking two wheels :p
The arguments made (and I use the word arguments loosely):
"Too few trainable_params compared to GTP3".
GTP3 is several orders of magnitude higher than what people train, and so it's a useless comparison. It's like we're comparing a bike to an e-bike, and someone says "yeah, but can the e-bike run faster than a rocket?"
Second argument "Sure, it's faster than a machine that costs 3-4 fives more, but you should instead compare it to a machine that costs even more than that".
I can only take it as a troll comment.
A huge number of models are "small". I'm currently training game units for autonomous behaviors. The M1 is massively oversized for my need.
Saying "Oh look, GPT-3" just stupidifies the conversation, and is classic dismissive nonsense.
I wouldn't complain about a benchmark executing any real world SOTA model on m1 and V100, but those will most likely not even run on the M1 due to memory constraints.
So this article is like using an ios game to evaluate a Mac pro. You can do it, but it's not really useful.
V100 has both vec2 hfma (i.e. fp16 multiply-add is twice the rate of fp32), getting ~30 TFLOPS, and tensor cores which can achieve up to 4x that for matrix multiplications.
trainable parameters: 2236682The V100 is almost 5-10x the price of an M1.