Analyzing the performance of Tensorflow training on M1 Mac Mini and Nvidia V100 (opens in new tab)

(wandb.ai)

226 pointsbriggers5y ago91 comments

91 comments

62 comments · 16 top-level

fxtentacle5y ago· 12 in thread

"trainable_params 12,810"

laughs

(for comparison, GPT3: 175,000,000,000 parameters)

Can Apple's M1 help you train tiny toy examples with no real-world relevance? You bet it can!

Plus it looks like they are comparing Apples to Oranges ;) This seems to be 16 bit precision on the M1 and 32 bit on the V100. So the M1-trained model will most likely yield worse or unusable results, due to lack of precision.

And lastly, they are plainly testing against the wrong target. The V100 is great, but it is far from NVIDIA's flagship for training small low-precision models. At the FP16 that the M1 is using, the correct target would have been an RTX 3090 or the like, which has 35 TFLOPS. The V100 only gets 14 TFLOPS because it lacks the dedicated TensorRT accelerator hardware.

So they compare the M1 against an NVIDIA model from 2017 that lacks the relevant hardware acceleration and, thus, is a whopping 60% slower than what people actually use for such training workloads.

I'm sure my bicycle will also compare very favorably against a car that is lacking two wheels :p

iaml5y ago

GPT3 is so big it would take 355 years to train on a nvidia V100, so your example is also not really useful for comparison. It would be interesting to see some mid-sized nn benchmarks though.

coolness5y ago

This, not to mention one could get the GPU usage on the V100 way higher by training with larger batch sizes, which would also make training much faster.

JacobSuperslav5y ago

thanks for the thorough comment. the article is, unfortunately, just clickbait.

nvarsj5y ago

It seems like a common trend with M1 articles on HN lately.

coldtea5y ago

The comment is bogus empty snark (and factually wrong).

The arguments made (and I use the word arguments loosely):

"Too few trainable_params compared to GTP3".

GTP3 is several orders of magnitude higher than what people train, and so it's a useless comparison. It's like we're comparing a bike to an e-bike, and someone says "yeah, but can the e-bike run faster than a rocket?"

Second argument "Sure, it's faster than a machine that costs 3-4 fives more, but you should instead compare it to a machine that costs even more than that".

I can only take it as a troll comment.

joseph_grobbles5y ago

Thorough? Their comment is noisy snark.

A huge number of models are "small". I'm currently training game units for autonomous behaviors. The M1 is massively oversized for my need.

Saying "Oh look, GPT-3" just stupidifies the conversation, and is classic dismissive nonsense.

apl5y ago

Hard disagree. V100s are a perfectly valid comparison point. They're usually what's available at scale (on AWS, in private clusters, etc.) because nobody's rolled out enough A100s at this point. If you look at any paper from OpenAI et al. (basically: not Google), you'll see performance numbers for large V100 clusters.

fxtentacle5y ago

Yes and you'll see parameters tuned for V100, not parameters tuned for m1 somehow limping along on a V100 in emulation mode.

I wouldn't complain about a benchmark executing any real world SOTA model on m1 and V100, but those will most likely not even run on the M1 due to memory constraints.

So this article is like using an ios game to evaluate a Mac pro. You can do it, but it's not really useful.

1 more reply

Firadeoclus5y ago

> The V100 only gets 14 TFLOPS because it lacks the dedicated TensorRT accelerator hardware.

V100 has both vec2 hfma (i.e. fp16 multiply-add is twice the rate of fp32), getting ~30 TFLOPS, and tensor cores which can achieve up to 4x that for matrix multiplications.

YetAnotherNick5y ago

For the first graph:

  trainable parameters: 2236682

qayxc5y ago

So it's a toy model...

1 more reply

jbverschoor5y ago

Even the RTX 3090 is double the price of an M1 for just 1 card.

The V100 is almost 5-10x the price of an M1.

volta875y ago· 7 in thread

When developing ML models, you rarely train "just one".

The article mentions that they explored a not-so-large hyper-parameter space (i.e. they trained multiple models with different parameters each).

It would be interesting to know how long does the whole process takes on the M1 vs the V100.

For the small models covered in the article, I'd guess that the V100 can train them all concurrently using MPS (multi-process service: multiple processes can concurrently use the GPU).

In particular it would be interesting to know, whether the V100 trains all models in the same time that it trains one, and whether the M1 does the same, or whether the M1 takes N times more time to train N models.

This could paint a completely different picture, particularly for the user perspective. When I go for lunch, coffee, or home, I usually spawn jobs training a large number of models, such that when I get back, all these models are trained.

I only start training a small number of models at the latter phases of development, when I have already explored a large part of the model space.

---

To make the analogy, what this article is doing is something similar to benchmarking a 64 core CPU against a 1 core CPU using a single threaded benchmark. The 64 core CPU happens to be slightly beefier and faster than the 1 core CPU, but it is more expensive and consumes more power because... it has 64x more cores. So to put things in perspective, it would make sense to also show a benchmark that can use 64x cores, which is the reason somebody would buy a 64-core CPU, and see how the single-core one compares (typically 64x slower).

---

To me, the only news here is that Apple GPU cores are not very far behind NVIDIA's cores for ML training, but there is much more to a GPGPU than just the perf that you get for small models in a small number of cores. Apple would still need to (1) catch up, and (2) extremely scale up their design. They probably can do both if they set their eyes on it. Exciting times.

sdenton45y ago

The low gpu utilization rate in the first graph is kind of a tell... Seems like the M1 is a little bit worse than 40% of a v100?

volta875y ago

If that's the case that would be very good. One can buy lots of M1 mac minis for the price of a V100..

1 more reply

nightcracker5y ago

> When developing ML models, you rarely train "just one".

Depends on your field. In Reinforcement Learning you often really do train just one, at least on the same data set (since the data set often is dynamically generated based on the behavior of the previous iteration of the model).

volta875y ago

Even in reinforcement learning you can train multiple model with different data-sets concurrently and combine them for the next iteration.

lukas5y ago

Do you really train more than one model at the same time on a single GPU? In my experience that's pretty unusual.

I completely agree with your conclusion here.

volta875y ago

Depends on model size, but if the model is small enough that I actually do training on a PCIe board, I do. I partition an A100 in 8, and train 8 models at a time, or just use MPS on a V100 board. The bigger A100 boards can fit multiple of the same models that do fit in a single V100..

Also I tend to do this initially, when I am exploring the hyperparameter space, for which I tend to use smaller but more models.

I find that using big models initially is just a waste of time. You want to try many things as quickly as possible.

junipertea5y ago

I found training multiple models on same GPU hit other bottlenecks (mainly memory capacity/bandwidth) fast. I tend to train one model per GPU and just scale the number of computers. Also, if nothing else, we tend to push the models to fit the GPU memory.

1 more reply

whywhywhywhy5y ago· 7 in thread

>We can see better performance gains with the m1 when there are fewer weights to train likely due to the superior memory architecture of the M1.

Wasn't this whole "M1 memory" thing decided to be a myth now some more technical people have dissected it?

gameswithgo5y ago

I think two different memory things are being talked about:

1. There is an idea that M1 has RAM that is vastly higher bandwidth than intel/amd machines. In reality it is the same laptop ddr ram that other machines have, though at a very high clock rate. Not higher than the best intel laptops though. So the bandwidth is not any more amazing than a top end Intel laptop, and latency is no different.

2. But in this case I believe they are talking about the CPU and GPU both being able to freely access the same ram, as compared to a setup where you have a discrete GPU with it's own ram, where data must first be copied to the GPU ram for the GPU to do something with it. In some workloads this can be an inferior approach, in others it can be superior, as the GPU's ram is faster. The M1 model again isn't unique, as its similar to how game consoles work, I believe.

dragontamer5y ago

> In reality it is the same laptop ddr ram that other machines have

LPDDR4 is more well known for cell phones than laptops actually. I think it shows the stagnation of the laptop market (and DDR4) that LPDDR4 is really catching up (and then some). Or maybe... because cell phones are more widespread these days, cell phones just naturally get the better tech?

On the other hand, M1 is pretty wide. Apple clearly is tackling the memory bottleneck very strongly in its design.

DDR5 is going to be the next major step forward for desktops/laptops.

> 2. But in this case I believe they are talking about the CPU and GPU both being able to freely access the same ram, as compared to a setup where you have a discrete GPU with it's own ram, where data must first be copied to the GPU ram for the GPU to do something with it. In some workloads this can be an inferior approach, in others it can be superior, as the GPU's ram is faster. The M1 model again isn't unique, as its similar to how game consoles work, I believe.

More than just the "same RAM", but probably even shares the same last-level cache. Both AMD's chips and Intel's iGPUs share the cache with its CPU/GPU hybrid architectures.

However: it seems like on-core SIMD units (aka: AVX or ARM NEON / SVE) are even lower latency, since those share L1 cache.

Any situation where you need low latency but SIMD, it makes more sense to use AVX / SVE than even waiting for L3 cache to talk to the iGPU. Any situation where you need massive parallelism, a dedicated 3090 is more useful.

Its going to be tough to figure out a good use of iGPUs: they're being squeezed on the latency front (by things like A64Fx: 512-bit ARM SIMD, as well as AVX512 on the Intel side), and also squeezed by the bandwidth front (by classic GPUs)

iforgotpassword5y ago

Myth or not, it's memory bandwidth is amazing, so I guess that helps.

rsynnott5y ago

As with many things, there isn't one "M1 memory" thing. It's a combination of myth and real stuff. No, it isn't ultra-low latency or high-bandwidth. But on the other hand, single core achievable bandwidth is very high.

coldtea5y ago

No. Some technical people just gave their non-definitive two cents.

vletal5y ago

Could you please provide some resources on how the unified memory model supposedly works? Why is it a "myth"?

tyingq5y ago

I believe it's referring not to unified memory, but some speculation that the memory being closer to the CPU makes some notable difference. That line was in a fair amount of the initial articles about the M1, and fits the "myth" description.

helsinkiandrew5y ago· 4 in thread

Can someone with more knowledge of Nvidia GPU's please say how much the V100 costs ($5-10K?) compared with the $900 mac mini.

fxtentacle5y ago

You would instead buy a used 1080 (no ti) for similar performance.

The special thing about the V100 is that it's driver EULA allows data center usage. If you don't need that, there are other much cheaper options.

spi5y ago

"Similar performance" still means 30%-50% slower [1] and half the RAM, not really that comparable.

For much closer performance you should get a 2080ti, which should be roughly comparable in speed and have 11GB [edit: wrongly wrote 14GB before] of memory (against the 16GB for the V100). Price-wise you still save a lot of money, after quickly googling around, roughly $1200 vs. $15k-$20k.

But you still lose something, e.g. if you use half precision on V100 you get virtually double speed, if you do on a 1080 / 2080 you get... nothing because it's not supported.

(and more importantly for companies, you can actually use only V100-style stuff on servers [edit: as you mentioned already, although I'm not 100% sure it's just drivers that are the issue?])

[1] I've not used 1080 myself, but I've used 1080ti and V100 extensively, and the latter is about 30% faster. Hence my estimate for comparison with 1080

3 more replies

littlestymaar5y ago

> The special thing about the V100 is that it's driver EULA allows data center usage.

Wait what? Is it the only thing?

That sounds hard to believe: if true, using the open driver (Nouveau) instead of Nvidia's proprietary one would be a massive money saver for datacenters operators (and even if Nouveau doesn't support the features you'd want already, supporting their development would be much cheaper for a company like Amazon than paying a premium on every GPU they buy)

3 more replies

sillysaurusx5y ago

Don't buy hardware in general for AI work, IMO. It'll be out of date in a year and you'll end up training in the cloud anyway.

1 more reply

mark_l_watson5y ago· 3 in thread

I had the same experience. My M1 system does well on smaller models compared to a NVidia 1070 with 10GB of memory. My MacBook Pro only has 8GB total memory. Large models run slowly.

I found setting up Apple’s M1 fork of TensorFlow to be fairly easy, BTW.

I am writing a new book on using Swift for AI applications, motivated by the “niceness” of the Swift language and Apple’s CoreML libraries.

iluxonchik5y ago

do you happen to have a draft version available somewhere? i'm diving into ML with Swift soon

mark_l_watson5y ago

If you are interested in just the iOS/iPadOS/macOS platforms, then work through the tutorial articles on ML that Apple provides to devs.

If you are on Linux, then Swift for TensorFlow is OK. You will save some effort by using Google Colab notebooks, that support Swift and Swift for TensorFlow.

figomore5y ago

I think this is the book https://leanpub.com/SwiftAI

1 more reply

sradman5y ago· 3 in thread

I categorize this as an exploration of how to benchmark desktop/workstation NPUs [1] similar to the exploration Daniel Lemire started with SIMD. Mobile SoC NPUs are used to deploy inference models on smartphones and IoT devices while discreet NPUs like Nvidia A100/V100 target cloud clusters.

We don’t have apples-to-apples benchmarks like SPECint/SPECfp for the SoC accelerators in the M1 (GPU, NPU, etc.) so these early attempts are both facile and critical as we try to categorize and compare the trade-offs between the SoC/discreet and performance/perf-per-watt options available.

Power efficient SoC for desktops is new and we are learning as we go.

[1] https://en.m.wikipedia.org/wiki/AI_accelerator

volta875y ago

> We don’t have apples-to-apples benchmarks

We do: https://mlperf.org/

Just run their benchmarks. Submitting your results there is a bit more complicated, because all results there are "verified" by independent entities.

If you feel like your AI use case is not well represented by any of the MLPerf benchmarks, open a discussion thread about it, propose a new benchmark, etc.

The set of benchmarks there increases all the time to cover new applications. For example, on top of the MLPerf Training and MLPerf Inference benchmark suites, we now have a new MLPerf HPC suite to capture ML of very large models.

solidasparagus5y ago

Those benchmarks are absurdly tuned to the hardware. Just look at the result Google gets with BERT on V100s vs the result NVIDIA gets with V100s. It's an interesting measurement of what experts can achieve when they modify their code to run on the hardware they understand well, but it isn't useful beyond that.

1 more reply

sradman5y ago

> on top of the MLPerf Training and MLPerf Inference benchmark suites, we now have a new MLPerf HPC suite to capture ML of very large models.

I think the challenge is selecting the tests that best represent the typical ML/DL use cases for the M1 and comparing it to an alternative such as the V100 using a common toolchain like Tensorflow. One of the problems that I see is that the optimizer/codegen of the toolchain is a key component; the M1 has both GPU and Neural Engine and we don’t know which accelerator is targeted or even possibly both. Should we benchmark ML Create on M1 vs A14 or A12X? Perhaps it is my ignorance but I don’t think we are at a point where our existing benchmarks can be applied meaningfully with the M1 but I’m sure we will get there soon.

1 more reply

lopuhin5y ago· 2 in thread

> I chose MobileNetV2 to make iteration faster. When I tried ResNet50 or other larger models the gap between the M1 and Nvidia grew wider.

(and that's on CIFAR-10). But why not report these results and also test on a more realistic datasets? The internet is full of M1 TF brenchmarks on CIFAR or MNIST, has anyone seen something different?

sillysaurusx5y ago

Hehe. That criticism could be applied to ML itself. :)

I wish ML used more than CIFNISTNet, but unfortunately there's not a lot of standard datasets yet. (Even Imagenet is an absolute pain to set up.)

sdenton45y ago

Tensorflow Datasets includes a lot of the 'standard' datasets in a way that's dead simple to call up and use (including ~10 variants of imagenet): https://www.tensorflow.org/datasets/catalog/overview

tbalsam5y ago· 2 in thread

This is on a model designed to run faster on CPUs. It's like dropping a bowling ball on your foot and claiming excitement that you feel bruised after a few days.

Maybe there's something interesting there, definitely, but the overhype of the title takes away any significant amount of clout I'd give to the publishers for research. If you find something interesting, say it, and stop making vapid generalizations for the sake of more clicks.

Remember, we only can feed the AI hype bubble when we do this. It might be good results, but we need to be at least realistic about it, or there won't be an economy of innovation for people to listen to in the future, because they've tuned it out with all of the crap marketing that comes/came before it.

Thanks for coming to my TED Talk!

lukas5y ago

I don't think MobileNetV2 is designed to train on GPUs - according to this https://azure.microsoft.com/en-us/blog/gpus-vs-cpus-for-depl... MobileNetV2 gets bigger gains from GPUs vs several CPUs than ResNet. You could argue the batch size doesn't fully use the V100 but these comparisons are tricky and this looks like fairly normal training to me.

It's pretty surprising to me that an M1 performs anywhere near a V100 on model training and I guess the most striking thing is the energy efficiency of the M1.

tbalsam5y ago

MV2 is memory-limited, the depthwise + groups + 1x1 convs has a long launch time on GPU. Shattered kernels are fine for CPU, but not for GPU.

Though per your note on the scales, that's really interesting empirical results. I'll have to look into that, thanks for passing that along.

baxter0015y ago· 2 in thread

No, but it's pretty good at retraining the final layer of low memory networks like MobileNet - weirdly a workload that the V100 is very poorly suited for...

enos_feedler5y ago

Not surprising since this is a training use case that Apple very much focuses on with CreateML.

xiphias25y ago

What about the M1X that will come with 64GB RAM? I’m thinking of waiting for that to come out. Ah...I just see that the article authors are waiting for it as well

procrastinatus5y ago· 1 in thread

One thing I haven’t seen much mention of is getting things to run on the M1’s neural engine instead of the GPU - it seems like the neural engine has ~3x more compute capacity and is specifically optimized for this type of computation.

Has anyone spotted any work allowing a mainstream tensor library (e.g. jax, tf, pytorch) to run on the neural engine?

lldbg5y ago

George hotz got his "for play" tensor library[a] to run on the Apple Neural Engine (ANE). The results were somewhat dissapointing, however, and currently it only does relu.

[a]: https://github.com/geohot/tinygrad

StavrosK5y ago· 1 in thread

I'm seeing a lot of M1 hype, and I suspect most of it us unwarranted. I looked at comparisons between the M1 and the latest Ryzens, and it looks like it's comparable? Does anyone know details? I only looked summarily.

ZeroCool2u5y ago

The main hype is that performance is similar, but the M1 does it with a lot less power draw. The performance itself isn't too crazy. It's just crazy that it does it with a somewhat similar power draw to a high end phone.

SloopJon5y ago· 1 in thread

The first graph includes "Apple Intel", which is not mentioned anywhere else in the post. Any idea what hardware that was, and whether it used the accelerated TensorFlow?

vanpelt5y ago

My bad, this was using non-Accelerated TensorFlow on a 2.3GHz 8-Core i9.

tpoacher5y ago· 1 in thread

Betteridge says no.

coldtea5y ago

And Betteridge is wrong.

jlouis5y ago

CPUs often outperform specialized hardware on small models. This is nothing new. You'd need to go to a larger model, and then power consumption curves change too.

0x0085y ago

Well, putting out a tl;dr and then a graph that does not mention FP16/FP32 performance differences or anything related to TensorRT cannot be taken seriously if we talk about performance per watt. We need to see the a comparison that includes multiple scenarios so we can determine something like a break-even point between Nvidia GPUs and Apple M1 GPU, possibly even for several SotA models.

JohnHaugeland5y ago

"Can Apple's M1 do a good job? We cut things down to unrealstic sizes, turned off cores, and p-hacked as hard as we could until we found a way to pretend the answer was yes"

j / k navigate · click thread line to collapse

91 comments

62 comments · 16 top-level

fxtentacle5y ago· 12 in thread

"trainable_params 12,810"

laughs

(for comparison, GPT3: 175,000,000,000 parameters)

Can Apple's M1 help you train tiny toy examples with no real-world relevance? You bet it can!

So they compare the M1 against an NVIDIA model from 2017 that lacks the relevant hardware acceleration and, thus, is a whopping 60% slower than what people actually use for such training workloads.

I'm sure my bicycle will also compare very favorably against a car that is lacking two wheels :p

iaml5y ago

GPT3 is so big it would take 355 years to train on a nvidia V100, so your example is also not really useful for comparison. It would be interesting to see some mid-sized nn benchmarks though.

coolness5y ago

This, not to mention one could get the GPU usage on the V100 way higher by training with larger batch sizes, which would also make training much faster.

JacobSuperslav5y ago

thanks for the thorough comment. the article is, unfortunately, just clickbait.

nvarsj5y ago

It seems like a common trend with M1 articles on HN lately.

coldtea5y ago

The comment is bogus empty snark (and factually wrong).

The arguments made (and I use the word arguments loosely):

"Too few trainable_params compared to GTP3".

Second argument "Sure, it's faster than a machine that costs 3-4 fives more, but you should instead compare it to a machine that costs even more than that".

I can only take it as a troll comment.

joseph_grobbles5y ago

Thorough? Their comment is noisy snark.

A huge number of models are "small". I'm currently training game units for autonomous behaviors. The M1 is massively oversized for my need.

Saying "Oh look, GPT-3" just stupidifies the conversation, and is classic dismissive nonsense.

apl5y ago

fxtentacle5y ago

Yes and you'll see parameters tuned for V100, not parameters tuned for m1 somehow limping along on a V100 in emulation mode.

I wouldn't complain about a benchmark executing any real world SOTA model on m1 and V100, but those will most likely not even run on the M1 due to memory constraints.

So this article is like using an ios game to evaluate a Mac pro. You can do it, but it's not really useful.

1 more reply

Firadeoclus5y ago

> The V100 only gets 14 TFLOPS because it lacks the dedicated TensorRT accelerator hardware.

V100 has both vec2 hfma (i.e. fp16 multiply-add is twice the rate of fp32), getting ~30 TFLOPS, and tensor cores which can achieve up to 4x that for matrix multiplications.

YetAnotherNick5y ago

For the first graph:

  trainable parameters: 2236682

qayxc5y ago

So it's a toy model...

1 more reply

jbverschoor5y ago

Even the RTX 3090 is double the price of an M1 for just 1 card.

The V100 is almost 5-10x the price of an M1.

volta875y ago· 7 in thread

When developing ML models, you rarely train "just one".

The article mentions that they explored a not-so-large hyper-parameter space (i.e. they trained multiple models with different parameters each).

It would be interesting to know how long does the whole process takes on the M1 vs the V100.

For the small models covered in the article, I'd guess that the V100 can train them all concurrently using MPS (multi-process service: multiple processes can concurrently use the GPU).

I only start training a small number of models at the latter phases of development, when I have already explored a large part of the model space.

---

sdenton45y ago

The low gpu utilization rate in the first graph is kind of a tell... Seems like the M1 is a little bit worse than 40% of a v100?

volta875y ago

If that's the case that would be very good. One can buy lots of M1 mac minis for the price of a V100..

1 more reply

nightcracker5y ago

> When developing ML models, you rarely train "just one".

volta875y ago

Even in reinforcement learning you can train multiple model with different data-sets concurrently and combine them for the next iteration.

lukas5y ago

Do you really train more than one model at the same time on a single GPU? In my experience that's pretty unusual.

I completely agree with your conclusion here.

volta875y ago

Also I tend to do this initially, when I am exploring the hyperparameter space, for which I tend to use smaller but more models.

I find that using big models initially is just a waste of time. You want to try many things as quickly as possible.

junipertea5y ago

1 more reply

whywhywhywhy5y ago· 7 in thread

>We can see better performance gains with the m1 when there are fewer weights to train likely due to the superior memory architecture of the M1.

Wasn't this whole "M1 memory" thing decided to be a myth now some more technical people have dissected it?

gameswithgo5y ago

I think two different memory things are being talked about:

dragontamer5y ago

> In reality it is the same laptop ddr ram that other machines have

On the other hand, M1 is pretty wide. Apple clearly is tackling the memory bottleneck very strongly in its design.

DDR5 is going to be the next major step forward for desktops/laptops.

More than just the "same RAM", but probably even shares the same last-level cache. Both AMD's chips and Intel's iGPUs share the cache with its CPU/GPU hybrid architectures.

However: it seems like on-core SIMD units (aka: AVX or ARM NEON / SVE) are even lower latency, since those share L1 cache.

iforgotpassword5y ago

Myth or not, it's memory bandwidth is amazing, so I guess that helps.

rsynnott5y ago

coldtea5y ago

No. Some technical people just gave their non-definitive two cents.

vletal5y ago

Could you please provide some resources on how the unified memory model supposedly works? Why is it a "myth"?

tyingq5y ago

helsinkiandrew5y ago· 4 in thread

Can someone with more knowledge of Nvidia GPU's please say how much the V100 costs ($5-10K?) compared with the $900 mac mini.

fxtentacle5y ago

You would instead buy a used 1080 (no ti) for similar performance.

The special thing about the V100 is that it's driver EULA allows data center usage. If you don't need that, there are other much cheaper options.

spi5y ago

"Similar performance" still means 30%-50% slower [1] and half the RAM, not really that comparable.

But you still lose something, e.g. if you use half precision on V100 you get virtually double speed, if you do on a 1080 / 2080 you get... nothing because it's not supported.

(and more importantly for companies, you can actually use only V100-style stuff on servers [edit: as you mentioned already, although I'm not 100% sure it's just drivers that are the issue?])

[1] I've not used 1080 myself, but I've used 1080ti and V100 extensively, and the latter is about 30% faster. Hence my estimate for comparison with 1080

3 more replies

littlestymaar5y ago

> The special thing about the V100 is that it's driver EULA allows data center usage.

Wait what? Is it the only thing?

3 more replies

sillysaurusx5y ago

Don't buy hardware in general for AI work, IMO. It'll be out of date in a year and you'll end up training in the cloud anyway.

1 more reply

mark_l_watson5y ago· 3 in thread

I had the same experience. My M1 system does well on smaller models compared to a NVidia 1070 with 10GB of memory. My MacBook Pro only has 8GB total memory. Large models run slowly.

I found setting up Apple’s M1 fork of TensorFlow to be fairly easy, BTW.

I am writing a new book on using Swift for AI applications, motivated by the “niceness” of the Swift language and Apple’s CoreML libraries.

iluxonchik5y ago

do you happen to have a draft version available somewhere? i'm diving into ML with Swift soon

mark_l_watson5y ago

If you are interested in just the iOS/iPadOS/macOS platforms, then work through the tutorial articles on ML that Apple provides to devs.

If you are on Linux, then Swift for TensorFlow is OK. You will save some effort by using Google Colab notebooks, that support Swift and Swift for TensorFlow.

figomore5y ago

I think this is the book https://leanpub.com/SwiftAI

1 more reply

sradman5y ago· 3 in thread

Power efficient SoC for desktops is new and we are learning as we go.

[1] https://en.m.wikipedia.org/wiki/AI_accelerator

volta875y ago

> We don’t have apples-to-apples benchmarks

We do: https://mlperf.org/

Just run their benchmarks. Submitting your results there is a bit more complicated, because all results there are "verified" by independent entities.

If you feel like your AI use case is not well represented by any of the MLPerf benchmarks, open a discussion thread about it, propose a new benchmark, etc.

solidasparagus5y ago

1 more reply

sradman5y ago

> on top of the MLPerf Training and MLPerf Inference benchmark suites, we now have a new MLPerf HPC suite to capture ML of very large models.

1 more reply

lopuhin5y ago· 2 in thread

> I chose MobileNetV2 to make iteration faster. When I tried ResNet50 or other larger models the gap between the M1 and Nvidia grew wider.

sillysaurusx5y ago

Hehe. That criticism could be applied to ML itself. :)

I wish ML used more than CIFNISTNet, but unfortunately there's not a lot of standard datasets yet. (Even Imagenet is an absolute pain to set up.)

sdenton45y ago

Tensorflow Datasets includes a lot of the 'standard' datasets in a way that's dead simple to call up and use (including ~10 variants of imagenet): https://www.tensorflow.org/datasets/catalog/overview

tbalsam5y ago· 2 in thread

This is on a model designed to run faster on CPUs. It's like dropping a bowling ball on your foot and claiming excitement that you feel bruised after a few days.

Thanks for coming to my TED Talk!

lukas5y ago

It's pretty surprising to me that an M1 performs anywhere near a V100 on model training and I guess the most striking thing is the energy efficiency of the M1.

tbalsam5y ago

MV2 is memory-limited, the depthwise + groups + 1x1 convs has a long launch time on GPU. Shattered kernels are fine for CPU, but not for GPU.

Though per your note on the scales, that's really interesting empirical results. I'll have to look into that, thanks for passing that along.

baxter0015y ago· 2 in thread

No, but it's pretty good at retraining the final layer of low memory networks like MobileNet - weirdly a workload that the V100 is very poorly suited for...

enos_feedler5y ago

Not surprising since this is a training use case that Apple very much focuses on with CreateML.

xiphias25y ago

What about the M1X that will come with 64GB RAM? I’m thinking of waiting for that to come out. Ah...I just see that the article authors are waiting for it as well

procrastinatus5y ago· 1 in thread

Has anyone spotted any work allowing a mainstream tensor library (e.g. jax, tf, pytorch) to run on the neural engine?

lldbg5y ago

George hotz got his "for play" tensor library[a] to run on the Apple Neural Engine (ANE). The results were somewhat dissapointing, however, and currently it only does relu.

[a]: https://github.com/geohot/tinygrad

StavrosK5y ago· 1 in thread

ZeroCool2u5y ago

SloopJon5y ago· 1 in thread

The first graph includes "Apple Intel", which is not mentioned anywhere else in the post. Any idea what hardware that was, and whether it used the accelerated TensorFlow?

vanpelt5y ago

My bad, this was using non-Accelerated TensorFlow on a 2.3GHz 8-Core i9.

tpoacher5y ago· 1 in thread

Betteridge says no.

coldtea5y ago

And Betteridge is wrong.

jlouis5y ago

CPUs often outperform specialized hardware on small models. This is nothing new. You'd need to go to a larger model, and then power consumption curves change too.

0x0085y ago

JohnHaugeland5y ago

"Can Apple's M1 do a good job? We cut things down to unrealstic sizes, turned off cores, and p-hacked as hard as we could until we found a way to pretend the answer was yes"

j / k navigate · click thread line to collapse