Show HN: TensorFlow for AWS on a Real GPU (opens in new tab)

(github.com)

77 pointsalexkern10y ago12 comments

12 comments

11 comments · 2 top-level

tacos10y ago· 5 in thread

Top link points to a non-existent AMI so this "recipe" -- like almost every "here's how to ..." recipe on GitHub that gets posted on HN -- doesn't work. I swear, it's like 100% fail on these types of posts.

Also this "real GPU" is explicitly called out in the Google docs as unsupported.

michaelZejoop10y ago

I can't ssh into instance using the AWS documentation (I was pretty careful following the instructions and know I have the right instance id and region)--> "Permission denied (publickey)."

(I've since fixed this - I hadn't chmod'd right, and didn't account for working from an ubuntu machine)

droque10y ago

I tried with the Oregon region and it failed. I changed to the N. California region and it worked. (No comment about the GPU though.)

tacos10y ago

Github recipes are the "works on my machine!" of the new millennium. Just like pecking in programs from computer magazines in the 1980s, only better!

I like the ones that are hardcoded to a specific name in a home directory best. Especially when it doesn't match the github name of the "creator."

alexkernOP10y ago

I've added a note that the AMI's region must be us-west-1. Thanks for the heads up!

fred25610y ago

An AMI id is region-specific, the original poster should probably have mentioned what region he created the AMI in.

Smerity10y ago· 4 in thread

Honestly, there are enough issues with TensorFlow right now due to CUDA 3.0 that using it with AWS is highly problematic. I appreciate the author's attempt, but there's no way the five lines of code he changed to allow CUDA 3.0 has fixed any of the issues found in [1], such as NaNs during training, equally slow training on a g2.2xlarge as a g2.8xlarge, etc ...

If you're just interested in playing around, then your laptop will do fine - TensorFlow is happy with just about any hardware you throw at it. Hell, your modern Android phone will run it =]

If you're interested in a more involved experiment, develop and debug your task locally on your laptop. By the time you're ready for large scale training, there might be a stable and battle tested AMI such that people are no longer reporting issues in [1] about it.

Again, if you're interested, follow the CUDA 3.0 issue on GitHub[1] - this is nowhere near a solved problem and will only cause headaches if you're using it for education.

[1]: https://github.com/tensorflow/tensorflow/issues/25

alexkernOP10y ago

Thanks for the feedback! I've added a note to the README that support is still experimental. I'll be tracking the issue and updating the repo + AMI as it develops. Will be compiling with the latest commit (72a5a60) for configurable CUDA Compute support soon.

vrv10y ago

Thanks, would be good to have multiple sources of verification that HEAD now supports this natively without issues like unexpected NaNs.

https://github.com/tensorflow/tensorflow/issues/25#issuecomm... is one verification :)

1 more reply

erikbern10y ago

I think the nan issue you are referring to was caused by some weird stuff with Bazel.

I put together an AMI in virginia: ami-cf5028a5 and if you have a masochistic streak, here are the steps to do it yourself: https://gist.github.com/erikbern/78ba519b97b440e10640

The main issue I'm still seeing is that g2.8xlarge with my AMI doesn't run 4x faster than g2.2xlarge even if correctly detects the 4 GPU's. Haven't had time to identify the issue though.

varelse10y ago

Xen virtualization disables P2P copies ergo GPUs have what we call a "failure to communicate and some GPUs you just can't reach (without going through the CPU that is)."

j / k navigate · click thread line to collapse

12 comments

11 comments · 2 top-level

tacos10y ago· 5 in thread

Also this "real GPU" is explicitly called out in the Google docs as unsupported.

michaelZejoop10y ago

I can't ssh into instance using the AWS documentation (I was pretty careful following the instructions and know I have the right instance id and region)--> "Permission denied (publickey)."

(I've since fixed this - I hadn't chmod'd right, and didn't account for working from an ubuntu machine)

droque10y ago

I tried with the Oregon region and it failed. I changed to the N. California region and it worked. (No comment about the GPU though.)

tacos10y ago

Github recipes are the "works on my machine!" of the new millennium. Just like pecking in programs from computer magazines in the 1980s, only better!

I like the ones that are hardcoded to a specific name in a home directory best. Especially when it doesn't match the github name of the "creator."

alexkernOP10y ago

I've added a note that the AMI's region must be us-west-1. Thanks for the heads up!

fred25610y ago

An AMI id is region-specific, the original poster should probably have mentioned what region he created the AMI in.

Smerity10y ago· 4 in thread

If you're just interested in playing around, then your laptop will do fine - TensorFlow is happy with just about any hardware you throw at it. Hell, your modern Android phone will run it =]

Again, if you're interested, follow the CUDA 3.0 issue on GitHub[1] - this is nowhere near a solved problem and will only cause headaches if you're using it for education.

[1]: https://github.com/tensorflow/tensorflow/issues/25

alexkernOP10y ago

vrv10y ago

Thanks, would be good to have multiple sources of verification that HEAD now supports this natively without issues like unexpected NaNs.

https://github.com/tensorflow/tensorflow/issues/25#issuecomm... is one verification :)

1 more reply

erikbern10y ago

I think the nan issue you are referring to was caused by some weird stuff with Bazel.

I put together an AMI in virginia: ami-cf5028a5 and if you have a masochistic streak, here are the steps to do it yourself: https://gist.github.com/erikbern/78ba519b97b440e10640

The main issue I'm still seeing is that g2.8xlarge with my AMI doesn't run 4x faster than g2.2xlarge even if correctly detects the 4 GPU's. Haven't had time to identify the issue though.

varelse10y ago

Xen virtualization disables P2P copies ergo GPUs have what we call a "failure to communicate and some GPUs you just can't reach (without going through the CPU that is)."

j / k navigate · click thread line to collapse