4000x Speedup in Reinforcement Learning with Jax (opens in new tab)

(chrislu.page)

131 points_hark3y ago30 comments

30 comments

27 comments · 8 top-level

percentcer3y ago· 10 in thread

Reminds me of this evergreen tweet from ryg: https://mobile.twitter.com/rygorous/status/12712968344392826...

  if you made something 2x faster, you might have done something smart
  if you made something 100x faster, you definitely just stopped doing something stupid

canucker20163y ago

I recall the Programming Pearls article from the Sept. 1984 issue of the Communications of the ACM journal, which compared various algorithms to determine the max sum of a continuous subarray of an array.

The article showed how a linear, O(N), algorithm running on a lowly 8-bit CPU can beat a cubic algorithm, O(N^3), running on a Cray supercomputer, when N is sufficiently large.

see https://www.cs.rpi.edu/~moorthy/Courses/CSCI2300/p865-bentle...

hyperbovine3y ago

Meh. This tweet is a lot less clever than it seems. Shave a factor of n off the complexity of your your algorithm, as happens regularly in CS and informatics, and have all the 1000x speedups you want.

squeaky-clean3y ago

If you shave a factor of n off of your algorithm, it usually isn't the same algorithm anymore. That's what they mean, the previous algorithm choice was "stupid" and you've stopped doing something stupid.

2 more replies

dan-robertson3y ago

I don’t think it seems clever so much as pithy? I don’t think it’s necessarily about algorithms so much as it is a claim that the main way to be 1000x slower is by doing 999x times too much work or waiting.

An example that is basically unrelated to complexity theory is something like talking to a distant service but keeping a small number of requests because eg you were worried about load or didn’t notice you were waiting for acks or create a new tcp connection for each request or have a small sendbuf or somehow send way too fast and get rate limited and need to retry.

amscanne3y ago

I believe the underlying point is that the bigger the N, the more obvious it probably is. If you’re in 1000x territory, there’s a strong likelihood that the improvement is so obvious that doing the opposite might be called “stupid”.

jakeinspace3y ago

I think it’s fair to classify using O(n^2) sorting as stupid (if the input size is significant). In numerical computing, using a naive routine for computing eigenvalues in O(n^4) would equally be considered stupid, unless the input sizes are known to be small of course.

yieldcrv3y ago

I love the quote, I’ll be using it the rest of my life

quickthrower23y ago

Just prove P = NP while you are at it

thatguymike3y ago

Or, you started doing something stupid!

__turbobrew__3y ago

if you made something 100x faster you don’t understand how mmap works

sillysaurusx3y ago· 3 in thread

It's a little disingenuous to say that the 4000x speedup is due to Jax. I'm a huge Jax fanboy (one of the biggest) but the speedup here is thanks to running the simulation environment on a GPU. But as much as I love Jax, it's still extraordinarily difficult to implement even simple environments purely on a GPU.

My long-term ambition is to replicate OpenAI's Dota 2 reinforcement learning work, since it's one of the most impactful (or at least most entertaining) use of RL. It would be more or less impossible to translate the game logic into Jax, short of transpiling C++ to Jax somehow. Which isn't a bad idea – someone should make that.

It should also be noted that there's a long history of RL being done on accelerators. AlphaZero's chess evaluations ran entirely on TPUs. Pytorch CUDA graphs also make it easier to implement this kind of thing nowadays, since (again, as much as I love Jax) some Pytorch constructs are simply easier to use than turning everything into a functional programming paradigm.

All that said, you should really try out Jax. The fact that you can calculate gradients w.r.t. any arbitrary function is just amazing, and you have complete control over what's JIT'ed into a GPU graph and what's not. It's a wonderful feeling compared to using Pytorch's accursed .backwards() accumulation scheme.

Can't wait for a framework that feels closer to pure arbitrary Python. Maybe AI can figure out how to do it.

luchris4293y ago

Author here! I didn't realize this got posted on HN. While indeed we do get a speedup by putting the environments on the GPU, most of the speedup seems to come from the ability to easily parallelize RL training with Jax.

While there is work on putting RL environments on accelerators, the main speedup from this work comes from also training many RL agents in parallel. This is largely because the neural networks we use in RL are relatively small and thus don't utilize the GPU very efficiently.

While this was always possible to do, Jax makes it way easier because we just need to call `jax.vmap` to get it to work.

Inufu3y ago

AlphaZero did not run game logic on TPUs (neither chess nor other games), implementing it in C++ is more than fast enough and much simpler.

TPUs were used for neural network inference and training, but game logic as well as MCTS was on the CPU using C++.

JAX is awesome though, I use it for all my neural network stuff!

sillysaurusx3y ago

According to the AlphaZero paper (https://arxiv.org/pdf/1712.01815.pdf) they ran game logic on TPUs:

> Training proceeded for 700,000 steps (mini-batches of size 4,096) starting from randomly initialised parameters, using 5,000 first-generation TPUs to generate self-play games and 64 second-generation TPUs to train the neural networks. Further details of the training procedure are provided in the Methods.

xyzzy47473y ago· 2 in thread

How does this compare with PyTorch / Tensorflow / etc.? Obviously doing heavy data processing on the GPU will have a large speedup compared to a single thread on the CPU.

It's almost like the author is claiming credit for creating Nvidia, when in fact he is just calling its APIs.

luchris4293y ago

The baseline we are comparing to is standard RL training that is widely used in academia. The technique mentioned in the blog post is not widely used amongst researchers.

The reason we write about Jax is that doing this technique is really hard in PyTorch / Tensorflow. This is because:

1. Jax has vmap. (PyTorch does now too, but it is far more recent).

2. There are RL environments that others have written in pure Jax (see the blog post for four different repos of RL environments)

3. As m00x hints to, Jax replicates Numpy's API. This makes it way easier to use for non-neural network programming (e.g. RL environments).

m00x3y ago

I don't get that sense at all.

You could do the same with tensorflow and pytorch, but in my experience, with more difficulty since they're more opinionated about how you should do your operations.

JAX is definitely easier to do things that aren't on rails.

_harkOP3y ago· 2 in thread

jax.vmap() is all you need?

schizo893y ago

Not only vectorization, but the plethora of environments written in jax. Hopefully someone will port MuJoCo to jax soon

erwincoumans3y ago

There is Brax, a differentiable physics simulator written in Jax. It includes Gym tasks such as Ant, Humanoid and more: https://github.com/google/brax It is not full MuJoCo but a good base to add more features. Aside from position based dynamics (xpbd) it features motion in generalized coordinates using the same accurate robot dynamics algorithms as MuJoCo and TDS (Tiny Differentiable Simulator).

ssivark3y ago· 1 in thread

From what I understand of Jax, it feels somewhat similar in flavor to Julia, but trying to live with the language constraints (and ecosystem benefits) of Python.

I wonder how Julia is placed for running reinforcement learning algorithms (efficiently) — particularly in cases when the “environment” is nicely wrapped in Python to fit some standardized interface.

Buttons8403y ago

I've done some RL experiments in Julia, and having all the in-between be fast was helpful; I saw significant speed increases. That said, Julia was probably just compensating for my own stupidity, because I was converting my environment objects into tensors over and over and over.

nothrowaways3y ago· 1 in thread

It is misleading, the speedup is not just because it is Jax. The devil is in the GPU

luchris4293y ago

Indeed the devil is in the GPU! Jax and its ecosystem just make it much easier to use the GPU.

ipsum23y ago

Strange that the author claims Jax's vmap is what's doing the heavy lifting, but doesn't use PyTorch vmap to make the benchmark comparable.

schizo893y ago

Neural differential equations are also easier with jax. sim2real may be easier with simulator where some of hard computations are replaced with neural approximations

j / k navigate · click thread line to collapse

30 comments

27 comments · 8 top-level

percentcer3y ago· 10 in thread

Reminds me of this evergreen tweet from ryg: https://mobile.twitter.com/rygorous/status/12712968344392826...

  if you made something 2x faster, you might have done something smart
  if you made something 100x faster, you definitely just stopped doing something stupid

canucker20163y ago

The article showed how a linear, O(N), algorithm running on a lowly 8-bit CPU can beat a cubic algorithm, O(N^3), running on a Cray supercomputer, when N is sufficiently large.

see https://www.cs.rpi.edu/~moorthy/Courses/CSCI2300/p865-bentle...

hyperbovine3y ago

squeaky-clean3y ago

2 more replies

dan-robertson3y ago

amscanne3y ago

jakeinspace3y ago

yieldcrv3y ago

I love the quote, I’ll be using it the rest of my life

quickthrower23y ago

Just prove P = NP while you are at it

thatguymike3y ago

Or, you started doing something stupid!

__turbobrew__3y ago

if you made something 100x faster you don’t understand how mmap works

sillysaurusx3y ago· 3 in thread

Can't wait for a framework that feels closer to pure arbitrary Python. Maybe AI can figure out how to do it.

luchris4293y ago

While this was always possible to do, Jax makes it way easier because we just need to call `jax.vmap` to get it to work.

Inufu3y ago

AlphaZero did not run game logic on TPUs (neither chess nor other games), implementing it in C++ is more than fast enough and much simpler.

TPUs were used for neural network inference and training, but game logic as well as MCTS was on the CPU using C++.

JAX is awesome though, I use it for all my neural network stuff!

sillysaurusx3y ago

According to the AlphaZero paper (https://arxiv.org/pdf/1712.01815.pdf) they ran game logic on TPUs:

xyzzy47473y ago· 2 in thread

How does this compare with PyTorch / Tensorflow / etc.? Obviously doing heavy data processing on the GPU will have a large speedup compared to a single thread on the CPU.

It's almost like the author is claiming credit for creating Nvidia, when in fact he is just calling its APIs.

luchris4293y ago

The baseline we are comparing to is standard RL training that is widely used in academia. The technique mentioned in the blog post is not widely used amongst researchers.

The reason we write about Jax is that doing this technique is really hard in PyTorch / Tensorflow. This is because:

1. Jax has vmap. (PyTorch does now too, but it is far more recent).

2. There are RL environments that others have written in pure Jax (see the blog post for four different repos of RL environments)

3. As m00x hints to, Jax replicates Numpy's API. This makes it way easier to use for non-neural network programming (e.g. RL environments).

m00x3y ago

I don't get that sense at all.

You could do the same with tensorflow and pytorch, but in my experience, with more difficulty since they're more opinionated about how you should do your operations.

JAX is definitely easier to do things that aren't on rails.

_harkOP3y ago· 2 in thread

jax.vmap() is all you need?

schizo893y ago

Not only vectorization, but the plethora of environments written in jax. Hopefully someone will port MuJoCo to jax soon

erwincoumans3y ago

ssivark3y ago· 1 in thread

From what I understand of Jax, it feels somewhat similar in flavor to Julia, but trying to live with the language constraints (and ecosystem benefits) of Python.

Buttons8403y ago

nothrowaways3y ago· 1 in thread

It is misleading, the speedup is not just because it is Jax. The devil is in the GPU

luchris4293y ago

Indeed the devil is in the GPU! Jax and its ecosystem just make it much easier to use the GPU.

ipsum23y ago

Strange that the author claims Jax's vmap is what's doing the heavy lifting, but doesn't use PyTorch vmap to make the benchmark comparable.

schizo893y ago

Neural differential equations are also easier with jax. sim2real may be easier with simulator where some of hard computations are replaced with neural approximations

j / k navigate · click thread line to collapse