I didn't know that cross "async" communication was cheaper, that does seem like a good selling point, but what exactly makes it cheaper? After all threads share the same address space, so you can just pass pointers around the same way you would within the same thread. I expected the overhead to be roughly similar.
Things can get cache-expensive if the code is running on different cores, but then again using all the hardware resources available is generally something you want to do if you care about performance.