undefined | Better HN

0 pointshnuser1234561y ago0 comments

I think it does?: (the comment is in the original source)

    print("Adding matrices using GPU...")
    start_time = time.time()
    gpu_result = add_matrices(gpu_matrices)
    cp.cuda.get_current_stream().synchronize() # Not 100% sure what this does
    elapsed_time = time.time() - start_time

I was going to ask, any CUDA professionals who want to give a crash course on what us python guys will need to know?

print("Adding matrices using GPU...") start_time = time.time() gpu_result = add_matrices(gpu_matrices) cp.cuda.get_current_stream().synchronize() # Not 100% sure what this does elapsed_time = time.time() - start_time

0 comments

5 comments · 1 top-level

apbytes1y ago· 4 in thread

When you call a cuda method, it is launched asynchronously. That is the function queues it up for execution on gpu and returns.

So if you need to wait for an op to finish, you need to `synchronize` as shown above.

`get_current_stream` because the queue mentioned above is actually called stream in cuda.

If you want to run many independent ops concurrently, you can use several streams.

Benchmarking is one use case for synchronize. Another would be if you let's say run two independent ops in different streams and need to combine their results.

Btw, if you work with pytorch, when ops are run on gpu, they are launched in background. If you want to bench torch models on gpu, they also provide a sync api.

claytonjy1y ago

I’ve always thought it was weird GPU stuff in python doesn’t use asyncio, and mostly assumed it was because python-on-GPU predates asyncio. But I was hoping a new lib like this might right that wrong, but it doesn’t. Maybe for interop reasons?

Do other languages surface the asynchronous nature of GPUs in language-level async, avoiding silly stuff like synchronize?

ImprobableTruth1y ago

The reason is that the usage is completely different from coroutine based async. With GPUs you want to queue _as many async operations as possible_ and only then synchronize. That is, you would have a program like this (pseudocode):

  b = foo(a)
  c = bar(b)
  d = baz(c)
  synchronize()

With coroutines/async await, something like this

  b = await foo(a)
  c = await bar(b)
  d = await baz(c)

would synchronize after every step, being much more inefficient.

2 more replies

apbytes1y ago

Might have to look at specific lib implementations, but I'd guess that mostly gpu calls from python are actually happening in c++ land. And internally a lib might be using synchronize calls where needed.

hnuser123456OP1y ago

Thank you kindly!

j / k navigate · click thread line to collapse