> You will need at least 350GB GPU memory on your entire cluster to serve the OPT-175B model. For example, you can use 4 x AWS p3.16xlarge instances, which provide 4 (instance) x 8 (GPU/instance) x 16 (GB/GPU) = 512GB memory.
https://alpa.ai/tutorials/opt_serving.html
(Scroll down to the second "Note", not far from the top)
I wonder what FlexGen is doing.. a naive guess is a mix of SSD and system memory. Definitely curious about what FlexGen's underlying strategy translates to in terms of actual data paths.