Ah good catch. Upon even closer examination, the attention layer (~2B params) is shared across experts. So in theory you would need 2B for the attention head + 5B for each of two experts in RAM.
That's a total of 12B, meaning this should be able to be run on the same hardware as 13B models with some loading time between generations.