I tried mark's OMP_NUM_THREADS suggestion (
https://news.ycombinator.com/item?id=35018559), did not see any an obvious change to make it parallel, and given the huggingface patch (
https://github.com/huggingface/transformers/pull/21955) once it gets in is suppose to allow streaming from RAM to the GPU. So, for me it was not worth the effort to keep working on the CPU version as even the best case ~30X speedup will still take around a minute to run the 7B.