Llama-1-33B was trained on 40% more tokens than LLama-1-13B; this explained some of the disparity. This time around they both have the same data scale (2T pretraining + 500B code finetune), but 34B is also using GQA which is slightly more noisy than MHA. Furthermore, there have been some weird indications in the original LLama-2 paper that 34B base model is something… even more special, it's been trained on a separate internal cluster with undervolted/underclocked GPUs (though this in itself can't hurt training results), its scores are below expectations, it's been less "aligned". Here, Code-Llama-Instruct-13B is superior to 34B on HumanEval@1. So yes, it's desirable but I wouldn't get my hopes up.