Out of curiousity, I was trying out gpt-4-0613 and claude-v2 with https://github.com/getcursor/eval, but sadly I'm getting hangs at 3% with both of them (maybe hitting rate limits?).
Pre-training of a foundational model is what you're thinking of for the "absurdly expensive" part but fine tunes are extremely cheap and undoubtedly are being done constantly. (You can see just how cheap by looking at the papers for Alpaca, Vicuna, Koala, etc). Prices dropped from about $600 to $10 for smaller models. Guanaco, using QLoRA, fine tuned llama-65b in about 1 day on a single GPU.
Another way to empirically test btw is to search for all the articles pointing out what ChatGPT gets wrong (3 or 4). I recently tested those when looking for evals and it gets the large majority (maybe 80-90% of those are answered correctly now).
After the base training there are three seperate sets of additional training to align the model, convince it to do question-response, and to improve the quality via feedback.
If you update the original base model, then all the tuning steps need to be repeated.
For a model the size of GPT-4 this is expensive and slow, which is why OpenAI hasn’t bothered.