anything that compares proprietary models will be very miscalibrated and may not be indicative, there have been too many model changes in both chat and the api where model providers did not even say the word before it got too noticable
Quality would be performance against different given benchmarks, I assume?
There's multiple open weight models you can run on a pretty standard computer at home, which match the quality of GPT 4. I guess that would also change the equation.