At least they are honest about it in the specs that they have published - there's a graph there that clearly shows their server-side model underperforming GPT-4. A refreshing change from the usual "we trained a 7B model and it's
almost as good as GPT-4 in tests" hype train.
(see "Apple Foundation Model Human Evaluation" here: https://machinelearning.apple.com/research/introducing-apple...)