The will not measure up. Notice they're comparing it to Gemma, Google's open weight model, not to Gemini, Sonnet, or GPT. That's fine - this is a tiny model.
If you want something closer to the frontier models, Qwen3.6-Plus (not open) is doing quite well[1] (I've not tested it extensively personally):
https://qwen.ai/blog?id=qwen3.6