No, they are bad models. They were benchmaxxed on LMAreana and a few other benchmarks but as soon as you try them yourself they fall to pieces.
I have my own agentic benchmark[1] I use to compare models.
Llama-4-scout-17b-16e scores 14/25, while llama-4-maverick-17b-128e scores 12/25.
By comparison gemma-4-E4B-it-GGUF:Q4_K_M scores 15/25 (that is a 4B parameter model!) - even GPT3.5 scores 13/25 (with some adjustment because it doesn't do tool calling).
Llama 4 was a bad model, unfortunately.