Ask the LLM for examples of LLMs fucking up on simple tasks. Either it succeeds, proving the point, or fails, also proving the point.
I had both GPT-4o and llama3.1, through duck.ai, make up kscreen-doctor commands the other day. Commands that were easily formatted by simply looking at the output of kscreen-doctor --help.