Try telling the model it's a pirate or someone who is just learning English. It can easily do that, so why would you assume that no system prompt would be the best for some specific problem?
You can tell them to be more critical, that's a useful one. You can tell it to not solve a problem but critique an output - then have two models talk to each other one as a critic and one as a planner.
I can help show the difference but I'm not sure quite what you think doesn't matter and feel like that's important to nail down first.
> You say that I should be "testing and measuring" as I go. How? What is the metric to measure?
Tools like promptfoo can help with some of this.
You can do comparisons, blind tests, measuring what your users prefer, you can use high quality models to test things like "does not mention it's an AI bot" or similar. It depends on what your task is.
Edit -
A lot of people don't properly test and have lots of things in their prompts that aren't necessarily helping, or may have been required in an earlier model but now aren't needed. Prompt engineering is more important in less powerful models or higher stakes situations.