You could circumvent the counter issue by pinning to a specific core, otherwise you'd have to use scheduler events (ftrace/perf on Linux) to know which CPU you were running on at specific times and when so you could subtract the right counters (I've had to do this before and it isn't pretty). It would also prevent issues where your task gets preempted and moved to another core and you have to warm the caches again. If you use Linux, options like isolcpus and nohz_full ensure you're not measuring the scheduler/other tasks but have to be set at boot. But if your loops between clock_gettime take substantially more than clock_gettime itself then at least the timer overhead is probably not that important.
As for the branch prediction, I would think most of the branches would be predicted correctly because usually you'll have runs of "not x" before each "is x". Switching between many loops of course hurts the prediction.
My SIMD knowledge only extends as far as which compiler options make my code faster.
No shame using LLMs, I use them extensively, but I find I have to write some code yourself because I make noticeably more mistakes coding if I have let the LLM do everything for a couple of weeks.