We built AccountingBench, a test where LLMs must "close the books" for a real SaaS business using 1 year of Stripe/Ramp/Rippling/Mercury data.
Claude 4 and Grok 4 start strong - within 1% of human CPA baselines in month 1.
But as time progresses, all models inevitably accumulate compounding errors and exhibit erratic behavior, causing significant deviations.
That said, the early accuracy here is promising. With targeted post-training, models may be able to replace humans for this kind of work.
Ask Claude to multiply two ten-digit numbers. It gets the first one or two digits correct, and then makes up the rest.
ChatGPT used to have the same problem, but now it writes a program to perform the math for it.