What you describe is essentially what happened, the AI result working from specs and tests was more performant than the original. The real AI you describe just rewrote chardet without looking at the source, only better.
Is there any visibility or accountability to record exactly what it did and not look at? I doubt it. So we're left with a kind of Rorschach test: some people think LLMs follow rules like law-abiding citizens, and some people distrust commercial LLMs because they understand that commercial LLMs were never designed for visibility and accountability.