Where do you draw the line between this and coverage-guided fuzzing? A lot of what you describe (parallel, adaptive, finds edge cases in unbounded input spaces) maps cleanly onto the fuzzing playbook, which has decades of theory behind it - corpus management, mutation scheduling, minimization of found crashes.
Are you borrowing from that literature or treating agent testing as a distinct problem? Feels like there's real transfer available if you're not already pulling from it.