> Good webcam setups have a sharp depth of field, which makes the speaker look sharp and the background blurred
I'm guessing you mean "shallow depth of field" rather than "sharp" here.
And you probably don't want it to be so shallow so a few centimeters would matter, then your nose could be in focus but your ears not, or vice-versa. That's a bit too shallow.
It also doesn't require auto-focus, it requires you to be able to put the camera at a good distance with the correct f-stop, or being able to change the f-stop one way or another.
Auto-focus is more for cases when the subject/object moves a lot forward/backward from the perspective of the camera, so you don't have to manually adjust things.