This is simple with a small, experimental fleet — so we are looking at the best case scenario for these vehicles, but the question is what does it fall down to in a realistic commercial application?
That would also be comparing safety, because averages are always skewed by bad apples (i.e. 1 driver with 20 collisions gets an average to 2 for a group of 10 drivers with no other collisions). We at least need to start talking about medians, standard deviations groups and such.
And we need autonomous vehicles to beat or match good drivers, otherwise, good drivers are worse off in the streets (and due to how averages are used, this might be more than 50% of drivers). Not sure why that's so controversial?