In the very beginning I think they(df) used pgbench kind of benchmarks 'the smp' benchmark, as a lot of people were using postgres and was really easy to compare qps and transactions per second, of course it also tests ufs2 vs hammer. (if I remember correctly)
It was a long time ago, but freebsd5 felt more like a new OS than just a 4.11->5.0 bump, particularly with the removal of the giant lock and all the witness(4) work, took a while to figure out how to finetune it as a lot of systems were giant free but not all, and also of course moving one lock to many small locks means a lot of spinning and certain patterns of workloads are slower than before. It took until 7.0 to get amazing, and then in 8 or so I think it was super solid.
Dragonfly went with kernel messaging and one scheduler per core, and FreeBSD spent a lot of time into making a preemptive scheduler (sched_ule (4) : http://fxr.watson.org/fxr/source/kern/sched_ule.c?v=FREEBSD-...)
Weird times, but I am super grateful that both the FreeBSD team and the DragonflyBSD team did what they think is right.
Mad respect to the people who are just coding what they think is right.