To put it simply, in my tetsing, if using 4 threads (the default) memcached and my clone run about the same. However if using 8 threads (with -t 8 option), memcached becomes a little slower (!) while my clone becomes a little faster. There are still some optimizations I'm planning to do, so I hope the performance/scalability margin will improve yet more.