The short answer is that blocking is expensive due to the overhead of the implied context switch and poor locality. As computers become faster, a larger percentage of the CPU time is dedicated to context-switching overhead and non-blocking architectures eliminate that. For applications like databases where this problem is more severe, the difference in throughput between a blocking architecture and a non-blocking architecture can be 10x on the same hardware, so it is a very important optimization if you want your software to have performance that is competitive.
A modern thread-per-core shared-nothing architecture takes this even further and tries to eliminate blocking at the hardware level for the same basic reason.