Well, a lot changed since the article. For one the test tool now eats more CPU than RNG.
From my dumb tests (run DD in one, then many threads), the 4 thread run have 4x the performance of single thread one (I have 4 core CPU), while 16 thread one have predictably same-ish total throughput, so if there are any serialization still there it is not noticeable much.