this looks pretty recent, good to know that memcpy is improving (maybe). i suspect the reason icc is winning so hard is exactly what i mention - cache hints and good register usage. the things is that stuff isn't new... its much more than 10 years old now.
no memcpy implementation has an excuse to be that slow under any compiler imo.
i'd imagine the intel guys will make use of this stuff because its why they put it in the hardware to start with... :)
incidentally, as a complete tangent, really great bunch of guys to work with if you get the chance - at least in my experience. a definite passion for and focus on performance in a serious way. :)