Intel's take on GCC's memcpy implementation (opens in new tab)

(software.intel.com)

76 pointsmtdev14y ago13 comments

13 comments

12 comments · 6 top-level

memset14y ago· 3 in thread

Someone tell me if I am mistaken - but it looks like the main difference between GCC's and Intel's memcpy() boils down to gcc using `rep movsl` and icc using `movdqa`, the latter having a shorter decode time and possibly shorter execution time?

bdonlan14y ago

No, the problem is with x86-64, which apparently doesn't use `rep movsl`; as far as I can tell, GCC's x86-64 backend assumes that SSE will be available, and so only has a SSE inline memcpy. However, in the kernel SSE is not available (as SSE registers aren't saved normally, to save time), so this is disabled. With no non-SSE fallback (such as `rep movsl` on x86), gcc falls back to a function call, with the performance impact this implies.

sliverstorm14y ago

From the sound of it, the function call was not the issue, so much as the function that gets called is old and non-optimal with modern tools.

Tuna-Fish14y ago

rep movsl moves data 32 bits at a time, while movdqu/movdqa moves data 128 bits at a time. The advantage is not only in decoding -- the data paths in modern Intel processors are really 128bit, so movdqu/movdqa gets 4 times the throughput out of the system. (Until you run out of L1 cache, after which you really slow down.)

JoeAltmaier14y ago· 2 in thread

I'm sad that computers in this modern age still require me to be in their business. Doesn't it seem like the cpu's own business to move bytes efficiently? Why is the compiler, much less the programmer, involved? The tests being made in the compiler/lib are of factors better-known at runtime (overlap, size, alignment) and better handled by microcode.

Andys14y ago

Hardware improvements necessarily move slower than software, especially when carrying the complex historical baggage of out-of-order execution of x86.

To be fair, things are improving. eg. The latest Intel CPUs no longer need aligned memory to avoid slowing down.

JoeAltmaier14y ago

Really? That's huge!

A really robust memmove library routine should handle about eleven different factors, one of which is alignment. I don't know of ANY library that handled that right, probably because its so hard. E.g. unaligned source, unaligned dest with Different alignment is very hard. Usually they settle on aligning the destination (unaligned cache writes are more expensive). The true solution is to load the partial source, then loop loading whole aligned source words, shifting values in multiple registers to create aligned destination words to store.

That all requires about 16 different unrolled code loops to cover all the cases. Nobody bothers. So nobody every got the best performance in a general memmove anywhere. Sigh.

1 more reply

abrahamsen14y ago· 1 in thread

> the developer communications don't appear on a public list. There is no visible public help forum or mail list

http://dir.gmane.org/index.php?prefix=gmane.comp.lib.glibc

Seems public to me.

ominous_prime14y ago

The list is publicly archived, but glibc's maintainer (Ulrich Drepper) actively discourages public interaction for the project. The project's policy is that bug reports should almost always go through a Linux distribution, and to say it nicely, Drepper can be difficult to persuade.

Debian was in the process of switching to eglibc in order to avoid glibc (and Drepper), and fix issues they saw with the library.

wolf550e14y ago

This article is old: March 9, 2009 1:00 AM PDT

Nowadays glibc has modern SSE code and the kernel uses "rep movsb". The kernel can store and restore FPU state if the copy is long and doing SSE/AVX is worth it. Someone on the Linux kernel mailing list measured that performance depends on src and dest being 64-byte aligned compared to each other: if they are aligned, "rep movsb" is faster than SSE.

The thread: https://lkml.org/lkml/2011/9/1/229

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git...

http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_...

shin_lao14y ago

A couple of years ago, before SSE existed, I wrote a highly optimized memory copy routine. It was more than just using movntq (non temporal is important to avoid cache pollution) and the like, for large data I copied the chunks in a local buffer less than one page size and copied it to the destination. Sounds crazy? It actually was much faster because of page locality.

For small chunks however, nothing was faster than rep movsb which moves one byte at the time.

vz014y ago

Anger Fog found this issue one year earlier, 2008:

http://www.cygwin.com/ml/libc-help/2008-08/msg00007.html

j / k navigate · click thread line to collapse