1. Better memory aliasing to use SIMD instructions. But you have to trade in pointer arithmetic for those cases. Only useful for number-crunching and GPU-like workload AFAIK.
2. Pre-compute time-consuming constants during compile time. Though I see no reason why you cannot do this in C.
3. JIT and runtime optimization. I doubt this though, given various overhead of JIT and runtime optimization (GC, memory, etc), that JIT can beat carefully crafted C. Of course it will make it easier to write many types of programs.
I assumed that a big part of a language being faster than C is really about the compiler generating faster code than a human programmer in either language could do without thinking too hard about optimization, and not so much the theoretical top speed. As I understand this is the case with assembly vis a vis C.
Yes, but one runs into the same problems as for carefully crafted assembly:
- People who have the skills necessary to produce "carefully crafted" code don't come cheap.
- All careful crafting often ties the code to a specific target platform. For example, the gyrations one might through to make the code more SIMD-friendly might be detrimental to performance on a platform that lacks SIMD.
In C, you would have to code the assumptions into the runtime yourself, as opposed to letting the compiler or JIT do it. This isn't viable if you have time constraints or a lack of knowledge in the required subject area.
There's also the chance that you'll introduce bugs that the compiler/JIT developers have already encountered and fixed.
Still, compilers are increasingly starting to do this.
There is a reason why we have tons of `#define` macros in many C programs.
A language with the constraints of Haskell allows the JIT to recognise that fib(20) is being called a lot, or will be called again, and simply inline that function call itself. That is completely impossible for a C compiler to achieve unless the developer writes: if (arg == 20) { return X; }
That has two problems: a) The developer has to profile and look for specific behaviour in the system b) The optimisation may change depending on the state of the world e.g. the system may pick a random constant every day to compute the Fibonacci sequence for.
Doesn't the example below show otherwise? Note the presence of movdqu in both memcopy_normal and memcopy_restrict and how GCC emits five LEA instructions without restrict, but only two when restrict is present.
This does not support the author's assertion that "Aliasing information is the only one, where I am certain about speed improvements, because it is impossible to reach Fortran-speed in C."
#include "stddef.h"
void* memcopy_normal(void* dst, const void* src, size_t count) {
while (count--) *(char*)dst++ = *(char*)src++;
return dst;
}
void* memcopy_restrict(void* restrict dst, const void* restrict src, size_t count) {
while (count--) *(char* restrict)dst++ = *(char* restrict)src++;
return dst;
}
> gcc -S -masm=intel -std=c99 -m64 -march=corei7 -O3 restrict2.c
.file "restrict2.c"
.intel_syntax noprefix
.text
.p2align 4,,15
.globl memcopy_normal
.type memcopy_normal, @function
memcopy_normal:
.LFB0:
.cfi_startproc
test rdx, rdx
mov rax, rdi
je .L2
lea r10, [rdx-1]
mov r8, rdx
shr r8, 4
mov r9, r8
sal r9, 4
test r9, r9
je .L7
lea rcx, [rsi+16]
cmp rdx, 15
lea r11, [rax+16]
seta dil
cmp rax, rcx
seta cl
cmp rsi, r11
seta r11b
or ecx, r11d
test dil, cl
je .L7
xor ecx, ecx
xor edi, edi
.p2align 4,,10
.p2align 3
.L4:
movdqu xmm0, XMMWORD PTR [rsi+rcx]
add rdi, 1
movdqu XMMWORD PTR [rax+rcx], xmm0
add rcx, 16
cmp r8, rdi
ja .L4
lea r8, [rax+r9]
add rsi, r9
sub r10, r9
cmp rdx, r9
je .L5
.L3:
lea r9, [r10+1]
xor ecx, ecx
.p2align 4,,10
.p2align 3
.L6:
movzx edi, BYTE PTR [rsi+rcx]
mov BYTE PTR [r8+rcx], dil
add rcx, 1
cmp r9, rcx
jne .L6
.L5:
add rax, rdx
.L2:
rep
ret
.L7:
mov r8, rax
jmp .L3
.cfi_endproc
.LFE0:
.size memcopy_normal, .-memcopy_normal
.p2align 4,,15
.globl memcopy_restrict
.type memcopy_restrict, @function
memcopy_restrict:
.LFB1:
.cfi_startproc
push r12
.cfi_def_cfa_offset 16
.cfi_offset 12, -16
test rdx, rdx
mov rax, rdi
push rbp
.cfi_def_cfa_offset 24
.cfi_offset 6, -24
push rbx
.cfi_def_cfa_offset 32
.cfi_offset 3, -32
je .L12
lea r10, [rdx-1]
mov r11, rsi
mov r8, rsi
neg r11
and r11d, 15
cmp r11, rdx
cmova r11, rdx
test r11, r11
je .L13
xor ecx, ecx
.p2align 4,,10
.p2align 3
.L14:
movzx r9d, BYTE PTR [r8]
add rcx, 1
add r8, 1
sub r10, 1
mov BYTE PTR [rdi], r9b
add rdi, 1
cmp r11, rcx
ja .L14
cmp rdx, r11
je .L15
.L13:
mov r12, rdx
sub r12, r11
mov rbx, r12
shr rbx, 4
mov rbp, rbx
sal rbp, 4
test rbp, rbp
je .L16
add rsi, r11
xor ecx, ecx
add r11, rax
xor r9d, r9d
.p2align 4,,10
.p2align 3
.L17:
movdqa xmm0, XMMWORD PTR [rsi+rcx]
add r9, 1
movdqu XMMWORD PTR [r11+rcx], xmm0
add rcx, 16
cmp rbx, r9
ja .L17
add rdi, rbp
add r8, rbp
sub r10, rbp
cmp r12, rbp
je .L15
.L16:
lea r9, [r10+1]
xor ecx, ecx
.p2align 4,,10
.p2align 3
.L18:
movzx esi, BYTE PTR [r8+rcx]
mov BYTE PTR [rdi+rcx], sil
add rcx, 1
cmp r9, rcx
jne .L18
.L15:
add rax, rdx
.L12:
pop rbx
.cfi_def_cfa_offset 24
pop rbp
.cfi_def_cfa_offset 16
pop r12
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.LFE1:
.size memcopy_restrict, .-memcopy_restrict
.ident "GCC: (GNU) 4.6.3"
.section .note.GNU-stack,"",@progbitsIt is true though that restrict doesn't solve all of C's aliasing problems though. For example:
#include <stdlib.h>
struct array_3d {
float * restrict data;
size_t xmin, ymin, zmin;
size_t xmax, ymax, zmax;
size_t allocated;
};
static inline float *
array_3d_elem_addr(struct array_3d *a3d, size_t x, size_t y, size_t z) {
return a3d->data + z +
y * (a3d->zmax - a3d->zmin) +
x * (a3d->zmax - a3d->zmin) * (a3d->ymax - a3d->ymin);
}
void array_3d_add(struct array_3d *restrict dst, struct array_3d *restrict src) {
for (size_t x = dst->xmin; x != dst->xmax; ++x)
for (size_t y = dst->ymin; y != dst->ymax; ++y)
for (size_t z = dst->zmin; z != dst->zmax; ++z)
*array_3d_elem_addr(dst, x, y, z) += *array_3d_elem_addr(src, x, y, z);
}
No amount of restrict keywords can tell the compiler that src->data and dst->data don't alias here (putting restrict on the 'data' member declaration inside array_3d doesn't mean anything here, because of the way restrict is defined).GCC manages to vectorize this loop, but only conditionally, with runtime alias checks. That's ok for this trivial example, but it isn't always viable in more complex examples. It's also worth noting that there are ways to rewrite this code so that it does fully vectorize, though again it's helped by the example being so trivial. The main point is that in Fortran, the alias rules are strong by default, without the programmer having to do acrobatics.
Edit: formatting fixes
mans or astrange can probably elaborate.
MSVC on Windows can be 40% faster than gcc on the same platform. I think icc (Intel C Compiler) can be several times faster than gcc for some workloads.
I can write OpenCL code, that will be up to 100 times faster than plain C.
EDIT:
The reason for this, is that we still use procedural languages rather than declarative.
In declarative language I would say: "The content of this block of memory should be identical to the content of that block of memory." I would also specify that blocks are non-overlapping, if it's not possible to infer from the block definitions themselves.
Instead of this, memcopy C code describes procedure of iterative byte-by-byte copying. This is only one way out of many to do it and maybe it even was the fastest way to do it on PDP-11, but not on modern or future machines.
Similarly if you trying to micromanage your employees, you will not get most optimized result.
To have the best performance, you do not need to be Close-to-The-Metal(tm). You need to be able to express what your desired result is in the highest possible abstracted, but still correct, way.
- A very clever and complete type analysis. The compiler can thus optimize at best the code without runtime type-checking code.
- If possible, a flow analysis : By analysing the bounds of each function, compiler is able to remove a lot of unused code. For instance, if a variable is defined by t = 2*n and long afterwards, there's a test which check primality of t, the compiler can remove the test.
- Compile code by rewriting algorithms that "takes in mind" locality of memory, ie. all things described by U. Drepper in "What every programmers should know about computer memory", ie2. rewriting algorithms to avoid cache miss, etc...
- Automatic template-like mecanism : the compiler is able to make partial execution of the code, even in array, so all the code which can be calculated because you have the datas should be directly compiled. Imagine you have a regular expression library and you write your regexp in your code (just the regexp) : the compiler should be able to produce an automata.
- It is also said : be able to use SIMD instruction.
Because of this, it is better to compile a high-level language with a very clean semantics : you can much easily reason about your semantics and more easily rewrite some algorithms.
Does the C semantics prevent the compiler from issuing a run-time check for the (rare) aliasing case and proceed with the fast version in the common case? (probably unrolled, too?)
Out of curiosity: The code uses a special counter variable count which has to be decremented separately - is this faster than testing for dst != behind_last_dst_prt ?
(aside: my gut-feeling tells me that if the combination of C/8086 can't pull his example of at the maximum memory to processor transfer speed for sufficiently large input vectors, there is something seriously rotten ...)
Turtles. Turtles all the way down.
Translation: If your language is "Turtles" then you need to use "Turtles" all the way down the entire stack, and that includes your "Turtles" CPUs.
Inter-module optimizations (e.g., removing dead code) can occur at link time as well.
So there are better tools now that compiler writers can target to start making progress in at least one of these areas.