However, I doubt the efficacy of your C++ experts: most of the people I know who write C++ are actually really bad at optimizing code. They mostly use it for legacy reasons. If you get a team of experienced (and expensive) systems programmers, you will likely get a slightly better result than your GC algorithm.
Which is a very good reason to develop an optimized GC algorithm, the domain experts can crank out code without having to optimize every single memory (de)allocation which sounds like a waste of their time.
It’s funny, people don’t usually doubt that a modern compiler can do a better optimization job than an expert but add a memory management algorithm and that’s a bridge too far.
However, a GC is a lot slower than manual memory management, which contrasts with the fact that most compiler activities are actually pretty low in overhead (now - it didn't used to be this way). Really, the only cost overhead left is the abstraction mismatch, and that is not too bad, when you compare to how bad humans are at writing assembly.
That said, this case looks like one where the C++ experts spent very little time optimizing (mostly writing business logic), and probably made a very poor choice of tools.
There is a triangle of GC performance; througput, latency (i.e. pause length), and memory overhead. Manual memory management will often be slower (in the throughput sense) than a throughput-tuned GC because:
1. Manual memory management typically precludes moving live data
2. Manual memory management often frees data as soon as it is dead
GC will often have faster allocations than manual memory management because #1 makes it possible to just use a pointer-increment for allocation. GC will often have faster freeing of data because of #2; in particular using a nursery with Cheney's algorithm makes it O(1) to free an arbitrary amount of data.
Where a throughput optimized GC falls down is in that any code that allocates may have an unpredictable amount of delay.
Also note that for video games, both typical GC and malloc/free are often too slow for per-frame data, so arena allocators are used, which sidestep #2, and allow a pointer-increment allocation without needing #1. This is specifically because there are a lot of objects with exactly the same bounds on their lifetime. Special-purpose algorithms will almost always trump general-purpose algorithms when run on the workload they are optimized for.
extern void foo(T *p); // some arbitrary function
void bar1(bool cond)
{
..
auto p = std::make_unique<T, your_deleter>();
if (cond) { return foo(p.release()); }
...
}
This requires the compiler to call your_deleter::operator() regardless of whether cond is true or false, even though it's unnecessary (and can thus be slower) in the case where cond is true. Moreover, the obvious way to avoid it is to write it "C-style": void bar2(bool cond)
{
..
auto p = new T();
if (cond) { return foo(p); }
...
your_deleter()(p);
}
which can up being faster when cond is true. But this isn't something an expert would generally want to do, as now the C++ code becomes unidiomatic, fragile, and unmaintainable.In an ideal world, though, you could have an optimizer smart enough to do that transformation automatically. C++ compilers already do that in trivial cases, but they can't do it in general. My impression is that their Haskell compiler exploits the internal knowledge of what your_deleter does (i.e. reference counting) in order to optimize the code in various ways, like optimizing out such code, consolidating refcount updates, etc. And if I understand this correctly, there's no surprise at all that it can be faster than idiomatic C++ code written even by experts.
The question for me isn't the expertise of their programmers. Perhaps in their case they genuinely do need to have lots of objects on the heap, have (say) tight loops where they (for whatever reason) nevertheless cannot avoid the heap allocations, and don't have much of a use for finalizers besides freeing memory. In which case, I'm not surprised their solution clearly delivers better results than the C++ equivalent. The question from me, instead, is how well they think that generalizes, such as to (a) well-written Haskell programs in general, (b) well-written C++ programs in general, and/or (c) other domains. It would be one thing if their solution delivers better results in Haskell than C++ for their use case; it would be another thing if they could claim their solution delivers better results in Haskell than C++ for most use cases.
This is a non-issue.
The point is that check itself is an extra instruction (or two, rather) that would otherwise be skipped entirely.
I'm not saying this commonly makes a difference. I'm just saying this might be something that does make a difference for them in their particular use case.
Also note that I was trying to describe the general phenomenon with a simple example, but this obviously isn't limited to std::unique_ptr.