Update:
>To handle both these cases, we will use checked addition and return a null pointer if either addition overflows. Here is the new Rust source code:
I’m still filling this under “Author made arbitrary decision and result is arbitrary solution is slower as a result of arbitrary decision”.
If you're doing small bump allocations and you're anywhere near pointer overflow, it means you're way out of bounds already; the bump allocations you already made were wrong.
You need to check whether the allocation increment is out of the zone from which you're allocating, and you need that no matter which direction you go.
If the requests are small, you will hit the end of your arena long before you worry about pointer overflow at the zero address or at address 0xFF..FF!
That said, aligning down is a shade faster than up. To align down to a boundary divisible by ALIGN_MASK + 1 we just truncate some low order bits to zero:
addr &= ~ALIGN_MASK;
If the bits are already zero, the address doesn't move; all is cool.but aligning up, where we don't care about overflow, requires handling the case where the address is already aligned and doesn't have to move:
if (addr & ALIGN_MASK) {
addr |= ALIGN_MASK;
addr++;
}
Or a trick like this where we bias the address with an offset of -1 during the masking calculation: addr = ((addr - 1) | ALIGN_MASK) + 1;
(We could get underflow here if addr is the zero address (null pointer on most systems), but it's reversible if the pointer arithmetic is done as unsigned. Zero aligns to zero: it goes to 0xFFF..FFF which stays the same after | 7, and then increments back to zero.)Either of these is worse than just addr &= ~7.
I also wouldn't mind the a default "fast" bump allocator library to do all tricks it can without sacrificing safety.
Is rust warning you about pointer overflow? My understanding is that rust is complaining about integer overflow but perhaps that's insufficient?
(I don't actually know which solution is better/worse from perf perspective besides author's post)
I don't think the lesson is "let's just write unsafe code"
Let's face facts, the allocator, even with this additional check, is still going to be blazing fast compared to malloc() or whatever. If you're allocating so much that the handful of extra instructions is going to be a significant slowdown, maybe structure your allocations differently?
Quick example: https://godbolt.org/z/rdv4qnrs8.
*edited to update the example, realized I messed up the comparison logic.
Link: https://godbolt.org/z/f1jGW6Pa3
Update: NVM, definitely not being handled correctly. https://godbolt.org/z/cMTe1o979
edit: code moved to a toplevel comment
My favorite is one where I only use subtraction but the direction of allocation is in the positive direction (I subtract a `remaining` counter, and the returned address is `end - remaining`).
But I have seen many others. A few of my colleagues have similarly geeked out on this and come up with splendid bumpers (ggaren wrote a great one in bmalloc, and I remember the MMTk folks put hella thought into theirs).
I'm surprised that there was no discussion of memory system performance. There are tons of OS-level and hardware prefetcher optimizations for forward-marching pointer references and zero-page allocation.
I’ve seen multiple ways to do it. And I’ve lost days of my life to coming up with pointlessly amusing variants that get there in weird ways.
Yeah I’m also amused that they didn’t get into the fact that downward bump is just not what the HW expects you to do. You’d need a real benchmark to see that effect. And, it’s an open question whether this would matter unless your allocation rate was obscene (if it isn’t then most scans of memory will be normal loops and not the allocator).
Where do you get this fact from? ARM and x86 have a descending stack (<= ARMv7 supported stack growing in either direction).
It's about "bump allocators", where you allocate memory just by incrementing/decrementing a pointer (and don't free it until you're ready to free up all the memory all at once).
These can either allocate "upwards", starting at low addresses and giving you memory at higher addresses for successive allocations, or "downwards", going the other way.
The claim being made is that "downwards" is better, because the bounds checks you need to do turn out to be more efficient that way; allocation ends up using fewer instructions, fewer registers, and fewer conditional jumps.
Given a choice, the OP implies that you should position small-but-numerous allocations next to the top, and larger infrequent allocations next to the bottom.
The tricky part is choosing in a way that puts you noticeably ahead of the unidirectional allocator re: what problems you can solve, without putting excessive mental load on yourself. I've found a pattern of "long-lived allocations on one end, short-lived allocations on the other" to work well here (which, yes, doesn't always coincide with the numerous vs. infrequent axis mentioned in my previous comment).
For one thing, this particular allocator is separate from the general allocator so the calling code can choose when to use the bump allocator.
Quoting the crate doc:
> [B]ump allocation [is] well-suited for phase-oriented allocations. That is, a group of objects that will all be allocated during the same program phase, used, and then can all be deallocated together as a group.
I can see this being useful as a way to optimize allocation/deallocation speed for specific use-cases.
Hard-coding the alignment simplifies many questions (especially if you can guarantee the allocation size being a multiple of the alignment, at which point it becomes a complete non-issue).
And if you have an upper bound on the requested allocation size, the integer overflow checks can be dropped too (e.g. in a Java "new int[n]", "n" is an int, which has a max value of 2^31-1, which can safely be added to a bump pointer, provided it's further than that away from the address space end (which it's gonna be due to the kernel reserving the upper half of the address space)).
And for constant-size objects, your entire bump allocator can become just
if (ptr > precomputedEndMinusObjectSize) fallback();
result = ptr;
ptr += size;[0] https://shipilev.net/jvm/anatomy-quarks/4-tlab-allocation/
size_t alignment_needed = align - (ptr & (align - 1));
if (alignment_needed > SIZE_MAX - size) return nullptr;
size_t space_needed = size + alignment_needed;
size_t space_remaining = end - ptr;
if (space_needed > space_remaining) return nullptr;
char *result = ptr + alignment_needed;
ptr += space_needed;
return result; let new_ptr = new_ptr & !(align - 1);
But I don't see rdx decremented before being negated and AND'ed with rax, in the assembly. What am I missing?Edit: I get it now. My brain was interpreting NEG as NOT for some reason.
In addition to simpler arithmetic, rounding down always succeeds - 0 is a multiple of all N - while rounding up may fail if a larger multiple does not exist. So you need to check in the round-up case.