Always Bump Downwards (2019) (opens in new tab)

(fitzgeraldnick.com)

99 pointstjalfi2y ago72 comments

72 comments

47 comments · 10 top-level

lowbloodsugar2y ago· 10 in thread

So lesson is don’t use safe Rust code when writing something like a memory allocator where an overflow check is deemed by the author to be too slow?

Update:

>To handle both these cases, we will use checked addition and return a null pointer if either addition overflows. Here is the new Rust source code:

I’m still filling this under “Author made arbitrary decision and result is arbitrary solution is slower as a result of arbitrary decision”.

pavon2y ago

I don't think Rust is the issue here. The same integer overflow can occur in any language and should be checked. Integer overflow and underflow is one of the most common security bugs after memory access errors. It is more that bump allocators are so efficient that a relatively inexpensive overflow check can be a significant fraction of their runtime.

kazinator2y ago

It's not integer overflow but pointer overflow.

If you're doing small bump allocations and you're anywhere near pointer overflow, it means you're way out of bounds already; the bump allocations you already made were wrong.

You need to check whether the allocation increment is out of the zone from which you're allocating, and you need that no matter which direction you go.

If the requests are small, you will hit the end of your arena long before you worry about pointer overflow at the zero address or at address 0xFF..FF!

That said, aligning down is a shade faster than up. To align down to a boundary divisible by ALIGN_MASK + 1 we just truncate some low order bits to zero:

  addr &= ~ALIGN_MASK;

If the bits are already zero, the address doesn't move; all is cool.

but aligning up, where we don't care about overflow, requires handling the case where the address is already aligned and doesn't have to move:

  if (addr & ALIGN_MASK) {
    addr |= ALIGN_MASK;
    addr++;
  }

Or a trick like this where we bias the address with an offset of -1 during the masking calculation:

  addr = ((addr - 1) | ALIGN_MASK) + 1;

(We could get underflow here if addr is the zero address (null pointer on most systems), but it's reversible if the pointer arithmetic is done as unsigned. Zero aligns to zero: it goes to 0xFFF..FFF which stays the same after | 7, and then increments back to zero.)

Either of these is worse than just addr &= ~7.

tel2y ago

I would definitely be in favor of an even faster `SmallBump` variant which assumes that the allocations are small w.r.t. usize::MAX for some additional speed.

I also wouldn't mind the a default "fast" bump allocator library to do all tricks it can without sacrificing safety.

1 more reply

wrsh072y ago

Since we're both thinking about this...

Is rust warning you about pointer overflow? My understanding is that rust is complaining about integer overflow but perhaps that's insufficient?

(I don't actually know which solution is better/worse from perf perspective besides author's post)

lowbloodsugar2y ago

Or you just add ALIGNMENT-1 and then mask. So if alignment is 16 then you add 15.

1 more reply

cratermoon2y ago

No, the lesson is to work with the safety mechanisms to accomplish both fast and safe result. If I'm writing code in language A, I don't choose to throw out one of the foundational aspects of that language on the grounds that it's merely inconvenient.

wrsh072y ago

I think the lesson is "safety comes at a cost. If you bump downwards you can avoid those safety checks"

I don't think the lesson is "let's just write unsafe code"

kazinator2y ago

Caller asks for a 500 megabyte downward bump allocation in a 32 bit system. Do you check for underflow or not?

bitwize2y ago

You check to make sure the size requested is less than or equal to current pointer - chunk start and fail if it is bigger.

Let's face facts, the allocator, even with this additional check, is still going to be blazing fast compared to malloc() or whatever. If you're allocating so much that the handful of extra instructions is going to be a significant slowdown, maybe structure your allocations differently?

saagarjha2y ago

Most allocators will do this, yes.

dev_dwarf2y ago· 8 in thread

In the "bump up" version you could remove both the checked_add branches and replace them with a single check at the end, making the amount of branches the same.

Quick example: https://godbolt.org/z/rdv4qnrs8.

*edited to update the example, realized I messed up the comparison logic.

ridiculous_fish2y ago

I think this doesn't work because `aligned + size` may wrap all the way around into the valid region again. For example if aligned == ptr + 1, and size is usize::MAX, we will end up with new_ptr == ptr and the allocation will wrongly succeed.

dev_dwarf2y ago

Interesting point. I modified my example to test what you described. I had to play with the compilation flags to get the allocs to not be optimized out and to not panic when the integer overflow happens, but otherwise I didn't change the logic. I'm pretty sure my implementation is correctly handling the case you mention, evidenced by it returning a null pointer.

Link: https://godbolt.org/z/f1jGW6Pa3

Update: NVM, definitely not being handled correctly. https://godbolt.org/z/cMTe1o979

1 more reply

loeg2y ago

The straightforward answer is: don't tolerate ridiculous alignments.

saagarjha2y ago

You'd need to check for that, though.

1 more reply

sltkr2y ago

That version is unsafe: what if size == 0xfff..fff and alignment is needed? You will end up with ptr <= new_ptr < end, seemingly a valid result, but actually with not enough space.

edit: code moved to a toplevel comment

loeg2y ago

No allocator can be expected to allocate usize::MAX, so it doesn't really matter.

sltkr2y ago

It matters because if the allocator cannot allocate a given amount, it should reliably return NULL to iform the caller that the allocation failed, not return a random invalid pointer that's not usable, which will lead to undefined behavior.

1 more reply

dev_dwarf2y ago

Agreed. The question really is if you should demand the user to enforce that constraint on the size they pass to you, or if the function itself should signal an error in that case.

1 more reply

pizlonator2y ago· 4 in thread

It’s possible to write many other bump algorithms, including ones that bump upwards but generate much much better code than OP’s forward upward bump. In particular, no overflow checks are needed if you write it carefully enough.

My favorite is one where I only use subtraction but the direction of allocation is in the positive direction (I subtract a `remaining` counter, and the returned address is `end - remaining`).

But I have seen many others. A few of my colleagues have similarly geeked out on this and come up with splendid bumpers (ggaren wrote a great one in bmalloc, and I remember the MMTk folks put hella thought into theirs).

titzer2y ago

In most situations, the allocation size is a constant and bumping upwards can be done without an overflow check because the region cannot be close enough to the upper part of memory to wrap around.

I'm surprised that there was no discussion of memory system performance. There are tons of OS-level and hardware prefetcher optimizations for forward-marching pointer references and zero-page allocation.

pizlonator2y ago

You can bump upwards without an overflow check even if the size is variable.

I’ve seen multiple ways to do it. And I’ve lost days of my life to coming up with pointlessly amusing variants that get there in weird ways.

Yeah I’m also amused that they didn’t get into the fact that downward bump is just not what the HW expects you to do. You’d need a real benchmark to see that effect. And, it’s an open question whether this would matter unless your allocation rate was obscene (if it isn’t then most scans of memory will be normal loops and not the allocator).

zeusk2y ago

> fact that downward bump is just not what the HW expects you to do.

Where do you get this fact from? ARM and x86 have a descending stack (<= ARMv7 supported stack growing in either direction).

1 more reply

saagarjha2y ago

Even when size is a full 64-bit value (…in that it might actually overflow the address space)?

1 more reply

gjm112y ago· 3 in thread

Since the title is less than perfectly perspicuous:

It's about "bump allocators", where you allocate memory just by incrementing/decrementing a pointer (and don't free it until you're ready to free up all the memory all at once).

These can either allocate "upwards", starting at low addresses and giving you memory at higher addresses for successive allocations, or "downwards", going the other way.

The claim being made is that "downwards" is better, because the bounds checks you need to do turn out to be more efficient that way; allocation ends up using fewer instructions, fewer registers, and fewer conditional jumps.

chrchang5232y ago

Incidentally, you can choose to bump in both directions. It's more complicated (you need to keep track of which end you allocated each data structure on), but in exchange, the allocator becomes sufficient for many more use cases.

Given a choice, the OP implies that you should position small-but-numerous allocations next to the top, and larger infrequent allocations next to the bottom.

dev_dwarf2y ago

This sounds like it would make the alloc logic much more complicated and branch-y, defeating the purpose of bumping down anyway, unless your implying some compile-time way to do this.

chrchang5232y ago

No, the idea is that you manually make some allocations downward from the top and some allocations upward from the bottom. The bumping code is as simple as in the unidirectional case.

The tricky part is choosing in a way that puts you noticeably ahead of the unidirectional allocator re: what problems you can solve, without putting excessive mental load on yourself. I've found a pattern of "long-lived allocations on one end, short-lived allocations on the other" to work well here (which, yes, doesn't always coincide with the numerous vs. infrequent axis mentioned in my previous comment).

2 more replies

justin_oaks2y ago· 3 in thread

I only recently learned about bump allocators. I was very confused as to how such an allocator could be useful. In reading the author's bumpalo crate [1] documentation, he cleared up my confusion.

For one thing, this particular allocator is separate from the general allocator so the calling code can choose when to use the bump allocator.

Quoting the crate doc:

> [B]ump allocation [is] well-suited for phase-oriented allocations. That is, a group of objects that will all be allocated during the same program phase, used, and then can all be deallocated together as a group.

I can see this being useful as a way to optimize allocation/deallocation speed for specific use-cases.

[1] https://docs.rs/bumpalo/latest/bumpalo/

loeg2y ago

It's great when you don't need destructors and have a bunch of small objects to allocate with similar lifetimes.

ph4evers2y ago

It is also used in WASM. I guess to increase memory safety

saagarjha2y ago

Bump allocators are very simple in an environment where everything you want must be brought with you.

sam_bishop2y ago· 3 in thread

I've heard that the Hotspot JVM uses a bump allocator but don't know the details. I'm sure it's heavily optimized though, so I'm curious about how this compares.

dzaima2y ago

For most specific purposes, a specialized bump allocator can be significantly more optimized.

Hard-coding the alignment simplifies many questions (especially if you can guarantee the allocation size being a multiple of the alignment, at which point it becomes a complete non-issue).

And if you have an upper bound on the requested allocation size, the integer overflow checks can be dropped too (e.g. in a Java "new int[n]", "n" is an int, which has a max value of 2^31-1, which can safely be added to a bump pointer, provided it's further than that away from the address space end (which it's gonna be due to the kernel reserving the upper half of the address space)).

And for constant-size objects, your entire bump allocator can become just

    if (ptr > precomputedEndMinusObjectSize) fallback();
    result = ptr;
    ptr += size;

saagarjha2y ago

I believe HotSpot can move allocations out of the arena when garbage collection happens, which means fragmentation is less of an issue.

tjalfiOP2y ago

Here's an article[0] with some information on Hotspot's bump allocators.

[0] https://shipilev.net/jvm/anatomy-quarks/4-tlab-allocation/

benmanns2y ago· 2 in thread

Somewhat aside, I love the graphical display of the benchmark results: https://fitzgeraldnick.com/media/bumpalo-switch-to-downwards... --I think I'll start doing that when I run benchmarks in the future.

thethirdone2y ago

Notably, that graph starts at 25 which seems misleading.

wrsh072y ago

Eh it was clearly labeled, and I pretty quickly gleaned that it was like 5 microseconds faster

1 more reply

sltkr2y ago· 2 in thread

You can also implement BumpUp() with only two conditionals and no overflow like this (in C, because I don't know Rust, but the logic is identical):

    size_t alignment_needed = align - (ptr & (align - 1));
    if (alignment_needed > SIZE_MAX - size) return nullptr;
    size_t space_needed = size + alignment_needed;
    size_t space_remaining = end - ptr;
    if (space_needed > space_remaining) return nullptr;
    char *result = ptr + alignment_needed;
    ptr += space_needed;
    return result;

loeg2y ago

Yeah, and you only need one check if you can restrict the allocation pointer to be at least max alignment away from SIZE_MAX, which is very easy to do in practice. Max alignment needed is usually no more than 64 bytes, but even 4kB isn't especially burdensome. On Linux/amd64, you don't even need to do anything special -- the high virtual address space is likely reserved for the kernel anyway.

dev_dwarf2y ago

Nice. Seems to work for case you mentioned under my comment: https://godbolt.org/z/TYorcd8b6

asalahli2y ago· 2 in thread

The rust code for rounding down is given as

    let new_ptr = new_ptr & !(align - 1);

But I don't see rdx decremented before being negated and AND'ed with rax, in the assembly. What am I missing?

conradludgate2y ago

It's being converted into `-align` which is equivalent under twos compliment

asalahli2y ago

I don't follow. Are you saying `new_ptr & -align` is equivalent to `new_ptr & !(align - 1)`?

Edit: I get it now. My brain was interpreting NEG as NOT for some reason.

ridiculous_fish2y ago

The insight here is that, for unsigned values, rounding down to a multiple of N is cheaper than rounding up to a multiple of N.

In addition to simpler arithmetic, rounding down always succeeds - 0 is a multiple of all N - while rounding up may fail if a larger multiple does not exist. So you need to check in the round-up case.

j / k navigate · click thread line to collapse

72 comments

47 comments · 10 top-level

lowbloodsugar2y ago· 10 in thread

So lesson is don’t use safe Rust code when writing something like a memory allocator where an overflow check is deemed by the author to be too slow?

Update:

>To handle both these cases, we will use checked addition and return a null pointer if either addition overflows. Here is the new Rust source code:

I’m still filling this under “Author made arbitrary decision and result is arbitrary solution is slower as a result of arbitrary decision”.

pavon2y ago

kazinator2y ago

It's not integer overflow but pointer overflow.

If you're doing small bump allocations and you're anywhere near pointer overflow, it means you're way out of bounds already; the bump allocations you already made were wrong.

You need to check whether the allocation increment is out of the zone from which you're allocating, and you need that no matter which direction you go.

If the requests are small, you will hit the end of your arena long before you worry about pointer overflow at the zero address or at address 0xFF..FF!

That said, aligning down is a shade faster than up. To align down to a boundary divisible by ALIGN_MASK + 1 we just truncate some low order bits to zero:

  addr &= ~ALIGN_MASK;

If the bits are already zero, the address doesn't move; all is cool.

but aligning up, where we don't care about overflow, requires handling the case where the address is already aligned and doesn't have to move:

  if (addr & ALIGN_MASK) {
    addr |= ALIGN_MASK;
    addr++;
  }

Or a trick like this where we bias the address with an offset of -1 during the masking calculation:

  addr = ((addr - 1) | ALIGN_MASK) + 1;

Either of these is worse than just addr &= ~7.

tel2y ago

I would definitely be in favor of an even faster `SmallBump` variant which assumes that the allocations are small w.r.t. usize::MAX for some additional speed.

I also wouldn't mind the a default "fast" bump allocator library to do all tricks it can without sacrificing safety.

1 more reply

wrsh072y ago

Since we're both thinking about this...

Is rust warning you about pointer overflow? My understanding is that rust is complaining about integer overflow but perhaps that's insufficient?

(I don't actually know which solution is better/worse from perf perspective besides author's post)

lowbloodsugar2y ago

Or you just add ALIGNMENT-1 and then mask. So if alignment is 16 then you add 15.

1 more reply

cratermoon2y ago

wrsh072y ago

I think the lesson is "safety comes at a cost. If you bump downwards you can avoid those safety checks"

I don't think the lesson is "let's just write unsafe code"

kazinator2y ago

Caller asks for a 500 megabyte downward bump allocation in a 32 bit system. Do you check for underflow or not?

bitwize2y ago

You check to make sure the size requested is less than or equal to current pointer - chunk start and fail if it is bigger.

saagarjha2y ago

Most allocators will do this, yes.

dev_dwarf2y ago· 8 in thread

In the "bump up" version you could remove both the checked_add branches and replace them with a single check at the end, making the amount of branches the same.

Quick example: https://godbolt.org/z/rdv4qnrs8.

*edited to update the example, realized I messed up the comparison logic.

ridiculous_fish2y ago

dev_dwarf2y ago

Link: https://godbolt.org/z/f1jGW6Pa3

Update: NVM, definitely not being handled correctly. https://godbolt.org/z/cMTe1o979

1 more reply

loeg2y ago

The straightforward answer is: don't tolerate ridiculous alignments.

saagarjha2y ago

You'd need to check for that, though.

1 more reply

sltkr2y ago

That version is unsafe: what if size == 0xfff..fff and alignment is needed? You will end up with ptr <= new_ptr < end, seemingly a valid result, but actually with not enough space.

edit: code moved to a toplevel comment

loeg2y ago

No allocator can be expected to allocate usize::MAX, so it doesn't really matter.

sltkr2y ago

1 more reply

dev_dwarf2y ago

Agreed. The question really is if you should demand the user to enforce that constraint on the size they pass to you, or if the function itself should signal an error in that case.

1 more reply

pizlonator2y ago· 4 in thread

My favorite is one where I only use subtraction but the direction of allocation is in the positive direction (I subtract a `remaining` counter, and the returned address is `end - remaining`).

titzer2y ago

In most situations, the allocation size is a constant and bumping upwards can be done without an overflow check because the region cannot be close enough to the upper part of memory to wrap around.

pizlonator2y ago

You can bump upwards without an overflow check even if the size is variable.

I’ve seen multiple ways to do it. And I’ve lost days of my life to coming up with pointlessly amusing variants that get there in weird ways.

zeusk2y ago

> fact that downward bump is just not what the HW expects you to do.

Where do you get this fact from? ARM and x86 have a descending stack (<= ARMv7 supported stack growing in either direction).

1 more reply

saagarjha2y ago

Even when size is a full 64-bit value (…in that it might actually overflow the address space)?

1 more reply

gjm112y ago· 3 in thread

Since the title is less than perfectly perspicuous:

It's about "bump allocators", where you allocate memory just by incrementing/decrementing a pointer (and don't free it until you're ready to free up all the memory all at once).

These can either allocate "upwards", starting at low addresses and giving you memory at higher addresses for successive allocations, or "downwards", going the other way.

chrchang5232y ago

Given a choice, the OP implies that you should position small-but-numerous allocations next to the top, and larger infrequent allocations next to the bottom.

dev_dwarf2y ago

This sounds like it would make the alloc logic much more complicated and branch-y, defeating the purpose of bumping down anyway, unless your implying some compile-time way to do this.

chrchang5232y ago

No, the idea is that you manually make some allocations downward from the top and some allocations upward from the bottom. The bumping code is as simple as in the unidirectional case.

2 more replies

justin_oaks2y ago· 3 in thread

I only recently learned about bump allocators. I was very confused as to how such an allocator could be useful. In reading the author's bumpalo crate [1] documentation, he cleared up my confusion.

For one thing, this particular allocator is separate from the general allocator so the calling code can choose when to use the bump allocator.

Quoting the crate doc:

I can see this being useful as a way to optimize allocation/deallocation speed for specific use-cases.

[1] https://docs.rs/bumpalo/latest/bumpalo/

loeg2y ago

It's great when you don't need destructors and have a bunch of small objects to allocate with similar lifetimes.

ph4evers2y ago

It is also used in WASM. I guess to increase memory safety

saagarjha2y ago

Bump allocators are very simple in an environment where everything you want must be brought with you.

sam_bishop2y ago· 3 in thread

I've heard that the Hotspot JVM uses a bump allocator but don't know the details. I'm sure it's heavily optimized though, so I'm curious about how this compares.

dzaima2y ago

For most specific purposes, a specialized bump allocator can be significantly more optimized.

Hard-coding the alignment simplifies many questions (especially if you can guarantee the allocation size being a multiple of the alignment, at which point it becomes a complete non-issue).

And for constant-size objects, your entire bump allocator can become just

    if (ptr > precomputedEndMinusObjectSize) fallback();
    result = ptr;
    ptr += size;

saagarjha2y ago

I believe HotSpot can move allocations out of the arena when garbage collection happens, which means fragmentation is less of an issue.

tjalfiOP2y ago

Here's an article[0] with some information on Hotspot's bump allocators.

[0] https://shipilev.net/jvm/anatomy-quarks/4-tlab-allocation/

benmanns2y ago· 2 in thread

thethirdone2y ago

Notably, that graph starts at 25 which seems misleading.

wrsh072y ago

Eh it was clearly labeled, and I pretty quickly gleaned that it was like 5 microseconds faster

1 more reply

sltkr2y ago· 2 in thread

You can also implement BumpUp() with only two conditionals and no overflow like this (in C, because I don't know Rust, but the logic is identical):

    size_t alignment_needed = align - (ptr & (align - 1));
    if (alignment_needed > SIZE_MAX - size) return nullptr;
    size_t space_needed = size + alignment_needed;
    size_t space_remaining = end - ptr;
    if (space_needed > space_remaining) return nullptr;
    char *result = ptr + alignment_needed;
    ptr += space_needed;
    return result;

loeg2y ago

dev_dwarf2y ago

Nice. Seems to work for case you mentioned under my comment: https://godbolt.org/z/TYorcd8b6

asalahli2y ago· 2 in thread

The rust code for rounding down is given as

    let new_ptr = new_ptr & !(align - 1);

But I don't see rdx decremented before being negated and AND'ed with rax, in the assembly. What am I missing?

conradludgate2y ago

It's being converted into `-align` which is equivalent under twos compliment

asalahli2y ago

I don't follow. Are you saying `new_ptr & -align` is equivalent to `new_ptr & !(align - 1)`?

Edit: I get it now. My brain was interpreting NEG as NOT for some reason.

ridiculous_fish2y ago

The insight here is that, for unsigned values, rounding down to a multiple of N is cheaper than rounding up to a multiple of N.

j / k navigate · click thread line to collapse