I’ve been meaning to ask Qi if he’d be open to cutting a final 6.0 release on the repo before re-archiving.
At the same time it’d be nice to modernize the default settings for the final release. Disabling the (somewhat confusingly backwardly-named) “cache oblivious” setting by default so that the 16 KiB size-class isn’t bloated to 20 KiB would be a major improvement. This isn’t to disparage your (i.e. Jason’s) original choice here; IIRC when I last talked to Qi and David about this they made the point that at the time you chose this default, typical TLB associativity was much lower than it is now. On a similar note, increasing the default “page size” from 4 KiB to something larger (probably 16 KiB), which would correspondingly increase the large size-class cutoff (i.e. the point at which the allocator switches from placing multiple allocations onto a slab, to backing individual allocations with their own extent directly) from 16 KiB up to 64 KiB would be pretty impactful. One of the last things I looked at before leaving Meta was making this change internally for major services, as it was worth a several percent CPU improvement (at the cost of a minor increase in RAM usage due to increased fragmentation). There’s a few other things I’d tweak (e.g. switching the default setting of metadata_thp from “disabled” to “auto”, changing the extent-sizing for slabs from using the nearest exact multiple of the page size that fits the size-class to instead allowing ~1% guaranteed wasted space in exchange for reducing fragmentation), but the aforementioned settings are the biggest ones.
A fellow traveler, ahoy!
i think, because Itanic broke a ton of assumptions
What's hard about using TCMalloc if you're not using bazel? (Not asking to imply that it's not, but because I'm genuinely curious.)
1. Use it as a dynamically linked library. This is not great because you’re taking at a minimum the performance hit of going through the PLT for every call. The forfeited performance is even larger if you compare against statically linking with LTO (i.e. so that you can inline calls to malloc, get the benefit of FDO , etc.). Not to mention all the deployment headaches associated with shared libraries.
2. Painfully manually create a static library. I’ve done this, it’s awful; especially if you want to go the extra mile to capture as much performance as possible and at least get partial LTO (i.e. of TCMalloc independent of your application code, compiling all of TCMalloc’s compilation units together to create a single object file).
When I was at Meta I imported TCMalloc to benchmark against (to highlight areas where we could do better in Jemalloc) by pain-stakingly hand-translating its bazel BUILD files to buck2 because there was legitimately no better option.
As a consequence of being so hard to use outside of Google, TCMalloc has many more unexpected (sometimes problematic) behaviors than Jemalloc when used as a general purpose allocator in other environments (e.g. it basically assumes that you are using a certain set of Linux configuration options [1] and behaves rather poorly if you’re not)
[1] https://google.github.io/tcmalloc/tuning.html#system-level-o...
For the non low-level programmers in the bowels of memory allocators among us, why is this a "lol"?
Why not? I mean this is complete drive-by comment, so please correct me, but there was a fully staffed team at Meta that maintained it, but was not in the best place to manage the issues?
custom-malloc-newbie question: Why is the choice of build system (generator) significant when evaluating the usability of a library?
I've done both before, and seen libraries at various levels of complexity; there is definitely a point where you just want to give up and not use the thing when it's very complex.
One fine day, we discovered Jemalloc and put it in our service, which was causing a lot of memory fragmentation. We did not think that those 2 lines of changes in Dockerfile were going to fix all of our woes, but we were pleasantly surprised. Every single issue went away.
Today, our multi-million dollar revenue company is using your memory allocator on every single service and on every single Dockerfile.
Thank you! From the bottom of our hearts!
the top 3 from https://github.com/topics/resize-images (as of 2025-06-13)
imaginary: https://github.com/h2non/imaginary/blob/1d4e251cfcd58ea66f83...
imgproxy: https://web.archive.org/web/20210412004544/https://docs.imgp... (linked from a discussion in the imaginary repo)
imagor: https://github.com/cshum/imagor/blob/f6673fa6656ee8ef17728f2...
libvips is fairly highly threaded and does a lot of alloc/free, so it's challenging for most heap implementations.
FWIW while it was a factor it was just one of a number: https://github.com/rust-lang/rust/issues/36963#issuecomment-...
And jemalloc was only removed two years after that issue was opened: https://github.com/rust-lang/rust/pull/55238
I wonder if some kind of dynamic page-size (with dynamic ftrace-style binary patching for performance?) would have been that much slower.
I learned of it from it's integration in FreeBSD and never looked back.
jemalloc has help entertained a lot of people :)
windows def allocator is pos. Jemalloc rules
Wow, still? I remember allocator benchmarks from 10-15 years ago where there were some notable differences between allocators... and then Windows with like 20% the performance of everything else!
Which one of them? These days it could mean HeapAlloc, or it could mean malloc from uCRT.
Or I wonder if they could simply use tcmalloc or another allocator these days?
Facebook infrastructure engineering reduced investment in core technology, instead emphasizing return on investment.
> Or I wonder if they could simply use tcmalloc or another allocator these days?
Jemalloc is very deeply integrated there, so this is a lot harder than it sounds. From the telemetry being plumbed through in Strobelight, to applications using every highly Jemalloc-specific extension under the sun (e.g. manually created arenas with custom extent hooks), to the convergent evolution of applications being written in ways such that they perform optimally with respect to Jemalloc’s exact behavior.
> as a result of recent changes within Meta we no longer have anyone shepherding long-term jemalloc development with an eye toward general utility
> we reached a sad end for jemalloc in the hands of Facebook/Meta
> Meta’s needs stopped aligning well with those of external uses some time ago, and they are better off doing their own thing.
Though this was not the post I was expecting to show up today, it was super awesome for me to get to have played my tiny part in this big journey. Thanks for everything @je (and qi + david -- and all the contributors before and after my time!).
I will say, the Facebook people were very excited to share jemalloc with us when they acquired my employer, but we were using FreeBSD so we already had it and thought it was normal. :)
A while back, I had a conversation with an engineer who maintained an OS allocator, and their claim was that custom allocators tend to make one process's memory allocation faster at the expense of the rest of the system. System allocators are less able to make allocation fair holistically, because one process isn't following the same patterns as the rest.
Which is why you see it recommended so frequently with services, where there is generally one process that you want to get preferential treatment over everything else.
It would be interested in hearing their thoughts directly, I'm also not an allocator engineer and someone who maintains an OS allocator probably knows wayyy more about this stuff than me. I'm sure there's some missing nuance or context or which would've made it make sense.
There's also the fact that ... a lot of processes only ever have a single thread, or at most have a few background threads that do very little of interest. So all these "multi-threading-first allocators" aren't actually buying anything of value, and they do have a lot of overhead.
Semi-related: one thing that most people never think about: it is exactly the same amount of work for the kernel to zero a page of memory (in preparation for a future mmap) as for a userland process to zero it out (for its own internal reuse)
Possibly more work since the kernel can't use SIMD
> And people find themselves in impossible situations where the main choices are 1) make poor decisions under extreme pressure, 2) comply under extreme pressure, or 3) get routed around.
It doesn't sound like a work place :-(
- fsociety
As much as it's nice to think software can be done, I think something so closely tied to the kernel and hardware and the application layer, which all change constantly, never can be.
Even from a quick look at the open issues, I can see https://github.com/jemalloc/jemalloc/issues/2838, and https://github.com/jemalloc/jemalloc/issues/2815 as two examples, but there's a fair number of issues still open against the repository.
So that'll leave projects like redis & valkey with some decisions to make.
1) Keep jemalloc and accept things like memory leak bugs
2) Fork and maintain their own version of jemalloc.
3) Spend time replacing it entirely.
4) Hope someone else picks it up?
Lol absolutely not
So the flow is like this: user has an allocation looking issue. Picks up $allocator. If they have an $allocator type problem then they keep using it, otherwise they use something else.
There are tons of users if these allocators but many rarely engage with the developers. Many wouldn’t even notice improvements or regressions on upgrades because after the initial choice they stop looking.
I’m not sure how to fix that, but this is not healthy for such projects.
In fact, Jason himself (the author of jemalloc and TFA) posted an article on glibc malloc fragmentation 15 years ago: https://web.archive.org/web/20160417080412/http://www.canonw...
And it's an issue to this day: https://blog.arkey.fr/drafts/2021/01/22/native-memory-fragme...
How would the allocator know that some block is unused, short of `free` being called? Does glibc not return all memory after a `free`? Do other allocators do something clever to automatically release things? Is there just a lot of bookkeeping overhead that some allocators are better at handling?
jemalloc is always the first thing I installed whenever I had to provision bare servers.
If jemalloc is somehow the default allocator in Linux, I think it will not have a hard time retaining contributors.
I'm not really sure what I expected, but somehow I expect a memory allocator to be ... smaller, simpler perhaps?
However it is typically always more complex to make production quality software, especially in a performance sensitive domain.
I think we lost a great deal of potential when ORCA was too tied to Pony and not extracted to a framework, tool, and/or library useful outside of it such as integrated or working with LLVM.
You can write a naive mark-and-sweep in an afternoon. You can write a reference counter in even less time. And for some runtimes this is fine.
But writing a generational, concurrent, moving GC takes a lot of time. But if you can achieve it, you can get amazing performance gains. Just look at recent versions of Java.
I wonder if you did get everything you should from the companies that use it. I mean sometimes I feel that big tech firms only use free software, never giving anything to it, so I hope you were the exception here.
He and the other previous contributors are free to find new employers to continue such an arrangement, if any are willing to make that investment. Alternatively they could cobble together funding from a variety of smaller vendors. I think the author is happy to move on to other projects, after spending a long time in this problem space.
I don’t think that “don’t let one megacorp hire a team of contributors for your FOSS project” is the lesson here. I’d say it’s a lesson in working upstream - the contributions made during their Facebook / Meta investment are available for the community to build upon. They could’ve just as easily been made in a closed source fork inside Facebook, without violating the terms of the license.
Also Mozilla were unable to switch from their fork to the upstream version, and didn’t easily benefit from the Facebook / Meta investment as a result.
Good times.
"The jemalloc memory allocator was first conceived in early 2004, and has been in public use for about 20 years now. Thanks to the nature of open source software licensing, jemalloc will remain publicly available indefinitely. But active upstream development has come to an end. This post briefly describes jemalloc’s development phases, each with some success/failure highlights, followed by some retrospective commentary."
Surely a "retrospective" would be a better word for a look back. It even means "look back.