During my time at Facebook, I maintained a bunch of kernel patches to improve jemalloc purging mechanisms. It wasn't popular in the kernel or the security community, but it was more efficient on benchmarks for sure.
Many programs run multiple threads, allocate in one and free in the other. Jemalloc's primary mechanism used to be: madvise the page back to the kernel and then have it allocate it in another thread's pool.
One problem: this involves zero'ing memory, which has an impact on cache locality and over all app performance. It's completely unnecessary if the page is being recirculated within the same security domain.
The problem was getting everyone to agree on what that security domain is, even if the mechanism was opt-in.
We did extensive benchmarking of HHVM with and without your patches, and they were proven to make no statistically significant difference in high level metrics. So we dropped them out of the kernel, and they never went back in.
I don't doubt for a second you can come up with specific counterexamples and microbenchnarks which show benefit. But you were unable to show an advantage at the system level when challenged on it, and that's what matters.
By the time you joined and benchmarked these systems, the continuous rolling deployment had taken over. If you're restarting the server every few hours, of course the memory fragmentation isn't much of an issue.
> But you were unable to show an advantage at the system level when challenged on it, and that's what matters.
You mean 5 years after I stopped working on the kernel and the underlying system had changed?
I don't recall ever talking to you on the matter.
If you don't like the idea of memory cgroups as a security domain, you could tighten it to be a process. But kernel developers have been opposed to tracking pages on a per address space basis for a long time. On the other hand memory cgroup tracking happens by construction.
https://research.google/pubs/google-wide-profiling-a-continu... https://engineering.fb.com/2025/01/21/production-engineering...
The profiling clearly showed kernel functions doing memzero at the top of the profiles which motivated the change. The performance impact (A/B testing and measuring the throughput) also showed a benefit at the point the change was committed.
This was when "facebook" was a ~1GB ELF binary. https://en.wikipedia.org/wiki/HipHop_for_PHP
The change stopped being impactful sometime after 2013, when a JIT replaced the transpiler. I'm guessing likely before 2016 when continuous deployment came into play. But that was continuously deploying PHP code, not HHVM itself.
By the time the patches were reevaluated I was working on a Graph Database, which sounded a lot more interesting than going back to my old job function and defending a patch that may or may not be relevant.
I'm still working on one. Guilty as charged of carrying ideas in my head for 10+ years and acting on them later. Link in my profile.
There needs to be more competition in the malloc space. Between various huge page sizes and transparent huge pages, there are a lot of gains to be had over what you get from a default GNU libc.
Our results from July 2025:
rows are <allocator>: <RSS>, <time spent for allocator operations>
app1:
glibc: 215,580 KB, 133 ms
mimalloc 2.1.7: 144,092 KB, 91 ms
mimalloc 2.2.4: 173,240 KB, 280 ms
tcmalloc: 138,496 KB, 96 ms
jemalloc: 147,408 KB, 92 ms
app2, bench1
glibc: 1,165,000 KB, 1.4 s
mimalloc 2.1.7: 1,072,000 KB, 5.1 s
mimalloc 2.2.4:
tcmalloc: 1,023,000 KB, 530 ms
app2, bench2
glibc: 1,190,224 KB, 1.5 s
mimalloc 2.1.7: 1,128,328 KB, 5.3 s
mimalloc 2.2.4: 1,657,600 KB, 3.7 s
tcmalloc: 1,045,968 KB, 640 ms
jemalloc: 1,210,000 KB, 1.1 s
app3
glibc: 284,616 KB, 440 ms
mimalloc 2.1.7: 246,216 KB, 250 ms
mimalloc 2.2.4: 325,184 KB, 290 ms
tcmalloc: 178,688 KB, 200 ms
jemalloc: 264,688 KB, 230 ms
tcmalloc was from github.com/google/tcmalloc/tree/24b3f29.i don't recall which jemalloc was tested.
tcmalloc (thread caching malloc) assumes memory allocations have good thread locality. This is often a double win (less false sharing of cache lines, and most allocations hit thread-local data structures in the allocator).
Multithreaded async systems destroy that locality, so it constantly has to run through the exception case: A allocated a buffer, went async, the request wakes up on thread B, which frees the buffer, and has to synchronize with A to give it back.
Are you using async rust, or sync rust?
Edit: I see mimalloc v3 is out – I missed that! That probably moots this discussion altogether.
Even toolchains like Turbo Pascal for MS-DOS, had an API to customise the memory allocator.
The one size fits all was never a solution.
If you got a web request, you could allocate a memory pool for it, then you would do all your memory allocations from that pool. And when your web request ended - either cleanly or with a hundred different kinds of errors, you could just free the entire pool.
it was nice and made an impression on me.
I think the lowly malloc probably has lots of interesting ways of growing and changing.
Yes, if you want to use huge pages with arbitrary alloc/free, then use a third-party malloc. If your alloc/free patterns are not arbitrary, you can do even better. We treat malloc as a magic black box but it's actually not very good.
https://jemalloc.net/jemalloc.3.html
One thing to call out: sdallocx integrates well with C++'s sized delete semantics: https://isocpp.org/files/papers/n3778.html
The nice thing about mimalloc is that there are a ton of configurable knobs available via env vars. I'm able to hand those 16 1 GiB pages to the program at launch via `MIMALLOC_RESERVE_HUGE_OS_PAGES=16`.
EDIT: after re-reading your comment a few times, I apologize if you already knew this (which it sounds like you did).
Last time I checked mimalloc which was admittedly a while ago, probably 5 years, it was noticebly worse and I saw a lot of people on their github issues agreeing with me so I just never looked at it again.
Jemalloc Postmortem - https://news.ycombinator.com/item?id=44264958 - June 2025 (233 comments)
Jemalloc Repositories Are Archived - https://news.ycombinator.com/item?id=44161128 - June 2025 (7 comments)
Most of the savings seemed to come from HVAC costs, followed by buying less computers and in turn less data centers. I'm sure these days saving memory is also a big deal but it doesn't seem to have been then.
The above was already the case 10 years ago, so LLMs are at most another factor added on.
In startups I've put more effort into squeezing blood from a stone for far less change; even if the change was proportionally more significant to the business. Sometimes it would be neat to say "something I did saved $X million dollars or saved Y kWh of energy" or whatever.
At most... Think 10x rather than 0.1x or 1x.
they've been using jemalloc (and employing "je") since 2009.
I'm saddened that the job market in Australia is largely React CRUD applications and that it's unlikely I will find a role that lets me leverage my niche skill set (which is also my hobby)
Link in bio.
I applied for both and got ghosted, haha.
I also saw a government role as a security researcher. Involves reverse engineering, ghidra and that sort of thing. Super awesome - but the pay is extremely uncompetitive. Such a shame.
Other than that, the most interesting roles are in finance (like HFT) - where you need to juggle memory allocations, threads and use C++ (hoping I can pitch Rust but unlikely).
Sadly they have a reputation of having pretty rough cultures, uncompetitive salaries and it's all in-office
The one I know of (IMC trading) does a lot of low level stuff like this and is currently hiring.
Facebook's coding AIs to the rescue, maybe? I wonder how good all these "agentic" AIs are at dreaded refactoring jobs like these.
This doesn't quite read properly to me. What does it actually mean, does anyone know?
For example they use AsmJit in a lot of projects (both internal and open-source) and it's now unmaintained because of funding issues. Maybe they have now internal forks too.
https://technology.blog.gov.uk/2015/12/11/using-jemalloc-to-...
Initially the idea was diagnostics, instead the the problem disappeared on its own.
Second thoughts: Actually the fb.com post is more transparent than I'd have predicted. Not bad at all. Of course it helps that they're delivering good news!
He's doing just fine. If you're looking for a story about a FAANG company not paying engineers well for their work, this isn't it.
when i preloaded jemalloc , memory remained at significantly lower levels, and - more importantly - it was stable.
there seems to be no single correct solution to memory allocation, depending on the workload
From the Department of Redundancy Department.
I was recently debugging an app double-free segfault on my android 13 samsung galaxy A51 phone, and the internal stack trace pointed to jemalloc function calls (je_free).
"To help personalize content, tailor and measure ads and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls:
They should have just called it an ivory tower, as that's what they're building whenever they're not busy destroying democracy with OS Backdoor lobbyism or Cambridge Analytica shenanigans.
Edit: If every thread about any of Elon Musk's companies can contain at least 10 comments talking about Elon's purported crimes against humanity, threads about Zuckerberg's companies can contain at least 1 comment. Without reminders like this, stories like last week's might as well remain non-consequential.
https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...
A 0.5% improvement may not be a lot to you, but at hyperscaler scale it's well worth staffing a team to work on it, with the added benefit of having people on hand that can investigate subtle bugs and pathological perf behaviors.
but as usual there is an xkcd for that. https://xkcd.com/1205/
One project I spent a bunch of time optimizing the write path of I/O. It was just using standard fwrite. But by staging items correctly it was an easy 10x speed win. Those optimizations sometimes stack up and count big. But it also had a few edges on it, so use with care.
void* malloc(size_t size) {
void *ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_ANON, -1, 0);
return (ptr == MAP_FAILED) ? NULL : ptr;
}
void free(void *ptr) { /* YOLO */ }
/s