A little offtopic but the default Delphi 7 memory allocator did this, except that it also merged blocks that it obtained from different OS allocation calls.
This worked fine for regular usage, but if that memory was ever used for Bitmaps for UI stuff, it wouldn't work: Since Windows does some of its UI stuff in kernel mode, before doing anything Windows would attempt to lock the entire allocation's VAD entry to prevent you from messing with it in another thread while it was using it. If the Bitmap you were working on happened to belong to two different OS-level allocations, this lock function would fail since it wasn't meant to handle that case.
This would lead to random, extremely cryptic errors such as ERROR_NO_SCROLLBARS "The window does not have scroll bars." since the lock call was deep in the stack and the callers replaced the error with another one as it bubbled up.
In the end we had to replace the allocator to avoid that issue. That was a fun day I spent debugging that.
We just had it restart itself and try again whenever it got one of those errors when printing but eventually we wanted to add a feature that required the process to not die, and by that time I was already 99% sure that it wasn't something in our code and I had already ruled out threading issues.
I ended up putting it in a VM with a kernel debugger attached and having a script make a snapshot and make it print over and over until it errored, then following along in IDA until I saw what was going on.
Having a way to trigger it (by restoring the snapshot) on demand helped a lot, otherwise it would have taken forever to make sense of it as it could sit without crashing for nearly an hour.
Here's a 2003 forum post from someone else having the same problem: http://www.delphigroups.info/2/1/749064.html
(Your application wasn't even a "process" per se. Until Windows 95, everything was just happening in one shared address space, in real mode. In fact, it was only in Windows 3.1 where user applications stopped running in ring 0!)
If you think about it, this "the kernel is a game engine and your application is the game" approach isn't necessarily a bad design... for a single-tasking OS's library exokernel, like the Wii's https://wiibrew.org/wiki/IOS.
But, of course, Windows claimed to be a multitasking OS. But it actually wasn't! And I don't mean the obvious thing about it not having pre-emption. Lots of multitasking OSes didn't have pre-emption.
No, what I mean is that the concurrency primitive for the cooperative scheduling wasn't the "task" (i.e. the process or thread. Which, again, there weren't any of.) Instead, the concurrency primitive was the window. Until Windows 95, Windows was a multi-windowing OS.
Each control was owned by a window. Each window had a WndProc. If your Windows executable (i.e. application delegate module) set up two windows, then each window participated separately in the global Windows event loop, up-to-and-including things like having its own set of loaded fonts, its own clipboard state, and its own interned strings table. In OOP terms†, your application was just a dead "class object", running no logic of its own save for one-time load and unload hooks. It was the windows themselves that were the "instances" of your class.
This might make you realize why MDI (or Multiple Document Interface, where there are multiple small per-document "windows" inside one big window) was so popular back then. The MDI "windows" weren't actually windows — they didn't have their own WndProc. They were just controls, like a tab view is a control. Only the big container window was a real window, and so all the resources within that big window were shared between all the virtual windows. MDI was a memory-saving trick!
---
† The actual more interesting analogy is that Windows was essentially a (single-threaded, cooperatively-scheduled) actor system, where windows were the actors. There is a very close parallel between (Windows 3.1 executables, Windows 3.1 windows) and (Erlang modules, Erlang processes).
E.g. the Linux kernel used (not sure if it's still like this) a buddy allocator to allocate pages and power-of-two blocks of pages and slab allocators to subdivide those pages and allocate data structures.
Another thing that the article doesn't mention that is important is that most production allocators make use of thread-local storage and either have per-thread caches of free blocks or sometimes whole per-thread memory regions. This is to reduce lock contention and provide memory that is more likely to be in the current core's cache.
I had originally written about threading and locality but it made the post too long and complicated, so I cut it out for the final draft. You can see remnants of it if you check the HTML comments in the post :D
Then linux has various slabs(slub/slob/slab), built on top of the "buddy allocator".
Userlevel code shoud use non virtual address stable mmap-ed regions (slabs + offsets). Legacy "libc" services were built as virtual address stable services... which are kind of expensive to manage on a modern paginated system. Virtual address stable regions should be kept to a minimum (that horrible ELF static TLS). There is a workaround though (but linux overcommit default policy could kill your process): a user process would query the amount of ram on the system and mmap a region of roughly (care of the overcommit policy) the same size, only once per process life-time. Then you could have a virtual address stable region which could use most if not all the available ram (excluding hot-memory addition...). Should be very easy to manage with lists.
I'm inspired by Bartosz Ciechanowski and Julia Evans. The web is such a powerful toolbox. So many concepts are more complex than text alone can hope to explain. Those two are so creative and full of energy.
And you're right, it's incredibly time intensive to put these together. Thousands of lines of code, plus the text content, plus reaching out to domain experts for reviews (shout out Chris Down, kernel wizard extraordinaire).
Communicating complex subjects through interactive visual displays is very effective -- it's worth the effort. Thank you!
As an old-timer in writing code, I think my generation's most-challenging legacies (=== the things we f**ked up) are the non-robust malloc/free concept and null-terminated text strings. Bugs in code using those conventions have been exploitable to a really damaging extent. I learned to program in C from K&R. And getting data-structure code right, and safe to deploy, in that language and its derivatives is HARD.
The C inventors are (were in Dennis Ritchie's case) brilliant and Bell Labs was great. Their ideas shaped the the stuff the global internet runs on. But these parts of what thy invented ..... ouch. (All OSs had the same problems.)
I wish the wonderful article presented here carried a more prominent warning about this. Many will read it as they learn to code. The history of our profession can teach about what NOT to do as well as what to do.
He recently published a similar guide on load balancing that is equally intuitive and insightful: https://samwho.dev/load-balancing/
I can't wait to see what he puts out next!
We never missed good things lol!
If I were to make some suggestions, based on how I know they would receive this:
- I would make explicit reference to heap and stack. Students who are learning this material are learning about the heap/stack dichotomy, and I think it would really improve the exposition to make clear that not all memory is allocated this way in a program.
- I would remove this line: "At the end of this post, you should know everything you need to know to write your own allocator." I can confidently say that my student will not be able to write an allocator after reading this. It's nothing to do with your piece, it's just the intersection of people who don't understand hex and who could build an allocator after a short blog post is very very small. Students who read this post and at the end still feel like they can't build an allocator will end up discouraged, with a feeling that maybe they are missing something.
- Consider rethinking how you handle hex numbers throughout. You introduce them and say they are distinguished by a preceding "0x", but then you immediately omit the 0x to save space in the figure. In my experience, students have a lot of trouble with understanding that hex and dec have the same underlying representation. They will not be able to distinguish between hex and dec bases implicitly, so from a pedagogical standpoint, it's better to be consistent throughout and include the prefix.
- Finally at the end I would make mention that there are other strategies for memory management to encourage further exploration. Other languages do it differently, and for students it's important to know which other avenues are out there. Otherwise they might think heap-based malloc/free are the only way to do things, the same way they often thing imperative programming is the only way to program.
Anyway, thank you for creating this, and I'm sure it will really help my students. In a just world, "seeing" the memory like this would ideally be first-class tooling for languages like C.
I tried a couple of different ways to introduce the stack and the heap but it always felt like it made the post too long and complicated. In the end I decided to take a pure, idealistic view of memory in order to focus on the algorithms used to pack allocations effectively. You can see some of my abandoned efforts as HTML comments in the post :D
Introducing the 0x prefix and immediately dropping it hurt me as well, but I didn't have a better way to make the visualisation work on mobile. I completely agree with you that it's not ideal.
I'd like to do a post of this style about garbage collection at some point.
Would definitely like to see more thoughts from those cute corgis.
I wrote a JIT compiler and I didn't bother calling free much, I just let the operating system free up all allocated memory.
I got into this situation often:
return_struct = do_something(mystruct);
return_struct->inner_struct = malloc(sizeof(struct my_inner_struct));
Now, who owns inner_struct? Who is responsible for freeing it? Do I free it when I assign to it?I feel this ownership complicates cross-language FFI API calls, because who is responsible for freeing structures depends on the application and the platform you're running under. For example, Rust code being called from Erlang. You have to be extra careful when memory is managed by a different language runtime.
I feel I am at the beginning of intuitively understanding what memory really is: memory is just a huge contiguous region of numbered locations. Bump allocators allocate in one direction and free all at once. Arena allocators allocate to a preallocated region, I think.
Memory is a logistical problem of how you arrange and allocate finite resources.
I am thinking of alternative visualizations of understanding memory, for example, I started writing an animation of binary search:
https://replit.com/@Chronological/ProgrammingRTS
The idea is that you see values and memory locations move around with the final goal being able to command memory to move around and be computed, such as real time strategy game.
I think if we could visualise memory as cars on a road, we would see obvious traffic jams.
return_struct does since it is the only thing that knows the address.
> Who is responsible for freeing it?
return_struct is, unless you hand that responsibility over to something else.
> Do I free it when I assign to it?
Yes, unless you want leaks.
> I think if we could visualise memory as cars on a road, we would see obvious traffic jams.
That visualisation is helpful for threads, where the program is the road/map and the cars are the threads. I don't see how it's useful for memory.
There might be an analogy here that could help you reason about your nested structure allocations…
Memory is an array of bytes owned by the OS. While there are all kinds of implementation details about addressing and storage and performance and paging and virtual memory, it’s really just an array. The OS gives you a way to reserve pieces of the array for your own use, and you’re responsible for giving them back if you want to play nice and/or run for a long time, otherwise (as a safety net) the OS will take them back as soon as you exit.
This is, in a sense, very similar to the question you posed. An outer routine owns the outer structure, and an inner routine allocates some inner structure. The simplest, most intuitive, and generally best advice is that whoever allocates is also responsible for freeing memory. In other words, one way to define ownership of memory is by who allocates it. Implicitly and automatically the responsibility to free that memory belongs to owner that allocated it. It’s okay to explicitly transfer ownership, but can easily get complicated and unintuitive. You can also consider letting the parent free your struct to be similar to not calling free() in your JIT compiler - it’s a ‘lazy’ optimization to have the parent clean up - and I don’t mean that in a judgemental sense, I mean it’s valid to let the parent handle it, if you know that it will, and this can be done without getting confused about who owns the memory and who was actually responsible for freeing it. Note that when you leave the parent to clean it up, you are foregoing the ability to re-use the memory - this is true in your JIT compiler and it’s true for malloc() and free() as well. If you let the OS handle it, you’re in effect declaring that you believe you do not need to recycle the memory allocated in your program during it’s execution. (This might be true, and it might stay that way, but it’s always worth asking if it will remain true, since lots of people have been eventually bitten and had to retroactively refactor for memory management when their requirements change.)
Arena allocators are cool, the idea is you allocate a large-ish region of memory and sub-allocate into it (often with a fast, simple allocator like a bump allocator) and then free the large-ish block when you're done. It's a way to take knowing how much memory you need as a whole and optimise that to a single call to malloc/free.
You may enjoy looking through https://www.cs.usfca.edu/~galles/visualization/Algorithms.ht....
I want an extremely performant deep copy solution, I've been thinking of using an allocator to implement it.
If we have a tree data structure or a nested hashmap, then we want to copy it cheaply, there is copy on write. But most copies of hashmaps are slow because they instantiate every child object in a recursive loop.
So I want to be able to memcpy a complicated data structure for cheap copies.
While it would be nice to have next to no overhead for FFI, it's not always tractable. That's why you have to serialize across boundaries, the same as if you're serializing across processes or the network. At least in a single virtual memory space you can have a caller allocate a buffer and the callee fill it, with the caller being responsible for deserializing and freeing later. That gets you pretty far, and is relatively safe.
The alternative is to be callee managed, and for the callee to return things by handle and not necessarily by pointer, but that is also fraught.
This was on a 386sx with 8M RAM and it was pretty much all the available memory after the OS was loaded and settled down.
A MILLION BYTES!!
Didn't do anything with it, but still, after DOS and EMS/XMS memory and all the other falderol of managing memory. (At the time, it was also the only x86 OS that would format a floppy drive without bringing all the other threads to a crawl. UI was still available-ish, BiModem was still BiModeming...
> "List allocators used by malloc() and free() are, by necessity, general purpose allocators that aren’t optimized for any particular memory allocation pattern... To understand, let’s examine commonly used list allocation algorithms: first-fit, next-fit, best-fit and quick-fit."
That's from an article on custom memory management (aimed at embedded programming issues) I've found pretty useful, and it picks up where this leaves off:
https://www.embedded-software-engineering.de/dynamic-memory-...
You can practice writing custom memory managers in an application that runs on an operating system by only using the stack (just create a big array of int etc. and call that your memory space):
> "For the safety-critical tasks, the developer can replace the standard allocator with a custom allocator that sets aside a buffer for the exclusive use of that task, and satisfies all memory allocation requests out of that buffer... The remainder of this paper presents four examples of custom allocator algorithms: the block allocator, stack allocator, bitmap allocator and thread-local allocator. Custom memory managers can use any one of these algorithms or a combination of different algorithms."
This is not strictly true, it depends on the environment you're using. Some older operating systems and some modern embedded systems have memory mapped at the zero address.
On the other hand, most people expect the usual layout that grows from zero upwards. CPU designers are people, too, and make similar assumptions about layout and operation of memory (if they have to), which then affect systems that use their processors. Also, with all its compatibility, C still presumed certain class of hardware. It was not invented as a language for every microprocessor and microcontroller and each odd memory model, the microcontroller evolution was instead affected by alignment with coding in C.
This ugly hack (using the same object to hold the address value that can be operated on, and its own validity information) might be the most well known.
address + <value at address>
address - <value at address-1>
Shouldn't this be? address + <value at address> + 3
address - <value at address-1> - 3"As a general-purpose memory allocator, though, we can't get away with having no free implementation."
I have a belief that the future of software are short-lived programs that never free memory. Programs allocate and terminate. Short-lived program communicate with each other via blocking CSP-style channels (see Reppy's Concurrent Programming in ML).
If you could also educate me on why this is a bad idea I would appreciate.
Interesting to use hacker news as the blog's own comment section in this way.
I wrote this:
<div class="memory" bytes="32">
<malloc size="4" addr="0x0"></malloc>
<malloc size="5" addr="0x4"></malloc>
<malloc size="6" addr="0x9"></malloc>
<malloc size="7" addr="0xa"></malloc>
<free addr="0x0"></free>
<free addr="0x4"></free>
<free addr="0x9"></free>
<free addr="0xa"></free>
</div>
Instead of this: <div class="memory" bytes="32">
<malloc size="4" addr="0x0"></malloc>
<malloc size="5" addr="0x4"></malloc>
<malloc size="6" addr="0x9"></malloc>
<malloc size="7" addr="0xf"></malloc>
<free addr="0x0"></free>
<free addr="0x4"></free>
<free addr="0x9"></free>
<free addr="0xf"></free>
</div>My high congratulations for creating a most friendly, readable and useful lesson!
How I wish I had something like that when I first learned C.
Admittedly the playground is hard to understand at first, and a touch janky in places. But the workloads are taken from for-real malloc/free traces on programs!
Especially because you can scroll through all the steps.
While aggressively optimizing we replaced malloc in the end with a function we called "cooloc", that traded portability and security for speed. The debug tool here would have been useful.
* https://github.com/DaveJarvis/mandelbrot/blob/master/memory....
I then apply this same principle of "opening" and "closing" structures throughout the application. Generally, I can quickly verify that the calls are balanced:
* https://github.com/DaveJarvis/mandelbrot/blob/master/threads...
What's nice about this pattern is that the underlying implementation of exactly how the memory is maintained for a particular data structure (e.g., whether malloc or gdImageCreateTrueColor is called) becomes an implementation detail:
* https://github.com/DaveJarvis/mandelbrot/blob/master/image.c
The main application opens then closes structures in the reverse order:
* https://github.com/DaveJarvis/mandelbrot/blob/master/main.c
This means there's only one call to malloc and one call to free throughout the entire application (third-party libraries notwithstanding), allowing them to be swapped out with ease.
Aside, logging can follow this same idea by restricting where any text is written to the console to a single location in the code base:
* https://github.com/DaveJarvis/mandelbrot/blob/master/logging...
As a very minor point, calling free(NULL) is well-defined and safe so there is no need for the if-statement in memory_close(). This is very clearly stated in the manual page [1] for instance:
If ptr is a null pointer, no action shall occur.
Probably showing my age. SunOS 4, PalmOS, and 3BSD reputedly crashed. (There were also double free exploits back in 1996.)
This further illustrates my point, though: Removing the NULL check is a single conditional to remove, as opposed to littering the free + guard everywhere. In effect, by isolating duplicated pieces of logic, it keeps possibilities open.
http://igoro.com/archive/volatile-keyword-in-c-memory-model-...
It's a bit old by now (2010), but I always remembered the mental model Igor created.
My only piece of feedback would be for the "Inline Bookkeeping" section (https://samwho.dev/memory-allocation/#inline-bookkeeping), it took a while for me to grok the numbered list to figure out which block corresponded to address + X. I wonder if there is a better way to visualize the 4 numbered bullet points? Maybe just arrows and text pointing to the visualization?
Thanks again for this wonderful article!
In another universe I'd hook in to the page scroll and highlight each block being referred to as I talk about it in the text. I'm probably not explaining that well here, but imagine something like the way this article works: https://pudding.cool/2017/05/song-repetition/.
It seems to imply that malloc/free works by boundary tag? Which I don't think is the case? (and if not, it leaves the reader confused as to how it then actually works).
I know "some" languages use the tag technique (e.g. julia), but the article seems to imply that this also applies to the c code above? Or am I wrong and c also makes use of such a scheme when you use pointer arithmetic?? (I don't see how that would work with arrays if that's the case though)
The post is intending to talk about the topic of memory allocation in a general sense. The way that malloc works on any given system is implementation dependent, it's not possible to say "malloc works in this way" because Debian can differ from Arch can differ from BSD, etc.
It's not my goal to tell you exactly how modern malloc/free implementations work. I could probably be more clear about this at the start of the post.
I'd love to also see applications of custom memory allocations. I know about usecases in building game engines and the importance of hitting cache there, but I'm not sure where else in the world this would be as useful.
> Couldn't we rearrange the memory to get a block of 6 contiguous bytes? Some sort of defragmentation process?
> Sadly not. Remember earlier we talked about how the return value of malloc is the address of a byte in memory? Moving allocations won't change the pointers we have already returned from malloc. We would change the value those pointers are pointed at, effectively breaking them. This is one of the downsides of the malloc/free API.
But why not? Couldn't we store information about old pointers somewhere and match them with new addresses when defragmenting? Some kind of virtual memory driver that would map old pointers to new adresses transparently for the programs? Or would it be too much overhead for too little benefit?
In OSes without virtual memory, one option is to do the same non-transparently: instead of returning pointers, malloc and friends work with "pointers to pointers" (called handles), so there is one extra level of indirection, and now the OS is free to rearrange this 2nd level as it sees fit. Whenever a program wants to read/write the data behind a handle, it must dereference the handle to get to the real pointer, but it also must let the OS know that it is currently using the real pointer -- this is to avoid the OS moving it around. This is usually called "locking/unlocking" the handle.
Some classic examples are Windows 3.x, Mac OS (toolbox), PalmOS, etc.
https://en.wikipedia.org/wiki/Classic_Mac_OS_memory_manageme...
You would need hardware support for this, since the hardware is what decides what gets returned when a program attempts to read from a memory location.
Hardware already does support virtual memory but the granularity is the page (which are a minimum of 4KiB in most OSs).
Once you have it call free() for you, your piece of code is now a compacting GC, like Java's for example.
In languages where memory is managed for you, you can absolutely do this, since the runtime can find every single pointer to an object and rewrite it.
Virtual memory can let you do this, but would require a separate page for each allocation (since virtual memory operates on a page-level). Given that the smallest page on modern architectures is 4k, this would mean using 4k of ram for each allocation (and rounding up each allocation to a 4k page boundary).
On top of that, it's an OS system call to map and unmap pages, which means you incur a system-call on every allocation and deallocation, which is much slower than using a user-space allocator.
Someone else already mentioned, but garbage collected languages do actually do this. Because they’re fully in control of memory (the language exposes no raw pointers), they’re able to create the layer of indirection you’ve suggested and move things around as they please. I know at minimum the JVM does this. The term to search for is “heap compaction.”
There’s also the weird and wonderful work of Emery Berger et al with their “Mesh” malloc implementation, which blows my mind: https://youtu.be/XRAP3lBivYM.
If malloc were to return something like an address that holds the address of memory allocated there is nothing preventing the program from reading that address, doing math on it, and storing it somewhere else.
I'm glad you like the playground. If you don't mind me asking, what/where/how do you teach? I was actually hoping to get the attention of educators with this tool to see if it would make sense in, e.g., undergrad CS courses.
I've started writing my next post and have learnt a bit more about PixiJS/GSAP3, and think I know a way to do it that would work nicely but would require changing some of the underlying code. I can't promise I'll revisit this soon, but I would like to in future.
malloc/free: 207294429ns
slab: 74795526ns
A 74% reduction in runtime is pretty nice.
There's something refreshing about firing up a C (or maybe now Zig, in my case) compiler and allocating a gigabyte of memory, and seeing that your process is using exactly 1GB.
How did you measure that? What was fastest? I'd imagine sbrk is faster than mmap, but I haven't checked. If you're on Linux, then doing a virtual memory allocation will not cause the memory to have physical representation. Did you populate the pages through looping? Or did you use MAP_POPULATE?
I saw that madvise was going to get MADV_POPULATE_(READ|WRITE) in the Linux kernel mailing lists back in 2021, but I haven't seen it on the kernels that I use (5.4, 5.15)
It would make the site much more accessible and clear in the event you didn't realize you had to click forward.
While I also hate the overuse of JavaScript, this is exactly how it's meant to be used – adding small bits of interactivity to documents.
Definitely open to ideas.
Most of the research for this post was done by reading papers for various malloc implementations (phkmalloc, dlmalloc, tcmalloc, mimalloc) and reading their source code.
It instills the idea that you should be asking a question at this point. Some information has been provided that should generate more questions in your head, if you're keeping pace.