After almost six months, I finally found a spot where I could monkey patch a function to wrap it with a short circuit if the coordinates were out of bounds. Not only fixed the bug but made drag and drop several times faster. Couldn’t share this with the world because they weren’t accepting PRs against the old widgets.
I’ve worked harder on bug fixes, but I think that’s the longest I’ve worked on one.
Debugging errors in JS crypto and compression implementations that only occur at random, after at least some ten thousand iterations, on a mobile browser back when those were awful, and only if the debugger is closed/detached as opening it disabled the JIT was not fun.
It taught me to go into debugging with no assumptions about what can and cannot be to blame, which has been very useful later in even trickier scenarios.
This is why D, by default, initializes all variables. Note that the optimizer removes dead assignments, so this is runtime cost-free. D's implementation of C, ImportC, also default initializes all locals. Why let that stupid C bug continue?
Another that repeatedly bit me was adding a field, and neglecting to add initialization of it to all the constructors.
This is why D guarantees that all fields are initialized.
If native code calls back into Java, and the GC kicks in, all the objects the native code can see can be compacted and moved. So my implementation worked fine for all of the smaller test fixtures, and blew up half the time with the largest. Because I skipped a line to make it “go faster”.
I finally realized I was seeing raw Java objects in the middle of my “array” and changing the value of final fields into illegal pairings which blew everything the fuck up.
Level 2 systems programmer: "oh no, my memory allocator is a garbage collector"
For as painful as the debugging story was I have spent vastly more amounts of time working around garbage collectors to ship performant code.
( https://github.com/jemalloc/jemalloc/issues/1317 Unlike what the title says, it's not Windows-specific.)
(*): The application uses libc malloc normally, but at some places it allocates pages using `mmap(non_anonymous_tempfile)` and then uses jemalloc to partition them. jemalloc has a feature called "extent hooks" where you can customize how jemalloc gets underlying pages for its allocations, which we use to give it pages via such mmap's. Then the higher layers of the code that just want to allocate don't have to care whether those allocations came from libc malloc or mmap-backed disk file.
If there were 20million rooms in the world with a price for each day of the year, we’d be looking at around 7billion prices per year. That’d be say 4Tb of storage without indexes.
The problem space seems to have a bunch of options to partition - by locality, by date etc.
I’m curious if there’s a commonly understood match for this problem?
FWIW with that dataset size, my first experiments would be with SQL server because that data will fit in ram. I don’t know if that’s where I’d end up - but I’m pretty sure it’s where I’d start my performance testing grappling with this problem.
[1]: https://github.com/microsoft/mimalloc/blob/dev/src/heap.c#L1...
The underlying sys crate provides the binding for mimalloc API like `mi_collect`: https://docs.rs/libmimalloc-sys/0.1.39/libmimalloc_sys/fn.mi...
“C programmers think memory management is too important to be left to the computer. LISP programmers think memory management is too important to be left to the user.”
But far better to just use integer cents.
Every OS will provide some mechanism to get more pages. But it turns out that managing the use of those pages requires specialized handling, depending on the use case, as well as a bunch of boilerplate. Hence, we also have malloc and its many, many cousins to allocate arbitrary size objects.
You're always welcome to use brk(2) or your OS's equivalent if you just want pages. The question is, what are you going to do with each page once you have it? That's where the next level comes in ...
For high performance stuff where you need low, predictable latency, you're probably not going to want to use dynamic memory at all.
The downside is that it makes things like "print" a pain in the ass.
The upside is that you can have multiple memory allocators with hugely different characteristics (arena for per frame resources, bump allocator for network resources, etc.).
Generally given that page size isn't something you know at compiler (or even install size) and it can vary between each restart and it being between anything between ~4KiB and 1GiB and most natural memory objects being much less then 4KiB but some being potentially much more then 1GiB you kind don't want to leak anything related to page sizes into your business logic if it can be helper. If you still need to most languages have memory/allocation pools you can use to get a bit more control about memory allocation/free and reuse.
Also the performance issues mentioned have not much to do with memory pages or anything like that _instead they are rooted in concurrency controls of a global resource (memory)_. I.e. thread local concurrency syncronization vs. process concurrency synchronization.
mainly instead of using a fully general purpose allocator they used an allocator whiche is still general purpose but has a design bias which improves same-thread (de)allocation perf at cost of cross thread (de)allocation perf. And they where doing a ton of cross thread (de)allocations leading to noticeable performance degradation.
The thing is even if you hypothetically only had allocations at sizes multiple of a memory page or use a ton of manual mmap you still would want to use a allocator and not always directly free freed memory back to the OS as doing so and doing a syscall on every allocation tends to lead to major performance degradation (in many use cases). So you still need concurrency controls but they come at a cost, especially for cross thread synchronization. Even just lock-free controls based on atomic have a cost over thread local controls caused often largely by cache invalidation/synchronization.
The concept of memory that is allocated by a thread and can only be deallocated by that thread is useful and valid, but as TFA demonstrates, can also cause problems if you're not careful with your overall architecture. If the language you're using even allows you to use this concept, it almost certainly will not protect you from having to get the architecture corect.
Interestingly, it would seem that Java programmers play with garbage collectors while Rust programmers play with memory allocators.
*system