Someone needs to come up with something like a functional language based on a trick like this. Or maybe a meta-language akin to RPython, so people can write domain specific little languages for doing things like serving web requests, combined with domain specific "cheating" GC that can get away with doing much less work than a full general purpose GC.
Couldn't a pure functional programming environment be structured to allow for such GC "cheating?"
Also anytime a functions scope terminates, all its memory immediately goes away.
This can be done because its a functional language with immutable data structures.
Additionally, all memory used by a page is cleaned up when the page is finished processing. This has to do with "Memory is scoped" more than immutable data structures.
I don't think this is accurate; those values remain until GC is triggered for the process or the process terminates.
As usual, excellent prog21 article: http://prog21.dadgum.com/16.html (I miss prog21 already…)
The GC is a fallback mechanism to catch any refcounting loops, but the reference counting is still in force, so non-looped objects still do get collected.
When the GC runs, everything gets touched. Without it, the refcounting still hits some shared objects, but this is reduced enough to be worth it, and the heap doesn't grow without bounds.
Even if you never deploy, it's not like Python processes will run forever if you have have GC on. Big deployments usually recycle their processes regularly. Why wouldn't you? There's no downside once you have a big enough cluster.
I recall hearing that YouTube also runs with GC off, but I don't have a source for that.
Another way of looking at it is that you are optimizing away a feature that Python needs in order to be a general-purpose language, but that is not needed for your purpose.
I like sounding "bad."
[1]: https://www.erlang-solutions.com/blog/erlang-19-0-garbage-co... [2]: https://hamidreza-s.github.io/erlang%20garbage%20collection%...
Forking threads for web pages is so old school...And Python is a terrible choice for something at their scale.
Just redo the hosting bit in Java or golang and call it a day. If their UI code is sufficiently isolated from the back end it's not a huge deal.
Instagram is a pretty small application feature-wise, a few devs could probably do it in a couple months
So what? Sure they would scale better when using Erlang, Java or Go... but sometimes it is wiser to finish building something than making the best ultrascalable system. If you are really successful you will find ways to scale.
Facebook is a good example because of all the money they've thrown at a problem they shouldn't have. First was a custom PHP interpretor then a compiler and now hack. If they didn't have nearly unlimited money to throw at it would things have ended differently?
Language choice is one of the easiest choices to make. Pick a fast one out of the box if you plan to get big. It's not like the faster languages take orders of magnitude longer to write code in, the effect is minimal at best.
I don't think you understand how uWSGI works.
> And Python is a terrible choice for something at their scale.
Maybe they could use whatever youtube uses for their frontend instead?
...or gunicorn. Or twisted / hendrix. Or any WSGI container with process consciousness in the past 5+ years.
Java/Golang/Rust/C++/C# are all about 100X faster than Python. Wouldn't it be more reasonable to use one of those than mess with the runtome to squeeze out 20% more performance?
I get that software is complex and people have simple deadlines...
The effect is faster shutdown with less "copy on read" but some __del__ methods and finally: blocks in some generators might never get called.
That being said, I wonder if their team considered implementing a different language that was meant to work without GC overhead. I'm all for working with something you're familiar with, but this seems like they've hit the point where they know enough of the problem surface area that they should be able to start optimizing for more than just 10% efficiency by turning off a selling point of safer languages.
Er, hold on there. As discussed in the article, python only uses GC to cope with cases which the refcount can't handle. Which turns out in some situations to be a relatively small amount. The performance problem here didn't actually come from any "fundamental overhead" of GC, just an incidental implementation detail.
Deciding to jump languages just because of a 10% performance gain which pointed them in that direction (which, as I've discussed above, it didn't) would be ill advised.
> by turning off a selling point of safer languages
I don't know if I'd say GC is a selling point of a safer language, it doesn't really have a lot to do with safety in itself and the authors certainly didn't manage to remove any language safety in disabling the GC.
I suppose people at instagram didn't just stop there, but are also planning for more long term solution to optimizing their stack ( aka migration to a more performant language).
It appears to me this is the fate of anyone that has to make garbage collection scale. [1] I guess it's all worth it; build fast, win big and then struggle with the GC as an exercise in technical debt once you can afford enough staff to focus on the problem. Dreary.
The insight that due to reference counting GC turns reads into writes is interesting; I think I've encountered that before but can't remember where. As the gap between CPU throughput and RAM latency grows this becomes an ever greater point of pain.
> aka migration to a more performant language
That happens occasionally with small systems, and it usually works. But big operations with complex code bases don't do that very often. Facebook has re-implemented PHP at least twice now to improve run time performance without changing their language. Just one anecdote in a long list of heroic anecdotes about those who will do any number of backflips and somersaults to avoid replacing their chosen language.
Seems quite risky/costly for a mere 10% computational efficiency gain. If you're going to change the memory model of a programming language, might as well shoot for 10x improvement instead of 10%.
So this is all a workaround for Python's inability to use threads effectively. Instead of one process with lots of threads, they have many processes with shared memory.
Look in this doc for "Optional Collection of Cyclical Garbage": https://www.python.org/download/releases/2.0/
With something like Django, there's quite a big startup tax you have to pay, so at the scale of Instagram they would need quite a lot more servers to handle the same load if every request was served lambda style.
The point of Lambda is that you don't startup Django with each request. You can think of the fixed/shared part of Django being replaced by API Gateway. Lambda then replaces the non-shared threads that get started for each request.
However when the ruby community moved to Puma which is based on both processes and threads it was needed less. Not that this is rocket science (it's still far behind the JVM and .NET), I assume a hybrid process/thread model is something that hadn't reached a critical mass in the Python/Django/Flask/Bottle community?
Ruby has a patch to do the same thing -- increase sharing by moving reference counts out of the object itself:
Index here:
http://www.rubyenterpriseedition.com/faq.html#what_is_this
First post in a long series:
http://izumi.plan99.net/blog/index.php/2007/07/25/making-rub...
I think these patches or something similar may have made it into Ruby 2.0:
http://patshaughnessy.net/2012/3/23/why-you-should-be-excite...
https://medium.com/@rcdexta/whats-the-deal-with-ruby-gc-and-...
The Dalvik VM (now replaced by ART) also did this to run on phones with 64 MiB of memory:
https://www.youtube.com/watch?v=ptjedOZEXPM
I think PHP might do it too. It feels like Python should be doing this as well.
I realized that the workload (which involved a large amount of long-lived static data on the heap) would have seen enormous memory savings, if only we weren't running with Ruby 1.9's mark-and-sweep GC algorithm that marked every object during the mark phase.
I briefly experimented with turning off GC and periodically killing workers. Thankfully, in that situation, all we actually had to do was upgrade to Ruby 2.2, which does have a proper CoW-friendly incremental GC algorithm.
`fork` is awesome.
https://github.com/msgpack/msgpack-python/blob/2481c64cf162d...
> At Instagram, we do the simple thing first. [...] Lesson learned: prove your theory before going for it.
So do they no longer do the simple thing first?
More on topic: this seems like they optimized something in a way that might really constrain them down the road. Now if anyone creates an object that isn't covered by ref-counting they will get OOMs.
Maybe I misunderstood how page faults work, but I thought this process was reversed. I.e. Each page fault triggers a CoW, not the other way around?