undefined | Better HN

0 pointsno_circuit4y ago0 comments

The article is from a database company, so I'll assume that approximates the scope. My scope for the GC discussion would include other parts that could be considered similar software: cluster-control plane (Kubernetes), other databases, and possibly the first level of API services to implement a service like an internal users/profiles or auth endpoints.

The tricky thing is GC works most of the time, but if you are working at scale you really can't predict user behavior, and so all of those GC-tuning parameters that were set six months ago no longer work properly. A good portion of production outages are likely related to cascading failures due to too long GC pauses, and a good portion of developer time is spent testing and tuning GC parameters. It is easier to remove and/or just not allow GC languages at these levels in the first place.

On the other hand IMO GC-languages at the frontend level are OK since you'd just need to scale horizontally.

0 comments

22 comments · 7 top-level

EdwardDiego4y ago· 5 in thread

> A good portion of production outages are likely related to cascading failures due to too long GC pauses, and a good portion of developer time is spent testing and tuning GC parameters

After 14 years in JVM dev in areas where latency and reliability are business critical, I disagree.

Yes, excessive GC stop the world pauses can cause latency spikes, and excessive GC time is bad, and yes, when a new GC algorithm is released that you think might offer improvements, you test it thoroughly to determine if it's better or worse for your workload.

But a "good portion" of outages and developer time?

Nope. Most outages occur for the same old boring reasons - someone smashed the DB with an update that hits a pathological case and deadlocks processes using the same table, a DC caught fire, someone committed code with a very bad logical bug, someone considered a guru heard that gRPC was cool and used it without adequate code review and didn't understand that gRPC's load balancing defaults to pick first, etc. etc.

The outages caused by GC were very very few.

Outages caused by screw-ups or lack of understanding of subtleties of a piece of tech, as common as they are in every other field of development.

Then there's the question of what outages GCed languages _don't_ suffer.

I've never had to debug corrupted memory, or how a use after free bug let people exfiltrate data.

zinekeller4y ago

> I've never had to debug corrupted memory

You're lucky! When OpenJDK was still closed-sourced Hotspot from Sun, we have chased bugs that Sun confirmed was a defect on how Hotspot handle memory (and this is on a ECC'd system of course), although these days I can't remind of anything remotely related.

> or how a use after free bug let people exfiltrate data.

Technically you're just outsourcing it :)

EdwardDiego4y ago

Yeah, have only ever hit one or two JVM bugs in very rare circumstances - which we usually fixed by upgrading.

> Technically you're just outsourcing it :)

Haha, very true. Luckily, to developers who are far better at that stuff than the average bear.

The recent log4j rigmarole is a great example of what I was describing in JVM dev though - no complicated memory issues involved, definitely not GC related, just developers making decisions using technologies that had very subtle footguns they didn't understand (the capacity to load arbitrary code via LDAP was, AFAIK, very poorly known, if not forgotten, until Log4Shell).

Karrot_Kream4y ago

> You're lucky! When OpenJDK was still closed-sourced Hotspot from Sun, we have chased bugs that Sun confirmed was a defect on how Hotspot handle memory (and this is on a ECC'd system of course), although these days I can't remind of anything remotely related.

I mean sure. I remember having similar issues with early (< 2.3) Python builds as well. But in the last decade of my career, only a handful of outages were caused by Java GC issues. Most of them happened for a myriad of other architectural reasons.

zepolen4y ago

> After 14 years in JVM dev in areas where latency and reliability are business critical

What sort of industry/use cases are we talking here? There is business critical and mission critical and if your experience is in network applications as your next paragraph seems to imply then no offence, but you have never worked with critical systems where an nondeterministic GC pause can send billions worth of metal into the sun or kill people.

EdwardDiego4y ago

Um, how did you derive from this conversation that the "outages" in question were about space missions failing?

Curious, and a tad confused.

1 more reply

coder5434y ago· 5 in thread

Go doesn’t offer a bunch of GC tuning parameters. Really only one parameter, so your concerns about complex GC tuning here seem targeted at some other language like Java.

This is a drawback in some cases, since one size never truly fits all, but it dramatically simplifies things for most applications, and the Go GC has been tuned for many years to work well in most places where Go is commonly used. The developers of Go continue to fix shortcomings that are identified.

Go’s GC prioritizes very short STWs and predictable latency, instead of total GC throughput, and Go makes GC throughput more manageable by stack allocating as much as it can to reduce GC pressure.

Generally speaking, Go is also known for using very little memory compared to Java.

socialdemocrat4y ago

Java _needs_ lots of GC tuning parameters because you have practically no way of tuning the way your memory is used and organized in Java code. In Go you can actually do that. You can decide how data structures are nested, you can take pointers to the inside of a a block of memory. You could make e.g. a secondary allocator, allocating objects from a contiguous block of memory.

Java doesn't allow those things, and thus it must instead give you lots of levers to pull on to tune the GC.

It is just a different strategy of achieving the same thing:

https://itnext.io/go-does-not-need-a-java-style-gc-ac99b8d26...

native_samples4y ago

That's the Go party line but not really true.

Counter-example: The Go GC is tuned for HTTP servers at latency sensitive companies like Google. It therefore prioritizes latency over throughput to an astonishing degree, which means it is extremely bad at batch jobs - like compilers.

What language is the Go compiler written in? Go.

This isn't fixable by simply writing the code differently. What you're talking about is in the limit equivalent to not using a GCd language at all, and you can do that with Java too via the Unsafe allocators. But it's not a great idea to do that too much, because then you may as well just bite the bullet and write C++.

Java doesn't actually need lots of GC tuning parameters. Actually most of the time you can ignore them, because the defaults balance latency and throughput for something reasonable for the vast majority of companies that aren't selling ad clicks. But, if you want, you can tell the JVM more about your app to get better results like whether it's latency or throughput sensitive. The parameters are there mostly to help people with unusual or obscure workloads where Go simply gives up and says "if you have this problem, Go is not for you".

1 more reply

no_circuitOP4y ago

Yes, my comments were targeted to Java and Scala. Java has paid the bills for me for many years. I'd use Java for just about anything except for high load infrastructure systems. And if you're in, or want to be in, that situation, then why risk finding out two years later that a GC-enabled app is suboptimal?

I'd guess you'd have no choice if in order to hire developers, you had to choose a language that the people found fun to use.

astrange4y ago

Is go's GC not copying/generational? I think "stack allocation" doesn't really make sense in a generational GC, as everything sort of gets stack allocated. Of course, compile-time lifetime hints might still be useful somehow.

coder5434y ago

> Is go's GC not copying/generational?

Nope, Go does not use a copying or generational GC. Go uses a concurrent mark and sweep GC.

Even then, generational GCs are not as cheap as stack allocation.

1 more reply

apalmer4y ago· 3 in thread

> A good portion of production outages are likely related to cascading failures due to too long GC pauses, and a good portion of developer time is spent testing and tuning GC parameters.

Can’t really accept that without some kind of quantitative evidence.

no_circuitOP4y ago

No worries. It is not meant to be quantitative. For a few years of my career that has been my experience. For this type of software, if I'm making the decision on what technology to use, it won't be any GC-based language. I'd rather not rely on promises that GC works great, or is very tunable.

One could argue that I could just tune my services from time to time. But I'd just reduce the surface area for problems by not relying upon it at all -- both a technical and a business decision.

throwawaylala14y ago

If you're needing to fight the GC to prevent crashes or whatever then you have a system design issue not a tooling/language/ecosystem issue. There are exceptions to this but they're rare and not worth mentioning in a broad discussion like this.

Sadly very few people take interest in learning how to design systems properly.

Instead they find comfort in tools that allow them to over-engineer the problems away. Like falling into zealotry on things like FP, zero-overhead abstractions, "design patterns", containerization, manual memory management, etc, etc. These are all nice things when properly applied in context but they're not a substitute for making good system design decisions.

Good system design starts with understanding what computers are good at and what they suck at. That's a lot more difficult than it sounds because today's abstractions try to hide what computers suck at.

Example: Computers suck at networking. We have _a lot_ of complex layers to help make it feel somewhat reliable. But as a fundamental concept, it sucks. The day you network two computers together is the day you've opened yourself up to a world of hurt (think race conditions) - so, like, don't do it if you don't absolutely have to.

3 more replies

LosWochosWeek4y ago

> I'd rather not rely on promises that GC works great, or is very tunable.

I'm always puzzled by statements like these. What else do you want to rely on? The best answer I can think of is "The promise that my own code will work better", but even then: I don't trust my own code, my past self has let me down too many times. The promise that code from my colleagues will do better than GC? God forbid.

It's not like not having a GC means that you're reducing the surface area. You're not. What you're doing is taking on the responsibility of the GC and gambling on the fact that you'll do the things it does better.

The only thing that I can think of that manually memory managed languages offer vs GC languages is the fact that you can "fix locally". But then again, you're fixing problems created by yourself or your colleagues.

initplus4y ago· 2 in thread

It's impossible to spend any time tuning Go's GC parameters as they intentionally do not provide any.

Go's GC is optimized for latency, it doesn't see the same kind of 1% peak latency issues you get in languages with a long tail of high latency pauses.

Also consider API design - Java API (both in standard & third party libs) tend to be on the verbose side and build complex structures out of many nested objects. Most Go applications will have less nesting depth so it's inherently an easier GC problem.

System designs that rely on allocating a huge amount of memory to a single process exist in a weird space - big enough that perf is really important, but small enough that single-process is still a viable design. Building massive monoliths that allocate hundreds of Gb's at peak load just doesn't seem "in vogue" anymore.

If you are building a distributed system keeping any individual processes peak allocation to a reasonable size is almost automatic.

erik_seaberg4y ago

You tune Go’s GC by rewriting your code. It’s like turning a knob but slower and riskier.

coder5434y ago

You tune GC in Go by profiling allocations, CPU, and memory usage. Profiling shows you where the problems are, and Go has some surprisingly nice profiling tools built in.

Unlike turning a knob, which has wide reaching and unpredictable effects that may cause problems to just move around from one part of your application to another, you can address the actual problems with near-surgical precision in Go. You can even add tests to the code to ensure that you're meeting the expected number of allocations along a certain code path if you need to guarantee against regressions... but the GC is so rarely the problem in Go compared to Java, it's just not something to worry about 99% of the time.

If knobs had a "fix the problem" setting, they would already be set to that value. Instead, every value is a trade off, and since you have hundreds of knobs, you're playing an impossible optimization game with hundreds of parameters to try to find the set of parameter values that make your entire application perform the way you want it to. You might as well have a meta-tuner that just randomly turns the knobs to collect data on all the possible combinations of settings... and just hope that your next code change doesn't throw all that hard work out the window. Go gives you the tools to tune different parts of your code to behave in ways that are optimal for them.

It's worth pointing out that languages like Rust and C++ also require you to tune allocations and deallocations... this is not strictly a GC problem. In those languages, like in Go, you have to address the actual problems instead of spinning knobs and hoping the problem goes away.

The one time I have actually run up against Go's GC when writing code that was trying to push the absolute limits of what could be done on a fleet of rather resource constrained cloud instances, I wished I was writing Rust for this particular problem... I definitely wasn't wishing I could be spinning Java's GC knobs. But, I was still able to optimize things to work in Go the way I needed them to even in that case, even if the level of control isn't as granular as Rust would have provided.

throwaway8943454y ago

> The tricky thing is GC works most of the time, but if you are working at scale you really can't predict user behavior, and so all of those GC-tuning parameters that were set six months ago no longer work properly. A good portion of production outages are likely related to cascading failures due to too long GC pauses, and a good portion of developer time is spent testing and tuning GC parameters. It is easier to remove and/or just not allow GC languages at these levels in the first place.

Getting rid of the GC doesn't absolve you of the problem, it just means that rather than tuning GC parameters, you've encoded usage assumptions in thousands of places scattered throughout your code base.

exdsq4y ago

I think I toggled with the GC for less than a week in my eight years experience including some systems stuff - maybe this is true at FANG scale but not for me!

eudoxus4y ago

As many have replied, the available levers for 'GC-tuning' in go is almost non-existent. However, what we do have influence on is "GC Pressure" which is a very important metric we can move in the right direction if the application requires it.

j / k navigate · click thread line to collapse

0 comments

22 comments · 7 top-level

EdwardDiego4y ago· 5 in thread

> A good portion of production outages are likely related to cascading failures due to too long GC pauses, and a good portion of developer time is spent testing and tuning GC parameters

After 14 years in JVM dev in areas where latency and reliability are business critical, I disagree.

But a "good portion" of outages and developer time?

The outages caused by GC were very very few.

Outages caused by screw-ups or lack of understanding of subtleties of a piece of tech, as common as they are in every other field of development.

Then there's the question of what outages GCed languages _don't_ suffer.

I've never had to debug corrupted memory, or how a use after free bug let people exfiltrate data.

zinekeller4y ago

> I've never had to debug corrupted memory

> or how a use after free bug let people exfiltrate data.

Technically you're just outsourcing it :)

EdwardDiego4y ago

Yeah, have only ever hit one or two JVM bugs in very rare circumstances - which we usually fixed by upgrading.

> Technically you're just outsourcing it :)

Haha, very true. Luckily, to developers who are far better at that stuff than the average bear.

Karrot_Kream4y ago

zepolen4y ago

> After 14 years in JVM dev in areas where latency and reliability are business critical

EdwardDiego4y ago

Um, how did you derive from this conversation that the "outages" in question were about space missions failing?

Curious, and a tad confused.

1 more reply

coder5434y ago· 5 in thread

Go doesn’t offer a bunch of GC tuning parameters. Really only one parameter, so your concerns about complex GC tuning here seem targeted at some other language like Java.

Go’s GC prioritizes very short STWs and predictable latency, instead of total GC throughput, and Go makes GC throughput more manageable by stack allocating as much as it can to reduce GC pressure.

Generally speaking, Go is also known for using very little memory compared to Java.

socialdemocrat4y ago

Java doesn't allow those things, and thus it must instead give you lots of levers to pull on to tune the GC.

It is just a different strategy of achieving the same thing:

https://itnext.io/go-does-not-need-a-java-style-gc-ac99b8d26...

native_samples4y ago

That's the Go party line but not really true.

What language is the Go compiler written in? Go.

1 more reply

no_circuitOP4y ago

I'd guess you'd have no choice if in order to hire developers, you had to choose a language that the people found fun to use.

astrange4y ago

coder5434y ago

> Is go's GC not copying/generational?

Nope, Go does not use a copying or generational GC. Go uses a concurrent mark and sweep GC.

Even then, generational GCs are not as cheap as stack allocation.

1 more reply

apalmer4y ago· 3 in thread

> A good portion of production outages are likely related to cascading failures due to too long GC pauses, and a good portion of developer time is spent testing and tuning GC parameters.

Can’t really accept that without some kind of quantitative evidence.

no_circuitOP4y ago

One could argue that I could just tune my services from time to time. But I'd just reduce the surface area for problems by not relying upon it at all -- both a technical and a business decision.

throwawaylala14y ago

Sadly very few people take interest in learning how to design systems properly.

3 more replies

LosWochosWeek4y ago

> I'd rather not rely on promises that GC works great, or is very tunable.

initplus4y ago· 2 in thread

It's impossible to spend any time tuning Go's GC parameters as they intentionally do not provide any.

Go's GC is optimized for latency, it doesn't see the same kind of 1% peak latency issues you get in languages with a long tail of high latency pauses.

If you are building a distributed system keeping any individual processes peak allocation to a reasonable size is almost automatic.

erik_seaberg4y ago

You tune Go’s GC by rewriting your code. It’s like turning a knob but slower and riskier.

coder5434y ago

You tune GC in Go by profiling allocations, CPU, and memory usage. Profiling shows you where the problems are, and Go has some surprisingly nice profiling tools built in.

throwaway8943454y ago

exdsq4y ago

I think I toggled with the GC for less than a week in my eight years experience including some systems stuff - maybe this is true at FANG scale but not for me!

eudoxus4y ago

j / k navigate · click thread line to collapse