I have two tips I can share based on my experience optimizing OctoSQL[0].
First, some applications might have a fairly constant live heap size at any given point in time, but do a lot of allocations (like OctoSQL, where each processed record is a new allocation, but they might be consumed by a very-slowly-growing group by). In that case the GC threshold (which is based on the last live heap size) can be low and result in very frequent garbage collection runs, even though your application is using just megabytes of memory. In that case, using debug.SetGCPercent to modify that threshold at startup to be closer to 10x the live heap size will yield enormous performance benefits, while sacrificing very little memory.
Second, even if the CPU profiler tells you the GC is consuming a lot of time, that doesn't mean it's taking it away from your app, if it's single-threaded. `go tool trace` can give you a much better overview of how computationally intensive and problematic the GC really is, even though reading it takes some getting used to.
I have experienced the same issue here. Our load balancer used CPU usage as a proxy for deciding how much traffic should be assigned when performing load balancing. When the app was written in Go, we consistently found that the GC is consuming a lot of CPU time even though all other metrics like request latency were very good, even in the microseconds range. This was the case even when the app was massively parallel with lots of goroutines. But the load balancer kept sloshing traffic around unnecessarily based on its observation that GC is consuming a lot of CPU time.
If you have a lot of small requests, with only few requests active at the same time, but many requests per second overall, with each making a few allocations, you will have a small live heap size, while quickly reaching the threshold for another GC.
This way you get a lot of GC runs. Latency isn't affected too much because Go is quite good at keeping the stop-the-world's short. You might have interleaving application/stop-the-world in a 50/50 ratio of computation time (that's something you can diagnose very easily with go tool trace btw).
Having a higher GOGC threshold might help a lot there, since it will make stop-the-world's less frequent, while keeping their duration mostly unchanged (as that scales proportionally to live heap size).
That's obviously just a guess based on the limited info I have though.
Though I can tell that the biggest improvement to my profiling flow was adding a `--profile` flag to OctoSQL itself. This way I can easily create CPU/memory/trace profiles of whole OctoSQL command invocations, which makes experiments and debugging on weird inputs much quicker.
I haven't tried it yet but it seems like an Arena/bump allocator for example should be possible now.
I would be good for the official runtime to be designed in a plugin way, so that third parties may experiment their own implementations of some aspects of the runtime.
func Allocate[T](arena *Arena) *T {
var bytes = arena.Bump(sizeof(T), alignof(T))
return (*T) bytes
}
Any reason why this would not work?Maybe you need to cast to unsafe.Pointer or something before returning, but in theory this _should_ work.
`GOMEMLIMIT` described in the document is a new tuning option.
EDIT: found it at the top of the file.