In practice though, for most enterprise web services, a lot of real world performance comes down to how efficiently you are calling external services (including the database). Just converting a loop of queries into bulk ones can help loads (and then tweaking the query to make good use of indexes, doing upserts, removing unneeded data, etc.)
I'm hopeful that improvements in LLMs mean we can ditch ORMs (under the guise that they are quicker to write queries and the inbetween mapping code with) and instead make good use of SQL to harness the powers that modern databases provide.
Also, before jsonb existed, you'd often run into big blobs of properties you don't care to split up into tables. Now it takes some discipline to avoid shoving things into jsonb that shouldn't be.
https://learn.microsoft.com/en-us/dotnet/csharp/linq/
It solves all of your issues with “ORMs” (it’s really more than just an ORM)
In my demo app, the CPU hotspots were entirely in application code, not I/O wait. And across a fleet, even "smaller" gains in CPU and heap compound into real cost and throughput differences. They're different problems, but your point is valid. Goal here is to get more folks thinking about other aspects of performance especially when the software is running at scale.
Maybe we can ditch active models like those we see in sqlalchemy, but the typed query builders that come with ORMs are going to become more important, not less. Leveraging the compiler to catch bad queries is a huge win.
My experience with something like the latest Claude Code models these days has been that they are pretty good at SQL. I think some combination of LLM review of SQL code with smoke tests would do the trick here.
Apart from that my experience over the last 20 years was that a lot of performance is lost because of memory allocation (in GCed languages like Java or JavaScript). Removing allocation in hot loops really goes a long way and leads to 10 or 100 fold runtime improvements.
Parts of the GC language crowd in particular have come to hold some false optimistic beliefs about how well a GC can handle allocations. Also, Java and C# can sneak in silly heap allocations in the wrong places (e.g. autoboxing). So there is a tendency for programs to overload the GC with avoidable work.
I think Java (or other JVM languages) are then best positioned, because of jooq. Still the best SQL generation library I've used.
Much cleaner, shorter code and type safety with Postgres (my schema tends to be highly normalized too). And these days I’ve got it well integrated with Zod for type safe JS/TS front-ends as well.
This is usually the first thing I look for when someone is complaining about speed. Developers often miss it because they are developing against a database on their local machine which removes any of the network latency that exists in deployed environments.
There's a balance with a DB. Doing 1 or 2 row queries 1000 times is obviously inefficient, but making a 1M row query can have it's own set of problems all the same (even if you need that 1M).
It'll depend on the hardware, but you really want to make sure that anything you do with a DB allows for other instances of your application a chance to also interact with the DB. Nothing worse than finding out the 2 row insert is being blocked by a million row read for 20 seconds.
There's also a question of when you should and shouldn't join data. It's not always a black and white "just let the DB handle it". Sometimes the better route to go down is to make 2 queries rather than joining, particularly if it's something where the main table pulls in 1000 rows with only 10 unique rows pulled from the subtable. Of course, this all depends on how wide these things are as well.
But 100% agree, ORMs are the worst way to handle all these things. They very rarely do the right thing out of the box and to make them fast you ultimately end up needing to comprehend the SQL they are emitting in the first place and potentially you end up writing custom SQL anyways.
They store up conserved programming time and then spend it all at once when you hit the edge case.
If you never hit the case, it's great. As soon as you do, it's all returned with interest :)
I’d also argue “micro-ORMs” like Diesel (which isn’t really much like ActiveRecord, Hibernate, etc., but more a very thin DSL/interface that maps SQL types to Rust types), combined with LLMs, are the ideal solution (assuming we still want humans to be able to easily understand and trust the code generated). And there’s a big argument to be made for schema migration management being done at the app level (with plain SQL for migrations).
All that said, at work, we use Rails. And ActiveRecord’s “includes/preload/eager_load” methods are fantastic solutions to 99% of cases of querying for things efficiently, and are far more clear than all the SQL you’d have to write to replicate them.
When using JDBC I found myself quickly in implementing a poor mans ORM.
Or even the local filesystem :)
CPU calls are cheap, memory is pretty cheap, disk is bad, spinning disk is very bad, network is 'good luck'.
You can O(pretty bad) most of the time as long as you stay within the right category of those.
I recently fixed a treesitter perf issue (for myself) in neovim by just dfsing down the parse tree instead of what most textobject plugins do, which is:
-> walk the entire tree for all subtrees that match this metadata
-> now you have a list of matching subtrees, iterate through said subtree nodes, and see which ones are "close" to your cursor.
But in neovim, when I type "daf", I usually just want to delete the function right under my cursor. So you can just implement the same algorithm by just... dfsing down the parse tree (which has line numbers embedded per nodes) and detecting the matches yourself.
In school, when I did competitive programming and TCS, these gains often came from super clever invariants that you would just sit there for hours, days, weeks, just mulling it over. Then suddenly realize how to do it more cleverly and the entire problem falls away (and a bunch of smart people praise you for being smart :D). This was not one of them - it was just, "go bypass the API and do it faster, but possibly less maintainably".
In industry, it's often trying to manage the tradeoff between readability, maintainability, etc. I'm very much happy to just use some dumb n^2 pattern for n <= 10 in some loop that I don't really care much about, rather than start pulling out some clever state manipulation that could lead to pretty "menial" issues such as:
- accidental mutable variables and duplicating / reusing them later in the code
- when I look back in a week, "What the hell am I doing here?"
- or just tricky logic in general
I only noticed the treesitter textobject issue because I genuinely started working with 1MB autogen C files at work. So... yeah...
I could go and bug the maintainers to expose a "query over text range* API (they only have query, and node text range separately, I believe. At least of the minimal research I have done; I haven't kept up to date with it). But now that ties into considerations far beyond myself - does this expose state in a way that isn't intuitive? Are we adding composable primitives or just ad hoc adding features into the library to make it faster because of the tighter coupling? etc. etc.
I used to think of all of that as just kind of "bs accidentals" and "why shouldn't we just be able to write the best algorithms possible". As a maintainer of some systems now... nah, the architectural design is sometimes more fun!
I may not have these super clever flashes of insight anymore but I feel like my horizons have broadened (though part of it is because GPT Pro started 1 shotting my favorite competitive programming problems circa late 2025 D: )
After all, even if one has some slow and beastly, unoptimized Spring Boot container that chews through RAM, its not that expenseive (in the grand scheme of things) to just replicate more instances of it.
The String.format() problem is most immediately a bad compiler and bad implementation, IMO. It's not difficult to special-case literal strings as the first argument, do parsing at compile time, and pass in a structured representation. The method could also do runtime caching. Even a very small LRU cache would fix a lot of common cases. At the very least they should let you make a formatter from a specific format string and reuse it, like you can with regexes, to explicitly opt into better performance.
But ultimately the string templates proposal should come back and fix this at the language level. Better syntax and guaranteed compile-time construction of the template. The language should help the developer do the fast thing.
String concatenation is a little trickier. In a JIT'ed language you have a lot of options for making a hierarchy of string implementations that optimize different usage patterns, and still be fast - and what you really want for concatenation is a RopeString, like JS VMs have, that simply references the other strings. The issue is that you don't want virtual calls for hot-path string method calls.
Java chose a single final class so all calls are direct. But they should have been able to have a very small sealed class hierarchy where most methods are final and directly callable, and the virtual methods for accessing storage are devirtualized in optimized methods that only ever see one or two classes through a call site.
To me, that's a small complexity cost to make common string patterns fast, instead of requiring StringBuilder.
In CL, there's a general infrastructure called "compiler macros" that is intended as a hint to the compiler to expand calls as macros at compile time. The macro is also allowed to just leave the form unexpanded, in which case it defaults to an unexpanded function call. And the function can be turned into a value itself and passed around, even if the compiler macro exists.
For CL's format, this means an implementation will typically have a compiler macro (or some similar mechanism) that does an expansion if the format is a string constant.
CL also has a function called formatter that takes a format string and returns a function that acts like (lambda (&rest args) (apply #'format <the format string> args). This function can be implemented as something that expands the format string into code and then compiles the code.
The mechanisms in CL would allow a user to implement the equivalent of a format compiler macro (and formatter) even if the implementation didn't provide them.
They tried, its opponents dilluted it to the point of uselessness and now will forever use this failed attempt as a wedge.
I'm sorry, I don't believe Java will get sensible String templates in our life time.
I love how Zig, D and Rust do exactly what you say: parse the format string at compile time, making it super efficient at runtime (no parsing, no regex, just the optimal code to get the string you need).
I say this but I write most of my code in Java/Kotlin :D . I just wish I could write more low-level languages for super efficient code, but for what I do, Java is more than enough.
Also C++, which works the same way.
I was listening to someone say they write fast code in Java by avoiding allocations with a PoolAllocator that would "cache" small objects with poolAllocator.alloc(), poolAllocator.release(). So just manual memory management with extra steps. At that point why not use a better language for the task?
Well, the whole thing was standard Java OOP, except they also had a bunch of functional programming stuff on top of that. I can relate to that -- I think they were university students when they started, and I definitely had an OOP and FP phase. But then they just... kept it, 10+ years later.
So while it's true that you can write C in any language... those kind of folks don't tend to use Java in the first place ;)
--
(Except Notch? Well, his code looks like C, not sure if it's actually fast! I really enjoyed his 4 kilobyte java games back in the day, I think he published the source for each one too.)
EDIT: Found it!
https://web.archive.org/web/20120317121029/http://www.mojang...
Edit 2: This one has a download, still works!
https://web.archive.org/web/20120301015921/http://www.mojang...
A project might also grow into these requirements. I can easily imagine that something wasn't problematic for a long time but suddenly emerged as an issue over time. At that point you wouldn't want to migrate the whole codebase to a better language anymore.
Even JavaScript is much better for this, much, much better.
But JS has another problem: there's no way to force a number to be unboxed (no primitive vs boxed types), so the array of doubles might very well be an array of pointers to numbers[1].
But with hidden class optimizations an object might be able to store a float directly in a field, so the array of objects will have one box per (x,y,z), while an array of "numbers" might have one box per number, so 3x as many. My guess is, without benchmarking, is that JS is much worse than Java then, because the "optimization" will end up being worse.
[1]: Most JS engines have an optimization for small ints, called SMIs, that use pointer tagging to support either an int or a references, but I don't think they typically do this optimization for floats.
I've spent a fair few years developing lowish (10-20us wire to wire) latency trading systems and the majority of the code does not need to go fast. It's just wasted effort, a debugging headache, and technical debt. So the natural trade off is a bit of pain to make the hot path fast through spans, unsafe code, pre-allocated object pools, etc and in return you get to use a safe and easy programming language everywhere else.
In C# low latency dev is not even that painful, as there are a lot of tools available specifically for this purpose by the runtime.
Doing it to avoid memory pressure generally means you simply have a bad algorithm that needs to be tweaked. It's very rarely the right solution.
The JVM may optimize many short lived objects better than a pool of objects with less reasonably lifetimes.
This is actually the perfect situation: you are allowed to do it carefully and manually for 1% of code on the hot path, but you don't have to worry about it for the 99% of the code that's not.
Such as?
Orders by hour could be made faster. The issue with it is it's using a map when an array works both faster and just fine.
On top of that, the map boxes the "hour" which is undesirable.
This is how I'd write it
long[] ordersByHour = new long[24];
var deafultTimezone = ZoneId.systemDefault();
for (Order order : orders) {
int hour = order.timestamp().atZone(deafultTimezone).getHour();
ordersByHour[hour]++;
}
If you know the bound of an array, it's not large, and you are directly indexing in it, you really can't do any better performance wise.It's also not less readable, just less familiar as Java devs don't tend to use arrays that much.
Practically speaking, that would be pretty unusual. I don't think I've ever seen that sort of construct in my day to day coding (which could realistically have more than 1B elements).
I wish Java had a proper compiler.
https://foojay.io/today/how-is-leyden-improving-java-perform...
I long ago concluded that Java was not a client or systems programming language because of the implementation priorities of the JVM maintainers. Note that I say priorities--they are extremely bright and capable engineers that focus on different use cases, and there isn't much money to be made from a client ecosystem.
There are JITs that use dynamic profile guided optimization which can adjust the emitted binary at runtime to adapt to the real world workload. You do not need to have a profile ahead of time like with ordinary PGO. Java doesn't have this yet (afaik), but .NET does and it's a huge deal for things like large scale web applications.
https://devblogs.microsoft.com/dotnet/bing-on-dotnet-8-the-i...
AOT options like GraalVM Native Image can help cold starts a lot, but then half your favorite frameworks breaks and you trade one set of hoops for another. Pick which pain you want.
The folks on embedded get to play with PTC and Aicas.
Android, even if not proper Java, has dex2oat.
There are options to turn on which cause the JVM to save off and reload compiled classes. It pretty massively improves performance.
You can get even faster if you do that plus doing a jlink jvm. But that's more of a pain. The AOT cache is a lot simpler to do.
if it is valuable, i'd be surprised you can't freeze/resume the state and use it for instantaneous workload optimized startup.
I mean, both of your points are a thing, see https://www.azul.com/products/components/falcon-jit-compiler... for LLVM as a JIT compiler
and https://openjdk.org/jeps/483 (and in general, project Leyden)
Too many folks have this mindset there is only one JVM, when that has never been the case since the 2000's, after Java for various reasons started poping everywhere.
in practice, for web applications exposing some sort of `WarmupTask` abstraction in your service chassis that devs can implement will get you quite far. just delay serving traffic on new deployments until all tasks complete. that way users will never hit a cold node
Because in my experience as of 2026, Java programs are consistently among the most painful or unpleasant to interact with.
And aside from algorithms, it usually comes down to avoiding memory allocations.
I have my go-to zero-alloc grpc and parquet and json and time libs etc and they make everything fast.
It’s mostly how idiomatic Java uses objects for everything that makes it slow overall.
But eventually after making a JVM app that keeps data in something like data frames etc and feels a long way from J2EE beans you can finally bump up against the limits that only c/c++/rust/etc can get you past.
I’ve heard about HFT people using Java for workloads where micro optimization is needed.
To be frank, I just never understood it. From what I’ve seen heard/you have to write the code in such a way that makes it look clumsy and incompatible with pretty much any third party dependencies out there.
And at that point, why are you even using Java? Surely you could use C, C++, or any variety of popular or unpopular languages that would be more fitting and ergonomic (sorry but as a language Java just feels inferior to C# even). The biggest swelling point of Java is the ecosystem, and you can’t even really use that.
But on Java specifically: every Java object still has a 24-byte overhead. How doesn't that thrash your cache?
The advice on avoiding allocations in Java also results in terrible code. For example, in math libraries, you'll often see void Add(Vector3 a, Vector3 b, Vector3 our) as opposed to the more natural Vector3 Add(Vector3 a, Vector3 b). There you go, function composition goes out the window and the resulting code is garbage to read and write. Not even C is that bad; the compiler will optimize the temporaries away. So you end up with Java that is worse than a low-level imperative language.
And, as far as I know, the best GC for Java still incurs no less than 1ms pauses? I think the stock ones are as bad as 10ms. How anyone does low-latency anything in Java then boggles my mind.
The code will not look pretty but it will be very fast.
I am trying to aim for something like Mapstruct for the developer experience, since everybody loves Mapstruct. With Jagger, you bring a parser (JSON/XML/whatever), define some databind methods, and it generates the implementation for you.
So far I've only had my own use cases, so I would love to have some wider feedback.
Spring autowiring makes Java seem as a whole unnecessarily complex. Think it should be highly discouraged in the language (unless it is revamped and made apart of the compiler).
... not sure how this applies to the ObjectMapper, as I haven't programmed in Java in awhile. ... and my gripe doesn't apply to SpringBoot though:)
Constructor autowiring is the application of the inversion of control and dependency injection pattern. If there was no autowiring, you could autowire the components together just the same with normal code calling constructors in the correct order. Spring just finds the components and does the construction for you.
I've seen classes that are meant to be used by multiple threads where literally every method has `synchronized` because "that was the only way they could get it to work". Of course, if literally every method is synchronized it literally can't actually be used by multiple threads, it just looks like it is.
Generally speaking I work pretty hard to avoid any kind of locks. Locks can be an anti-pattern in my mind: for a lot of problems, if I am reaching for a lock, it's because I haven't actually thought through the problems well enough. They're a bandaid and they create potential choke-points in the app. I also think that they're a crappy fix to try and shoehorn non-concurrent patterns into a concurrent landscape.
I personally think that making something thread-safe and concurrent while also being maintainable and fast is a hard problem, and I think lazily trying to add threads into concurrent applications is a good way to write terrible code that is impossible to debug.
Obviously no accounting for taste, but when I write programs now, I kind of always make them concurrent-first (generally using and/or reinventing the actor model). I try and build my initial algorithm to accept that concurrency is inevitable and start that from the get go. I can't remember the last time I reached for `synchronized`, though every now and then I do have to reach for ReentrantLock, and I always feel dirty doing so.
The point of view is usually also wrong, they focus on the method call flow while they should think about protecting access to shared data.
As I said, I feel like when I reach for a lock, about 95% of the time it’s because I don’t really understand the problem well enough.
1. Avoid abstraction as much as possible, convoluted flow control and reduce useless objects creation
2. Learn how to manage concurrency correctly, focus on the data being accessed by multiple thread and focus on sequential access
3. Don't use bloated frameworks (all of them)
4. Consider rewriting common libraries following the principles above and with only the functionalities you actually need.
Easy 10x improvement, try it.
Rest of advice is great: things compilers can't really catch but a good code reviewer should point out.
A second bug is that Character.isDigit() returns true for non-ASCII Unicode digits as well, while Integer.parseInt() only supports ASCII digits.
Another bug is that the code will fail on the input string "-".
Lastly, using value.isBlank() is a pessimization over value.isEmpty() (or just checking value.length(), which is read anyway in the next line), given that the loop would break on the first blank character. It makes the function not be constant-time, along with the first point above that the length of the digit sequence isn’t being limited.
[0]
public int parseOrDefault(String value, int defaultValue) {
if (value == null || value.isBlank()) return defaultValue;
for (int i = 0; i < value.length(); i++) {
char c = value.charAt(i);
if (i == 0 && c == '-') continue;
if (!Character.isDigit(c)) return defaultValue;
}
return Integer.parseInt(value);
} public int parseOrDefault(String value, int defaultValue) {
if (value == null || value.isBlank()) return defaultValue;
for (int i = 0; i < value.length(); i++) {
char c = value.charAt(i);
if (i == 0 && c == '-') continue;
if (!Character.isDigit(c)) return defaultValue;
}
return Integer.parseInt(value);
}
Is probably worse than Integer.parseInt alone, since it can still throw NumberFormatExceptions for values that overflow (which is no longer handled!). Would maybe fix that. Unfortunately this is a major flaw in the Java standard library; parsing numbers shouldn't throw expensive exceptions.It doesn't excuse the "use exceptions for control flow" anti-pattern, but it is a quick patch.
This one is so prevalent that JVM has an optimization where it gives up on filling stack for exception, if it was thrown over and over in exact same place.
The rest were all very familiar. Well, apart from the new stuff. I think most of my code was running in java 6...
Most of this stuff is just central knowledge of the language that you pick up over time. Certainly, AI can also pick this stuff up instantly, but will it always pick the most efficient path when generating code for you?
Probably not, until we get benchmarks into the hot path of our test suite. That is something someone should work on.
Concatenating of thousands of individual Strings was already considered a well-known performance killer back then.
Interesting to see this in the wild in 2026.
> StringBuilder works off a single mutable character buffer. One allocation.
It's one allocation to instantiate the builder and _any_ number of allocations after that (noting that it's optimized to reduce allocations, so it's not allocating on every append() unless they're huge).
Who’s there?
long pause
Java
Java is only fast-ish even on its best day. The more typical performance is much worse because the culture around the language usually doesn't consider performance or efficiency to be a priority. Historically it was even a bit hostile to it.
Modern Java runtimes are pretty good, though.
For example, appending to a string in a loop. That only happens because of how Java handles strings. In C++, that's very fast. As fast as it can get, really. Since it all goes into the same buffer that gets mutated and expands at a good growth rate. Basically, equivalent to StringBuilder, but that's just all strings.
Or, boxing. C++ doesn't have to box generics to store them in a container.
Well everything comes at a price. Just imagine how many billions of very hard to debug bugs did Java save the world from by going immutable on Strings, at the expense of some slow down when used without a though, and pretty much every java book starts with how to avoid it since decades?
Same for Java, I have yet to in my entire career see enterprise Java be performant and not memory intensive.
At the end of the day, if you care about performance at the app layer, you will use a language better suited to that.
Factories! Factories everywhere!
https://gwern.net/doc/cs/2005-09-30-smith-whyihateframeworks...
Man that code looks awful. Really reminds me of why I drifted away from Java over time. Not just the algorithm, of course; the repetitiveness, the hoops that you have to jump through in order to do pretty "stream processing"... and then it's not even an FP algorithm in the end, either way!
Honestly the only time I can imagine the "process the whole [collection] per iteration" thing coming up is where either you really do need to compare (or at least really are intentionally comparing) each element to each other element, or else this exact problem of building a histogram. And for the latter I honestly haven't seen people fully fall into this trap very often. More commonly people will try to iterate over the possible buckets (here, hour values), sometimes with a first pass to figure out what those might be. That's still extra work, but at least it's O(kn) instead of O(n^2).
You can do this sort of thing in an elegant, "functional" looking way if you sort the data first and then group it by the same key. That first pass is O(n lg n) if you use a classical sort; making a histogram like this in the first place is basically equivalent to radix sort, but it's nice to not have to write that yourself. I just want to show off what it can look like e.g. in Python:
def local_hour(order):
return datetime.datetime.fromtimestamp(order.timestamp).hour
groups = itertools.groupby(sorted(orders, key=local_hour), key=local_hour)
orders_by_hour = {hour: len(list(orders)) for (hour, orders) in groups}
Anyway, overall I feel like these kinds of things are mostly done by people who don't need to have the problem explained, who have simply been lazy or careless and simply need to be made to look in the right place to see the problem. Cf. Dan Luu's anecdotes https://danluu.com/algorithms-interviews/ , and I can't seem to find it right now but the story about saving a company millions of dollars finding Java code that was IIRC resizing an array one element at a time.(Another edit: originally I missed that the code was only trying to count the number of orders in each hour, rather than collecting them. I fixed the code above, but the discussion makes less sense for the simplified problem. In Python we can do this with `collections.Counter`, but it wouldn't be unreasonable to tally things up in a pre-allocated `counts_by_hour = [0] * 24` either.)
----
Edit:
> String.format() came in last in every category. It has to... StringBuilder was consistently the fastest. The fix: [code not using StringBuilder]... Use String.format() for the numeric formatting where you need it, and let the compiler optimize the rest. Or just use a StringBuilder if you need full control.
Yeah, this is confused in a way that I find fairly typical of LLM output. The attitude towards `String.format` is just plain inconsistent. And there's no acknowledgment of how multiple `+`s in a line get optimized behind the scenes. And the "fix" still uses `String.format` to format the floating-point value, and there's no investigation of what that does to performance or whether it can be avoided.
any other resources like this?
Same AI slop.
Maven on the other hand, is just plain boring tech that works. There's plenty of documentation on how to use it properly for many different environments/scenarios, it's declarative while enabling plug-ins for bespoke customisations, it has cruft from its legacy but it's quite settled and it just works.
Could Maven be more modern if it was invented now? Yeah, sure, many other package managers were developed since its inception with newer/more polished concepts but it's dependable, well documented, and it just plain works.
It isn't great for really strange and odd builds, but in that case, you should probably be breaking your project down into smaller components (each with it's own maven file) anyways.
Gradle does suck and maven is ok but a bit ugly.
* Most mature Java project has moved to Kotlin.
* The standard build system uses gradle, which is either groovy or kotlin, which gets compiled to java which then compiles java.
* Log4shell, amongst other vulnerabilities.
* Super slow to adopt features like async execution
* Standard repo usage is terrible.
There is no point in using Java anymore. I don't agree that Rust is a replacement, but between Python, Node, and C/C++ extensions to those, you can do everything you need.
It gets a reaction, though, so great for social media.
Programming in Rust is a constant negotiation with the compiler. That isn't necessarily good or bad but I have far more control in Zig, and flexibility in Java.
That said, the article does have the "LLM stank" on it, which is always offputting, but the content itself seems solid.
Oracle was the one who open-sourced the whole of the JDK, and is the main contributor to OpenJDK by far, which is completely open-source with the same license as the Linux kernel. It's fine to criticize them on e.g. Oracle db licenses and stuff like that, but they have been excellent stewards of Java and all the bad language around this java lawyering stuff is just FUD.