Based on a problem I'm facing with Postgres today, I wonder if this really progresses as linearly as the article wants to make it out.
We're in the middle of evaluating Postgres as a replacement for MySQL, and experience notable slow-down for plain multi-row inserts due to index growth as soon as the table reaches just a couple of dozen million rows. It's an uncomplicated and flat (no constraints or foreign keys etc.) medium width table of about 10-15 columns and a handful of non-composite btree indices - and/or hash indices; we've tried mixing and matching just to see what happens - but ingestion drops to less than half already before 50m rows. At 100m rows the insertion performance is down to a fraction and from there it just gets worse the larger the table and its indices grow. It's as if there's some specific exponential cut-off point where everything goes awry. However, if we simply remove all indices from the table, Postgres will happily insert hundreds of millions rows at a steady and near identical pace from start to end. The exact same table and indices on MySQL, as closely as we can match between MySQL and Postgres, running on the same OS and hardware, maintains more or less linear insertion performance well beyond 500m rows.
Now, there's a lot to say about the whys and why-nots when it comes to keeping tables of this size in an RDBMS and application design relying on it to work out, and probably a fair amount more about tuning Postgres' config, but we're stumped as to why PG's indexing performance falters this early when contrasted against InnoDB/MySQL. 50-100m rows really isn't much. Would greatly appreciate if anyone with insight could shed some light on it and maybe offer a few ideas to test out.
(add.: during these stress tests the hardware is nowhere close to over-encumbered, and there's consistent headroom on both memory, CPU and disk I/O)
Separate your table into Cold (with all indexes and bells and whistles) and Hot (heap table with no indexes except PK).
Insert as many rows as you want into Hot heap, and then move them in the background into cold in batches, so that index recalculation is amortized across many rows, instead of per-row.
Another poster suggested partitioning, thats the same idea: separate Hot and Cold data into partitions and keep hot partition as heap
Contrast that with ClickHouse, for example. It operates in a different niche than Postgres (OLAP instead of OLTP) – with their merge tree engine family [1] that does data deduplication in background.
There is one project of modernizing Postgres' storage engine called OrioleDB [2], but I think the company got acquihired by Supabase [3] and maybe the project has not been progressing very quickly since then.
[1] https://clickhouse.com/docs/engines/table-engines/mergetree-... [2] https://www.orioledb.com/ [3] https://supabase.com/blog/supabase-acquires-oriole
If you committed a row update but didn't update the index, a subsequent query using the not yet updated index would not find the updated row correctly.
It would also only work for certain types of indexes, you couldn't do it for uniqueness constraint for example.
I do agree that in theory you could have some extension to the index declaration that covers all that, but my worry there would be that it would be non obvious and a foot gun. Doing it the way described above makes that break in semantics clear.
Yes, this is understood. In particular for b-trees that require some refurnishing when growing. What's less understood is why Postgres hasn't solved this in a way similar to how InnoDB solves it behind the scenes.
> during these stress tests the hardware is nowhere close to over-encumbered, and there's consistent headroom on both memory, CPU and disk I/O
This assertion is likely wrong - you're likely skipping over some metrics that has clues to what we need to know. Here are some questions to get the discussion moving.
- Is this PostgreSQL managed or self-hosted?
Your mention of "consistent headroom on both memory, CPU and disk I/O" gives me hope you're self-hosting it but I've heard the same thing in the past from people attempting to use RDS and wondering the same as you are, so no assumptions.
- Are you using COPY or multi-row INSERT statements?
- How much RAM does that server have?
- What is the fillfactor, max_wal_size and checkpoint_timeout?
- Is the WAL on NVMe?
- What's the iostat or wa during the slowdown?
- Are random UUIDs (part of) the index?
Have you posted to https://dba.stackexchange.com/
If I were you, I would create a GitHub repo that has scripts that synthesize the data and reproduce the issues you're seeing.
We can clearly see with iostat/iotop that the server and its storage isn't overworked.
> Is this PostgreSQL managed or self-hosted?
We're evaluating on our own bare metal. It's an 8c/16t Zen 3 with 32 GB of RAM. Storage (where everything incl. the WAL is) is NVMe and the drives' true sustainable/synchronous write speed is about ~750 MiB/second.
The specs are far, far higher than required for something this basic. Total memory usage when we begin observing the problem is a fraction of what's available. The MySQL production and test environments running this without hindrances actually have only half the amount of RAM.
> Are you using COPY or multi-row INSERT statements?
Multi-row INSERTs, one per transaction, with anything from 100 to 500 rows. Evaluation simulates the volumes we can see through our APIs on production, though it omits everything but Postgres itself in order to test Postgres' ingestion capability without other factors getting in the way - it's just Postgres and a light-weight data generator mimicking production data.
> What is the fillfactor, max_wal_size and checkpoint_timeout?
Not sure about the fill factor. Everything is running on default Postgres 17.9 as packaged for Debian 13, so that would be 1GB max_wal_size and checkpoint_timeout is commented out with a default of 5 minutes. Haven't gotten to any performance tuning yet, would be thankful for any suggestions to try out. The only thing we've tried is disabling auto-vacuuming to see if it was too frequent, causing i/o contention or otherwise hogging throughput. Not really any noticable change.
> Are random UUIDs (part of) the index?
No, neither v4 UUIDs nor b-tree-friendlier v7 UUIDs, but a couple of the indexed columns contain random integral numbers that can become quite sprawly inside a b-tree. We do observe somewhat better results indexing these with HASH indices instead, which also makes a lot more sense for that particular data and how we query it. For evaluation our outset was to stick with b-tree indices because that's what's used on the MySQL setup (InnoDB does not support on-disk hash indices).
> What's the iostat or wa during the slowdown?
When we reach the point where ingestion speed has shrunk to about a third we observe iowait peaks having grown to some 15%, which tells us the problem is likely Postgres spending more and more of its time shoveling in the indices rather than storing actual row data. Maximum written data at about 150 MB/second is just a fraction of what the NVMe drives can sustain. None of Postgres' individual processes ever top out anywhere close to 100% of a single core on the machine. Total memory usage is less than 2 GB, and here we suspect we have a lot of tuning to look into. To contrast, the MySQL setup is greedier with both CPU and memory usage, and an educated guess is that the major difference allowing it to keep ingesting hundreds of millions of rows without slowing down is that InnoDB, without us having to jump through any hoops, on one hand defers persisting new index data so that the DML can be finalized as quickly as possible, and on the other hand operates on its indices in a much more efficient way than Postgres.
Everything needed to reproduce the problem is in this paste, which contains a neutral version of the full table and indices, and a simplified version of the data generator: https://paste.debian.net/plainh/ddc819cb
I strongly encourage you to post to https://dba.stackexchange.com/, as a HN thread is the wrong place for this discussion (there's a lot of tuning ahead of you and others who are in your situation in the future might skip this nested thread) but be forewarned, the TLDR from dba.stackexchange will either be a quick "you need atleast 64 GB RAM for your PostgreSQL usecase" or there will a better, thorough discussion of increasing max_wal_size and lowering the fillfactor (which is what you're looking for). The ideal answer should even walk you through BRIN (vs btree) indexes. I'm asking you to post there because it will enumerate the WHY of all of these. At that point you can make an informed decision if this all would be worth it.
Now I empathize that MySQL is doing a phenomenal job at only a quarter of that (16 GB you said?) but not for the reasons you might think (and one can certainly argue, nor should an end user care!). MySQL's method of buffering (InnoDB change buffers) and its clustered index gives you the performance you like when suddenly doing bursty writes aka "write sprint". I need you to be aware of that (and dba.stackexchange responses will certainly address that).
I would have written a lengthy post on what to do next but I must first ask:
- Why are you evaluating PostgreSQL in this case when MySQL seems to work well? For example: does it feel like your aggregations are getting slower? As you can see, with PostgreSQL, you will have different set of tradeoffs (RAM, tuning, VACUUM)
- Are there real, limiting business constraints that force you to operate on less than 64 GB of RAM given your volume and throughput expectations (like FF limitations, or these are smaller machines on the edge, etc)
- If you can, as an experiment, while you write your dba.stackexchange question and for the PostgreSQL mailing list, you can tweak multiple parameters and tell me what you see:
- I'm concerned you have a `shared_buffers = 128MB`. Set it to `shared_buffers = 8GB` (give the B-trees room to live in RAM)
- Increase `max_wal_size = 16GB` (stop the checkpoint flooding and let Postgres "breathe" during your batch inserts).
- Increase `checkpoint_timeout = 30min` (set 30min to the actual window it takes; also, this is temporary but this should push checkpoints out so they don't interrupt your "write sprint").
- Set `maintenance_work_mem = 2GB` (should speed up index creation and vacuuming).
- Lower `fillfactor` on those specific 8 indexes from the default 100 down to 70. The B-trees should now have the ability to absorb those inserts better.
This should get you in the right direction, googling for the right documentation, but there's even more ahead of you, including a separate discussion about BRIN (vs btree) indexes.
> defers persisting new index data so that the DML can be finalized as quickly as possible
"defers persisting" might get misread as if it doesn't write to disk - it does but you're close and you will uncover more :)
MySQL defaults are specifically tuned for your "write sprint" usecase, infact, to actually mask the IO latency of secondary index updates but if your real usecase is not just large sudden bursts of writes to a table that has a btree index, you certainly will appreciate this effort. Happy weekend!
[1] https://www.postgresql.org/docs/current/brin.html [2] https://www.postgresql.org/docs/current/bloom.html
I asked them for specifics.
The issue I'm facing is a very large table, that is both write and read heavy, and the reads do not fall into a specific range of values for any particular column, so I don't think partitioning is an option.
It is a lot of admin work, but if you really need to scale up Postgres write throughput, I don't see many other options without increasing hardware costs.
I assume you have already picked the low-hanging fruit discussed in the neighboring comments - batch writes, make sure you are using COPY instead of INSERT, tune Postgres parameters adequately and use the fastest disk you can grab for the WAL.
[1] - https://andersmurphy.com/2025/12/02/100000-tps-over-a-billio...
That, however, came with its own set of problems. Of course you have to tune for concurrent connections as you scale wider, but there were much more serious contention issues than you'd expect, and the compilation times were terrible too. It turned out to be because those transactions locked the DB during their (synchronous) operations, and we had a globally distributed set of compilers in order to serve globally distributed traffic.
The solution ended up being to run a separate cluster of API servers in the same region as the DB. The expensive calls from the compilers to the API servers were largely async https so they didn't have to wait on the API nodes, and the API nodes could talk to the DB synchronously with low latency.
While I really like ZFS personally for 99% of things, for "need maximum performance with PostgreSQL and we're already pushing high end hardware to the maximum" ZFS is reportedly the wrong choice.
I would try filling up the table, forcing a vacuum, then timing the inserts afterwards. Not that you should need to vacuum in real usage, but it might uncover a problem.
Pretty much, yes. High volume rapid multi-row INSERTs. Nothing special involved in the DDL or how the data goes into the table. Vacuuming the table (whether automatically or between every 10-20m rows) makes no difference. The indices still penalize the performance substantially the larger they grow.
I've run mysql for years in production and have spent probably 30 minutes thinking about managing it. Unless there's some psql feature you need (unlikely), it'll just become a severe pain in your ass down the road because you set something up "wrong."
Just vacuuming can barf completely, leaving you dead in the water.
TL;DR: if you aren't a DBA and don't want to play one on TV don't bother with psql.
The constant problem is that "big scale" always means "larger than I've seen", so on any project larger than a person has encountered, they assume they need to pull out the big guns. Also, people worry about things like what happens if they really *do* scale 10 years from now.
Neither is a practical concern for nearly anyone who will ever face this decision.
And then yes, of course, some people have problems that actually can't be solved by Postgres. But verify this first, don't assume.
And in many cases it's those architectures that force more complexity and make it appear like they have much bigger challenges then they do. Great for resume driven development, but often you can get away with far less.
Everyone is hung up making their shit "scalable" like its a systems design interview at google in 2010.
Now a days you get a box with 600+ cores and 4TB of RAM. That is going to cover a very very large percentage of most enterprises.
The issue is there's a lot of lore and esoterica required to get it to scale.
Every time there's an issue with Postgres there's someone that's all "just do xyzzy, it's super obvious that this undocumented setting fixes it."
Just look at the slowdown/big pages problem.
If it works for you, use it.
There is a steep learning curve sure, and there is lot of manual steps to do some critical actions perhaps.
Documentation however is the last thing I would criticize PostgreSQL for. PostgreSQL documentation is by far best amongst all databases open or closed. It is exhaustive, well maintained, well written, and quite accurate and kept up for decades - The oldest supported version (in documentation) is 6.3 released in 1998.
4-8 cores is no problem at all, though.
One warning--the setting that would increase throughput the most (synchronous_commit = off) sacrifices durability to do so.
Based on the shown graph, this is misleading at best, essentially false. After 120K writes/s p50 spikes from 10ms to 1s (1 second for a write!!!!). That's two orders of magnitude latency spike, and an unacceptable one for an OLTP workload. It clearly shows the server is completely saturated, which is clearly a non operational regime. Quoting 144K is equivalent to quoting the throughput of a highway at the moment traffic comes to a standstill.
Based on this graph the highest number I'd quote is 120K. And probably you want to keep operating the server within a safe margin below peak, but since this is a benchmark, let's call 120K the peak. Because actually p50 is not even the clear-cut. It should be a higher percentile (say p95) at which latency is within reasonable bounds. But for the shake of not over complicating, it could be taken as a reference.
> We found that the bottleneck was in flushing the Postgres write-ahead log (WAL) to disk
Therefore, you are not measuring Postgres peak performance, but rather Postgres performance under the IO constraints of this particular system. Certainly, 120K IOPS is the maximum that this particular instance can have. But it doesn't show if Postgres could do better under a more performant IO disk. Actually, a good test would have been to try the next instance (db.m7i.48xlarge) with 240K IOPS and see if performance doubles (within the same envelop of p50 latency) or not. And afterwards to test on an instance with local NVMe (you won't find this in RDS).
> From [1]: > Postgres insert throughput > uv run python benchmarks/postgres_insert.py --rps 1000 --duration 300
300 seconds test duration?? This is not operational. You are not accounting for checkpoints, background writer, and especially autovacuum. Given that workflow pattern includes UPDATEs, you must validate bloat generation (or, equivalently, bloat removal) by a) observing much longer periods of time (e.g. 1h) and b) making sure the autovacuum configuration (and/or individual table vacuum configuration if required) makes bloat contained in a stable way. Otherwise, shown performance numbers will degrade over time, making them not realistic.
Only after proper autovacuum tuning and under the effects of bgwriter (also tuned!) and checkpoints (all required, especially for write intensive workloads), over much larger periods of time, the benchmark could be considered meaningful.
> We next measure the scalability of Postgres-backed queues.
I'd recommend benchmarking the recently announced PgQue project, that operates bloat-free (one of the largest operational hurdles in queue-like Postgres workloads). See [2] for a previous discussion in HN.
[1]: https://github.com/dbos-inc/dbos-postgres-benchmark [2]: https://news.ycombinator.com/item?id=47817349
> Based on this graph the highest number I'd quote is 120K. And probably you want to keep operating the server within a safe margin below peak, but since this is a benchmark, let's call 120K the peak. Because actually p50 is not even the clear-cut. It should be a higher percentile (say p95) at which latency is within reasonable bounds. But for the shake of not over complicating, it could be taken as a reference.
You definitely don't want to run a production system at saturation! But it's worthwhile to measure a complex system like Postgres at saturation, see when it gets there and how it behaves there, and then run at a slightly lower throughput.
> Therefore, you are not measuring Postgres peak performance, but rather Postgres performance under the IO constraints of this particular system. Certainly, 120K IOPS is the maximum that this particular instance can have. But it doesn't show if Postgres could do better under a more performant IO disk. Actually, a good test would have been to try the next instance (db.m7i.48xlarge) with 240K IOPS and see if performance doubles (within the same envelop of p50 latency) or not. And afterwards to test on an instance with local NVMe (you won't find this in RDS).
I've done some testing (not in the blog post)--doubling instance size/IOPS doesn't improve performance significantly because it doesn't affect the WAL bottleneck. Local NVMe should have a significant impact in theory, but I haven't tested this myself.
> 300 seconds test duration?? This is not operational. You are not accounting for checkpoints, background writer, and especially autovacuum. Given that workflow pattern includes UPDATEs, you must validate bloat generation (or, equivalently, bloat removal) by a) observing much longer periods of time (e.g. 1h) and b) making sure the autovacuum configuration (and/or individual table vacuum configuration if required) makes bloat contained in a stable way. Otherwise, shown performance numbers will degrade over time, making them not realistic.
Those are usage examples (notice the 1000 rps)--actual benchmarks were run at and were stable at much longer duration.
I disagree. It's worthless a number at saturation. Because "a slightly lower throughput" is at best an unqualified hand-waving. Real numbers can be quite far from that saturation point.
Quote instead real production numbers. You can define them clearly, it's not that hard. E.g.: p95 below 10ms latency. That's it. Measure and report that number.
> I've done some testing (not in the blog post)--doubling instance size/IOPS doesn't improve performance significantly because it doesn't affect the WAL bottleneck. Local NVMe should have a significant impact in theory, but I haven't tested this myself.
But those would be interesting numbers to share! "Doesn't improve performance significantly" --sorry, I'm not big friend of unqualified data points. Is it 10%, 20%, 50%? And definitely, when measured at saturation, surely you don't see improvements. But if measured at an operational regime, you should probably see notable improvements (unless other scaling factors start to dominate, in which case your benchmark becomes much more meaningful because then you are finding Postgres scaling limits and not just the limits of the disk on which it's running). Changes the picture dramatically.
> Those are usage examples (notice the 1000 rps)--actual benchmarks were run at and were stable at much longer duration.
Sorry, but if you use that as an example, gives me little confidence about the real intent. But glad to hear you run at longer duration --add that information to the post! But again, that's not enough. Show the bloat and demonstrate how stable it is, given the tuning required to keep it contained, of course. Also show the tps over time --I'm sure it drops notably in the presence of checkpoints-- and then the "under 10ms latency at p95" will become dominated by write performance during checkpoints.
Because when you determine your SLOs, it's not at the happy path, but the opposite. And saying "Postgres can do 144K writes/sec on this machine" is beyond the happy path, so it's not meaningful for me.
But how about horizontally? It would be nice to have high availability, or even to be able to upgrade the OS and postgres itself without downtime.
Atleast I hope you are! Nothing else has been as well battletested. Unfortunately, perhaps because if its name, it gets no facetime on HN. Its last few mentions here barely received attention it deserved.
Theoretically with 2 to 10x compute required and in practice 100 to 500x
Horizontal scaling for reads on the other hand is much easier. If you have multiple replicas for high availability, you might as well put them to work. It can also reduce the risk for read heavy tasks interfering with transaction processing. You can even go a step further and replicate to a database that's optimized for analytical tasks.
Horizontal scaling for stateless applications (e.g. web servers or job processors) is often easier and more robust than vertical scaling, with little to no downsides.
If you're doing it for other reasons it's usually a mistake.
If you're running it in kubernetes with cloudnativepg it's even easier.
The only thing it doesn't do well is master master replication which is why most of these does it scale posts mostly talk about how slow writes are. And they are pretty slow.
I don't realize why DBOS is not nearly as popular as Temporal but it has made a world of difference building Durable Queues and Long Running, Durable Workflows in Python (it supports other languages too).
As they show in this article, Postgres scales impressively well (4 billion workflows per day, on a db.m7i.24xlarge, enough for most applications), which is why, if you have your PostgreSQL backup/restore strategy knocked out and dialed in, you should really take a close look at DBOS to handle your cloud agnostic or self hosted Durable Queues and Durable Workflows. It's an amazing piece of software founded by the original author of Ingres (precusor to Postgres - the story of DBOS itself is captivating. I believe it started from being unable to scale Spark job scheduling)
I give them about a year or two before the wheels fall off, then it's off to Broadcom and friends.
But I could be wrong as now they're not in the 'durable execution' space at all, it's 'durable execution for ai' according to their latest conference.
Got to spend that VC dosh somewhere I suppose, they're certainly not spending it on making a good product.
It's true that we recently had a Series D and that VC firms recognize the value of what we do. The Temporal Server software is 100% open source (MIT license: https://github.com/temporalio/temporal/blob/main/LICENSE). It's totally free and you don't even need to fill out a registration form, just download precompiled binaries from GitHub or clone the repo and build it yourself. You can self-host it anywhere you like, no restrictions on scale or commercial usage. We offer SaaS (Temporal Cloud), which customers can choose as an alternative self-hosting, based on their needs. The migration path is bi-directional, so not a trap by any definition.
Regarding AI, Temporal is widely used in that space, but that does not negate the thousands of other companies that use Temporal for other things (e.g., order management systems, customer onboarding, loan origination, money movement, cloud infrastructure management, and so on). In fact, our growth in the AI market came about because companies who were already using Temporal for other use cases realized that it also solved the problems they encountered in their AI projects.
And to your last point, we've made dozens of enhancements to the product (here's a small sample: https://temporal.io/blog/categories/product-news). I'd encourage you to follow the news from next week's Replay conference (https://replay.temporal.io/) because we'll be announcing many more.
https://docs.dbos.dev/production/workflow-recovery#recovery-...
>When self-hosting in a distributed setting without Conductor, it is important to manage workflow recovery so that when an executor crashes, restarts, or is shut down, its workflows are recovered. You should assign each executor running a DBOS application an executor ID through DBOS configuration. Each workflow is tagged with the ID of the executor that started it. When an application with an executor ID restarts, it only recovers pending workflows assigned to that executor ID.
https://docs.dbos.dev/production/hosting-conductor
> Self-hosted Conductor is released under a proprietary license. Self-hosting Conductor for commercial or production use requires a paid license key.
DBOS Conductor is an out of band management service that IIRC helps you mainly observe your DBOS and recover failures in a seamless way. As far as I could see, it's not necessary, for you to use DBOS workflows and queues. Don't quote me though and reach out to their forum and verify in case I'm missing certain usecases.
Personally, I do not use DBOS Conductor - I have my own observable setup using Grafana/VictioriaMetrics as my workflows are instrumented with OTel. I had initially set Conductor up for development (and it looked to be free for development although I recall some major limitations on how many workflows etc - which is why I put my own alternate monitoring setup).
They also have a very reasonably priced cloud hosted DBOS Conductor. I think my first 30 days were completely free and then they moved me to a "hobby" tier. It's a fantastic way to help decide whether it's for you.
I believe DBOS Conductor is how DBOS pays the bills but you can use DBOS workflows and queues unlimited without DBOS Conductor. If you don't want to pay for Conductor - their out of band management service, you can put together your own just fine, like I did. My own Grafana/VictioriaMetrics setup answers my questions but I would imagine Claude/Codex/Cursor should be able to put something fairly useful if you didn't want to go down my route.
> executor elasticity appears to be locked behind license purchase
DBOS has designed their system to be extremely flexible and extensible. While yes, Conductor can absolutely manage your executors for you, it's not the only thing that can. You're not limited to using Conductor. As I said, I manage my own - everything you need to know to do so is in the code and documentation. They even have a document for LLMs and agents. I have had to interact with the DBOS team 0 times to set everything up.
I prefer this business model (an optional tool - Conductor, is paid) vs. DBOS offering just everything across the stack on a "free tier" but with caps on DBOS workflows and queues. In their current business model DBOS workflows and queues are completely uncapped (atleast from what I can make out).
If you do reach out to them, I would appreciate if you let me know anything to the contrary.
We have a simple worker setup and temporal is pretty easy to setup
Out only issue is really needing an intermediary data store for task result storage
We are using DBOS in new projects as it's even simpler and the downside (task log interface behind saas) is easily remedied with a copilot generated task viewer