Forking an industrial-grade tool means the entire lifespan of the entire product becomes your responsibility to your client. Tracking the major upgrade changes might be a pain in the arse but they're nothing compared to tracking every security and data-loss fix that bubbles around the Postgres community.
It's not just developer time that's the cost here. They had to compile the whole Postgres+Citus database, for every platform they support, in a timely manner, test it and distribute packages. Think of all the CPU cycles and bandwidth they're saving by only having to compile as an extension against public headers.
Functioning as an extension means Postgres and its distributors (eg Ubuntu) are the people responsible for keeping Postgres alive and secure. Citus only have to support their thing.
Why aren't they talking about how much this move is saving them in day-to-day? There's no shame in being efficient.
But what you're saying —which wasn't immediately obvious, and correct me if I'm wrong— is your users are using your database product, not Postgres, so you can hold them back as long as you like when they're using a forked product. They won't be carried away by an automatic update and it's much harder for them to jump ship.
And while there is some truth to that, it comes with a karmic cost. People picked you because you were based on their favourite, industry tested database. If you slip behind in features, or (more importantly) can't backport security fixes instantly, you're dead.
The stuff Citus has been landing in PostgreSQL is fantastic.
They call out a number of forks that do quite nicely. I work for Pivotal, which sponsors Greenplum. Companies pay handsomely for the capabilities it brings to the table.
But they are right that rebasing is a nightmare. My understanding (possibly wrong) is that the broad selection of APIs that make an extension-only approach did not appear in PostgreSQL until more recent versions -- anyone who forked earlier (such as Greenplum) have to first catch up and then migrate.
I do know that the Greenplum team have decided to catch up until they are working against mainline. It is, as you might imagine, a slow process: rebasing millions of lines of code a release at a time is not the easiest task on earth. But maintaining a fork will, in the long run, be harder.
Ever for smaller changes, it is very well known that you better contribute back to the Free Software project - because having your patch included and being maintained there is a lot less trouble than maintaining your fork and updating your patches every few months.
One exception might be one-off changes, but we all know that nothing is more definitive than the temporary.
The only real exception is when your patch is superseded by another patch (or a better solution). Then, you maintian your private patch only until next version is finalized.
In other words. Does Postgres 10 offer the same features in terms of clustering as Citus does?
Citus is an extension that takes several database nodes and makes them appear as a single logical database server (at the table level, by automatically sharding them based on a column).
https://wiki.postgresql.org/wiki/Replication,_Clustering,_an...
And Citus is the first link in that list.
Data for different customers can be stored on separate nodes, and your DB is not limited by capacity of one node.
In regular PG, all data need to fit single node.
Is there any interests in a RDS as a Service? So basically setting up and running a completely fault aware postgres cluster in any infrastructure, either public or private?
Thanks for your feedback :)
Since it uses BOSH, you can deploy to a wide range of targets. OpenStack, vSphere, AWS, Azure, GCP and I forget what else.
Disclosure: I work for Pivotal.
Basically as if I hired a contractor to install, monitor and upgrade, but automated. Existing services charge too much since they resell VMs and storage, while also being less flexible with access and performance.
There's also the rise of Kubernetes (with operators, helm charts and persistent storage) that takes away much of the complexity. By version 2.0, it should be able to easily make any legacy single-node system into a fault-tolerant service.
Ideally, I would just need an SSH key inside your machines and the capabilities to open an ssh tunnel inside the firewall to scrape metrics.
Ideally, the metric should get exposed back to the customer.
I am not a big fan of containers when working with data that are irreplaceable. But the use k8s may really help.
If I could get something like that on Digital Ocean I’d be all over it.
They are closed-source and require enterprise licensing based on RAM quota so it's not simple to do automated cloud provisioning. They do have their own MemSQL cloud offering so you might inquire into that. Also MemSQL Ops is probably the easiest and most reliable operations software for any database, it just takes a few clicks to install and upgrade your cluster.
By the way I don't have any experience with MemSQL