This feature truly is a game-changer for adoption IMO.
Congrats on shipping to all involved!
Are there any plans to support recursive CTEs? What are the technical challenges there?
It's worth noting that MySQL has no problem with this kind of situation. It never cares about the existence of orphaned rows. It only cares about not letting you creating them in the first place, and it cares about cleaning up. But it doesn't blow up if orphaned rows do exist. They will just become ghosts.
"Changing a table's schema is one of the most challenging problems in relational databases, and in MySQL in particular. In today's accelerated and rapid development cycles, engineers find that they need to make schema changes sometimes on a daily basis."
Okay, but is this really true? Is it really necessary to have no downtime except in a very few cases? It's honestly rare to have systems that absolutely must stay up all the time, but allowing fiddling with the schema as well just seems excessive.
I'd also say as software dev I'd be very uncomfortable with a system like yours that claims to maintain consistency while running and changing the schema, short of a formal proof.
We aren't talking about zero downtime here, but continual, recurring downtime due to schema changes. Once you have beyond a few million rows in a normal RDBMS, schema changes can take minutes to hours depending on the type. Do this a few times per month and you now have 'lots' of downtime and you are blocking other engineering work from happening. It eventually becomes so much of a hassle that engineers don't want to do schema changes, blocking feature work. The more seamless and painless you can make them, the better.
How often do you expect to make the schema changes? I mean I quoted this bit "...make schema changes sometimes on a daily basis" – is this realistic, or a kind of business insanity typically caused by bad management? Ditto "...but continual, recurring downtime due to schema changes". This really looks like a failure of management rather than a technical problem to be solved.
Also aren't you likely to be doing something larger than just a schema change very often, in which case that would necessitate replacing your application, so changes are not just restricted to the database. You now have a bigger problem of co-ordinating app and DB changes.
I also asked to do you need permanent uptime because in a lot of systems, especially smaller ones (and by the long tail most systems are going to be smallish) the users are very tolerant of an hours' downtime a month, for example.
"Once you have beyond a few million rows in a normal RDBMS, schema changes can take minutes to hours depending on the type"
That's a pretty strong claim; what kind of thing is going to take hours that your database can do consistently? Does it even take hours? I had a 100,000,000 row table of unique ints lying around so I put a foreign key from itself to itself (a bit daft, but just for timing purposes. DB is MS SQL, table is fully hot in memory)
alter table [tmp_ints_clustered] add constraint ffffkkkkk foreign key (x) references [tmp_ints_clustered](x);
21 seconds.
What you're doing (if you can get it correct! Which I have to wonder at) is doubtless excellent for some very large companies, but in general... I'm afraid I'm not so sure.
Edit: I feel I'm perhaps missing your bigger picture.
Surely that's a bad design, and capability to support it is enabling continued bad design.
Ive been researching databases now for several days straight, the choices are overwhelming but I've pretty much narrowed my use case down to an RDBMS system.
I need to essentially handle 100's of millions of "leads" (and 10s of millions per day) which can make up any number of user fields. over 1B total
I need to resolve duplicate leads either in realtime or near realtime. A duplication can occur across a combination of 1 or more fields, so basically OLTP type operations (select, update, delete on single rows)
I do need to run large OLAP queries as well across all data
I've looked at things like scylla and whatnot but they seem too heavy duty for my volume. it's not like i need to store trillions of messages like discord in some huge event log.
I was considering these 3 options...
1. planetscale
2. citus
3. cockroachdb
I havent really narrowed it down further than this, but i liked the idea of still having RDBMS features without needing to worry about storage and scaling with just sheer write volume.
It seemed i could then do my basic OLTP stuff that i need, and citus had a cool demo how some OLAP query on 1B rows ran in 20s with 10 nodes, and that also fits a reasonable time for queries (BI tools will be used for that)
OLAP works here, but is not great depending on how fast you need info to be available. If you're generating reports, Postgres is fine as long as your queries are properly optimized, and you can get them within minutes for massive workloads. If you need near-instant (sub 1s) results, I would recommend you sync your RDBMS to a columnar database like ClickHouse, and let the better data layout work in your favor, rather than trying to constrain a row-based DB to act like it's not.
Otherwise, both are rock solid and simple to use. I've dealt with more intensive workloads than you mentioned, with the same use-case and Postgres worked very well. ClickHouse never had a problem.
I was considering using something like Airbyte, but then I thought this may actually be complex if PG rows are updating/deleting it means I also need to sync single rows (or groups of rows) to clickhouse, and I wasn't sure how the support was for that.
What kind of aggregation queries? Can you do pre-aggregation?
Citus would probably be my pick if you want SQL.
Feel free to email.
Any time I am doing something that could violate some logical constraint, I am probably trying to fix something that is already fucked. The safety nannies simply add insult to injury at this stage. I will delete broken rows in whatever order I please.
If constraint violations are adding value to your solution, you probably have a far more severe problem repeatedly slamming into the wall somewhere.
Having to write code that can handle foreign key violations because the DB doesn't check it is a major pain. (we use Cassandra for example, so there is a "foreign key" usually from a PG row to a Cassandra row, obviously that can't be enforced on DB level so application code has to do the work)
As for deleting/updating data, FKs can be a bit annoying, but postgresql for example has two (possibly more) options.
1) The (possibly dangerous) cascade delete, which will traverse the FKs basically for you and deletes them 2) The check FKs (and other constraints) on commit. I.e. instead of checking every delete/update statement causes FKs violations, it'll check at the end, after having done all the delete/update statements if there are any FK violations. (or update statements). Called deferrable constraints.
If there are any downstream queries that assume the relationship between X and Y, and you accidentally violated that contract, wouldn't you WANT the database to tell you? Without a FK constraint in place, I would just have to know that I needed to update hundreds of lines of (incidentally) dependent code!
It's roughly analagous to the static vs dynamic typing debate. Do you want to discover FK violations at runtime in the future? Or head them off at insert time? Either way, you cannot sweep referential integrity under the rug... you only shift the burden from one place to another.
So I thank my database daily for it's work enforcing FK constraints. Otherwise I'd have to write that code! You can't ignore the lifecycle of references.
If code has many maintainers, pinky promises to keep referential integrity are bound to be broken eventually. Humans like shortcuts.
https://www.youtube.com/@PlanetScale
Subscribed.