Zero downtime migrations at petabyte scale (2024) (opens in new tab)

(planetscale.com)

112 pointsOzzie_osman4mo ago42 comments

42 comments

20 comments · 6 top-level

mattlord4mo ago· 6 in thread

Blog post author here. I'm happy to answer any related questions you may have.

That 400TB in the image is a large database! I'm guessing that's not the largest in the PlanetScale fleet either. Very impressive and a reminder that you're strongly differentiated against some of the recent database upstarts in terms of battle tested mission critical scale. Out of curiosity how many of these large clusters are using your true managed 'as a service' offering or are they mostly in the bring your own cloud mode? Do you offer zero downtime migrations from bring your own cloud to true as a service?

mattlord4mo ago

That particular cluster has grown significantly since the post was written, and yes there are now quite a few others that are challenging it for the "largest" claim. :-)

These larger ones are fully using the PlanetScale SaaS, but they are using Managed -- meaning that there are resources dedicated to and owned by them. You can read more about that here: https://planetscale.com/docs/vitess/managed

All of the PlanetScale features, including imports and online schema migrations or deployment requests (https://planetscale.com/docs/vitess/schema-changes/deploy-re...) are fully supported with PlaneScale Managed.

1 more reply

willquack4mo ago

> you can run an initial VDiff, and then resume that one as you get closer to the cutover point.

VDiff (v2) only compares the source and destination at a specific point in time with resume only comparing rows with PK higher than the last one compared before it was paused. I assume this means:

1. VDiff doesn't catch updates to rows with PK lower than the point it was paused which could have become corrupt, and

2. VDiff doesn't continuously validate cdc changes meaning (unless you enforce extra downtime to run / resume a vdiff) you can never be 100% sure if your data is valid before SwitchTraffic

I'm curious if this is something customers even care about, or is point in time data validation sufficient enough to catch any issues that could occur during migrations?

mattlord4mo ago

You are correct about resuming. If you do an initial VDiff and then resume that same VDiff say 1 month later it would only diff rows with a higher PK value.

But there's also nothing stopping you from doing a new VDiff to cover all data at that later point in time.

2 more replies

l5870uoo9y4mo ago

What does it cost to host a 400TB database?

freakynit4mo ago

Enterprise grade nvme ssd's typically cost around 150$/TB. For RF of 3, this comes to around: 400 x 3 x 150: 180K USD. With a minimum of 5 year lifecycle for these enterprise SSD's, we are looking at 36K USD/year.

Going through their pricing (https://planetscale.com/pricing?engine=vitess&cluster=M-5120...), for just 15TB storage with RF=3, the pricing comes to around 24000 USD/MONTH, not year. Adjusted for 400TB and per year, this becomes 7.6 million usd. Of course, you also get a lot more, but, the difference is just insane.

1 more reply

mystifyingpoi4mo ago· 4 in thread

While this is cool and I dig it, I'm really, really thankful for maintenance windows at the current job. In the real world, 99.9% of systems aren't used 24/7/365. Just do the cutoff when everyone is asleep. Then restart everything to be sure.

embedding-shape4mo ago

> In the real world, 99.9% of systems aren't used 24/7/365. Just do the cutoff when everyone is asleep

"Real world" being something that covers max what, 10 hours of a day? What about things that are used by the entire world? I think there is more than you realize of those sort of services underpinning the entire internet and the web, serving a global user base.

MagicMoonlight4mo ago

Almost nothing in the world is used globally. You have a handful of things like YouTube and Facebook and the visa network.

Nobody is using slopwork’s new CrudX at a global scale.

3 more replies

mystifyingpoi4mo ago

> What about things that are used by the entire world?

Well, for the remaining 0.1% - go ahead and use the fancy hot replication thingy. Sometimes there is no choice, and that's fine. Although that might mean, that the system architecture is busted.

ayuhito4mo ago

> Just do the cutoff when everyone is asleep.

In this age, many smaller companies serve customers across the globe. There is no common “asleep”.

Thaxll4mo ago· 3 in thread

We need more details on 6. This is the hard part, like you swap connection from A to B, but if B is not synced properly and you write to it then you start having diff between the two and there is no way back.

Like B is slightly out of date ( replication wise ) the service modify something, then A comes with change that modify the same data that you just wrote.

How do you ensure that B is up to date without stopping write to A ( no downtime ).

mattlord4mo ago

It's open source. You can get as many details as you like :)

https://github.com/vitessio/vitess

https://vitess.io/docs/reference/vreplication/

https://vitess.io/docs/reference/features/vtgate-buffering/

Kaliboy4mo ago

Not sure how they do it, but I would do it like so:

Have old database be master. Let new be a slave. Load in latest db dump, may take as long as it wants.

Then start replication and catch up on the delay.

You would need, depending on the db type, a load balancer/failover manager. PgBouncer and PgPoolII come to mind, but MySQL has some as well. Let that connect to the master and slave, connect the application to the database through that layer.

Then trigger a failover. That should be it.

Snelius4mo ago

> Load in latest db dump, may take as long as it wants.

400TB its about a week+ ?

> Then start replication and catch up on the delay.

Then u have a changes in the delay about +- 1TB. It means a changes syncing about few days more while changes still coming.

They said "current requests are buffered" which is impossible, especial for long distributed (optional) transactions which in a progress (it can spend a hours, days (for analitycs)).

Overwall this article is a BS or some super custom case which irrelevant for common systems. You can't migrate w/o downtime, it's a physical impossible.

2 more replies

WaitWaitWha4mo ago· 1 in thread

I split step 4 in their "high level, this is the general flow for data migrations".

4.0 Freeze old system

4.1 Cut over application traffic to the new system.

4.2 merge any diff that happened between snapshot 1. and cutover 4.1

4.3 go live

to me, the above reduces the pressure on downtime because the merge is significantly smaller between freeze and go live, than trying to go live with entire environment. If timed well, the diff could be minuscule.

What they are describing is basically, live mirror the resource. Okay, that is fancy nice. Love to be able to do that. Some of us have a mildly chewed bubble gum, a foot of duct tape, and a shoestring.

dheera4mo ago

Yeah it depends on what the system is.

Lots of systems can tolerate a lot more downtime than the armchair VPs want them to have.

If people don't access to Instagram for 6 hours, the world won't end. Gmail or AWS S3 is a different story. Therefore Instagram should give their engineers a break and permit a migration with downtime. It makes the job a lot easier, requires fewer engineers and cost, and is much less likely to have bugs.

redwood4mo ago

Worth underlining that this is data migrations from one database server or system to another rather than schema migrations

ksec4mo ago

Missing 2024 in the Title.

j / k navigate · click thread line to collapse

42 comments

20 comments · 6 top-level

mattlord4mo ago· 6 in thread

Blog post author here. I'm happy to answer any related questions you may have.

redwood4mo ago

mattlord4mo ago

That particular cluster has grown significantly since the post was written, and yes there are now quite a few others that are challenging it for the "largest" claim. :-)

1 more reply

willquack4mo ago

> you can run an initial VDiff, and then resume that one as you get closer to the cutover point.

VDiff (v2) only compares the source and destination at a specific point in time with resume only comparing rows with PK higher than the last one compared before it was paused. I assume this means:

1. VDiff doesn't catch updates to rows with PK lower than the point it was paused which could have become corrupt, and

2. VDiff doesn't continuously validate cdc changes meaning (unless you enforce extra downtime to run / resume a vdiff) you can never be 100% sure if your data is valid before SwitchTraffic

I'm curious if this is something customers even care about, or is point in time data validation sufficient enough to catch any issues that could occur during migrations?

mattlord4mo ago

You are correct about resuming. If you do an initial VDiff and then resume that same VDiff say 1 month later it would only diff rows with a higher PK value.

But there's also nothing stopping you from doing a new VDiff to cover all data at that later point in time.

2 more replies

l5870uoo9y4mo ago

What does it cost to host a 400TB database?

freakynit4mo ago

1 more reply

mystifyingpoi4mo ago· 4 in thread

embedding-shape4mo ago

> In the real world, 99.9% of systems aren't used 24/7/365. Just do the cutoff when everyone is asleep

MagicMoonlight4mo ago

Almost nothing in the world is used globally. You have a handful of things like YouTube and Facebook and the visa network.

Nobody is using slopwork’s new CrudX at a global scale.

3 more replies

mystifyingpoi4mo ago

> What about things that are used by the entire world?

Well, for the remaining 0.1% - go ahead and use the fancy hot replication thingy. Sometimes there is no choice, and that's fine. Although that might mean, that the system architecture is busted.

ayuhito4mo ago

> Just do the cutoff when everyone is asleep.

In this age, many smaller companies serve customers across the globe. There is no common “asleep”.

Thaxll4mo ago· 3 in thread

Like B is slightly out of date ( replication wise ) the service modify something, then A comes with change that modify the same data that you just wrote.

How do you ensure that B is up to date without stopping write to A ( no downtime ).

mattlord4mo ago

It's open source. You can get as many details as you like :)

https://github.com/vitessio/vitess

https://vitess.io/docs/reference/vreplication/

https://vitess.io/docs/reference/features/vtgate-buffering/

Kaliboy4mo ago

Not sure how they do it, but I would do it like so:

Have old database be master. Let new be a slave. Load in latest db dump, may take as long as it wants.

Then start replication and catch up on the delay.

Then trigger a failover. That should be it.

Snelius4mo ago

> Load in latest db dump, may take as long as it wants.

400TB its about a week+ ?

> Then start replication and catch up on the delay.

Then u have a changes in the delay about +- 1TB. It means a changes syncing about few days more while changes still coming.

They said "current requests are buffered" which is impossible, especial for long distributed (optional) transactions which in a progress (it can spend a hours, days (for analitycs)).

Overwall this article is a BS or some super custom case which irrelevant for common systems. You can't migrate w/o downtime, it's a physical impossible.

2 more replies

WaitWaitWha4mo ago· 1 in thread

I split step 4 in their "high level, this is the general flow for data migrations".

4.0 Freeze old system

4.1 Cut over application traffic to the new system.

4.2 merge any diff that happened between snapshot 1. and cutover 4.1

4.3 go live

dheera4mo ago

Yeah it depends on what the system is.

Lots of systems can tolerate a lot more downtime than the armchair VPs want them to have.

redwood4mo ago

Worth underlining that this is data migrations from one database server or system to another rather than schema migrations

ksec4mo ago

Missing 2024 in the Title.

j / k navigate · click thread line to collapse