* It supports adding a null column immediately.
* It supports slowly filling in a possibly-default value on all records.
* It supports setting the column to non-null when every row has a value.
What pain points are you hitting, and is my analysis totally off-base?
Though a quick search shows that step three might decide to ignore indexes and be slow; that would be disappointing.
You can work around this by adding a `CHECK` constraint which does the same thing, but it's a little inconvenient. There was some work being done to add this feature to not-null constraints (http://www.postgresql.org/message-id/20140517155857.GD7857@e...), but unfortunately it looks like it's gone stale.
I think generally mysql shops that use pt-online-schema-change don't create foreign key constraints in production, sidestepping the whole problem this article is about.
They did test against the full production data, afterwards, and could not reproduce. Which was what they expected, since the migrations were on empty tables, which just happened to have foreign key constraints against large tables, but no rows for those constraints to actually apply to.
So for them to "fully" test this migration before applying it to production, they would need to be replaying all production queries against the testing database as well, and maybe even test the migration multiple times to get a statistical sense of the possible latencies.
(I've actually done something like that before, but it's not something you do for every little change)
Most of our queries are fast (tens of milliseconds or lower), which is how we got away without knowing about this for so long. Unsurprisingly, we've been making a bigger effort to eliminate any slow queries we do find lately. ;)
It's true the table wasn't available for 15 seconds, but current connections weren't dropped/lost, so a robust interface with the API should survive a 15 second delay.
Shouldn't a system like this be built to handle long delays?
Those situations can be smoothed over with retries, but we try pretty hard to avoid delays like this in the first place.