How We Release So Frequently (opens in new tab)

(engineering.skybettingandgaming.com)

201 pointsTomNomNom10y ago50 comments

50 comments

44 comments · 12 top-level

rachelbythebay10y ago· 9 in thread

Get off NFS while you still can. Thank me later.

I think NFS often gets an undue bad reputation. I work at a company which uses NFS at scale and it does the job without too much trouble. NFS is used as the storage for vmware datastores, xen primary storage and also for shared storage mounted between servers.

For the latter case, the mounting of these partitions can be automated with config management on the linux servers. You have to be careful with UID's and GID's but config management helps with this.

The filers supplying the NFS storage can be exploited to provide replication to other datacenters,snapshots and also provide redundancy with multiple heads serving the volumes.

In the past I've used Fibre channel ( found it overly complex) and iSCSI. iSCSI was fairly straight forward to use, but I've never tried to automate it. I guess there isn't a reason you couldn't however. For complexity I guess its Fibre>iSCSI>NFS.

Performance wise we don't have any issues with NFS itself, the bottleneck is sometimes the filer trying to keep up :-)

Anyhow, in complex environments, sometimes its good to keep things simple where you can. NFS helps with that, its stable, scalable and the performance is comparable to iSCSI.

Removing the need for shared storage on the OS where possible is the ultimate aim though.

digi_owl10y ago

I wonder of how much the experience differs based on the NFS version being used.

gaius10y ago

Agreed. I've run Oracle with thousands of commits/sec on NFS with no problems. Or no more problems than we'd have had on any storage.

jordanb10y ago

Yep. I work for a big company with a bureaucratic system engineering department that puts all program code on NFS (SANs).

It's been the cause of virtually all of our service outages and many of our performance problems---and it's completely obsolete in the era of Jenkins and Ansible.

simonw10y ago

It sounds like they are using NFS for distributing their code, which should be almost entirely read-only - the only place that writes new files to the file system is the single Jenkins machine that manages the deploys. It seems likely to me that this would avoid most of the risks inherent in running NFS at scale.

rachelbythebay10y ago

Let's say you distribute a binary over NFS -- compile some C or C++ or whatever you like. Then various hosts run /path/to/binary. At some point, the NFS server changes that file out from underneath those hosts, because, well, it can. The usual "text file busy" you'd get when trying to do that on a local filesystem never happens.

At some point after that, the hosts running the binary will try to page something involving that binary and will SIGBUS and die.

That's just one of many failure modes.

1 more reply

jagsta10y ago

That's not the only issue with running NFS with a horizontally scaled web cluster, stuff like this starts to hurt: http://www.serverphorums.com/read.php?7,655118

TomNomNomOP10y ago

That is something we're working on at the moment. It is read-only from the web servers' perspective, but it's definitely not without its problems.

Outdoorsman10y ago

Looks like a large distributed company...I have some experience with casinos...similar...

I'd be interested to know more about your experiences with an NSF "share"...which he mentions...?

Thanks!

ktRolster10y ago· 7 in thread

The biggest difficulty is managing database rollbacks. When you mess up your database, rolling back can be tough.

These guys avoid the problem by never rolling back the database, and never making changes that might require that.

axelfontaine10y ago

Flyway author here. Rollback is an illusion once your transaction has been committed. At best you can attempt to issue a compensating transaction to undo some of the effects. More details: https://flywaydb.org/documentation/faq.html#downgrade

w4tson10y ago

Currently using flyway on an enterprise Java project. Brilliant library. Excellent documentation, developers sing its praises all day everyday.

benjiweber10y ago

Database schema changes can almost always be moved forward in a always safe manner using expand-contract pattern http://martinfowler.com/bliki/ParallelChange.html

As for reducing the risk of damage to the data itself from badly behaved application code, I think the best approach is to design your architecture such that you can't lose important data.

There are various other techniques than can help. I wrote about some of them here http://benjiweber.co.uk/blog/2015/03/21/minimising-the-risk-...

Releasing less frequently actually only makes the problem worse. Infrequent changes are often too big to have a chance of understanding their potential affect on the production system. You're also less likely to immediately know how to respond to a problem.

kul_10y ago

Never rolling back makes a lot of sense, and ia often asserted by the fact that many migration frameworks dont even provide rollback facility e.g. flyway. A database migration is a major high risk change IMO and should be well thought through.

gaius10y ago

If you add a column and real customer data is written to that column, then "rolling back" means deleting it. That isn't a problem that can be "solved".

ozim10y ago

I am never thinking about doing rollback on production, downgrade is nice on dev machine when you work on multiple branches. I'd rather do database restore and code restore to pev version.

I would not roll out new code with migrations without ad-hoc db backup. Any new code as well.

QuercusMax10y ago

It's always possible to make changes to your schema without needing to potentially roll back the database. It may just be more annoying than you are willing to attempt. You might need to roll back your code to a previous version, but if you structure your changes right, you'll do it in multiple steps. (These are basically what's laid out in the article, but broken down more.)

  1. Change your code to write to old and new schema; keep reading from old schema. 
  2. Migrate data to new schema in background. 
  3. Add a flag to control where you're reading from. Default it to read from old location. Keep writing to both. 
  4. Flip the flag on some subset of your jobs. Ensure everything is still running smoothly for as long as you like. 
  5. Change the default to read from new schema. Wait as long as you like to be comfortable that the change is working properly. 
  6. Delete the code that reads from the old schema. 
  7. Delete the code that writes to the old schema. 
  8. Drop the data in the old schema.

At any point prior to deleting the old data, if you encounter problems you can roll back to an old version of the code. If your schema changes are incompatible, you can make an entire new database with the new schema. This may temporarily waste some storage space, but it's very safe.

pbreit10y ago· 5 in thread

Are frequent releases some sort of advantage?

FordPrefectAO10y ago

Whatever you are developing has no value until someone actually uses it. Therefore the quicker you can confidently release something(feature, bugfix, etc), the more value you gave deliver.

This idea comes from the concept of inventory waste from lean manufacturing

suresk10y ago

Another side benefit is that it generally leads to higher quality releases and more stability, for a few reasons:

1) If you can do quick, easy releases, people release small, narrowly-scoped changes instead of huge batches of changes that may interact with each other in all sorts of unknown ways.

2) If you need to fix something, having a process that lets you release quickly and easily means you can get a fix deployed that much sooner. I hate being in a situation where an emergency patch that is really small still takes hours to get out.

1 more reply

sdrothrock10y ago

In addition, faster releases can also reduce sunk costs. For example, if you release new features in stages, you can correct course quickly from the beginning based on actual user use/feedback instead of spending months building up something that may not meet actual needs.

2 more replies

pilsetnieks10y ago

Provided all else is equal (bugs, feature requests, customer feedback,) yes. If it gives any business advantage, the sooner you release, the sooner it starts giving returns, and those returns eventually compound over time. So, yes.

Outdoorsman10y ago

End result to a user...potentially yes...though they may never be aware of your efforts...

fancy_pantser10y ago· 3 in thread

I literally checked the date twice to make sure this wasn't 10+ years old.

vemv10y ago

I guess your point being that these aren't novel techniques?

The author sure didn't invent them, but they aren't widespread enough yet. In fact e.g. Rails puts you in the opposite mindset (which is OK at early stages).

jdietrich10y ago

Betting is a highly regulated environment. The kind of mistake that startups make all the time could result in a heavy fine of the loss of a license. In such a climate, this workflow is exceptionally modern.

mbesto10y ago

For a company like Sky Betting this is actually fairly monumental. Not every company looks and behaves like every trending SV startup.

http://www.cvc.com/Our-Portfolio.htmx?ordertype=1&itemid=112...

PS - They actually make pretty good money too: http://www.cvc.com/Our-Portfolio.htmx?ordertype=1&itemid=112...

stuaxo10y ago· 3 in thread

Hm, is it any better working there than for the TV part ?

unistdh10y ago

There are many tech teams at Sky. Each one is very different in terms of culture, tech, product etc.

As I understand it, Sky bet is quite separate from the rest of the company. This is probably a good thing...

michae1m10y ago

Separate company these days.

TomNomNomOP10y ago

Yes! Although I am obviously biased.

simonw10y ago· 2 in thread

This is a good solid write-up of techniques that I think are emerging as best practices generally for high scale, complex sites. The approach described matches what we do at Eventbrite pretty closely, and I know companies like Etsy and Slack use the same kind of process.

Feature flags in particular are a very powerful tool for managing feature releases independent of the deploy cycle. Here's a good recent article that dives into those in more detail: http://martinfowler.com/articles/feature-toggles.html

kough10y ago

I just want to throw out that there's nothing about this that won't work for !("high scale, complex") sites. I worked at a ~20 person startup and we arrived at the same decision regarding database migrations and, you know what, it wasn't that hard and it didn't take that much time and it made deploys a lot smoother.

I see a lot of other people mentioning the annoyance factor. Like anything else, you get used to it, and appreciate its advantages.

QuercusMax10y ago

All the principles in here sound exactly like what is SOP at Google (and probably Facebook and others). It's a pain in the butt sometimes, especially when making schema changes, but ensuring everything works properly even when you have multiple versions running in production can really help with confidence in making changes.

You do have to think several steps ahead when making big changes; the (internal-only) project I work on is in the process of completely changing our DB schema and we're redoing our API completely as well. We're attempting to keep our old API running in parallel while migrating literally everything underneath it, which is a fun challenge. It results in a lot of what can feel like busywork, but when the alternative is bringing down your service temporarily, it's worth it. An hour of planned downtime to do an offline migration can easily turn into several days when Murphy strikes. That's OK when you're first building a system, but once anyone is relying on you to get their work done, it is just pure pain.

rixed10y ago· 1 in thread

In the old days of relational databases we had an abstraction layer between actual storage and applications called a query language, decoupling them with functions and views, which were helpful to change the schema independently of the code...

QuercusMax10y ago

We don't use database views, but kinda-sorta mimic them by separating the persistence layer from the API layer. It can be annoying to maintain (at the big G there are lots of SWEs who joke that their job is writing code to copy fields between protobufs), but the alternative is to couple your API directly to what you're storing in the database. That is a road that leads to pain.

The benefit over using views is that at your code is written in the same language, instead of having a while bunch of logic running semi-hidden in the database. If you have a bug in your view, you have to update your DB schema (or at least roll out new PL/SQL DB code or whatever). And if you're working with a planet-scale distributed app, it just plain won't work.

girvo10y ago· 1 in thread

Neat. Interestingly, for all its faults PHP in my experience has made this a little easier to achieve than other languages we also use: shared-nothing and no internal process state between requests makes cutting over a bit easier than our equivalent node servers. Some great practical advice in this article.

qwer10y ago

That's a flaw in how you use node, not node itself. We run dozens of instances of node without internal process state, the need for sticky sessions, etc. You're losing a lot of load-balancing ability by not keeping this discipline too.

tenken10y ago· 1 in thread

What language, framework, tools do you use to manage these phased migrations? It sounds like alot of extra code to write.

Also 1 site update, or version number, could really be N releases until fruition -- which don't sound like traditional releases to me.

d0m10y ago

>> It sounds like alot of extra code to write.

What he's saying is more about convention than writing code. For instance, instead of adding a column "abc" and doing:

foo.abc = 123;

they would do something like:

if (foo.abc) { foo.abc = 123; }

make sure all tests pass, and then migrate the db.

If you're asking about tools to migrate code, all popular languages have one. (I.e. django comes with one that's already really good).

Can_Not10y ago

I can see the infrastructural scenarios where this would be beneficial, but maybe even then, I think this is a convoluted way to not use what I consider critical workflow tools, specifically environmental files, config files, git merge, git rebase, branches. I think you would be better off looking to see if you can organize your files better, restrict merging major branches to employees who are properly trained/competent in merging.

DigitalJack10y ago

seems like there is a race condition on the docroot switchover, but maybe with their forward only migrations it's a non-issue.

tbarbugli10y ago

Interesting 8/10 years ago!

j / k navigate · click thread line to collapse

50 comments

44 comments · 12 top-level

rachelbythebay10y ago· 9 in thread

Get off NFS while you still can. Thank me later.

ptrincr10y ago

For the latter case, the mounting of these partitions can be automated with config management on the linux servers. You have to be careful with UID's and GID's but config management helps with this.

The filers supplying the NFS storage can be exploited to provide replication to other datacenters,snapshots and also provide redundancy with multiple heads serving the volumes.

Performance wise we don't have any issues with NFS itself, the bottleneck is sometimes the filer trying to keep up :-)

Anyhow, in complex environments, sometimes its good to keep things simple where you can. NFS helps with that, its stable, scalable and the performance is comparable to iSCSI.

Removing the need for shared storage on the OS where possible is the ultimate aim though.

digi_owl10y ago

I wonder of how much the experience differs based on the NFS version being used.

gaius10y ago

Agreed. I've run Oracle with thousands of commits/sec on NFS with no problems. Or no more problems than we'd have had on any storage.

jordanb10y ago

Yep. I work for a big company with a bureaucratic system engineering department that puts all program code on NFS (SANs).

It's been the cause of virtually all of our service outages and many of our performance problems---and it's completely obsolete in the era of Jenkins and Ansible.

simonw10y ago

rachelbythebay10y ago

At some point after that, the hosts running the binary will try to page something involving that binary and will SIGBUS and die.

That's just one of many failure modes.

1 more reply

jagsta10y ago

That's not the only issue with running NFS with a horizontally scaled web cluster, stuff like this starts to hurt: http://www.serverphorums.com/read.php?7,655118

TomNomNomOP10y ago

That is something we're working on at the moment. It is read-only from the web servers' perspective, but it's definitely not without its problems.

Outdoorsman10y ago

Looks like a large distributed company...I have some experience with casinos...similar...

I'd be interested to know more about your experiences with an NSF "share"...which he mentions...?

Thanks!

ktRolster10y ago· 7 in thread

The biggest difficulty is managing database rollbacks. When you mess up your database, rolling back can be tough.

These guys avoid the problem by never rolling back the database, and never making changes that might require that.

axelfontaine10y ago

w4tson10y ago

Currently using flyway on an enterprise Java project. Brilliant library. Excellent documentation, developers sing its praises all day everyday.

benjiweber10y ago

Database schema changes can almost always be moved forward in a always safe manner using expand-contract pattern http://martinfowler.com/bliki/ParallelChange.html

As for reducing the risk of damage to the data itself from badly behaved application code, I think the best approach is to design your architecture such that you can't lose important data.

There are various other techniques than can help. I wrote about some of them here http://benjiweber.co.uk/blog/2015/03/21/minimising-the-risk-...

kul_10y ago

gaius10y ago

If you add a column and real customer data is written to that column, then "rolling back" means deleting it. That isn't a problem that can be "solved".

ozim10y ago

I am never thinking about doing rollback on production, downgrade is nice on dev machine when you work on multiple branches. I'd rather do database restore and code restore to pev version.

I would not roll out new code with migrations without ad-hoc db backup. Any new code as well.

QuercusMax10y ago

  1. Change your code to write to old and new schema; keep reading from old schema. 
  2. Migrate data to new schema in background. 
  3. Add a flag to control where you're reading from. Default it to read from old location. Keep writing to both. 
  4. Flip the flag on some subset of your jobs. Ensure everything is still running smoothly for as long as you like. 
  5. Change the default to read from new schema. Wait as long as you like to be comfortable that the change is working properly. 
  6. Delete the code that reads from the old schema. 
  7. Delete the code that writes to the old schema. 
  8. Drop the data in the old schema.

pbreit10y ago· 5 in thread

Are frequent releases some sort of advantage?

FordPrefectAO10y ago

Whatever you are developing has no value until someone actually uses it. Therefore the quicker you can confidently release something(feature, bugfix, etc), the more value you gave deliver.

This idea comes from the concept of inventory waste from lean manufacturing

suresk10y ago

Another side benefit is that it generally leads to higher quality releases and more stability, for a few reasons:

1) If you can do quick, easy releases, people release small, narrowly-scoped changes instead of huge batches of changes that may interact with each other in all sorts of unknown ways.

1 more reply

sdrothrock10y ago

2 more replies

pilsetnieks10y ago

Outdoorsman10y ago

End result to a user...potentially yes...though they may never be aware of your efforts...

fancy_pantser10y ago· 3 in thread

I literally checked the date twice to make sure this wasn't 10+ years old.

vemv10y ago

I guess your point being that these aren't novel techniques?

The author sure didn't invent them, but they aren't widespread enough yet. In fact e.g. Rails puts you in the opposite mindset (which is OK at early stages).

jdietrich10y ago

mbesto10y ago

For a company like Sky Betting this is actually fairly monumental. Not every company looks and behaves like every trending SV startup.

http://www.cvc.com/Our-Portfolio.htmx?ordertype=1&itemid=112...

PS - They actually make pretty good money too: http://www.cvc.com/Our-Portfolio.htmx?ordertype=1&itemid=112...

stuaxo10y ago· 3 in thread

Hm, is it any better working there than for the TV part ?

unistdh10y ago

There are many tech teams at Sky. Each one is very different in terms of culture, tech, product etc.

As I understand it, Sky bet is quite separate from the rest of the company. This is probably a good thing...

michae1m10y ago

Separate company these days.

TomNomNomOP10y ago

Yes! Although I am obviously biased.

simonw10y ago· 2 in thread

kough10y ago

I see a lot of other people mentioning the annoyance factor. Like anything else, you get used to it, and appreciate its advantages.

QuercusMax10y ago

rixed10y ago· 1 in thread

QuercusMax10y ago

girvo10y ago· 1 in thread

qwer10y ago

tenken10y ago· 1 in thread

What language, framework, tools do you use to manage these phased migrations? It sounds like alot of extra code to write.

Also 1 site update, or version number, could really be N releases until fruition -- which don't sound like traditional releases to me.

d0m10y ago

>> It sounds like alot of extra code to write.

What he's saying is more about convention than writing code. For instance, instead of adding a column "abc" and doing:

foo.abc = 123;

they would do something like:

if (foo.abc) { foo.abc = 123; }

make sure all tests pass, and then migrate the db.

If you're asking about tools to migrate code, all popular languages have one. (I.e. django comes with one that's already really good).

Can_Not10y ago

DigitalJack10y ago

seems like there is a race condition on the docroot switchover, but maybe with their forward only migrations it's a non-issue.

tbarbugli10y ago

Interesting 8/10 years ago!

j / k navigate · click thread line to collapse