Moving Fast and Securing Things (opens in new tab)

(slack.engineering)

375 pointsChris9118y ago59 comments

59 comments

42 comments · 12 top-level

boffinism8y ago· 12 in thread

> The process of deploying code to production is very simple, and takes about ten minutes total. This results in a life cycle in which we deploy code to production approximately 100 times per day.

What? They spend 1000 minutes out of every 1440 deploying to production? The deployment process is occurring over 16 hours out of every 24? Am I the only one who is nonplussed by this?

EDIT: Ok I get it, I get it. I guess I always worked in much smaller companies where CD meant deploying about 10 times a day tops. TIL big companies are big.

hoorayimhelping8y ago

When you have a team of 200 people, deploying 100 times a day isn't that big of a number.

A culture of continuous deployment is often hard to fathom for people who've never worked at a company with one. Everything, down to what you write and how you write it, is influenced by being able to deploy it and see its effects almost instantly.

These aren't huge, sweeping changes being deployed. They're small pieces of larger feature sets. It's more like: Deploy a conditional statement with logging and confirm from the logs that it works; next, deploy the view that you're testing with a feature flag toggled in a way that you and the pm can see it that is properly called from the conditional from before; when things look good, deploy some controller code that handles form requests from the view, etc.

You deploy small changes piecemeal and so spread out the risk over a larger period of time. It makes identifying issues with a new piece of code almost trivial. Needing to debug 30 lines of code is so much less harrowing than needing to look over 900.

Gravityloss8y ago

Also depends what you are working on...

wgerard8y ago

Not a slack employee but worked at a company with similar CD views:

(Likely) various groups of people are deploying to production throughout the day. Out of those 100 deploys, an individual is probably only involved in 1 or 2 a day. As soon as you're ready to deploy your code, you queue up and see it all the way through to production along with probably a few other people doing the same thing.

The actual "change the servers over to the new production code" process is usually instantaneous or extremely quick, the 10 minutes is mostly spent testing/building/etc.

People (including myself) enjoy this because you can push very small incremental changes to production, which significantly reduces the chance of confounding errors or major issues.

Note that this is would be a Sisyphean task if your company doesn't have great logging/metrics reporting/testing/etc.

perpetualcrayon8y ago

I once worked on a project for one of the largest corporations in the world. First part of my career I only ever worked on smaller teams.

I was excited for the move to a large corporation where there would be amazing room for growth and learning.

I have to say that almost a year into my work on this project, I was absolutely stunned how inept this company was at coordinating a technology project.

Something a small team could accomplish in a matter of months was taking 100's of developers and 100's more in supporting / operational roles years to accomplish. My guess is the developers on this project would gladly trade places with Sisyphus.

toomuchtodo8y ago

It’s usually a Sisyphean task. Everyone wants to look and act cutting edge (“but Netflix!”) but nobody wants to make the necessary investments in the tooling, org structure, and management ability/support that is required to support that sort of deployment cadence (if your org focuses on who broke something instead of the process, and management doesn’t want to change that culture, all hope is already lost [based on experience in a large enterprise, YMMV]).

There are some legitimate needs for continuous deployment, the rest of it is cargo culting.

3 more replies

xyzzy_plugh8y ago

I think you're assuming that each deploy is somehow disruptive or blocking, as if systems enter some "deploy mode". This may be true of a fleet overall, but is not when you take an instance into account.

Instead imagine each deploy as an edge, and imagine them to be near instantaneous. With respect to an instance, and a user, and the observable effects of a deploy, this paints a more accurate picture. 100 times a day means one deploy every 14.4 minutes.

How many times a day do you think Amazon deploys changes? Or Facebook? Or Google?

electrum8y ago

Deployment at Facebook: https://code.facebook.com/posts/270314900139291/rapid-releas...

stephengillie8y ago

When I started at a real estate webhost in 2013, there was a "push Dev to Prod" script, and a scheduled task to run the script every 60 seconds. So we technically had 1440 pushes per day, though most developers were only working during daylight hours.

So having more pushes per day isn't necessarily the metric to maximize. Quality of code changes for each push is important, and this is where automated testing can be very valuable. The goal is for automated testing to be a "gatekeeper of bad code".

But even this system isn't perfect, and it's possible to deploy things that pass tests but still have show-stopping bugs. Or for the code to cause your tests to misbehave - I'm seeing this now with Tape.js on Travis, where Tape sees my S3 init calls as extra tests. Then my build fails because - of the 2 tests specified, 3 failed, and another 4 passed.

maaark8y ago

C'mon dude, it obviously doesn't require everyone in the company to spend 10 minutes for each deployment, it takes one person 10 minutes to deploy one thing.

nathancahill8y ago

Probably not even. One person starts a deploy, after 10 minutes they can deploy again.

jsjohnst8y ago

You’re assuming a monolithic app, maybe there’s twenty services that get deployed five times a day.

m0meni8y ago

https://en.wikipedia.org/wiki/Continuous_delivery

vasilakisfil8y ago· 11 in thread

I am in favor of checklists for certain critical tasks, even if they are repetitive and/or boring. I think checklists are underrated.

ggregoire8y ago

I am in favor of (check)lists for everything. I am actually surprised about how few developers take notes and make list. It’s one of the most important part of my workflow.

icebraining8y ago

I've been convinced that checklists are great (by theory and by practice), yet I still write way fewer than I should.

I strongly dislike repetitive mental work, and writing a checklist is essentially resigning myself that such work will be necessary. Until I write it, I can still convince myself I'll be able to automate the process.

2 more replies

peterwwillis8y ago

Exactly. Checklists are great for non-critical tasks. Hospital triage uses the same checklist for pretty much every new patient. Triage is boring, but the fact that it is checklisted is what helps drive efficiency in the ER.

Similar boring processes in a tech business can be checklisted to increase efficiency, and the checklist itself can be iterated over. Everything from designing a new feature, to troubleshooting an error, to addressing a customer support ticket, to getting access to a new resource should use checklists.

2 more replies

cup-of-tea8y ago

Most don't have decent tooling set up for such a thing. Org-mode has a great feature where you can quickly capture todo items along with the place you were when you made them etc. (file/project, for example). I could be writing a function and get a thought about some upstream calling code, press a key, make a note, forget about it and carry on coding, knowing that it made it on to my todo list for investigation later.

If I had to switch to some other note taking platform it would probably break my flow enough that I wouldn't do it.

Twirrim8y ago

Checklists are useful, but they have their limits.

People too often get used to the routine and will end up skipping bits from checklists, or even outright missing stuff. It's strange, given whole idea of there being something to actually check would hopefully mitigate that.

Generally I do my utmost to avoid having checklists, so that they're left for things where they can't be avoided (e.g. where automation makes no sense, or potentially makes things worse)

1 more reply

jimktrains28y ago

They were/are helpful in making medicine safer too.

https://www.newyorker.com/magazine/2007/12/10/the-checklist

https://www.ncbi.nlm.nih.gov/pubmed/24116973

kendallpark8y ago

Same with aviation.

http://qualitysafety.bmj.com/content/24/7/428

ranger2078y ago

The New Yorker article was later expanded into an entire book of the same name.

1 more reply

auntad8y ago

This. There are some great bits about checklists and their (underutilized) role in preventing human error in Don Norman's Design of Everyday Things (chapter 5).

braythwayt8y ago

Atul Gawande literally wrote the book on the subject:

http://atulgawande.com/book/the-checklist-manifesto/

wpietri8y ago

I think they work much better (and are less boring) when the checklists are living artifacts created by the people who use them.

wpietri8y ago· 4 in thread

One of the things I think about when analyzing organizational behavior is where something falls on the supportive vs controlling spectrum. It's really impressive how much they're on the supportive end here.

When organizations scale up, and especially when they're dealing with risks, it's easy for them to shift toward the controlling end of things. This is especially true when internally people can score points by assigning or shifting blame.

Controlling and blaming are terrible for creative work, though. And they're also terrible for increasing safety beyond a certain pretty low level. (For those interested, I strongly recommend Sidney Dekker's "Field Guide to Understanding Human Error" [1], a great book on how to investigate airplane accidents, and how blame-focused approaches deeply harm real safety efforts.) So it's great to see Slack finding a way to scale up without losing something that has allowed them to make such a lovely product.

[1] https://www.amazon.com/Field-Guide-Understanding-Human-Error...

dvtrn8y ago

Having recently escaped from a "control and blame" environment, this is also horrible for releases as left unchecked, more energy is expended trying to double-down on architecting for perfection in fault tolerance. Risk aversion goes through the roof cripples decision making, and before you know it your entire team of developers have become full time maintenance coders, you stop innovating and spend cycles creating imaginary problems for yourself and begin slowly sinking.

We had a guy who more or less appointed himself manager when previous engineering manager decided he couldn't deal with the environment anymore, his insistence on controlling everything resulted in a conscious decision to destroy the engineering wiki and knowledge base and forced everyone to funnel through him-creating a single source of truth. Once his mind was made up on something, he would berate other engineers, other developers and team members to get what he wanted. Features stopped being developed, things began to fail chronically, and because senior leadership weren't made up of tech people-they all deferred to him-and once they decided to officially make him engineering manager (for no reason other than he had been on the team the longest-because people were beginning to wise up and quit the company), the entire engineering department of 12 people except for 2 quit because no one wanted to work for him.

Imagine my schadenfreude after leaving that environment to find out they were forced to close after years of failing to innovate, resulting in the market catching up and passing them. Never in my adult life have I seen a company inflict so many wounds on itself and then be shocked when competitors start plucking customers off like grapes.

wpietri8y ago

For those for whom this excellent description has resonance, I strongly recommend the book, "Why Does He Do That? Inside the Minds of Angry and Controlling Men". [1] It's nominally written about domestic abuse, but its descriptions of abuser psychology and its taxonomy of abuser behaviors have been really helpful to me in a work context.

[1] https://www.amazon.com/Why-Does-He-That-Controlling-ebook/dp...

michaelbuckbee8y ago

This extends to other aspects of their ecosystem. I've shipped applications for most of the major app ecosystems and Slack's was literally the _only_ one where it really felt like they wanted to help me get things set up correctly and help me ship something.

The reviewer rejected my Slack add-on twice, but was really nice about it, gave specific reasons, encouraged me to fix it and reapply, etc.

A very pleasant experience compared to some of the other systems where it feels like you're begging to be capriciously rejected.

paulific8y ago

Yes, I did some translation work for a global engineering company whose approach to achieving "zero fatal accidents" was exactly what you suggest. Instead of placing the blame on people for not following the rules, they identify the ultimate cause and fix the problems with their systems that made the accident possible in the first place.

ejcx8y ago· 1 in thread

I love this! If you're a part of a security team and you are not automating your processes and procedures then your team is going to drown. You must automate.

It seems like some simple checklist app but having a non Jira process that takes only a few minutes is so valuable, and "security reviews" and "threat models" as part of your SDLC take insane amounts of time and honestly aren't super helpful.

insensible8y ago

What I think is brilliant here is that that sort of work can take place separately and provide feedback to these checklists when problems or deficiencies are found. Basically the security checklists are a deliverable that can be iterated on independently while they still benefit from the existing one.

maccard8y ago· 1 in thread

> At the start of 2015, Slack had 100 employees. Today, we’re over 800 people!

That's a lot of people...

corrigible8y ago

Seems like their client memory usage scales with their headcount

mikekey8y ago· 1 in thread

Well written and timely for me. I would like to see this capable of something other than Jira though :/

coldacid8y ago

There's instructions for working with Trello in the repo's README, but so far it seems to be just that or JIRA Enterprise (not Cloud).

punnerud8y ago

I like the addition question if you are using C/C++: «We confirm that we really, really need to use a non-memory-safe language.". PHP/Python/C/C++ get Medium Risk directly, Low Risk: WebApp/API/MessageServer/iOS/Android/Electron/WindowsPhone

spydum8y ago

So glad they finally published this, saw the OWASP AppSec talk, was eagerly awaiting it.

However - I would want to caution: I think this model works because Slack has a self-described "culture of developer trust". I tend to think, they hire bright engineers and ensure they are equipped to do the right thing. I believe the vast majority of organizations are NOT ready for this. I direly want them to be, but simple fact is there are too many mediocre developers, and they can't be trusted without guardrails (and some straight up need babysitters).

JepZ8y ago

And I thought 'security' itself is friction ;-)

No seriously, I was wondering if that tool has a CLI interface? Might make it more accessable for some devs.

mbid8y ago

A security app written in PHP. Nice touch.

hhaidar8y ago

The company I work for has been offering an enterprise level service like for about 8 years now: https://www.securitycompass.com/sdelements/

jrochkind18y ago

this is really cool.

j / k navigate · click thread line to collapse

59 comments

42 comments · 12 top-level

boffinism8y ago· 12 in thread

> The process of deploying code to production is very simple, and takes about ten minutes total. This results in a life cycle in which we deploy code to production approximately 100 times per day.

What? They spend 1000 minutes out of every 1440 deploying to production? The deployment process is occurring over 16 hours out of every 24? Am I the only one who is nonplussed by this?

EDIT: Ok I get it, I get it. I guess I always worked in much smaller companies where CD meant deploying about 10 times a day tops. TIL big companies are big.

hoorayimhelping8y ago

When you have a team of 200 people, deploying 100 times a day isn't that big of a number.

Gravityloss8y ago

Also depends what you are working on...

wgerard8y ago

Not a slack employee but worked at a company with similar CD views:

The actual "change the servers over to the new production code" process is usually instantaneous or extremely quick, the 10 minutes is mostly spent testing/building/etc.

People (including myself) enjoy this because you can push very small incremental changes to production, which significantly reduces the chance of confounding errors or major issues.

Note that this is would be a Sisyphean task if your company doesn't have great logging/metrics reporting/testing/etc.

perpetualcrayon8y ago

I once worked on a project for one of the largest corporations in the world. First part of my career I only ever worked on smaller teams.

I was excited for the move to a large corporation where there would be amazing room for growth and learning.

I have to say that almost a year into my work on this project, I was absolutely stunned how inept this company was at coordinating a technology project.

toomuchtodo8y ago

There are some legitimate needs for continuous deployment, the rest of it is cargo culting.

3 more replies

xyzzy_plugh8y ago

How many times a day do you think Amazon deploys changes? Or Facebook? Or Google?

electrum8y ago

Deployment at Facebook: https://code.facebook.com/posts/270314900139291/rapid-releas...

stephengillie8y ago

maaark8y ago

C'mon dude, it obviously doesn't require everyone in the company to spend 10 minutes for each deployment, it takes one person 10 minutes to deploy one thing.

nathancahill8y ago

Probably not even. One person starts a deploy, after 10 minutes they can deploy again.

jsjohnst8y ago

You’re assuming a monolithic app, maybe there’s twenty services that get deployed five times a day.

m0meni8y ago

https://en.wikipedia.org/wiki/Continuous_delivery

vasilakisfil8y ago· 11 in thread

I am in favor of checklists for certain critical tasks, even if they are repetitive and/or boring. I think checklists are underrated.

ggregoire8y ago

I am in favor of (check)lists for everything. I am actually surprised about how few developers take notes and make list. It’s one of the most important part of my workflow.

icebraining8y ago

I've been convinced that checklists are great (by theory and by practice), yet I still write way fewer than I should.

2 more replies

peterwwillis8y ago

2 more replies

cup-of-tea8y ago

If I had to switch to some other note taking platform it would probably break my flow enough that I wouldn't do it.

Twirrim8y ago

Checklists are useful, but they have their limits.

Generally I do my utmost to avoid having checklists, so that they're left for things where they can't be avoided (e.g. where automation makes no sense, or potentially makes things worse)

1 more reply

jimktrains28y ago

They were/are helpful in making medicine safer too.

https://www.newyorker.com/magazine/2007/12/10/the-checklist

https://www.ncbi.nlm.nih.gov/pubmed/24116973

kendallpark8y ago

Same with aviation.

http://qualitysafety.bmj.com/content/24/7/428

ranger2078y ago

The New Yorker article was later expanded into an entire book of the same name.

1 more reply

auntad8y ago

This. There are some great bits about checklists and their (underutilized) role in preventing human error in Don Norman's Design of Everyday Things (chapter 5).

braythwayt8y ago

Atul Gawande literally wrote the book on the subject:

http://atulgawande.com/book/the-checklist-manifesto/

wpietri8y ago

I think they work much better (and are less boring) when the checklists are living artifacts created by the people who use them.

wpietri8y ago· 4 in thread

[1] https://www.amazon.com/Field-Guide-Understanding-Human-Error...

dvtrn8y ago

wpietri8y ago

[1] https://www.amazon.com/Why-Does-He-That-Controlling-ebook/dp...

michaelbuckbee8y ago

The reviewer rejected my Slack add-on twice, but was really nice about it, gave specific reasons, encouraged me to fix it and reapply, etc.

A very pleasant experience compared to some of the other systems where it feels like you're begging to be capriciously rejected.

paulific8y ago

ejcx8y ago· 1 in thread

I love this! If you're a part of a security team and you are not automating your processes and procedures then your team is going to drown. You must automate.

insensible8y ago

maccard8y ago· 1 in thread

> At the start of 2015, Slack had 100 employees. Today, we’re over 800 people!

That's a lot of people...

corrigible8y ago

Seems like their client memory usage scales with their headcount

mikekey8y ago· 1 in thread

Well written and timely for me. I would like to see this capable of something other than Jira though :/

coldacid8y ago

There's instructions for working with Trello in the repo's README, but so far it seems to be just that or JIRA Enterprise (not Cloud).

punnerud8y ago

spydum8y ago

So glad they finally published this, saw the OWASP AppSec talk, was eagerly awaiting it.

JepZ8y ago

And I thought 'security' itself is friction ;-)

No seriously, I was wondering if that tool has a CLI interface? Might make it more accessable for some devs.

mbid8y ago

A security app written in PHP. Nice touch.

hhaidar8y ago

The company I work for has been offering an enterprise level service like for about 8 years now: https://www.securitycompass.com/sdelements/

jrochkind18y ago

this is really cool.

j / k navigate · click thread line to collapse