To continue a discussion:
- How does your engineering team track new "debt" after releasing code? (if at all, and why not)
- Do you pay anyone for centralized logging, or wish you didn't? Are you making it useful?
- Do you feel like your company is good at managing access when hiring / firing people?
Otherwise thanks for any feedback, I enjoy writing these!- Technical debt of custom coded solutions is a known issue across our organisation. New strategy is to move to market solutions, therefore outsourcing the risk to organisations with (hopefully) better code management than we have. For my corner, we don't have technical debt measured accurately enough for my liking.
- Yes, we pay for an use centralised logging. We've actually been through two solutions, and are now moving to a third due to various factors (cost, integrations, speed, out-of-the-box metrics). Integration into the centralised logging system is part of our Request for Tender marking criteria.
- Relatively good at disabling access after someone leaves. We integrate as much as possible to a central repository. It's just the outliers that tend to last beyond someone in the organisation. Critical systems are absolutely shutdown within 24 hours of a leaver departing (usually immediately if they're a bad leaver).
Edit: Formatting
Currently, I mainly use a seperate "secrets.yml" file that gets deployed via Ansible and is stored there encrypted using Ansible-Vault with a strong password. Is that a reasonable approach? What is your opinion about storing secrets in environment variables? It seems that some people advise this over storing them in files, but I have seen some cases where environment variables can be exposed to the web client as well.
The big win is simply keeping secrets out of source code, out of an general engineer's copy/paste buffer, and with errors not going to a logging platform with single factor access. Your likelihood of a short term incident decreases dramatically. Especially if those secrets have well segmented access, (IE, not a single AWS key with `AdministratorAccess` everywhere).
- poorly, really.
- for network, and security stuff, absolutely: splunk is the bees knees. For apps, each team tends to run their own mix (graylog2/elk/custom). Have pushed for more security type events from apps into splunk for correlation, but it just costs too damm much.
- depends on the region. I find US / UK do okay, but the more emerging/growth markets where we have employees, the worse it gets.
Do you mean instead "that _at_ least respect"?
I ask only because they two have different meanings.
Are there hosted installs of Elasticsearch/Logstash/Kibana? Is ELK even what I want?
Every time I start looking at centralized logging stuff it seems like a rabbit hole of problems we're too small to be worrying about, stuff that's not shipping features on my app.
CloudWatch works fine too. CloudWatch comes integrated with AWS services out of the box. It can be more annoying to get your logs into it than ELK (the latter seems overall more popular). Its alerting and the AWS CLI integration pretty slick, though.
You should also go turn on CloudTrail right now. It lets you automatically log side-effectful API calls. It is not a replacement for a centralized logging pipeline, but it's great high-signal data to put into one.
I appreciate that your complaint (totally valid!) was "this is a rabbit hole", and I just gave you two options, and that might not help your perception that it's a rabbit hole. If you find yourself paralyzed by choice, either choice is much better than deferring the choice! Just pick one. Heck, if you can't pick, let me help: pick AWS hosted Elasticsearch.
A lot of people (also in the security space) like Splunk. I find it annoying to deploy (I've heard rsyslog-in-front-of-forwarders as a canonical deployment method for just ingesting syslog more than once because reasons) and overpriced. YMMV.
Disclaimer: shameless plug! You're not the only one with your hair on fire. One of the first things we're doing for Latacora customers is setting up a centralized logging pipeline.
I think it's really important to internalize the idea that there is no Platonic ideal of a logging solution. It's a fundamentally frustrating manifestation of entropy that you're going to wrestle with, but it's a really necessary goal to work towards long term. Sort of a "the first step is admitting powerlessness" kind of deal.
The trick to Cloudwatch is --- like most AWS services --- never using the web UI.
I've been using Loggly for my personal machines (~8, mostly cloud VPSes). On the plus side, it's free at my scale, and the analysis and reporting tools are nice at least in theory. On the minus side, I can't get my logs past 7 days archived to S3 without paying $150/month, which I really want since my main use-case is longer-term analysis and forensics.
I'm planning to switch to Papertrail, which for the princely sum of $7/mo will give me a simpler UI and a year's archiving to S3.
Loggly and Papertrail both use the same deployment strategy (you hook them up to syslog and/or your app's logging package), and I had Loggly up and running and providing useful feedback in solidly under four hours.
The killer feature it has is for me is searching structured (JSON) logs. Just use the Logstash/Greylog library in the language of your choice and send the logs to Loggly, and you quickly have a logging system where you can zoom in on the logs comming from different subsystems of your codebase or produced by a specific user.
Disclaimer: I work at Sumo Logic I would recommend: https://www.sumologic.com On top of grep like searches, you can do analytical searches (SQL on text data).
* Sentry: https://sentry.io/welcome/
* Logentries: https://logentries.com/
* Loggly: https://www.loggly.com/
* Opbeat: https://opbeat.com/
* Papertrail: https://papertrailapp.com/
Sentry is open source and there is even an official up-to-date docker image: https://hub.docker.com/_/sentry/
Loggly published an "Ultimate Guide to Logging": https://www.loggly.com/ultimate-guide/
We also have streaming log parsers to connect your data. That whole thing about 'creating new alerts in minutes' is trivial in our platform since everything is based in SQL.
Unlike Splunk or ELK, our solution is based on in-memory streams so you don't have to wait for data to be indexed to fire off alerts on anomalous activity. Feel free to message me to find out more or simply download the product from http://www.striim.com/
See https://logentries.com/ for an example
> The discovery of a root cause is an important milestone that dictates the emotional environment an incident will take place in, and whether it becomes unhealthy or not.
> A grey cloud will hover over a team until a guiding root cause is discovered. This can make people bad to one another. I work very hard to avoid this toxicity with teams. I remember close calls when massive blame, panic, and resignations felt like they were just one tough conversation away.
[1] https://medium.com/starting-up-security/red-teams-6faa8d95f6...
One piece of advice that I'd give out with such cases is to listen to your Spidey Sense. A lot of organizations will say, after the fact, "well... something didn't seem right with Bob...". If you sense something isn't right, prepare to secure evidence and analyze it. Don't put IT assets back into circulation if there's doubt, and don't sit on it.
- Yes, centralized logging is the biggest thing. What you put into it matters; queriability matters; but nothing matters as much as having that centralized logging pipeline to begin with. Once you have that, you can start adding other relevant metadata, like host config states, API calls, et cetera.
- Giving employees a budget to buy the device they want is probably a better idea than BYOD. Strong password policies still matter. If it's BYOD, you probably still want to bring the device into policy. That can include physical rules (only do work work on the VPN or from the office) and software ones (you can use any device you want but it has to be running our osqueryd or whatever). Unfortunately, visibility becomes a double edged sword: there are good legal and ethical reasons for not wanting to see everything on an employee's laptop. (Overall, I think BYOD is a bad idea for most companies.)
- 2FA is pretty cool. It doesn't just solve the usual "bad/compromised password" model -- it also typically makes it a lot harder for employees to mismanage their credentials (e.g. re-use the same SSH keys and have their personal box be compromised). For some reason, having that around seems to remind developers that you can make users re-authenticate for important/unusual actions -- you don't just have to count on the ambient authority of a session cookie.
- We'd all like to imagine that we're going to be attacked by space alien 0day ninjas. Realistically, the main vector is an employee (rogue or confused deputy). Trainings are boring and don't work. Signature-based detection gets outdated pretty quick. I've done a little work on faster analysis tools -- I'm hoping we get a lot better at unobtrusively protecting people from even spearphishing in the next few years. (The tools we're building at Latacora are ready to beat a lot of attacker tactics right now, but I think we have an arms race ahead of us. Boring domain generation algorithms still aren't detected by most organization, so there's not a lot of evolutionary pressure.)
- I have no idea if we'll get better at quantifying metrics for debt and security risk. I did a little bit of research into this, and it's a wide open field. You can get decent high-level reports with a "DEFCON number", but most of these models are not sophisticated in the sense you'd expect actuarial tables to be. And that's what they should be! It's revenue-at-risk! Step one here is fortunately getting all of that data into that centralized logging pipeline, and security professionals seem to mostly agree that's what you do first, so hopefully we get better here.
> This can either mean one of a few things: These environments don’t exist at all, there aren’t many of them, or they don’t see incidents that would warrant involving IR folks like myself.
What are these secrets store? Do they exist?
Sometimes, it's as simple as a shared password store (I've used one powered by GPG, for example). This is better than YOLO password policy, but not by much: humans still see individual keys.
If you want to be really fancy, you authenticate the human and then decide what they get to do, in a centralized fashion. This is often tricky to do, because you either don't have the funds to do that if you're small, or you have too many services to interact with if you're big. (Many organizations get pretty close -- I'm told that the DoD pretty much authenticates everything with smart cards, for example.)
Sometimes, it means a more automated system where software authenticates instead of a human, and it gets e.g. a certificate. Usually this is still always the same certificate, though; so the main difference is that it's a human versus a machine authenticating.
Sometimes, it means an HSM (hardware security module). These are secure physical devices that perform cryptographic operations for you, so that the key stays on the device.
I fail to see how it is secured. (Though, I can understand that it is less bad than a YOLO policy).
> Many organizations get pretty close -- I'm told that the DoD pretty much authenticates everything with smart cards, for example.
I've been at a place with RSA SecurID (smart card and OTP) + active directory account as SSO authentication for everything (use one or both for 2FA). It was nice and well done.
I thought Banks seem to have solved alot of that.