What DevOps pager/on-call schedule works best for your team? And are there any best practices that are noteworthy?
Never page if it's not an absolute dire emergency. One server out of a cluster - Next Business Day. Failed disk - NBD, unless you're out of hot spares.
As much of your work as possible should be automated to fix it without you having to touch anything. Service down? Try restarting it. Still down? Maybe then consider an email or page.
Other stuff
+Monthly or quarterly sync up meetings between all pager people. Doubly so during super critical times for the business to ensure stability.
+Single email list/PDL for the on-call (+ manager) so they can communicate about issues, as well as be cc'd on vendor support tickets (helps with hand offs)
+FAQ for your services so you don't have to wake the DBA or web admin until you know it's really hosed.
+(Sounds silly, but bears mentioning) During pager hand-off, last week's guy and this week's guy should talk about what happened and if there's anything they should know
Agreed, we were thinking of doing week long rotations (Tuesday - Tuesday) with a "hand off conversation" happening on Tuesdays.
The reason for this discussion is because up until a certain seniority level, you get "hazard pay" for carrying the pager. You get paid 1 hour for every so many you're on call. A weekend/holiday is 24 hours instead of 8 on the day your receive it or 16 on a weekday.
You should also cover rules for holding the pager. Ours include no alcohol, and no more than 1 hour away from the site (certain emergencies may require on-site visits). You also need to respond within 20 minutes, otherwise it gets escalated, or in certain larger locations, sent to the backup on-call person.
http://blog.pagerduty.com/2011/03/on-call-best-practices-par...
This is a series of posts that have pretty sane defaults; I personally would not do a daily rotation, but rather rotations of 5 days, and alternating weekends (one guy does M-F, one guy does Sat/Sund) and you switch off.
First let me give you some advice about "what an emergency is" and "how we are alerted". You need to define what an emergency is in your company, and notify everyone (with clear guidelines on "how to get help"), so that you limit pages to critical issues, post this on an internal wiki (you have a wiki right?). What is really worth getting woken up and coming into the office for? Alerts are issued like this, nagios alerts go to email, these are not generally emergencies, a couple checks do fire sms alerts, so I consider these email alerts issues for the workday, and I do not check these on the weekend. An automated system scans syslog for alerts that might be an emergency (based on prior experience (i.e. db errors about a disk subsystem, etc)), we also have our apps log emergency issues to syslog, and if one is triggered, a sms goes out to the group. We also have any helpdesk tickets (you have a helpdesk right?), with "emergency" in the title issue a page, users know to do this via the "what an emergency is" wiki page.
When a page comes in, if you can take it, you simply issue a sms "ACK" to the group, this tells everyone that you have Accepted this page, and you are the owner. This helps us load balance across everyone's lives. If you need help, you pull in other people as needed. You also issue a sms "All Clear", when the issue is resolved, this will typically go alongside an email to the group with an issue summery.
This entire system does not need to be complex. Start simple and iterate as needed. There also needs to be a process to find out what happened, do we need more monitoring, additional syslog triggers, etc.
ps. our UPS, HVAC, and security systems can issues pages via sms too as needed. I didn't mention this because it highly dependent on our environment. We also use a modem and landline to issues these pages. We have a linux server with qpage [1] running on it, which issues the pages by dialing a landline at a telco. This allows us to issue pages if our network link goes down too.
pss check out my website @ http://sysadmincasts.com/ where I plan to cover issues like this.
We're an established team/product, so we have an internal wiki, help/support desk, and use PagerDuty. We just want to shift away from only a few people (basically 2) handling DevOps emergencies and spread the experience over more members of our engineering team.
With the "all on call/no schedule" route, have you ever had a scenario where no one acknowledged an issue?
Keep a release schedule, stick to it, do not deviate. If you can't get your stuff tested before the deadline, thats on you and your peers should not suffer. Make sure that all engineer receive alerts via email, its neccesary to "share the pain" so that people get an idea of what mistakes do. Weekly rotation is probably the best thing to do, that way there is a consistent point person for the week.
Our overall goal is to keep the DevOps skills sharp between as many members of the team as possible...