The job of IT operations (IT Ops) has always included many an underappreciated responsibility, but perhaps none as thankless as “on call.” The need, it seems, is obvious. After all, we live in a 24-hour-a-day economy, and someone has to keep an eye on the services that the business relies on — even at 2:00 a.m., on a Tuesday, during the holidays. So, like it or not, on call is just something IT Ops has to do in order to make sure the business continues to run and the rest of us don’t walk into a huge business disaster when we show up to the office the next morning.
Or is it?
In his breakout session titled, “Skeptics in the Church of Data Getting Evangelical,” Aaron Aldrich, from DevOps consulting firm Cage Data, questioned that very notion and proposed two simple ideas that could help eliminate the need altogether:
- Get smarter about what you’re monitoring (and why you’re monitoring it).
- Automate (intelligently) anywhere you can.
At Zenoss, we love to tout the extensibility of our platform. And while it’s true that we really can monitor just about anything, if you happen to be the guy whose phone is waking him up multiple times throughout the night just to tell him something informational or that can’t be fixed until the morning anyway, you might actually find yourself wishing our coverage wasn’t so broad.
So, what was Aaron’s advice? Well, as a starting point, ask some fundamental questions:
- Am I collecting the right data? Low-priority alerts, or those that applications like Zenoss can provide for informational or planning purposes, should have policies that ensure they’re not alerting in the same way as critical severity issues — and they should not be going to on call. Data out of context* can do more harm than good, as it can “train” operators to ignore more important information.
- Why am I monitoring the things I’m monitoring? Often, a system or service, especially business-critical ones, have built-in redundancies. If your web service runs on a cluster of virtual servers that automatically load balance, you really don’t need (or want) to get a full-blown alert every time one of them reaches a capacity threshold.
- Are the right people (or systems) getting the right information? This is where automation and orchestration between tools can really pay dividends. If low-level functions, like system restarts, can be automated and tools can provision additional resources when critical services are in need, based on policies and run books, then nobody needs to spring into action.
Aaron also did a great job of calling out some practical ways to approach what to automate. For instance, consider if the policies that currently trigger an alert could be turned into a process that was self-healing. To borrow one of his examples, if every time you get a “high memory usage” alert, you know to check a leaky application and reboot its daemon, maybe it’s possible to create a process that automatically runs that task, when notified of the same issue with the same parameters, instead of triggering an alert to IT Ops. So, instead of babysitting a misbehaving process, the team can look into ways of fixing the underlying bug.
Given Aaron’s advice, it’s no wonder that we saw companies like VictorOps and SaltStack sponsoring this year’s GalaxZ17 event. Integration and automation could indeed be gifts from on high, if used correctly, to help ease the burden of IT Ops tasks such as on call. And, given Urban Dictionary’s definition of “getting religion,” which is, “To change up ways and policy as if having had an epiphany,” it’s no wonder those who think of data as the most sacred resource within IT monitoring are indeed getting evangelical about the practice of cleaning up what data they’re paying attention to.
I guess it was a good thing I wore my Sunday best this year.
*”Data” out of context (something else I borrowed from Aaron!):