Applying Google’s SRE Principles to IT Monitoring

At one of my annual reviews a while ago, I said, “I’m frustrated. Thinking back over last year, I can’t remember any one thing that made a difference to the business.”

I don’t recommend telling your manager that unless you have a great relationship! But I bet sometimes you feel the same way, especially in IT operations. Often, it seems like a never-ending stream of issues to be solved. How can you communicate your accomplishments effectively?

Here’s what I suggest: identify a handful of big objectives — something that makes sense to the business and that you can measure. The objectives should be bigger than your narrow task, because you’re part of a team, after all.

This is exactly the approach that Google suggests in their book on site reliability engineering, which is fast becoming a critical read for IT operations teams. In it, they say, “It’s impossible to manage a service correctly, let alone well, without understanding which behaviors really matter for that service and how to measure and evaluate those behaviors.”

Or, translated: know the primary objectives and key performance indicators (KPIs) so you don’t waste your life reacting constantly without getting anything important done.

With Zenoss as a Service, we’re thinking about key objectives that we can accomplish for customers, as well a the KPIs we can use to measure them. Here are five we’re considering.

1. Metric Collection Success

You use Zenoss to proactively measure IT components in many, many ways. We want to help you by ensuring that collection is straightforward.

The calculation is simple:

Data points collected per day

-----------------------------  = Metric Collection Success Rate

Data points desired per day

 

A moderately sized customer can easily attempt 1,000 metric measurements each second, 75 million measurements per day.

If you can rely on our collection service, you will have useful, timely information with which to make informed decisions.

2. Event Collection Success

IT components have become very good at self-monitoring. We continuously listen for self-reported errors and manage risk by proactively checking for continued operations.

Again, the calculation is simple:

Events collected per day

------------------------  = Event Collection Success Rate

Events generated per day

 

3. Model Update Success

It’s a given that systems and applications fail. With Zenoss software, our customers can focus on rapid recovery — generally reducing downtime by 50 percent or more.

A key tool is our continuously updated model of the configuration, dependencies and relationships among the components. The model lets customers rapidly determine the root cause of a failure across multiple connected systems. Our software typically updates the model thousands of times a day, focusing on rapidly changing systems like public clouds and virtualization.

The calculation is, again, straightforward:

Successful model updates per day

--------------------------------  = Model Update Success Rate

Total model updates per day

 

The change to the Solr index in Resource Manager (RM) 6 has created a huge improvement in this objective for customers who have updated, by the way — more than a 90 percent reduction in failures is common.

4. Noise Reduction

We collect far more data from measurements, events and model changes than any set of people can properly evaluate. A key focus for our team is to identify actionable events from the sea of data. We employ key technologies like data deduplication, fixed and variable thresholding, trend analysis, analytical predictions, and automated root-cause analysis.  We’re also exploring various machine learning and anomaly detection techniques.

We calculate this objective by considering events.

Total events - Actionable events per day

----------------------------------------  = Noise Reduction Success Rate

Total collected events per day

 

5. Access to Data

All that data collection and processing does you no good unless everybody on your team can get to whatever data they need, whenever they need it. You need the user interface to be there.

We calculate this by considering every request to the user interface.

Successful user interface requests

-------------------------------------  = Data Access Success Rate

Total user interface requests per day

 

We Need Your Help

It’s true that the simple calculations above hide an enormous amount of implementation work, mostly for Zenoss. We’re starting to do that work, and you can see some of the results in the self-monitoring features in RM 6.

We’d like to take it further, particularly as more and more customers begin to rely on Zenoss to operate their monitoring systems. It’s only fair for us to tell you our objectives and help you identify obstacles in your way.

Before we charge ahead with this big investment, I’d like to hear what you think about the five objectives above. Are they sufficient? Are they useful? Could you use them to communicate internally? Please email me at kerickson@zenoss.com with the subject line “Objective Feedback.”

If you’re able to attend our GalaxZ user conference in June, I’d be very happy to talk to you individually, too!