Is Your Performance “Normal”? How Do You Know?

TechTips2

IT operations environments are very “noisy”. With tens of thousands of events coming in from thousands of different devices, it’s hard to know when things are “normal” and when things are “not normal”. This is particularly challenging because event traffic and volume from devices can vary depending on the time of the day, the day of week, the week of month, and even the time of year.

Given the large volume of performance events generated by the thousands of devices in a typical enterprise data center, and also given the wide range of “normal” that can exist for all of these different devices, how can you figure out when a performance variation for a device is aberrant behavior ─ a problem, or “abnormal” ─ versus when a performance variation on a device is just “business as usual” and part of “normal” operations?

Knowing What “Normal” Is

The first step in knowing whether or not your devices are performing “normally” is defining what “normal” means.

The Predictive Thresholding ZenPack from Zenoss is designed specifically to help IT Operations teams better understand what “normal” performance is for devices in an enterprise IT infrastructure monitoring environment. With predictive thresholding, you can see how your data points vary over time for devices during normal operations. In addition, once you know what “normal” for a device is, you can create threshold events in Zenoss that notify you when device performance is outside of the predicted bands of “normal” behavior.

Zenoss uses the RRDTool to store performance data and help you understand what “normal” is. Included in the RRDTool is the ability to use the Holt-Winters Time Series Forecasting Algorithm to adaptively predict future observations in a time series. The Predictive Thresholding ZenPack uses the Holt-Winters exponential smoothing algorithm included in RRD for predictive thresholding.

The forecast created by the Holt-Winters algorithm is the sum of three components: a baseline (or intercept), a linear trend over time (or slope), and a seasonal coefficient (a periodic effect, such as a daily cycle). After a value is observed, each of these components is updated via exponential smoothing. The algorithm learns from past values and uses them to predict the future. The rate of adaptation is governed by three parameters: Alpha (intercept), beta (slope), and gamma (seasonal). The prediction can also be viewed as a “smoothed” value for the time series.
Note: For more detailed information on RRD and Holt-Winters, see the following resources: rrdcreate Command and Notes on RRDTOOL implementation of Aberrant Behavior Detection.

Detecting Aberrant Behavior

Once you know what “normal” is, next use the predictive thresholding capabilities provided by the Holt-Winters algorithm in the Predictive Thresholding ZenPack to predict what the next data point should be. Then, if the data point falls outside of this range, Zenoss can detect this aberrant behavior and generate an event that notifies you of the situation so you can investigate further.

Aberrant behavior detection in Holt-Winters is decomposed into three pieces, each building on its predecessor:

  1. An algorithm for predicting the values of a time series one time step into the future.
  2. A measure of deviation between the predicted values and the observed values.
  3. A mechanism to decide if and when an observed value or sequence of observed values is “too deviant” from the predicted value(s).

You can configure predictive thresholding in Zenoss using a variety of parameters, including the number of windows, or observations, for alerting, the number of periods the threshold has failed in a window before alerting, and the number of standard deviations away from a predicted value that must occur before alerting.

Predictive Thresholding Examples

Now that you have a bit of a background on how predictive thresholding and the Holt-Winters algorithm work in Zenoss, use the following examples to help you see how you can use the Predictive Thresholding ZenPack in your own monitoring environment to detect aberrant behavior and notify you when it occurs.

  • Example #1 – Monitoring Router or Switch Throughput
  • Example #2 – Monitoring CPU Usage
  • Example #3 – Monitoring Disk Utilization
  • Example #4 – Monitoring the Number of Failed Connections

Example #1 – Monitoring Router or Switch Throughput

In this example, assume you are monitoring a router or a switch for throughput throughout the day. Also assume you have known, expected spikes in the amount of data you are sending and receiving throughout the day.

For example, when you do a backup at night between 1:00 a.m. and 2:00 a.m., your network pipe is typically full, and your switch or router runs at 90% capacity during the backup window. In your environment, this is normal, expected night time behavior that has been occurring for the last 12-18 months. You do not want to be notified every morning at 1:00 a.m. or 2:00 a.m. when the backup kicks off because the router or switch suddenly starts running at 90% capacity.

However, in this example, you do want to know if, at 3:00 p.m. in the afternoon, your router or switch that typically runs at 10% capacity during the day has a sudden spike and is now running at 90-100% capacity.

In this scenario, when Zenoss detects that your throughput is outside the band of normally expected behavior for the time period, Zenoss generates an event that notifies you that your router or switch is no longer within its “normal” range. When you receive this notification, you know that you need to go and investigate the router or switch capacity issue Zenoss identified.

Example #2 – Monitoring CPU Usage

If you are monitoring CPU usage on a server, you can use predictive thresholding to understand what the daily CPU usage patterns are on the server. For example, you can see that CPU utilization for a certain server runs at 10% at night. However, CPU utilization on the same server typically runs at 80% during the day.

Using the Predictive Thresholding ZenPack, you can have Zenoss monitor the amount of CPU you are using over time to see if CPU utilization is within expected ranges during day and evening hours. Then, if Zenoss detects two periods where your CPU is running at 10% or less during the day, Zenoss can generate an event notifying you of this aberrant behavior. You can then check the server to see if the application running on the server crashed.

Example #3 – Monitoring Disk Utilization

In this example, assume you are a storage administrator. You have a backup that runs every night. The backup adds 5GB to your storage device every night, and this is expected behavior.
Now assume that another one of your storage administrator “buddies” decides he needs a spot to store 50GBs of data related to a project he is working on. He puts it out on your 500GB storage device, but “accidently” forgets to tell you.

After your buddy deposits his data on your storage device, Zenoss detects that disk usage on your storage device has suddenly spiked outside the range of expected behavior. Zenoss generates an event that notifies you of this aberrant behavior, which in turn lets you know that someone has put something on your storage device that you did not expect. You can then investigate this issue further ─ and then go and have a talk with your “buddy”.

Example #4 – Monitoring the Number of Failed Connections

In this example, assume you are logging the number of failed connections to your servers. Typically this is between zero and three failed connections an hour. Now assume that your are suddenly getting 3,000 failed connections in an hour. If you are logging your number of failed connections per hour and graphing this data as a performance metric, Zenoss can detect this aberrant behavior and notify you. You can then investigate the issue to see, for example, if you are facing a denial of service attack.

Try It Out!

If you are interested in trying out the Predictive Thresholding ZenPack in your environment, there are two versions available:

  • If you are a Zenoss Service Dynamics customer, use the Commercial Zenoss Predictive Thresholding ZenPack. This ZenPack is developed and maintained by Zenoss and requires a Zenoss paid subscription for use.
  • If you are a Zenoss Core user, use the free, open-source Community-developed and Community-supported Predictive Threshold ZenPack.

Learn More!

Interested in learning more about other types of Zenoss ZenPacks you can use to enhance and extend your monitoring capabilities?

Check out What Zenoss Monitors, and feel free to download one of our many data sheets that describe Zenoss extended monitoring capabilities in more detail.

See the ZenPack Catalog on the Zenoss Wiki for a complete listing of all of our more than 300 ZenPacks, including the following types of ZenPacks:

If you are new to Zenoss, learn more about how others are using the Zenoss Service Dynamics Unified IT Operations platform today to improve their monitoring efficiency and productivity and avoid outages: Four Profiles in Unified Monitoring Success.

Also learn more about the Zenoss Service Dynamics unified monitoring architecture: Zenoss Service Dynamics Architecture Overview.

Share This Tip!

If you’ve found this article helpful, feel free to share it with others on LinkedIn, Twitter, Google+ or Facebook, or follow our blog to get the latest news and information from Zenoss.