Want Better Event Correlation? AIOps or Better Monitoring – Where to Start

We all hate those phone calls. “Hey, did you know that the lead flow into Salesforce is down?” No matter how far IT technology has come, we still get calls when our customers find things first. And then the finger-pointing starts. We’ve spent the last 10 years trying to make business run more smoothly by connecting our applications. But when something breaks, we’re left figuring it out by looking at one tool, then another, then another — and then we’re stuck in a room arguing about whose tool is right. You’ve got too many tools, too many events and alerts, and more disjointed pieces of information than you know what to do with.

Enter event correlation — an insightful way to find a needle in a haystack or to find a needle you weren’t otherwise aware of. Wikipedia defines event correlation as “a technique for making sense of a large number of events and pinpointing the few events that are really important in that mass of information – this is accomplished by looking for and analyzing relationships between events.”  

This is what AIOps vendors are saying they can do, and they say they can do it with cutting-edge machine learning techniques (But really, today it’s mostly statistical analytics.). With analyst firms like Gartner and AIA releasing buyer guides for AIOps, it sure seems tempting to look into their promise to solve it all! Many think that they can just add an AIOps tool on top of their existing mix without actually having to touch their suite of disjointed monitoring tools.  

But is this really the way to solve your problems? We think there are actually two ways to do event correlation. Let’s break it down and compare the two.

1.  The Statistical Event Correlation Approach (The AIOps Way)

AIOps tools use statistical analytics along with user-defined rules to bubble up and prioritize incidents. These statistical methods rely on indicators such as time (Did multiple events occur simultaneously?), network proximity (Are two disruptions located on the same subnet?), and number of like qualities (Did a particular word show up in many events?).  

These tools can be really great at spotting a pattern. Given a set of whitelisted and blacklisted events, they can perform brute-force pattern matching to classify new incoming events as good or bad. They can also detect these patterns themselves, but for this, they need long training times and large sets of data.   

But there are two challenges with this approach. The first is that these AIOps tools are not domain-aware and don’t inherently understand the IT elements themselves, which can lead to them surfacing false positives. Things that you by default know don’t make any sense. For example, you know a printer and a website aren’t related even though they generated events at the same time. But your statistical approach may not know this.    

The other challenge: around 70 percent of IT incidents, per analyst firm EMA, are completely new and haven’t occurred in the past. So, relying on past behavior means that you can miss first-time issues that are important to know about. Wouldn’t you want to be sure that you’ll catch more than just 30 percent of IT issues?

Yet another thing — because AIOps tools tend to rely on other monitoring tools for events, they are retroactively looking at older data, and that data is only as good as what the monitoring tool was configured to send it. Garbage in, garbage out. Often, the ingested events are processed and deduped already, which further throws statistical analysis out of whack. Tools with more sophisticated machine learning and AI also require in-demand data science skills, which are often hard to come by.  

An AIOps tool can be really great at analyzing unstructured data (for example, text in events as well as natural language processing in service desk tickets) in order to identify higher-level correlations that an infrastructure monitoring tool would surely miss. Some even stretch beyond the IT Ops realm and ingest data from other streams like social media, allowing companies to truly understand when their users or brand are impacted.

2.  The Intelligent, Domain-Based Correlation Approach

This approach involves modernizing your infrastructure monitoring solution itself. The secret here is selecting a platform with an ability to perform event correlation based on a native, deep understanding of the IT infrastructure components and dependencies. One that is domain-aware as well as service-aware. One that knows how the infrastructure works in order to determine logical relationships. Understanding how these individual monitored elements support a critical service at any given point in time helps to prioritize the most important issues to investigate and resolve first.

An infrastructure monitoring solution like Zenoss helps you to understand IT service risks in real time and to sift through noise by bubbling up service-impacting events (with prioritized root-cause analysis to ease resolution). Zenoss has inherent domain understanding about the devices themselves and how they work. It knows that a failing fan in a converged infrastructure server will affect its performance and that a printer error is not going to take down infrastructure for an e-commerce website even if it is on the same subnet. It knows that an issue on a backup server will affect the mobile app it supports, even if it isn’t at peak use and customers aren’t yet affected.

The benefits of using this approach go even further. By consolidating your monitoring toolset, you can also reduce license and labor costs. By handling a significant amount of event correlation within your monitoring tools, you can prevent event storms from moving to the service desk, where they are more costly to manage.

The Best of Both Worlds

Before you slap yet another tool on top of your monitoring solution, stop and think about what you are solving for. We know companies that have pursued an AIOps solution instead of tackling their monitoring, ultimately realizing that this approach fell short.

So, our prescription for the ailment? Tune your monitoring approach first. Your monitoring is your first line of defense, and ensuring that you have a unified view with quality insights is key.  You can then unleash an AIOps tool on a broader set of data to catch things that are missed, thus supplementing and complementing your monitoring.  

Here are some tips on how you should leverage each type of tool to get the best of both worlds.

Use Zenoss software for:

  • Unified monitoring of infrastructure performance and availability across your hybrid IT environment
  • Amalgamating event data from your other monitoring tools (In other words, your infrastructure monitoring “monitor of monitors”)
  • Generating infrastructure-related insights that are timely, actionable and enriched with contextual data for resolution
  • Automating alerts and ticketing for infrastructure-related, service-impacting events

Use the AIOps platform for:

  • Fusing data across broader IT domains, leading to higher-level business insights
  • Cross-functional collaboration and visualization of important data across internal teams
  • Finding higher-level, seasonal trends beyond your usual monitoring metrics and indicators

To learn more about how AIOps tools can complement solutions like Zenoss, consider attending the upcoming GalaxZ18 conference in Austin, Texas.