About 2 weeks ago, Kent Erickson of Zenoss teamed up with Cisco’s Russell Fishman to host the webinar Eliminating IT Process Integration Work, which discussed the reasons why the two companies partnered to create Zenoss Cloud Service Assurance or “ZCSA,” monitoring software that monitors next generation data centers, most of which incorporate cloud infrastructures “end-to-end,” as Russell put it.
Service Assurance is an important topic, especially as data centers become more complex and the expectations of end users become more sophisticated. At least a decade has passed since the average end user would readily put up with slow-loading apps, let alone an animated Under Construction sign. Deepak Kanwar gives a good overview about the consequences of downtime in his latest post, something I’ve discussed on this blog as well. Unplanned downtime can frustrate customers, prevent your IT staff from doing their intended jobs, and hurt your reputation, all of which leads to financial pain. And that’s just a partial list.
Next generation data centers will work differently than the ones even five years ago. Here are some of the principles Kent listed for the next generation data centers:
Based on virtualization and cloud principles
Applications use pools of compute, storage, and network resources rather than dedicated infrastructure components
Infrastructure standards that enable reliable deployment and redeployment of resources on-demand
Clearly, next generation data centers are more complex than those in the pre-virtualization and pre-cloud adoption days. And trying to assure that everything stays up and running may be beyond the reach of traditional service assurance tools that can only monitor one aspect of the whole entity.
As Russell pointed out:
If you don’t have a [Service Assurance] toolset that understands the dynamic tenant nature of the way these cloud environments are built, you end up with flat views of environment, which makes it difficult to triage when things go wrong. If you don’t understand the context of a particular resource failing, you can’t prioritize your IT service restoration efforts.
The word “triage” really stuck out for me. When I think “triage,” I think of doctors in an emergency room trying to determine whether the woman fighting an allergic reaction to a wasp sting should be treated before the man with the broken arm. And then it clicked. We may not be discussing sentient robots or other forms of artificial intelligence, but these next-generation data centers are like complex organisms. You can no longer cure the part without looking at the whole.
Diagnosing the Flu
My minor epiphany occurred about 10 minutes into the webinar when Russell discussed how the aggregation of disparate components could come together to make a cloud environment:
When we talk about cloud as a system, we’re talking about managing this whole system as a single combined entity…It’s nonsense to talk about different domains, when you’re really managing a single system. The lines between those different traditional silos are absolutely blurred.
As a result, any event or glitch needs to be seen in context with everything else within this data center “organism.”
It reminded me of my alcoholic cousin “Al” coming down with the flu last Thanksgiving. Looking back, my family had monitored Al in a siloed fashion using our personal observation as our “universe of context”. I concluded that the tears in his eyes were Al’s response to watching his favorite team, the Detroit Lions, lose yet again. Hearing Al complain of nausea, my uncle concluded that it was a direct result of drinking and eating too much. And Al’s aches and pains were quickly dismissed by his brother as a consequence of his extreme hours at the gym.
Fortunately for us, Al provided us all with a consolidated view when he called out, “I’m sick for real, man.”
Minimizing the Frankenstein Effect
At this point, no next-generation data center, even the latest, greatest Cisco UCS, can express itself in Al’s simple, declarative way. Moreover, while certain converged infrastructures like Cisco UCS do come in discrete containers, they don’t evince problems in a way recognizable to many IT people, let alone end users.
Hence, the need for monitoring. And not just the typical proliferation of point tools but something that works end-to-end, as Russell explained:
We need a single way to correlate the different events that are coming from the various parts of the infrastructure and understand the root causes in a unified fashion that matches the cloud infrastructure and architecture.
A siloed approach to monitoring is akin to treating your next-generation data center as if it were analogous to Frankenstein’s monster, rather than my naturally evolving organism. And I can understand the desire to want to stick with the former. Although today’s data centers have certainly evolved from the days of one app/one server, they didn’t evolve in the way humans have. To make a Cisco UCS box, you have to build it much as Frankenstein built his monster; you can’t put stem cells in test tubes. I’m reluctant to put forth exact analogies, but the physical servers might correspond to the monster’s organs, storage to the brain, networking to the circulatory system. You may choose NetApp for their brain, while others choose EMC or Hitachi.
But you can no longer piece together various events and glitches to figure out why a shopping cart app has stopped updating customer changes, to give one common example. The inherent dynamism that comes with virtualization and cloud computing means your requirements are constantly changing, and computing resources are being allocated to different parts of the “body,” as needed. It’s a similar scenario to the way a big Thanksgiving repast can make you sleepy because the body has taken some resources from the brain to help digest all that food.
According to Russell, most Service Assurance tools struggle with treating the data center in a holistic (whole body) way:
[These tools] try to keep up with what’s going on, but [they] often don’t reflect the reality of how the environment is configured or how those resources are allocated to various customers. Can we be sure that when something goes wrong, we know what group of people that is using my cloud it affects, or do we just see it as a bunch of VMs, relying on arcane concepts like naming conventions for servers to work out who owns what? How can we even be sure that people will name their resources appropriately?
Given the limitations of most service assurance tools, it’s not surprising that Cisco teamed up with Zenoss to develop a service assurance monitoring solution that can speak in a way that makes sense to people. It may not offer quite the clarity my cousin Al could, but fortunately, it won’t barf all over everything in the process.
And check out this webinar below, if you have some time. Kent and Russell have put together a solid demo of ZCSA that will give you a better sense of how it works.