In my last post, “Service Assurance and Why You Should Care,” I pointed out that it is increasingly clear that tools developed to keep a single system or a small cluster running are no longer sufficient in today’s highly distributed, complex environment. A new set of tools, tools that offer service assurance rather than merely offering high availability are needed. These tools must be able to examine and manage all of the components of the IT infrastructure not just a few selected components to achieve the goal of preventing service interruptions that have a negative impact on customers. Let’s take a deeper look at implementing service assurance.
Traditional focus Disaster Recovery
Traditionally IT organizations focused on disaster recovery or “picking up the pieces after everything collapsed.” Some more forward looking organizations focused on disaster prevention and maintaining high levels of availability through the use of high availability clusters or even deploying fault tolerant computers.
Their goals typically included keeping individual components, such as systems, storage, networking components and, perhaps, complete workloads functioning. The approach they took was deploying redundancy plus some sort of clustering, workload management or other form of failover software.
While this approach is quite workable for monolithic workloads that are deployed upon a small number of servers, it really isn’t enough when we consider what has been happening in IT over the last 10 or 15 years.
Quick review of service assurance
Service assurance is having the tools to monitor all parts of an application, regardless of where it is running or on what type of system, detect issues with an application component or how it is interacting with other application components; present a view of what is happening to IT administrators in real time; and proactively address those issues before they have a chance to become problems. This means being able to examine at all parts of a service. Performance and configuration data concerning components such as the servers, the operating system, network components, storage system, virtualization tools, application frameworks and the applications themselves must be readily available to the IT administrator.
It also means understanding how all of the various components of the complete computing solution interact with one another. The tools must understand that an application slowdown may be caused by a database problem, a networking problem or failing hardware within the system cabinet.
The whole point of service assurance is to mitigate the impact of service disruptions on customers regardless of whether the source of the disruption is internal, such as a component failure, or external, such as a networking problem. This is increasingly difficult.
What’s needed is thinking differently about reliability
IT organizations have a different goal today. The goal is preventing slowdowns or failures rather than dealing with the aftermath. Today’s workloads are architected as multi-tier, distributed application services that have been woven together. These application services execute more on off-the-shelf, industry standard operating systems, such as Windows, Linux and UNIX rather than on vendor-specific operating systems.
The industry is also seeing organizations deploy cloud-based applications, application components or storage.
The environment is far more complex and there are many more ways for a slowdown or failure to occur.
Tools that offer service assurance must be able to wend their way through complex application systems and prevent application slowdowns or application failures. These complex systems now include physical systems, virtual systems and even systems found in a cloud service provider’s data center. This can be a very challenging task because of today’s distributed, multi-tier application design.
Service Assurance must come to the forefront
Since the goal is preventing problems rather than dealing with the aftermath, IT administrators are now expected to wade through more operational data, from more layers of software and hardware and make real-time decisions. Service assurance rather than just high availability/failover must be considered.
This means that operational data must be collected from many systems housed in different data centers must be gathered, analyzed and interpreted. Predictive analysis tools must be used to predict when and where problems are going to arise rather than merely reporting on what’s happening. There is simply too much data for human beings to manage if real-time operational decisions must be made.
Tools must be used that take advantage of the capabilities of management frameworks and tools that have already been deployed. These new tools must add real time analysis of operational data to what the installed tools offer. This analysis must result in predictions about where the next slowdown or failure might come from. They must also have the capability of preventing these problems.
Tools such as those offered by Zenoss should be considered as an important improvement to today’s management environment.