It’s now normal practice for our customers to monitor tens of thousands of devices from a single Zenoss installation. Making the software work for these large environments has taken a large amount of product work, as you’d expect.
We’ve found that as we support larger and larger environments, that high-availability designs are increasingly requested. Failures and recovery procedures that can be tolerated when only a few dozen machines are affected have real economic value when they affect monitoring for thousands or tens of thousands of resources.
Our customers have noted that the data center monitoring need requires an application that is more complex than any of the line of business applications because it requires watching all the infrastructure and the most complex applications simultaneously and continuously. I’ll use this article to explore how Zenoss is achieving high availability today — and how that compares to other approaches.
Zenoss Version 5 Delivers Standard, Highly Available Design
When we built Zenoss Version 5, we decided that every customer deserved a highly available implementation. So, Zenoss doesn’t charge extra for a high-availability version. It’s a standard feature in the Version 5 architecture. Any data center using Zenoss software, large or small, can easily benefit from monitoring that just doesn’t stop.
The design considerations for a high-availability application are complex. By choosing Docker and writing an orchestrator, we were able to deliver a reliable implementation that operates with very low administrative overhead.
Through Docker orchestration provided by Control Center, we provided for both scale and high-availability needs with data delivery queues, multiple process instances, automatic load balancing, stateless processes, and automatic network configuration. The performance metric database is a multi-instance store, and all database writes are transactional.
The real punch for high availability came from the orchestrator, which provides monitoring and automatic restart for all the Docker processes that make up the Zenoss application. When you implement Zenoss Service Dynamics, you give Control Center a resource pool — a collection of virtualized or physical hosts — and it distributes the processes across all the hosts for an all-active design. Should a host have issues, Control Center will restart the Zenoss processes on other hosts automatically and within a few seconds. Data in motion is protected from loss via message queues. If a host fails completely, the Zenoss application will work around the failure by restarting processes on other hosts in the resource pool.
A very nice part of this design is that you’re not limited to a traditional two-host HA implementation as in a cluster. You can add three, four, or more hosts at any time for really easy scale-out and to limit the performance hit should a single host be lost. If you want to maintain performance, it’s far cheaper to add 25 percent more capacity by moving from four to five hosts than it is to double it with a two-server cluster.
A second nice part is that the entire application is highly available. Data collection, business logic, routing and queueing, data storage, and the presentation layer are all protected by the same mechanism with the same no-administration process.
We’ve built Zenoss 5 so any application component can fail without issue, except perhaps a brief loss of capacity. It took a total redesign because great HA is not something that can be bolted on afterward.
We’re not perfect yet. The Control Center server still needs to run with a hot-standby. But that’s just one (typically virtual) server in one location instead of a potentially broadly distributed set of servers. And we’re working on turning that into active-active because of customer demand.
Typical High-Availability Implementations Can Be Challenging and Even Fail
Let’s compare this to how other vendors achieve high availability. It’s something we think you’ll want to consider as the number of applications and management points grows.
- Many vendors deliver a distributed collection architecture that relies on virtual or physical appliances. To protect well against data loss, you almost need two instances of each appliance to be running at all times. And then you have to contend with what to do with two possibly different sets of collected data from each appliance’s set of machines. Yuck.
- Traditional clustering designs are at the end of their popularity. These are often used to meet highly available database designs. The horrible administrative headaches associated with managing clustered designs have kept these from broad popularity. One customer told me that the downtime associated with patch management was enough to drop the availability of their cluster servers below 99 percent. The standard server deployment without clustering was achieving another full nine of uptime! Look for low administrative costs in high availability. We’re trying to free up IT time to focus on applications, after all.
- VMware has produced a revolution in availability that has lessened the overall need for highly available hardware designs. When it only takes a few minutes to restart a virtual server on another host, much of the pain of downtime is avoided. But if you’re relying on a large-scale data center monitoring solution, a few minutes of downtime may not be acceptable. And if the database writes aren’t transactional, you could wind up in a confused state, maybe even needing a full restore with loss of hours of data.
- I’ve worked on a few versions of software that had highly available features. For example, the first version of Microsoft Operations Manager agents would recover from mid-level collector failures by automatically sending data to a backup collector. But that was only a partial implementation of high availability. It’s important to understand exactly what is protected and what isn’t as you consider a high-availability solution.
When You Expect Your Applications to Fail, Your Monitoring System Can’t
I’ve written about how IT Ops needs to make life easy for DevOps teams. DevOps teams are deploying new version of their applications frequently, as often as multiple times each day. Part of making it easy is ensuring that when DevOps teams need information, it’s there for them. An intensely reliable monitoring system is a crucial part of the service that IT Ops should be providing — and that takes a high-availability deployment.