Six Things We Learned While Migrating Our Training Infrastructure to Amazon AWS

When the Customer Enablement team here at Zenoss found ourselves at the end of a replacement cycle for our self-hosted training infrastructure we faced a classic CTO decision tree: should we replace our existing hardware with new? Or move to a managed hosting solution? Or should we join the herd and head skyward for the clouds?

On the surface, our infrastructure needs looked a lot like the poster-child case for a Cloud-based solution such as Amazon AWS:

  • Our compute loads are “bursting.” We fire up classes of 20-40 student servers for two days, then delete everything to set up for the next class
  • Elasticity would be a huge benefit. As a growing company, it’s always difficult to predict how many students we will train next quarter – or next year for that matter. Could our solution scale up or down if our estimated volumes were off?
  • Geo-diversity would be a big plus, as our department planned for the ever-increasing globalization of our customer base

After a careful analysis, AWS won out but it wasn’t as close of a decision as the above bullet points might imply. In fact it turned out to be a very close call for us. During our selection process and our subsequent roll-out, we learned a tremendous amount about cloud migrations. Here are some of the highlights:

  1. The contrasting cost models (large upfront costs / nominal operating costs for self hosting, zero startup costs / high operating costs for Cloud, and a blend of the two for managed hosting) made a discounted cash flow financial model ideal for weighing the quantitative factors of our decision. In fact, this decision was so perfect for using discounted cash flows that it could serve as a textbook business school case.
  2. The internal control implications of using Cloud platforms need to be carefully considered. Here is why: in a traditional capital procurement process outlays are evaluated and approved by management before they occur. By contrast, the costs associated with cloud solutions are incurred daily by operational staff with the click of a mouse (or the running of a script). No direct procurement process is practical or even available when real (not budgetary) costs are incurred. The risk of budget over runs as a result of unnecessary instance run times is therefore significant. Worse still, overruns are unlikely to be detected prior to budgets being exceeded and money being spent. Service providers have shared tales with us of “saw toothed” cost graphs for cloud-based solutions, which are generated when discipline degrades over time and costs creep higher. Renewed management attention to (or more accurately, management anger about) cloud invoices cause discipline to be restored and costs to drop, resulting in the vertical cliffs in the graphs. Then the process repeats – again and again! We have been extremely disciplined in managing our run times, but we’ve learned that it takes constant vigilance.
  3. Cloud cost management starts with the assignment of cost accountability to a particular individual. Whenever cloud platforms are managed in a “federated” (department by department) IT model, a particular individual should be tasked with approving and managing run times, and actual incurred expenses should be carefully monitored by the finance department and contrasted with forecast run times to facilitate rapid remediation of budget overruns.
  4. Although the operational benefits of cloud automation are significant, the development and testing resources needed to produce that automation proved can be very significant for us. They can be equally difficult to forecast, so managers should leave plenty of breathing room in the budget for these initiatives.
  5. In federated IT environments, there can be a high risk of waste and inefficiency when multiple departments duplicate automation development efforts. As such, whenever possible teams should be compelled to collaborate on their development efforts or, better still, automation initiatives should be centralized.
  6. When practical, management should consider the establishment a centralized team that provides “Cloud Services” to other departments. The team can be tasked with the cradle-to-grave life cycle of Cloud management: evaluating adoption versus alternatives by conducting a rigorous cost forecast for each department’s proposed Cloud deployment, conducing regular and disciplined cost oversight to provide early warning of budget overruns and standardization of automation development to prevent expensive and unnecessary duplication of efforts.

We’ve been delighted with our choice to host our training infrastructure on a Cloud-based platform, but we learned that we’re reaping the benefits of Amazon’s AWS only after a rigorous platform selection process and disciplined daily management of our infrastructure.

Like the Zenoss blog? Please provide your email address below to subscribe and enjoy reading updates right from your inbox!