Introduction

To begin, let's first address the elephant looking cloud in the sky. The concept of public cloud capacity management is an oxymoron of sorts. One of the selling points for customers of any cloud providers is that their usage of the platform can scale up and down, in minutes, based on demand without any usage planning, upfront costs or long-term commitment in order to facilitate a greater level of agility to their business. Need a VM; stand one up! Need to turn down a server; go right ahead. With tools such as Cloud Application Manager this is even easier than ever, by controlling the deployment or deletion of instances in any cloud seamlessly.

While this capability has solved many operational flexibility issues businesses have encountered running applications in their own data centers, it has also exposed others that were indirectly controlled with the inherent constraints that came with running their own data center such as variable costs and governance models. Thus, the traditional capacity management practice hasn't disappeared for enterprises as much as its scope and expectations have changed and thus a resurgence in the value the position provides.

Let's quickly cover what has or will change for capacity managers as enterprises moves to the cloud:

  • Q: How has the cloud changed the capacity management function?

    A: The practice has changed from an infrequent physical capability scope to a daily financial scope.

  • Q: Why is capacity management in the cloud important for companies?

    A: While moving to the cloud provides strategic advantages for enterprises from an operationality agility perspective due to its ability to scale practically infinitely, the budget to run the services is never as flexible. Thus, the need for enterprises to drive towards financial optimization of their cloud environment.

  • Q: Who in an enterprise is responsible for capacity management functions in the cloud?

    A: Usually, whomever is responsible for the budget for the unit running any particular cloud environment. This means that sometimes someone in finance or accounting becomes responsible or is involved at minimum.

  • Q: What is the financial impact of not controlling capacity in the cloud?

    A: It has been reported that 80% of enterprises that run their own data center have more server capacity than they actually need to operate. Subsequently, if enterprises were to just 'lift and shift' services from their data center to the cloud 75% would see their cost increase! With the enterprise data centers estimated to be \$170 billion market, a 25% reduction in costs means approximately \$40 billion in savings that can be reinvested elsewhere within the enterprise. (1)

In conclusion, enterprises are not moving to the cloud for cost savings, yet ignoring the possibility of controlling costs though a capacity management function will be expensive.

The rest of this post will chronical the basic steps and processes that enterprises will need to implement to control costs while maintaining operational agility.

Step 1 - Know your Environment

A fairly typical scenario when a company runs their own data center is called capacity planning, which means to purchase the equipment they currently need plus as much capacity they foresee needing in the next 12 to 24 and sometimes as far as 36 months in the future depending on their financial and budget situation.

During this time, as the money for the equipment has been spent and is only being deprecated internally, it doesn't necessarily matter if the equipment is being utilized inefficiently or not. This is how the preverbal Zombie infrastructure or IT Sprawl is created because the business doesn't need or care to know where and how their resources are being utilized on a daily basis, as it won't necessarily make a financial difference. Only when the utilization of the infrastructure starts to exceed capacity and negatively impact performance and operational activities is there usually a large one-off effort to cull excessive resources, before the cycle naturally begins again.

Whether, the cause is because the tools to address the root cause don't exist or it just isn't a priority to upper-management because it does reduce actual costs or expenses is immaterial at this point. However, we will be using this scenario as a base as we review the initial steps to control capacity.

Financial Reporting

Whomever is responsible or controls the finances for an enterprise's cloud infrastructure must, on a daily basis, monitor cost to achieve the best financial results by making sure spend is being used within expectations and budget. To begin with, this can be as simple as estimating daily and monthly spend amounts and comparing that to actual usages in order to know where you are at financially at any given time and be able to act accordingly.

Overtime, this financial litmus test should become more complex and rigorous as an enterprise better understands what, where, when and how cloud resources are being used by their engineers and developers to know when spikes and dips in cost are anticipated to further control cost and expenses. We will discuss later how this knowledge and experience of how your enterprise is using the cloud can be leveraged outside of a financial view.

Inventory & Utilization Reporting

After a company gets an initial handle on the financials, an additional layer of control is necessary as financial controls, especially at the beginning of a cloud journey or if your company overall usage is exceedingly large, won't makes the usage of a platform efficient alone.

As a company becomes more confident in understanding how it utilizes one or many cloud providers platforms, patterns start to emerge of what is normal and abnormal behavior. This if further reinforced with lessons learned from monitoring costs, as when costs go up, one will inherently know what usage caused that to occur and to be on the lookout for it accordingly going forward.

Here are some common examples of situations that should be caught with Inventory and Utilization Reporting:

  1. Block storage volumes that aren't attached to a server.

  2. Servers that run at ~20% of its CPU, Memory or Network limits.

  3. Databases that are utilizing ~20% of provisioned IOPS.

  4. Servers in higher cost regions (i.e. Asia/Pacific) without any local presence

  5. Servers using VM types that aren't appropriate for workload (i.e. GPU server for a back-end API applications)

When an asset is identified as being outside of the companies defined requirement/expectations a corrective action, which is inherently more cost effective, should be taken accordingly such as:

  1. Delete the abandon storage volumes
  2. Downgrade the size of the server to run closer to ~80-90%.
  3. Downgrade IOPS throughputs closer to ~80-90%.
  4. Delete or migrate servers to lower cost regions.
  5. Delete or migrate workloads to appropriate CPU types

As this process matures internally overtime a governance model can be established to proactively prevent wasteful behavior of resources rather than catching it in a reactive manner to save additional costs.

The lesson here is that an enterprise must first have the ability to quickly and efficiently sort through all of the data that each cloud provider offers to be able to find, then make informed and timely decisions on the course of action to take. Secondly, companies must adopt processes and techniques to be able to quickly and efficiently communicate and action off of the outliers found in their analytic reporting to produce measurable results. This leads us to the next section.

Step2 -- Cloud Governance

After companies have the ability to monitor and analyze usage and an understanding of who should and shouldn't utilize what resources, in what locations, etc... then a governance model can be put in place to proactively control cost rather than reacting after someone has done something they aren't or shouldn't be able to do.

While a governance model is an important factor in proactively controlling costs, thus helping prevent excess costs from ever occurring, rather than having that money already out the door and having to address it, it should never be counted on to wholly control costs alone. Instead, the previously build analytic capability should be used to validate and measure the success of any governance program. This helps a customer further refine their costing budgets and reduce 'fudge' factors by narrowing the amount of overhead one has to put in place to meet budget targets.

Conclusion

After a company has implemented an governance system and also analytic capability for finances, inventory and utilization it is safe to say their cloud capacity management function is keeping an enterprise cloud environment operating efficiently financially and operationally and could stop there and keep costs under control. However, doing so would leave an opportunity to be even more financially efficient by leveraging Reserved Instances.

We will discuss how to best leverage Reserved Instance (RI) in a further blog post so stay tuned!


(1) Reference: