One of the biggest challenges in managing an enterprise platform (Compute, Storage, or even platforms like OpenShift) is the Capacity Planning Strategies. It’s a huge effort to find the right strategy for the respective platform type to support the business demands as well as possible and on the other hand, keep the costs as low as possible without hitting any SLO.
I had insight into many different business models respective environments and will try to describe some of the experiences here. There is no “One Size Fits All” in Capacity Planning Strategies, so this blog post might be an inspiration for you. All comments and discussions are very welcome.
Often the Capacity Planning is a process that will be introduced after the environment is already up and running, but this does not work for all business models. This quote from the VMware Operations Management guide is completely correct in my opinion.
For IaaS or DaaS, capacity management begins long before hardware is deployed. It begins with a business plan, which decides on what class of service will be provided.https://www.vmwareopsguide.com/operations-management/chapter-3-capacity-management/1.3.3-capacity-planning/
Annual Forecast Strategy
The Annual Forecast is one of the preferred Capacity Planning Strategies for large enterprises with on-premises private cloud environments. The demand for the Annual Forecast is often based on the budget planning for the business units.
The yearly budget of the service owner of the platform needs to include:
- Costs for Hardware (renewal and additional)
- Costs for Licenses / Software / Support
These costs can be determined only when the capacity planning for the coming year can be made. But how can the capacity need be determined? I have thought about this very question and would like to point it out.
The simple answer might be a tool like vRealize Operations Manager, which can use the past growth to generate a forecast for the next 12 months. But my experience over the last years was, that large enterprises do not act that linear. They have a lifecycle of Operating Systems and Applications, they migrate from other platforms into the private cloud environment and they have new business demands (additional services) or even acquisitions. So the much more accurate method is to ask the departments about their upcoming projects and know their current capacity usage.
I have created a simple example that will demonstrate how this forecast can be done (Full Spreadsheet) for compute resources. The “Base Growth” represents a 5% quarterly growth based on the current capacity plus the demands of the projects which were reported from the business unit.
With these pieces of information (in combination with your SLOs like CPU Overcommitment), even spread over quarters, it should now be possible not only to plan an annual budget but also to schedule the procurements.
This Capacity Planning Strategy fits perfectly for business models/enterprises which do not act very agile. The demand for fast procurements and on-demand infrastructure provisioning is quite low.
Entitlement Based Strategy
The Entitlement Based Capacity Planning is often driven by a Cloud Management Portal like VMware vRealize Automation, but not necessarily. One of the key pre-requirement of Entitlement Based Capacity Planning is the ability for the business units to monitor and manage their services and entitlements. So, how does Entitlement Based Capacity Planning work?
Each business unit that wants to make use of the private cloud platform (in this case often IaaS platform) is assigned a certain amount of resources, the entitlement. The sum of all entitlements is basically the capacity the service owner of the private cloud platform must provide. The entitlements are usually provided with factors, this offered the capability to do a given amount of overprovisioning of the real resources.
If a business unit needs to expand its entitlement, the service owner of the private cloud platform can specify a lead time, e.g. up to 30 days for the expansion (the “On-Demand” chapter goes a little deeper into the topic of expansion time). The business unit thus has the responsibility to request the expansion in a timely manner. The lead time gives time for new resources to be deployed. But not in all cases do new resources need to be deployed, the service owner of the private cloud platform can do temporally more aggressive overprovisioning or keeps a buffer anyway. On the other hand, has the business unit has the ability to downscale, shutdown, or de-provision other services to free up resources in the entitlement, this typically leads to more rightsizing and less waste if the tools are provided for this.
On-Demand growth of capacity is in my opinion the most complex of the highlighted Capacity Planning Strategies. This sounds strange at first because there is no need for forward-looking capacity planning; the pitfalls lie elsewhere. The biggest challenge is to understand how long it takes to increase capacity. This duration depends on many factors and the entire team (purchasing, datacenter infrastructures, network, SAN, etc.) must work well together to stay in-time. Since a well-rehearsed process and a lot of experience are required, this strategy is mostly used in the service provider sector or in large enterprise environments.
If you are aware of the worst-case time for each cluster/environment/resource type, you need a tool (e.g. vRealize Operations Manager) or process to identify when you need to start to expand the capacity. For Example, if you need 30 days to order, mount, and install new VMware ESXi hosts you have to start 30 days prior you run out of CPU or Memory resources.
Just as with the other Capacity Planning Strategies, your On-Demand strategy should include the defined HA-Capacity and can include an additional buffer. As a fallback to the required time, a fixed percentage can be set as a Hard-Limit:
30 days prior you run out of CPU or Memory resources BUT at least 15% Headroom.
Every time I have implemented an On-Demand growth strategy, another question came up pretty soon: In what steps do we expand? This is a quite complex question and heavily depends on your business model and/or growth profile. But from the technical perspective, a building block model is the easiest to handle. If we stay with the ESXi host example, it’s way easier to scale in full racks (n ESXi Hosts + TOR Switches) instead of a single ESXi which sometimes requires an additional rack, the datacenter cabling, and TOR switches (this causes fluctuating times). Also for the time planning and WIP-Flow, it is usually easier to install a larger number of ESXi Hosts in one step.
As I initially said, there is no “One Size Fits All” in Capacity Planning Strategies. Possibly, one of the strategies fits your business demand and works well for your team, or a combination of two strategies (e.g. “Annual Forecast” as a baseline and “On-Demand” if required) works great for your environment.
In any case, it is at first important to understand the consumers and business needs. After you have figured out your Capacity Planning Strategy, you can start to optimize your processes and technical tools/skills to be more efficient with this strategy (generate more business value).