COVID-19 has changed the trajectory of cloud adoption and consumption in many industries, including online retail, online learning, online collaboration, telemedicine, and many SaaS applications. A sharp increase in cloud costs was caused by the accelerated consumption of cloud services. In other industries (such as hotels, airlines), demand plummeted. Cloud cost optimisation has become a top priority for most businesses.
This will be the focus of this post where I will share 5 actionable insights on how to optimise cloud costs quickly.
Action number 1 – Establish Cloud Cost Optimisation Team
Cost optimisation is a team sport therefore there is a need for a cost optimisation team. The cost optimisation team should comprise:
– Cloud architect. This role will introduce cost optimisation as a pillar in their design. Design consideration will be given to resource sizing, network topology, application architecture, etc.
– DevOps Engineer: This role will focus on automating manual tasks such as producing utilisation reports automatically shutting down underutilised resources and restarting when needed.
– FinOps Analyst. FinOps is a relatively new discipline in cloud space, and it stands for financial operations. FinOps will liaison with consumers of cloud services (applications, owners and business units) and with finance to ensure collaboration. FinOps are often responsible for enforcing resource naming and tagging. I have seen many clients struggle in understanding their monthly complex cloud bills. Tagging will go a long way in simplifying billing.
Cost optimisation is not a one hit wonder, but it is an ongoing process that needs a dedicated team to drive.
Action number 2 – Right Sizing
I am a firm believer that right sizing needs to happen before a single workload migrated to public cloud. A baseline of workloads on premise should be done. A common mistake I see is pre-migration sizing is using provisioned sizing as a baseline. For example, if we provisioned a virtual machine with 6GHz CPU and 16GB RAM then this becomes baseline and similar sized resources in cloud are provisioned. The problem is often when you examine the actual utilisation rate of those resources on premise, the utilisation rate is somewhere between 15% and 20%. This is done due to the habit of sizing for peaks, then adding 30%-40% as a safety cushion. This on-premise approach will yield a significant waste of resources in the public cloud OPEX based model.
The right approach is to monitor resource utilisation on premise for a month and look for average utilisation rate and peak utilisation rate. In the typical example above with 6GHz and 16GB RAM, I often see reports that might show CPU average utilisation rate of 0.85 GHz and a peak rate of 1.5 GHz and average utilisation of 6GB RAM and a peak of 8.5GB RAM. We can use a similar approach with storage and network. As you can see, there is a significant difference between a baseline based on provisioned resources and average and peak utilisation rate.
Establishing an accurate baseline could save an organisation a significant amount of money.
Action number 3 – Identify and shutdown of idle resources
A conversation I had with the head of operations recently around their DevOps running in the public cloud. I have noticed it was running 24/7. The client explained DevOps were used to running their environment 24/7 on-premise, and they only moved to the cloud with the condition of maintaining the same freedom. Attempting to change that would lead to political battles, which he wanted to avoid.
I totally understand the politics issue and my response was .. 168!
There are 168 hours in a week. The workday is 7.5 hours a day (more or less in other places). I will take a conservative approach and assume the developer team is working 10 hours a day M-F. This means developers are working 50 hours/week, yet the organisation is paying for 168 hours! Developers are using resources less than 30% of the time, yet the organisation is paying for 100% utilisation. In this case, with a DevOps bill of £15,000/month, a shutdown of resources during none worked hours will save over £10,000 a month. Is that figure worth having a conversation about reinvesting that £10k in other strategic parts of the business?
Automation should be leveraged to shut down and restart resources to optimise costs.
Action number 4 – Network Topology
There are different options for connecting to the cloud. For example, with Microsoft Azure, you get three options, Point to Site VPN, Site to Site VPN, and EXPRESS ROUTE (dedicated private connection). For organisations that are still early in their cloud journey, a VPN option will be more cost-effective for a small environment (it might not be acceptable because of lack of QoS and less secure than a dedicated private line).
Although major cloud vendors (Azure, AWS, GCP), do not charge for ingress traffic (traffic coming from the internet to your cloud environment), you get charged for egress traffic (traffic from your cloud environment to internet and other Azure regions). Architecture needs to keep traffic in the same region to reduce egress traffic cost. There are many other network design considerations that can be leveraged to optimise costs.
Action number 5 – Reserved Instances and Spot Instances
Major cloud providers offer significant cost savings based on enterprises committing to using certain vCPU series over a certain period (1 or 3 years). The discounts can be as high as 70%.
Cloud providers offer unused capacity to enterprises at up to a 90% discount for what cloud providers call spot instances. The catch is these spot servers are only available based on a cloud provider unused capacity. These servers can shutdown without notice. Spot instances are ideal for stateless none critical workloads. I have mitigated against risk of sudden shutdown by utilising auto scaling to add servers to replace shutdown servers and by using queues to ensure tasks remain in queue while another server is powered up.
The above are 5 actions out of a portfolio of levers that can be utilised to optimise public cloud costs quickly.
I hope you have found this post informative and thank you for reading.