The Accelerate State of the DevOps report 2019 found that elite performers were 24 times more likely than low performers to have met essential cloud characteristics. One of the characteristics is Measured Service – Cloud systems automatically control, optimize, and report resource use based on the type of service such as storage, processing, bandwidth, and active user accounts. While this could have been sufficient for most of the companies focusing only on growth in pre-COVID world, it is no longer sufficient when keeping infrastructure cost down is as important as growth.
We at Devtron have been helping our customers achieve cost-effective growth for a long time on AWS and Kubernetes. Following are the four heads under which companies should analyze their infrastructure to optimize cost. Though we have taken examples of AWS cost optimization specifically wherever required but it can be extended to any cloud provider.
Cost-Effective Resources
Provisioning – Many infrastructure requirements are temporary in nature for example setting up new testing environments. It can speed up the goto market by testing multiple features in parallel. Also if teams using the same infrastructure are not distributed across timezones quick provisioning and deprovisioning can help save cost. As a bonus you are ready for Disaster Recovery.
Take Away: Identify temporary workloads and automate their provisioning and auto-provisioning.
Right-Sizing – Are you using the cheapest resource type that meet your requirement? Based on the analysis of OS instances in North America for AWS, it was found that 84% of the instances were not rightly sized. It was estimated that right-sizing those instances can lead to AWS cost optimization by 36% (or USD 55 Million)
Take Away: Measure resource utilization over a broad range of timeline periodically and benchmark that against cost for different AWS resources this includes not only instance types but storage types, managed instances etc and use strategies mentioned under point 2 below.
Purchasing – Different resources are suitable for different kinds of workload for eg frequently accessed data storage vs latency-sensitive data storage vs very infrequently accessed data storage. Also based on the future visibility and failure tolerance different contracts give you different pricing for eg always up load vs baseline load vs periodic jobs vs fault-tolerant batch applications vs long term minimum growth commitments.
Take Away: Know your application workload type and map that against different SLA and pricing options provided by the cloud providers. AWS for example has different pricing strategies like spot, RI with different upfront payment, on demand, minimum commitment etc
Geographical Selection – We all know that it’s better to move data and computation closer to the user for lower latency and data storage sovereignty for compliances but that is not always the case for eg QA environment. This can result in AWS cost optimization of 30% to 170% across regions. Also remember applications should be colocated to minimize the transfer of data across geographies as it may have a negative impact on cost optimization measures.
Take Away: Critically examine latency and data sovereignty requirements for each provisioning and move them to the region which gives you the lowest cost for the latency and compliance requirements but remember to colocate dependent applications.
Matching Supply and Demand
Demand-based – For applications that have varying but not spikey load across the day can use demand-based scaling against CPU utilization, ram utilization, or custom checks. In fact autoscaling is one of the easiest but most often overlooked functionality which can result in substantial AWS cost optimization.
Take Away: Identify applications that have variable load and configure them for autoscaling but remember that it may take up to 5 mins for a new instance to be available so provision for it while configuring autoscaling.
Buffer based – Queue based scaling systems have garnered great interest over recent years which allows developers to build functions that can scale based on factors like the size of the queue. This can be great for intermittent and spikey loads wherein requests alternate between very low numbers and sudden high surge.
Take Away: Queue based scaling can be great for intermittent and spikey loads but configure it properly else it may result in a very high cost in case of a huge burst of load.
Time-based – This allows you to have different scaling across the day for example your customer may reside in only one country and during night time traffic goes down to close to zero. In our experience we have found companies rarely using scheduled scaling for AWS cost optimization.
Take Away: Do this only after demand-based scaling if the difference in workload across time slots cannot be handled by configuring appropriate min-max in demand-based scaling.
Expenditure Monitoring
Utilization reports – Periodically monitor the utilization of different resources being used so that appropriate call can be taken based on points mentioned under point 1 & 2
Coverage report – Periodically monitor the coverage of resources under different pricing strategies and use the most cost-effective pricing strategies. Most of the cloud providers provide a dashboard like cost explorer by AWS.
Take Away: One can only optimize if one knows where money is being spent. Build systems which raise alerts at proper time on proper channels.
Optimizing Time
Cost optimization team – Create a cross-functional team which owns ROI of applications. They will own the cost optimization activity, create a roadmap, and involve stakeholders to ensure that these activities happen on a timely basis.
Take Away: More often than not, cost is not optimized because nobody owns it or people who own it don’t have the right authority and skill set. To achieve cost optimization in a time bound fashion consistently, it’s important to form the right team with the right skillset and authority.
So how are your optimizing your infrastructure cost? Share with us in the comments below.