Skip to content

Application Scaling in Kubernetes

One of the primary advantages of running applications on Kubernetes is the ability to automatically scale them in response to real-world demand. Proper scaling ensures that your application has the resources it needs to be responsive and available, while also optimizing costs by not over-provisioning resources.

The Contain Platform provides the foundational components necessary to implement both horizontal and vertical scaling for your applications.

Further Reading

For a more comprehensive deep-dive into this topic, please refer to the official Kubernetes Autoscaling Workloads documentation.

Scaling Concepts

There are two primary ways to scale an application in Kubernetes:

  1. Horizontal Scaling: This involves increasing or decreasing the number of running instances (pods) for your application. If your application's load increases, you add more pods. If the load decreases, you remove pods. This is the most common scaling strategy for stateless applications.

  2. Vertical Scaling: This involves increasing or decreasing the resources (CPU and memory) allocated to the existing pods. If a pod is consistently using 100% of its allocated CPU, you can give it more CPU. This is often used for stateful applications like databases that may not be able to scale horizontally with ease.

The Role of the Metrics Server

To make intelligent scaling decisions, Kubernetes needs to know how much resource (CPU and memory) your application's pods are currently consuming. This is the job of the Kubernetes Metrics Server.

The Contain Platform includes the Metrics Server as a standard component in every cluster. It collects resource usage data from every pod and exposes it through the Kubernetes Metrics API, making it available for autoscalers to use.

Horizontal Scaling with the Horizontal Pod Autoscaler (HPA)

The most common way to automatically scale your application is with a HorizontalPodAutoscaler (HPA). The HPA automatically adjusts the number of replicas in a Deployment, StatefulSet, or other scaleable resource based on observed CPU or memory usage.

How It Works

The HPA controller periodically queries the Metrics Server for the resource utilization of the pods it's targeting. It then compares the current utilization to the target you've defined and calculates the optimal number of replicas.

Example: Scaling on CPU Utilization

Let's say you have a Deployment for your web application and you want to ensure the pods' average CPU usage stays around 80%. If the average CPU usage exceeds this target, the HPA will add more pods. If it drops well below, it will remove pods.

You would define an HPA resource like this:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-webapp-hpa
  namespace: my-app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-webapp
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 80
  • scaleTargetRef: Points to the Deployment we want to scale.
  • minReplicas / maxReplicas: Defines the lower and upper bounds for the number of pods. The HPA will never scale below 2 or above 10 pods.
  • metrics: Defines the metric to scale on. In this case, it targets an average CPU utilization of 80% across all pods.

Set Resource Requests

For the HPA to work effectively, you must set CPU resource requests on your Pods' containers. The utilization percentage is calculated based on the requested amount (e.g., 80% of a 500m CPU request is 400m). Without requests, the HPA cannot calculate utilization and will not function.

Vertical Scaling with the Vertical Pod Autoscaler (VPA)

While HPA changes the number of pods, the VerticalPodAutoscaler (VPA) adjusts the CPU and memory resource requests of the pods themselves. This helps you "right-size" your pods, ensuring they have the resources they need without being over-provisioned.

How It Works

The VPA analyzes the historical resource consumption of your pods. Based on this analysis, it can recommend or automatically apply updated resource requests and limits to your pod specifications.

The safest and most common way to use VPA is in "recommendation mode." In this mode, the VPA does not automatically change your pods' resources. Instead, it provides a recommendation that you can review and apply manually. This is an great tool for determining the optimal resource requests for your application.

Here is an example of a VPA in recommendation mode:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-webapp-vpa
  namespace: my-app
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind:       Deployment
    name:       my-webapp
  updatePolicy:
    updateMode: "Off"
  • targetRef: Points to the Deployment we want to analyze.
  • updatePolicy.updateMode: "Off": This is the key field. It tells the VPA to only generate recommendations and not to apply any changes.

After a while, you can check the VPA's recommendations with kubectl describe vpa my-webapp-vpa.

Automatic Updates

The VPA can also be configured with an updateMode of "Auto". In this mode, it will automatically update the pod template with its recommended resource requests. To apply these changes, the VPA must restart the pods, which can cause a brief service interruption. This mode should be used with caution and is best suited for applications that can handle rolling restarts gracefully.

Combining HPA and VPA: The Best of Both Worlds

You can use HPA and VPA together to achieve a highly efficient scaling strategy.

  1. Use VPA in "Off" mode: Deploy a VPA with updateMode: "Off" for your application. Let it run for a period to gather data and generate stable recommendations for CPU and memory requests.
  2. Apply the Recommendations: Manually update your Deployment's pod template with the resource requests recommended by the VPA.
  3. Use HPA for Horizontal Scaling: With your pods now "right-sized," create an HPA to scale the number of replicas horizontally based on CPU or memory utilization.

Conflict Warning

You cannot use an HPA and a VPA in "Auto" mode on the same resource for the same metric (CPU or memory). The two autoscalers would conflict, leading to undesirable behavior. The recommended pattern is always to use VPA in "Off" mode to inform the requests used by the HPA.