Workload Resilience and Scheduling¶

Workload Resilience

A resilient application is one that remains available and functional even in the face of infrastructure failures, planned maintenance, or high resource demand. The Contain Platform is built on a foundation of Kubernetes primitives that, when used correctly, enable you to build and deploy highly resilient application architectures.

This document provides a high-level, architectural overview of the key concepts and tools for managing workload resilience on the platform. It is not intended to be an exhaustive "how-to" guide but rather to explain the platform's philosophy and guide architects and senior engineers toward the most important patterns for building robust systems.

The Pillars of Workload Resilience¶

We believe that a resilient application architecture rests on four key pillars. By addressing each of these areas, you can ensure your application is protected against a wide range of potential disruptions.

Placement

Controlling where your application instances run to survive infrastructure failures.
Health

Defining how the platform can determine if your application is working correctly.
Priority

Declaring which applications are most important when cluster resources are scarce.
Disruption

Protecting your application from voluntary disruptions, such as cluster upgrades.

Placement: Surviving Infrastructure Failure¶

The most fundamental aspect of resilience is ensuring that the failure of a single piece of infrastructure (like a server node or a rack) does not take down your entire application. The platform's scheduler provides tools to control the placement of your application's pods across these physical or logical failure domains.

Pod Anti-Affinity¶

Pod Anti-Affinity is the primary architectural pattern for achieving high availability. It allows you to create rules that prevent the scheduler from placing multiple replicas of the same application onto the same underlying node, rack, or availability zone.

For any production workload with two or more replicas, we strongly recommend using pod anti-affinity to ensure that replicas are spread across different failure domains. This is the most effective way to protect your service from a single point of infrastructure failure. The scheduler will try to spread replicas across failure domains however it also takes a lot of other parameters into consideration so to be sure it is recommended to create anti-affinity configuration.

Health: Enabling Self-Healing¶

The Contain Platform has a powerful self-healing capability, but it relies on a "contract" with your application: your application must be able to signal its own health status. Kubernetes provides health probes for this purpose. Properly configured probes are essential for achieving zero-downtime deployments and automated recovery.

Liveness Probes¶

A Liveness Probe answers the question, "Is my application container alive and functional?" If a liveness probe fails (e.g., due to a deadlock or a crashed process), the platform will automatically restart the container, attempting to recover it to a healthy state.

Readiness Probes¶

A Readiness Probe answers the question, "Is my application ready to accept new traffic?" This is a critical component for safe deployments. When a new replica of your application is deployed, it will not receive any user traffic until its readiness probe passes. This prevents situations where traffic is sent to a pod that is still in the middle of a lengthy startup process. Also the old replicas wont be stopped until the new one is running (depending on the configured rollout strategy of the deployment).

Priority: Managing Resource Contention¶

In a cluster, it's possible for the demand for resources to exceed the available capacity. When this happens, the platform needs a way to decide which workloads are most important. This is managed through a system of priority and preemption.

The platform's own system components run at the highest priority to ensure the cluster remains stable. This same mechanism is extended to your applications. By assigning a higher priority to a critical back-end service, you instruct the scheduler to protect it, even if that means evicting (preempting) a less-critical pod to make room. This is a useful tool for preventing cascading failures caused by resource starvation.

Implementation Details

For information on the available PriorityClasses and how to apply them, please see the Pod Priority and Preemption section in the Resource Management documentation.

Disruption: Protecting Against Planned Maintenance¶

Not all disruptions are unexpected failures. Many are planned, voluntary events, such as when a cluster node must be drained of its workloads to perform a software upgrade. To ensure your application remains highly available during these events, you can create a Pod Disruption Budget (PDB).

A PDB is a contract you make with the platform. It allows you to declare, for example, "you can drain nodes for maintenance, but you must always ensure that at least 90% of my application's replicas are running at all times." If a planned maintenance action would violate the budget, the platform will wait until it is safe to proceed.

Using PDBs for your critical production workloads is essential for achieving high availability during the routine lifecycle management of the platform.

We recommend setting the minAvailable field to a value between 1 and n-1 - where n is the number of replicas and >1.