|
| 1 | +--- |
| 2 | +title: Workload-Aware Preemption |
| 3 | +content_type: concept |
| 4 | +weight: 80 |
| 5 | +--- |
| 6 | + |
| 7 | +<!-- overview --> |
| 8 | +{{< feature-state feature_gate_name="WorkloadAwarePreemption">}} |
| 9 | + |
| 10 | +Workload-aware preemption introduces a preemption mechanism specifically designed for PodGroups. |
| 11 | +When a PodGroup cannot be scheduled, the scheduler utilizes a preemption logic that tries to |
| 12 | +make scheduling of this PodGroup possible. This approach is used exclusively during PodGroup scheduling |
| 13 | +and replaces the default preemption mechanism for pods from a given PodGroup. |
| 14 | + |
| 15 | +When this feature is enabled, the scheduler treats the PodGroup as a single preemptor unit, |
| 16 | +rather than evaluating individual pods from a PodGroup in isolation. To make room for the pending pods in the group, |
| 17 | +it searches for victims across the entire cluster, |
| 18 | +and knows how to treat and preempt other PodGroups as victims according to their disruption modes. |
| 19 | + |
| 20 | +This feature depends on the [Gang Scheduling](/docs/concepts/scheduling-eviction/gang-scheduling/) |
| 21 | +and the [Workload API](/docs/concepts/workloads/workload-api/). |
| 22 | +Ensure the [`GenericWorkload`](/docs/reference/command-line-tools-reference/feature-gates/#GenericWorkload) |
| 23 | +and [`GangScheduling`](/docs/reference/command-line-tools-reference/feature-gates/#GangScheduling) feature gates |
| 24 | +and the `scheduling.k8s.io/v1alpha2` {{< glossary_tooltip text="API group" term_id="api-group" >}} are enabled in the cluster. |
| 25 | + |
| 26 | +<!-- body --> |
| 27 | + |
| 28 | +## How it works |
| 29 | + |
| 30 | +The workload-aware preemption process follows the same principles |
| 31 | +as [default preemption](/docs/concepts/scheduling-eviction/pod-priority-preemption/#preemption) |
| 32 | +with a few differences: |
| 33 | + |
| 34 | +1. Cluster-wide domain: Instead of evaluating preemption node by node, |
| 35 | + the scheduler evaluates the entire cluster as a single domain. |
| 36 | + It selects a set of victims across multiple nodes that can be removed |
| 37 | + to make enough room for the preemptor PodGroup to be scheduled. |
| 38 | + |
| 39 | +2. Victim importance hierarchy: The scheduler decides which preemption units |
| 40 | + (individual pods or PodGroups) are more critical and should be spared from preemption |
| 41 | + using a strict hierarchy: |
| 42 | + * Priority: Higher priority units are always more important. |
| 43 | + * Workload type: PodGroups are considered more important than individual Pods of the same priority. |
| 44 | + * Group size (PodGroups): If both units are PodGroups, |
| 45 | + the one with more members (larger size) is considered more important. |
| 46 | + * Start time: Units that started earlier are more important. |
| 47 | + |
| 48 | +3. Pod group priority and disruption: The scheduler considers the specific |
| 49 | + [priority and disruption mode](/docs/concepts/workloads/workload-api/disruption-and-priority/) of a PodGroup |
| 50 | + to evaluate if and how its pods can be preempted during preemption events. |
| 51 | + |
| 52 | +{{< note >}} |
| 53 | +When scheduling a single Pod, the default pod preemption applies. |
| 54 | +As of 1.36, when the scheduler performs a default preemption for a single Pod |
| 55 | +and it attempts to preempt a Pod belonging to a PodGroup, it does **not** |
| 56 | +respect the `priority` or `disruptionMode` fields of that PodGroup. |
| 57 | +{{< /note >}} |
| 58 | + |
| 59 | +## {{% heading "whatsnext" %}} |
| 60 | + |
| 61 | +* Learn more about [PodGroup Priority and Disruption](/docs/concepts/workloads/workload-api/disruption-and-priority/). |
| 62 | +* Learn about the [Workload API](/docs/concepts/workloads/workload-api/). |
| 63 | +* Read more about [Gang scheduling](/docs/concepts/scheduling-eviction/gang-scheduling/). |
0 commit comments