Skip to content

Commit 29bd816

Browse files
committed
KEP-5710: Update scheduler docs for workload aware preemption
1 parent 2c2ed1c commit 29bd816

5 files changed

Lines changed: 154 additions & 0 deletions

File tree

‎content/en/docs/concepts/scheduling-eviction/_index.md‎

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ of terminating one or more Pods on Nodes.
2727
* [PodGroup Scheduling](/docs/concepts/scheduling-eviction/podgroup-scheduling/)
2828
* [Gang Scheduling](/docs/concepts/scheduling-eviction/gang-scheduling/)
2929
* [Topology-aware Scheduling](/docs/concepts/scheduling-eviction/topology-aware-scheduling/)
30+
* [Workload-Aware preemption](/docs/concepts/scheduling-eviction/workload-aware-preemption/)
3031
* [Descheduler](https://github.com/kubernetes-sigs/descheduler#descheduler-for-kubernetes)
3132
* [Node Declared Features](/docs/concepts/scheduling-eviction/node-declared-features/)
3233

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
---
2+
title: Workload-Aware Preemption
3+
content_type: concept
4+
weight: 80
5+
---
6+
7+
<!-- overview -->
8+
{{< feature-state feature_gate_name="WorkloadAwarePreemption">}}
9+
10+
Workload-aware preemption introduces a preemption mechanism specifically designed for PodGroups.
11+
When a PodGroup cannot be scheduled, the scheduler utilizes a preemption logic that tries to
12+
make scheduling of this PodGroup possible. This approach is used exclusively during PodGroup scheduling
13+
and replaces the default preemption mechanism for pods from a given PodGroup.
14+
15+
When this feature is enabled, the scheduler treats the PodGroup as a single preemptor unit,
16+
rather than evaluating individual pods from a PodGroup in isolation. To make room for the pending pods in the group,
17+
it searches for victims across the entire cluster,
18+
and knows how to treat and preempt other PodGroups as victims according to their disruption modes.
19+
20+
This feature depends on the [Gang Scheduling](/docs/concepts/scheduling-eviction/gang-scheduling/)
21+
and the [Workload API](/docs/concepts/workloads/workload-api/).
22+
Ensure the [`GenericWorkload`](/docs/reference/command-line-tools-reference/feature-gates/#GenericWorkload)
23+
and [`GangScheduling`](/docs/reference/command-line-tools-reference/feature-gates/#GangScheduling) feature gates
24+
and the `scheduling.k8s.io/v1alpha2` {{< glossary_tooltip text="API group" term_id="api-group" >}} are enabled in the cluster.
25+
26+
<!-- body -->
27+
28+
## How it works
29+
30+
The workload-aware preemption process follows the same principles
31+
as [default preemption](/docs/concepts/scheduling-eviction/pod-priority-preemption/#preemption)
32+
with a few differences:
33+
34+
1. Cluster-wide domain: Instead of evaluating preemption node by node,
35+
the scheduler evaluates the entire cluster as a single domain.
36+
It selects a set of victims across multiple nodes that can be removed
37+
to make enough room for the preemptor PodGroup to be scheduled.
38+
39+
2. Victim importance hierarchy: The scheduler decides which preemption units
40+
(individual pods or PodGroups) are more critical and should be spared from preemption
41+
using a strict hierarchy:
42+
* Priority: Higher priority units are always more important.
43+
* Workload type: PodGroups are considered more important than individual Pods of the same priority.
44+
* Group size (PodGroups): If both units are PodGroups,
45+
the one with more members (larger size) is considered more important.
46+
* Start time: Units that started earlier are more important.
47+
48+
3. Pod group priority and disruption: The scheduler considers the specific
49+
[priority and disruption mode](/docs/concepts/workloads/workload-api/disruption-and-priority/) of a PodGroup
50+
to evaluate if and how its pods can be preempted during preemption events.
51+
52+
{{< note >}}
53+
When scheduling a single Pod, the default pod preemption applies.
54+
As of 1.36, when the scheduler performs a default preemption for a single Pod
55+
and it attempts to preempt a Pod belonging to a PodGroup, it does **not**
56+
respect the `priority` or `disruptionMode` fields of that PodGroup.
57+
{{< /note >}}
58+
59+
## {{% heading "whatsnext" %}}
60+
61+
* Learn more about [PodGroup Priority and Disruption](/docs/concepts/workloads/workload-api/disruption-and-priority/).
62+
* Learn about the [Workload API](/docs/concepts/workloads/workload-api/).
63+
* Read more about [Gang scheduling](/docs/concepts/scheduling-eviction/gang-scheduling/).

‎content/en/docs/concepts/workloads/workload-api/_index.md‎

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,8 @@ Each entry in `podGroups` must have:
3939
1. A unique `name` that can be used in the Pod's [Workload reference](/docs/concepts/workloads/pods/workload-reference/).
4040
2. A [scheduling policy](/docs/concepts/workloads/workload-api/policies/) (`basic` or `gang`).
4141

42+
If the [`WorkloadAwarePreemption`](/docs/reference/command-line-tools-reference/feature-gates/#WorkloadAwarePreemption) [feature gate](/docs/reference/command-line-tools-reference/feature-gates/) is enabled each entry in `podGroups` can also have [priority and disruption mode](/docs/concepts/workloads/workload-api/disruption-and-priority/).
43+
4244
```yaml
4345
apiVersion: scheduling.k8s.io/v1alpha1
4446
kind: Workload
@@ -56,6 +58,8 @@ spec:
5658
gang:
5759
# The gang is schedulable only if 4 pods can run at once
5860
minCount: 4
61+
priorityClassName: high-priority # Only applicable with WorkloadAwarePreemption feature gate
62+
disruptionMode: PodGroup # Only applicable with WorkloadAwarePreemption feature gate
5963
```
6064
6165
### Referencing a workload controlling object
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
---
2+
title: Pod Group Disruption and Priority
3+
content_type: concept
4+
weight: 10
5+
---
6+
7+
<!-- overview -->
8+
{{< feature-state feature_gate_name="WorkloadAwarePreemption" >}}
9+
10+
PodGroup can declare a disruption mode. This mode dictates how
11+
the scheduler can disrupt a running PodGroup, for example to accommodate
12+
a higher priority PodGroup. A PodGroup also has a priority,
13+
which overrides the priority of the individual pods from the group
14+
for [workload-aware preemption](/docs/concepts/scheduling-eviction/workload-aware-preemption/) events.
15+
16+
<!-- body -->
17+
18+
## Disruption mode types
19+
20+
{{< note >}}
21+
As of 1.36, the `priority` or `disruptionMode` fields of the PodGroup are only respected
22+
by [workload-aware preemption](/docs/concepts/scheduling-eviction/workload-aware-preemption/).
23+
During the pod scheduling phase, the scheduler does not take into account
24+
the `priority` or `disruptionMode` fields of the PodGroup.
25+
{{< /note >}}
26+
27+
The API supports two disruption modes: `Pod` and `PodGroup`.
28+
The default one is `Pod`.
29+
30+
### Pod
31+
32+
The `Pod` mode instructs the scheduler to treat all Pods in the group as separate entities,
33+
allowing independent disruption of a single pod from a PodGroup.
34+
35+
### PodGroup
36+
37+
The `PodGroup` mode emphasizes "all-or-nothing" semantics for disruption.
38+
It instructs the scheduler that all pods from the PodGroup have to be disrupted together.
39+
40+
## Pod group priority
41+
42+
PodGroup uses the same concept of [PriorityClass](/docs/concepts/scheduling-eviction/pod-priority-preemption/#priorityclass) as single Pods.
43+
Once you have created one or more PriorityClasses,
44+
you can create a PodGroup that specifies one of those PriorityClass names in its specification.
45+
The priority admission controller uses the `priorityClassName` field and populates the integer value of the priority.
46+
If the priority class is not found, the PodGroup is rejected.
47+
When `priorityClassName` is not set for a PodGroup, Kubernetes looks for a default (a PriorityClass with `globalDefault` set true)
48+
If there is no PriorityClass with `globalDefault` set true, a PodGroup with no `priorityClassName` has priority zero.
49+
50+
The priority of the PodGroup is an authorative priority for all pods in the group during [workload-aware preemption](/docs/concepts/scheduling-eviction/workload-aware-preemption/) events, even when priorities of individual pods forming this PodGroup differ.
51+
52+
The following YAML is an example of a PodGroup configuration that uses the `high-priority` PriorityClass,
53+
which maps to the integer priority value of 1000000.
54+
The priority admission controller checks the specification and resolves the priority of the PodGroup to 1000000.
55+
56+
```yaml
57+
apiVersion: scheduling.k8s.io/v1alpha2
58+
kind: PodGroup
59+
metadata:
60+
namespace: ns-1
61+
name: job-1
62+
spec:
63+
priorityClassName: high-priority
64+
```
65+
66+
## {{% heading "whatsnext" %}}
67+
68+
* Read about [Workload-Aware Preemption](/docs/concepts/scheduling-eviction/workload-aware-preemption/) algorithm.
69+
* Learn about the [Workload API](/docs/concepts/workloads/workload-api/).
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
---
2+
title: WorkloadAwarePreemption
3+
content_type: feature_gate
4+
_build:
5+
list: never
6+
render: false
7+
8+
stages:
9+
- stage: alpha
10+
defaultValue: false
11+
fromVersion: "1.36"
12+
---
13+
14+
Enables the support for [Workload-aware preemption](/docs/concepts/scheduling-eviction/workload-aware-preemption/).
15+
16+
When enabled, if a PodGroup fails to schedule, the scheduler will use a workload-aware preemption
17+
algorithm to select victims to preempt instead of the default pod preemption algorithm.

0 commit comments

Comments
 (0)