kubernetes
diff --git a/‎content/en/docs/concepts/scheduling-eviction/_index.md‎
Lines changed: 1 addition & 0 deletions b/‎content/en/docs/concepts/scheduling-eviction/_index.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎content/en/docs/concepts/scheduling-eviction/workload-aware-preemption.md‎
Lines changed: 63 additions & 0 deletions b/‎content/en/docs/concepts/scheduling-eviction/workload-aware-preemption.md‎
Lines changed: 63 additions & 0 deletions
diff --git a/‎content/en/docs/concepts/workloads/workload-api/_index.md‎
Lines changed: 4 additions & 0 deletions b/‎content/en/docs/concepts/workloads/workload-api/_index.md‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎content/en/docs/concepts/workloads/workload-api/disruption-and-priority.md‎
Lines changed: 69 additions & 0 deletions b/‎content/en/docs/concepts/workloads/workload-api/disruption-and-priority.md‎
Lines changed: 69 additions & 0 deletions
diff --git a/‎content/en/docs/reference/command-line-tools-reference/feature-gates/WorkloadAwarePreemption.md‎
Lines changed: 17 additions & 0 deletions b/‎content/en/docs/reference/command-line-tools-reference/feature-gates/WorkloadAwarePreemption.md‎
Lines changed: 17 additions & 0 deletions
@@ -27,6 +27,7 @@ of terminating one or more Pods on Nodes.
 * [PodGroup Scheduling](/docs/concepts/scheduling-eviction/podgroup-scheduling/)
 * [Gang Scheduling](/docs/concepts/scheduling-eviction/gang-scheduling/)
 * [Topology-aware Scheduling](/docs/concepts/scheduling-eviction/topology-aware-scheduling/)
+* [Workload-Aware preemption](/docs/concepts/scheduling-eviction/workload-aware-preemption/)
 * [Descheduler](https://github.com/kubernetes-sigs/descheduler#descheduler-for-kubernetes)
 * [Node Declared Features](/docs/concepts/scheduling-eviction/node-declared-features/)
 
 
@@ -0,0 +1,63 @@
+---
+title: Workload-Aware Preemption
+content_type: concept
+weight: 80
+---
+
+<!-- overview -->
+{{< feature-state feature_gate_name="WorkloadAwarePreemption">}}
+
+Workload-aware preemption introduces a preemption mechanism specifically designed for PodGroups.
+When a PodGroup cannot be scheduled, the scheduler utilizes a preemption logic that tries to
+make scheduling of this PodGroup possible. This approach is used exclusively during PodGroup scheduling
+and replaces the default preemption mechanism for pods from a given PodGroup.
+
+When this feature is enabled, the scheduler treats the PodGroup as a single preemptor unit,
+rather than evaluating individual pods from a PodGroup in isolation. To make room for the pending pods in the group,
+it searches for victims across the entire cluster,
+and knows how to treat and preempt other PodGroups as victims according to their disruption modes.
+
+This feature depends on the [Gang Scheduling](/docs/concepts/scheduling-eviction/gang-scheduling/)
+and the [Workload API](/docs/concepts/workloads/workload-api/).
+Ensure the [`GenericWorkload`](/docs/reference/command-line-tools-reference/feature-gates/#GenericWorkload)
+and [`GangScheduling`](/docs/reference/command-line-tools-reference/feature-gates/#GangScheduling) feature gates
+and the `scheduling.k8s.io/v1alpha2` {{< glossary_tooltip text="API group" term_id="api-group" >}} are enabled in the cluster.
+
+<!-- body -->
+
+## How it works
+
+The workload-aware preemption process follows the same principles
+as [default preemption](/docs/concepts/scheduling-eviction/pod-priority-preemption/#preemption)
+with a few differences:
+
+1. Cluster-wide domain: Instead of evaluating preemption node by node,
+   the scheduler evaluates the entire cluster as a single domain.
+   It selects a set of victims across multiple nodes that can be removed
+   to make enough room for the preemptor PodGroup to be scheduled.
+
+2. Victim importance hierarchy: The scheduler decides which preemption units
+   (individual pods or PodGroups) are more critical and should be spared from preemption
+   using a strict hierarchy:
+   * Priority: Higher priority units are always more important.
+   * Workload type: PodGroups are considered more important than individual Pods of the same priority.
+   * Group size (PodGroups): If both units are PodGroups,
+     the one with more members (larger size) is considered more important.
+   * Start time: Units that started earlier are more important.
+
+3. Pod group priority and disruption: The scheduler considers the specific
+   [priority and disruption mode](/docs/concepts/workloads/workload-api/disruption-and-priority/) of a PodGroup
+   to evaluate if and how its pods can be preempted during preemption events.
+
+{{< note >}}
+When scheduling a single Pod, the default pod preemption applies.
+As of 1.36, when the scheduler performs a default preemption for a single Pod
+and it attempts to preempt a Pod belonging to a PodGroup, it does **not**
+respect the `priority` or `disruptionMode` fields of that PodGroup.
+{{< /note >}}
+
+## {{% heading "whatsnext" %}}
+
+* Learn more about [PodGroup Priority and Disruption](/docs/concepts/workloads/workload-api/disruption-and-priority/).
+* Learn about the [Workload API](/docs/concepts/workloads/workload-api/).
+* Read more about [Gang scheduling](/docs/concepts/scheduling-eviction/gang-scheduling/).
@@ -39,6 +39,8 @@ Each entry in `podGroups` must have:
 1. A unique `name` that can be used in the Pod's [Workload reference](/docs/concepts/workloads/pods/workload-reference/).
 2. A [scheduling policy](/docs/concepts/workloads/workload-api/policies/) (`basic` or `gang`).
 
+If the [`WorkloadAwarePreemption`](/docs/reference/command-line-tools-reference/feature-gates/#WorkloadAwarePreemption) [feature gate](/docs/reference/command-line-tools-reference/feature-gates/) is enabled each entry in `podGroups` can also have [priority and disruption mode](/docs/concepts/workloads/workload-api/disruption-and-priority/).
+
 ```yaml
 apiVersion: scheduling.k8s.io/v1alpha1
 kind: Workload
@@ -56,6 +58,8 @@ spec:
       gang:
         # The gang is schedulable only if 4 pods can run at once
         minCount: 4
+    priorityClassName: high-priority # Only applicable with WorkloadAwarePreemption feature gate
+    disruptionMode: PodGroup # Only applicable with WorkloadAwarePreemption feature gate
 ```
 
 ### Referencing a workload controlling object
 
@@ -0,0 +1,69 @@
+---
+title: Pod Group Disruption and Priority
+content_type: concept
+weight: 10
+---
+
+<!-- overview -->
+{{< feature-state feature_gate_name="WorkloadAwarePreemption" >}}
+
+PodGroup can declare a disruption mode. This mode dictates how
+the scheduler can disrupt a running PodGroup, for example to accommodate
+a higher priority PodGroup. A PodGroup also has a priority,
+which overrides the priority of the individual pods from the group
+for [workload-aware preemption](/docs/concepts/scheduling-eviction/workload-aware-preemption/) events.
+
+<!-- body -->
+
+## Disruption mode types
+
+{{< note >}}
+As of 1.36, the `priority` or `disruptionMode` fields of the PodGroup are only respected
+by [workload-aware preemption](/docs/concepts/scheduling-eviction/workload-aware-preemption/).
+During the pod scheduling phase, the scheduler does not take into account
+the `priority` or `disruptionMode` fields of the PodGroup.
+{{< /note >}}
+
+The API supports two disruption modes: `Pod` and `PodGroup`.
+The default one is `Pod`.
+
+### Pod
+
+The `Pod` mode instructs the scheduler to treat all Pods in the group as separate entities,
+allowing independent disruption of a single pod from a PodGroup.
+
+### PodGroup
+
+The `PodGroup` mode emphasizes "all-or-nothing" semantics for disruption.
+It instructs the scheduler that all pods from the PodGroup have to be disrupted together.
+
+## Pod group priority
+
+PodGroup uses the same concept of [PriorityClass](/docs/concepts/scheduling-eviction/pod-priority-preemption/#priorityclass) as single Pods.
+Once you have created one or more PriorityClasses,
+you can create a PodGroup that specifies one of those PriorityClass names in its specification.
+The priority admission controller uses the `priorityClassName` field and populates the integer value of the priority.
+If the priority class is not found, the PodGroup is rejected.
+When `priorityClassName` is not set for a PodGroup, Kubernetes looks for a default (a PriorityClass with `globalDefault` set true)
+If there is no PriorityClass with `globalDefault` set true, a PodGroup with no `priorityClassName` has priority zero.
+
+The priority of the PodGroup is an authorative priority for all pods in the group during [workload-aware preemption](/docs/concepts/scheduling-eviction/workload-aware-preemption/) events, even when priorities of individual pods forming this PodGroup differ.
+
+The following YAML is an example of a PodGroup configuration that uses the `high-priority` PriorityClass,
+which maps to the integer priority value of 1000000.
+The priority admission controller checks the specification and resolves the priority of the PodGroup to 1000000.
+
+```yaml
+apiVersion: scheduling.k8s.io/v1alpha2
+kind: PodGroup
+metadata:
+  namespace: ns-1
+  name: job-1
+spec:
+  priorityClassName: high-priority
+```
+
+## {{% heading "whatsnext" %}}
+
+* Read about [Workload-Aware Preemption](/docs/concepts/scheduling-eviction/workload-aware-preemption/) algorithm.
+* Learn about the [Workload API](/docs/concepts/workloads/workload-api/).
@@ -0,0 +1,17 @@
+---
+title: WorkloadAwarePreemption
+content_type: feature_gate
+_build:
+  list: never
+  render: false
+
+stages:
+  - stage: alpha
+    defaultValue: false
+    fromVersion: "1.36"
+---
+
+Enables the support for [Workload-aware preemption](/docs/concepts/scheduling-eviction/workload-aware-preemption/).
+
+When enabled, if a PodGroup fails to schedule, the scheduler will use a workload-aware preemption
+algorithm to select victims to preempt instead of the default pod preemption algorithm.