DRA: Add configurable health check timeout per device by harche · Pull Request #135147 · kubernetes/kubernetes

harche · 2025-11-05T16:34:48Z

What type of PR is this?

/kind feature
/kind api-change

What this PR does / why we need it:

Implements device-specific health check timeouts in the DRA health monitoring system as defined in KEP-4680. This allows DRA drivers to specify custom timeout values for individual devices through the gRPC health API.

Please note that, this PR builds on top of #133752. It has been created in the interest of time remaining for the code freeze.

Which issue(s) this PR is related to:

Ref: KEP-4680 (Add Resource Health to Pod Status)
Ref: kubernetes/enhancements#5476

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Added configurable per-device health check timeouts to the DRA health monitoring API.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Implements device-specific health check timeouts in the DRA health monitoring system as defined in KEP-4680. This allows DRA drivers to specify custom timeout values for individual devices through the gRPC health API. Changes: - Add HealthCheckTimeout field to state.DeviceHealth struct to store device-specific timeout durations - Add health_check_timeout_seconds field to DeviceHealth proto message in the DRA health gRPC API (v1alpha1) - Update manager.go to extract timeout from gRPC responses and apply DefaultHealthTimeout (30s) when not specified - Handle negative timeout values defensively by logging a warning and falling back to the default timeout - Simplify healthinfo.go by removing redundant fallback logic since timeouts are now always set at creation time - Update tests to include HealthCheckTimeout in test fixtures The timeout behavior is: - Positive values: Use the specified timeout in seconds - Zero or unspecified: Use DefaultHealthTimeout (30 seconds) - Negative values: Log warning and use DefaultHealthTimeout This implementation provides flexibility for DRA drivers to define appropriate health check intervals for different device types while maintaining backward compatibility through sensible defaults. Ref: KEP-4680 (Add Resource Health to Pod Status) Ref: kubernetes/enhancements#5476 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Signed-off-by: Harshal Patil <12152047+harche@users.noreply.github.com>

harche · 2025-11-05T17:22:08Z

/retest-required

harche · 2025-11-05T18:11:19Z

/retest pull-kubernetes-node-e2e-containerd

harche · 2025-11-05T18:16:03Z

/retest-required

haircommander · 2025-11-05T18:17:55Z

@harche can you update with a release note #133752 (comment)?

kannon92 · 2025-11-05T18:30:29Z

/sig node
/wg device-management

Jpsassine · 2025-11-05T18:40:16Z

@harche thank you for opening this to get the change through.

k8s-ci-robot · 2025-11-05T19:06:01Z

@harche: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-unit-windows-master	`374baac`	link	false	`/test pull-kubernetes-unit-windows-master`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Jpsassine · 2025-11-05T19:30:10Z

cc. @SergeyKanzhelev

harche · 2025-11-06T17:00:03Z

/retest-required

SergeyKanzhelev

/lgtm
/approve

harche · 2025-11-06T19:24:26Z

/assign @liggitt

Ready for final approval.

liggitt · 2025-11-06T20:58:43Z

/kind feature
/approve
for API bit

k8s-ci-robot · 2025-11-06T20:58:57Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: harche, liggitt, SergeyKanzhelev

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~pkg/kubelet/cm/OWNERS~~ [SergeyKanzhelev,liggitt]
~~staging/src/k8s.io/kubelet/pkg/apis/OWNERS~~ [liggitt]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

harche · 2025-11-06T22:08:07Z

/label priority/important-soon

harche · 2025-11-06T22:09:44Z

/triage accepted

DRA: Add configurable health check timeout per device

ArangoGutierrez and others added 2 commits August 28, 2025 16:34

Check HealthCheckTimeout in updateHealthInfo comparison

374baac

Signed-off-by: Harshal Patil <12152047+harche@users.noreply.github.com>

github-project-automation Bot added this to SIG Node: code and documentation PRs and Dynamic Resource Allocation Nov 5, 2025

k8s-ci-robot removed the do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Nov 5, 2025

github-project-automation Bot moved this to 🆕 New in Dynamic Resource Allocation Nov 5, 2025

github-project-automation Bot moved this to Triage in SIG Node: code and documentation PRs Nov 5, 2025

k8s-ci-robot requested review from mtaufen and natasha41575 November 5, 2025 16:35

harche changed the title ~~WIP: Healh check timeout~~ Nov 5, 2025

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Nov 5, 2025

harche mentioned this pull request Nov 5, 2025

DRA: Add configurable health check timeout per device #133752

Closed

pohly moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation Nov 6, 2025

SergeyKanzhelev approved these changes Nov 6, 2025

View reviewed changes

k8s-ci-robot assigned SergeyKanzhelev Nov 6, 2025

k8s-ci-robot assigned liggitt Nov 6, 2025

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Nov 6, 2025

liggitt moved this to API review completed, 1.35 in API Reviews Nov 6, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 6, 2025

liggitt added this to the v1.35 milestone Nov 6, 2025

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 6, 2025

k8s-ci-robot merged commit ec5211c into kubernetes:master Nov 6, 2025
22 checks passed

github-project-automation Bot moved this from Triage to Done in SIG Node: code and documentation PRs Nov 6, 2025

pohly moved this from 👀 In review to ✅ Done in Dynamic Resource Allocation Nov 7, 2025

haircommander mentioned this pull request Nov 11, 2025

Add Resource Health Status to the Pod Status for Device Plugin and DRA kubernetes/enhancements#4680

Open

19 tasks

Jpsassine mentioned this pull request Nov 20, 2025

[Flaky Test] [sig-node] E2eNode Suite.[It] [sig-node] [DRA] [Feature:DynamicResourceAllocation] [FeatureGate:DynamicResourceAllocation] Resource Health [Serial] should not add health status to Pod when feature gate is disabled #135317

Closed

harche mentioned this pull request Mar 6, 2026

[KEP-4680] DRA: Make device health check timeout configurable #133118

Closed

harche mentioned this pull request May 28, 2026

self-nominate harche to be a sig-node reviewer #139365

Merged

mhan8796 pushed a commit to mhan8796/kubernetes that referenced this pull request Jun 27, 2026

Merge pull request kubernetes#135147 from harche/HealhCheckTimeout

67090a9

DRA: Add configurable health check timeout per device

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DRA: Add configurable health check timeout per device#135147

DRA: Add configurable health check timeout per device#135147
k8s-ci-robot merged 2 commits into
kubernetes:masterfrom
harche:HealhCheckTimeout

harche commented Nov 5, 2025 •

edited

Loading

harche commented Nov 5, 2025

harche commented Nov 5, 2025

harche commented Nov 5, 2025

haircommander commented Nov 5, 2025

kannon92 commented Nov 5, 2025

Jpsassine commented Nov 5, 2025

k8s-ci-robot commented Nov 5, 2025 •

edited

Loading

Jpsassine commented Nov 5, 2025

harche commented Nov 6, 2025

SergeyKanzhelev left a comment

harche commented Nov 6, 2025

liggitt commented Nov 6, 2025

k8s-ci-robot commented Nov 6, 2025

harche commented Nov 6, 2025

harche commented Nov 6, 2025

Uh oh!

Labels

10 participants

Uh oh!

Conversation

harche commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR is related to:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

harche commented Nov 5, 2025

harche commented Nov 5, 2025

harche commented Nov 5, 2025

haircommander commented Nov 5, 2025

kannon92 commented Nov 5, 2025

Jpsassine commented Nov 5, 2025

k8s-ci-robot commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Jpsassine commented Nov 5, 2025

harche commented Nov 6, 2025

SergeyKanzhelev left a comment

Choose a reason for hiding this comment

harche commented Nov 6, 2025

liggitt commented Nov 6, 2025

k8s-ci-robot commented Nov 6, 2025

harche commented Nov 6, 2025

harche commented Nov 6, 2025

Uh oh!

Labels

10 participants

harche commented Nov 5, 2025 •

edited

Loading

k8s-ci-robot commented Nov 5, 2025 •

edited

Loading