Skip to content

DRA: Add configurable health check timeout per device#135147

Merged
k8s-ci-robot merged 2 commits into
kubernetes:masterfrom
harche:HealhCheckTimeout
Nov 6, 2025
Merged

DRA: Add configurable health check timeout per device#135147
k8s-ci-robot merged 2 commits into
kubernetes:masterfrom
harche:HealhCheckTimeout

Conversation

@harche

@harche harche commented Nov 5, 2025

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind feature
/kind api-change

What this PR does / why we need it:

Implements device-specific health check timeouts in the DRA health monitoring system as defined in KEP-4680. This allows DRA drivers to specify custom timeout values for individual devices through the gRPC health API.

Please note that, this PR builds on top of #133752. It has been created in the interest of time remaining for the code freeze.

Which issue(s) this PR is related to:

Ref: KEP-4680 (Add Resource Health to Pod Status)
Ref: kubernetes/enhancements#5476

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Added configurable per-device health check timeouts to the DRA health monitoring API.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


ArangoGutierrez and others added 2 commits August 28, 2025 16:34
Implements device-specific health check timeouts in the DRA health monitoring
system as defined in KEP-4680. This allows DRA drivers to specify custom
timeout values for individual devices through the gRPC health API.

Changes:
- Add HealthCheckTimeout field to state.DeviceHealth struct to store
  device-specific timeout durations
- Add health_check_timeout_seconds field to DeviceHealth proto message
  in the DRA health gRPC API (v1alpha1)
- Update manager.go to extract timeout from gRPC responses and apply
  DefaultHealthTimeout (30s) when not specified
- Handle negative timeout values defensively by logging a warning and
  falling back to the default timeout
- Simplify healthinfo.go by removing redundant fallback logic since
  timeouts are now always set at creation time
- Update tests to include HealthCheckTimeout in test fixtures

The timeout behavior is:
- Positive values: Use the specified timeout in seconds
- Zero or unspecified: Use DefaultHealthTimeout (30 seconds)
- Negative values: Log warning and use DefaultHealthTimeout

This implementation provides flexibility for DRA drivers to define
appropriate health check intervals for different device types while
maintaining backward compatibility through sensible defaults.

Ref: KEP-4680 (Add Resource Health to Pod Status)
Ref: kubernetes/enhancements#5476

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Harshal Patil <12152047+harche@users.noreply.github.com>
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. labels Nov 5, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Nov 5, 2025
@harche

harche commented Nov 5, 2025

Copy link
Copy Markdown
Contributor Author

/retest-required

@harche

harche commented Nov 5, 2025

Copy link
Copy Markdown
Contributor Author

/retest pull-kubernetes-node-e2e-containerd

@harche harche changed the title WIP: Healh check timeout Nov 5, 2025
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Nov 5, 2025
@harche

harche commented Nov 5, 2025

Copy link
Copy Markdown
Contributor Author

/retest-required

@haircommander

Copy link
Copy Markdown
Contributor

@harche can you update with a release note #133752 (comment)?

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Nov 5, 2025
@kannon92

kannon92 commented Nov 5, 2025

Copy link
Copy Markdown
Contributor

/sig node
/wg device-management

@Jpsassine

Copy link
Copy Markdown
Contributor

@harche thank you for opening this to get the change through.

@k8s-ci-robot

k8s-ci-robot commented Nov 5, 2025

Copy link
Copy Markdown
Contributor

@harche: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-unit-windows-master 374baac link false /test pull-kubernetes-unit-windows-master

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@Jpsassine

Copy link
Copy Markdown
Contributor
@pohly pohly moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation Nov 6, 2025
@harche

harche commented Nov 6, 2025

Copy link
Copy Markdown
Contributor Author

/retest-required

@SergeyKanzhelev SergeyKanzhelev left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@harche

harche commented Nov 6, 2025

Copy link
Copy Markdown
Contributor Author

/assign @liggitt

Ready for final approval.

@liggitt

liggitt commented Nov 6, 2025

Copy link
Copy Markdown
Member

/kind feature
/approve
for API bit

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Nov 6, 2025
@liggitt liggitt moved this to API review completed, 1.35 in API Reviews Nov 6, 2025
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: harche, liggitt, SergeyKanzhelev

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 6, 2025
@liggitt liggitt added this to the v1.35 milestone Nov 6, 2025
@harche

harche commented Nov 6, 2025

Copy link
Copy Markdown
Contributor Author

/label priority/important-soon

@harche

harche commented Nov 6, 2025

Copy link
Copy Markdown
Contributor Author

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 6, 2025
@k8s-ci-robot k8s-ci-robot merged commit ec5211c into kubernetes:master Nov 6, 2025
22 checks passed
@pohly pohly moved this from 👀 In review to ✅ Done in Dynamic Resource Allocation Nov 7, 2025
mhan8796 pushed a commit to mhan8796/kubernetes that referenced this pull request Jun 27, 2026
DRA: Add configurable health check timeout per device
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-review Categorizes an issue or PR as actively needing an API review. approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.

10 participants