DRA: Add configurable health check timeout per device#135147
Conversation
Implements device-specific health check timeouts in the DRA health monitoring system as defined in KEP-4680. This allows DRA drivers to specify custom timeout values for individual devices through the gRPC health API. Changes: - Add HealthCheckTimeout field to state.DeviceHealth struct to store device-specific timeout durations - Add health_check_timeout_seconds field to DeviceHealth proto message in the DRA health gRPC API (v1alpha1) - Update manager.go to extract timeout from gRPC responses and apply DefaultHealthTimeout (30s) when not specified - Handle negative timeout values defensively by logging a warning and falling back to the default timeout - Simplify healthinfo.go by removing redundant fallback logic since timeouts are now always set at creation time - Update tests to include HealthCheckTimeout in test fixtures The timeout behavior is: - Positive values: Use the specified timeout in seconds - Zero or unspecified: Use DefaultHealthTimeout (30 seconds) - Negative values: Log warning and use DefaultHealthTimeout This implementation provides flexibility for DRA drivers to define appropriate health check intervals for different device types while maintaining backward compatibility through sensible defaults. Ref: KEP-4680 (Add Resource Health to Pod Status) Ref: kubernetes/enhancements#5476 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Harshal Patil <12152047+harche@users.noreply.github.com>
|
/retest-required |
|
/retest pull-kubernetes-node-e2e-containerd |
|
/retest-required |
|
@harche can you update with a release note #133752 (comment)? |
|
/sig node |
|
@harche thank you for opening this to get the change through. |
|
@harche: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
cc. @SergeyKanzhelev |
|
/retest-required |
|
/assign @liggitt Ready for final approval. |
|
/kind feature |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: harche, liggitt, SergeyKanzhelev The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/label priority/important-soon |
|
/triage accepted |
DRA: Add configurable health check timeout per device
What type of PR is this?
/kind feature
/kind api-change
What this PR does / why we need it:
Implements device-specific health check timeouts in the DRA health monitoring system as defined in KEP-4680. This allows DRA drivers to specify custom timeout values for individual devices through the gRPC health API.
Please note that, this PR builds on top of #133752. It has been created in the interest of time remaining for the code freeze.
Which issue(s) this PR is related to:
Ref: KEP-4680 (Add Resource Health to Pod Status)
Ref: kubernetes/enhancements#5476
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: