Skip to content

Commit 1df0651

Browse files
chipspeaklaurafitzgeraldkryanbeaneFuture-OutlierCursor
authored
[Feature] Add NetworkIsolation support to RayClusters (#4638)
* [Feature] Add NetworkIsolation support to RayClusters Co-authored-by: Pat O'Connor <paoconno@redhat.com> Co-authored-by: Bryan Keane <bkeane@redhat.com> * add feature gate reg to integration tests + removed redundant if err Signed-off-by: Pat O'Connor <paoconno@redhat.com> * separate function for head and base ingress rules Signed-off-by: Pat O'Connor <paoconno@redhat.com> * logger updates as per review Signed-off-by: Pat O'Connor <paoconno@redhat.com> * adjust DNS egress rule to allow all port 53 egress Signed-off-by: Pat O'Connor <paoconno@redhat.com> * resolve helm chart CI failure Signed-off-by: Pat O'Connor <paoconno@redhat.com> * restrict KubeRay rule via NamespaceSelector Signed-off-by: Pat O'Connor <paoconno@redhat.com> * check if update is necessary on NP via DeepEqual Signed-off-by: Pat O'Connor <paoconno@redhat.com> * prevent same-name NP mod + add namespace fallback to default Signed-off-by: Pat O'Connor <paoconno@redhat.com> * add missing rayStartParams ports to custom ports example Signed-off-by: Pat O'Connor <paoconno@redhat.com> * emit warning when existing NP conflicts with new one via name + fix test Signed-off-by: Pat O'Connor <paoconno@redhat.com> * remove redundant samples + add new samples Signed-off-by: Pat O'Connor <paoconno@redhat.com> * remove permissive pod selector rule + test updates Signed-off-by: Pat O'Connor <paoconno@redhat.com> * prop labels to jobsubmitter for networkpolicy rule Signed-off-by: Pat O'Connor <paoconno@redhat.com> * Add opt-in ALLOW_ALL_RAYJOB_SUBMITTERS env var for broad submitter ingress on standalone RayClusters Signed-off-by: Pat O'Connor <paoconno@redhat.com> * updated API and config to fix CI failures Signed-off-by: Pat O'Connor <paoconno@redhat.com> * Moved label prop to getSummitterTemplate + review changes Signed-off-by: Pat O'Connor <paoconno@redhat.com> * Review feedback: failed deletion event + reconcile on dupe + nits Signed-off-by: Pat O'Connor <paoconno@redhat.com> * use rayStartParams for ports + review nits Signed-off-by: Pat O'Connor <paoconno@redhat.com> * test fix + convert constants to CamelCase Signed-off-by: Pat O'Connor <paoconno@redhat.com> * helm updates + check if RayCluster is externally managed Signed-off-by: Pat O'Connor <paoconno@redhat.com> * API updates per review + autoscaling egress rule Signed-off-by: Pat O'Connor <paoconno@redhat.com> * Use standard lib for CIDR + use logConstructor consistently Signed-off-by: Pat O'Connor <paoconno@redhat.com> * Use EndpointSlice IPs for API server egress rule Signed-off-by: Pat O'Connor <paoconno@redhat.com> * Deduplicates ports + peers across EndpointSlices in egress rules Signed-off-by: Pat O'Connor <paoconno@redhat.com> * Remove operator namespace fallback and prop error if namespace not found Signed-off-by: Pat O'Connor <paoconno@redhat.com> * Remove patch verb for networkpolicies Signed-off-by: Pat O'Connor <paoconno@redhat.com> * Remove client port rule for operator Signed-off-by: Pat O'Connor <paoconno@redhat.com> * Add reconcileConcurrency and GenerationChangedPredicate to NetworkPolicy controller setup Signed-off-by: Pat O'Connor <paoconno@redhat.com> * solve 3 points, close to merge Signed-off-by: Future-Outlier <eric901201@gmail.com> * helper Signed-off-by: Future-Outlier <eric901201@gmail.com> * update Signed-off-by: Future-Outlier <eric901201@gmail.com> * CI failure fixes after DNS logic removal Signed-off-by: Pat O'Connor <paoconno@redhat.com> * Revert "helper" This reverts commit 26c4053. * Skip networkpolicy integration test when missing namespace Signed-off-by: Pat O'Connor <paoconno@redhat.com> * Normalize defaulted NetworkPolicy ports before comparing Signed-off-by: Pat O'Connor <paoconno@redhat.com> * Drop network policy integration tests pending followup e2e Signed-off-by: Pat O'Connor <paoconno@redhat.com> * Remove operator namespace ingress rule Signed-off-by: Pat O'Connor <paoconno@redhat.com> * networkpolicy: separate head/worker custom rules in API Replace the flat ingressRules/egressRules on NetworkIsolationConfig with per-role head/worker sub-structs.Since the API has not shipped yet, this avoids locking in a shared-only design that would require a backward-compatible migration later. Head and workers have fundamentally different security profiles — head needs external access (dashboard, submitter, Prometheus) while workers typically need outbound access (S3, model registries). A shared-only API forces users to over-permit workers just to allow access to the head. New API shape: networkIsolation: mode: DenyAll head: ingressRules: [...] egressRules: [...] worker: ingressRules: [...] egressRules: [...] Co-authored-by: Cursor <cursor@cursor.sh> * NetworkIsolation API change cleanup Signed-off-by: Pat O'Connor <paoconno@redhat.com> * Unit tests to ensure custom policies don't leak between head and worker Signed-off-by: Pat O'Connor <paoconno@redhat.com> * Sample comment indendation and removal of operator rule from API docs Signed-off-by: Pat O'Connor <paoconno@redhat.com> * Indent update to pass linter Signed-off-by: Pat O'Connor <paoconno@redhat.com> * Additional review nits - comments etc Signed-off-by: Pat O'Connor <paoconno@redhat.com> * Remove redundant comments in controller code Signed-off-by: Pat O'Connor <paoconno@redhat.com> --------- Signed-off-by: Pat O'Connor <paoconno@redhat.com> Signed-off-by: Future-Outlier <eric901201@gmail.com> Co-authored-by: Laura Fitzgerald <lfitzger@redhat.com> Co-authored-by: Bryan Keane <bkeane@redhat.com> Co-authored-by: Future-Outlier <eric901201@gmail.com> Co-authored-by: Cursor <cursor@cursor.sh>
1 parent 99f542a commit 1df0651

27 files changed

Lines changed: 4768 additions & 2 deletions

‎docs/reference/api.md‎

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -287,6 +287,63 @@ _Appears in:_
287287
| `SidecarMode` | |
288288

289289

290+
#### NetworkIsolationConfig
291+
292+
293+
294+
NetworkIsolationConfig defines network isolation settings for Ray cluster.
295+
All modes permit intra-cluster pod-to-pod traffic.
296+
DNS egress is not included automatically; see NetworkPolicyRules.EgressRules
297+
for why it must be added under DenyAll/DenyAllEgress.
298+
299+
300+
301+
_Appears in:_
302+
- [RayClusterSpec](#rayclusterspec)
303+
304+
| Field | Description | Default | Validation |
305+
| --- | --- | --- | --- |
306+
| `mode` _[NetworkIsolationMode](#networkisolationmode)_ | Mode controls the security level. All modes permit intra-cluster pod-to-pod<br />traffic (DNS egress excluded, see EgressRules).<br />- "DenyAll": Denies all Ingress and Egress.<br />- "DenyAllIngress": Denies all Ingress.<br />- "DenyAllEgress": Denies all Egress. | DenyAll | Enum: [DenyAll DenyAllIngress DenyAllEgress] <br /> |
307+
| `head` _[NetworkPolicyRules](#networkpolicyrules)_ | Head specifies custom NetworkPolicy rules applied only to the head pod's policy.<br />The base head policy always allows intra-cluster traffic and (for K8sJobMode<br />RayJob-owned clusters) the submitter pod. Rules here are appended to those<br />base rules. Platforms that need operator dashboard access should add it here<br />(e.g. via a mutating webhook). | | |
308+
| `worker` _[NetworkPolicyRules](#networkpolicyrules)_ | Worker specifies custom NetworkPolicy rules applied only to worker pods' policy.<br />The base worker policy always allows intra-cluster traffic.<br />Rules here are appended to that base rule. | | |
309+
310+
311+
#### NetworkIsolationMode
312+
313+
_Underlying type:_ _string_
314+
315+
NetworkIsolationMode is the type for network isolation mode constants.
316+
317+
_Validation:_
318+
- Enum: [DenyAll DenyAllIngress DenyAllEgress]
319+
320+
_Appears in:_
321+
- [NetworkIsolationConfig](#networkisolationconfig)
322+
323+
| Field | Description |
324+
| --- | --- |
325+
| `DenyAll` | NetworkIsolationDenyAll denies all ingress and egress traffic.<br /> |
326+
| `DenyAllIngress` | NetworkIsolationDenyAllIngress denies all ingress traffic.<br /> |
327+
| `DenyAllEgress` | NetworkIsolationDenyAllEgress denies all egress traffic.<br /> |
328+
329+
330+
#### NetworkPolicyRules
331+
332+
333+
334+
NetworkPolicyRules defines custom ingress and egress rules for a NetworkPolicy.
335+
336+
337+
338+
_Appears in:_
339+
- [NetworkIsolationConfig](#networkisolationconfig)
340+
341+
| Field | Description | Default | Validation |
342+
| --- | --- | --- | --- |
343+
| `ingressRules` _[NetworkPolicyIngressRule](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#networkpolicyingressrule-v1-networking) array_ | IngressRules specifies custom ingress rules appended to the base policy.<br />Only meaningful when the mode includes ingress denial (DenyAll or DenyAllIngress). | | |
344+
| `egressRules` _[NetworkPolicyEgressRule](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#networkpolicyegressrule-v1-networking) array_ | EgressRules specifies custom egress rules appended to the base policy.<br />Only meaningful when the mode includes egress denial (DenyAll or DenyAllEgress).<br />DNS egress is NOT added automatically: under DenyAll/DenyAllEgress you MUST<br />add a DNS rule here (e.g. to kube-system pods labeled k8s-app=kube-dns on<br />port 53), because Ray workers reach the head via its service FQDN and cannot<br />resolve it without DNS. See the network-isolation-deny-all sample. | | |
345+
346+
290347
#### RayCluster
291348

292349

@@ -330,6 +387,7 @@ _Appears in:_
330387
| `headServiceAnnotations` _object (keys:string, values:string)_ | | | |
331388
| `enableInTreeAutoscaling` _boolean_ | EnableInTreeAutoscaling indicates whether operator should create in tree autoscaling configs | | |
332389
| `gcsFaultToleranceOptions` _[GcsFaultToleranceOptions](#gcsfaulttoleranceoptions)_ | GcsFaultToleranceOptions for enabling GCS FT | | |
390+
| `networkIsolation` _[NetworkIsolationConfig](#networkisolationconfig)_ | NetworkIsolation specifies optional configuration for network isolation.<br />When set, separate NetworkPolicies are created for head and worker pods.<br />The reconciler always permits intra-cluster pod-to-pod traffic.<br />Note: under DenyAll/DenyAllEgress, DNS egress is not added<br />automatically; since Ray pods reach the head via its service FQDN, you must<br />allow DNS egress via Head/Worker EgressRules or the cluster will fail to start. | | |
333391
| `headGroupSpec` _[HeadGroupSpec](#headgroupspec)_ | HeadGroupSpec is the spec for the head pod | | |
334392
| `rayVersion` _string_ | RayVersion is used to determine the command for the Kubernetes Job managed by RayJob | | |
335393
| `workerGroupSpecs` _[WorkerGroupSpec](#workergroupspec) array_ | WorkerGroupSpecs are the specs for the worker pods | | |

0 commit comments

Comments
 (0)