[Feature] Add NetworkIsolation support to RayClusters by chipspeak · Pull Request #4638 · ray-project/kuberay

chipspeak · 2026-03-23T11:37:32Z

Why are these changes needed?

KubeRay currently provides a flexible foundation for deploying Ray on Kubernetes but relies on users to manually configure critical security features. As per this proposal doc this PR is one tentpole of a larger feature set aimed at security conscious network templates.

This PR handles the automatic creation of NetworkPolicies for RayClusters via a new controller in KubeRay. The controller utilises a new field in the RayCluster CR's spec, networkIsolation. This field has 3 values usable for it's mode field:

DenyAll - this default will prevent all ingress and egress with the exception of itra-cluster pod communication. This includes blocking egress needed for RayJobs. It services as a sensible starting point for any additional ingressRules or egressRules that a user wishes to add. The intended approach here would be to use a webhook to add any additional rules. This should maintain an initially conservative security posture from which to build a user-specific template.
DenyAllIngress - restricts all inbound traffic to RayCluster pods while leaving egress unrestricted. Intra-cluster communication and KubeRay operator access are still permitted by default. Like denyAll, any user-specified ingressRules are appended to the base policy, and RayJob submitter pods are automatically allowed when the RayCluster is owned by a RayJob. This mode is useful when Ray workloads need outbound access (e.g. pulling from external data sources or reaching cloud APIs) but should not be reachable from arbitrary pods in the namespace.
DenyAllEgress - restricts all outbound traffic from RayCluster pods while leaving ingress unrestricted. Intra-cluster communication and DNS resolution (port 53) are preserved as base rules to ensure the cluster remains functional. Any user-specified egressRules are appended alongside these defaults. This mode suits environments where inbound access to the Ray dashboard or client port should remain open but workloads should not be able to reach external services without explicit allowlisting. NOTE: For the use of pip etc in the context of a RayJob or to use RayServe the user is required to add egressRules for these.

In all three modes, the controller generates separate NetworkPolicy resources for head and worker pods. The base rules always ensure inter-pod communication within the cluster is unimpeded and that the KubeRay operator can reach the head node's dashboard and client ports. As previously outlined, users can extend any mode via the ingressRules and egressRules fields on the networkIsolation spec, which are appended verbatim to the generated policies.

These resources are owned by the RayCluster (or the RayJob that owns the respective RayCluster) and are life-cycled accordingly to mitigate the need for user cleanup. These rules work on the expectation that good RBAC practice is maintained in your cluster.

NOTE: RayClusters that are owned by RayJobs function the same as outlined above but an additional rule accounts for the RayJob submitter pod (this rule is present on all RayClusters). This rule uses the OwnerReference to validate but given this isn't present when a RayJob is submitted against a pre-existing cluster. To account for this, a new env var (which defaults to false) can be used to add a more permissive rule that will facilitate these submitter pods.

The proposal doc features additional information about the specific rules for each mode and the rational behind them.

Related issue number

Closes: 3987

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

machichima · 2026-03-30T13:45:42Z

Should we also add NetworkPolicy to RayJob submitter pod?

chipspeak · 2026-03-30T14:22:26Z

Should we also add NetworkPolicy to RayJob submitter pod?

It might add a little too much complexity in terms of configuration. A user can define a submitterPodTemplate so we'd need to account for situations in which our defaults might conflict and interfere with that.

Would it be an idea for a follow-up PR maybe pending some discussion? Given how ephemeral submitter Pods are, I feel the attack surface is relatively small as is.

machichima · 2026-03-31T13:50:49Z

Would it be an idea for a follow-up PR maybe pending some discussion? Given how ephemeral submitter Pods are, I feel the attack surface is relatively small as is.

Of course! We can discuss this in follow-ups

machichima · 2026-03-31T14:06:09Z

+		// NamespaceSelector is intentionally empty (matches all namespaces) so that
+		// the operator pod is allowed regardless of which namespace it runs in.


Would it be better to limit it to the namespace where the kuberay-operator is in? Based on this rule, any pod with the label app.kubernetes.io/component: kuberay can connect to head pod

That was the way we had it in our mid-stream, I'd just assumed we'd want a little more flexibility here. If a user deployed it in a custom namespace, they'd need to manually adjust the auto-generated network policies in this instance! I can certainly move it back to being namespace specific but it's security theatre at a certain point because if a bad actor has permissions to add that label, they'd also have no issue moving resources to a hard-coded namespace if need be. Does that make sense?

I don't think adding this label requires high privilege?

What I was thinking is that when creating a RayCluster, user can specify labels for worker and head pod. If they set app.kubernetes.io/component: kuberay for RayClusterA, it means that pods in RayClusterA are able to access other RayClusters' head Pods on the dashboard/client ports. This may allow job submission to other RayCluster.

This is one of the case that we wanted to prevent

kuberay/ray-operator/apis/ray/v1/raycluster_types.go

Line 226 in e10a435

Labels map[string]string `json:"labels,omitempty"`

Sorry I mean permissions in the context of applying the RayCluster CR at all but fair point! I'll move it back to our more conservative mid-stream approach with the operator namespace validated too! Thanks!

I've updated it just now. Using NamespaceSelector and pulling the namespace from the service account on the pod so that the user doesn't have to deploy the operator with the POD_NAMESPACE env if they don't want to. WDYT?

machichima

I think there's some redundant / missing sample YAMLs

ray-cluster.network-isolation-monitoring.yaml, ray-cluster.network-isolation-complex-rules.yaml, and ray-cluster.network-isolation-custom-rules.yaml
- Those 3 are similar, all using denyAll + ingressRules, I think we can just keep one
egress related YAML is not included

Maybe can just include following YAMLs:

denyAll with ingress and egress set
denyAllIngress with ingressrule set
denyAllEgress with egressrule set
custom port

WDYT?

chipspeak · 2026-04-02T16:09:38Z

I think there's some redundant / missing sample YAMLs

ray-cluster.network-isolation-monitoring.yaml, ray-cluster.network-isolation-complex-rules.yaml, and ray-cluster.network-isolation-custom-rules.yaml

Those 3 are similar, all using denyAll + ingressRules, I think we can just keep one

egress related YAML is not included

Maybe can just include following YAMLs:

denyAll with ingress and egress set

denyAllIngress with ingressrule set

denyAllEgress with egressrule set

custom port

WDYT?

Sounds good! I've updated the samples in line with this in the relevant commit.

machichima · 2026-04-04T03:48:53Z

Hi @chipspeak,

A follow-up on this comment: https://github.com/ray-project/kuberay/pull/4638/changes#r3009809497

After discussing with the team offline, we decided it’s better to remove this rule for now to keep the default behavior strict. Instead of handling this on our side, we will leave it up to the users who create the RayCluster to explicitly add the necessary IngressRules.

We should also document this behavior in IngressRules docstring and the official docs.

kuberay/ray-operator/apis/ray/v1/raycluster_types.go

Lines 157 to 159 in 02fe0ff

    
           // IngressRules specifies custom ingress rules for Ray cluster pods. 
        
           // +optional 
        
           IngressRules []networkingv1.NetworkPolicyIngressRule `json:"ingressRules,omitempty"`

For example, we can mention that users can set the PodSelector to match pods with a specific label (e.g., allow-submit: true), and then apply this label to the SubmitterPodTemplate when creating a RayJob.

Our reasoning is that it's much safer to shift this responsibility to the user by default. If there is significant user demand in the future, we can then revisit this. However, if we release a non-strict version now, tightening the security later will be much harder as it could break existing user applications.

cc @rueian

chipspeak · 2026-04-07T13:35:10Z

Hi @chipspeak,

A follow-up on this comment: https://github.com/ray-project/kuberay/pull/4638/changes#r3009809497

After discussing with the team offline, we decided it’s better to remove this rule for now to keep the default behavior strict. Instead of handling this on our side, we will leave it up to the users who create the RayCluster to explicitly add the necessary IngressRules.

We should also document this behavior in IngressRules docstring and the official docs.

kuberay/ray-operator/apis/ray/v1/raycluster_types.go

Lines 157 to 159 in 02fe0ff

// IngressRules specifies custom ingress rules for Ray cluster pods.

// +optional

IngressRules []networkingv1.NetworkPolicyIngressRule `json:"ingressRules,omitempty"`

For example, we can mention that users can set the PodSelector to match pods with a specific label (e.g., allow-submit: true), and then apply this label to the SubmitterPodTemplate when creating a RayJob.

Our reasoning is that it's much safer to shift this responsibility to the user by default. If there is significant user demand in the future, we can then revisit this. However, if we release a non-strict version now, tightening the security later will be much harder as it could break existing user applications.

cc @rueian

Cool that makes sense!

Just for clarity, I assume this refers specifically to the permissive podselector rule that allowed the job submitter pod (and other pods within the same namespace) to access a pre-existing RayCluster?

Are we still ok with how it works for RayJob-owned RayClusters? Where the controller detects that a RayCluster has a RayJob ownerReference and then injects an ingress rule on the head NetworkPolicy allowing the submitter pod to reach the dashboard port.

machichima · 2026-04-08T13:25:24Z

Cool that makes sense!

Just for clarity, I assume this refers specifically to the permissive podselector rule that allowed the job submitter pod (and other pods within the same namespace) to access a pre-existing RayCluster?

Are we still ok with how it works for RayJob-owned RayClusters? Where the controller detects that a RayCluster has a RayJob ownerReference and then injects an ingress rule on the head NetworkPolicy allowing the submitter pod to reach the dashboard port.

Yes! My comment is referring to head pod ingress that allows pods within the same namespace. For RayJob-owned RayClusters, the current implementation in buildRayJobPeer looks good to me, which do not allow access for arbitrary pods.

Future-Outlier

is this used in the production env?

Signed-off-by: Pat O'Connor <paoconno@redhat.com>

…icy controller setup Signed-off-by: Pat O'Connor <paoconno@redhat.com>

Signed-off-by: Future-Outlier <eric901201@gmail.com>

Signed-off-by: Pat O'Connor <paoconno@redhat.com>

This reverts commit 26c4053.

Signed-off-by: Pat O'Connor <paoconno@redhat.com>

…lat ingressRules/egressRules on NetworkIsolationConfig with per-role head/worker sub-structs.Since the API has not shipped yet, this avoids locking in a shared-only design that would require a backward-compatible migration later. Head and workers have fundamentally different security profiles — head needs external access (dashboard, submitter, Prometheus) while workers typically need outbound access (S3, model registries). A shared-only API forces users to over-permit workers just to allow access to the head. New API shape: networkIsolation: mode: DenyAll head: ingressRules: [...] egressRules: [...] worker: ingressRules: [...] egressRules: [...] Co-authored-by: Cursor <cursor@cursor.sh>

Signed-off-by: Pat O'Connor <paoconno@redhat.com>

chipspeak requested review from MortalHappiness, andrewsykim, kevin85421 and rueian as code owners March 23, 2026 11:37

chipspeak mentioned this pull request Mar 23, 2026

[Feature] Add mTLS Support via Cert Manager to RayCluster #4566

Open

4 tasks

cursor Bot reviewed Mar 23, 2026

View reviewed changes

Comment thread ray-operator/controllers/ray/networkpolicy_controller_test.go Outdated

Comment thread ray-operator/controllers/ray/networkpolicy_controller.go Outdated

rueian reviewed Mar 23, 2026

View reviewed changes

Comment thread ray-operator/controllers/ray/networkpolicy_controller.go Outdated

Comment thread ray-operator/controllers/ray/networkpolicy_controller.go Outdated

rueian reviewed Mar 23, 2026

View reviewed changes

Comment thread ray-operator/controllers/ray/networkpolicy_controller.go Outdated

Comment thread ray-operator/controllers/ray/networkpolicy_controller.go Outdated

Comment thread ray-operator/controllers/ray/networkpolicy_controller.go Outdated

rueian reviewed Mar 23, 2026

View reviewed changes

Comment thread ray-operator/controllers/ray/networkpolicy_controller.go Outdated

Comment thread ray-operator/controllers/ray/networkpolicy_controller.go Outdated

Comment thread ray-operator/controllers/ray/networkpolicy_controller.go Outdated

cursor Bot reviewed Mar 24, 2026

View reviewed changes

Comment thread ray-operator/controllers/ray/networkpolicy_controller.go Outdated

Comment thread ray-operator/controllers/ray/networkpolicy_controller.go

cursor Bot reviewed Mar 24, 2026

View reviewed changes

Comment thread ray-operator/controllers/ray/networkpolicy_controller.go

machichima reviewed Mar 31, 2026

View reviewed changes

Comment thread ray-operator/controllers/ray/networkpolicy_controller.go Outdated

machichima reviewed Mar 31, 2026

View reviewed changes

machichima reviewed Apr 1, 2026

View reviewed changes

Comment thread ray-operator/config/samples/ray-cluster.network-isolation-custom-ports.yaml Outdated

cursor Bot reviewed Apr 2, 2026

View reviewed changes

Comment thread ray-operator/controllers/ray/networkpolicy_controller.go

cursor Bot reviewed Apr 2, 2026

View reviewed changes

Comment thread ray-operator/controllers/ray/networkpolicy_controller.go

Comment thread ray-operator/controllers/ray/networkpolicy_controller.go

Comment thread ray-operator/config/samples/ray-cluster.network-isolation-custom-ports.yaml Outdated

cursor Bot reviewed Apr 2, 2026

View reviewed changes

Comment thread ray-operator/controllers/ray/networkpolicy_controller.go Outdated

cursor Bot reviewed Apr 2, 2026

View reviewed changes

Comment thread ray-operator/controllers/ray/networkpolicy_controller_test.go Outdated

fscnick reviewed Apr 7, 2026

View reviewed changes

Comment thread ray-operator/controllers/ray/networkpolicy_controller_test.go Outdated

Comment thread ray-operator/controllers/ray/networkpolicy_controller_test.go Outdated

cursor Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread helm-chart/kuberay-operator/templates/_helpers.tpl

Comment thread ray-operator/config/rbac/role.yaml

chipspeak force-pushed the network-policies-with-gate branch from bd3876b to 6aa8ab1 Compare April 8, 2026 06:58

cursor Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread ray-operator/controllers/ray/networkpolicy_controller.go Outdated

Future-Outlier reviewed Apr 17, 2026

View reviewed changes

chipspeak and others added 26 commits June 23, 2026 10:17

test fix + convert constants to CamelCase

c7b6042

Signed-off-by: Pat O'Connor <paoconno@redhat.com>

helm updates + check if RayCluster is externally managed

314413f

Signed-off-by: Pat O'Connor <paoconno@redhat.com>

API updates per review + autoscaling egress rule

556aaea

Signed-off-by: Pat O'Connor <paoconno@redhat.com>

Use standard lib for CIDR + use logConstructor consistently

35a814e

Signed-off-by: Pat O'Connor <paoconno@redhat.com>

Use EndpointSlice IPs for API server egress rule

c980e5a

Signed-off-by: Pat O'Connor <paoconno@redhat.com>

Deduplicates ports + peers across EndpointSlices in egress rules

07d3439

Signed-off-by: Pat O'Connor <paoconno@redhat.com>

Remove operator namespace fallback and prop error if namespace not found

c37f308

Signed-off-by: Pat O'Connor <paoconno@redhat.com>

Remove patch verb for networkpolicies

b9a133c

Signed-off-by: Pat O'Connor <paoconno@redhat.com>

Remove client port rule for operator

c442195

Signed-off-by: Pat O'Connor <paoconno@redhat.com>

Add reconcileConcurrency and GenerationChangedPredicate to NetworkPol…

6f81b86

…icy controller setup Signed-off-by: Pat O'Connor <paoconno@redhat.com>

solve 3 points, close to merge

8123a87

Signed-off-by: Future-Outlier <eric901201@gmail.com>

helper

e39932d

Signed-off-by: Future-Outlier <eric901201@gmail.com>

update

c679afe

Signed-off-by: Future-Outlier <eric901201@gmail.com>

CI failure fixes after DNS logic removal

ad3ae25

Signed-off-by: Pat O'Connor <paoconno@redhat.com>

Revert "helper"

c50271a

This reverts commit 26c4053.

Skip networkpolicy integration test when missing namespace

c5d0917

Signed-off-by: Pat O'Connor <paoconno@redhat.com>

Normalize defaulted NetworkPolicy ports before comparing

3828605

Signed-off-by: Pat O'Connor <paoconno@redhat.com>

Drop network policy integration tests pending followup e2e

1caeb1c

Signed-off-by: Pat O'Connor <paoconno@redhat.com>

Remove operator namespace ingress rule

b124c87

Signed-off-by: Pat O'Connor <paoconno@redhat.com>

NetworkIsolation API change cleanup

decccb2

Signed-off-by: Pat O'Connor <paoconno@redhat.com>

Unit tests to ensure custom policies don't leak between head and worker

822d98b

Signed-off-by: Pat O'Connor <paoconno@redhat.com>

Sample comment indendation and removal of operator rule from API docs

4950eb9

Signed-off-by: Pat O'Connor <paoconno@redhat.com>

Indent update to pass linter

9dbf7d9

Signed-off-by: Pat O'Connor <paoconno@redhat.com>

Additional review nits - comments etc

3e754b3

Signed-off-by: Pat O'Connor <paoconno@redhat.com>

Remove redundant comments in controller code

f53cf82

Signed-off-by: Pat O'Connor <paoconno@redhat.com>

chipspeak force-pushed the network-policies-with-gate branch from 15a2156 to f53cf82 Compare June 23, 2026 09:20

Future-Outlier approved these changes Jun 23, 2026

View reviewed changes

andrewsykim merged commit 1df0651 into ray-project:master Jun 23, 2026
32 checks passed

github-project-automation Bot moved this from can be merged to Done in @Future-Outlier's kuberay project Jun 23, 2026

		// NamespaceSelector is intentionally empty (matches all namespaces) so that
		// the operator pod is allowed regardless of which namespace it runs in.

Uh oh!

Conversation

chipspeak commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

machichima commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

chipspeak commented Mar 30, 2026

Uh oh!

machichima commented Mar 31, 2026

Uh oh!

machichima Mar 31, 2026

Choose a reason for hiding this comment

chipspeak Mar 31, 2026

Choose a reason for hiding this comment

machichima Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

chipspeak Apr 2, 2026

Choose a reason for hiding this comment

chipspeak Apr 2, 2026

Choose a reason for hiding this comment

machichima left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chipspeak commented Apr 2, 2026

machichima commented Apr 4, 2026

chipspeak commented Apr 7, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

machichima commented Apr 8, 2026

Uh oh!

Future-Outlier left a comment

Choose a reason for hiding this comment

Uh oh!

Labels

8 participants

chipspeak commented Mar 23, 2026 •

edited

Loading

machichima commented Mar 30, 2026 •

edited

Loading

machichima Apr 1, 2026 •

edited

Loading