Skip to content

[Feature] Add NetworkIsolation support to RayClusters#4638

Merged
andrewsykim merged 45 commits into
ray-project:masterfrom
opendatahub-io:network-policies-with-gate
Jun 23, 2026
Merged

[Feature] Add NetworkIsolation support to RayClusters#4638
andrewsykim merged 45 commits into
ray-project:masterfrom
opendatahub-io:network-policies-with-gate

Conversation

@chipspeak

@chipspeak chipspeak commented Mar 23, 2026

Copy link
Copy Markdown
Contributor

Why are these changes needed?

KubeRay currently provides a flexible foundation for deploying Ray on Kubernetes but relies on users to manually configure critical security features. As per this proposal doc this PR is one tentpole of a larger feature set aimed at security conscious network templates.

This PR handles the automatic creation of NetworkPolicies for RayClusters via a new controller in KubeRay. The controller utilises a new field in the RayCluster CR's spec, networkIsolation. This field has 3 values usable for it's mode field:

  • DenyAll - this default will prevent all ingress and egress with the exception of itra-cluster pod communication. This includes blocking egress needed for RayJobs. It services as a sensible starting point for any additional ingressRules or egressRules that a user wishes to add. The intended approach here would be to use a webhook to add any additional rules. This should maintain an initially conservative security posture from which to build a user-specific template.

  • DenyAllIngress - restricts all inbound traffic to RayCluster pods while leaving egress unrestricted. Intra-cluster communication and KubeRay operator access are still permitted by default. Like denyAll, any user-specified ingressRules are appended to the base policy, and RayJob submitter pods are automatically allowed when the RayCluster is owned by a RayJob. This mode is useful when Ray workloads need outbound access (e.g. pulling from external data sources or reaching cloud APIs) but should not be reachable from arbitrary pods in the namespace.

  • DenyAllEgress - restricts all outbound traffic from RayCluster pods while leaving ingress unrestricted. Intra-cluster communication and DNS resolution (port 53) are preserved as base rules to ensure the cluster remains functional. Any user-specified egressRules are appended alongside these defaults. This mode suits environments where inbound access to the Ray dashboard or client port should remain open but workloads should not be able to reach external services without explicit allowlisting. NOTE: For the use of pip etc in the context of a RayJob or to use RayServe the user is required to add egressRules for these.

In all three modes, the controller generates separate NetworkPolicy resources for head and worker pods. The base rules always ensure inter-pod communication within the cluster is unimpeded and that the KubeRay operator can reach the head node's dashboard and client ports. As previously outlined, users can extend any mode via the ingressRules and egressRules fields on the networkIsolation spec, which are appended verbatim to the generated policies.

These resources are owned by the RayCluster (or the RayJob that owns the respective RayCluster) and are life-cycled accordingly to mitigate the need for user cleanup. These rules work on the expectation that good RBAC practice is maintained in your cluster.

NOTE: RayClusters that are owned by RayJobs function the same as outlined above but an additional rule accounts for the RayJob submitter pod (this rule is present on all RayClusters). This rule uses the OwnerReference to validate but given this isn't present when a RayJob is submitted against a pre-existing cluster. To account for this, a new env var (which defaults to false) can be used to add a more permissive rule that will facilitate these submitter pods.

The proposal doc features additional information about the specific rules for each mode and the rational behind them.

Related issue number

Closes: 3987

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(
Comment thread ray-operator/controllers/ray/networkpolicy_controller_test.go Outdated
Comment thread ray-operator/controllers/ray/networkpolicy_controller.go Outdated
Comment thread ray-operator/controllers/ray/networkpolicy_controller.go Outdated
Comment thread ray-operator/controllers/ray/networkpolicy_controller.go Outdated
Comment thread ray-operator/controllers/ray/networkpolicy_controller.go Outdated
Comment thread ray-operator/controllers/ray/networkpolicy_controller.go Outdated
Comment thread ray-operator/controllers/ray/networkpolicy_controller.go Outdated
Comment thread ray-operator/controllers/ray/networkpolicy_controller.go Outdated
Comment thread ray-operator/controllers/ray/networkpolicy_controller.go Outdated
Comment thread ray-operator/controllers/ray/networkpolicy_controller.go Outdated
Comment thread ray-operator/controllers/ray/networkpolicy_controller.go Outdated
Comment thread ray-operator/controllers/ray/networkpolicy_controller.go
Comment thread ray-operator/controllers/ray/networkpolicy_controller.go
@machichima

machichima commented Mar 30, 2026

Copy link
Copy Markdown
Collaborator

Should we also add NetworkPolicy to RayJob submitter pod?

@chipspeak

Copy link
Copy Markdown
Contributor Author

Should we also add NetworkPolicy to RayJob submitter pod?

It might add a little too much complexity in terms of configuration. A user can define a submitterPodTemplate so we'd need to account for situations in which our defaults might conflict and interfere with that.

Would it be an idea for a follow-up PR maybe pending some discussion? Given how ephemeral submitter Pods are, I feel the attack surface is relatively small as is.

Comment thread ray-operator/controllers/ray/networkpolicy_controller.go Outdated
@machichima

Copy link
Copy Markdown
Collaborator

Would it be an idea for a follow-up PR maybe pending some discussion? Given how ephemeral submitter Pods are, I feel the attack surface is relatively small as is.

Of course! We can discuss this in follow-ups

Comment thread ray-operator/controllers/ray/networkpolicy_controller.go
Comment on lines +283 to +284
// NamespaceSelector is intentionally empty (matches all namespaces) so that
// the operator pod is allowed regardless of which namespace it runs in.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to limit it to the namespace where the kuberay-operator is in? Based on this rule, any pod with the label app.kubernetes.io/component: kuberay can connect to head pod

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the way we had it in our mid-stream, I'd just assumed we'd want a little more flexibility here. If a user deployed it in a custom namespace, they'd need to manually adjust the auto-generated network policies in this instance! I can certainly move it back to being namespace specific but it's security theatre at a certain point because if a bad actor has permissions to add that label, they'd also have no issue moving resources to a hard-coded namespace if need be. Does that make sense?

@machichima machichima Apr 1, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think adding this label requires high privilege?

What I was thinking is that when creating a RayCluster, user can specify labels for worker and head pod. If they set app.kubernetes.io/component: kuberay for RayClusterA, it means that pods in RayClusterA are able to access other RayClusters' head Pods on the dashboard/client ports. This may allow job submission to other RayCluster.

This is one of the case that we wanted to prevent

Labels map[string]string `json:"labels,omitempty"`

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I mean permissions in the context of applying the RayCluster CR at all but fair point! I'll move it back to our more conservative mid-stream approach with the operator namespace validated too! Thanks!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated it just now. Using NamespaceSelector and pulling the namespace from the service account on the pod so that the user doesn't have to deploy the operator with the POD_NAMESPACE env if they don't want to. WDYT?

@machichima machichima left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's some redundant / missing sample YAMLs

  • ray-cluster.network-isolation-monitoring.yaml, ray-cluster.network-isolation-complex-rules.yaml, and ray-cluster.network-isolation-custom-rules.yaml
    • Those 3 are similar, all using denyAll + ingressRules, I think we can just keep one
  • egress related YAML is not included

Maybe can just include following YAMLs:

  1. denyAll with ingress and egress set
  2. denyAllIngress with ingressrule set
  3. denyAllEgress with egressrule set
  4. custom port

WDYT?

Comment thread ray-operator/config/samples/ray-cluster.network-isolation-custom-ports.yaml Outdated
Comment thread ray-operator/controllers/ray/networkpolicy_controller.go
Comment thread ray-operator/controllers/ray/networkpolicy_controller.go
Comment thread ray-operator/controllers/ray/networkpolicy_controller.go
Comment thread ray-operator/config/samples/ray-cluster.network-isolation-custom-ports.yaml Outdated
Comment thread ray-operator/controllers/ray/networkpolicy_controller.go Outdated
Comment thread ray-operator/controllers/ray/networkpolicy_controller_test.go Outdated
@chipspeak

Copy link
Copy Markdown
Contributor Author

I think there's some redundant / missing sample YAMLs

  • ray-cluster.network-isolation-monitoring.yaml, ray-cluster.network-isolation-complex-rules.yaml, and ray-cluster.network-isolation-custom-rules.yaml

    • Those 3 are similar, all using denyAll + ingressRules, I think we can just keep one
  • egress related YAML is not included

Maybe can just include following YAMLs:

  1. denyAll with ingress and egress set
  2. denyAllIngress with ingressrule set
  3. denyAllEgress with egressrule set
  4. custom port

WDYT?

Sounds good! I've updated the samples in line with this in the relevant commit.

@machichima

Copy link
Copy Markdown
Collaborator

Hi @chipspeak,

A follow-up on this comment: https://github.com/ray-project/kuberay/pull/4638/changes#r3009809497

After discussing with the team offline, we decided it’s better to remove this rule for now to keep the default behavior strict. Instead of handling this on our side, we will leave it up to the users who create the RayCluster to explicitly add the necessary IngressRules.

We should also document this behavior in IngressRules docstring and the official docs.

// IngressRules specifies custom ingress rules for Ray cluster pods.
// +optional
IngressRules []networkingv1.NetworkPolicyIngressRule `json:"ingressRules,omitempty"`

For example, we can mention that users can set the PodSelector to match pods with a specific label (e.g., allow-submit: true), and then apply this label to the SubmitterPodTemplate when creating a RayJob.

Our reasoning is that it's much safer to shift this responsibility to the user by default. If there is significant user demand in the future, we can then revisit this. However, if we release a non-strict version now, tightening the security later will be much harder as it could break existing user applications.

cc @rueian

@chipspeak

Copy link
Copy Markdown
Contributor Author

Hi @chipspeak,

A follow-up on this comment: https://github.com/ray-project/kuberay/pull/4638/changes#r3009809497

After discussing with the team offline, we decided it’s better to remove this rule for now to keep the default behavior strict. Instead of handling this on our side, we will leave it up to the users who create the RayCluster to explicitly add the necessary IngressRules.

We should also document this behavior in IngressRules docstring and the official docs.

// IngressRules specifies custom ingress rules for Ray cluster pods.
// +optional
IngressRules []networkingv1.NetworkPolicyIngressRule `json:"ingressRules,omitempty"`

For example, we can mention that users can set the PodSelector to match pods with a specific label (e.g., allow-submit: true), and then apply this label to the SubmitterPodTemplate when creating a RayJob.

Our reasoning is that it's much safer to shift this responsibility to the user by default. If there is significant user demand in the future, we can then revisit this. However, if we release a non-strict version now, tightening the security later will be much harder as it could break existing user applications.

cc @rueian

Cool that makes sense!

Just for clarity, I assume this refers specifically to the permissive podselector rule that allowed the job submitter pod (and other pods within the same namespace) to access a pre-existing RayCluster?

Are we still ok with how it works for RayJob-owned RayClusters? Where the controller detects that a RayCluster has a RayJob ownerReference and then injects an ingress rule on the head NetworkPolicy allowing the submitter pod to reach the dashboard port.

Comment thread ray-operator/controllers/ray/networkpolicy_controller_test.go Outdated
Comment thread ray-operator/controllers/ray/networkpolicy_controller_test.go Outdated
Comment thread helm-chart/kuberay-operator/templates/_helpers.tpl
Comment thread ray-operator/config/rbac/role.yaml
@chipspeak chipspeak force-pushed the network-policies-with-gate branch from bd3876b to 6aa8ab1 Compare April 8, 2026 06:58
@machichima

Copy link
Copy Markdown
Collaborator

Cool that makes sense!

Just for clarity, I assume this refers specifically to the permissive podselector rule that allowed the job submitter pod (and other pods within the same namespace) to access a pre-existing RayCluster?

Are we still ok with how it works for RayJob-owned RayClusters? Where the controller detects that a RayCluster has a RayJob ownerReference and then injects an ingress rule on the head NetworkPolicy allowing the submitter pod to reach the dashboard port.

Yes! My comment is referring to head pod ingress that allows pods within the same namespace. For RayJob-owned RayClusters, the current implementation in buildRayJobPeer looks good to me, which do not allow access for arbitrary pods.

Comment thread ray-operator/controllers/ray/networkpolicy_controller.go Outdated

@Future-Outlier Future-Outlier left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this used in the production env?

chipspeak and others added 26 commits June 23, 2026 10:17
Signed-off-by: Pat O'Connor <paoconno@redhat.com>
Signed-off-by: Pat O'Connor <paoconno@redhat.com>
Signed-off-by: Pat O'Connor <paoconno@redhat.com>
Signed-off-by: Pat O'Connor <paoconno@redhat.com>
Signed-off-by: Pat O'Connor <paoconno@redhat.com>
Signed-off-by: Pat O'Connor <paoconno@redhat.com>
Signed-off-by: Pat O'Connor <paoconno@redhat.com>
Signed-off-by: Pat O'Connor <paoconno@redhat.com>
Signed-off-by: Pat O'Connor <paoconno@redhat.com>
…icy controller setup

Signed-off-by: Pat O'Connor <paoconno@redhat.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Pat O'Connor <paoconno@redhat.com>
This reverts commit 26c4053.
Signed-off-by: Pat O'Connor <paoconno@redhat.com>
Signed-off-by: Pat O'Connor <paoconno@redhat.com>
Signed-off-by: Pat O'Connor <paoconno@redhat.com>
Signed-off-by: Pat O'Connor <paoconno@redhat.com>
…lat ingressRules/egressRules on NetworkIsolationConfig with per-role head/worker sub-structs.Since the API has not shipped yet, this avoids locking in a shared-only design that would require a backward-compatible migration later. Head and workers have fundamentally different security profiles — head needs external access (dashboard, submitter, Prometheus) while workers typically need outbound access (S3, model registries). A shared-only API forces users to over-permit workers just to allow access to the head. New API shape: networkIsolation: mode: DenyAll head: ingressRules: [...] egressRules: [...] worker: ingressRules: [...] egressRules: [...]

Co-authored-by: Cursor <cursor@cursor.sh>
Signed-off-by: Pat O'Connor <paoconno@redhat.com>
Signed-off-by: Pat O'Connor <paoconno@redhat.com>
Signed-off-by: Pat O'Connor <paoconno@redhat.com>
Signed-off-by: Pat O'Connor <paoconno@redhat.com>
Signed-off-by: Pat O'Connor <paoconno@redhat.com>
Signed-off-by: Pat O'Connor <paoconno@redhat.com>
@chipspeak chipspeak force-pushed the network-policies-with-gate branch from 15a2156 to f53cf82 Compare June 23, 2026 09:20
@andrewsykim andrewsykim merged commit 1df0651 into ray-project:master Jun 23, 2026
32 checks passed
@github-project-automation github-project-automation Bot moved this from can be merged to Done in @Future-Outlier's kuberay project Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

8 participants