Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

멀티 클러스터 Ray Serve 및 GKE Inference Gateway를 사용하여 LLM 서빙

표준

이 문서에서는 Kubernetes Gateway API와 GKE 추론 게이트웨이를 구성하여 Google Kubernetes Engine (GKE)의 여러 Ray Serve 클러스터에서 추론 요청을 관리하는 방법을 설명합니다. 이 구성을 사용하면 여러 팀의 트래픽 관리를 중앙 집중화하고, 용량을 늘리기 위해 리전 간에 워크로드를 분산하고, 요청 본문 콘텐츠를 기반으로 모델 인식 라우팅을 구현할 수 있습니다.

GKE Inference Gateway 및 Ray Serve 사용의 이점

GKE Inference Gateway와 Ray Serve를 사용하면 다음과 같은 이점이 있습니다.

경로 라우팅: 경로 접두사로 각 RayService를 구성한 다음 여러 Ray 서비스로 라우팅되는 하나의 게이트웨이로 서비스를 제공합니다.
- 경로 접두사 규칙 설정에 대한 자세한 내용은 게이트웨이 API 문서를 참고하세요.
모델 인식 라우팅: 요청 본문을 기반으로 라우팅할 RayService를 선택합니다(예: OpenAI-API JSON 요청에서 요청된 모델을 추출).
거버넌스: 서비스를 사용하려면 API 키가 필요하거나 인증 및 API 관리를 위한 Apigee를 사용하여 사용자의 할당량을 적용합니다.
멀티 리전: 멀티 클러스터 게이트웨이를 사용하여 RayServices로 여러 GKE 클러스터 간에 트래픽을 분할하여 가용성 또는 용량을 높입니다.
관심사 분리: 별도의 팀에서 관리하고, 별도의 출시를 따르고, 다른 토폴로지에서 실행할 수 있는 별도의 RayService를 사용합니다.
보안: 게이트웨이를 사용하여 SSL 종료자로 작동하여 인터넷을 통한 사용자 트래픽을 보호합니다. 자세한 내용은 게이트웨이 보안을 참고하세요.

라우팅을 구성하려면 게이트웨이, HTTPRoute, RayService를 배포해야 합니다. 각 타겟 Ray 클러스터의 Kubernetes 서비스는 일반적으로 KubeRay에 의해 생성됩니다. Ray Serve는 InferencePool 또는 Endpoint Picker를 만들 필요 없이 클러스터 내에서 요청 부하를 분산합니다.

GKE에서 Ray Serve의 모델 인식 라우팅

모델 인식 라우팅은 본문 기반 라우팅 확장 프로그램에 의해 사용 설정됩니다. 본문 기반 라우팅을 사용하면 사용자의 요청에 명시된 모델에 따라 트래픽을 여러 RayService로 라우팅할 수 있으므로 여러 Ray 클러스터에 호스팅된 여러 모델을 처리할 수 있는 단일 엔드포인트를 사용할 수 있습니다. 사용자는 액세스가 간소화되고 앱 개발자는 각 Ray 엔드포인트의 구성을 제어할 수 있습니다.

모델 인식 라우팅을 구성하려면 다음 주요 구성요소를 배포합니다.

JSON 페이로드에서 모델 이름을 추출하는 본문 기반 라우터 확장 프로그램 이 라우터 확장 프로그램은 Helm을 사용하여 배포됩니다.
수신 트래픽을 처리하는 GKE 게이트웨이 (L7 리전 내부 애플리케이션 부하 분산기)
라우터 확장 프로그램에서 채운 헤더를 사용하여 트래픽을 올바른 Ray 서비스로 안내하는 HTTPRoute 규칙입니다.
사일로화된 모델의 수명 주기와 자동 확장을 관리하는 여러 Ray Serve 클러스터

시작하기 전에

시작하기 전에 다음 태스크를 수행했는지 확인합니다.

Google Kubernetes Engine API를 사용 설정합니다.

Google Kubernetes Engine API 사용 설정

이 태스크에 Google Cloud CLI를 사용하려면 gcloud CLI를 설치한 후 초기화합니다. 이전에 gcloud CLI를 설치했으면 gcloud components update 명령어를 실행하여 최신 버전을 가져옵니다. 이전 gcloud CLI 버전에서는 이 문서의 명령어를 실행하지 못할 수 있습니다.
참고: 기존 gcloud CLI 설치의 경우 compute/region 속성을 설정해야 합니다. 주로 영역 클러스터를 사용하는 경우에는 대신 compute/zone을 설정합니다. 기본 위치를 설정하면 gcloud CLI에서 One of [--zone, --region] must be supplied: Please specify location과 같은 오류를 방지할 수 있습니다. 클러스터의 위치가 설정한 기본값과 다른 경우 특정 명령어에서 위치를 지정해야 할 수 있습니다.

Helm이 설치되어 있는지 확인합니다.
Hugging Face 계정이 아직 없으면 이 계정을 만듭니다.
Hugging Face 토큰이 있는지 확인합니다.

개발 환경 준비

환경 변수를 설정합니다.

export CLUSTER=$(whoami)-ray-bbr
export PROJECT_ID=$(gcloud config get-value project)
export LOCATION=us-central1-b
export REGION=us-central1
export HUGGING_FACE_TOKEN=YOUR_HUGGING_FACE_TOKEN

YOUR_HUGGING_FACE_TOKEN을 Hugging Face 액세스 토큰으로 바꿉니다.

인프라 준비

이 섹션에서는 L4 GPU를 사용하여 Ray 지원 게이트웨이 지원 GKE 클러스터를 설정합니다.

Ray Operator 및 Gateway API가 사용 설정된 클러스터를 만듭니다.

gcloud container clusters create ${CLUSTER} \
    --project ${PROJECT_ID} \
    --location ${LOCATION} \
    --cluster-version 1.35 \
    --gateway-api standard \
    --addons HttpLoadBalancing,RayOperator \
    --enable-ray-cluster-logging \
    --enable-ray-cluster-monitoring \
    --machine-type e2-standard-4

모델 워크로드용 GPU 노드 풀을 만듭니다.

gcloud container node-pools create gpu-pool \
    --cluster=${CLUSTER} \
    --location=${LOCATION} \
    --accelerator="type=nvidia-l4,count=1,gpu-driver-version=latest" \
    --machine-type=g2-standard-8 \
    --num-nodes=4

본문 기반 라우팅에 필요한 리전 내부 애플리케이션 부하 분산기의 프록시 전용 서브넷을 만듭니다.

gcloud compute networks subnets create bbr-proxy-only-subnet \
    --purpose=REGIONAL_MANAGED_PROXY \
    --role=ACTIVE \
    --region=${REGION} \
    --network=default \
    --range=192.168.10.0/24

Hugging Face 보안 비밀을 배포합니다.

kubectl create secret generic hf-secret \
    --from-literal=hf_api_token=${HUGGING_FACE_TOKEN}

모델 인식 라우팅을 위해 본문 기반 라우터 배포

본문 기반 라우터 확장 프로그램은 요청을 가로채고 JSON 본문을 파싱하며 모델 필드를 X-Gateway-Model-Name 헤더로 추출합니다.

다음 콘텐츠로 helm-values.yaml이라는 파일을 만듭니다.

bbr:
  plugins:
    - type: "body-field-to-header"
      name: "openai-model-extractor"
      json:
        field_name: "model"
        header_name: "X-Gateway-Model-Name"

Helm을 사용하여 본문 기반 라우터를 설치합니다.

helm install body-based-router \
    oci://registry.k8s.io/gateway-api-inference-extension/charts/body-based-routing \
    --version v1.4.0 \
    --set provider.name=gke \
    --set inferenceGateway.name=ray-multi-model-gateway \
    --values helm-values.yaml

RayService 배포

모델을 배포하려면 RayService 매니페스트를 적용해야 합니다. 각 매니페스트는 특정 LLM을 실행하는 Ray 클러스터를 정의합니다.

다음 콘텐츠로 gemma-2b-it.yaml이라는 파일을 만듭니다.

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: gemma-2b-it
spec:
  serveConfigV2: |
    applications:
    - name: llm_app
      route_prefix: "/"
      import_path: ray.serve.llm:build_openai_app
      args:
        llm_configs:
            - model_loading_config:
                model_id: gemma-2b-it
                model_source: google/gemma-2b-it
              accelerator_type: L4
              log_engine_metrics: true
              deployment_config:
                autoscaling_config:
                    min_replicas: 2
                    max_replicas: 2
                health_check_period_s: 600
                health_check_timeout_s: 300
  rayClusterConfig:
    headGroupSpec:
      rayStartParams:
        dashboard-host: "0.0.0.0"
        num-cpus: "0"
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray-llm:2.54.0-py311-cu128
              resources:
                limits:
                  memory: "8Gi"
                  ephemeral-storage: "32Gi"
                requests:
                  cpu: "2"
                  memory: "8Gi"
                  ephemeral-storage: "32Gi"
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
              env:
                - name: RAY_SERVE_THROUGHPUT_OPTIMIZED
                  value: "1"
                - name: RAY_SERVE_ENABLE_HA_PROXY
                  value: "1"
                - name: HUGGING_FACE_HUB_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: hf-secret
                      key: hf_api_token
    rayVersion: 2.54.0
    workerGroupSpecs:
      - replicas: 2
        minReplicas: 2
        maxReplicas: 2
        groupName: gpu-group
        rayStartParams: {}
        template:
          spec:
            containers:
              - name: llm
                image: rayproject/ray-llm:2.54.0-py311-cu128
                env:
                  - name: RAY_SERVE_THROUGHPUT_OPTIMIZED
                    value: "1"
                  - name: RAY_SERVE_ENABLE_HA_PROXY
                    value: "1"
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                resources:
                  limits:
                    nvidia.com/gpu: "1"
                    ephemeral-storage: "24Gi"
                  requests:
                    cpu: "6"
                    memory: "24Gi"
                    nvidia.com/gpu: "1"
                    ephemeral-storage: "24Gi"
            nodeSelector:
              cloud.google.com/gke-accelerator: nvidia-l4

다음 콘텐츠로 qwen2.5-3b.yaml이라는 파일을 만듭니다.

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: qwen-25-3b
spec:
  serveConfigV2: |
    applications:
    - name: llm_app
      route_prefix: "/"
      import_path: ray.serve.llm:build_openai_app
      args:
        llm_configs:
            - model_loading_config:
                model_id: qwen-2.5-3b
                model_source: Qwen/Qwen2.5-3B
              accelerator_type: L4
              log_engine_metrics: true
              deployment_config:
                autoscaling_config:
                    min_replicas: 2
                    max_replicas: 2
                health_check_period_s: 600
                health_check_timeout_s: 300
  rayClusterConfig:
    headGroupSpec:
      rayStartParams:
        dashboard-host: "0.0.0.0"
        num-cpus: "0"
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray-llm:2.54.0-py311-cu128
              resources:
                limits:
                  memory: "8Gi"
                  ephemeral-storage: "32Gi"
                requests:
                  cpu: "2"
                  memory: "8Gi"
                  ephemeral-storage: "32Gi"
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
              env:
                - name: RAY_SERVE_THROUGHPUT_OPTIMIZED
                  value: "1"
                - name: RAY_SERVE_ENABLE_HA_PROXY
                  value: "1"
                - name: HUGGING_FACE_HUB_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: hf-secret
                      key: hf_api_token
    rayVersion: 2.54.0
    workerGroupSpecs:
      - replicas: 2
        minReplicas: 2
        maxReplicas: 2
        groupName: gpu-group
        rayStartParams: {}
        template:
          spec:
            containers:
              - name: llm
                image: rayproject/ray-llm:2.54.0-py311-cu128
                env:
                  - name: RAY_SERVE_THROUGHPUT_OPTIMIZED
                    value: "1"
                  - name: RAY_SERVE_ENABLE_HA_PROXY
                    value: "1"
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                resources:
                  limits:
                    nvidia.com/gpu: "1"
                    ephemeral-storage: "24Gi"
                  requests:
                    cpu: "6"
                    memory: "24Gi"
                    nvidia.com/gpu: "1"
                    ephemeral-storage: "24Gi"
            nodeSelector:
              cloud.google.com/gke-accelerator: nvidia-l4

모델을 배포합니다.

kubectl apply -f gemma-2b-it.yaml
kubectl apply -f qwen2.5-3b.yaml

상태 확인 구성

부하 분산기가 Ray 작업자 상태를 정확하게 모니터링하도록 하려면 HealthCheckPolicy 리소스를 적용해야 합니다.

다음 콘텐츠로 healthcheck-policy.yaml이라는 파일을 만듭니다.

apiVersion: networking.gke.io/v1
kind: HealthCheckPolicy
metadata:
  name: gemma-serve-healthcheck
  namespace: default
spec:
  default:
    checkIntervalSec: 5
    timeoutSec: 5
    healthyThreshold: 2
    unhealthyThreshold: 2
    config:
      type: HTTP
      httpHealthCheck:
        port: 8000
        requestPath: /-/healthz
  targetRef:
    group: ""
    kind: Service
    name: gemma-2b-it-serve-svc
---
apiVersion: networking.gke.io/v1
kind: HealthCheckPolicy
metadata:
  name: qwen-serve-healthcheck
  namespace: default
spec:
  default:
    checkIntervalSec: 5
    timeoutSec: 5
    healthyThreshold: 2
    unhealthyThreshold: 2
    config:
      type: HTTP
      httpHealthCheck:
        port: 8000
        requestPath: /-/healthz
  targetRef:
    group: ""
    kind: Service
    name: qwen-25-3b-serve-svc

상태 점검 정책을 적용합니다.

kubectl apply -f healthcheck-policy.yaml

라우팅 구성

라우팅을 구성하려면 Gateway 및 HTTPRoute 매니페스트를 적용해야 합니다. HTTPRoute에는 X-Gateway-Model-Name 헤더(본문 기반 라우터에 의해 채워짐)와 일치하는 규칙이 포함되어 있어 트래픽을 적절한 Ray 서비스로 라우팅합니다.

다음 콘텐츠로 gateway.yaml이라는 파일을 만듭니다.

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: ray-multi-model-gateway
  namespace: default
spec:
  gatewayClassName: gke-l7-rilb
  listeners:
  - allowedRoutes:
      namespaces:
        from: Same
    name: http
    port: 80
    protocol: HTTP
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: ray-multi-model-route
spec:
  parentRefs:
  - name: ray-multi-model-gateway
  rules:
  - matches:
    - headers:
      - type: Exact
        name: X-Gateway-Model-Name
        value: gemma-2b-it  # Must match model named in JSON request!
      path:
        type: PathPrefix
        value: /
    backendRefs:
    - name: gemma-2b-it-serve-svc  # Ray service name plus "-serve-svc".
      kind: Service
      port: 8000

  - matches:
    - headers:
      - type: Exact
        name: X-Gateway-Model-Name
        value: qwen-2.5-3b  # Matches another extracted model name
      path:
        type: PathPrefix
        value: /
    backendRefs:
    - name: qwen-25-3b-serve-svc  # Target Ray Service.
      kind: Service
      port: 8000

게이트웨이와 경로를 적용합니다.
```
kubectl apply -f gateway.yaml
```

배포 테스트

게이트웨이가 프로비저닝되고 두 Ray 클러스터가 모두 준비되면 JSON 본문에 다른 모델 이름으로 요청을 전송하여 라우팅을 테스트할 수 있습니다.

게이트웨이 IP 주소를 가져옵니다.

kubectl get gateways ray-multi-model-gateway

게이트웨이 주소에 연결할 수 있는 네트워크에서 셸을 시작합니다. Ray 클러스터 포드 중 하나에서 curl을 사용할 수 있습니다.
```
POD_NAME=$(kubectl get pods -l ray.io/node-type=head -o jsonpath='{.items[0].metadata.name}')
kubectl exec -it $POD_NAME -- bash
```

Gemma로 라우팅을 테스트하여 요청을 보냅니다.

curl http://GATEWAY_IP_ADDRESS/v1/chat/completions \
    --header 'Content-Type: application/json' \
    --data '{
    "model": "gemma-2b-it",
    "messages": [{"role": "user", "content": "Tell me about GKE."}]
    }'

GATEWAY_IP_ADDRESS를 이전 단계의 IP 주소로 바꿉니다.

출력은 다음과 비슷합니다.

{"id":"chatcmpl-594f7cab-f991-4522-9829-acdbb65d9f67","object":"chat.completion","created":1776379509,"model":"gemma-2b-it","choices":[{"index":0,"message":{"role":"assistant","content":"**Google Kubernetes Engine (GKE)** is a fully managed container orchestration service for Kubernetes [...]

Qwen으로의 라우팅을 테스트합니다.

curl http://GATEWAY_IP_ADDRESS/v1/chat/completions \
    --header 'Content-Type: application/json' \
    --data '{
    "model": "qwen-2.5-3b",
    "messages": [{"role": "user", "content": "How does Ray Serve work?"}]
    }'