通过 KubeRay,使用 GKE 上的 TPU 应用 LLM

本教程介绍如何使用 Google Kubernetes Engine (GKE) 上的张量处理单元 (TPU) 以及 Ray Operator 插件vLLM 部署框架来部署大语言模型 (LLM)。

在本教程中,您可以在 TPU v5e 或 TPU Trillium (v6e) 上应用 LLM 模型,如下所示:

本指南适用于想要使用 Kubernetes 容器编排功能,在采用 vLLM 的 TPU 上使用 Ray 应用模型的生成式 AI 客户、新的和现有的 GKE 用户、机器学习工程师、MLOps (DevOps) 工程师或平台管理员。

背景

本部分介绍本指南中使用的关键技术。

GKE 托管式 Kubernetes 服务

Google Cloud 提供各种各样的服务,包括 GKE,该服务非常适合用于部署和管理 AI/机器学习工作负载。GKE 是一项托管式 Kubernetes 服务,可简化容器化应用的部署、扩缩和管理。GKE 提供必要的基础设施(包括可伸缩资源、分布式计算和高效网络),以满足 LLM 的计算需求。

如需详细了解关键 Kubernetes 概念,请参阅开始了解 Kubernetes。如需详细了解 GKE 以及它如何帮助您扩缩、自动执行和管理 Kubernetes,请参阅 GKE 概览

Ray Operator

GKE 上的 Ray Operator 插件提供了一个端到端 AI/机器学习平台,用于部署、训练和微调机器学习工作负载。在本教程中,您将使用 Ray 中的框架 Ray Serve,通过 Hugging Face 应用热门 LLM。

TPU

TPU 是 Google 定制开发的应用专用集成电路 (ASIC),用于加速机器学习和使用 TensorFlowPyTorchJAX 等框架构建的 AI 模型。

本教程介绍如何在 TPU v5e 或 TPU Trillium (v6e) 节点上应用 LLM 模型,这些节点根据每个模型要求配置 TPU 拓扑,从而以低延迟响应提示。

vLLM

vLLM 是一个经过高度优化的开源 LLM 服务框架,可提高 TPU 上的服务吞吐量,具有如下功能:

  • 具有 PagedAttention 且经过优化的 Transformer(转换器)实现
  • 连续批处理,可提高整���服务吞吐量
  • 多个 GPU 上的张量并行处理和分布式服务

如需了解详情,请参阅 vLLM 文档

目标

本教程介绍以下步骤:

  1. 创建具有 TPU 节点池的 GKE 集群。
  2. 使用单主机 TPU 切片部署 RayCluster 自定义资源。GKE 会将 RayCluster 自定义资源部署为 Kubernetes Pod。
  3. 应用 LLM。
  4. 与模型交互。

您可以选择性地配置 Ray Serve 框架支持的以下模型部署资源和方法:

  • 部署 RayService 自定义资源。
  • 通过模型组合实现多个模型的组合。

准备工作

在开始之前,请确保您已执行以下任务:

  • 启用 Google Kubernetes Engine API。
  • 启用 Google Kubernetes Engine API
  • 如果您要使用 Google Cloud CLI 执行此任务,请安装初始化 gcloud CLI。 如果您之前安装了 gcloud CLI,请通过运行 gcloud components update 命令来获取最新版本。较早版本的 gcloud CLI 可能不支持运行本文档中的命令。
  • 如果您还没有 Hugging Face 账号,请创建一个。
  • 确保您拥有 Hugging Face 令牌
  • 确保您有权访问要使用的 Hugging Face 模型。通常需要在 Hugging Face 模型页面上签署一个协议并向模型所有者申请使用权,从而获得访问权限。
  • 确保您拥有以下 IAM 角色
    • roles/container.admin
    • roles/iam.serviceAccountAdmin
    • roles/container.clusterAdmin
    • roles/artifactregistry.writer

准备环境

  1. 检查您的 Google Cloud 项目中有足够的配额来使用单主机 TPU v5e 或单主机 TPU Trillium (v6e)。如需管理配额,请参阅 TPU 配额

  2. 在 Google Cloud 控制台中,启动 Cloud Shell 实例:
    打开 Cloud Shell

  3. 克隆示例代码库:

    git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples.git
    cd kubernetes-engine-samples
    
  4. 导航到工作目录:

    cd ai-ml/gke-ray/rayserve/llm
    
  5. 为 GKE 集群创建设置默认环境变量:

    Llama-3-8B-Instruct

    export PROJECT_ID=$(gcloud config get project)
    export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
    export CLUSTER_NAME=vllm-tpu
    export COMPUTE_REGION=REGION
    export COMPUTE_ZONE=ZONE
    export HF_TOKEN=HUGGING_FACE_TOKEN
    export GSBUCKET=vllm-tpu-bucket
    export KSA_NAME=vllm-sa
    export NAMESPACE=default
    export MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"
    export VLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1
    export SERVICE_NAME=vllm-tpu-head-svc
    

    替换以下内容:

    • HUGGING_FACE_TOKEN:您的 Hugging Face 访问令牌。
    • REGION:您拥有 TPU 配额的区域。确保您要使用的 TPU 版本在此区域可用。如需了解详情,请参阅 GKE 中的 TPU 可用性
    • ZONE:具有可用 TPU 配额的可用区。
    • VLLM_IMAGE:vLLM TPU 映像。您可以使用公共 docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 映像,也可以构建自己的 TPU 映像

    Mistral-7B

    export PROJECT_ID=$(gcloud config get project)
    export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
    export CLUSTER_NAME=vllm-tpu
    export COMPUTE_REGION=REGION
    export COMPUTE_ZONE=ZONE
    export HF_TOKEN=HUGGING_FACE_TOKEN
    export GSBUCKET=vllm-tpu-bucket
    export KSA_NAME=vllm-sa
    export NAMESPACE=default
    export MODEL_ID="mistralai/Mistral-7B-Instruct-v0.3"
    export TOKENIZER_MODE=mistral
    export VLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1
    export SERVICE_NAME=vllm-tpu-head-svc
    

    替换以下内容:

    • HUGGING_FACE_TOKEN:您的 Hugging Face 访问令牌。
    • REGION:您拥有 TPU 配额的区域。确保您要使用的 TPU 版本在此区域可用。如需了解详情,请参阅 GKE 中的 TPU 可用性
    • ZONE:具有可用 TPU 配额的可用区。
    • VLLM_IMAGE:vLLM TPU 映像。您可以使用公共 docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 映像,也可以构建自己的 TPU 映像

    Llama 3.1 70B

    export PROJECT_ID=$(gcloud config get project)
    export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
    export CLUSTER_NAME=vllm-tpu
    export COMPUTE_REGION=REGION
    export COMPUTE_ZONE=ZONE
    export HF_TOKEN=HUGGING_FACE_TOKEN
    export GSBUCKET=vllm-tpu-bucket
    export KSA_NAME=vllm-sa
    export NAMESPACE=default
    export MODEL_ID="meta-llama/Llama-3.1-70B"
    export MAX_MODEL_LEN=8192
    export VLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1
    export SERVICE_NAME=vllm-tpu-head-svc
    

    替换以下内容:

    • HUGGING_FACE_TOKEN:您的 Hugging Face 访问令牌。
    • REGION:您拥有 TPU 配额的区域。确保您要使用的 TPU 版本在此区域可用。如需了解详情,请参阅 GKE 中的 TPU 可用性
    • ZONE:具有可用 TPU 配额的可用区。
    • VLLM_IMAGE:vLLM TPU 映像。您可以使用公共 docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 映像,也可以构建自己的 TPU 映像
  6. 拉取 vLLM 容器映像:

    sudo usermod -aG docker ${USER}
    newgrp docker
    docker pull ${VLLM_IMAGE}
    

创建集群

您可以使用 Ray Operator 插件,在 GKE Autopilot 或 Standard 集群中通过 Ray 在 GPU 上应用 LLM。

最佳做法

使用 Autopilot 集群可获得全托管式 Kubernetes 体验。如需选择最适合您的工作负载的 GKE 操作模式,请参阅选择 GKE 操作模式

使用 Cloud Shell 创建 Autopilot 或 Standard 集群:

Autopilot

  1. 创建启用 Ray Operator 插件的 GKE Autopilot 集群:

    gcloud container clusters create-auto ${CLUSTER_NAME}  \
        --enable-ray-operator \
        --release-channel=rapid \
        --location=${COMPUTE_REGION}
    

标准

  1. 创建启用了 Ray Operator 插件的 Standard 集群:

    gcloud container clusters create ${CLUSTER_NAME} \
        --release-channel=rapid \
        --location=${COMPUTE_ZONE} \
        --workload-pool=${PROJECT_ID}.svc.id.goog \
        --machine-type="n1-standard-4" \
        --addons=RayOperator,GcsFuseCsiDriver
    
  2. 创建单主机 TPU 切片节点池:

    Llama-3-8B-Instruct

    gcloud container node-pools create tpu-1 \
        --location=${COMPUTE_ZONE} \
        --cluster=${CLUSTER_NAME} \
        --machine-type=ct5lp-hightpu-8t \
        --num-nodes=1
    

    GKE 会创建一个具有 ct5lp-hightpu-8t 机器类型的 TPU v5e 节点池。

    Mistral-7B

    gcloud container node-pools create tpu-1 \
        --location=${COMPUTE_ZONE} \
        --cluster=${CLUSTER_NAME} \
        --machine-type=ct5lp-hightpu-8t \
        --num-nodes=1
    

    GKE 会创建一个具有 ct5lp-hightpu-8t 机器类型的 TPU v5e 节点池。

    Llama 3.1 70B

    gcloud container node-pools create tpu-1 \
        --location=${COMPUTE_ZONE} \
        --cluster=${CLUSTER_NAME} \
        --machine-type=ct6e-standard-8t \
        --num-nodes=1
    

    GKE 会创建一个具有 ct6e-standard-8t 机器类型的 TPU v6e 节点池。

配置 kubectl 以与您的集群通信

如需配置 kubectl 以与您的集群通信,请运行以下命令:

Autopilot

gcloud container clusters get-credentials ${CLUSTER_NAME} \
    --location=${COMPUTE_REGION}

标准

gcloud container clusters get-credentials ${CLUSTER_NAME} \
    --location=${COMPUTE_ZONE}

为 Hugging Face 凭据创建 Kubernetes Secret

如需创建包含 Hugging Face 令牌的 Kubernetes Secret,请运行以下命令:

kubectl create secret generic hf-secret \
    --from-literal=hf_api_token=${HF_TOKEN} \
    --dry-run=client -o yaml | kubectl --namespace ${NAMESPACE} apply -f -

创建 Cloud Storage 存储桶

为了缩短 vLLM 部署启动时间并尽可能减少每个节点需要的磁盘空间,请使用 Cloud Storage FUSE CSI 驱动程序将下载的模型和编译缓存装载到 Ray 节点。

在 Cloud Shell 中,运行以下命令:

gcloud storage buckets create gs://${GSBUCKET} \
    --uniform-bucket-level-access

此命令会创建一个 Cloud Storage 存储桶,用于存储您从 Hugging Face 下载的模型文件。

设置 Kubernetes ServiceAccount 以访问存储桶

  1. 创建 Kubernetes ServiceAccount:

    kubectl create serviceaccount ${KSA_NAME} \
        --namespace ${NAMESPACE}
    
  2. 向 Kubernetes ServiceAccount 授予 Cloud Storage 存储桶的读写权限:

    gcloud storage buckets add-iam-policy-binding gs://${GSBUCKET} \
        --member "principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/${NAMESPACE}/sa/${KSA_NAME}" \
        --role "roles/storage.objectUser"
    

    GKE 会为 LLM 创建以下资源:

    1. 存储下载的模型和编译缓存的 Cloud Storage 存储桶。Cloud Storage FUSE CSI 驱动程序会读取存储桶的内容。
    2. 启用文件缓存的卷和 Cloud Storage FUSE 的并行下载功能。
    最佳实践

    根据模型内容(例如权重文件)的预期大小,使用由 tmpfsHyperdisk / Persistent Disk 提供支持的文件缓存。在本教程中,您将使用由 RAM 提供支持的 Cloud Storage FUSE 文件缓存。

部署 RayCluster 自定义资源

部署 RayCluster 自定义资源,该资源通常由一个系统 Pod 和多个工作器 Pod 组成。

Llama-3-8B-Instruct

完成以下步骤,创建 RayCluster 自定义资源以部署 Llama 3 8B 指令调���模型:

  1. 检查 ray-cluster.tpu-v5e-singlehost.yaml 清单:

    apiVersion: ray.io/v1
    kind: RayCluster
    metadata:
      name: vllm-tpu
    spec:
      headGroupSpec:
        rayStartParams: {}
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              gke-gcsfuse/cpu-limit: "0"
              gke-gcsfuse/memory-limit: "0"
              gke-gcsfuse/ephemeral-storage-limit: "0"
          spec:
            serviceAccountName: $KSA_NAME
            containers:
              - name: ray-head
                image: $VLLM_IMAGE
                imagePullPolicy: IfNotPresent
                resources:
                  limits:
                    cpu: "2"
                    memory: 8G
                  requests:
                    cpu: "2"
                    memory: 8G
                env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                ports:
                  - containerPort: 6379
                    name: gcs
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  - containerPort: 8471
                    name: slicebuilder
                  - containerPort: 8081
                    name: mxla
                volumeMounts:
                - name: gcs-fuse-csi-ephemeral
                  mountPath: /data
                - name: dshm
                  mountPath: /dev/shm
            volumes:
            - name: gke-gcsfuse-cache
              emptyDir:
                medium: Memory
            - name: dshm
              emptyDir:
                medium: Memory
            - name: gcs-fuse-csi-ephemeral
              csi:
                driver: gcsfuse.csi.storage.gke.io
                volumeAttributes:
                  bucketName: $GSBUCKET
                  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
      workerGroupSpecs:
      - groupName: tpu-group
        replicas: 1
        minReplicas: 1
        maxReplicas: 1
        numOfHosts: 1
        rayStartParams: {}
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              gke-gcsfuse/cpu-limit: "0"
              gke-gcsfuse/memory-limit: "0"
              gke-gcsfuse/ephemeral-storage-limit: "0"
          spec:
            serviceAccountName: $KSA_NAME
            containers:
              - name: ray-worker
                image: $VLLM_IMAGE
                imagePullPolicy: IfNotPresent
                resources:
                  limits:
                    cpu: "100"
                    google.com/tpu: "8"
                    ephemeral-storage: 40G
                    memory: 200G
                  requests:
                    cpu: "100"
                    google.com/tpu: "8"
                    ephemeral-storage: 40G
                    memory: 200G
                env:
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                volumeMounts:
                - name: gcs-fuse-csi-ephemeral
                  mountPath: /data
                - name: dshm
                  mountPath: /dev/shm
            volumes:
            - name: gke-gcsfuse-cache
              emptyDir:
                medium: Memory
            - name: dshm
              emptyDir:
                medium: Memory
            - name: gcs-fuse-csi-ephemeral
              csi:
                driver: gcsfuse.csi.storage.gke.io
                volumeAttributes:
                  bucketName: $GSBUCKET
                  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
            nodeSelector:
              cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
              cloud.google.com/gke-tpu-topology: 2x4
  2. 应用清单:

    envsubst < tpu/ray-cluster.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
    

    envsubst 命令会替换清单中的环境变量。

GKE 会创建一个 RayCluster 自定义资源,其中 workergroup2x4 拓扑中包含 TPU v5e 单主机。

Mistral-7B

完成以下步骤,创建 RayCluster 自定义资源以部署 Mistral-7B 模型:

  1. 检查 ray-cluster.tpu-v5e-singlehost.yaml 清单:

    apiVersion: ray.io/v1
    kind: RayCluster
    metadata:
      name: vllm-tpu
    spec:
      headGroupSpec:
        rayStartParams: {}
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              gke-gcsfuse/cpu-limit: "0"
              gke-gcsfuse/memory-limit: "0"
              gke-gcsfuse/ephemeral-storage-limit: "0"
          spec:
            serviceAccountName: $KSA_NAME
            containers:
              - name: ray-head
                image: $VLLM_IMAGE
                imagePullPolicy: IfNotPresent
                resources:
                  limits:
                    cpu: "2"
                    memory: 8G
                  requests:
                    cpu: "2"
                    memory: 8G
                env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                ports:
                  - containerPort: 6379
                    name: gcs
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  - containerPort: 8471
                    name: slicebuilder
                  - containerPort: 8081
                    name: mxla
                volumeMounts:
                - name: gcs-fuse-csi-ephemeral
                  mountPath: /data
                - name: dshm
                  mountPath: /dev/shm
            volumes:
            - name: gke-gcsfuse-cache
              emptyDir:
                medium: Memory
            - name: dshm
              emptyDir:
                medium: Memory
            - name: gcs-fuse-csi-ephemeral
              csi:
                driver: gcsfuse.csi.storage.gke.io
                volumeAttributes:
                  bucketName: $GSBUCKET
                  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
      workerGroupSpecs:
      - groupName: tpu-group
        replicas: 1
        minReplicas: 1
        maxReplicas: 1
        numOfHosts: 1
        rayStartParams: {}
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              gke-gcsfuse/cpu-limit: "0"
              gke-gcsfuse/memory-limit: "0"
              gke-gcsfuse/ephemeral-storage-limit: "0"
          spec:
            serviceAccountName: $KSA_NAME
            containers:
              - name: ray-worker
                image: $VLLM_IMAGE
                imagePullPolicy: IfNotPresent
                resources:
                  limits:
                    cpu: "100"
                    google.com/tpu: "8"
                    ephemeral-storage: 40G
                    memory: 200G
                  requests:
                    cpu: "100"
                    google.com/tpu: "8"
                    ephemeral-storage: 40G
                    memory: 200G
                env:
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                volumeMounts:
                - name: gcs-fuse-csi-ephemeral
                  mountPath: /data
                - name: dshm
                  mountPath: /dev/shm
            volumes:
            - name: gke-gcsfuse-cache
              emptyDir:
                medium: Memory
            - name: dshm
              emptyDir:
                medium: Memory
            - name: gcs-fuse-csi-ephemeral
              csi:
                driver: gcsfuse.csi.storage.gke.io
                volumeAttributes:
                  bucketName: $GSBUCKET
                  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
            nodeSelector:
              cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
              cloud.google.com/gke-tpu-topology: 2x4
  2. 应用清单:

    envsubst < tpu/ray-cluster.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
    

    envsubst 命令会替换清单中的环境变量。

GKE 会创建一个 RayCluster 自定义资源,其中 workergroup2x4 拓扑中包含 TPU v5e 单主机。

Llama 3.1 70B

完成以下步骤,创建 RayCluster 自定义资源以部署 Llama 3.1 70B 模型:

  1. 检查 ray-cluster.tpu-v6e-singlehost.yaml 清单:

    apiVersion: ray.io/v1
    kind: RayCluster
    metadata:
      name: vllm-tpu
    spec:
      headGroupSpec:
        rayStartParams: {}
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              gke-gcsfuse/cpu-limit: "0"
              gke-gcsfuse/memory-limit: "0"
              gke-gcsfuse/ephemeral-storage-limit: "0"
          spec:
            serviceAccountName: $KSA_NAME
            containers:
              - name: ray-head
                image: $VLLM_IMAGE
                imagePullPolicy: IfNotPresent
                resources:
                  limits:
                    cpu: "2"
                    memory: 8G
                  requests:
                    cpu: "2"
                    memory: 8G
                env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                ports:
                  - containerPort: 6379
                    name: gcs
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  - containerPort: 8471
                    name: slicebuilder
                  - containerPort: 8081
                    name: mxla
                volumeMounts:
                - name: gcs-fuse-csi-ephemeral
                  mountPath: /data
                - name: dshm
                  mountPath: /dev/shm
            volumes:
            - name: gke-gcsfuse-cache
              emptyDir:
                medium: Memory
            - name: dshm
              emptyDir:
                medium: Memory
            - name: gcs-fuse-csi-ephemeral
              csi:
                driver: gcsfuse.csi.storage.gke.io
                volumeAttributes:
                  bucketName: $GSBUCKET
                  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
      workerGroupSpecs:
      - groupName: tpu-group
        replicas: 1
        minReplicas: 1
        maxReplicas: 1
        numOfHosts: 1
        rayStartParams: {}
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              gke-gcsfuse/cpu-limit: "0"
              gke-gcsfuse/memory-limit: "0"
              gke-gcsfuse/ephemeral-storage-limit: "0"
          spec:
            serviceAccountName: $KSA_NAME
            containers:
              - name: ray-worker
                image: $VLLM_IMAGE
                imagePullPolicy: IfNotPresent
                resources:
                  limits:
                    cpu: "100"
                    google.com/tpu: "8"
                    ephemeral-storage: 40G
                    memory: 200G
                  requests:
                    cpu: "100"
                    google.com/tpu: "8"
                    ephemeral-storage: 40G
                    memory: 200G
                env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                volumeMounts:
                - name: gcs-fuse-csi-ephemeral
                  mountPath: /data
                - name: dshm
                  mountPath: /dev/shm
            volumes:
            - name: gke-gcsfuse-cache
              emptyDir:
                medium: Memory
            - name: dshm
              emptyDir:
                medium: Memory
            - name: gcs-fuse-csi-ephemeral
              csi:
                driver: gcsfuse.csi.storage.gke.io
                volumeAttributes:
                  bucketName: $GSBUCKET
                  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
            nodeSelector:
              cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
              cloud.google.com/gke-tpu-topology: 2x4
  2. 应用清单:

    envsubst < tpu/ray-cluster.tpu-v6e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
    

    envsubst ������������换清单中的环境变量。

GKE 会创建一个 RayCluster 自定义资源,其中 workergroup2x4 拓扑中包含 TPU v6e 单主机。

连接到 RayCluster 自定义资源

创建 RayCluster 自定义资源后,您可以连接到 RayCluster 资源并开始应用模型。

  1. 验证 GKE 是否已创建 RayCluster 服务:

    kubectl --namespace ${NAMESPACE} get raycluster/vllm-tpu \
        --output wide
    

    输出类似于以下内容:

    NAME       DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   TPUS   STATUS   AGE   HEAD POD IP      HEAD SERVICE IP
    vllm-tpu   1                 1                   ###    ###G     0      8      ready    ###   ###.###.###.###  ###.###.###.###
    

    等待 STATUS 变为 ready,并且 HEAD POD IPHEAD SERVICE IP 列具有 IP 地址。

  2. 建立与 Ray 头节点的 port-forwarding 会话:

    pkill -f "kubectl .* port-forward .* 8265:8265"
    pkill -f "kubectl .* port-forward .* 10001:10001"
    kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8265:8265 2>&1 >/dev/null &
    kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 10001:10001 2>&1 >/dev/null &
    
  3. 验证 Ray 客户端可以连接到远程 RayCluster 自定义资源:

    docker run --net=host -it ${VLLM_IMAGE} \
    ray list nodes --address http://localhost:8265
    

    输出类似于以下内容:

    ======== List: YYYY-MM-DD HH:MM:SS.NNNNNN ========
    Stats:
    ------------------------------
    Total: 2
    
    Table:
    ------------------------------
        NODE_ID    NODE_IP          IS_HEAD_NODE  STATE    STATE_MESSAGE    NODE_NAME          RESOURCES_TOTAL                   LABELS
    0  XXXXXXXXXX  ###.###.###.###  True          ALIVE                     ###.###.###.###    CPU: 2.0                          ray.io/node_id: XXXXXXXXXX
                                                                                               memory: #.### GiB
                                                                                               node:###.###.###.###: 1.0
                                                                                               node:__internal_head__: 1.0
                                                                                               object_store_memory: #.### GiB
    1  XXXXXXXXXX  ###.###.###.###  False         ALIVE                     ###.###.###.###    CPU: 100.0                       ray.io/node_id: XXXXXXXXXX
                                                                                               TPU: 8.0
                                                                                               TPU-v#e-8-head: 1.0
                                                                                               accelerator_type:TPU-V#E: 1.0
                                                                                               memory: ###.### GiB
                                                                                               node:###.###.###.###: 1.0
                                                                                               object_store_memory: ##.### GiB
                                                                                               tpu-group-0: 1.0
    

使用 vLLM 部署模型

如需使用 vLLM 部署特定模型,请按照以下说明操作。

Llama-3-8B-Instruct

docker run \
    --env MODEL_ID=${MODEL_ID} \
    --net=host \
    --volume=./tpu:/workspace/vllm/tpu \
    -it \
    ${VLLM_IMAGE} \
    serve run serve_tpu:model \
    --address=ray://localhost:10001 \
    --app-dir=./tpu \
    --runtime-env-json='{"env_vars": {"MODEL_ID": "meta-llama/Meta-Llama-3-8B-Instruct"}}'

Mistral-7B

docker run \
    --env MODEL_ID=${MODEL_ID} \
    --env TOKENIZER_MODE=${TOKENIZER_MODE} \
    --net=host \
    --volume=./tpu:/workspace/vllm/tpu \
    -it \
    ${VLLM_IMAGE} \
    serve run serve_tpu:model \
    --address=ray://localhost:10001 \
    --app-dir=./tpu \
    --runtime-env-json='{"env_vars": {"MODEL_ID": "mistralai/Mistral-7B-Instruct-v0.3", "TOKENIZER_MODE": "mistral"}}'

Llama 3.1 70B

docker run \
    --env MAX_MODEL_LEN=${MAX_MODEL_LEN} \
    --env MODEL_ID=${MODEL_ID} \
    --net=host \
    --volume=./tpu:/workspace/vllm/tpu \
    -it \
    ${VLLM_IMAGE} \
    serve run serve_tpu:model \
    --address=ray://localhost:10001 \
    --app-dir=./tpu \
    --runtime-env-json='{"env_vars": {"MAX_MODEL_LEN": "8192", "MODEL_ID": "meta-llama/Meta-Llama-3.1-70B"}}'

查看 Ray 信息中心

您可以通过 Ray 信息中心查看 Ray Serve 部署和相关日志。

  1. 点击 Cloud Shell 任务栏右上角的 “网页预览”图标 网页预览按钮。
  2. 点击更改端口,然后将端口号设置为 8265
  3. 点击更改并预览
  4. 在 Ray 信息中心内,点击 Serve 标签页。

在 Serve 部署的状态变为 HEALTHY 后,模型便可开始处理输入。

应用模型

本指南重点介绍支持文本生成的模型,这是一种可根据提示创建文本内容的方法。

Llama-3-8B-Instruct

  1. 设置到服务器的端口转发:

    pkill -f "kubectl .* port-forward .* 8000:8000"
    kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8000:8000 2>&1 >/dev/null &
    
  2. 向 Serve 端点发送提示:

    curl -X POST http://localhost:8000/v1/generate -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'
    

Mistral-7B

  1. 设置到服务器的端口转发:

    pkill -f "kubectl .* port-forward .* 8000:8000"
    kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8000:8000 2>&1 >/dev/null &
    
  2. 向 Serve 端点发送提示:

    curl -X POST http://localhost:8000/v1/generate -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'
    

Llama 3.1 70B

  1. 设置到服务器的端口转发:

    pkill -f "kubectl .* port-forward .* 8000:8000"
    kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8000:8000 2>&1 >/dev/null &
    
  2. 向 Serve 端点发送提示:

    curl -X POST http://localhost:8000/v1/generate -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'
    

其他配置

您可以选择性地配置 Ray Serve 框架支持的以下模型部署资源和方法:

部署 RayService

您可以使用 RayService 自定义资源部署本教程中的相同模型。

  1. 删除您在本教程中创建的 RayCluster 自定义资源:

    kubectl --namespace ${NAMESPACE} delete raycluster/vllm-tpu
    
  2. 创建 RayService 自定义资源以部署模型:

    Llama-3-8B-Instruct

    1. 检查 ray-service.tpu-v5e-singlehost.yaml 清单:

      apiVersion: ray.io/v1
      kind: RayService
      metadata:
        name: vllm-tpu
      spec:
        serveConfigV2: |
          applications:
            - name: llm
              import_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:model
              deployments:
              - name: VLLMDeployment
                num_replicas: 1
              runtime_env:
                working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
                env_vars:
                  MODEL_ID: "$MODEL_ID"
                  MAX_MODEL_LEN: "$MAX_MODEL_LEN"
                  DTYPE: "$DTYPE"
                  TOKENIZER_MODE: "$TOKENIZER_MODE"
                  TPU_CHIPS: "8"
        rayClusterConfig:
          headGroupSpec:
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: ray-head
                  image: $VLLM_IMAGE
                  imagePullPolicy: IfNotPresent
                  ports:
                  - containerPort: 6379
                    name: gcs
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                  resources:
                    limits:
                      cpu: "2"
                      memory: 8G
                    requests:
                      cpu: "2"
                      memory: 8G
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          workerGroupSpecs:
          - groupName: tpu-group
            replicas: 1
            minReplicas: 1
            maxReplicas: 1
            numOfHosts: 1
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                  - name: ray-worker
                    image: $VLLM_IMAGE
                    imagePullPolicy: IfNotPresent
                    resources:
                      limits:
                        cpu: "100"
                        google.com/tpu: "8"
                        ephemeral-storage: 40G
                        memory: 200G
                      requests:
                        cpu: "100"
                        google.com/tpu: "8"
                        ephemeral-storage: 40G
                        memory: 200G
                    env:
                      - name: JAX_PLATFORMS
                        value: "tpu"
                      - name: HUGGING_FACE_HUB_TOKEN
                        valueFrom:
                          secretKeyRef:
                            name: hf-secret
                            key: hf_api_token
                      - name: VLLM_XLA_CACHE_PATH
                        value: "/data"
                    volumeMounts:
                    - name: gcs-fuse-csi-ephemeral
                      mountPath: /data
                    - name: dshm
                      mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
                nodeSelector:
                  cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
                  cloud.google.com/gke-tpu-topology: 2x4
    2. 应用清单:

      envsubst < tpu/ray-service.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
      

      envsubst 命令会替换清单中的环境变量。

      GKE 会创建一个 RayService,其中 workergroup2x4 拓扑中包含 TPU v5e 单主机。

    Mistral-7B

    1. 检查 ray-service.tpu-v5e-singlehost.yaml 清单:

      apiVersion: ray.io/v1
      kind: RayService
      metadata:
        name: vllm-tpu
      spec:
        serveConfigV2: |
          applications:
            - name: llm
              import_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:model
              deployments:
              - name: VLLMDeployment
                num_replicas: 1
              runtime_env:
                working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
                env_vars:
                  MODEL_ID: "$MODEL_ID"
                  MAX_MODEL_LEN: "$MAX_MODEL_LEN"
                  DTYPE: "$DTYPE"
                  TOKENIZER_MODE: "$TOKENIZER_MODE"
                  TPU_CHIPS: "8"
        rayClusterConfig:
          headGroupSpec:
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: ray-head
                  image: $VLLM_IMAGE
                  imagePullPolicy: IfNotPresent
                  ports:
                  - containerPort: 6379
                    name: gcs
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                  resources:
                    limits:
                      cpu: "2"
                      memory: 8G
                    requests:
                      cpu: "2"
                      memory: 8G
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          workerGroupSpecs:
          - groupName: tpu-group
            replicas: 1
            minReplicas: 1
            maxReplicas: 1
            numOfHosts: 1
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                  - name: ray-worker
                    image: $VLLM_IMAGE
                    imagePullPolicy: IfNotPresent
                    resources:
                      limits:
                        cpu: "100"
                        google.com/tpu: "8"
                        ephemeral-storage: 40G
                        memory: 200G
                      requests:
                        cpu: "100"
                        google.com/tpu: "8"
                        ephemeral-storage: 40G
                        memory: 200G
                    env:
                      - name: JAX_PLATFORMS
                        value: "tpu"
                      - name: HUGGING_FACE_HUB_TOKEN
                        valueFrom:
                          secretKeyRef:
                            name: hf-secret
                            key: hf_api_token
                      - name: VLLM_XLA_CACHE_PATH
                        value: "/data"
                    volumeMounts:
                    - name: gcs-fuse-csi-ephemeral
                      mountPath: /data
                    - name: dshm
                      mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
                nodeSelector:
                  cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
                  cloud.google.com/gke-tpu-topology: 2x4
    2. 应用清单:

      envsubst < tpu/ray-service.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
      

      envsubst 命令会替换清单中的环境变量。

      GKE 会创建一个 RayService,其中 workergroup2x4 拓扑中包含 TPU v5e 单主机。

    Llama 3.1 70B

    1. 检查 ray-service.tpu-v6e-singlehost.yaml 清单:

      apiVersion: ray.io/v1
      kind: RayService
      metadata:
        name: vllm-tpu
      spec:
        serveConfigV2: |
          applications:
            - name: llm
              import_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:model
              deployments:
              - name: VLLMDeployment
                num_replicas: 1
              runtime_env:
                working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
                env_vars:
                  MODEL_ID: "$MODEL_ID"
                  MAX_MODEL_LEN: "$MAX_MODEL_LEN"
                  DTYPE: "$DTYPE"
                  TOKENIZER_MODE: "$TOKENIZER_MODE"
                  TPU_CHIPS: "8"
        rayClusterConfig:
          headGroupSpec:
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: ray-head
                  image: $VLLM_IMAGE
                  imagePullPolicy: IfNotPresent
                  ports:
                  - containerPort: 6379
                    name: gcs
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                  resources:
                    limits:
                      cpu: "2"
                      memory: 8G
                    requests:
                      cpu: "2"
                      memory: 8G
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          workerGroupSpecs:
          - groupName: tpu-group
            replicas: 1
            minReplicas: 1
            maxReplicas: 1
            numOfHosts: 1
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                  - name: ray-worker
                    image: $VLLM_IMAGE
                    imagePullPolicy: IfNotPresent
                    resources:
                      limits:
                        cpu: "100"
                        google.com/tpu: "8"
                        ephemeral-storage: 40G
                        memory: 200G
                      requests:
                        cpu: "100"
                        google.com/tpu: "8"
                        ephemeral-storage: 40G
                        memory: 200G
                    env:
                      - name: JAX_PLATFORMS
                        value: "tpu"
                      - name: HUGGING_FACE_HUB_TOKEN
                        valueFrom:
                          secretKeyRef:
                            name: hf-secret
                            key: hf_api_token
                      - name: VLLM_XLA_CACHE_PATH
                        value: "/data"
                    volumeMounts:
                    - name: gcs-fuse-csi-ephemeral
                      mountPath: /data
                    - name: dshm
                      mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
                nodeSelector:
                  cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
                  cloud.google.com/gke-tpu-topology: 2x4
    2. 应用清单:

      envsubst < tpu/ray-service.tpu-v6e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
      

      envsubst 命令会替换清单中的环境变量。

    GKE 会创建一个 RayCluster 自定义资源,其中部署了 Ray Serve 应用并创建了后续的 RayService 自定义资源。

  3. 验证 RayService 资源的状态:

    kubectl --namespace ${NAMESPACE} get rayservices/vllm-tpu
    

    等待 Service 状态变为 Running

    NAME       SERVICE STATUS   NUM SERVE ENDPOINTS
    vllm-tpu   Running          1
    
  4. 检索 RayCluster 头服务的名称:

    SERVICE_NAME=$(kubectl --namespace=${NAMESPACE} get rayservices/vllm-tpu \
        --template={{.status.activeServiceStatus.rayClusterStatus.head.serviceName}})
    
  5. 建立与 Ray 头节点的 port-forwarding 会话以查看 Ray 信息中心:

    pkill -f "kubectl .* port-forward .* 8265:8265"
    kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8265:8265 2>&1 >/dev/null &
    
  6. 查看 Ray 信息中心

  7. 应用模型

  8. 清理 RayService 资源:

    kubectl --namespace ${NAMESPACE} delete rayservice/vllm-tpu
    

通过模型组合实现多个模型的组合

模型组合是一种可将多个模型组合到单个应用中的方法。

在本部分中,您将使用 GKE 集群将 Llama 3 8B IT 和 Gemma 7B IT 这两个模型组合到单个应用中:

  • 第一个模型是回答提示中问题的智能助理模型。
  • 第二个模型是摘要器模型。助理型模型的输出会链接到摘要器模型的输入中。最终结果是助理型模型所提供回答的摘要版本。
  1. 完成以下步骤,获取对 Gemma 模型的访问权限:

    1. 登录 Kaggle 平台,���署许可同意协议,并获取 Kaggle API 令牌。在本教程中,您会将 Kubernetes Secret 用��� Kaggle ������。
    2. 访问 Kaggle.com 上的模型同意页面
    3. 如果您尚未登录 Kaggle,请进行登录。
    4. 点击申请访问权限
    5. Choose Account for Consent(选择进行同意的账号)部分中,选择 Verify via Kaggle Account(通过 Kaggle 账号验证),以使用您的 Kaggle 账号进行同意。
    6. 接受模型条款及条件
  2. 设置环境:

    export ASSIST_MODEL_ID=meta-llama/Meta-Llama-3-8B-Instruct
    export SUMMARIZER_MODEL_ID=google/gemma-7b-it
    
  3. 对于 Standard 集群,创建一个额外的单主机 TPU 切片节点池:

    gcloud container node-pools create tpu-2 \
      --location=${COMPUTE_ZONE} \
      --cluster=${CLUSTER_NAME} \
      --machine-type=MACHINE_TYPE \
      --num-nodes=1
    

    MACHINE_TYPE 替换为以下某个机器类型:

    • ct5lp-hightpu-8t,用于预配 TPU v5e。
    • ct6e-standard-8t,用于预配 TPU v6e。

    Autopilot 集群会自动预配所需的节点。

  4. 根据您要使用的 TPU 版本部署 RayService 资源:

    TPU v5e

    1. 检查 ray-service.tpu-v5e-singlehost.yaml 清单:

      apiVersion: ray.io/v1
      kind: RayService
      metadata:
        name: vllm-tpu
      spec:
        serveConfigV2: |
          applications:
          - name: llm
            route_prefix: /
            import_path:  ai-ml.gke-ray.rayserve.llm.model-composition.serve_tpu:multi_model
            deployments:
            - name: MultiModelDeployment
              num_replicas: 1
            runtime_env:
              working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
              env_vars:
                ASSIST_MODEL_ID: "$ASSIST_MODEL_ID"
                SUMMARIZER_MODEL_ID: "$SUMMARIZER_MODEL_ID"
                TPU_CHIPS: "16"
                TPU_HEADS: "2"
        rayClusterConfig:
          headGroupSpec:
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: ray-head
                  image: $VLLM_IMAGE
                  resources:
                    limits:
                      cpu: "2"
                      memory: 8G
                    requests:
                      cpu: "2"
                      memory: 8G
                  ports:
                  - containerPort: 6379
                    name: gcs-server
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  env:
                    - name: HUGGING_FACE_HUB_TOKEN
                      valueFrom:
                        secretKeyRef:
                          name: hf-secret
                          key: hf_api_token
                    - name: VLLM_XLA_CACHE_PATH
                      value: "/data"
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          workerGroupSpecs:
          - replicas: 2
            minReplicas: 1
            maxReplicas: 2
            numOfHosts: 1
            groupName: tpu-group
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: llm
                  image: $VLLM_IMAGE
                  env:
                    - name: HUGGING_FACE_HUB_TOKEN
                      valueFrom:
                        secretKeyRef:
                          name: hf-secret
                          key: hf_api_token
                    - name: VLLM_XLA_CACHE_PATH
                      value: "/data"
                  resources:
                    limits:
                      cpu: "100"
                      google.com/tpu: "8"
                      ephemeral-storage: 40G
                      memory: 200G
                    requests:
                      cpu: "100"
                      google.com/tpu: "8"
                      ephemeral-storage: 40G
                      memory: 200G
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
                nodeSelector:
                  cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
                  cloud.google.com/gke-tpu-topology: 2x4
    2. 应用清单:

      envsubst < model-composition/ray-service.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
      

    TPU v6e

    1. 检查 ray-service.tpu-v6e-singlehost.yaml 清单:

      apiVersion: ray.io/v1
      kind: RayService
      metadata:
        name: vllm-tpu
      spec:
        serveConfigV2: |
          applications:
          - name: llm
            route_prefix: /
            import_path:  ai-ml.gke-ray.rayserve.llm.model-composition.serve_tpu:multi_model
            deployments:
            - name: MultiModelDeployment
              num_replicas: 1
            runtime_env:
              working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
              env_vars:
                ASSIST_MODEL_ID: "$ASSIST_MODEL_ID"
                SUMMARIZER_MODEL_ID: "$SUMMARIZER_MODEL_ID"
                TPU_CHIPS: "16"
                TPU_HEADS: "2"
        rayClusterConfig:
          headGroupSpec:
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: ray-head
                  image: $VLLM_IMAGE
                  resources:
                    limits:
                      cpu: "2"
                      memory: 8G
                    requests:
                      cpu: "2"
                      memory: 8G
                  ports:
                  - containerPort: 6379
                    name: gcs-server
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  env:
                    - name: HUGGING_FACE_HUB_TOKEN
                      valueFrom:
                        secretKeyRef:
                          name: hf-secret
                          key: hf_api_token
                    - name: VLLM_XLA_CACHE_PATH
                      value: "/data"
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          workerGroupSpecs:
          - replicas: 2
            minReplicas: 1
            maxReplicas: 2
            numOfHosts: 1
            groupName: tpu-group
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: llm
                  image: $VLLM_IMAGE
                  env:
                    - name: HUGGING_FACE_HUB_TOKEN
                      valueFrom:
                        secretKeyRef:
                          name: hf-secret
                          key: hf_api_token
                    - name: VLLM_XLA_CACHE_PATH
                      value: "/data"
                  resources:
                    limits:
                      cpu: "100"
                      google.com/tpu: "8"
                      ephemeral-storage: 40G
                      memory: 200G
                    requests:
                      cpu: "100"
                      google.com/tpu: "8"
                      ephemeral-storage: 40G
                      memory: 200G
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
                nodeSelector:
                  cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
                  cloud.google.com/gke-tpu-topology: 2x4
    2. 应用清单:

      envsubst < model-composition/ray-service.tpu-v6e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
      
  5. 等待 RayService 资源的状态变为 Running

    kubectl --namespace ${NAMESPACE} get rayservice/vllm-tpu
    

    输出类似于以下内容:

    NAME       SERVICE STATUS   NUM SERVE ENDPOINTS
    vllm-tpu   Running          2
    

    在此输出中,RUNNING 状态表示 RayService 资源已准备就绪。

  6. 确认 GKE 为 Ray Serve 应用创建了该 Service:

    kubectl --namespace ${NAMESPACE} get service/vllm-tpu-serve-svc
    

    输出类似于以下内容:

    NAME                 TYPE        CLUSTER-IP        EXTERNAL-IP   PORT(S)    AGE
    vllm-tpu-serve-svc   ClusterIP   ###.###.###.###   <none>        8000/TCP   ###
    
  7. 建立与 Ray 头节点的 port-forwarding 会话:

    pkill -f "kubectl .* port-forward .* 8265:8265"
    pkill -f "kubectl .* port-forward .* 8000:8000"
    kubectl --namespace ${NAMESPACE} port-forward service/vllm-tpu-serve-svc 8265:8265 2>&1 >/dev/null &
    kubectl --namespace ${NAMESPACE} port-forward service/vllm-tpu-serve-svc 8000:8000 2>&1 >/dev/null &
    
  8. 向模型发送请求:

    curl -X POST http://localhost:8000/ -H "Content-Type: application/json" -d '{"prompt": "What is the most popular programming language for machine learning and why?", "max_tokens": 1000}'
    

    输出类似于以下内容:

      {"text": [" used in various data science projects, including building machine learning models, preprocessing data, and visualizing results.\n\nSure, here is a single sentence summarizing the text:\n\nPython is the most popular programming language for machine learning and is widely used in data science projects, encompassing model building, data preprocessing, and visualization."]}
    

构建和部署 TPU 映像

本教程使用来自 vLLM 的托管 TPU 映像。vLLM 提供一个 Dockerfile.tpu 映像,该映像会在包含 TPU 依赖项的所需 PyTorch XLA 映像的基础上构建 vLLM。不过,您也可以构建和部署自己的 TPU 映像,以便对 Docker 映像的内容进行更精细的控制。

  1. 创建一个 Docker 仓库以存储本指南中的容器映像:

    gcloud artifacts repositories create vllm-tpu --repository-format=docker --location=${COMPUTE_REGION} && \
    gcloud auth configure-docker ${COMPUTE_REGION}-docker.pkg.dev
    
  2. 克隆 vLLM 仓库:

    git clone https://github.com/vllm-project/vllm.git
    cd vllm
    
  3. 构建映像:

    docker build -f ./docker/Dockerfile.tpu . -t vllm-tpu
    
  4. 使用您的 Artifact Registry 名称标记 TPU 映像:

    export VLLM_IMAGE=${COMPUTE_REGION}-docker.pkg.dev/${PROJECT_ID}/vllm-tpu/vllm-tpu:TAG
    docker tag vllm-tpu ${VLLM_IMAGE}
    

    TAG 替换为您要定义的标记的名称。如果您未指定标记,Docker 将应用默认的最新标记。

  5. 将该映像推送到 Artifact Registry。

    docker push ${VLLM_IMAGE}
    

逐个删除资源

如果您使用的是现有项目,并且不想将其删除,则可逐个删除不再需要的资源。

  1. 删除 RayCluster 自定义资源:

    kubectl --namespace ${NAMESPACE} delete rayclusters vllm-tpu
    
  2. 删除 Cloud Storage 存储桶:

    gcloud storage rm -r gs://${GSBUCKET}
    
  3. 删除 Artifact Registry 代码库:

    gcloud artifacts repositories delete vllm-tpu \
        --location=${COMPUTE_REGION}
    
  4. 删除集群:

    gcloud container clusters delete ${CLUSTER_NAME} \
        --location=LOCATION
    

    LOCATION 替换为以下某个环境变量:

    • 对于 Autopilot 集群,使用 COMPUTE_REGION
    • 对于 Standard 集群,使用 COMPUTE_ZONE

删除项目

如果您在新的 Google Cloud 项目中部署了教程,但不再需要该项目,请完成以下步骤来将其删除:

  1. 在 Google Cloud 控制台中,前往管理资源页面。

    转到“管理资源”

  2. 在项目列表中,选择要删除的项目,然后点击删除
  3. 在对话框中输入项目 ID,然后点击关闭以删除项目。

后续步骤