Christian Posta bio photo

Christian Posta

Field CTO at solo.io, author Istio in Action and Microservices for Java Developers, open-source enthusiast, cloud application development, committer @ Apache, Serverless, Cloud, Integration, Kubernetes, Docker, Istio, Envoy #blogger

Twitter Google+ LinkedIn Github Stackoverflow

NVIDIA NIM is a great way to run AI inference workloads in containers. I deploy primarily to Kubernetes, so I wanted to dig into deploying NIM using the Kubernetes NIM Operator and use GPUs in Google Cloud. I actually started by going to ChatGPT and asked it to give me a step by step guide for doing this on GKE. The results it gave seemed impressive, until I started following the steps. ChatGPT is good at a lot of things, but in this case it gave me complete and utter nonsense. So I thought I’d write up a guide for others who want to deploy NIM on GKE using the NIM operator with some learnings along the way.

Deploy the GKE Cluster

The first step is to deploy the GKE cluster. I’m going to use the Google Cloud CLI to do this. I will spin up the cluster with CPU nodes first and then add the GPU nodes later. This way I can separate the management of the GPU nodes (ie, scale them down since they are expensive!)

gcloud container clusters create $CLUSTER_NAME \
    --zone $ZONE \
    --node-locations $ZONE \
    --release-channel "regular" \
    --machine-type "n1-standard-8" \
    --image-type "UBUNTU_CONTAINERD" \
    --disk-type "pd-standard" \
    --disk-size "100" \
    --no-enable-intra-node-visibility \
    --metadata disable-legacy-endpoints=true \
    --max-pods-per-node "110" \
    --num-nodes "2" \
    --logging=SYSTEM,WORKLOAD \
    --monitoring=SYSTEM \
    --enable-ip-alias \
    --default-max-pods-per-node "110" \
    --no-enable-master-authorized-networks

Next we deploy the GPU nodes:

gcloud container node-pools create gpu-pool \
    --cluster $CLUSTER_NAME \
    --zone $ZONE \
    --node-locations $ZONE \
    --machine-type "g2-standard-16" \
    --image-type "UBUNTU_CONTAINERD" \
    --node-labels="gke-no-default-nvidia-gpu-device-plugin=true" \
    --metadata disable-legacy-endpoints=true \
    --disk-size "100" \
    --num-nodes "1" \
    --tags=nvidia-ingress-all

Things to Note

  • I picked the g2-standard-16 machine type because it has a single GPU (don’t want to be too expensive for the demo I was working on), because it has 24GB GPU memory (the model i’ll run is 8B parameters with float16 precision which requires about 16BG of memory). I also need a decent amount of main memory on the VM and this machine type has 64GB. The model will struggle to startup without enough memory.
  • When deploying the GPU nodes, we will pick GPU accelerators that can successfully run a LLM model. For example, a lot of guides install the Tesla T4 GPU and this does not work well for the LLM model I wanted to use (ie, LLama 3.1 8B)
  • We disable the default NVIDIA GPU drivers by setting the gke-no-default-nvidia-gpu-device-plugin node label. We do this because we want to manage the GPU drivers and underlying container runtime using NVIDIA GPU Operator (which we’ll do next)
  • For the g2-standard-16 machine type which runs an NVIDIA L4 GPU, we can expect to spend around $600 a month on the GPU node. We could use a more powerful GPU (A100) but those are a lot more expensive (around $2200 a month)
  • Check the GPU machine types that might be appropriate for your workloads
  • Use the GCP Price Calculator to estimate your costs for different GPU machine types

Deploy the NVIDIA GPU Operator

I wanted to use the NVIDIA GPU Operator to manage the GPU drivers and underlying container runtime. The NVIDIA GPU Operator makes it easier to configure and manage multiple software components on the nodes that support GPU workloads. I use Helm for this deployment.

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

When installing the GPU Operator, prepare yourself that it may take a little while (a few minutes… if it’s taking longer than 5-10 minutes, something may have gone wrong).

For the operator to install correctly (it’s going to install itself into GKE as a system-node-critical pod), we need to create a ResourceQuota resource. It should look similar this:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: gcp-critical-pods
  namespace: gpu-operator
spec:
  hard:
    pods: 100
  scopeSelector:
    matchExpressions:
    - operator: In
      scopeName: PriorityClass
      values:
      - system-node-critical
      - system-cluster-critical

If we save that into a file called gke-resourcequota.yaml, we can apply it to the GKE cluster:

export GPU_OPERATOR_VERSION=v24.9.1

kubectl create namespace gpu-operator

kubectl apply -f ./gke-resourcequota.yaml

helm upgrade --install gpu-operator nvidia/gpu-operator --wait  \
    -n gpu-operator \
    --version=$GPU_OPERATOR_VERSION \
    -f ./nim/gpu-operator-values.yaml \
    --timeout 2m

As I mentioned this may take a few minutes to install.

Verify the installation

We should verify the device drivers were installed correctly and that the GPU nodes are correctly recognized.

Check all GPU operator pods are running

kubectl get pods -n gpu-operator

Debugging events

Verify GPU nodes are properly recognized

kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"

Verify with nvidia-smi

POD=$(kubectl get pods -n gpu-operator -l app=nvidia-driver-daemonset -o jsonpath='{.items[0].metadata.name}')
kubectl exec -it $POD -n gpu-operator -- nvidia-smi

The output should look like this:

$ kubectl exec -it $POD -n gpu-operator -- nvidia-smi
Wed Feb  5 21:13:19 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.08             Driver Version: 550.127.08     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:00:03.0 Off |                    0 |
| N/A   73C    P0             41W /   72W |      23MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

Install the NIM Operator

The NIM Operator is a Kubernetes operator that manages the deployment and lifecycle of NVIDIA NIM. It’s a great way to deploy NIM in a Kubernetes cluster. You can take advantage of things like automatically downloading/caching NIM models and then running them in containers and scaling them as needed. We will also use Helm for this deployment.

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

You’ll need to create Kubernete secrets to access the NGC container registry which hosts the model catalog. In this example I create a secret for the NGC Docker API key and a secret for the NGC API key. They are slightly different. The first one is for accessing the container registry (ie, such as a docker registry). The second is a generic API key for accessing the NGC servce APIs. It was a little confusing at first, but you get to the API keys by clicking “Setup” under your profile in the top right corner of the NGC website.

Create the keys using the the kubectl create secret command. Note the username for the docker-username is $oauthtoken.

kubectl create secret docker-registry ngc-secret \
    --docker-server=nvcr.io \
    --docker-username='$oauthtoken' \
    --docker-password="$NGC_DOCKER_API_KEY" \
    --docker-email='your.email@solo.io' 

kubectl create secret generic ngc-api-secret \
    --from-literal=NGC_API_KEY="$NGC_API_KEY" 

Install the operator

Now we can install the operator. I’m using version v1.0.1 of the operator.

export NIM_OPERATOR_VERSION=v1.0.1
helm upgrade --install nim-operator nvidia/k8s-nim-operator --wait  \
    -n nvidia-nim \
    --version=$NIM_OPERATOR_VERSION \
    --set global.registry.secretName=ngc-secret

Refer to the docs if you have other questions about the installation options.

Deploy a NIM Model

Now we can deploy a NIM inferencing model. I’m going to deploy the Llama 3.1 8B model.

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: meta-llama3-8b-instruct
spec:
  tolerations:
    - key: "nvidia.com/gpu"
      operator: "Exists"
      effect: "NoSchedule"
  source:
    ngc:
      modelPuller: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.3.3
      pullSecret: ngc-secret
      authSecret: ngc-api-secret
      model:
        engine: tensorrt_llm
        tensorParallelism: "1"
        qosProfile: "throughput"
        profiles:
        - 193649a2eb95e821309d6023a2cabb31489d3b690a9973c7ab5d1ff58b0aa7eb
        - 7cc8597690a35aba19a3636f35e7f1c7e7dbc005fe88ce9394cad4a4adeed414
  storage:
    pvc:
      create: true
      storageClass: standard-rwo
      size: "50Gi"
      volumeAccessMode: ReadWriteOnce
  resources: {}

Things to Note:

  • You may need to add the tolerations section to the NIMCache as the Operator fails to do this correctly.
  • Storage options may not suppor the “ReadWriteMany” option, so check your storage class options. I changed to use ReadWriteOnce to work with the default storage class and storage options that get installed on the GKE nodes I’ve picked
  • You may need to specify “profiles” which are the GPU profiles that NVIDIA NIM supports. Which profiles should you pick? The best way is to run a test on your nodes to see. I’ll show that next.

Choosing the right NIM model profiles

NIM model profiles configure the model engine to take advantage of specific features in the hardware and software of the machine to improve performance. You can run this simple job to determine what profiles are supported on your nodes:

apiVersion: batch/v1
kind: Job
metadata:
  name: nim-profile-job
spec:
  template:
    metadata:
      name: nim-profile-pod
    spec:
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Equal"
          value: "present"
          effect: "NoSchedule"
      containers:
        - name: nim-profile
          image: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.3.3
          args: ["list-model-profiles"]
          resources:
            limits:
              nvidia.com/gpu: 1  # Request 1 GPU
          env:
            - name: NIM_CACHE_PATH
              value: /tmp
            - name: NGC_API_KEY
              valueFrom:
                secretKeyRef:
                  name: ngc-api-secret
                  key: NGC_API_KEY
      imagePullSecrets:
        - name: ngc-secret
      restartPolicy: Never

This Job will output something similar to this:

MODEL PROFILES
- Compatible with system and runnable:
  - 193649a2eb95e821309d6023a2cabb31489d3b690a9973c7ab5d1ff58b0aa7eb (vllm-bf16-tp1)
  - With LoRA support:
    - 3ad27031e57506f12b47d3cd74d045fe05977f0bb92e2f8c80a3240748482578 (vllm-bf16-tp1-lora)
- Compilable to TRT-LLM using just-in-time compilation of HF models to TRTLLM engines:
  - 7cc8597690a35aba19a3636f35e7f1c7e7dbc005fe88ce9394cad4a4adeed414 (tensorrt_llm-trtllm_buildable-bf16-tp1)
  - With LoRA support:
    - df4113435195daa68b56c83741d66b422c463c556fc1669f39f923427c1c57c5 (tensorrt_llm-trtllm_buildable-bf16-tp1-lora)
- Incompatible with system:
  - fa55c825306dfc09c9d0e7ef423e897d91fe8334a3da87d284f45f45cbd4c1b0 (tensorrt_llm-h100-fp8-tp2-pp1-latency)
  - 33e38db03bb29d47b6c6e604c5e3686f224a1f88e97cd3e0e18cf83e71a949fb (tensorrt_llm-h100_nvl-fp8-tp2-pp1-latency)
  - f8b5f71dd66c36c70deac7927cbd98b1c4f78caf1abf01f768be7118e1daa278 (tensorrt_llm-h100-fp8-tp1-pp1-throughput)
  ...

From this output you can specify the model profiles to cache in your NIMCache resource. The NIM Cache will download the model and profiles and be prepared for when a NIMService is deployed and refers to the model that is in the cache.

Deploy the NIMService

Our last step is to deploy the NIMService resource. This will instantiate the NIM model on Kubernetes and available to serve inference requests.

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama3-8b-instruct
spec:
  image:
    repository: nvcr.io/nim/meta/llama-3.1-8b-instruct
    tag: 1.3.3
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: meta-llama3-8b-instruct
  replicas: 1
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000

Now if you login to the GKE cluster and find a simple client pod, you can test the NIMService with the following command:

kubectl run curl-test --image=curlimages/curl:latest --rm -i --restart=Never -- curl "http://meta-llama3-8b-instruct.default:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta/llama-3.1-8b-instruct",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

You should see output like this:

{"id":"chat-ed3d7d2a37de47e4aa62116a6af64b13","object":"chat.completion","created":1739155307,"model":"meta/llama-3.1-8b-instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! How can I assist you today?"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":12,"total_tokens":21,"completion_tokens":9},"prompt_logprobs":null}pod "curl-test" deleted

Wrapping Up

Running your own LLM inferencing service on Kubernetes is a great way to get started with AI. If you’re looking to run NIM microservices on Kubernetes, and especially GKE, I hope this guide helps you out. If you have any questions, please