A Unified Approach with vLLM, Ray Serve, KubeRay & Kubernetes
Based on CNCF presentation by Lily (Xiaoxuan) Liu
AutoscaleWorks — Saddle River Consulting LLC
Deploying large language models in production is fundamentally different from training. Serving demands low latency, high throughput, and cost efficiency — all at massive scale.
Models like Llama 3.1 405B exceed single-node GPU memory, requiring distributed inference across multiple nodes
Autoscaling GPU workloads requires orchestration-aware infrastructure that understands model topology
Managing model lifecycle, rolling updates, health checks, and multi-model serving without downtime
A high-throughput, memory-efficient serving engine for LLMs, purpose-built for production inference.
from vllm import LLM
# Create an LLM with HuggingFace model
llm = LLM(
model="meta-llama/Meta-Llama-3.1-8B"
)
# Generate text from prompts
prompts = [
"Hello, my name is",
"The capital of France is"
]
outputs = llm.generate(prompts)
vLLM supports nearly all popular open-source LLMs and VLMs out of the box.
Official launch partner — 8B, 70B, 405B parameter variants. Day-one optimized support.
MoE architecture with Wide-EP support for distributed expert parallelism across nodes.
First-class support for all major model families from HuggingFace Hub.
Mistral's multimodal model — contributed natively by model creators to vLLM.
Vision-language model with dynamic resolution support, contributed by Alibaba.
Growing ecosystem of multimodal models with optimized attention and image encoders.
vLLM leverages PyTorch as the hardware abstraction layer, enabling deployment across any accelerator.
Key Insight: PyTorch serves as a universal abstraction layer. Write model code once and deploy on any hardware backend — from NVIDIA H100s to Google TPUv5e to AMD MI300X.
A scalable, framework-agnostic model serving library built on the Ray distributed runtime.
KubeRay bridges Ray's distributed compute with Kubernetes-native orchestration.
The orchestration layer that makes GPU-accelerated LLM serving reliable and scalable.
NVIDIA device plugin, GPU sharing (MIG, time-slicing), topology-aware scheduling for NVLink/NVSwitch
Dedicated GPU node pools (L4, A100, H100), spot/preemptible nodes for cost optimization, cluster autoscaler
RDMA/InfiniBand for multi-node tensor parallelism, service mesh for traffic management, ingress for API endpoints
Persistent volumes for model weights, HuggingFace cache mounts, shared storage across Ray workers
RBAC, network policies, secrets management for HF tokens and API keys, pod security standards
Prometheus metrics from vLLM + Ray, Grafana dashboards, GPU utilization monitoring, cost tracking
Process the entire input prompt in parallel. This is compute-bound and benefits from high FLOPS GPUs. Prefix caching avoids recomputation for shared system prompts.
Generate tokens one at a time autoregressively. This is memory-bandwidth-bound. Continuous batching interleaves multiple requests to keep GPUs saturated.
Key vLLM Optimizations: PagedAttention reduces memory waste by up to 55%. Continuous batching improves throughput by 2-4x over static batching. Prefix caching accelerates repeated prompts by 3-10x.
| Feature | vLLM | TGI | Triton + TRT-LLM |
|---|---|---|---|
| PagedAttention | ✓ | ✓ | Custom KV Cache |
| Continuous Batching | ✓ | ✓ | ✓ |
| Multi-Node TP/PP | ✓ | Limited | ✓ |
| Open Source | Apache 2.0 | Apache 2.0 | Mixed |
| KubeRay Native | ✓ | ✗ | ✗ |
Deploy Llama 3.1 8B on GKE with KubeRay in a single manifest.
apiVersion: ray.io/v1
kind: RayService
metadata:
name: llama-31-serve
spec:
serveConfigV2: |
applications:
- name: llm
import_path: ray_serve_llm:app
deployments:
- name: VLLMDeployment
num_replicas: 2
ray_actor_options:
num_gpus: 1
rayClusterConfig:
headGroupSpec:
template:
spec:
containers:
- name: ray-head
image: rayproject/ray-ml:2.40.0-gpu
resources:
limits:
nvidia.com/gpu: "1"
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: token
Tensor Parallelism — Split model layers across 2-8 GPUs on a single node using NVLink for fast inter-GPU communication. Ideal for models up to ~70B parameters.
llm = LLM(
model="meta-llama/Llama-3.1-70B",
tensor_parallel_size=4
)
Pipeline Parallelism — Distribute model stages across nodes. KubeRay manages the Ray cluster spanning multiple K8s pods with RDMA networking.
llm = LLM(
model="meta-llama/Llama-3.1-405B",
tensor_parallel_size=8,
pipeline_parallel_size=4
) # 32 GPUs across 4 nodes
For throughput scaling, deploy multiple independent model replicas behind Ray Serve's load balancer. Each replica uses TP within its node; Ray Serve routes requests across replicas with prefix-cache affinity.
Unified metrics pipeline from vLLM engine through Ray to Kubernetes.
Unified Prometheus Endpoint: Ray integrates with vLLM to expose all engine-level metrics alongside cluster metrics through a single Prometheus scrape target. Grafana dashboards provide real-time visibility across the full stack.
Separate prefill and decode into dedicated GPU pools optimized for each phase's compute profile. vLLM + Ray Serve already supports this pattern.
Continued advancements in purpose-built inference chips (NVIDIA Blackwell, Google TPUv6, AMD MI400) drive serving cost down by 2-3x per generation.
Speculative decoding, chunked prefill, and structured output guarantees continue to improve vLLM's tokens/s per dollar efficiency.
Projects like llm-d bring vLLM-native workload orchestration directly into Kubernetes, with inference-aware scheduling and routing.
Kubernetes and KubeRay enable scalable, adaptable infrastructure to handle the demands of GenAI, simplifying LLM pipelines from development to production.
Ray Serve and vLLM provide a streamlined, unified approach to efficiently manage LLM inference with OpenAI-compatible APIs out of the box.
Continued advancements in specialized hardware and optimized serving engines like vLLM are critical to improving serving efficiency and reducing cost.
By leveraging Kubernetes, KubeRay, and vLLM together, we tackle current infrastructure challenges and make LLMs accessible for widespread, high-performance use.
Self-Hosted LLM Serving • Agentic RAG Systems • GPU Kubernetes Infrastructure
[email protected]
autoscaleworks.ai
Saddle River Consulting LLC