Technical Deep Dive

Efficient LLM Deployment

A Unified Approach with vLLM, Ray Serve, KubeRay & Kubernetes


Based on CNCF presentation by Lily (Xiaoxuan) Liu
AutoscaleWorks — Saddle River Consulting LLC

Context

The LLM Infrastructure Challenge

Deploying large language models in production is fundamentally different from training. Serving demands low latency, high throughput, and cost efficiency — all at massive scale.

1

GPU Memory Constraints

Models like Llama 3.1 405B exceed single-node GPU memory, requiring distributed inference across multiple nodes

2

Scaling Complexity

Autoscaling GPU workloads requires orchestration-aware infrastructure that understands model topology

3

Operational Overhead

Managing model lifecycle, rolling updates, health checks, and multi-model serving without downtime

Architecture

End-to-End LLM Serving Stack

vLLM LLM Inference Engine — PagedAttention, Continuous Batching, Tensor Parallelism
Ray Serve Model Deployment & Scaling — Request Routing, Load Balancing, Lifecycle Management
KubeRay Kubernetes Operator for Ray Clusters — Automated Cluster Management
Kubernetes Container Orchestration — GPU Scheduling, Node Pools, Networking
Layer 1

vLLM — The Inference Engine

A high-throughput, memory-efficient serving engine for LLMs, purpose-built for production inference.

Core Innovations

  • PagedAttention — Near-zero memory waste for KV cache via OS-inspired virtual memory paging
  • Continuous Batching — Dynamically adds/removes requests from running batch for max GPU utilization
  • Tensor Parallelism — Splits model layers across GPUs for distributed inference
  • Pipeline Parallelism — Distributes model stages across nodes for multi-node serving

OpenAI-Compatible API

from vllm import LLM

# Create an LLM with HuggingFace model
llm = LLM(
  model="meta-llama/Meta-Llama-3.1-8B"
)

# Generate text from prompts
prompts = [
  "Hello, my name is",
  "The capital of France is"
]
outputs = llm.generate(prompts)
vLLM

Broad Model Support

vLLM supports nearly all popular open-source LLMs and VLMs out of the box.

Large Language Models

Meta Llama 3.1

Official launch partner — 8B, 70B, 405B parameter variants. Day-one optimized support.

DeepSeek R1 / V3

MoE architecture with Wide-EP support for distributed expert parallelism across nodes.

Qwen, Mistral, Gemma, Phi

First-class support for all major model families from HuggingFace Hub.

Vision Language Models

Pixtral 12B

Mistral's multimodal model — contributed natively by model creators to vLLM.

Qwen2-VL

Vision-language model with dynamic resolution support, contributed by Alibaba.

LLaVA, InternVL, Molmo

Growing ecosystem of multimodal models with optimized attention and image encoders.

Hardware

PyTorch as the Narrow Waist

vLLM leverages PyTorch as the hardware abstraction layer, enabling deployment across any accelerator.

Models
Utilities
PyTorch
NVIDIA GPU (CUDA)
AMD GPU (ROCm)
Intel GPU (XPU)
Google TPU
AWS Trainium

Key Insight: PyTorch serves as a universal abstraction layer. Write model code once and deploy on any hardware backend — from NVIDIA H100s to Google TPUv5e to AMD MI300X.

Layer 2

Ray Serve — Serving Framework

A scalable, framework-agnostic model serving library built on the Ray distributed runtime.

Model Deployment & Scaling

  • Autoscaling based on request queue depth and latency targets
  • Multi-model serving on shared GPU pools
  • Rolling updates with zero-downtime deployments
  • Dynamic model loading/unloading

Request Routing & Load Balancing

  • Prefix-cache-affinity routing for higher cache hit rates
  • Locality-aware scheduling across GPU nodes
  • OpenAI-compatible API endpoints out of the box

Model Lifecycle Management

  • Health checks and automatic failure recovery
  • Graceful shutdown with request draining
  • Model versioning and A/B testing
  • Resource quotas per deployment

Advanced Serving Patterns

  • Disaggregated Prefill/Decode — Separate prefill and decode phases onto dedicated GPU pools
  • Wide Expert Parallelism — For MoE models like DeepSeek
Layer 3

KubeRay — Kubernetes Operator

KubeRay bridges Ray's distributed compute with Kubernetes-native orchestration.

Without KubeRay

  • Manual Ray head/worker node configuration
  • Hand-managed networking between Ray nodes
  • No auto-recovery on node failures
  • Manual GPU resource allocation
  • No integration with K8s RBAC, secrets, or storage

With KubeRay

  • RayCluster CRD — Declarative cluster provisioning
  • RayService CRD — Manages Ray Serve deployments with rolling updates
  • RayJob CRD — Batch jobs with automatic cluster lifecycle
  • Auto-healing, GPU-aware scheduling, node pool integration
  • Native K8s secrets, ConfigMaps, PVCs, and service mesh support
Layer 4

Kubernetes — The Foundation

The orchestration layer that makes GPU-accelerated LLM serving reliable and scalable.

01

GPU Scheduling

NVIDIA device plugin, GPU sharing (MIG, time-slicing), topology-aware scheduling for NVLink/NVSwitch

02

Node Pool Management

Dedicated GPU node pools (L4, A100, H100), spot/preemptible nodes for cost optimization, cluster autoscaler

03

Networking

RDMA/InfiniBand for multi-node tensor parallelism, service mesh for traffic management, ingress for API endpoints

04

Storage

Persistent volumes for model weights, HuggingFace cache mounts, shared storage across Ray workers

05

Security

RBAC, network policies, secrets management for HF tokens and API keys, pod security standards

06

Observability

Prometheus metrics from vLLM + Ray, Grafana dashboards, GPU utilization monitoring, cost tracking

Data Flow

End-to-End Request Flow

Client
OpenAI SDK
K8s Ingress
Load Balancer
Ray Serve
Router
vLLM Engine
GPU Worker
Response
Streaming

Prefill Phase

Process the entire input prompt in parallel. This is compute-bound and benefits from high FLOPS GPUs. Prefix caching avoids recomputation for shared system prompts.

Decode Phase

Generate tokens one at a time autoregressively. This is memory-bandwidth-bound. Continuous batching interleaves multiple requests to keep GPUs saturated.

Performance

Production Benchmarks

2.2K
Tokens/s per H200
24x
Throughput vs naive
<100ms
Time to first token
95%+
GPU Utilization

Key vLLM Optimizations: PagedAttention reduces memory waste by up to 55%. Continuous batching improves throughput by 2-4x over static batching. Prefix caching accelerates repeated prompts by 3-10x.

FeaturevLLMTGITriton + TRT-LLM
PagedAttentionCustom KV Cache
Continuous Batching
Multi-Node TP/PPLimited
Open SourceApache 2.0Apache 2.0Mixed
KubeRay Native
Implementation

RayService Deployment

Deploy Llama 3.1 8B on GKE with KubeRay in a single manifest.

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: llama-31-serve
spec:
  serveConfigV2: |
    applications:
      - name: llm
        import_path: ray_serve_llm:app
        deployments:
          - name: VLLMDeployment
            num_replicas: 2
            ray_actor_options:
              num_gpus: 1
  rayClusterConfig:
    headGroupSpec:
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray-ml:2.40.0-gpu
              resources:
                limits:
                  nvidia.com/gpu: "1"
              env:
                - name: HUGGING_FACE_HUB_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: hf-secret
                      key: token
Scale

Scaling Patterns

Single-Node, Multi-GPU

Tensor Parallelism — Split model layers across 2-8 GPUs on a single node using NVLink for fast inter-GPU communication. Ideal for models up to ~70B parameters.

llm = LLM(
  model="meta-llama/Llama-3.1-70B",
  tensor_parallel_size=4
)

Multi-Node, Multi-GPU

Pipeline Parallelism — Distribute model stages across nodes. KubeRay manages the Ray cluster spanning multiple K8s pods with RDMA networking.

llm = LLM(
  model="meta-llama/Llama-3.1-405B",
  tensor_parallel_size=8,
  pipeline_parallel_size=4
)  # 32 GPUs across 4 nodes

Data Parallel Replicas

For throughput scaling, deploy multiple independent model replicas behind Ray Serve's load balancer. Each replica uses TP within its node; Ray Serve routes requests across replicas with prefix-cache affinity.

Monitoring

Observability Stack

Unified metrics pipeline from vLLM engine through Ray to Kubernetes.

vLLM Engine Metrics

  • Time to First Token (TTFT)
  • Time per Output Token (TPOT)
  • Tokens/second throughput
  • KV cache utilization %
  • Prefix cache hit rate
  • Batch size distribution
  • Request queue depth

Ray Serve Metrics

  • Replica count & health
  • Request latency (p50/p95/p99)
  • Queue length per deployment
  • Autoscaling events
  • Error rates by endpoint
  • Active connections
  • Routing decisions

Kubernetes Metrics

  • GPU utilization (DCGM)
  • GPU memory usage
  • Node pool capacity
  • Pod restart counts
  • Network throughput
  • Storage IOPS
  • Cost per inference

Unified Prometheus Endpoint: Ray integrates with vLLM to expose all engine-level metrics alongside cluster metrics through a single Prometheus scrape target. Grafana dashboards provide real-time visibility across the full stack.

Looking Ahead

Future of LLM Infrastructure

01

Disaggregated Serving

Separate prefill and decode into dedicated GPU pools optimized for each phase's compute profile. vLLM + Ray Serve already supports this pattern.

02

Specialized Hardware

Continued advancements in purpose-built inference chips (NVIDIA Blackwell, Google TPUv6, AMD MI400) drive serving cost down by 2-3x per generation.

03

Optimized Serving Engines

Speculative decoding, chunked prefill, and structured output guarantees continue to improve vLLM's tokens/s per dollar efficiency.

04

Kubernetes-Native LLM Ops

Projects like llm-d bring vLLM-native workload orchestration directly into Kubernetes, with inference-aware scheduling and routing.

Conclusion

Key Takeaways

Scalable Infrastructure

Kubernetes and KubeRay enable scalable, adaptable infrastructure to handle the demands of GenAI, simplifying LLM pipelines from development to production.

Efficient LLM Serving

Ray Serve and vLLM provide a streamlined, unified approach to efficiently manage LLM inference with OpenAI-compatible APIs out of the box.

Future-Ready Architecture

Continued advancements in specialized hardware and optimized serving engines like vLLM are critical to improving serving efficiency and reducing cost.

Unified Stack

By leveraging Kubernetes, KubeRay, and vLLM together, we tackle current infrastructure challenges and make LLMs accessible for widespread, high-performance use.

AutoscaleWorks

Self-Hosted LLM Serving • Agentic RAG Systems • GPU Kubernetes Infrastructure


[email protected]
autoscaleworks.ai

Saddle River Consulting LLC