Technical Deep Dive

Efficient LLM Deployment

A Unified Approach with vLLM, Ray Serve, KubeRay & Kubernetes

Based on CNCF presentation by Lily (Xiaoxuan) Liu
AutoscaleWorks — Saddle River Consulting LLC

Context

The LLM Infrastructure Challenge

Deploying large language models in production is fundamentally different from training. Serving demands low latency, high throughput, and cost efficiency — all at massive scale.

GPU Memory Constraints

Models like Llama 3.1 405B exceed single-node GPU memory, requiring distributed inference across multiple nodes

Scaling Complexity

Autoscaling GPU workloads requires orchestration-aware infrastructure that understands model topology

Operational Overhead

Managing model lifecycle, rolling updates, health checks, and multi-model serving without downtime

Architecture

End-to-End LLM Serving Stack

vLLM LLM Inference Engine — PagedAttention, Continuous Batching, Tensor Parallelism

↓

Ray Serve Model Deployment & Scaling — Request Routing, Load Balancing, Lifecycle Management

↓

KubeRay Kubernetes Operator for Ray Clusters — Automated Cluster Management

↓

Kubernetes Container Orchestration — GPU Scheduling, Node Pools, Networking

Layer 1

vLLM — The Inference Engine

A high-throughput, memory-efficient serving engine for LLMs, purpose-built for production inference.

Core Innovations

PagedAttention — Near-zero memory waste for KV cache via OS-inspired virtual memory paging
Continuous Batching — Dynamically adds/removes requests from running batch for max GPU utilization
Tensor Parallelism — Splits model layers across GPUs for distributed inference
Pipeline Parallelism — Distributes model stages across nodes for multi-node serving

OpenAI-Compatible API

from vllm import LLM

# Create an LLM with HuggingFace model
llm = LLM(
  model="meta-llama/Meta-Llama-3.1-8B"
)

# Generate text from prompts
prompts = [
  "Hello, my name is",
  "The capital of France is"
]
outputs = llm.generate(prompts)

vLLM

Broad Model Support

vLLM supports nearly all popular open-source LLMs and VLMs out of the box.

Large Language Models

Meta Llama 3.1

Official launch partner — 8B, 70B, 405B parameter variants. Day-one optimized support.

DeepSeek R1 / V3

MoE architecture with Wide-EP support for distributed expert parallelism across nodes.

Qwen, Mistral, Gemma, Phi

First-class support for all major model families from HuggingFace Hub.

Vision Language Models

Pixtral 12B

Mistral's multimodal model — contributed natively by model creators to vLLM.

Qwen2-VL

Vision-language model with dynamic resolution support, contributed by Alibaba.

LLaVA, InternVL, Molmo

Growing ecosystem of multimodal models with optimized attention and image encoders.

Hardware

PyTorch as the Narrow Waist

vLLM leverages PyTorch as the hardware abstraction layer, enabling deployment across any accelerator.

Models

Utilities

↓

PyTorch

↓

NVIDIA GPU (CUDA)

AMD GPU (ROCm)

Intel GPU (XPU)

Google TPU

AWS Trainium

Key Insight: PyTorch serves as a universal abstraction layer. Write model code once and deploy on any hardware backend — from NVIDIA H100s to Google TPUv5e to AMD MI300X.

Layer 2

Ray Serve — Serving Framework

A scalable, framework-agnostic model serving library built on the Ray distributed runtime.

Model Deployment & Scaling

Autoscaling based on request queue depth and latency targets
Multi-model serving on shared GPU pools
Rolling updates with zero-downtime deployments
Dynamic model loading/unloading

Request Routing & Load Balancing

Prefix-cache-affinity routing for higher cache hit rates
Locality-aware scheduling across GPU nodes
OpenAI-compatible API endpoints out of the box

Model Lifecycle Management

Health checks and automatic failure recovery
Graceful shutdown with request draining
Model versioning and A/B testing
Resource quotas per deployment

Advanced Serving Patterns

Disaggregated Prefill/Decode — Separate prefill and decode phases onto dedicated GPU pools
Wide Expert Parallelism — For MoE models like DeepSeek

Layer 3

KubeRay — Kubernetes Operator

KubeRay bridges Ray's distributed compute with Kubernetes-native orchestration.

Without KubeRay

Manual Ray head/worker node configuration
Hand-managed networking between Ray nodes
No auto-recovery on node failures
Manual GPU resource allocation
No integration with K8s RBAC, secrets, or storage

With KubeRay

RayCluster CRD — Declarative cluster provisioning
RayService CRD — Manages Ray Serve deployments with rolling updates
RayJob CRD — Batch jobs with automatic cluster lifecycle
Auto-healing, GPU-aware scheduling, node pool integration
Native K8s secrets, ConfigMaps, PVCs, and service mesh support

Layer 4

Kubernetes — The Foundation

The orchestration layer that makes GPU-accelerated LLM serving reliable and scalable.

GPU Scheduling

NVIDIA device plugin, GPU sharing (MIG, time-slicing), topology-aware scheduling for NVLink/NVSwitch

Node Pool Management

Dedicated GPU node pools (L4, A100, H100), spot/preemptible nodes for cost optimization, cluster autoscaler

Networking

RDMA/InfiniBand for multi-node tensor parallelism, service mesh for traffic management, ingress for API endpoints

Storage

Persistent volumes for model weights, HuggingFace cache mounts, shared storage across Ray workers

Security

RBAC, network policies, secrets management for HF tokens and API keys, pod security standards

Observability

Prometheus metrics from vLLM + Ray, Grafana dashboards, GPU utilization monitoring, cost tracking

Data Flow

End-to-End Request Flow

Client
OpenAI SDK

→

K8s Ingress
Load Balancer

→

Ray Serve
Router

→

vLLM Engine
GPU Worker

→

Response
Streaming

Prefill Phase

Process the entire input prompt in parallel. This is compute-bound and benefits from high FLOPS GPUs. Prefix caching avoids recomputation for shared system prompts.

Decode Phase

Generate tokens one at a time autoregressively. This is memory-bandwidth-bound. Continuous batching interleaves multiple requests to keep GPUs saturated.

Performance

Production Benchmarks

2.2K

Tokens/s per H200

24x

Throughput vs naive

<100ms

Time to first token

95%+

GPU Utilization

Key vLLM Optimizations: PagedAttention reduces memory waste by up to 55%. Continuous batching improves throughput by 2-4x over static batching. Prefix caching accelerates repeated prompts by 3-10x.

Feature	vLLM	TGI	Triton + TRT-LLM
PagedAttention	✓	✓	Custom KV Cache
Continuous Batching	✓	✓	✓
Multi-Node TP/PP	✓	Limited	✓
Open Source	Apache 2.0	Apache 2.0	Mixed
KubeRay Native	✓	✗	✗

Implementation

RayService Deployment

Deploy Llama 3.1 8B on GKE with KubeRay in a single manifest.

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: llama-31-serve
spec:
  serveConfigV2: |
    applications:
      - name: llm
        import_path: ray_serve_llm:app
        deployments:
          - name: VLLMDeployment
            num_replicas: 2
            ray_actor_options:
              num_gpus: 1
  rayClusterConfig:
    headGroupSpec:
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray-ml:2.40.0-gpu
              resources:
                limits:
                  nvidia.com/gpu: "1"
              env:
                - name: HUGGING_FACE_HUB_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: hf-secret
                      key: token

Scale

Scaling Patterns

Single-Node, Multi-GPU

Tensor Parallelism — Split model layers across 2-8 GPUs on a single node using NVLink for fast inter-GPU communication. Ideal for models up to ~70B parameters.

llm = LLM(
  model="meta-llama/Llama-3.1-70B",
  tensor_parallel_size=4
)

Multi-Node, Multi-GPU

Pipeline Parallelism — Distribute model stages across nodes. KubeRay manages the Ray cluster spanning multiple K8s pods with RDMA networking.

llm = LLM(
  model="meta-llama/Llama-3.1-405B",
  tensor_parallel_size=8,
  pipeline_parallel_size=4
)  # 32 GPUs across 4 nodes

Data Parallel Replicas

For throughput scaling, deploy multiple independent model replicas behind Ray Serve's load balancer. Each replica uses TP within its node; Ray Serve routes requests across replicas with prefix-cache affinity.

Monitoring

Observability Stack

Unified metrics pipeline from vLLM engine through Ray to Kubernetes.

vLLM Engine Metrics

Time to First Token (TTFT)
Time per Output Token (TPOT)
Tokens/second throughput
KV cache utilization %
Prefix cache hit rate
Batch size distribution
Request queue depth

Ray Serve Metrics

Replica count & health
Request latency (p50/p95/p99)
Queue length per deployment
Autoscaling events
Error rates by endpoint
Active connections
Routing decisions

Kubernetes Metrics

GPU utilization (DCGM)
GPU memory usage
Node pool capacity
Pod restart counts
Network throughput
Storage IOPS
Cost per inference

Unified Prometheus Endpoint: Ray integrates with vLLM to expose all engine-level metrics alongside cluster metrics through a single Prometheus scrape target. Grafana dashboards provide real-time visibility across the full stack.

Looking Ahead

Future of LLM Infrastructure

Disaggregated Serving

Separate prefill and decode into dedicated GPU pools optimized for each phase's compute profile. vLLM + Ray Serve already supports this pattern.

Specialized Hardware

Continued advancements in purpose-built inference chips (NVIDIA Blackwell, Google TPUv6, AMD MI400) drive serving cost down by 2-3x per generation.

Optimized Serving Engines

Speculative decoding, chunked prefill, and structured output guarantees continue to improve vLLM's tokens/s per dollar efficiency.

Kubernetes-Native LLM Ops

Projects like llm-d bring vLLM-native workload orchestration directly into Kubernetes, with inference-aware scheduling and routing.

Conclusion

Key Takeaways

Scalable Infrastructure

Kubernetes and KubeRay enable scalable, adaptable infrastructure to handle the demands of GenAI, simplifying LLM pipelines from development to production.

Efficient LLM Serving

Ray Serve and vLLM provide a streamlined, unified approach to efficiently manage LLM inference with OpenAI-compatible APIs out of the box.

Future-Ready Architecture

Continued advancements in specialized hardware and optimized serving engines like vLLM are critical to improving serving efficiency and reducing cost.

Unified Stack

By leveraging Kubernetes, KubeRay, and vLLM together, we tackle current infrastructure challenges and make LLMs accessible for widespread, high-performance use.

AutoscaleWorks

Self-Hosted LLM Serving • Agentic RAG Systems • GPU Kubernetes Infrastructure

[email protected]
autoscaleworks.ai

Saddle River Consulting LLC