From 856325fe2b3c894e689512b705e42e7bbf55d72d Mon Sep 17 00:00:00 2001
From: GitLab CI <automaton@nvidia.com>
Date: Wed, 8 Oct 2025 14:40:57 +0000
Subject: [PATCH] chore: Regenerate all playbooks

---
 nvidia/vllm/README.md | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/nvidia/vllm/README.md b/nvidia/vllm/README.md
index 1e9a6e4..8672aca 100644
--- a/nvidia/vllm/README.md
+++ b/nvidia/vllm/README.md
@@ -13,6 +13,14 @@
 
 ## Overview
 
+## Basic idea
+
+vLLM is an inference engine designed to run large language models efficiently. The key idea is **maximizing throughput and minimizing memory waste** when serving LLMs.  
+
+- It uses a memory-efficient attention algoritm called **PagedAttention** to handle long sequences without running out of GPU memory.  
+- New requests can be added to a batch already in process through **continuous batching** to keep GPUs fully utilized.  
+- It has an **OpenAI-compatible API** so applications built for the OpenAI API can switch to a vLLM backend with little or no modification.  
+
 ## What you'll accomplish
 
 You'll set up vLLM high-throughput LLM serving on DGX Spark with Blackwell architecture, 
@@ -40,7 +48,7 @@ support for ARM64.
 
 ## Time & risk
 
-**Time estimate:** 30 minutes for Docker approach
+**Duration:** 30 minutes for Docker approach
 
 **Risks:** Container registry access requires internal credentials