chore: Regenerate all playbooks

2026-04-23 02:23:53 +00:00 · 2025-10-08 14:40:57 +00:00 · 2025-10-08 14:40:57 +00:00 · 856325fe2b
commit 856325fe2b
parent bfde041ae0
1 changed files with 9 additions and 1 deletions
--- a/nvidia/vllm/README.md
+++ b/nvidia/vllm/README.md
@ -13,6 +13,14 @@

 ## Overview

+## Basic idea
+
+vLLM is an inference engine designed to run large language models efficiently. The key idea is **maximizing throughput and minimizing memory waste** when serving LLMs.  
+
+- It uses a memory-efficient attention algoritm called **PagedAttention** to handle long sequences without running out of GPU memory.  
+- New requests can be added to a batch already in process through **continuous batching** to keep GPUs fully utilized.  
+- It has an **OpenAI-compatible API** so applications built for the OpenAI API can switch to a vLLM backend with little or no modification.  
+
 ## What you'll accomplish

 You'll set up vLLM high-throughput LLM serving on DGX Spark with Blackwell architecture, 
@ -40,7 +48,7 @@ support for ARM64.

 ## Time & risk

-**Time estimate:** 30 minutes for Docker approach
+**Duration:** 30 minutes for Docker approach

 **Risks:** Container registry access requires internal credentials