From 856325fe2b3c894e689512b705e42e7bbf55d72d Mon Sep 17 00:00:00 2001 From: GitLab CI Date: Wed, 8 Oct 2025 14:40:57 +0000 Subject: [PATCH] chore: Regenerate all playbooks --- nvidia/vllm/README.md | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/nvidia/vllm/README.md b/nvidia/vllm/README.md index 1e9a6e4..8672aca 100644 --- a/nvidia/vllm/README.md +++ b/nvidia/vllm/README.md @@ -13,6 +13,14 @@ ## Overview +## Basic idea + +vLLM is an inference engine designed to run large language models efficiently. The key idea is **maximizing throughput and minimizing memory waste** when serving LLMs. + +- It uses a memory-efficient attention algoritm called **PagedAttention** to handle long sequences without running out of GPU memory. +- New requests can be added to a batch already in process through **continuous batching** to keep GPUs fully utilized. +- It has an **OpenAI-compatible API** so applications built for the OpenAI API can switch to a vLLM backend with little or no modification. + ## What you'll accomplish You'll set up vLLM high-throughput LLM serving on DGX Spark with Blackwell architecture, @@ -40,7 +48,7 @@ support for ARM64. ## Time & risk -**Time estimate:** 30 minutes for Docker approach +**Duration:** 30 minutes for Docker approach **Risks:** Container registry access requires internal credentials