diff --git a/nvidia/vllm/README.md b/nvidia/vllm/README.md index 1e9a6e4..8672aca 100644 --- a/nvidia/vllm/README.md +++ b/nvidia/vllm/README.md @@ -13,6 +13,14 @@ ## Overview +## Basic idea + +vLLM is an inference engine designed to run large language models efficiently. The key idea is **maximizing throughput and minimizing memory waste** when serving LLMs. + +- It uses a memory-efficient attention algoritm called **PagedAttention** to handle long sequences without running out of GPU memory. +- New requests can be added to a batch already in process through **continuous batching** to keep GPUs fully utilized. +- It has an **OpenAI-compatible API** so applications built for the OpenAI API can switch to a vLLM backend with little or no modification. + ## What you'll accomplish You'll set up vLLM high-throughput LLM serving on DGX Spark with Blackwell architecture, @@ -40,7 +48,7 @@ support for ARM64. ## Time & risk -**Time estimate:** 30 minutes for Docker approach +**Duration:** 30 minutes for Docker approach **Risks:** Container registry access requires internal credentials