dgx-spark-playbooks/nvidia/station-sglang-inference/endpoint-test.yaml
2026-05-26 18:25:53 +00:00

365 lines
16 KiB
YAML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

kind: Playbook
metadata:
name: station-sglang-inference
displayName: LLM Inference with SGLang
shortDescription: Serve LLMs with SGLang on DGX Station for prefix-cached multi-turn and structured output inference
publisher: nvidia
description: |
# REPLACE THIS WITH YOUR MODEL CARD
https://gitlab-master.nvidia.com/api-catalog/examples/-/blob/main/modelcard-example-mixtral8x7b.md?ref_type=heads
labelsV2:
- gpuType:playbook:gpu_type_station
- DGX Station
- GB300
- Inference
- SGLang
- Blackwell
- Structured Output
- RadixAttention
attributes:
- key: DURATION
value: 25 MIN
spec:
artifactName: station-sglang-inference
nvcfFunctionId: None
attributes:
showUnavailableBanner: false
apiDocsUrl: None
termsOfUse: |
cta:
text: View on GitHub
url: https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/station-sglang-inference/
tabs:
-
id: overview
label: Overview
content: |
# Basic idea
SGLang is a high-performance serving framework for large language models, optimized for workloads where requests share common prefixes — multi-turn conversations, RAG pipelines, and agentic workflows. Its core innovation, **RadixAttention**, automatically caches and reuses KV cache entries across requests using a radix tree, eliminating redundant prefill computation. SGLang also provides best-in-class **structured output generation** (JSON, regex, grammar-constrained decoding) through its xGrammar backend, running up to 3x faster than standard guided decoding.
- **RadixAttention** — Automatically reuses KV cache across requests sharing common prefixes. Multi-turn conversations and repeated system prompts skip re-computation entirely, reducing first-token latency and increasing throughput.
- **Structured output** — Compressed finite-state machine decoding with grammar mask generation overlapped with the LLM forward pass. Produces valid JSON, regex-matched, or grammar-constrained output with minimal overhead.
- **OpenAI-compatible API** — Drop-in replacement for OpenAI and vLLM endpoints. Supports `/v1/chat/completions`, `/v1/completions`, and `/v1/embeddings`.
- **Blackwell optimized** — SGLang includes optimizations for SM100+ GPUs and CUDA 13, with NVFP4 GEMM support and accelerated softmax kernels.
# What you'll accomplish
Launch SGLang on DGX Station to serve an LLM, then exercise its two key differentiators: prefix-cached multi-turn chat and structured JSON output generation. You will also benchmark multi-turn throughput to see RadixAttention's effect.
- Serve Qwen3-8B with SGLang's Blackwell-optimized backend
- Send multi-turn conversations and observe prefix cache hits in server metrics
- Generate structured JSON output using schema-constrained decoding
- Benchmark multi-turn throughput with and without prefix caching
# What to know before starting
- Basic Docker container usage
- Familiarity with REST APIs (curl or Python requests)
# Prerequisites
- NVIDIA DGX Station with GB300 GPU (Blackwell SM103)
- Docker installed: `docker --version`
- NVIDIA Container Toolkit configured: `nvidia-smi` should show the GB300
- HuggingFace account with access token
- Network access to HuggingFace and Docker Hub
# Ancillary files
- `assets/benchmark_multiturn.py` — Python script that benchmarks multi-turn conversation throughput and demonstrates structured output generation
# Time & risk
* **Duration:** 2025 minutes (including model download)
* **Risks:** Model download requires HuggingFace authentication
* **Rollback:** Stop and remove the container to restore state
* **Last Updated:** 04/06/2026
* First Publication
-
id: instructions
label: Instructions
content: |
# Step 1. Set up Docker permissions
If you haven't already, add your user to the docker group to run Docker without sudo:
```bash
sudo usermod -aG docker $USER
newgrp docker
```
# Step 2. Set up environment variables
```bash
# HuggingFace token (required)
# Get a token from https://huggingface.co/settings/tokens
export HF_TOKEN="your_huggingface_token"
# Model to serve
export MODEL_HANDLE="Qwen/Qwen3-8B"
# Maximum context length
export MAX_MODEL_LEN=8192
```
# Step 3. Pull the SGLang container
Pull the SGLang container image with CUDA 13.0 support (required for Blackwell SM103):
```bash
docker pull lmsysorg/sglang:latest-cu130
```
# Step 4. Identify the GB300 GPU
On DGX Station with multiple GPUs, identify the GB300's device index:
```bash
nvidia-smi --query-gpu=index,name --format=csv,noheader
```
Look for the row showing `NVIDIA GB300`. Note its index (e.g., `1`).
# Step 5. Start SGLang server
Launch the SGLang server:
```bash
# Replace device=1 with your GB300's index from Step 4
docker run -d \
--name sglang-server \
--gpus '"device=1"' \
--ipc host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 30000:30000 \
-e HF_TOKEN="$HF_TOKEN" \
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
lmsysorg/sglang:latest-cu130 \
sglang serve --model-path "$MODEL_HANDLE" \
--host 0.0.0.0 \
--port 30000 \
--context-length $MAX_MODEL_LEN \
--mem-fraction-static 0.85
```
Check the server logs:
```bash
docker logs -f sglang-server
```
Wait for the server to show it is ready:
```
INFO: Uvicorn running on http://0.0.0.0:30000
```
Press `Ctrl+C` to exit the log view.
> [!NOTE]
> First launch downloads the model and compiles kernels. Subsequent starts are faster thanks to cached weights and compiled artifacts.
# Step 6. Test basic inference
Send a chat completion request using the OpenAI-compatible API:
```bash
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "'"$MODEL_HANDLE"'",
"messages": [{"role": "user", "content": "Explain quantum computing in simple terms."}],
"max_tokens": 256
}'
```
The response follows the standard OpenAI format with a `choices` array containing the model's answer.
# Step 7. Multi-turn conversation with prefix caching
SGLang's RadixAttention automatically caches the KV cache for processed tokens. When follow-up messages share the same conversation prefix, the cached entries are reused — skipping prefill for all previously seen tokens.
Send a multi-turn conversation. The system prompt is deliberately long so the shared prefix exceeds SGLang's page size (64 tokens), which is the minimum unit for cache reuse:
```bash
# Turn 1
curl -s http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "'"$MODEL_HANDLE"'",
"messages": [
{"role": "system", "content": "You are an expert physics tutor who explains concepts clearly and concisely. You use real-world analogies and everyday examples to make abstract ideas concrete. When answering, first state the key concept in one sentence, then give a short explanation with an example."},
{"role": "user", "content": "What is the difference between speed and velocity?"}
],
"max_tokens": 256
}' | python3 -m json.tool
# Turn 2 — extends the same conversation
curl -s http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "'"$MODEL_HANDLE"'",
"messages": [
{"role": "system", "content": "You are an expert physics tutor who explains concepts clearly and concisely. You use real-world analogies and everyday examples to make abstract ideas concrete. When answering, first state the key concept in one sentence, then give a short explanation with an example."},
{"role": "user", "content": "What is the difference between speed and velocity?"},
{"role": "assistant", "content": "Speed is a scalar quantity that measures how fast an object moves, while velocity is a vector quantity that includes both speed and direction. For example, a car driving at 60 km/h has a speed of 60 km/h regardless of where it is headed. But if that car is driving 60 km/h north, that is its velocity — change direction to south and the velocity changes even though the speed stays the same."},
{"role": "user", "content": "Can you give me another example that shows why the distinction matters in real physics problems?"}
],
"max_tokens": 256
}' | python3 -m json.tool
```
The second request reuses the KV cache from the shared prefix (system message + first user turn + assistant response), only computing attention for the new user message. This reduces first-token latency for follow-up turns.
Check the cache hit rate in the server logs. SGLang logs each prefill batch with the number of cached tokens reused:
```bash
docker logs sglang-server 2>&1 | grep "cached-token" | tail -10
```
Look for `#cached-token` values greater than 0 on later turns — this confirms RadixAttention is reusing the KV cache from the shared prefix.
# Step 8. Structured JSON output
SGLang's constrained decoding guarantees valid JSON output matching a schema. This uses the xGrammar backend to overlap grammar mask generation with the model's forward pass, adding minimal latency.
Generate a structured response:
```bash
curl -s http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "'"$MODEL_HANDLE"'",
"messages": [
{"role": "user", "content": "List three programming languages with their primary use case and year created."}
],
"max_tokens": 512,
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "languages",
"schema": {
"type": "object",
"properties": {
"languages": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"primary_use": {"type": "string"},
"year_created": {"type": "integer"}
},
"required": ["name", "primary_use", "year_created"]
}
}
},
"required": ["languages"]
}
}
}
}' | python3 -m json.tool
```
The response content is guaranteed to be valid JSON matching the provided schema. Parse the `choices[0].message.content` field — it will contain a well-formed JSON object.
# Step 9. Benchmark multi-turn throughput
Run the included benchmark script to measure how prefix caching improves multi-turn latency. The script is in the `assets/` directory of this playbook.
```bash
python3 -m venv .venv && source .venv/bin/activate
pip install requests
```
```bash
python3 assets/benchmark_multiturn.py \
--base-url http://localhost:30000 \
--model "$MODEL_HANDLE" \
--num-conversations 20 \
--turns-per-conversation 5
```
The script sends parallel multi-turn conversations and measures:
- **Per-turn latency** for turn 1 vs subsequent turns (shows prefix caching effect)
- **Total throughput** in tokens per second
- **Cache statistics** from server metrics
You should see latency decrease for later turns in each conversation as the shared prefix grows, demonstrating RadixAttention's cache reuse.
> [!TIP]
> If you downloaded this playbook as a zip, the `assets/` directory is already present. If you cloned the full repository, navigate to `nvidia/station-sglang-inference/` first.
# Step 10. Cleanup
Stop and remove the container:
```bash
docker stop sglang-server
docker rm sglang-server
```
Optionally remove the image:
```bash
docker rmi lmsysorg/sglang:latest-cu130
```
-
id: troubleshooting
label: Troubleshooting
content: |
# Common issues
| Symptom | Cause | Fix |
|---------|-------|-----|
| "permission denied" when running docker | User not in docker group | Run `sudo usermod -aG docker $USER && newgrp docker` |
| Container fails to start with GPU error | NVIDIA Container Toolkit not configured | Run `nvidia-ctk runtime configure --runtime=docker` and restart Docker |
| `device >= 0 && device < num_gpus INTERNAL ASSERT FAILED` | Using `--gpus all` on a multi-GPU system | Use `--gpus '"device=N"'` to target the GB300 specifically (check index with `nvidia-smi`) |
| "Token is required" or 401 error | Missing HuggingFace token | Ensure `HF_TOKEN` is exported before running the docker command |
| Server exits with OOM error | Model too large for available GPU memory | Lower `--mem-fraction-static` (e.g., 0.7) or reduce `--context-length`. Check GPU memory with `nvidia-smi` |
| `json_schema` response_format returns error | SGLang version too old | Ensure you are using `lmsysorg/sglang:latest-cu130`. Older versions may not support `json_schema` format |
| Server starts but CUDA errors on inference | Wrong CUDA version for Blackwell | Use the `latest-cu130` image tag. SM103 requires CUDA 13.0+ |
| Model runs on wrong GPU | Default GPU selection | Use `--gpus '"device=N"'` to select the GB300 specifically |
| Slow first request after server start | Kernel JIT compilation | First request triggers kernel compilation. Subsequent requests are fast. Wait ~30 seconds |
| Connection refused on port 30000 | Server still loading model | Check `docker logs sglang-server` — wait for the Uvicorn startup message |
| `/server_info` shows no cache stats | Endpoint may differ across versions | Try `curl http://localhost:30000/v1/models` to verify the server is responsive. Cache metrics may be under `/metrics` (requires `--enable-metrics` server flag) |
> [!NOTE]
> On DGX Station, the GB300 is typically device 1 (device 0 is the RTX Pro 6000 workstation GPU). Always verify with `nvidia-smi --query-gpu=index,name --format=csv,noheader`.
resources:
- name: SGLang (GitHub)
url: https://github.com/sgl-project/sglang
- name: SGLang Documentation
url: https://docs.sglang.io/
- name: SGLang OpenAI API Reference
url: https://docs.sglang.io/basic_usage/openai_api_completions.html