dgx-spark-playbooks/nvidia/station-sglang-inference/README.md

# LLM Inference with SGLang

> Serve LLMs with SGLang on DGX Station (Qwen3-8B default; Qwen3.6 MoE optional)—prefix-cached multi-turn, structured output, benchmarks, and inference-server guidance


## Table of Contents

- [Overview](#overview)
- [Instructions](#instructions)
  - [Example model IDs (`MODEL_HANDLE`)](#example-model-ids-modelhandle)
  - [Choosing an inference backend (DGX Station)](#choosing-an-inference-backend-dgx-station)
  - [Next steps: heavier models on Station](#next-steps-heavier-models-on-station)
- [Troubleshooting](#troubleshooting)

---

## Overview

## Basic idea

SGLang is a high-performance serving framework for large language models, optimized for workloads where requests share common prefixes — multi-turn conversations, RAG pipelines, and agentic workflows. Its core innovation, **RadixAttention**, automatically caches and reuses KV cache entries across requests using a radix tree, eliminating redundant prefill computation. SGLang also provides best-in-class **structured output generation** (JSON, regex, grammar-constrained decoding) through its xGrammar backend, running up to 3x faster than standard guided decoding.

- **RadixAttention** — Automatically reuses KV cache across requests sharing common prefixes. Multi-turn conversations and repeated system prompts skip re-computation entirely, reducing first-token latency and increasing throughput.
- **Structured output** — Compressed finite-state machine decoding with grammar mask generation overlapped with the LLM forward pass. Produces valid JSON, regex-matched, or grammar-constrained output with minimal overhead.
- **OpenAI-compatible API** — Drop-in replacement for OpenAI and vLLM endpoints. Supports `/v1/chat/completions`, `/v1/completions`, and `/v1/embeddings`.
- **Blackwell optimized** — SGLang includes optimizations for SM100+ GPUs and CUDA 13, with NVFP4 GEMM support and accelerated softmax kernels.

## What you'll accomplish

Launch SGLang on DGX Station to serve an LLM, then exercise its two key differentiators: prefix-cached multi-turn chat and structured JSON output generation. You will also benchmark multi-turn throughput and interpret results together with server logs (wall time alone is not a reliable cache signal under parallel load).

- Serve **Qwen3-8B** (`Qwen/Qwen3-8B` by default for fast first-run validation) or another checkpoint from the in-playbook model table — including the larger **Qwen3.6 MoE** (`Qwen/Qwen3.6-35B-A3B`) once the workflow is verified
- Send multi-turn conversations and observe prefix cache hits in **Docker logs** (`#cached-token`)
- Generate structured JSON output using schema-constrained decoding
- Benchmark multi-turn throughput; optional **single-conversation** run to reduce contention; full cache/metrics scrape written to a **log file** for review
- Optional next step: large MoE such as **DeepSeek-V4** on Station when your SGLang build and VRAM allow

## What to know before starting

- Basic Docker container usage
- Familiarity with REST APIs (curl or Python requests)

## Prerequisites

- NVIDIA DGX Station with GB300 GPU (Blackwell SM103)
- Docker installed: `docker --version`
- NVIDIA Container Toolkit configured: `nvidia-smi` should show the GB300
- HuggingFace account with access token
- Network access to HuggingFace and Docker Hub

## Ancillary files

- `assets/benchmark_multiturn.py` — Benchmarks multi-turn chat under parallel load, structured JSON output, and writes full `/server_info` + `/metrics` bodies to a detail log (terminal shows a short summary only)

## Time & risk

* **Duration:** 20–30 minutes for the default `Qwen/Qwen3-8B`; 45–60 minutes if you switch to `Qwen/Qwen3.6-35B-A3B` (download + Blackwell CUDA-graph capture)
* **Risks:** Gated models (e.g., Llama 3.3) require HuggingFace authentication and license acceptance
* **Rollback:** Stop and remove the container to restore state
* **Last Updated:** 05/26/2026
  * First Publication

## Instructions

## Step 1. Set up Docker permissions

If you haven't already, add your user to the docker group to run Docker without sudo:

```bash
sudo usermod -aG docker $USER
newgrp docker
```

## Step 2. Set up environment variables

```bash
## HuggingFace token (only required for gated models such as Llama 3.3).
## Leave empty for public models like Qwen3-8B; for gated models get a token at
## https://huggingface.co/settings/tokens.
export HF_TOKEN=""

## Model to serve (see **Example model IDs** below).
## Default uses Qwen3-8B for fast first-run validation (~10–15 min boot on Station).
## Switch to Qwen3.6-35B-A3B once the workflow is working end-to-end.
export MODEL_HANDLE="Qwen/Qwen3-8B"

## Maximum context length
export MAX_MODEL_LEN=8192
```

### Example model IDs (`MODEL_HANDLE`)

Use any **Hugging Face text-generation or chat** checkpoint that your SGLang build supports. The table below lists common starting points on DGX Station; always check the model card for **license / gated access**, **VRAM**, and **context length**.

| Model ID | Notes |
|----------|--------|
| `Qwen/Qwen3-8B` | **Default in this playbook.** Dense Qwen3 8B; ~16 GB download, fast warmup, ideal for validating the workflow end-to-end. |
| `Qwen/Qwen3.6-35B-A3B` | Qwen3.6 MoE (~3B active experts); strong quality per GPU hour on Blackwell. ~70 GB download; allow ~30–45 min to first request. |
| `Qwen/Qwen3.6-27B` | Dense Qwen3.6; higher VRAM than the MoE row above at equal batch settings. |
| `google/gemma-3-12b-it` | Popular Gemma 3 instruct (text + vision in full stack; chat API usage is typically text-only). |
| `google/gemma-3-27b-it` | Larger Gemma 3 instruct variant. |
| `meta-llama/Llama-3.3-70B-Instruct` | Llama 3.3 70B instruct (gated on Hugging Face; accept the license in the model card before download). |

Heavyweight MoE (very large weights; confirm **SGLang version + GPU memory** before serving):

| Model ID | Notes |
|----------|--------|
| `deepseek-ai/DeepSeek-V4-Flash` | DeepSeek-V4 family (MoE). Intended to showcase **large local models on Station**; expect long downloads, strict VRAM headroom, and possible extra flags per SGLang docs. |
| `deepseek-ai/DeepSeek-V4-Pro` | Larger V4 variant; only if you have sufficient GPU memory and a supported SGLang build. |

### Choosing an inference backend (DGX Station)

Several OpenAI-compatible servers run well on NVIDIA hardware. None is universally “best”—pick by workload shape and operational constraints.

| Backend | Strengths | Typical “use this when…” |
|---------|-----------|---------------------------|
| **SGLang** | RadixAttention for shared-prefix workloads; strong structured / grammar decoding; active Blackwell + CUDA 13 paths. | Highly **multi-turn**, **RAG** (repeated system + documents), **agents**, or **schema-constrained** JSON at scale. |
| **vLLM** | MaturePagedAttention, broad model coverage, common default in examples. | You want a **well-trodden** OSS server with **maximum community recipes** and straightforward PagedAttention behavior. |
| **TensorRT-LLM** | NVIDIA-optimized kernels and quantization workflows for throughput-focused deployment. | You are **productionizing** on NVIDIA GPUs and can invest in **TensorRT-LLM export / engines** for peak throughput. |

This playbook focuses on **SGLang**; consult each project’s documentation for model support matrices and quantization modes.

## Step 3. Pull the SGLang container

Pull the SGLang container image with CUDA 13.0 support (required for Blackwell SM103):

```bash
docker pull lmsysorg/sglang:latest-cu130
```

## Step 4. Identify the GB300 GPU

Identify the GB300's device index:

```bash
nvidia-smi --query-gpu=index,name --format=csv,noheader
```

Look for the row showing `NVIDIA GB300`. Note its index — on DGX Station the GB300 may be at index `0` or `1` depending on configuration. If `nvidia-smi` shows only a single GB300, you can simply use `--gpus all` in the next step.

## Step 5. Start SGLang server

Launch the SGLang server. The flags below are tuned for GB300 (Blackwell SM103) — see notes after the command:

```bash
## Use --gpus all on a single-GPU Station, or --gpus '"device=N"' with the
## index from Step 4 if multiple GPUs are present.
docker run -d \
  --name sglang-server \
  --gpus all \
  --ipc host \
  --cap-add SYS_NICE \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -p 30000:30000 \
  -e HF_TOKEN="$HF_TOKEN" \
  -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
  lmsysorg/sglang:latest-cu130 \
  sglang serve --model-path "$MODEL_HANDLE" \
    --host 0.0.0.0 \
    --port 30000 \
    --context-length $MAX_MODEL_LEN \
    --mem-fraction-static 0.85 \
    --attention-backend flashinfer \
    --enable-cache-report
```

> [!IMPORTANT]
> **Why these flags on GB300:**
> - `--attention-backend flashinfer` — the auto-selected `trtllm_mha` backend currently fails CUDA-graph capture on Blackwell SM103 with `buildNdTmaDescriptor` errors; `fa3` is also rejected (it requires SM ≤ 90). FlashInfer is the safe default.
> - `--cap-add SYS_NICE` — lets SGLang set NUMA affinity; otherwise the server logs a warning on every launch.
> - `--enable-cache-report` — populates `usage.prompt_tokens_details.cached_tokens` in OpenAI-style responses so the benchmark in Step 9 can report cached prefill tokens.

Check the server logs:

```bash
docker logs -f sglang-server
```

Wait for the server to show it is ready:

```
INFO:     Uvicorn running on http://0.0.0.0:30000
```

Press `Ctrl+C` to exit the log view.

> [!NOTE]
> First launch downloads the model and captures CUDA graphs. Plan for ~10–15 min for `Qwen/Qwen3-8B` and ~30–45 min for `Qwen/Qwen3.6-35B-A3B` before the server is ready. Subsequent starts are faster thanks to cached weights and compiled artifacts.

## Step 6. Test basic inference

Send a chat completion request using the OpenAI-compatible API:

```bash
curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$MODEL_HANDLE"'",
    "messages": [{"role": "user", "content": "Explain quantum computing in simple terms."}],
    "max_tokens": 256
  }'
```

The response follows the standard OpenAI format with a `choices` array containing the model's answer.

## Step 7. Multi-turn conversation with prefix caching

SGLang's RadixAttention automatically caches the KV cache for processed tokens. When follow-up messages share the same conversation prefix, the cached entries are reused — skipping prefill for all previously seen tokens.

Send a multi-turn conversation. The system prompt is deliberately long so the shared prefix exceeds SGLang's page size (64 tokens), which is the minimum unit for cache reuse:

```bash
## Turn 1
curl -s http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$MODEL_HANDLE"'",
    "messages": [
      {"role": "system", "content": "You are an expert physics tutor who explains concepts clearly and concisely. You use real-world analogies and everyday examples to make abstract ideas concrete. When answering, first state the key concept in one sentence, then give a short explanation with an example."},
      {"role": "user", "content": "What is the difference between speed and velocity?"}
    ],
    "max_tokens": 256
  }' | python3 -m json.tool

## Turn 2 — extends the same conversation
curl -s http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$MODEL_HANDLE"'",
    "messages": [
      {"role": "system", "content": "You are an expert physics tutor who explains concepts clearly and concisely. You use real-world analogies and everyday examples to make abstract ideas concrete. When answering, first state the key concept in one sentence, then give a short explanation with an example."},
      {"role": "user", "content": "What is the difference between speed and velocity?"},
      {"role": "assistant", "content": "Speed is a scalar quantity that measures how fast an object moves, while velocity is a vector quantity that includes both speed and direction. For example, a car driving at 60 km/h has a speed of 60 km/h regardless of where it is headed. But if that car is driving 60 km/h north, that is its velocity — change direction to south and the velocity changes even though the speed stays the same."},
      {"role": "user", "content": "Can you give me another example that shows why the distinction matters in real physics problems?"}
    ],
    "max_tokens": 256
  }' | python3 -m json.tool
```

The second request **reuses the KV cache for the shared prefix** (system message + first user turn + assistant response) via RadixAttention, so repeated **prefill** work on that prefix is avoided. **End-to-end HTTP latency can still go up** on later turns: the transcript is longer (more tokens to attend to even with cache hits on the prefix), each assistant reply adds decode work, and the client measures full request time—not prefill alone.

Check cache reuse in the server logs. SGLang logs each prefill batch with the number of cached tokens reused:

```bash
docker logs sglang-server 2>&1 | grep "cached-token" | tail -10
```

Look for `#cached-token` values greater than 0 on later turns — this confirms RadixAttention is reusing the KV cache from the shared prefix. Treat that as the primary signal of prefix caching; wall-clock `curl` latency alone can be misleading.

## Step 8. Structured JSON output

SGLang's constrained decoding guarantees valid JSON output matching a schema. This uses the xGrammar backend to overlap grammar mask generation with the model's forward pass, adding minimal latency.

Generate a structured response:

```bash
curl -s http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$MODEL_HANDLE"'",
    "messages": [
      {"role": "user", "content": "List three programming languages with their primary use case and year created."}
    ],
    "max_tokens": 512,
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "languages",
        "schema": {
          "type": "object",
          "properties": {
            "languages": {
              "type": "array",
              "items": {
                "type": "object",
                "properties": {
                  "name": {"type": "string"},
                  "primary_use": {"type": "string"},
                  "year_created": {"type": "integer"}
                },
                "required": ["name", "primary_use", "year_created"]
              }
            }
          },
          "required": ["languages"]
        }
      }
    }
  }' | python3 -m json.tool
```

The response content is guaranteed to be valid JSON matching the provided schema. Parse the `choices[0].message.content` field — it will contain a well-formed JSON object.

## Step 9. Benchmark multi-turn throughput

This step uses `benchmark_multiturn.py` from this playbook's `assets/` directory. Clone (or download) the playbook repository first so the script is available locally:

```bash
git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-spark-playbooks/nvidia/station-sglang-inference
```

> [!TIP]
> If `git` is not available, download the repository as a ZIP from the [playbook repository](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/station-sglang-inference/) and extract it. All commands below assume your working directory is the playbook root (`dgx-station-playbooks/nvidia/station-sglang-inference/`), so `assets/benchmark_multiturn.py` resolves correctly.

The benchmark stress-tests the server with **parallel** conversations (default: 20) and reports **per-turn wall time**, token counts, and (when the API exposes it) **cached prefill tokens**.

Install the `requests` dependency. The **virtualenv approach below is the preferred, default installation path** — it keeps the script's dependencies isolated from the system Python interpreter so you cannot accidentally damage Ubuntu's own Python packages. Ubuntu 24.04 on DGX Station does not ship `python3-venv` by default, so install it once before creating the virtualenv:

```bash
sudo apt update && sudo apt install -y python3-venv
python3 -m venv .venv && source .venv/bin/activate
pip install requests
```

If you cannot run `sudo apt install python3-venv` (for example, a locked-down host), the next safest option is a **per-user install** that still respects PEP 668:

```bash
python3 -m pip install --user requests
```

> [!CAUTION]
> **Last-resort only — `--break-system-packages` can damage your system Python.**
> Ubuntu 24.04 ships an "externally managed" system Python (PEP 668). The `--break-system-packages` flag tells `pip` to ignore that guard and install into the system or per-user site-packages anyway. This can shadow or conflict with packages installed by `apt` and break system tooling that depends on them. Only use this command when both the **venv** and plain **`--user`** paths above are unavailable, and only if you are willing to take that risk on the host you are running on:
> ```bash
> python3 -m pip install --user --break-system-packages requests
> ```

```bash
python3 assets/benchmark_multiturn.py \
  --base-url http://localhost:30000 \
  --model "$MODEL_HANDLE" \
  --num-conversations 20 \
  --turns-per-conversation 5 \
  --cache-detail-file ./sglang_benchmark_cache_details.log
```

The script prints:
- **Median / P90 wall time per turn** — often **increases** as prompts grow and under parallel load; that does **not** contradict RadixAttention.
- **Median prompt tokens per turn** — should climb as history lengthens.
- **Median cached prefill tokens** (when returned in `usage`) — populated by `--enable-cache-report` (already set in Step 5); this is the primary cache signal from the OpenAI-style `usage` payload.
- A **short summary** of cache-related `/server_info` or `/metrics` lines; the **full** responses are written to `--cache-detail-file` (default `./sglang_benchmark_cache_details.log`) so you are not flooded with an unparsed metrics blob in the terminal.

> [!NOTE]
> The Step 5 launch enables `--enable-cache-report` (which fills `usage.prompt_tokens_details.cached_tokens`) but does **not** enable the Prometheus `/metrics` endpoint, since cached-prefill data is already exposed through `usage` and the `docker logs` `#cached-token` lines. If `/metrics` returns `404`/empty in the detail log, that is expected — the benchmark's primary cache signals (`usage.prompt_tokens_details.cached_tokens` and Docker logs) still work. To populate `/metrics` as well, add `--enable-metrics` to the `sglang serve` invocation in Step 5 and restart the container.

To isolate prefix-cache behavior from multi-client contention, rerun with a single conversation:

```bash
python3 assets/benchmark_multiturn.py \
  --base-url http://localhost:30000 \
  --model "$MODEL_HANDLE" \
  --num-conversations 1 \
  --turns-per-conversation 5
```

Always correlate behavior with **`docker logs`** (`#cached-token` lines) as in Step 7.

### Next steps: heavier models on Station

To stress GPU memory and throughput after completing the steps above, point `MODEL_HANDLE` at a larger checkpoint (for example `deepseek-ai/DeepSeek-V4-Flash`), **lower** `--mem-fraction-static` if you hit OOM, and reduce `--context-length` until the server starts cleanly. Confirm your SGLang image version supports the architecture (see [SGLang documentation](https://docs.sglang.io/)) and accept any **gated** model licenses on Hugging Face before pulling weights.

## Step 10. Cleanup

Stop and remove the container:

```bash
docker stop sglang-server
docker rm sglang-server
```

Optionally remove the image:

```bash
docker rmi lmsysorg/sglang:latest-cu130
```

## Troubleshooting

## Common issues

| Symptom | Cause | Fix |
|---------|-------|-----|
| "permission denied" when running docker | User not in docker group | Run `sudo usermod -aG docker $USER && newgrp docker` |
| Container fails to start with GPU error | NVIDIA Container Toolkit not configured | Run `nvidia-ctk runtime configure --runtime=docker` and restart Docker |
| `device >= 0 && device < num_gpus INTERNAL ASSERT FAILED` | `--gpus '"device=N"'` index does not exist on this Station | Re-run `nvidia-smi --query-gpu=index,name --format=csv,noheader` and use the actual GB300 index, or `--gpus all` if there is only one GPU |
| `RuntimeError: ... buildNdTmaDescriptor ... Check failed: false` during CUDA-graph capture | Default `trtllm_mha` attention backend is incompatible with Blackwell SM103 | Pass `--attention-backend flashinfer` to `sglang serve` |
| `AssertionError: FlashAttention v3 Backend requires SM>=80 and SM<=90` | `--attention-backend fa3` selected on Blackwell (SM103) | Use `--attention-backend flashinfer` instead |
| `User lacks permission to set NUMA affinity ... try adding --cap-add SYS_NICE` warning | Docker dropped the `SYS_NICE` capability | Add `--cap-add SYS_NICE` to the `docker run` command |
| `python3 -m venv .venv` fails with `apt install python3.12-venv` hint | Ubuntu 24.04 ships without `python3-venv` | Run `sudo apt update && sudo apt install -y python3-venv` (or use `python3 -m pip install --user --break-system-packages requests`) |
| "Token is required" or 401 error | Missing HuggingFace token for a gated model | Export `HF_TOKEN` before running the docker command and accept the model license on huggingface.co |
| Server exits with OOM error | Model too large for available GPU memory | Lower `--mem-fraction-static` (e.g., 0.7) or reduce `--context-length`. Check GPU memory with `nvidia-smi` |
| `json_schema` response_format returns error | SGLang version too old | Ensure you are using `lmsysorg/sglang:latest-cu130`. Older versions may not support `json_schema` format |
| Server starts but CUDA errors on inference | Wrong CUDA version for Blackwell | Use the `latest-cu130` image tag. SM103 requires CUDA 13.0+ |
| Slow first request after server start | Kernel JIT + CUDA-graph capture | First launch can take 10–15 min for `Qwen/Qwen3-8B` and 30–45 min for `Qwen/Qwen3.6-35B-A3B` before the server prints "fired up and ready to roll!". Subsequent requests are fast. |
| Connection refused on port 30000 | Server still loading model or capturing CUDA graphs | Check `docker logs sglang-server` — wait for the Uvicorn startup message and "The server is fired up and ready to roll!" |
| `Med cached prefill` column is `n/a` in the benchmark | OpenAI-style `cached_tokens` not enabled on the server | Add `--enable-cache-report` to `sglang serve` so `usage.prompt_tokens_details.cached_tokens` is populated |
| `/server_info` body floods the benchmark "cache highlights" output | Older `benchmark_multiturn.py` matched any line containing "cache" — including the single-line `/server_info` JSON | Use the version of `benchmark_multiturn.py` shipped with this playbook (it skips JSON blobs and lines longer than 200 chars); the full body is still saved to `--cache-detail-file` |
| Benchmark shows **higher** median latency on later turns | Expected under parallel load + longer transcripts | RadixAttention reduces repeated **prefill** on shared prefixes—use `docker logs` (`#cached-token`) and optionally `--num-conversations 1`. See Step 9 and `sglang_benchmark_cache_details.log` |
| `deepseek-ai/DeepSeek-V4-*` fails to load | Unsupported in this SGLang build or insufficient VRAM | Check [SGLang docs](https://docs.sglang.io/) for model support; try `DeepSeek-V4-Flash` before Pro; lower `--mem-fraction-static` and `--context-length` |

> [!NOTE]
> On DGX Station the GB300 may be at device `0` or `1` depending on configuration (some Stations also expose a workstation GPU at `0`). Always verify with `nvidia-smi --query-gpu=index,name --format=csv,noheader` before launching the container.