sarman/dgx-spark-playbooks

Fork 0

mirror of https://github.com/NVIDIA/dgx-spark-playbooks.git synced 2026-06-24 15:19:30 +00:00

GitLab CI 1d1a95b3cb chore: Regenerate all playbooks

2026-05-30 11:49:27 +00:00

4.1 KiB

Raw Blame History

name

description

metadata

sglang-setup

Deploy an SGLang inference server on an NVIDIA DGX Station GB300 with the cu130 container, RadixAttention prefix caching, and structured JSON output support. Use when the user asks to serve a model with SGLang, start an SGLang endpoint, or needs structured-output inference on DGX Station.

publisher	hardware
nvidia	DGX Station GB300

SGLang Setup on DGX Station

Deploy an SGLang inference server on DGX Station with validated configuration.

Steps

Find the GB300 GPU index. Run:
```
nvidia-smi --query-gpu=index,name --format=csv,noheader
```
Identify the device index for the GB300 (typically device 1). Use this index for --gpus below. Do NOT use --gpus all — mixed coherency will cause CUDA failures.
Ask the user which model to serve. If they don't have a preference, suggest:
- Qwen/Qwen3-8B — small, fast, good for testing
- Qwen/Qwen3-32B — medium, good balance
- meta-llama/Llama-3.1-70B-Instruct — large general-purpose
Check if the user has an HF_TOKEN. Pass inline with -e HF_TOKEN="...".

Deploy the container. Use this validated configuration:

docker pull lmsysorg/sglang:latest-cu130

docker run -d \
  --name sglang-server \
  --gpus '"device=<GB300_INDEX>"' \
  --ipc host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -p 30000:30000 \
  -e HF_TOKEN="<TOKEN>" \
  -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
  lmsysorg/sglang:latest-cu130 \
  sglang serve --model-path "<MODEL>" \
    --host 0.0.0.0 \
    --port 30000 \
    --context-length 32768 \
    --mem-fraction-static 0.85

Container version: Use lmsysorg/sglang:latest-cu130. The cu130 tag is required for Blackwell SM103 support.

First launch downloads the model and compiles kernels. This takes extra time — subsequent starts are faster.

Wait for the server to be ready. Monitor logs:
```
docker logs -f sglang-server
```

Test the server:

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<MODEL>",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 64
  }'

Report the result to the user, including:
- Model loaded and serving on port 30000
- How to stop: docker stop sglang-server && docker rm sglang-server

Key features

RadixAttention — automatic KV cache reuse across requests sharing prefixes. On by default, no flag needed. Verify with: docker logs sglang-server 2>&1 | grep "cached-token" | tail -5
Structured JSON output — use response_format.json_schema in API requests for guaranteed valid JSON.
Chunked prefill — add --chunked-prefill-size 8192 to break long prefills into chunks, reducing time-to-first-token.

Tuning parameters

Parameter	Default	Agent workloads	Throughput workloads
`--context-length`	32768	32768-65536	8192-16384
`--mem-fraction-static`	0.85	0.80-0.85	0.85-0.88
`--chunked-prefill-size`	off	4096-8192	8192
`--enable-metrics`	off	Optional	Recommended

Structured output example

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<MODEL>",
    "messages": [{"role": "user", "content": "List three programming languages."}],
    "max_tokens": 512,
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "languages",
        "schema": {
          "type": "object",
          "properties": {
            "languages": {
              "type": "array",
              "items": {
                "type": "object",
                "properties": {
                  "name": {"type": "string"},
                  "primary_use": {"type": "string"}
                },
                "required": ["name", "primary_use"]
              }
            }
          },
          "required": ["languages"]
        }
      }
    }
  }'

4.1 KiB Raw Blame History