dgx-spark-playbooks/nvidia/station-ai-skills/assets/skills/sglang-setup/SKILL.md

---
name: sglang-setup
description: Deploy an SGLang inference server on an NVIDIA DGX Station GB300 with the cu130 container, RadixAttention prefix caching, and structured JSON output support. Use when the user asks to serve a model with SGLang, start an SGLang endpoint, or needs structured-output inference on DGX Station.
metadata:
  publisher: nvidia
  hardware: DGX Station GB300
---

# SGLang Setup on DGX Station

Deploy an SGLang inference server on DGX Station with validated configuration.

## Steps

1. **Find the GB300 GPU index.** Run:
   ```bash
   nvidia-smi --query-gpu=index,name --format=csv,noheader
   ```
   Identify the device index for the GB300 (typically device 1). Use this index for `--gpus` below. Do NOT use `--gpus all` — mixed coherency will cause CUDA failures.

2. **Ask the user which model to serve.** If they don't have a preference, suggest:
   - `Qwen/Qwen3-8B` — small, fast, good for testing
   - `Qwen/Qwen3-32B` — medium, good balance
   - `meta-llama/Llama-3.1-70B-Instruct` — large general-purpose

3. **Check if the user has an HF_TOKEN.** Pass inline with `-e HF_TOKEN="..."`.

4. **Deploy the container.** Use this validated configuration:

   ```bash
   docker pull lmsysorg/sglang:latest-cu130

   docker run -d \
     --name sglang-server \
     --gpus '"device=<GB300_INDEX>"' \
     --ipc host \
     --ulimit memlock=-1 \
     --ulimit stack=67108864 \
     -p 30000:30000 \
     -e HF_TOKEN="<TOKEN>" \
     -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
     lmsysorg/sglang:latest-cu130 \
     sglang serve --model-path "<MODEL>" \
       --host 0.0.0.0 \
       --port 30000 \
       --context-length 32768 \
       --mem-fraction-static 0.85
   ```

   **Container version:** Use `lmsysorg/sglang:latest-cu130`. The `cu130` tag is required for Blackwell SM103 support.

   **First launch** downloads the model and compiles kernels. This takes extra time — subsequent starts are faster.

5. **Wait for the server to be ready.** Monitor logs:
   ```bash
   docker logs -f sglang-server
   ```

6. **Test the server:**
   ```bash
   curl http://localhost:30000/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
       "model": "<MODEL>",
       "messages": [{"role": "user", "content": "Hello"}],
       "max_tokens": 64
     }'
   ```

7. **Report the result** to the user, including:
   - Model loaded and serving on port 30000
   - How to stop: `docker stop sglang-server && docker rm sglang-server`

## Key features

- **RadixAttention** — automatic KV cache reuse across requests sharing prefixes. On by default, no flag needed. Verify with: `docker logs sglang-server 2>&1 | grep "cached-token" | tail -5`
- **Structured JSON output** — use `response_format.json_schema` in API requests for guaranteed valid JSON.
- **Chunked prefill** — add `--chunked-prefill-size 8192` to break long prefills into chunks, reducing time-to-first-token.

## Tuning parameters

| Parameter | Default | Agent workloads | Throughput workloads |
|-----------|---------|-----------------|---------------------|
| `--context-length` | 32768 | 32768-65536 | 8192-16384 |
| `--mem-fraction-static` | 0.85 | 0.80-0.85 | 0.85-0.88 |
| `--chunked-prefill-size` | off | 4096-8192 | 8192 |
| `--enable-metrics` | off | Optional | Recommended |

## Structured output example

```bash
curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<MODEL>",
    "messages": [{"role": "user", "content": "List three programming languages."}],
    "max_tokens": 512,
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "languages",
        "schema": {
          "type": "object",
          "properties": {
            "languages": {
              "type": "array",
              "items": {
                "type": "object",
                "properties": {
                  "name": {"type": "string"},
                  "primary_use": {"type": "string"}
                },
                "required": ["name", "primary_use"]
              }
            }
          },
          "required": ["languages"]
        }
      }
    }
  }'
```
chore: Regenerate all playbooks 2026-05-30 11:49:27 +00:00			`---`
			`name: sglang-setup`
			`description: Deploy an SGLang inference server on an NVIDIA DGX Station GB300 with the cu130 container, RadixAttention prefix caching, and structured JSON output support. Use when the user asks to serve a model with SGLang, start an SGLang endpoint, or needs structured-output inference on DGX Station.`
			`metadata:`
			`publisher: nvidia`
			`hardware: DGX Station GB300`
			`---`

			`# SGLang Setup on DGX Station`

			`Deploy an SGLang inference server on DGX Station with validated configuration.`

			`## Steps`

			`1. Find the GB300 GPU index. Run:`
			```bash
			`nvidia-smi --query-gpu=index,name --format=csv,noheader`
			```
			Identify the device index for the GB300 (typically device 1). Use this index for `--gpus` below. Do NOT use `--gpus all` — mixed coherency will cause CUDA failures.

			`2. Ask the user which model to serve. If they don't have a preference, suggest:`
			- `Qwen/Qwen3-8B` — small, fast, good for testing
			- `Qwen/Qwen3-32B` — medium, good balance
			- `meta-llama/Llama-3.1-70B-Instruct` — large general-purpose

			3. Check if the user has an HF_TOKEN. Pass inline with `-e HF_TOKEN="..."`.

			`4. Deploy the container. Use this validated configuration:`

			```bash
			`docker pull lmsysorg/sglang:latest-cu130`

			`docker run -d \`
			`--name sglang-server \`
			`--gpus '"device=<GB300_INDEX>"' \`
			`--ipc host \`
			`--ulimit memlock=-1 \`
			`--ulimit stack=67108864 \`
			`-p 30000:30000 \`
			`-e HF_TOKEN="<TOKEN>" \`
			`-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \`
			`lmsysorg/sglang:latest-cu130 \`
			`sglang serve --model-path "<MODEL>" \`
			`--host 0.0.0.0 \`
			`--port 30000 \`
			`--context-length 32768 \`
			`--mem-fraction-static 0.85`
			```

			Container version: Use `lmsysorg/sglang:latest-cu130`. The `cu130` tag is required for Blackwell SM103 support.

			`First launch downloads the model and compiles kernels. This takes extra time — subsequent starts are faster.`

			`5. Wait for the server to be ready. Monitor logs:`
			```bash
			`docker logs -f sglang-server`
			```

			`6. Test the server:`
			```bash
			`curl http://localhost:30000/v1/chat/completions \`
			`-H "Content-Type: application/json" \`
			`-d '{`
			`"model": "<MODEL>",`
			`"messages": [{"role": "user", "content": "Hello"}],`
			`"max_tokens": 64`
			`}'`
			```

			`7. Report the result to the user, including:`
			`- Model loaded and serving on port 30000`
			- How to stop: `docker stop sglang-server && docker rm sglang-server`

			`## Key features`

			- RadixAttention — automatic KV cache reuse across requests sharing prefixes. On by default, no flag needed. Verify with: `docker logs sglang-server 2>&1 \| grep "cached-token" \| tail -5`
			- Structured JSON output — use `response_format.json_schema` in API requests for guaranteed valid JSON.
			- Chunked prefill — add `--chunked-prefill-size 8192` to break long prefills into chunks, reducing time-to-first-token.

			`## Tuning parameters`

			`\| Parameter \| Default \| Agent workloads \| Throughput workloads \|`
			`\|-----------\|---------\|-----------------\|---------------------\|`
			\| `--context-length` \| 32768 \| 32768-65536 \| 8192-16384 \|
			\| `--mem-fraction-static` \| 0.85 \| 0.80-0.85 \| 0.85-0.88 \|
			\| `--chunked-prefill-size` \| off \| 4096-8192 \| 8192 \|
			\| `--enable-metrics` \| off \| Optional \| Recommended \|

			`## Structured output example`

			```bash
			`curl http://localhost:30000/v1/chat/completions \`
			`-H "Content-Type: application/json" \`
			`-d '{`
			`"model": "<MODEL>",`
			`"messages": [{"role": "user", "content": "List three programming languages."}],`
			`"max_tokens": 512,`
			`"response_format": {`
			`"type": "json_schema",`
			`"json_schema": {`
			`"name": "languages",`
			`"schema": {`
			`"type": "object",`
			`"properties": {`
			`"languages": {`
			`"type": "array",`
			`"items": {`
			`"type": "object",`
			`"properties": {`
			`"name": {"type": "string"},`
			`"primary_use": {"type": "string"}`
			`},`
			`"required": ["name", "primary_use"]`
			`}`
			`}`
			`},`
			`"required": ["languages"]`
			`}`
			`}`
			`}`
			`}'`
			```