dgx-spark-playbooks/nvidia/station-ai-skills/assets/skills/sglang-setup/SKILL.md
2026-05-30 11:49:27 +00:00

4.1 KiB

name description metadata
sglang-setup Deploy an SGLang inference server on an NVIDIA DGX Station GB300 with the cu130 container, RadixAttention prefix caching, and structured JSON output support. Use when the user asks to serve a model with SGLang, start an SGLang endpoint, or needs structured-output inference on DGX Station.
publisher hardware
nvidia DGX Station GB300

SGLang Setup on DGX Station

Deploy an SGLang inference server on DGX Station with validated configuration.

Steps

  1. Find the GB300 GPU index. Run:

    nvidia-smi --query-gpu=index,name --format=csv,noheader
    

    Identify the device index for the GB300 (typically device 1). Use this index for --gpus below. Do NOT use --gpus all — mixed coherency will cause CUDA failures.

  2. Ask the user which model to serve. If they don't have a preference, suggest:

    • Qwen/Qwen3-8B — small, fast, good for testing
    • Qwen/Qwen3-32B — medium, good balance
    • meta-llama/Llama-3.1-70B-Instruct — large general-purpose
  3. Check if the user has an HF_TOKEN. Pass inline with -e HF_TOKEN="...".

  4. Deploy the container. Use this validated configuration:

    docker pull lmsysorg/sglang:latest-cu130
    
    docker run -d \
      --name sglang-server \
      --gpus '"device=<GB300_INDEX>"' \
      --ipc host \
      --ulimit memlock=-1 \
      --ulimit stack=67108864 \
      -p 30000:30000 \
      -e HF_TOKEN="<TOKEN>" \
      -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
      lmsysorg/sglang:latest-cu130 \
      sglang serve --model-path "<MODEL>" \
        --host 0.0.0.0 \
        --port 30000 \
        --context-length 32768 \
        --mem-fraction-static 0.85
    

    Container version: Use lmsysorg/sglang:latest-cu130. The cu130 tag is required for Blackwell SM103 support.

    First launch downloads the model and compiles kernels. This takes extra time — subsequent starts are faster.

  5. Wait for the server to be ready. Monitor logs:

    docker logs -f sglang-server
    
  6. Test the server:

    curl http://localhost:30000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "<MODEL>",
        "messages": [{"role": "user", "content": "Hello"}],
        "max_tokens": 64
      }'
    
  7. Report the result to the user, including:

    • Model loaded and serving on port 30000
    • How to stop: docker stop sglang-server && docker rm sglang-server

Key features

  • RadixAttention — automatic KV cache reuse across requests sharing prefixes. On by default, no flag needed. Verify with: docker logs sglang-server 2>&1 | grep "cached-token" | tail -5
  • Structured JSON output — use response_format.json_schema in API requests for guaranteed valid JSON.
  • Chunked prefill — add --chunked-prefill-size 8192 to break long prefills into chunks, reducing time-to-first-token.

Tuning parameters

Parameter Default Agent workloads Throughput workloads
--context-length 32768 32768-65536 8192-16384
--mem-fraction-static 0.85 0.80-0.85 0.85-0.88
--chunked-prefill-size off 4096-8192 8192
--enable-metrics off Optional Recommended

Structured output example

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<MODEL>",
    "messages": [{"role": "user", "content": "List three programming languages."}],
    "max_tokens": 512,
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "languages",
        "schema": {
          "type": "object",
          "properties": {
            "languages": {
              "type": "array",
              "items": {
                "type": "object",
                "properties": {
                  "name": {"type": "string"},
                  "primary_use": {"type": "string"}
                },
                "required": ["name", "primary_use"]
              }
            }
          },
          "required": ["languages"]
        }
      }
    }
  }'