4.1 KiB
| name | description | metadata | ||||
|---|---|---|---|---|---|---|
| sglang-setup | Deploy an SGLang inference server on an NVIDIA DGX Station GB300 with the cu130 container, RadixAttention prefix caching, and structured JSON output support. Use when the user asks to serve a model with SGLang, start an SGLang endpoint, or needs structured-output inference on DGX Station. |
|
SGLang Setup on DGX Station
Deploy an SGLang inference server on DGX Station with validated configuration.
Steps
-
Find the GB300 GPU index. Run:
nvidia-smi --query-gpu=index,name --format=csv,noheaderIdentify the device index for the GB300 (typically device 1). Use this index for
--gpusbelow. Do NOT use--gpus all— mixed coherency will cause CUDA failures. -
Ask the user which model to serve. If they don't have a preference, suggest:
Qwen/Qwen3-8B— small, fast, good for testingQwen/Qwen3-32B— medium, good balancemeta-llama/Llama-3.1-70B-Instruct— large general-purpose
-
Check if the user has an HF_TOKEN. Pass inline with
-e HF_TOKEN="...". -
Deploy the container. Use this validated configuration:
docker pull lmsysorg/sglang:latest-cu130 docker run -d \ --name sglang-server \ --gpus '"device=<GB300_INDEX>"' \ --ipc host \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ -p 30000:30000 \ -e HF_TOKEN="<TOKEN>" \ -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \ lmsysorg/sglang:latest-cu130 \ sglang serve --model-path "<MODEL>" \ --host 0.0.0.0 \ --port 30000 \ --context-length 32768 \ --mem-fraction-static 0.85Container version: Use
lmsysorg/sglang:latest-cu130. Thecu130tag is required for Blackwell SM103 support.First launch downloads the model and compiles kernels. This takes extra time — subsequent starts are faster.
-
Wait for the server to be ready. Monitor logs:
docker logs -f sglang-server -
Test the server:
curl http://localhost:30000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "<MODEL>", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 64 }' -
Report the result to the user, including:
- Model loaded and serving on port 30000
- How to stop:
docker stop sglang-server && docker rm sglang-server
Key features
- RadixAttention — automatic KV cache reuse across requests sharing prefixes. On by default, no flag needed. Verify with:
docker logs sglang-server 2>&1 | grep "cached-token" | tail -5 - Structured JSON output — use
response_format.json_schemain API requests for guaranteed valid JSON. - Chunked prefill — add
--chunked-prefill-size 8192to break long prefills into chunks, reducing time-to-first-token.
Tuning parameters
| Parameter | Default | Agent workloads | Throughput workloads |
|---|---|---|---|
--context-length |
32768 | 32768-65536 | 8192-16384 |
--mem-fraction-static |
0.85 | 0.80-0.85 | 0.85-0.88 |
--chunked-prefill-size |
off | 4096-8192 | 8192 |
--enable-metrics |
off | Optional | Recommended |
Structured output example
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "<MODEL>",
"messages": [{"role": "user", "content": "List three programming languages."}],
"max_tokens": 512,
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "languages",
"schema": {
"type": "object",
"properties": {
"languages": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"primary_use": {"type": "string"}
},
"required": ["name", "primary_use"]
}
}
},
"required": ["languages"]
}
}
}
}'