mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-06-24 07:09:31 +00:00
chore: Regenerate all playbooks
This commit is contained in:
parent
bc6bf2251e
commit
97ae853a23
@ -54,6 +54,8 @@ The following models are supported with vLLM on Spark. All listed models are ava
|
|||||||
|
|
||||||
| Model | Quantization | Support Status | HF Handle |
|
| Model | Quantization | Support Status | HF Handle |
|
||||||
|-------|-------------|----------------|-----------|
|
|-------|-------------|----------------|-----------|
|
||||||
|
| **DiffusionGemma 26B A4B IT** | BF16 | ✅ | [`google/diffusiongemma-26B-A4B-it`](https://huggingface.co/google/diffusiongemma-26B-A4B-it) |
|
||||||
|
| **DiffusionGemma 26B A4B IT** | NVFP4 | ✅ | [`nvidia/diffusiongemma-26B-A4B-it-NVFP4`](https://huggingface.co/nvidia/diffusiongemma-26B-A4B-it-NVFP4) |
|
||||||
| **Nemotron-3-Nano-Omni-30B-A3B-Reasoning** | BF16 | ✅ | [`nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16`](https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16) |
|
| **Nemotron-3-Nano-Omni-30B-A3B-Reasoning** | BF16 | ✅ | [`nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16`](https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16) |
|
||||||
| **Nemotron-3-Nano-Omni-30B-A3B-Reasoning** | FP8 | ✅ | [`nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8`](https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8) |
|
| **Nemotron-3-Nano-Omni-30B-A3B-Reasoning** | FP8 | ✅ | [`nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8`](https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8) |
|
||||||
| **Nemotron-3-Nano-Omni-30B-A3B-Reasoning** | NVFP4 | ✅ | [`nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4`](https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4) |
|
| **Nemotron-3-Nano-Omni-30B-A3B-Reasoning** | NVFP4 | ✅ | [`nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4`](https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4) |
|
||||||
@ -97,8 +99,8 @@ Reminder: not all model architectures are supported for NVFP4 quantization.
|
|||||||
* **Duration:** 30 minutes for Docker approach
|
* **Duration:** 30 minutes for Docker approach
|
||||||
* **Risks:** Container registry access requires internal credentials
|
* **Risks:** Container registry access requires internal credentials
|
||||||
* **Rollback:** Container approach is non-destructive.
|
* **Rollback:** Container approach is non-destructive.
|
||||||
* **Last Updated:** 04/28/2026
|
* **Last Updated:** 06/10/2026
|
||||||
* Add support for Nemotron-3-Nano-Omni reasoning BF16, FP8, NVFP4
|
* Add models
|
||||||
|
|
||||||
## Instructions
|
## Instructions
|
||||||
|
|
||||||
@ -133,9 +135,13 @@ newgrp docker
|
|||||||
Find the latest container build from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm
|
Find the latest container build from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
## HuggingFace token (required)
|
||||||
|
## Get a token from https://huggingface.co/settings/tokens
|
||||||
|
export HF_TOKEN="your_huggingface_token"
|
||||||
|
|
||||||
export LATEST_VLLM_VERSION=<latest_container_version>
|
export LATEST_VLLM_VERSION=<latest_container_version>
|
||||||
## example
|
## example
|
||||||
## export LATEST_VLLM_VERSION=26.02-py3
|
## export LATEST_VLLM_VERSION=26.05.post1-py3
|
||||||
|
|
||||||
export HF_MODEL_HANDLE=<HF_HANDLE>
|
export HF_MODEL_HANDLE=<HF_HANDLE>
|
||||||
## example
|
## example
|
||||||
@ -144,7 +150,12 @@ export HF_MODEL_HANDLE=<HF_HANDLE>
|
|||||||
docker pull nvcr.io/nvidia/vllm:${LATEST_VLLM_VERSION}
|
docker pull nvcr.io/nvidia/vllm:${LATEST_VLLM_VERSION}
|
||||||
```
|
```
|
||||||
|
|
||||||
For Gemma 4 model family, use vLLM custom containers:
|
For DiffusionGemma models, use vLLM custom container:
|
||||||
|
```bash
|
||||||
|
docker pull vllm/vllm-openai:gemma
|
||||||
|
```
|
||||||
|
|
||||||
|
For Gemma 4 model family, use vLLM custom container:
|
||||||
```bash
|
```bash
|
||||||
docker pull vllm/vllm-openai:gemma4-cu130
|
docker pull vllm/vllm-openai:gemma4-cu130
|
||||||
```
|
```
|
||||||
@ -159,6 +170,31 @@ nvcr.io/nvidia/vllm:${LATEST_VLLM_VERSION} \
|
|||||||
vllm serve ${HF_MODEL_HANDLE}
|
vllm serve ${HF_MODEL_HANDLE}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
To run DiffusionGemma models (e.g. `google/diffusiongemma-26B-A4B-it`):
|
||||||
|
```bash
|
||||||
|
docker run -it \
|
||||||
|
-p 8000:8000 \
|
||||||
|
--gpus all \
|
||||||
|
--shm-size=16g \
|
||||||
|
-e HF_TOKEN="$HF_TOKEN" \
|
||||||
|
-e VLLM_USE_V2_MODEL_RUNNER=1 \
|
||||||
|
vllm/vllm-openai:gemma ${HF_MODEL_HANDLE} \
|
||||||
|
--gpu-memory-utilization 0.8 \
|
||||||
|
--max-model-len 262144 \
|
||||||
|
--attention-backend TRITON_ATTN \
|
||||||
|
--max-num-seqs 10 \
|
||||||
|
--diffusion-config '{"canvas_length":256}' \
|
||||||
|
--override-generation-config '{"max_new_tokens": null}' \
|
||||||
|
--enable-auto-tool-choice \
|
||||||
|
--tool-call-parser gemma4 \
|
||||||
|
--reasoning-parser gemma4 \
|
||||||
|
--enable-prefix-caching \
|
||||||
|
--default-chat-template-kwargs '{"enable_thinking": true}' \
|
||||||
|
--load-format fastsafetensors
|
||||||
|
|
||||||
|
## For BF16 checkpoint add "--moe-backend triton" for better performance
|
||||||
|
```
|
||||||
|
|
||||||
To run models from Gemma 4 model family, (e.g. `google/gemma-4-31B-it`):
|
To run models from Gemma 4 model family, (e.g. `google/gemma-4-31B-it`):
|
||||||
```bash
|
```bash
|
||||||
docker run -it --gpus all -p 8000:8000 \
|
docker run -it --gpus all -p 8000:8000 \
|
||||||
@ -188,11 +224,19 @@ Expected response should contain `"content": "204"` or similar mathematical calc
|
|||||||
|
|
||||||
For container approach (non-destructive):
|
For container approach (non-destructive):
|
||||||
|
|
||||||
|
NGC Container:
|
||||||
```bash
|
```bash
|
||||||
docker rm $(docker ps -aq --filter ancestor=nvcr.io/nvidia/vllm:${LATEST_VLLM_VERSION})
|
docker rm $(docker ps -aq --filter ancestor=nvcr.io/nvidia/vllm:${LATEST_VLLM_VERSION})
|
||||||
docker rmi nvcr.io/nvidia/vllm
|
docker rmi nvcr.io/nvidia/vllm
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Upstream Container:
|
||||||
|
```bash
|
||||||
|
docker stop "<container name>"
|
||||||
|
docker rm "<container name>"
|
||||||
|
docker rmi "<container image name>"
|
||||||
|
```
|
||||||
|
|
||||||
## Step 6. Next steps
|
## Step 6. Next steps
|
||||||
|
|
||||||
- **Production deployment:** Configure vLLM with your specific model requirements
|
- **Production deployment:** Configure vLLM with your specific model requirements
|
||||||
@ -623,6 +667,7 @@ http://<head-node-ip>:8265
|
|||||||
| CUDA version mismatch errors | Wrong CUDA toolkit version | Reinstall CUDA 12.9 using exact installer |
|
| CUDA version mismatch errors | Wrong CUDA toolkit version | Reinstall CUDA 12.9 using exact installer |
|
||||||
| Container registry authentication fails | Invalid or expired GitLab token | Generate new auth token |
|
| Container registry authentication fails | Invalid or expired GitLab token | Generate new auth token |
|
||||||
| SM_121a architecture not recognized | Missing LLVM patches | Verify SM_121a patches applied to LLVM source |
|
| SM_121a architecture not recognized | Missing LLVM patches | Verify SM_121a patches applied to LLVM source |
|
||||||
|
| CUDA out of memory | Insufficient GPU memory | Reduce --max-model-len and --max-num-seqs parameters |
|
||||||
|
|
||||||
## Common Issues for running on two Sparks
|
## Common Issues for running on two Sparks
|
||||||
| Symptom | Cause | Fix |
|
| Symptom | Cause | Fix |
|
||||||
@ -631,7 +676,7 @@ http://<head-node-ip>:8265
|
|||||||
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
|
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
|
||||||
| Model download fails | Authentication or network issue | Re-run `huggingface-cli login`, check internet access |
|
| Model download fails | Authentication or network issue | Re-run `huggingface-cli login`, check internet access |
|
||||||
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your HuggingFace token; and request access to the gated model on your web browser |
|
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your HuggingFace token; and request access to the gated model on your web browser |
|
||||||
| CUDA out of memory with 405B | Insufficient GPU memory | Use 70B model or reduce max_model_len parameter |
|
| CUDA out of memory | Insufficient GPU memory | Reduce --max-model-len and --max-num-seqs parameters |
|
||||||
| Container startup fails | Missing ARM64 image | Rebuild vLLM image following ARM64 instructions |
|
| Container startup fails | Missing ARM64 image | Rebuild vLLM image following ARM64 instructions |
|
||||||
|
|
||||||
> [!NOTE]
|
> [!NOTE]
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user