mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-27 04:13:52 +00:00
Compare commits
10 Commits
09fada6242
...
89bc56cb35
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
89bc56cb35 | ||
|
|
8452a1c5b1 | ||
|
|
9414a5141f | ||
|
|
911ca6db8b | ||
|
|
08c06d5bd9 | ||
|
|
87796cfb06 | ||
|
|
36ac5b74eb | ||
|
|
c88578ffe1 | ||
|
|
532624b364 | ||
|
|
e542e522c5 |
@ -31,6 +31,7 @@ Each playbook includes prerequisites, step-by-step instructions, troubleshooting
|
||||
- [Install and Use Isaac Sim and Isaac Lab](nvidia/isaac/)
|
||||
- [Optimized JAX](nvidia/jax/)
|
||||
- [Live VLM WebUI](nvidia/live-vlm-webui/)
|
||||
- [Run models with llama.cpp on DGX Spark](nvidia/llama-cpp/)
|
||||
- [LLaMA Factory](nvidia/llama-factory/)
|
||||
- [LM Studio on DGX Spark](nvidia/lm-studio/)
|
||||
- [Build and Deploy a Multi-Agent Chatbot](nvidia/multi-agent-chatbot/)
|
||||
|
||||
269
nvidia/llama-cpp/README.md
Normal file
269
nvidia/llama-cpp/README.md
Normal file
@ -0,0 +1,269 @@
|
||||
# Run models with llama.cpp on DGX Spark
|
||||
|
||||
> Build llama.cpp with CUDA and serve models via an OpenAI-compatible API (Gemma 4 31B IT as example)
|
||||
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Overview](#overview)
|
||||
- [Instructions](#instructions)
|
||||
- [Troubleshooting](#troubleshooting)
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
## Basic idea
|
||||
|
||||
[llama.cpp](https://github.com/ggml-org/llama.cpp) is a lightweight C/C++ inference stack for large language models. You build it with CUDA so tensor work runs on the DGX Spark GB10 GPU, then load GGUF weights and expose chat through `llama-server`’s OpenAI-compatible HTTP API.
|
||||
|
||||
This playbook walks through that stack end to end. As the model example, it uses **Gemma 4 31B IT** - a frontier reasoning model built by Google DeepMind that llama.cpp supports, with strengths in coding, agentic workflows, and fine-tuning. The instructions download its **F16** GGUF from Hugging Face. The same build and server steps apply to other GGUFs (including other sizes in the support matrix below).
|
||||
|
||||
## What you'll accomplish
|
||||
|
||||
You will build llama.cpp with CUDA for GB10, download a Gemma 4 31B IT model checkpoint, and run **`llama-server`** with GPU offload. You get:
|
||||
|
||||
- Local inference through llama.cpp (no separate Python inference framework required)
|
||||
- An OpenAI-compatible `/v1/chat/completions` endpoint for tools and apps
|
||||
- A concrete validation that **Gemma 4 31B IT** runs on this stack on DGX Spark
|
||||
|
||||
## What to know before starting
|
||||
|
||||
- Basic familiarity with Linux command line and terminal commands
|
||||
- Understanding of git and building from source with CMake
|
||||
- Basic knowledge of REST APIs and cURL for testing
|
||||
- Familiarity with Hugging Face Hub for downloading GGUF files
|
||||
|
||||
## Prerequisites
|
||||
|
||||
**Hardware requirements**
|
||||
|
||||
- NVIDIA DGX Spark with GB10 GPU
|
||||
- Sufficient unified memory for the F16 checkpoint (on the order of **~62GB** for weights alone; more when KV cache and runtime overhead are included)
|
||||
- At least **~70GB** free disk for the F16 download plus build artifacts (use a smaller quant from the same repo if you need less disk and VRAM)
|
||||
|
||||
**Software requirements**
|
||||
|
||||
- NVIDIA DGX OS
|
||||
- Git: `git --version`
|
||||
- CMake (3.14+): `cmake --version`
|
||||
- CUDA Toolkit: `nvcc --version`
|
||||
- Network access to GitHub and Hugging Face
|
||||
|
||||
## Model Support Matrix
|
||||
|
||||
The following models are supported with llama.cpp on Spark. All listed models are available and ready to use:
|
||||
|
||||
| Model | Support Status | HF Handle |
|
||||
|-------|----------------|-----------|
|
||||
| **Gemma 4 31B IT** | ✅ | `ggml-org/gemma-4-31B-it-GGUF` |
|
||||
| **Gemma 4 26B A4B IT** | ✅ | `ggml-org/gemma-4-26B-A4B-it-GGUF` |
|
||||
| **Gemma 4 E4B IT** | ✅ | `ggml-org/gemma-4-E4B-it-GGUF` |
|
||||
| **Gemma 4 E2B IT** | ✅ | `ggml-org/gemma-4-E2B-it-GGUF` |
|
||||
| **Nemotron-3-Nano** | ✅ | `unsloth/Nemotron-3-Nano-30B-A3B-GGUF` |
|
||||
|
||||
## Time & risk
|
||||
|
||||
* **Estimated time:** About 30 minutes, plus downloading the ~62GB example
|
||||
* **Risk level:** Low — build is local to your clone; no system-wide installs required for the steps below
|
||||
* **Rollback:** Remove the `llama.cpp` clone and the model directory under `~/models/` to reclaim disk space
|
||||
* **Last updated:** 04/02/2026
|
||||
* First Publication
|
||||
|
||||
## Instructions
|
||||
|
||||
## Step 1. Verify prerequisites
|
||||
|
||||
This walkthrough uses **Gemma 4 31B IT** (`gemma-4-31B-it-f16.gguf`) as the example checkpoint. You can substitute another GGUF from [`ggml-org/gemma-4-31B-it-GGUF`](https://huggingface.co/ggml-org/gemma-4-31B-it-GGUF) (for example `Q4_K_M` or `Q8_0`) by changing the `hf download` filename and `--model` path in later steps.
|
||||
|
||||
Ensure the required tools are installed:
|
||||
|
||||
```bash
|
||||
git --version
|
||||
cmake --version
|
||||
nvcc --version
|
||||
```
|
||||
|
||||
All commands should return version information. If any are missing, install them before continuing.
|
||||
|
||||
Install the Hugging Face CLI:
|
||||
|
||||
```bash
|
||||
python3 -m venv llama-cpp-venv
|
||||
source llama-cpp-venv/bin/activate
|
||||
pip install -U "huggingface_hub[cli]"
|
||||
```
|
||||
|
||||
Verify installation:
|
||||
|
||||
```bash
|
||||
hf version
|
||||
```
|
||||
|
||||
## Step 2. Clone the llama.cpp repository
|
||||
|
||||
Clone upstream llama.cpp—the framework you are building:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/ggml-org/llama.cpp
|
||||
cd llama.cpp
|
||||
```
|
||||
|
||||
## Step 3. Build llama.cpp with CUDA
|
||||
|
||||
Configure CMake with CUDA and GB10’s **sm_121** architecture so GGML’s CUDA backend matches your GPU:
|
||||
|
||||
```bash
|
||||
mkdir build && cd build
|
||||
cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="121" -DLLAMA_CURL=OFF
|
||||
make -j8
|
||||
```
|
||||
|
||||
The build usually takes on the order of 5–10 minutes. When it finishes, binaries such as `llama-server` appear under `build/bin/`.
|
||||
|
||||
## Step 4. Download Gemma 4 31B IT GGUF (supported model example)
|
||||
|
||||
llama.cpp loads models in **GGUF** format. **gemma-4-31B-it** is available in GGUF from Hugging Face; this playbook uses a F16 variant that balances quality and memory on GB10-class hardware.
|
||||
|
||||
```bash
|
||||
hf download ggml-org/gemma-4-31B-it-GGUF \
|
||||
gemma-4-31B-it-f16.gguf \
|
||||
--local-dir ~/models/gemma-4-31B-it-GGUF
|
||||
```
|
||||
|
||||
The F16 file is large (**~62GB**). The download can be resumed if interrupted.
|
||||
|
||||
## Step 5. Start llama-server with Gemma 4 31B IT
|
||||
|
||||
From your `llama.cpp/build` directory, launch the OpenAI-compatible server with GPU offload:
|
||||
|
||||
```bash
|
||||
./bin/llama-server \
|
||||
--model ~/models/gemma-4-31B-it-GGUF/gemma-4-31B-it-f16.gguf \
|
||||
--host 0.0.0.0 \
|
||||
--port 30000 \
|
||||
--n-gpu-layers 99 \
|
||||
--ctx-size 8192 \
|
||||
--threads 8
|
||||
```
|
||||
|
||||
**Parameters (short):**
|
||||
|
||||
- `--host` / `--port`: bind address and port for the HTTP API
|
||||
- `--n-gpu-layers 99`: offload layers to the GPU (adjust if you use a different model)
|
||||
- `--ctx-size`: context length (can be increased up to model/server limits; uses more memory)
|
||||
- `--threads`: CPU threads for non-GPU work
|
||||
|
||||
You should see log lines similar to:
|
||||
|
||||
```
|
||||
llama_new_context_with_model: n_ctx = 8192
|
||||
...
|
||||
main: server is listening on 0.0.0.0:30000
|
||||
```
|
||||
|
||||
**Keep this terminal open** while testing. Large GGUFs can take several minutes to load; until you see `server is listening`, nothing accepts connections on port 30000 (see Troubleshooting if `curl` reports connection refused).
|
||||
|
||||
## Step 6. Test the API
|
||||
|
||||
Use a **second terminal on the same machine** that runs `llama-server` (for example another SSH session into DGX Spark). If you run `curl` on your laptop while the server runs only on Spark, use the Spark hostname or IP instead of `localhost`.
|
||||
|
||||
```bash
|
||||
curl -X POST http://127.0.0.1:30000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "gemma4",
|
||||
"messages": [{"role": "user", "content": "New York is a great city because..."}],
|
||||
"max_tokens": 100
|
||||
}'
|
||||
```
|
||||
|
||||
If you see `curl: (7) Failed to connect`, the server is still loading, the process exited (check the server log for OOM or path errors), or you are not curling the host that runs `llama-server`.
|
||||
|
||||
Example shape of the response (fields vary by llama.cpp version; `message` may include extra keys):
|
||||
|
||||
```json
|
||||
{
|
||||
"choices": [
|
||||
{
|
||||
"finish_reason": "length",
|
||||
"index": 0,
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": "New York is a great city because it's a living, breathing collage of cultures, ideas, and possibilities—all stacked into one vibrant, never‑sleeping metropolis. Here are just a few reasons that many people ("
|
||||
}
|
||||
}
|
||||
],
|
||||
"created": 1765916539,
|
||||
"model": "gemma-4-31B-it-f16.gguf",
|
||||
"object": "chat.completion",
|
||||
"usage": {
|
||||
"completion_tokens": 100,
|
||||
"prompt_tokens": 25,
|
||||
"total_tokens": 125
|
||||
},
|
||||
"id": "chatcmpl-...",
|
||||
"timings": {
|
||||
...
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Step 7. Longer completion (with example model)
|
||||
|
||||
Try a slightly longer prompt to confirm stable generation with **Gemma 4 31B IT**:
|
||||
|
||||
```bash
|
||||
curl -X POST http://127.0.0.1:30000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "gemma4",
|
||||
"messages": [{"role": "user", "content": "Solve this step by step: If a train travels 120 miles in 2 hours, what is its average speed?"}],
|
||||
"max_tokens": 500
|
||||
}'
|
||||
```
|
||||
|
||||
## Step 8. Cleanup
|
||||
|
||||
Stop the server with `Ctrl+C` in the terminal where it is running.
|
||||
|
||||
To remove this tutorial’s artifacts:
|
||||
|
||||
```bash
|
||||
rm -rf ~/llama.cpp
|
||||
rm -rf ~/models/gemma-4-31B-it-GGUF
|
||||
```
|
||||
|
||||
Deactivate the Python venv if you no longer need `hf`:
|
||||
|
||||
```bash
|
||||
deactivate
|
||||
```
|
||||
|
||||
## Step 9. Next steps
|
||||
|
||||
1. **Context length:** Increase `--ctx-size` for longer chats (watch memory; 1M-token class contexts are possible only when the build, model, and hardware allow).
|
||||
2. **Other models:** Point `--model` at any compatible GGUF; the llama.cpp server API stays the same.
|
||||
3. **Integrations:** Point Open WebUI, Continue.dev, or custom clients at `http://<spark-host>:30000/v1` using the OpenAI client pattern.
|
||||
|
||||
The server implements the usual OpenAI-style chat features your llama.cpp build enables (including streaming and tool-related flows where supported).
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Symptom | Cause | Fix |
|
||||
|---------|-------|-----|
|
||||
| `cmake` fails with "CUDA not found" | CUDA toolkit not in PATH | Run `export PATH=/usr/local/cuda/bin:$PATH` and re-run CMake from a clean build directory |
|
||||
| Build errors mentioning wrong GPU arch | CMake `CMAKE_CUDA_ARCHITECTURES` does not match GB10 | Use `-DCMAKE_CUDA_ARCHITECTURES="121"` for DGX Spark GB10 as in the instructions |
|
||||
| GGUF download fails or stalls | Network or Hugging Face availability | Re-run `hf download`; it resumes partial files |
|
||||
| "CUDA out of memory" when starting `llama-server` | Model too large for current context or VRAM | Lower `--ctx-size` (e.g. 4096) or use a smaller quantization from the same repo |
|
||||
| Server runs but latency is high | Layers not on GPU | Confirm `--n-gpu-layers` is high enough for your model; check `nvidia-smi` during a request |
|
||||
| `curl: (7) Failed to connect` on port 30000 | No listener yet, wrong host, or crash | Wait for `server is listening`; run `curl` on the same host as `llama-server` (or Spark’s IP); run `ss -tln` and confirm `:30000`; read server stderr for OOM or bad `--model` path |
|
||||
| Chat API errors or empty replies | Wrong `--model` path or incompatible GGUF | Verify the path to the `.gguf` file; update llama.cpp if the GGUF requires a newer format |
|
||||
|
||||
> [!NOTE]
|
||||
> DGX Spark uses Unified Memory Architecture (UMA), which allows flexible sharing between GPU and CPU memory. Some software is still catching up to UMA behavior. If you hit memory pressure unexpectedly, you can try flushing the page cache (use with care on shared systems):
|
||||
```bash
|
||||
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
|
||||
```
|
||||
|
||||
For the latest platform issues, see the [DGX Spark known issues](https://docs.nvidia.com/dgx/dgx-spark/known-issues.html) documentation.
|
||||
@ -6,6 +6,7 @@
|
||||
|
||||
- [Overview](#overview)
|
||||
- [Instructions](#instructions)
|
||||
- [Run on two Sparks](#run-on-two-sparks)
|
||||
- [Troubleshooting](#troubleshooting)
|
||||
|
||||
---
|
||||
@ -47,8 +48,8 @@ All necessary files for the playbook can be found [here on GitHub](https://githu
|
||||
* **Duration:** 45-90 minutes for complete setup and initial model fine-tuning
|
||||
* **Risks:** Model downloads can be large (several GB), ARM64 package compatibility issues may require troubleshooting, distributed training setup complexity increases with multi-node configurations
|
||||
* **Rollback:** Virtual environments can be completely removed; no system-level changes are made to the host system beyond package installations.
|
||||
* **Last Updated:** 01/15/2026
|
||||
* Fix qLoRA fine-tuning workflow
|
||||
* **Last Updated:** 03/04/2026
|
||||
* Recommend running Nemo finetune workflow via Docker
|
||||
|
||||
## Instructions
|
||||
|
||||
@ -296,8 +297,296 @@ python3 my_custom_training.py
|
||||
|
||||
Explore the [NeMo AutoModel GitHub repository](https://github.com/NVIDIA-NeMo/Automodel) for more recipes, documentation, and community examples. Consider setting up custom datasets, experimenting with different model architectures, and scaling to multi-node distributed training for larger models.
|
||||
|
||||
## Run on two Sparks
|
||||
|
||||
## Step 1. Configure network connectivity
|
||||
|
||||
Follow the network setup instructions from the [Connect two Sparks](https://build.nvidia.com/spark/connect-two-sparks/stacked-sparks) playbook to establish connectivity between your DGX Spark nodes.
|
||||
|
||||
This includes:
|
||||
- Physical QSFP cable connection
|
||||
- Network interface configuration (automatic or manual IP assignment)
|
||||
- Passwordless SSH setup
|
||||
- Network connectivity verification
|
||||
|
||||
> [!NOTE]
|
||||
> Steps 2 to 8 must be conducted on each node.
|
||||
|
||||
## Step 2. Configure Docker permissions
|
||||
|
||||
To easily manage containers without sudo, you must be in the `docker` group. If you choose to skip this step, you will need to run Docker commands with sudo.
|
||||
|
||||
Open a new terminal and test Docker access. In the terminal, run:
|
||||
|
||||
```bash
|
||||
docker ps
|
||||
```
|
||||
|
||||
If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo .
|
||||
|
||||
```bash
|
||||
sudo usermod -aG docker $USER
|
||||
newgrp docker
|
||||
```
|
||||
|
||||
## Step 3. Install NVIDIA Container Toolkit & setup Docker environment
|
||||
|
||||
Ensure the NVIDIA drivers and the NVIDIA Container Toolkit are installed on each node (both manager and workers) that will provide GPU resources. This package enables Docker containers to access the host's GPU hardware. Ensure you complete the [installation steps](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html), including the [Docker configuration](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-docker) for NVIDIA Container Toolkit.
|
||||
|
||||
## Step 4. Deploy Docker Containers
|
||||
|
||||
Download the [**pytorch-ft-entrypoint.sh**](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/pytorch-fine-tune/assets/pytorch-ft-entrypoint.sh) script into your home directory and run the following command to make it executable:
|
||||
|
||||
```bash
|
||||
chmod +x $HOME/pytorch-ft-entrypoint.sh
|
||||
```
|
||||
|
||||
Deploy the docker container by running the following command:
|
||||
```bash
|
||||
docker run -d \
|
||||
--name automodel-node \
|
||||
--gpus all \
|
||||
--network host \
|
||||
--ulimit memlock=-1 \
|
||||
--ulimit stack=67108864 \
|
||||
--device=/dev/infiniband \
|
||||
-v "$PWD"/pytorch-ft-entrypoint.sh:/opt/pytorch-ft-entrypoint.sh \
|
||||
-v "$HOME/.cache/huggingface/":/root/.cache/huggingface/ \
|
||||
-v "$HOME/.ssh":/tmp/.ssh:ro \
|
||||
-e UCX_NET_DEVICES=enp1s0f0np0,enp1s0f1np1 \
|
||||
-e NCCL_SOCKET_IFNAME=enp1s0f0np0,enp1s0f1np1 \
|
||||
-e GLOO_SOCKET_IFNAME=enp1s0f0np0,enp1s0f1np1 \
|
||||
-e NCCL_DEBUG=INFO \
|
||||
-e TORCH_NCCL_ASYNC_ERROR_HANDLING=1 \
|
||||
-e TORCH_DISTRIBUTED_DEBUG=INFO \
|
||||
-e CUDA_DEVICE_MAX_CONNECTIONS=1 \
|
||||
-e CUDA_VISIBLE_DEVICES=0 \
|
||||
nvcr.io/nvidia/pytorch:25.10-py3 \
|
||||
/opt/pytorch-ft-entrypoint.sh
|
||||
```
|
||||
|
||||
## Step 5. Install package management tools
|
||||
|
||||
Launch a terminal into your docker container on the node.
|
||||
|
||||
```bash
|
||||
docker exec -it automodel-node bash
|
||||
```
|
||||
|
||||
> [!NOTE]
|
||||
> All subsequent steps and commands, other than "Cleanup and rollback", should be run from within the docker container terminal.
|
||||
|
||||
Install `uv` for efficient package management and virtual environment isolation. NeMo AutoModel uses `uv` for dependency management and automatic environment handling.
|
||||
|
||||
```bash
|
||||
## Install uv package manager
|
||||
pip3 install uv
|
||||
|
||||
## Verify installation
|
||||
uv --version
|
||||
```
|
||||
|
||||
## Step 6. Clone NeMo AutoModel repository
|
||||
|
||||
Clone the official NeMo AutoModel repository to access recipes and examples. This provides ready-to-use training configurations for various model types and training scenarios.
|
||||
|
||||
```bash
|
||||
## Clone the repository
|
||||
git clone https://github.com/NVIDIA-NeMo/Automodel.git
|
||||
|
||||
## Navigate to the repository
|
||||
cd Automodel
|
||||
```
|
||||
|
||||
## Step 7. Install NeMo AutoModel
|
||||
|
||||
Set up the virtual environment and install NeMo AutoModel. Choose between wheel package installation for stability or source installation for latest features.
|
||||
|
||||
**Install from wheel package (recommended):**
|
||||
|
||||
```bash
|
||||
## Initialize virtual environment
|
||||
uv venv --system-site-packages
|
||||
|
||||
## Install packages with uv
|
||||
uv sync --inexact --frozen --all-extras \
|
||||
--no-install-package torch \
|
||||
--no-install-package torchvision \
|
||||
--no-install-package triton \
|
||||
--no-install-package nvidia-cublas-cu12 \
|
||||
--no-install-package nvidia-cuda-cupti-cu12 \
|
||||
--no-install-package nvidia-cuda-nvrtc-cu12 \
|
||||
--no-install-package nvidia-cuda-runtime-cu12 \
|
||||
--no-install-package nvidia-cudnn-cu12 \
|
||||
--no-install-package nvidia-cufft-cu12 \
|
||||
--no-install-package nvidia-cufile-cu12 \
|
||||
--no-install-package nvidia-curand-cu12 \
|
||||
--no-install-package nvidia-cusolver-cu12 \
|
||||
--no-install-package nvidia-cusparse-cu12 \
|
||||
--no-install-package nvidia-cusparselt-cu12 \
|
||||
--no-install-package nvidia-nccl-cu12 \
|
||||
--no-install-package transformer-engine \
|
||||
--no-install-package nvidia-modelopt \
|
||||
--no-install-package nvidia-modelopt-core \
|
||||
--no-install-package flash-attn \
|
||||
--no-install-package transformer-engine-cu12 \
|
||||
--no-install-package transformer-engine-torch
|
||||
|
||||
## Install bitsandbytes
|
||||
CMAKE_ARGS="-DCOMPUTE_BACKEND=cuda -DCOMPUTE_CAPABILITY=80;86;87;89;90" \
|
||||
CMAKE_BUILD_PARALLEL_LEVEL=8 \
|
||||
uv pip install --no-deps git+https://github.com/bitsandbytes-foundation/bitsandbytes.git@50be19c39698e038a1604daf3e1b939c9ac1c342
|
||||
```
|
||||
|
||||
## Step 8. Verify installation
|
||||
|
||||
Confirm NeMo AutoModel is properly installed and accessible. This step validates the installation and checks for any missing dependencies.
|
||||
|
||||
```bash
|
||||
## Test NeMo AutoModel import
|
||||
uv run --frozen --no-sync python -c "import nemo_automodel; print('✅ NeMo AutoModel ready')"
|
||||
```
|
||||
> [!NOTE]
|
||||
> You might see a warning stating `grouped_gemm is not available`. You can ignore this warning if you see '✅ NeMo AutoModel ready'.
|
||||
|
||||
> [!NOTE]
|
||||
> Ensure steps 2 to 8 were conducted on all nodes for correct setup.
|
||||
|
||||
## Step 9. Run sample multi-node fine-tuning
|
||||
The following commands show how to perform full fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT) with LoRA across both Spark devices using `torch.distributed.run`.
|
||||
|
||||
First, export your HF_TOKEN on both nodes so that gated models can be downloaded.
|
||||
|
||||
```bash
|
||||
export HF_TOKEN=<your_huggingface_token>
|
||||
```
|
||||
> [!NOTE]
|
||||
> Replace `<your_huggingface_token>` with your personal Hugging Face access token. A valid token is required to download any gated model.
|
||||
>
|
||||
> - Generate a token: [Hugging Face tokens](https://huggingface.co/settings/tokens), guide available [here](https://huggingface.co/docs/hub/en/security-tokens).
|
||||
> - Request and receive access on each model's page (and accept license/terms) before attempting downloads.
|
||||
> - Llama-3.1-8B: [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B)
|
||||
> - Qwen3-8B: [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)
|
||||
> - Mixtral-8x7B: [mistralai/Mixtral-8x7B](https://huggingface.co/mistralai/Mixtral-8x7B)
|
||||
>
|
||||
> The same steps apply for any other gated model you use: visit its model card on Hugging Face, request access, accept the license, and wait for approval.
|
||||
|
||||
Next, export a few multi-node PyTorch configuration environment variables.
|
||||
- `MASTER_ADDR`: IP address of your master node as set in [Connect two Sparks](https://build.nvidia.com/spark/connect-two-sparks/stacked-sparks). \(ex: 192.168.100.10\)
|
||||
- `MASTER_PORT`: Set a port number that can be used on your master node. \(ex: 12345\)
|
||||
- `NODE_RANK`: Master rank is set to 0 and Worker rank is set to 1
|
||||
|
||||
Run this on the Master node
|
||||
```bash
|
||||
export MASTER_ADDR=<TODO: specify IP>
|
||||
export MASTER_PORT=<TODO: specify port>
|
||||
export NODE_RANK=0
|
||||
```
|
||||
|
||||
Run this on the Worker node
|
||||
```bash
|
||||
export MASTER_ADDR=<TODO: specify IP>
|
||||
export MASTER_PORT=<TODO: specify port>
|
||||
export NODE_RANK=1
|
||||
```
|
||||
|
||||
**LoRA fine-tuning example:**
|
||||
|
||||
Execute a basic fine-tuning example to validate the complete setup. This demonstrates parameter-efficient fine-tuning using a small model suitable for testing.
|
||||
For the examples below, we are using YAML for configuration, and parameter overrides are passed as command line arguments.
|
||||
|
||||
Run this on the all nodes:
|
||||
```bash
|
||||
uv run --frozen --no-sync python -m torch.distributed.run \
|
||||
--nnodes=2 \
|
||||
--nproc_per_node=1 \
|
||||
--node_rank=${NODE_RANK} \
|
||||
--rdzv_backend=static \
|
||||
--rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \
|
||||
examples/llm_finetune/finetune.py \
|
||||
-c examples/llm_finetune/llama3_2/llama3_2_1b_squad_peft.yaml \
|
||||
--model.pretrained_model_name_or_path meta-llama/Llama-3.1-8B \
|
||||
--packed_sequence.packed_sequence_size 1024 \
|
||||
--step_scheduler.max_steps 100
|
||||
```
|
||||
The following `torch.distributed.run` parameters configure our dual-node distributed PyTorch workload and communication:
|
||||
- `--nnodes`: sets the total number of nodes participating in the distributed training. This is 2 for our dual-node case.
|
||||
- `--nproc_per_node`: sets the number of processes to be executed on each node. 1 fine-tuning process will occur on each node in our example.
|
||||
- `--node_rank`: sets the rank of the current node. Again, Master rank is set to 0 and Worker rank is set to 1.
|
||||
- `--rdzv_backend`: sets the backend used for the rendezvous mechanism. The rendezvous mechanism allows nodes to discover each other and establish communication channels before beginning the distributed workload. We use `fixed` for a pre-configured rendezvous setup.
|
||||
- `--rdzv_endpoint`: sets the endpoint on which the rendezvous is expected to occur. This will be the Master node IP address and port specified earlier.
|
||||
|
||||
These config overrides ensure the Llama-3.1-8B LoRA run behaves as expected:
|
||||
- `--model.pretrained_model_name_or_path`: selects the Llama-3.1-8B model to fine-tune from the Hugging Face model hub (weights fetched via your Hugging Face token).
|
||||
- `--packed_sequence.packed_sequence_size`: sets the packed sequence size to 1024 to enable packed sequence training.
|
||||
- `--step_scheduler.max_steps`: sets the maximum number of training steps. We set it to 100 for demonstation purposes, please adjust this based on your needs.
|
||||
|
||||
> [!NOTE]
|
||||
> `NCCL WARN NET/IB : roceP2p1s0f1:1 unknown event type (18)` logs during multi-node workloads can be ignored and are a sign that RoCE is functional.
|
||||
|
||||
**Full Fine-tuning example:**
|
||||
|
||||
Run this on the all nodes:
|
||||
```bash
|
||||
uv run --frozen --no-sync python -m torch.distributed.run \
|
||||
--nnodes=2 \
|
||||
--nproc_per_node=1 \
|
||||
--node_rank=${NODE_RANK} \
|
||||
--rdzv_backend=static \
|
||||
--rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \
|
||||
examples/llm_finetune/finetune.py \
|
||||
-c examples/llm_finetune/qwen/qwen3_8b_squad_spark.yaml \
|
||||
--model.pretrained_model_name_or_path Qwen/Qwen3-8B \
|
||||
--step_scheduler.local_batch_size 1 \
|
||||
--step_scheduler.max_steps 100 \
|
||||
--packed_sequence.packed_sequence_size 1024
|
||||
```
|
||||
These config overrides ensure the Qwen3-8B SFT run behaves as expected:
|
||||
- `--model.pretrained_model_name_or_path`: selects the Qwen/Qwen3-8B model to fine-tune from the Hugging Face model hub (weights fetched via your Hugging Face token). Adjust this if you want to fine-tune a different model.
|
||||
- `--step_scheduler.max_steps`: sets the maximum number of training steps. We set it to 100 for demonstation purposes, please adjust this based on your needs.
|
||||
- `--step_scheduler.local_batch_size`: sets the per-GPU micro-batch size to 1 to fit in memory; overall effective batch size is still driven by gradient accumulation and data/tensor parallel settings from the recipe.
|
||||
|
||||
|
||||
## Step 10. Validate successful training completion
|
||||
|
||||
Validate the fine-tuned model by inspecting artifacts contained in the checkpoint directory on your Master node.
|
||||
|
||||
```bash
|
||||
## Inspect logs and checkpoint output.
|
||||
## The LATEST is a symlink pointing to the latest checkpoint.
|
||||
## The checkpoint is the one that was saved during training.
|
||||
## below is an example of the expected output (username and domain-users are placeholders).
|
||||
ls -lah checkpoints/LATEST/
|
||||
|
||||
## root@gx10-f154:/workspace/Automodel# ls -lah checkpoints/LATEST/
|
||||
## total 36K
|
||||
## drwxr-xr-x 6 username domain-users 4.0K Dec 8 20:16 .
|
||||
## drwxr-xr-x 3 username domain-users 4.0K Dec 8 20:16 ..
|
||||
## -rw-r--r-- 1 username domain-users 1.6K Dec 8 20:16 config.yaml
|
||||
## drwxr-xr-x 2 username domain-users 4.0K Dec 8 20:16 dataloader
|
||||
## -rw-r--r-- 1 username domain-users 66 Dec 8 20:16 losses.json
|
||||
## drwxr-xr-x 3 username domain-users 4.0K Dec 8 20:16 model
|
||||
## drwxr-xr-x 2 username domain-users 4.0K Dec 8 20:16 optim
|
||||
## drwxr-xr-x 2 username domain-users 4.0K Dec 8 20:16 rng
|
||||
## -rw-r--r-- 1 username domain-users 1.3K Dec 8 20:16 step_scheduler.pt
|
||||
```
|
||||
|
||||
## Step 11. Cleanup and rollback
|
||||
|
||||
Stop and remove containers by using the following command on all nodes:
|
||||
|
||||
```bash
|
||||
docker stop automodel-node
|
||||
docker rm automodel-node
|
||||
```
|
||||
|
||||
> [!WARNING]
|
||||
> This removes all training data and performance reports. Copy `checkpoints/` out of the container in advance if you want to keep it.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
## Common issues for running on a single Spark
|
||||
|
||||
| Symptom | Cause | Fix |
|
||||
|---------|--------|-----|
|
||||
| `nvcc: command not found` | CUDA toolkit not in PATH | Add CUDA toolkit to PATH: `export PATH=/usr/local/cuda/bin:$PATH` |
|
||||
@ -307,6 +596,19 @@ Explore the [NeMo AutoModel GitHub repository](https://github.com/NVIDIA-NeMo/Au
|
||||
| ARM64 package compatibility issues | Package not available for ARM architecture | Use source installation or build from source with ARM64 flags |
|
||||
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
|
||||
|
||||
## Common Issues for running on two Starks
|
||||
|
||||
| Symptom | Cause | Fix |
|
||||
|---------|-------|-----|
|
||||
| `nvcc: command not found` | CUDA toolkit not in PATH | Add CUDA toolkit to PATH: `export PATH=/usr/local/cuda/bin:$PATH` |
|
||||
| Container exits immediately | Missing entrypoint script | Ensure `pytorch-ft-entrypoint.sh` download succeeded and has executable permissions |
|
||||
| `The container name "/automodel-node" is already in use` | Another docker container of the same name is in use on the node (likely forgotten during clean up) | Remove (or rename) the old container or rename the new one |
|
||||
| GPU not detected in training | CUDA driver/runtime mismatch | Verify driver compatibility: `nvidia-smi` and reinstall CUDA if needed |
|
||||
| Out of memory during training | Model too large for available GPU memory | Reduce batch size, enable gradient checkpointing, or use model parallelism |
|
||||
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
|
||||
| Checkpoint loading failure when running fine-tuning examples consecutively: `No such file or directory: 'checkpoints/epoch_0_step_*/*'` | Fine-tuning script attempts to load old checkpoints unsuccessfully | Remove the `checkpoints/` directory before running again |
|
||||
| `Unable to find address for: enp1s0f0np0` when attempting single node fine-tuning run on multi-node container | `enp1s0f0np0` is not configured with an IP | Verify network configuration or, if you configured the devices on `enp1s0f1np1`, set `NCCL_SOCKET_IFNAME` and `GLOO_SOCKET_IFNAME` to only `enp1s0f1np1` |
|
||||
|
||||
> [!NOTE]
|
||||
> DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
|
||||
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within
|
||||
|
||||
@ -172,12 +172,15 @@ Verify the NVIDIA runtime works:
|
||||
docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
|
||||
```
|
||||
|
||||
If you get a permission denied error on `docker`, add your user to the Docker group and log out/in:
|
||||
If you get a permission denied error on `docker`, add your user to the Docker group and activate the new group in your current session:
|
||||
|
||||
```bash
|
||||
sudo usermod -aG docker $USER
|
||||
newgrp docker
|
||||
```
|
||||
|
||||
This applies the group change immediately. Alternatively, you can log out and back in instead of running `newgrp docker`.
|
||||
|
||||
> [!NOTE]
|
||||
> DGX Spark uses cgroup v2. OpenShell's gateway embeds k3s inside Docker and needs host cgroup namespace access. Without `default-cgroupns-mode: host`, the gateway can fail with "Failed to start ContainerManager" errors.
|
||||
|
||||
@ -237,7 +240,7 @@ You should see `nemotron-3-super:120b` in the output.
|
||||
This single command handles everything: installs Node.js (if needed), installs OpenShell, clones NemoClaw at the pinned stable release (`v0.0.1`), builds the CLI, and runs the onboard wizard to create a sandbox.
|
||||
|
||||
```bash
|
||||
curl -fsSL https://www.nvidia.com/nemoclaw.sh | NEMOCLAW_INSTALL_TAG=v0.0.1 bash
|
||||
curl -fsSL https://www.nvidia.com/nemoclaw.sh | NEMOCLAW_INSTALL_TAG=v0.0.4 bash
|
||||
```
|
||||
|
||||
The onboard wizard walks you through setup:
|
||||
@ -322,13 +325,21 @@ http://127.0.0.1:18789/#token=<long-token-here>
|
||||
|
||||
**If accessing the Web UI from a remote machine**, you need to set up port forwarding.
|
||||
|
||||
First, find your Spark's IP address. On the Spark, run:
|
||||
|
||||
```bash
|
||||
hostname -I | awk '{print $1}'
|
||||
```
|
||||
|
||||
This prints the primary IP address (e.g. `192.168.1.42`). You can also find it in **Settings > Wi-Fi** or **Settings > Network** on the Spark's desktop, or check your router's connected-devices list.
|
||||
|
||||
Start the port forward on the Spark host:
|
||||
|
||||
```bash
|
||||
openshell forward start 18789 my-assistant --background
|
||||
```
|
||||
|
||||
Then from your remote machine, create an SSH tunnel to the Spark:
|
||||
Then from your remote machine, create an SSH tunnel to the Spark (replace `<your-spark-ip>` with the IP address from above):
|
||||
|
||||
```bash
|
||||
ssh -L 18789:127.0.0.1:18789 <your-user>@<your-spark-ip>
|
||||
|
||||
@ -31,12 +31,14 @@ Spark & Reachy Photo Booth is an interactive and event-driven photo booth demo t
|
||||
- **User position tracking** built with `facebookresearch/detectron2` and `FoundationVision/ByteTrack`
|
||||
- **MinIO** for storing captured/generated images as well as sharing them via QR-code
|
||||
|
||||
The demo is based on a several services that communicate through a message bus.
|
||||
The demo is based on several services that communicate through a message bus.
|
||||
|
||||

|
||||
|
||||
See also the walk-through video for this playbook: [Video](https://www.youtube.com/watch?v=6f1x8ReGLjc)
|
||||
|
||||
> [!NOTE]
|
||||
> This playbook applies to both the Reachy Mini and Reachy Mini Lite robots. For simplicity, we’ll refer to the robot as Reachy throughout this playbook.
|
||||
> This playbook applies to Reachy Mini Lite. Reachy Mini (with on-board Raspberry Pi) might require minor adaptations. For simplicity, we’ll refer to the robot as Reachy throughout this playbook.
|
||||
|
||||
## What you'll accomplish
|
||||
|
||||
@ -57,7 +59,7 @@ You'll deploy a complete photo booth system on DGX Spark running multiple infere
|
||||
> [!TIP]
|
||||
> Make sure your Reachy robot firmware is up to date. You can find instructions to update it [here](https://huggingface.co/spaces/pollen-robotics/Reachy_Mini).
|
||||
**Software Requirements:**
|
||||
- The official DGX Spark OS image including all required utilities such as Git, Docker, NVIDIA drivers, and the NVIDIA Container Toolkit
|
||||
- The official [DGX Spark OS](https://docs.nvidia.com/dgx/dgx-spark/dgx-os.html) image including all required utilities such as Git, Docker, NVIDIA drivers, and the NVIDIA Container Toolkit
|
||||
- An internet connection for the DGX Spark
|
||||
- NVIDIA NGC Personal API Key (**`NVIDIA_API_KEY`**). [Create a key](https://org.ngc.nvidia.com/setup/api-keys) if necessary. Make sure to enable the `NGC Catalog` scope when creating the key.
|
||||
- Hugging Face access token (**`HF_TOKEN`**). [Create a token](https://huggingface.co/settings/tokens) if necessary. Make sure to create a token with _Read access to contents of all public gated repos you can access_ permission.
|
||||
@ -77,8 +79,9 @@ All required assets can be found in the [Spark & Reachy Photo Booth repository](
|
||||
* **Estimated time:** 2 hours including hardware setup, container building, and model downloads
|
||||
* **Risk level:** Medium
|
||||
* **Rollback:** Docker containers can be stopped and removed to free resources. Downloaded models can be deleted from cache directories. Robot and peripheral connections can be safely disconnected. Network configurations can be reverted by removing custom settings.
|
||||
* **Last Updated:** 01/27/2026
|
||||
* 1.0.0 First Publication
|
||||
* **Last Updated:** 04/01/2026
|
||||
* 1.0.0 First publication
|
||||
* 1.0.1 Documentation improvements
|
||||
|
||||
## Governing terms
|
||||
Your use of the Spark Playbook scripts is governed by [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0) and enables use of separate open source and proprietary software governed by their respective licenses: [Flux.1-Kontext NIM](https://catalog.ngc.nvidia.com/orgs/nim/teams/black-forest-labs/containers/flux.1-kontext-dev?version=1.1), [Parakeet 1.1b CTC en-US ASR NIM](https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers/parakeet-1-1b-ctc-en-us?version=1.4), [TensorRT-LLM](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release?version=1.3.0rc1), [minio/minio](https://hub.docker.com/r/minio/minio), [arizephoenix/phoenix](https://hub.docker.com/r/arizephoenix/phoenix), [grafana/otel-lgtm](https://hub.docker.com/r/grafana/otel-lgtm), [Python](https://hub.docker.com/_/python), [Node.js](https://hub.docker.com/_/node), [nginx](https://hub.docker.com/_/nginx), [busybox](https://hub.docker.com/_/busybox), [UV Python Packager](https://docs.astral.sh/uv/), [Redpanda](https://www.redpanda.com/), [Redpanda Console](https://www.redpanda.com/), [gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b), [FLUX.1-Kontext-dev](https://huggingface.co/black-forest-labs/FLUX.1-Kontext-dev), [FLUX.1-Kontext-dev-onnx](https://huggingface.co/black-forest-labs/FLUX.1-Kontext-dev-onnx).
|
||||
@ -277,7 +280,7 @@ uv sync --all-packages
|
||||
Every folder suffixed by `-service` is a standalone Python program that runs in its own container. You must always start the services by interacting with the `docker-compose.yaml` at the root of the repository. You can enable code hot reloading for all the Python services by running:
|
||||
|
||||
```bash
|
||||
docker compose up -d --build --watch
|
||||
docker compose up --build --watch
|
||||
```
|
||||
|
||||
Whenever you change some Python code in the repository the associated container will be updated and automatically restarted.
|
||||
@ -315,6 +318,7 @@ The [Writing Your First Service](https://github.com/NVIDIA/spark-reachy-photo-bo
|
||||
|---------|-------|-----|
|
||||
| No audio from robot (low volume) | Reachy speaker volume set too low by default | Increase Reachy speaker volume to maximum |
|
||||
| No audio from robot (device conflict) | Another application capturing Reachy speaker | Check `animation-compositor` logs for "Error querying device (-1)", verify Reachy speaker is not set as system default in Ubuntu sound settings, ensure no other apps are capturing the speaker, then restart the demo |
|
||||
| Image-generation fails on first start | Transient initialization issue | Rerun `docker compose up --build -d` to resolve the issue |
|
||||
|
||||
If you have any issues with Reachy that are not covered by this guide, please read [Hugging Face's official troubleshooting guide](https://huggingface.co/docs/reachy_mini/troubleshooting).
|
||||
|
||||
|
||||
@ -442,7 +442,7 @@ Replace the IP addresses with your actual node IPs.
|
||||
On **each node** (primary and worker), run the following command to start the TRT-LLM container:
|
||||
|
||||
```bash
|
||||
docker run -d --rm \
|
||||
docker run -d --rm \
|
||||
--name trtllm-multinode \
|
||||
--gpus '"device=all"' \
|
||||
--network host \
|
||||
@ -456,9 +456,11 @@ docker run -d --rm \
|
||||
-e OMPI_MCA_rmaps_ppr_n_pernode="1" \
|
||||
-e OMPI_ALLOW_RUN_AS_ROOT="1" \
|
||||
-e OMPI_ALLOW_RUN_AS_ROOT_CONFIRM="1" \
|
||||
-e CPATH=/usr/local/cuda/include \
|
||||
-e TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas \
|
||||
-v ~/.cache/huggingface/:/root/.cache/huggingface/ \
|
||||
-v ~/.ssh:/tmp/.ssh:ro \
|
||||
nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6 \
|
||||
nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc5 \
|
||||
sh -c "curl https://raw.githubusercontent.com/NVIDIA/dgx-spark-playbooks/refs/heads/main/nvidia/trt-llm/assets/trtllm-mn-entrypoint.sh | sh"
|
||||
```
|
||||
|
||||
@ -477,7 +479,7 @@ You should see output similar to:
|
||||
|
||||
```
|
||||
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
|
||||
abc123def456 nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6 "sh -c 'curl https:…" 10 seconds ago Up 8 seconds trtllm-multinode
|
||||
abc123def456 nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc5 "sh -c 'curl https:…" 10 seconds ago Up 8 seconds trtllm-multinode
|
||||
```
|
||||
|
||||
### Step 6. Copy hostfile to primary container
|
||||
|
||||
@ -27,8 +27,8 @@ services:
|
||||
# Ollama configuration
|
||||
- OLLAMA_BASE_URL=http://ollama:11434/v1
|
||||
- OLLAMA_MODEL=llama3.1:8b
|
||||
# Disable vLLM
|
||||
- VLLM_BASE_URL=http://localhost:8001/v1
|
||||
# vLLM disabled in default Ollama mode
|
||||
# - VLLM_BASE_URL=http://localhost:8001/v1
|
||||
- VLLM_MODEL=disabled
|
||||
# Vector DB configuration
|
||||
- QDRANT_URL=http://qdrant:6333
|
||||
|
||||
@ -108,7 +108,7 @@ export class TextProcessor {
|
||||
|
||||
// Determine which LLM provider to use based on configuration
|
||||
// Priority: vLLM > NVIDIA > Ollama
|
||||
if (process.env.VLLM_BASE_URL) {
|
||||
if (process.env.VLLM_BASE_URL && process.env.VLLM_MODEL && process.env.VLLM_MODEL !== 'disabled') {
|
||||
this.selectedLLMProvider = 'vllm';
|
||||
} else if (process.env.NVIDIA_API_KEY) {
|
||||
this.selectedLLMProvider = 'nvidia';
|
||||
|
||||
@ -54,6 +54,11 @@ The following models are supported with vLLM on Spark. All listed models are ava
|
||||
|
||||
| Model | Quantization | Support Status | HF Handle |
|
||||
|-------|-------------|----------------|-----------|
|
||||
| **Gemma 4 31B IT** | Base | ✅ | [`google/gemma-4-31B-it`](https://huggingface.co/google/gemma-4-31B-it) |
|
||||
| **Gemma 4 31B IT** | NVFP4 | ✅ | [`nvidia/Gemma-4-31B-IT-NVFP4`](https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4) |
|
||||
| **Gemma 4 26B A4B IT** | Base | ✅ | [`google/gemma-4-26B-A4B-it`](https://huggingface.co/google/gemma-4-26B-A4B-it) |
|
||||
| **Gemma 4 E4B IT** | Base | ✅ | [`google/gemma-4-E4B-it`](https://huggingface.co/google/gemma-4-E4B-it) |
|
||||
| **Gemma 4 E2B IT** | Base | ✅ | [`google/gemma-4-E2B-it`](https://huggingface.co/google/gemma-4-E2B-it) |
|
||||
| **Nemotron-3-Super-120B** | NVFP4 | ✅ | [`nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4`](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4) |
|
||||
| **GPT-OSS-20B** | MXFP4 | ✅ | [`openai/gpt-oss-20b`](https://huggingface.co/openai/gpt-oss-20b) |
|
||||
| **GPT-OSS-120B** | MXFP4 | ✅ | [`openai/gpt-oss-120b`](https://huggingface.co/openai/gpt-oss-120b) |
|
||||
@ -89,9 +94,8 @@ Reminder: not all model architectures are supported for NVFP4 quantization.
|
||||
* **Duration:** 30 minutes for Docker approach
|
||||
* **Risks:** Container registry access requires internal credentials
|
||||
* **Rollback:** Container approach is non-destructive.
|
||||
* **Last Updated:** 03/12/2026
|
||||
* Added support for Nemotron-3-Super-120B model
|
||||
* Updated container to Feb 2026 release (26.02-py3)
|
||||
* **Last Updated:** 04/02/2026
|
||||
* Add support for Gemma 4 model family
|
||||
|
||||
## Instructions
|
||||
|
||||
@ -117,13 +121,21 @@ Find the latest container build from https://catalog.ngc.nvidia.com/orgs/nvidia/
|
||||
|
||||
```bash
|
||||
export LATEST_VLLM_VERSION=<latest_container_version>
|
||||
|
||||
## example
|
||||
## export LATEST_VLLM_VERSION=26.02-py3
|
||||
|
||||
export HF_MODEL_HANDLE=<HF_HANDLE>
|
||||
## example
|
||||
## export HF_MODEL_HANDLE=openai/gpt-oss-20b
|
||||
|
||||
docker pull nvcr.io/nvidia/vllm:${LATEST_VLLM_VERSION}
|
||||
```
|
||||
|
||||
For Gemma 4 model family, use vLLM custom containers:
|
||||
```bash
|
||||
docker pull vllm/vllm-openai:gemma4-cu130
|
||||
```
|
||||
|
||||
## Step 3. Test vLLM in container
|
||||
|
||||
Launch the container and start vLLM server with a test model to verify basic functionality.
|
||||
@ -131,7 +143,13 @@ Launch the container and start vLLM server with a test model to verify basic fun
|
||||
```bash
|
||||
docker run -it --gpus all -p 8000:8000 \
|
||||
nvcr.io/nvidia/vllm:${LATEST_VLLM_VERSION} \
|
||||
vllm serve "Qwen/Qwen2.5-Math-1.5B-Instruct"
|
||||
vllm serve ${HF_MODEL_HANDLE}
|
||||
```
|
||||
|
||||
To run models from Gemma 4 model family, (e.g. `google/gemma-4-31B-it`):
|
||||
```bash
|
||||
docker run -it --gpus all -p 8000:8000 \
|
||||
vllm/vllm-openai:gemma4-cu130 ${HF_MODEL_HANDLE}
|
||||
```
|
||||
|
||||
Expected output should include:
|
||||
@ -145,7 +163,7 @@ In another terminal, test the server:
|
||||
curl http://localhost:8000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "Qwen/Qwen2.5-Math-1.5B-Instruct",
|
||||
"model": "'"${HF_MODEL_HANDLE}"'",
|
||||
"messages": [{"role": "user", "content": "12*17"}],
|
||||
"max_tokens": 500
|
||||
}'
|
||||
|
||||
Loading…
Reference in New Issue
Block a user