mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-23 02:23:53 +00:00
chore: Regenerate all playbooks
This commit is contained in:
parent
70bbbbfab8
commit
c3793552fe
@ -65,13 +65,31 @@ All necessary files can be found in the TensorRT repository [here on GitHub](htt
|
||||
- Remove downloaded models from HuggingFace cache
|
||||
- Then exit the container environment
|
||||
|
||||
* **Last Updated:** 12/15/2025
|
||||
* **Last Updated:** 12/22/2025
|
||||
* Upgrade to latest pytorch container version nvcr.io/nvidia/pytorch:25.11-py3
|
||||
* Add HuggingFace token setup instructions for model access
|
||||
* Add docker container permission setup instructioins
|
||||
|
||||
## Instructions
|
||||
|
||||
## Step 1. Launch the TensorRT container environment
|
||||
## Step 1. Configure Docker permissions
|
||||
|
||||
To easily manage containers without sudo, you must be in the `docker` group. If you choose to skip this step, you will need to run Docker commands with sudo.
|
||||
|
||||
Open a new terminal and test Docker access. In the terminal, run:
|
||||
|
||||
```bash
|
||||
docker ps
|
||||
```
|
||||
|
||||
If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo .
|
||||
|
||||
```bash
|
||||
sudo usermod -aG docker $USER
|
||||
newgrp docker
|
||||
```
|
||||
|
||||
## Step 2. Launch the TensorRT container environment
|
||||
|
||||
Start the NVIDIA PyTorch container with GPU access and HuggingFace cache mounting. This provides
|
||||
the TensorRT development environment with all required dependencies pre-installed.
|
||||
@ -83,7 +101,7 @@ docker run --gpus all --ipc=host --ulimit memlock=-1 \
|
||||
nvcr.io/nvidia/pytorch:25.11-py3
|
||||
```
|
||||
|
||||
## Step 2. Clone and set up TensorRT repository
|
||||
## Step 3. Clone and set up TensorRT repository
|
||||
|
||||
Download the TensorRT repository and configure the environment for diffusion model demos.
|
||||
|
||||
@ -93,7 +111,7 @@ export TRT_OSSPATH=/workspace/TensorRT/
|
||||
cd $TRT_OSSPATH/demo/Diffusion
|
||||
```
|
||||
|
||||
## Step 3. Install required dependencies
|
||||
## Step 4. Install required dependencies
|
||||
|
||||
Install NVIDIA ModelOpt and other dependencies for model quantization and optimization.
|
||||
|
||||
@ -113,7 +131,7 @@ Set up your HuggingFace token to access open models.
|
||||
export HF_TOKEN = <YOUR_HUGGING_FACE_TOKEN>
|
||||
```
|
||||
|
||||
## Step 4. Run Flux.1 Dev model inference
|
||||
## Step 5. Run Flux.1 Dev model inference
|
||||
|
||||
Test multi-modal inference using the Flux.1 Dev model with different precision formats.
|
||||
|
||||
@ -138,7 +156,7 @@ python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry b
|
||||
--hf-token=$HF_TOKEN --fp4 --download-onnx-models
|
||||
```
|
||||
|
||||
## Step 5. Run Flux.1 Schnell model inference
|
||||
## Step 6. Run Flux.1 Schnell model inference
|
||||
|
||||
Test the faster Flux.1 Schnell variant with different precision formats.
|
||||
|
||||
@ -168,7 +186,7 @@ python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry b
|
||||
--fp4 --download-onnx-models
|
||||
```
|
||||
|
||||
## Step 6. Run SDXL model inference
|
||||
## Step 7. Run SDXL model inference
|
||||
|
||||
Test the SDXL model for comparison with different precision formats.
|
||||
|
||||
@ -186,7 +204,7 @@ python3 demo_txt2img_xl.py "a beautiful photograph of Mt. Fuji during cherry blo
|
||||
--hf-token=$HF_TOKEN --version xl-1.0 --download-onnx-models --fp8
|
||||
```
|
||||
|
||||
## Step 7. Validate inference outputs
|
||||
## Step 8. Validate inference outputs
|
||||
|
||||
Check that the models generated images successfully and measure performance differences.
|
||||
|
||||
@ -201,7 +219,7 @@ nvidia-smi
|
||||
python3 -c "import tensorrt as trt; print(f'TensorRT version: {trt.__version__}')"
|
||||
```
|
||||
|
||||
## Step 8. Cleanup and rollback
|
||||
## Step 9. Cleanup and rollback
|
||||
|
||||
Remove downloaded models and exit container environment to free disk space.
|
||||
|
||||
@ -216,7 +234,7 @@ exit
|
||||
rm -rf $HOME/.cache/huggingface/
|
||||
```
|
||||
|
||||
## Step 9. Next steps
|
||||
## Step 10. Next steps
|
||||
|
||||
Use the validated setup to generate custom images or integrate multi-modal inference into your
|
||||
applications. Try different prompts or explore model fine-tuning with the established TensorRT
|
||||
|
||||
@ -47,8 +47,9 @@ All necessary files for the playbook can be found [here on GitHub](https://githu
|
||||
* **Duration:** 45-90 minutes for complete setup and initial model fine-tuning
|
||||
* **Risks:** Model downloads can be large (several GB), ARM64 package compatibility issues may require troubleshooting, distributed training setup complexity increases with multi-node configurations
|
||||
* **Rollback:** Virtual environments can be completely removed; no system-level changes are made to the host system beyond package installations.
|
||||
* **Last Updated:** 12/15/2025
|
||||
* **Last Updated:** 12/22/2025
|
||||
* Upgrade to latest pytorch container version nvcr.io/nvidia/pytorch:25.11-py3
|
||||
* Add docker container permission setup instructioins
|
||||
|
||||
## Instructions
|
||||
|
||||
@ -70,13 +71,37 @@ nvidia-smi
|
||||
free -h
|
||||
```
|
||||
|
||||
## Step 2. Get the container image
|
||||
## Step 2. Configure Docker permissions
|
||||
|
||||
To easily manage containers without sudo, you must be in the `docker` group. If you choose to skip this step, you will need to run Docker commands with sudo.
|
||||
|
||||
Open a new terminal and test Docker access. In the terminal, run:
|
||||
|
||||
```bash
|
||||
docker ps
|
||||
```
|
||||
|
||||
If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo .
|
||||
|
||||
```bash
|
||||
sudo usermod -aG docker $USER
|
||||
newgrp docker
|
||||
```
|
||||
|
||||
## Step 3. Get the container image
|
||||
|
||||
```bash
|
||||
docker pull nvcr.io/nvidia/pytorch:25.11-py3
|
||||
```
|
||||
|
||||
## Step 3. Launch Docker
|
||||
If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo .
|
||||
|
||||
```bash
|
||||
sudo usermod -aG docker $USER
|
||||
newgrp docker
|
||||
```
|
||||
|
||||
## Step 4. Launch Docker
|
||||
|
||||
```bash
|
||||
docker run \
|
||||
@ -87,7 +112,7 @@ docker run \
|
||||
--rm nvcr.io/nvidia/pytorch:25.11-py3
|
||||
```
|
||||
|
||||
## Step 4. Install package management tools
|
||||
## Step 5. Install package management tools
|
||||
|
||||
Install `uv` for efficient package management and virtual environment isolation. NeMo AutoModel uses `uv` for dependency management and automatic environment handling.
|
||||
|
||||
@ -109,7 +134,7 @@ pip3 install --user uv
|
||||
export PATH="$HOME/.local/bin:$PATH"
|
||||
```
|
||||
|
||||
## Step 5. Clone NeMo AutoModel repository
|
||||
## Step 6. Clone NeMo AutoModel repository
|
||||
|
||||
Clone the official NeMo AutoModel repository to access recipes and examples. This provides ready-to-use training configurations for various model types and training scenarios.
|
||||
|
||||
@ -121,7 +146,7 @@ git clone https://github.com/NVIDIA-NeMo/Automodel.git
|
||||
cd Automodel
|
||||
```
|
||||
|
||||
## Step 6. Install NeMo AutoModel
|
||||
## Step 7. Install NeMo AutoModel
|
||||
|
||||
Set up the virtual environment and install NeMo AutoModel. Choose between wheel package installation for stability or source installation for latest features.
|
||||
|
||||
@ -161,7 +186,7 @@ CMAKE_BUILD_PARALLEL_LEVEL=8 \
|
||||
uv pip install --no-deps git+https://github.com/bitsandbytes-foundation/bitsandbytes.git@50be19c39698e038a1604daf3e1b939c9ac1c342
|
||||
```
|
||||
|
||||
## Step 7. Verify installation
|
||||
## Step 8. Verify installation
|
||||
|
||||
Confirm NeMo AutoModel is properly installed and accessible. This step validates the installation and checks for any missing dependencies.
|
||||
|
||||
@ -186,7 +211,7 @@ ls -la examples/
|
||||
## drwxr-xr-x 2 username domain-users 4096 Oct 14 09:27 vlm_generate
|
||||
```
|
||||
|
||||
## Step 8. Explore available examples
|
||||
## Step 9. Explore available examples
|
||||
|
||||
Review the pre-configured training recipes available for different model types and training scenarios. These recipes provide optimized configurations for ARM64 and Blackwell architecture.
|
||||
|
||||
@ -198,7 +223,7 @@ ls examples/llm_finetune/
|
||||
cat examples/llm_finetune/finetune.py | head -20
|
||||
```
|
||||
|
||||
## Step 9. Run sample fine-tuning
|
||||
## Step 10. Run sample fine-tuning
|
||||
The following commands show how to perform full fine-tuning (SFT), parameter-efficient fine-tuning (PEFT) with LoRA and QLoRA.
|
||||
|
||||
First, export your HF_TOKEN so that gated models can be downloaded.
|
||||
@ -280,7 +305,7 @@ These overrides ensure the Qwen3-8B SFT run behaves as expected:
|
||||
- `--step_scheduler.local_batch_size`: sets the per-GPU micro-batch size to 1 to fit in memory; overall effective batch size is still driven by gradient accumulation and data/tensor parallel settings from the recipe.
|
||||
|
||||
|
||||
## Step 10. Validate successful training completion
|
||||
## Step 11. Validate successful training completion
|
||||
|
||||
Validate the fine-tuned model by inspecting artifacts contained in the checkpoint directory.
|
||||
|
||||
@ -303,7 +328,7 @@ ls -lah checkpoints/LATEST/
|
||||
## -rw-r--r-- 1 username domain-users 1.3K Oct 16 22:33 step_scheduler.pt
|
||||
```
|
||||
|
||||
## Step 11. Cleanup and rollback (Optional)
|
||||
## Step 12. Cleanup and rollback (Optional)
|
||||
|
||||
Remove the installation and restore the original environment if needed. These commands safely remove all installed components.
|
||||
|
||||
@ -324,7 +349,7 @@ pip3 uninstall uv
|
||||
## Clear Python cache
|
||||
rm -rf ~/.cache/pip
|
||||
```
|
||||
## Step 12. Optional: Publish your fine-tuned model checkpoint on Hugging Face Hub
|
||||
## Step 13. Optional: Publish your fine-tuned model checkpoint on Hugging Face Hub
|
||||
|
||||
Publish your fine-tuned model checkpoint on Hugging Face Hub.
|
||||
> [!NOTE]
|
||||
@ -359,7 +384,7 @@ hf upload my-cool-model checkpoints/LATEST/model
|
||||
> ```
|
||||
> To fix this, you need to create an access token with *write* permissions, please see the Hugging Face guide [here](https://huggingface.co/docs/hub/en/security-tokens) for instructions.
|
||||
|
||||
## Step 12. Next steps
|
||||
## Step 14. Next steps
|
||||
|
||||
Begin using NeMo AutoModel for your specific fine-tuning tasks. Start with provided recipes and customize based on your model requirements and dataset.
|
||||
|
||||
|
||||
@ -61,8 +61,9 @@ You'll launch a NIM container on your DGX Spark device to expose a GPU-accelerat
|
||||
* GPU memory requirements vary by model size
|
||||
* Container startup time depends on model loading
|
||||
* **Rollback:** Stop and remove containers with `docker stop <CONTAINER_NAME> && docker rm <CONTAINER_NAME>`. Remove cached models from `~/.cache/nim` if disk space recovery is needed.
|
||||
* **Last Updated:** 12/09/2025
|
||||
* **Last Updated:** 12/22/2025
|
||||
* Update docker container version to cuda:13.0.1-devel-ubuntu24.04
|
||||
* Add docker container permission setup instructioins
|
||||
|
||||
## Instructions
|
||||
|
||||
@ -76,6 +77,13 @@ docker --version
|
||||
docker run --rm --gpus all nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 nvidia-smi
|
||||
```
|
||||
|
||||
If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo .
|
||||
|
||||
```bash
|
||||
sudo usermod -aG docker $USER
|
||||
newgrp docker
|
||||
```
|
||||
|
||||
### Step 2. Configure NGC authentication
|
||||
|
||||
Set up access to NVIDIA's container registry using your NGC API key.
|
||||
|
||||
@ -76,6 +76,13 @@ docker run --rm --gpus all lmsysorg/sglang:spark nvidia-smi
|
||||
df -h /
|
||||
```
|
||||
|
||||
If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo .
|
||||
|
||||
```bash
|
||||
sudo usermod -aG docker $USER
|
||||
newgrp docker
|
||||
```
|
||||
|
||||
## Step 2. Pull the SGLang Container
|
||||
|
||||
Download the latest SGLang container. This step runs on the host and may take
|
||||
|
||||
@ -52,20 +52,38 @@ support for ARM64.
|
||||
* **Duration:** 30 minutes for Docker approach
|
||||
* **Risks:** Container registry access requires internal credentials
|
||||
* **Rollback:** Container approach is non-destructive.
|
||||
* **Last Updated:** 12/11/2025
|
||||
* **Last Updated:** 12/22/2025
|
||||
* Upgrade vLLM container to latest version nvcr.io/nvidia/vllm:25.11-py3
|
||||
* Improve cluster setup instructions for Run on two Sparks
|
||||
* Add docker container permission setup instructioins
|
||||
|
||||
## Instructions
|
||||
|
||||
## Step 1. Pull vLLM container image
|
||||
## Step 1. Configure Docker permissions
|
||||
|
||||
To easily manage containers without sudo, you must be in the `docker` group. If you choose to skip this step, you will need to run Docker commands with sudo.
|
||||
|
||||
Open a new terminal and test Docker access. In the terminal, run:
|
||||
|
||||
```bash
|
||||
docker ps
|
||||
```
|
||||
|
||||
If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo .
|
||||
|
||||
```bash
|
||||
sudo usermod -aG docker $USER
|
||||
newgrp docker
|
||||
```
|
||||
|
||||
## Step 2. Pull vLLM container image
|
||||
|
||||
Find the latest container build from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=25.11-py3
|
||||
```
|
||||
docker pull nvcr.io/nvidia/vllm:25.11-py3
|
||||
```
|
||||
|
||||
## Step 2. Test vLLM in container
|
||||
## Step 3. Test vLLM in container
|
||||
|
||||
Launch the container and start vLLM server with a test model to verify basic functionality.
|
||||
|
||||
@ -94,7 +112,7 @@ curl http://localhost:8000/v1/chat/completions \
|
||||
|
||||
Expected response should contain `"content": "204"` or similar mathematical calculation.
|
||||
|
||||
## Step 3. Cleanup and rollback
|
||||
## Step 4. Cleanup and rollback
|
||||
|
||||
For container approach (non-destructive):
|
||||
|
||||
@ -110,7 +128,7 @@ To remove CUDA 12.9:
|
||||
sudo /usr/local/cuda-12.9/bin/cuda-uninstaller
|
||||
```
|
||||
|
||||
## Step 4. Next steps
|
||||
## Step 5. Next steps
|
||||
|
||||
- **Production deployment:** Configure vLLM with your specific model requirements
|
||||
- **Performance tuning:** Adjust batch sizes and memory settings for your workload
|
||||
|
||||
Loading…
Reference in New Issue
Block a user