chore: Regenerate all playbooks

This commit is contained in:
GitLab CI 2025-10-06 13:35:52 +00:00
parent 7773c86f7c
commit c20b49d138
15 changed files with 228 additions and 227 deletions

View File

@ -5,36 +5,20 @@
## Table of Contents ## Table of Contents
- [Overview](#overview) - [Overview](#overview)
- [What you'll accomplish](#what-youll-accomplish)
- [What to know before starting](#what-to-know-before-starting)
- [Prerequisites](#prerequisites)
- [Ancillary files](#ancillary-files)
- [Time & risk](#time-risk)
- [Instructions](#instructions) - [Instructions](#instructions)
- [Step 1. Verify system prerequisites](#step-1-verify-system-prerequisites)
- [Step 2. Launch PyTorch container with GPU support](#step-2-launch-pytorch-container-with-gpu-support)
- [Step 3. Clone LLaMA Factory repository](#step-3-clone-llama-factory-repository)
- [Step 4. Install LLaMA Factory with dependencies](#step-4-install-llama-factory-with-dependencies) - [Step 4. Install LLaMA Factory with dependencies](#step-4-install-llama-factory-with-dependencies)
- [Step 5. Configure PyTorch for CUDA 12.9 (if needed)](#step-5-configure-pytorch-for-cuda-129-if-needed)
- [Step 6. Prepare training configuration](#step-6-prepare-training-configuration)
- [Step 7. Launch fine-tuning training](#step-7-launch-fine-tuning-training)
- [Step 8. Validate training completion](#step-8-validate-training-completion)
- [Step 9. Test inference with fine-tuned model](#step-9-test-inference-with-fine-tuned-model)
- [Step 10. Troubleshooting](#step-10-troubleshooting)
- [Step 11. Cleanup and rollback](#step-11-cleanup-and-rollback)
- [Step 12. Next steps](#step-12-next-steps)
--- ---
## Overview ## Overview
### What you'll accomplish ## What you'll accomplish
You'll set up LLaMA Factory on NVIDIA Spark with Blackwell architecture to fine-tune large You'll set up LLaMA Factory on NVIDIA Spark with Blackwell architecture to fine-tune large
language models using LoRA, QLoRA, and full fine-tuning methods. This enables efficient language models using LoRA, QLoRA, and full fine-tuning methods. This enables efficient
model adaptation for specialized domains while leveraging hardware-specific optimizations. model adaptation for specialized domains while leveraging hardware-specific optimizations.
### What to know before starting ## What to know before starting
- Basic Python knowledge for editing config files and troubleshooting - Basic Python knowledge for editing config files and troubleshooting
- Command line usage for running shell commands and managing environments - Command line usage for running shell commands and managing environments
@ -44,7 +28,7 @@ model adaptation for specialized domains while leveraging hardware-specific opti
- Dataset preparation: formatting text data into JSON structure for instruction tuning - Dataset preparation: formatting text data into JSON structure for instruction tuning
- Resource management: adjusting batch size and memory settings for GPU constraints - Resource management: adjusting batch size and memory settings for GPU constraints
### Prerequisites ## Prerequisites
- NVIDIA Spark device with Blackwell architecture - NVIDIA Spark device with Blackwell architecture
@ -60,7 +44,7 @@ model adaptation for specialized domains while leveraging hardware-specific opti
- Internet connection for downloading models from Hugging Face Hub - Internet connection for downloading models from Hugging Face Hub
### Ancillary files ## Ancillary files
- Official LLaMA Factory repository: https://github.com/hiyouga/LLaMA-Factory - Official LLaMA Factory repository: https://github.com/hiyouga/LLaMA-Factory
@ -70,7 +54,7 @@ model adaptation for specialized domains while leveraging hardware-specific opti
- Documentation: https://llamafactory.readthedocs.io/en/latest/getting_started/data_preparation.html - Documentation: https://llamafactory.readthedocs.io/en/latest/getting_started/data_preparation.html
### Time & risk ## Time & risk
**Duration:** 30-60 minutes for initial setup, 1-7 hours for training depending on model size **Duration:** 30-60 minutes for initial setup, 1-7 hours for training depending on model size
and dataset. and dataset.
@ -83,7 +67,7 @@ saved locally and can be deleted to reclaim storage space.
## Instructions ## Instructions
### Step 1. Verify system prerequisites ## Step 1. Verify system prerequisites
Check that your NVIDIA Spark system has the required components installed and accessible. Check that your NVIDIA Spark system has the required components installed and accessible.
@ -95,7 +79,7 @@ python --version
git --version git --version
``` ```
### Step 2. Launch PyTorch container with GPU support ## Step 2. Launch PyTorch container with GPU support
Start the NVIDIA PyTorch container with GPU access and mount your workspace directory. Start the NVIDIA PyTorch container with GPU access and mount your workspace directory.
> **Note:** This NVIDIA PyTorch container supports CUDA 13 > **Note:** This NVIDIA PyTorch container supports CUDA 13
@ -104,7 +88,7 @@ Start the NVIDIA PyTorch container with GPU access and mount your workspace dire
docker run --gpus all --ipc=host --ulimit memlock=-1 -it --ulimit stack=67108864 --rm -v "$PWD":/workspace nvcr.io/nvidia/pytorch:25.08-py3 bash docker run --gpus all --ipc=host --ulimit memlock=-1 -it --ulimit stack=67108864 --rm -v "$PWD":/workspace nvcr.io/nvidia/pytorch:25.08-py3 bash
``` ```
### Step 3. Clone LLaMA Factory repository ## Step 3. Clone LLaMA Factory repository
Download the LLaMA Factory source code from the official repository. Download the LLaMA Factory source code from the official repository.
@ -121,9 +105,7 @@ Install the package in editable mode with metrics support for training evaluatio
pip install -e ".[metrics]" pip install -e ".[metrics]"
``` ```
### Step 5. Configure PyTorch for CUDA 12.9 (if needed) ## Step 5. Configure PyTorch for CUDA 12.9 (skip if using Docker container from Step 2)
#### If using standalone Python (skip if using Docker container)
In a python virtual environment, uninstall existing PyTorch and reinstall with CUDA 12.9 support for ARM64 architecture. In a python virtual environment, uninstall existing PyTorch and reinstall with CUDA 12.9 support for ARM64 architecture.
@ -132,7 +114,7 @@ pip uninstall torch torchvision torchaudio
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu129 pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu129
``` ```
#### If using Docker container *If using Docker container*
PyTorch is pre-installed with CUDA support. Verify installation: PyTorch is pre-installed with CUDA support. Verify installation:
@ -140,7 +122,7 @@ PyTorch is pre-installed with CUDA support. Verify installation:
python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')" python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"
``` ```
### Step 6. Prepare training configuration ## Step 6. Prepare training configuration
Examine the provided LoRA fine-tuning configuration for Llama-3. Examine the provided LoRA fine-tuning configuration for Llama-3.
@ -148,7 +130,7 @@ Examine the provided LoRA fine-tuning configuration for Llama-3.
cat examples/train_lora/llama3_lora_sft.yaml cat examples/train_lora/llama3_lora_sft.yaml
``` ```
### Step 7. Launch fine-tuning training ## Step 7. Launch fine-tuning training
> **Note:** Login to your hugging face hub to download the model if the model is gated > **Note:** Login to your hugging face hub to download the model if the model is gated
Execute the training process using the pre-configured LoRA setup. Execute the training process using the pre-configured LoRA setup.
@ -170,7 +152,7 @@ Example output:
Figure saved at: saves/llama3-8b/lora/sft/training_loss.png Figure saved at: saves/llama3-8b/lora/sft/training_loss.png
``` ```
### Step 8. Validate training completion ## Step 8. Validate training completion
Verify that training completed successfully and checkpoints were saved. Verify that training completed successfully and checkpoints were saved.
@ -186,7 +168,7 @@ Expected output should show:
- Training metrics showing decreasing loss values - Training metrics showing decreasing loss values
- Training loss plot saved as PNG file - Training loss plot saved as PNG file
### Step 9. Test inference with fine-tuned model ## Step 9. Test inference with fine-tuned model
Run a simple inference test to verify the fine-tuned model loads correctly. Run a simple inference test to verify the fine-tuned model loads correctly.
@ -194,7 +176,7 @@ Run a simple inference test to verify the fine-tuned model loads correctly.
llamafactory-cli chat examples/inference/llama3_lora_sft.yaml llamafactory-cli chat examples/inference/llama3_lora_sft.yaml
``` ```
### Step 10. Troubleshooting ## Step 10. Troubleshooting
| Symptom | Cause | Fix | | Symptom | Cause | Fix |
|---------|--------|-----| |---------|--------|-----|
@ -202,7 +184,7 @@ llamafactory-cli chat examples/inference/llama3_lora_sft.yaml
| Model download fails or is slow | Network connectivity or Hugging Face Hub issues | Check internet connection, try using `HF_HUB_OFFLINE=1` for cached models | | Model download fails or is slow | Network connectivity or Hugging Face Hub issues | Check internet connection, try using `HF_HUB_OFFLINE=1` for cached models |
| Training loss not decreasing | Learning rate too high/low or insufficient data | Adjust `learning_rate` parameter or check dataset quality | | Training loss not decreasing | Learning rate too high/low or insufficient data | Adjust `learning_rate` parameter or check dataset quality |
### Step 11. Cleanup and rollback ## Step 11. Cleanup and rollback
> **Warning:** This will delete all training progress and checkpoints. > **Warning:** This will delete all training progress and checkpoints.
@ -220,7 +202,7 @@ exit # Exit container
docker container prune -f docker container prune -f
``` ```
### Step 12. Next steps ## Step 12. Next steps
Test your fine-tuned model with custom prompts: Test your fine-tuned model with custom prompts:

View File

@ -35,14 +35,14 @@ FP8, FP4).
## Prerequisites ## Prerequisites
- [ ] NVIDIA Spark device with Blackwell GPU architecture - NVIDIA Spark device with Blackwell GPU architecture
- [ ] Docker installed and accessible to current user - Docker installed and accessible to current user
- [ ] NVIDIA Container Runtime configured - NVIDIA Container Runtime configured
- [ ] Hugging Face account with valid token - Hugging Face account with valid token
- [ ] At least 48GB VRAM available for FP16 Flux.1 Schnell operations - At least 48GB VRAM available for FP16 Flux.1 Schnell operations
- [ ] Verify GPU access: `nvidia-smi` - Verify GPU access: `nvidia-smi`
- [ ] Check Docker GPU integration: `docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu20.04 nvidia-smi` - Check Docker GPU integration: `docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu20.04 nvidia-smi`
- [ ] Confirm HF token access with permissions to FLUX repos: `echo $HF_TOKEN`, Sign in to your huggingface account You can create the token from create your token here (make sure you provide permissions to the token): https://huggingface.co/settings/tokens , Note the permissions to be checked and the repos: black-forest-labs/FLUX.1-dev and black-forest-labs/FLUX.1-dev-onnx (search for these repos when creating the user token) to be added. - Confirm HF token access with permissions to FLUX repos: `echo $HF_TOKEN`, Sign in to your huggingface account You can create the token from create your token here (make sure you provide permissions to the token): https://huggingface.co/settings/tokens , Note the permissions to be checked and the repos: black-forest-labs/FLUX.1-dev and black-forest-labs/FLUX.1-dev-onnx (search for these repos when creating the user token) to be added.
## Ancillary files ## Ancillary files

View File

@ -35,22 +35,22 @@ You'll establish a complete fine-tuning environment for large language models (1
## Prerequisites ## Prerequisites
- [ ] NVIDIA Spark device with Blackwell architecture GPU access - NVIDIA Spark device with Blackwell architecture GPU access
- [ ] CUDA toolkit 12.0+ installed and configured - CUDA toolkit 12.0+ installed and configured
```bash ```bash
nvcc --version nvcc --version
``` ```
- [ ] Python 3.10+ environment available - Python 3.10+ environment available
```bash ```bash
python3 --version python3 --version
``` ```
- [ ] Minimum 32GB system RAM for efficient model loading and training - Minimum 32GB system RAM for efficient model loading and training
- [ ] Active internet connection for downloading models and packages - Active internet connection for downloading models and packages
- [ ] Git installed for repository cloning - Git installed for repository cloning
```bash ```bash
git --version git --version
``` ```
- [ ] SSH access to your NVIDIA Spark device configured - SSH access to your NVIDIA Spark device configured
## Ancillary files ## Ancillary files

View File

@ -1,6 +1,6 @@
# Use a NIM on Spark # Use a NIM on Spark
> Run a NIM on Spark > Run an LLM NIM on Spark
## Table of Contents ## Table of Contents
@ -40,19 +40,19 @@ completions.
### Prerequisites ### Prerequisites
- [ ] DGX Spark device with NVIDIA drivers installed - DGX Spark device with NVIDIA drivers installed
```bash ```bash
nvidia-smi nvidia-smi
``` ```
- [ ] Docker with NVIDIA Container Toolkit configured, instructions here: https://******.nvidia.com/dgx-docs/review/621/dgx-spark/latest/nvidia-container-runtime-for-docker.html - Docker with NVIDIA Container Toolkit configured, instructions here: https://******.nvidia.com/dgx-docs/review/621/dgx-spark/latest/nvidia-container-runtime-for-docker.html
```bash ```bash
docker run -it --gpus=all nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 nvidia-smi docker run -it --gpus=all nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 nvidia-smi
``` ```
- [ ] NGC account with API key from https://ngc.nvidia.com/setup/api-key - NGC account with API key from https://ngc.nvidia.com/setup/api-key
```bash ```bash
echo $NGC_API_KEY | grep -E '^[a-zA-Z0-9]{86}==' echo $NGC_API_KEY | grep -E '^[a-zA-Z0-9]{86}=='
``` ```
- [ ] Sufficient disk space for model caching (varies by model, typically 10-50GB) - Sufficient disk space for model caching (varies by model, typically 10-50GB)
```bash ```bash
df -h ~ df -h ~
``` ```

View File

@ -6,7 +6,7 @@
- [Overview](#overview) - [Overview](#overview)
- [NVFP4 on Blackwell](#nvfp4-on-blackwell) - [NVFP4 on Blackwell](#nvfp4-on-blackwell)
- [Desktop Access](#desktop-access) - [Instructions](#instructions)
--- ---
@ -40,11 +40,11 @@ inside a TensorRT-LLM container, producing an NVFP4 quantized model for deployme
## Prerequisites ## Prerequisites
- [ ] NVIDIA Spark device with Blackwell architecture GPU - NVIDIA Spark device with Blackwell architecture GPU
- [ ] Docker installed with GPU support - Docker installed with GPU support
- [ ] NVIDIA Container Toolkit configured - NVIDIA Container Toolkit configured
- [ ] At least 32GB of available storage for model files and outputs - At least 32GB of available storage for model files and outputs
- [ ] Hugging Face account with access to the target model - Hugging Face account with access to the target model
Verify your setup: Verify your setup:
```bash ```bash
@ -71,7 +71,7 @@ huggingface-cli whoami
**Rollback**: Remove the output directory and any pulled Docker images to restore original state. **Rollback**: Remove the output directory and any pulled Docker images to restore original state.
## Desktop Access ## Instructions
## Step 1. Prepare the environment ## Step 1. Prepare the environment

View File

@ -6,7 +6,6 @@
- [Overview](#overview) - [Overview](#overview)
- [Instructions](#instructions) - [Instructions](#instructions)
- [Access with NVIDIA Sync](#access-with-nvidia-sync)
--- ---
@ -36,12 +35,9 @@ the powerful GPU capabilities of your Spark device without complex network confi
## Prerequisites ## Prerequisites
- [ ] DGX Spark device set up and connected to your network - DGX Spark device set up and connected to your network
- Verify with: `nvidia-smi` (should show Blackwell GPU information) - NVIDIA Sync installed and connected to your Spark
- [ ] NVIDIA Sync installed and connected to your Spark - Terminal access to your local machine for testing API calls
- Verify connection status in NVIDIA Sync system tray application
- [ ] Terminal access to your local machine for testing API calls
- Verify with: `curl --version`
@ -233,7 +229,3 @@ Monitor GPU and system usage during inference using the DGX Dashboard available
Build applications using the Ollama API by integrating with your preferred programming language's Build applications using the Ollama API by integrating with your preferred programming language's
HTTP client libraries. HTTP client libraries.
## Access with NVIDIA Sync
## Step 1. (DRAFT)

View File

@ -30,23 +30,23 @@ RTX Pro 6000 or DGX Spark workstation.
## Prerequisites ## Prerequisites
- [ ] NVIDIA GPU (RTX Pro 6000 or DGX Spark recommended) - NVIDIA GPU (RTX Pro 6000 or DGX Spark recommended)
```bash ```bash
nvidia-smi # Should show GPU with CUDA ≥12.9 nvidia-smi # Should show GPU with CUDA ≥12.9
``` ```
- [ ] NVIDIA drivers and CUDA toolkit installed - NVIDIA drivers and CUDA toolkit installed
```bash ```bash
nvcc --version # Should show CUDA 12.9 or higher nvcc --version # Should show CUDA 12.9 or higher
``` ```
- [ ] Docker with NVIDIA Container Toolkit - Docker with NVIDIA Container Toolkit
```bash ```bash
docker run --rm --gpus all nvidia/cuda:12.9.0-base-ubuntu22.04 nvidia-smi docker run --rm --gpus all nvidia/cuda:12.9.0-base-ubuntu22.04 nvidia-smi
``` ```
- [ ] Python 3.8+ environment - Python 3.8+ environment
```bash ```bash
python3 --version # Should show 3.8 or higher python3 --version # Should show 3.8 or higher
``` ```
- [ ] Sufficient disk space for databases (>3TB recommended) - Sufficient disk space for databases (>3TB recommended)
```bash ```bash
df -h # Check available space df -h # Check available space
``` ```

View File

@ -13,74 +13,101 @@
## Basic Idea ## Basic Idea
This playbook guides you through setting up and using Pytorch for fine-tuning large language models on NVIDIA Spark devices. This playbook guides you through setting up and using Pytorch for fine-tuning large language models and vision-language models on NVIDIA Spark devices. NeMo AutoModel provides GPU-accelerated, end-to-end training for Hugging Face models with native PyTorch support, enabling instant fine-tuning without conversion delays. The framework supports distributed training across single GPU to multi-node clusters, with optimized kernels and memory-efficient recipes specifically designed for ARM64 architecture and Blackwell GPU systems.
## What you'll accomplish ## What you'll accomplish
You'll establish a complete fine-tuning environment for large language models (1-70B parameters) on your NVIDIA Spark device. By the end, you'll have a working installation that supports parameter-efficient fine-tuning (PEFT) and supervised fine-tuning (SFT) You'll establish a complete fine-tuning environment for large language models (1-70B parameters) and vision-language models using NeMo AutoModel on your NVIDIA Spark device. By the end, you'll have a working installation that supports parameter-efficient fine-tuning (PEFT), supervised fine-tuning (SFT), and distributed training capabilities with FP8 precision optimizations, all while maintaining compatibility with the Hugging Face ecosystem.
## What to know before starting ## What to know before starting
## Prerequisites ## Prerequisites
recipes are specifically for DIGITS SPARK. Please make sure that OS and drivers are latest.
## Ancillary files ## Ancillary files
ALl files required for finetuning are included.
## Time & risk ## Time & risk
**Time estimate:** 30-45 mins for setup and runing finetuning. Finetuning run time varies depending on model size **Time estimate:**
**Risks:** Model downloads can be large (several GB), ARM64 package compatibility issues may require troubleshooting. **Risks:** Model downloads can be large (several GB), ARM64 package compatibility issues may require troubleshooting, distributed training setup complexity increases with multi-node configurations
**Rollback:** **Rollback:**
## Instructions ## Instructions
## Step 1. Pull the latest Pytorch container ## Step 1. Verify system requirements
Check your NVIDIA Spark device meets the prerequisites for NeMo AutoModel installation. This step runs on the host system to confirm CUDA toolkit availability and Python version compatibility.
```bash ```bash
docker pull nvcr.io/nvidia/pytorch:25.09-py3 ## Verify CUDA installation
nvcc --version
## Verify GPU accessibility
nvidia-smi
## Check available system memory
free -h
``` ```
## Step 2. Launch Docker ## Step 2. Get the container image
```bash ```bash
docker run --gpus all -it --rm --ipc=host \ docker pull nvcr.io/nvidia/pytorch:25.08-py3
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
-v ${PWD}:/workspace -w /workspace \
nvcr.io/nvidia/pytorch:25.09-py3
``` ```
## Step 3. Install dependencies inside the contianer ## Step 3. Launch Docker
```bash ```bash
pip install transformers peft datasets "trl==0.19.1" "bitsandbytes==0.48" docker run \
--gpus all \
--ulimit memlock=-1 \
-it --ulimit stack=67108864 \
--entrypoint /usr/bin/bash \
--rm nvcr.io/nvidia/pytorch:25.08-py3
``` ```
## Step 4: authenticate with huggingface
## Step 10. Troubleshooting
Common issues and solutions for NeMo AutoModel setup on NVIDIA Spark devices.
| Symptom | Cause | Fix |
|---------|--------|-----|
| `nvcc: command not found` | CUDA toolkit not in PATH | Add CUDA toolkit to PATH: `export PATH=/usr/local/cuda/bin:$PATH` |
| `pip install uv` permission denied | System-level pip restrictions | Use `pip3 install --user uv` and update PATH |
| GPU not detected in training | CUDA driver/runtime mismatch | Verify driver compatibility: `nvidia-smi` and reinstall CUDA if needed |
| Out of memory during training | Model too large for available GPU memory | Reduce batch size, enable gradient checkpointing, or use model parallelism |
| ARM64 package compatibility issues | Package not available for ARM architecture | Use source installation or build from source with ARM64 flags |
## Step 11. Cleanup and rollback
Remove the installation and restore the original environment if needed. These commands safely remove all installed components.
> **Warning:** This will delete all virtual environments and downloaded models. Ensure you have backed up any important training checkpoints.
```bash ```bash
huggingface-cli login ## Remove virtual environment
##<input your huggingface token. rm -rf .venv
##<Enter n for git credential>
``` ## Remove cloned repository
To run LoRA on Llama3 use the following command: cd ..
rm -rf Automodel
```bash ## Remove uv (if installed with --user)
python Llama3_8B_LoRA_finetuning.py pip3 uninstall uv
## Clear Python cache
rm -rf ~/.cache/pip
``` ```
To run qLoRA finetuning on llama3-70B use the following command: ## Step 12. Next steps
```bash
python Llama3_70B_qLoRA_finetuning.py
```
To run full finetuning on llama3-3B use the following command:
```bash
python Llama3_3B_full_finetuning.py
```

View File

@ -35,12 +35,12 @@ vision-language tasks using models like DeepSeek-V2-Lite.
## Prerequisites ## Prerequisites
- [ ] NVIDIA Spark device with Blackwell architecture - NVIDIA Spark device with Blackwell architecture
- [ ] Docker Engine installed and running: `docker --version` - Docker Engine installed and running: `docker --version`
- [ ] NVIDIA GPU drivers installed: `nvidia-smi` - NVIDIA GPU drivers installed: `nvidia-smi`
- [ ] NVIDIA Container Toolkit configured: `docker run --rm --gpus all nvidia/cuda:12.9-base nvidia-smi` - NVIDIA Container Toolkit configured: `docker run --rm --gpus all nvidia/cuda:12.9-base nvidia-smi`
- [ ] Sufficient disk space (>20GB available): `df -h` - Sufficient disk space (>20GB available): `df -h`
- [ ] Network connectivity for pulling NGC containers: `ping nvcr.io` - Network connectivity for pulling NGC containers: `ping nvcr.io`
## Ancillary files ## Ancillary files

View File

@ -40,17 +40,17 @@ These examples demonstrate how to accelerate large language model inference whil
## Prerequisites ## Prerequisites
- [ ] NVIDIA Spark device with sufficient GPU memory available (80GB+ recommended for GPT-OSS 120B) - NVIDIA Spark device with sufficient GPU memory available (80GB+ recommended for GPT-OSS 120B)
- [ ] Docker with GPU support enabled - Docker with GPU support enabled
```bash ```bash
docker run --gpus all nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev nvidia-smi docker run --gpus all nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev nvidia-smi
``` ```
- [ ] Access to NVIDIA's internal container registry (for Eagle3 example) - Access to NVIDIA's internal container registry (for Eagle3 example)
- [ ] HuggingFace authentication configured (if needed for model downloads) - HuggingFace authentication configured (if needed for model downloads)
```bash ```bash
huggingface-cli login huggingface-cli login
``` ```
- [ ] Network connectivity for model downloads - Network connectivity for model downloads
## Time & risk ## Time & risk

View File

@ -51,13 +51,13 @@ all traffic automatically encrypted and NAT traversal handled transparently.
## Prerequisites ## Prerequisites
- [ ] NVIDIA Spark device running Ubuntu (ARM64/AArch64) - NVIDIA Spark device running Ubuntu (ARM64/AArch64)
- [ ] Client device (Mac, Windows, or Linux) for remote access - Client device (Mac, Windows, or Linux) for remote access
- [ ] Internet connectivity on both devices - Internet connectivity on both devices
- [ ] Valid email account for Tailscale authentication (Google, GitHub, Microsoft) - Valid email account for Tailscale authentication (Google, GitHub, Microsoft)
- [ ] SSH server availability check: `systemctl status ssh` - SSH server availability check: `systemctl status ssh`
- [ ] Package manager working: `sudo apt update` - Package manager working: `sudo apt update`
- [ ] User account with sudo privileges on Spark device - User account with sudo privileges on Spark device
## Time & risk ## Time & risk

View File

@ -54,13 +54,13 @@ inference through kernel-level optimizations, efficient memory layouts, and adva
## Prerequisites ## Prerequisites
- [ ] NVIDIA Spark device with Blackwell architecture GPUs - NVIDIA Spark device with Blackwell architecture GPUs
- [ ] NVIDIA drivers compatible with CUDA 12.x: `nvidia-smi` - NVIDIA drivers compatible with CUDA 12.x: `nvidia-smi`
- [ ] Docker installed and GPU support configured: `docker run --rm --gpus all nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev nvidia-smi` - Docker installed and GPU support configured: `docker run --rm --gpus all nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev nvidia-smi`
- [ ] Hugging Face account with token for model access: `echo $HF_TOKEN` - Hugging Face account with token for model access: `echo $HF_TOKEN`
- [ ] Sufficient GPU VRAM (16GB+ recommended for 70B models) - Sufficient GPU VRAM (16GB+ recommended for 70B models)
- [ ] Internet connectivity for downloading models and container images - Internet connectivity for downloading models and container images
- [ ] Network: open TCP ports 8355 (LLM) and 8356 (VLM) on host for OpenAI-compatible serving - Network: open TCP ports 8355 (LLM) and 8356 (VLM) on host for OpenAI-compatible serving
## Model Support Matrix ## Model Support Matrix

View File

@ -36,10 +36,10 @@ parameter-efficient fine-tuning methods like LoRA and QLoRA.
## Prerequisites ## Prerequisites
- [ ] NVIDIA Spark device with Blackwell GPU architecture - NVIDIA Spark device with Blackwell GPU architecture
- [ ] `nvidia-smi` shows a summary of GPU information - `nvidia-smi` shows a summary of GPU information
- [ ] CUDA 13.0 installed: `nvcc --version` - CUDA 13.0 installed: `nvcc --version`
- [ ] Internet access for downloading models and datasets - Internet access for downloading models and datasets
##Ancillary files ##Ancillary files

View File

@ -5,9 +5,9 @@
## Table of Contents ## Table of Contents
- [Overview](#overview) - [Overview](#overview)
- [Instructions](#instructions)
- [Run on two Sparks](#run-on-two-sparks) - [Run on two Sparks](#run-on-two-sparks)
- [Step 14. (Optional) Launch 405B inference server](#step-14-optional-launch-405b-inference-server) - [Step 14. (Optional) Launch 405B inference server](#step-14-optional-launch-405b-inference-server)
- [Access through terminal](#access-through-terminal)
--- ---
@ -29,14 +29,14 @@ support for ARM64.
## Prerequisites ## Prerequisites
- [ ] DGX Spark device with ARM64 processor and Blackwell GPU architecture - DGX Spark device with ARM64 processor and Blackwell GPU architecture
- [ ] CUDA 12.9 or CUDA 13.0 toolkit installed: `nvcc --version` shows CUDA toolkit version. - CUDA 12.9 or CUDA 13.0 toolkit installed: `nvcc --version` shows CUDA toolkit version.
- [ ] Docker installed and configured: `docker --version` succeeds - Docker installed and configured: `docker --version` succeeds
- [ ] NVIDIA Container Toolkit installed - NVIDIA Container Toolkit installed
- [ ] Python 3.12 available: `python3.12 --version` succeeds - Python 3.12 available: `python3.12 --version` succeeds
- [ ] Git installed: `git --version` succeeds - Git installed: `git --version` succeeds
- [ ] Network access to download packages and container images - Network access to download packages and container images
- [ ] > TODO: Verify memory and storage requirements for builds
## Time & risk ## Time & risk
@ -46,6 +46,77 @@ support for ARM64.
**Rollback:** Container approach is non-destructive. **Rollback:** Container approach is non-destructive.
## Instructions
## Step 1. Pull vLLM container image
Find the latest container build from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=25.09-py3
```
docker pull nvcr.io/nvidia/vllm:25.09-py3
```
## Step 2. Test vLLM in container
Launch the container and start vLLM server with a test model to verify basic functionality.
```bash
docker run -it --gpus all -p 8000:8000 \
nvcr.io/nvidia/vllm:25.09-py3 \
vllm serve "Qwen/Qwen2.5-Math-1.5B-Instruct"
```
Expected output should include:
- Model loading confirmation
- Server startup on port 8000
- GPU memory allocation details
In another terminal, test the server:
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-Math-1.5B-Instruct",
"messages": [{"role": "user", "content": "12*17"}],
"max_tokens": 500
}'
```
Expected response should contain `"content": "204"` or similar mathematical calculation.
## Step 3. Troubleshooting
| Symptom | Cause | Fix |
|---------|--------|-----|
| CUDA version mismatch errors | Wrong CUDA toolkit version | Reinstall CUDA 12.9 using exact installer |
| Container registry authentication fails | Invalid or expired GitLab token | Generate new auth token |
| SM_121a architecture not recognized | Missing LLVM patches | Verify SM_121a patches applied to LLVM source |
| Reduce MAX_JOBS to 1-2, add swap space |
| Environment variables not set |
## Step 4. Cleanup and rollback
For container approach (non-destructive):
```bash
docker rm $(docker ps -aq --filter ancestor=******:5005/dl/dgx/vllm*)
docker rmi ******:5005/dl/dgx/vllm:main-py3.31165712-devel
```
To remove CUDA 12.9:
```bash
sudo /usr/local/cuda-12.9/bin/cuda-uninstaller
```
## Step 5. Next steps
- **Production deployment:** Configure vLLM with your specific model requirements
- **Performance tuning:** Adjust batch sizes and memory settings for your workload
- **Monitoring:** Set up logging and metrics collection for production use
- **Model management:** Explore additional model formats and quantization options
## Run on two Sparks ## Run on two Sparks
## Step 1. Verify hardware connectivity ## Step 1. Verify hardware connectivity
@ -310,74 +381,3 @@ http://192.168.100.10:8265
## - Persistent model caching across restarts ## - Persistent model caching across restarts
## - Alternative quantization methods (FP8, INT4) ## - Alternative quantization methods (FP8, INT4)
``` ```
## Access through terminal
## Step 1. Pull vLLM container image
Find the latest container build from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=25.09-py3
```
docker pull nvcr.io/nvidia/vllm:25.09-py3
```
## Step 2. Test vLLM in container
Launch the container and start vLLM server with a test model to verify basic functionality.
```bash
docker run -it --gpus all -p 8000:8000 \
nvcr.io/nvidia/vllm:25.09-py3 \
vllm serve "Qwen/Qwen2.5-Math-1.5B-Instruct"
```
Expected output should include:
- Model loading confirmation
- Server startup on port 8000
- GPU memory allocation details
In another terminal, test the server:
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-Math-1.5B-Instruct",
"messages": [{"role": "user", "content": "12*17"}],
"max_tokens": 500
}'
```
Expected response should contain `"content": "204"` or similar mathematical calculation.
## Step 3. Troubleshooting
| Symptom | Cause | Fix |
|---------|--------|-----|
| CUDA version mismatch errors | Wrong CUDA toolkit version | Reinstall CUDA 12.9 using exact installer |
| Container registry authentication fails | Invalid or expired GitLab token | Generate new auth token |
| SM_121a architecture not recognized | Missing LLVM patches | Verify SM_121a patches applied to LLVM source |
| Reduce MAX_JOBS to 1-2, add swap space |
| Environment variables not set |
## Step 4. Cleanup and rollback
For container approach (non-destructive):
```bash
docker rm $(docker ps -aq --filter ancestor=******:5005/dl/dgx/vllm*)
docker rmi ******:5005/dl/dgx/vllm:main-py3.31165712-devel
```
To remove CUDA 12.9:
```bash
sudo /usr/local/cuda-12.9/bin/cuda-uninstaller
```
## Step 5. Next steps
- **Production deployment:** Configure vLLM with your specific model requirements
- **Performance tuning:** Adjust batch sizes and memory settings for your workload
- **Monitoring:** Set up logging and metrics collection for production use
- **Model management:** Explore additional model formats and quantization options

View File

@ -43,14 +43,14 @@ You will deploy NVIDIA's VSS AI Blueprint on NVIDIA Spark hardware with Blackwel
## Prerequisites ## Prerequisites
- [ ] NVIDIA Spark device with ARM64 architecture and Blackwell GPU - NVIDIA Spark device with ARM64 architecture and Blackwell GPU
- [ ] FastOS 1.81.38 or compatible ARM64 system - FastOS 1.81.38 or compatible ARM64 system
- [ ] Driver version 580.82.09 installed: `nvidia-smi | grep "Driver Version"` - Driver version 580.82.09 installed: `nvidia-smi | grep "Driver Version"`
- [ ] CUDA version 13.0 installed: `nvcc --version` - CUDA version 13.0 installed: `nvcc --version`
- [ ] Docker installed and running: `docker --version && docker compose version` - Docker installed and running: `docker --version && docker compose version`
- [ ] Access to NVIDIA Container Registry with NGC API Key - Access to NVIDIA Container Registry with NGC API Key
- [ ] [Optional] NVIDIA API Key for remote model endpoints (hybrid deployment only) - [Optional] NVIDIA API Key for remote model endpoints (hybrid deployment only)
- [ ] Sufficient storage space for video processing (>10GB recommended in `/tmp/`) - Sufficient storage space for video processing (>10GB recommended in `/tmp/`)
## Ancillary files ## Ancillary files