chore: Regenerate all playbooks

This commit is contained in:
GitLab CI 2025-10-06 13:35:52 +00:00
parent 7773c86f7c
commit c20b49d138
15 changed files with 228 additions and 227 deletions

View File

@ -5,36 +5,20 @@
## Table of Contents
- [Overview](#overview)
- [What you'll accomplish](#what-youll-accomplish)
- [What to know before starting](#what-to-know-before-starting)
- [Prerequisites](#prerequisites)
- [Ancillary files](#ancillary-files)
- [Time & risk](#time-risk)
- [Instructions](#instructions)
- [Step 1. Verify system prerequisites](#step-1-verify-system-prerequisites)
- [Step 2. Launch PyTorch container with GPU support](#step-2-launch-pytorch-container-with-gpu-support)
- [Step 3. Clone LLaMA Factory repository](#step-3-clone-llama-factory-repository)
- [Step 4. Install LLaMA Factory with dependencies](#step-4-install-llama-factory-with-dependencies)
- [Step 5. Configure PyTorch for CUDA 12.9 (if needed)](#step-5-configure-pytorch-for-cuda-129-if-needed)
- [Step 6. Prepare training configuration](#step-6-prepare-training-configuration)
- [Step 7. Launch fine-tuning training](#step-7-launch-fine-tuning-training)
- [Step 8. Validate training completion](#step-8-validate-training-completion)
- [Step 9. Test inference with fine-tuned model](#step-9-test-inference-with-fine-tuned-model)
- [Step 10. Troubleshooting](#step-10-troubleshooting)
- [Step 11. Cleanup and rollback](#step-11-cleanup-and-rollback)
- [Step 12. Next steps](#step-12-next-steps)
---
## Overview
### What you'll accomplish
## What you'll accomplish
You'll set up LLaMA Factory on NVIDIA Spark with Blackwell architecture to fine-tune large
language models using LoRA, QLoRA, and full fine-tuning methods. This enables efficient
model adaptation for specialized domains while leveraging hardware-specific optimizations.
### What to know before starting
## What to know before starting
- Basic Python knowledge for editing config files and troubleshooting
- Command line usage for running shell commands and managing environments
@ -44,7 +28,7 @@ model adaptation for specialized domains while leveraging hardware-specific opti
- Dataset preparation: formatting text data into JSON structure for instruction tuning
- Resource management: adjusting batch size and memory settings for GPU constraints
### Prerequisites
## Prerequisites
- NVIDIA Spark device with Blackwell architecture
@ -60,7 +44,7 @@ model adaptation for specialized domains while leveraging hardware-specific opti
- Internet connection for downloading models from Hugging Face Hub
### Ancillary files
## Ancillary files
- Official LLaMA Factory repository: https://github.com/hiyouga/LLaMA-Factory
@ -70,7 +54,7 @@ model adaptation for specialized domains while leveraging hardware-specific opti
- Documentation: https://llamafactory.readthedocs.io/en/latest/getting_started/data_preparation.html
### Time & risk
## Time & risk
**Duration:** 30-60 minutes for initial setup, 1-7 hours for training depending on model size
and dataset.
@ -83,7 +67,7 @@ saved locally and can be deleted to reclaim storage space.
## Instructions
### Step 1. Verify system prerequisites
## Step 1. Verify system prerequisites
Check that your NVIDIA Spark system has the required components installed and accessible.
@ -95,7 +79,7 @@ python --version
git --version
```
### Step 2. Launch PyTorch container with GPU support
## Step 2. Launch PyTorch container with GPU support
Start the NVIDIA PyTorch container with GPU access and mount your workspace directory.
> **Note:** This NVIDIA PyTorch container supports CUDA 13
@ -104,7 +88,7 @@ Start the NVIDIA PyTorch container with GPU access and mount your workspace dire
docker run --gpus all --ipc=host --ulimit memlock=-1 -it --ulimit stack=67108864 --rm -v "$PWD":/workspace nvcr.io/nvidia/pytorch:25.08-py3 bash
```
### Step 3. Clone LLaMA Factory repository
## Step 3. Clone LLaMA Factory repository
Download the LLaMA Factory source code from the official repository.
@ -121,9 +105,7 @@ Install the package in editable mode with metrics support for training evaluatio
pip install -e ".[metrics]"
```
### Step 5. Configure PyTorch for CUDA 12.9 (if needed)
#### If using standalone Python (skip if using Docker container)
## Step 5. Configure PyTorch for CUDA 12.9 (skip if using Docker container from Step 2)
In a python virtual environment, uninstall existing PyTorch and reinstall with CUDA 12.9 support for ARM64 architecture.
@ -132,7 +114,7 @@ pip uninstall torch torchvision torchaudio
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu129
```
#### If using Docker container
*If using Docker container*
PyTorch is pre-installed with CUDA support. Verify installation:
@ -140,7 +122,7 @@ PyTorch is pre-installed with CUDA support. Verify installation:
python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"
```
### Step 6. Prepare training configuration
## Step 6. Prepare training configuration
Examine the provided LoRA fine-tuning configuration for Llama-3.
@ -148,7 +130,7 @@ Examine the provided LoRA fine-tuning configuration for Llama-3.
cat examples/train_lora/llama3_lora_sft.yaml
```
### Step 7. Launch fine-tuning training
## Step 7. Launch fine-tuning training
> **Note:** Login to your hugging face hub to download the model if the model is gated
Execute the training process using the pre-configured LoRA setup.
@ -170,7 +152,7 @@ Example output:
Figure saved at: saves/llama3-8b/lora/sft/training_loss.png
```
### Step 8. Validate training completion
## Step 8. Validate training completion
Verify that training completed successfully and checkpoints were saved.
@ -186,7 +168,7 @@ Expected output should show:
- Training metrics showing decreasing loss values
- Training loss plot saved as PNG file
### Step 9. Test inference with fine-tuned model
## Step 9. Test inference with fine-tuned model
Run a simple inference test to verify the fine-tuned model loads correctly.
@ -194,7 +176,7 @@ Run a simple inference test to verify the fine-tuned model loads correctly.
llamafactory-cli chat examples/inference/llama3_lora_sft.yaml
```
### Step 10. Troubleshooting
## Step 10. Troubleshooting
| Symptom | Cause | Fix |
|---------|--------|-----|
@ -202,7 +184,7 @@ llamafactory-cli chat examples/inference/llama3_lora_sft.yaml
| Model download fails or is slow | Network connectivity or Hugging Face Hub issues | Check internet connection, try using `HF_HUB_OFFLINE=1` for cached models |
| Training loss not decreasing | Learning rate too high/low or insufficient data | Adjust `learning_rate` parameter or check dataset quality |
### Step 11. Cleanup and rollback
## Step 11. Cleanup and rollback
> **Warning:** This will delete all training progress and checkpoints.
@ -220,7 +202,7 @@ exit # Exit container
docker container prune -f
```
### Step 12. Next steps
## Step 12. Next steps
Test your fine-tuned model with custom prompts:

View File

@ -35,14 +35,14 @@ FP8, FP4).
## Prerequisites
- [ ] NVIDIA Spark device with Blackwell GPU architecture
- [ ] Docker installed and accessible to current user
- [ ] NVIDIA Container Runtime configured
- [ ] Hugging Face account with valid token
- [ ] At least 48GB VRAM available for FP16 Flux.1 Schnell operations
- [ ] Verify GPU access: `nvidia-smi`
- [ ] Check Docker GPU integration: `docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu20.04 nvidia-smi`
- [ ] Confirm HF token access with permissions to FLUX repos: `echo $HF_TOKEN`, Sign in to your huggingface account You can create the token from create your token here (make sure you provide permissions to the token): https://huggingface.co/settings/tokens , Note the permissions to be checked and the repos: black-forest-labs/FLUX.1-dev and black-forest-labs/FLUX.1-dev-onnx (search for these repos when creating the user token) to be added.
- NVIDIA Spark device with Blackwell GPU architecture
- Docker installed and accessible to current user
- NVIDIA Container Runtime configured
- Hugging Face account with valid token
- At least 48GB VRAM available for FP16 Flux.1 Schnell operations
- Verify GPU access: `nvidia-smi`
- Check Docker GPU integration: `docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu20.04 nvidia-smi`
- Confirm HF token access with permissions to FLUX repos: `echo $HF_TOKEN`, Sign in to your huggingface account You can create the token from create your token here (make sure you provide permissions to the token): https://huggingface.co/settings/tokens , Note the permissions to be checked and the repos: black-forest-labs/FLUX.1-dev and black-forest-labs/FLUX.1-dev-onnx (search for these repos when creating the user token) to be added.
## Ancillary files

View File

@ -35,22 +35,22 @@ You'll establish a complete fine-tuning environment for large language models (1
## Prerequisites
- [ ] NVIDIA Spark device with Blackwell architecture GPU access
- [ ] CUDA toolkit 12.0+ installed and configured
- NVIDIA Spark device with Blackwell architecture GPU access
- CUDA toolkit 12.0+ installed and configured
```bash
nvcc --version
```
- [ ] Python 3.10+ environment available
- Python 3.10+ environment available
```bash
python3 --version
```
- [ ] Minimum 32GB system RAM for efficient model loading and training
- [ ] Active internet connection for downloading models and packages
- [ ] Git installed for repository cloning
- Minimum 32GB system RAM for efficient model loading and training
- Active internet connection for downloading models and packages
- Git installed for repository cloning
```bash
git --version
```
- [ ] SSH access to your NVIDIA Spark device configured
- SSH access to your NVIDIA Spark device configured
## Ancillary files

View File

@ -1,6 +1,6 @@
# Use a NIM on Spark
> Run a NIM on Spark
> Run an LLM NIM on Spark
## Table of Contents
@ -40,19 +40,19 @@ completions.
### Prerequisites
- [ ] DGX Spark device with NVIDIA drivers installed
- DGX Spark device with NVIDIA drivers installed
```bash
nvidia-smi
```
- [ ] Docker with NVIDIA Container Toolkit configured, instructions here: https://******.nvidia.com/dgx-docs/review/621/dgx-spark/latest/nvidia-container-runtime-for-docker.html
- Docker with NVIDIA Container Toolkit configured, instructions here: https://******.nvidia.com/dgx-docs/review/621/dgx-spark/latest/nvidia-container-runtime-for-docker.html
```bash
docker run -it --gpus=all nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 nvidia-smi
```
- [ ] NGC account with API key from https://ngc.nvidia.com/setup/api-key
- NGC account with API key from https://ngc.nvidia.com/setup/api-key
```bash
echo $NGC_API_KEY | grep -E '^[a-zA-Z0-9]{86}=='
```
- [ ] Sufficient disk space for model caching (varies by model, typically 10-50GB)
- Sufficient disk space for model caching (varies by model, typically 10-50GB)
```bash
df -h ~
```

View File

@ -6,7 +6,7 @@
- [Overview](#overview)
- [NVFP4 on Blackwell](#nvfp4-on-blackwell)
- [Desktop Access](#desktop-access)
- [Instructions](#instructions)
---
@ -40,11 +40,11 @@ inside a TensorRT-LLM container, producing an NVFP4 quantized model for deployme
## Prerequisites
- [ ] NVIDIA Spark device with Blackwell architecture GPU
- [ ] Docker installed with GPU support
- [ ] NVIDIA Container Toolkit configured
- [ ] At least 32GB of available storage for model files and outputs
- [ ] Hugging Face account with access to the target model
- NVIDIA Spark device with Blackwell architecture GPU
- Docker installed with GPU support
- NVIDIA Container Toolkit configured
- At least 32GB of available storage for model files and outputs
- Hugging Face account with access to the target model
Verify your setup:
```bash
@ -71,7 +71,7 @@ huggingface-cli whoami
**Rollback**: Remove the output directory and any pulled Docker images to restore original state.
## Desktop Access
## Instructions
## Step 1. Prepare the environment

View File

@ -6,7 +6,6 @@
- [Overview](#overview)
- [Instructions](#instructions)
- [Access with NVIDIA Sync](#access-with-nvidia-sync)
---
@ -36,12 +35,9 @@ the powerful GPU capabilities of your Spark device without complex network confi
## Prerequisites
- [ ] DGX Spark device set up and connected to your network
- Verify with: `nvidia-smi` (should show Blackwell GPU information)
- [ ] NVIDIA Sync installed and connected to your Spark
- Verify connection status in NVIDIA Sync system tray application
- [ ] Terminal access to your local machine for testing API calls
- Verify with: `curl --version`
- DGX Spark device set up and connected to your network
- NVIDIA Sync installed and connected to your Spark
- Terminal access to your local machine for testing API calls
@ -233,7 +229,3 @@ Monitor GPU and system usage during inference using the DGX Dashboard available
Build applications using the Ollama API by integrating with your preferred programming language's
HTTP client libraries.
## Access with NVIDIA Sync
## Step 1. (DRAFT)

View File

@ -30,23 +30,23 @@ RTX Pro 6000 or DGX Spark workstation.
## Prerequisites
- [ ] NVIDIA GPU (RTX Pro 6000 or DGX Spark recommended)
- NVIDIA GPU (RTX Pro 6000 or DGX Spark recommended)
```bash
nvidia-smi # Should show GPU with CUDA ≥12.9
```
- [ ] NVIDIA drivers and CUDA toolkit installed
- NVIDIA drivers and CUDA toolkit installed
```bash
nvcc --version # Should show CUDA 12.9 or higher
```
- [ ] Docker with NVIDIA Container Toolkit
- Docker with NVIDIA Container Toolkit
```bash
docker run --rm --gpus all nvidia/cuda:12.9.0-base-ubuntu22.04 nvidia-smi
```
- [ ] Python 3.8+ environment
- Python 3.8+ environment
```bash
python3 --version # Should show 3.8 or higher
```
- [ ] Sufficient disk space for databases (>3TB recommended)
- Sufficient disk space for databases (>3TB recommended)
```bash
df -h # Check available space
```

View File

@ -13,74 +13,101 @@
## Basic Idea
This playbook guides you through setting up and using Pytorch for fine-tuning large language models on NVIDIA Spark devices.
This playbook guides you through setting up and using Pytorch for fine-tuning large language models and vision-language models on NVIDIA Spark devices. NeMo AutoModel provides GPU-accelerated, end-to-end training for Hugging Face models with native PyTorch support, enabling instant fine-tuning without conversion delays. The framework supports distributed training across single GPU to multi-node clusters, with optimized kernels and memory-efficient recipes specifically designed for ARM64 architecture and Blackwell GPU systems.
## What you'll accomplish
You'll establish a complete fine-tuning environment for large language models (1-70B parameters) on your NVIDIA Spark device. By the end, you'll have a working installation that supports parameter-efficient fine-tuning (PEFT) and supervised fine-tuning (SFT)
You'll establish a complete fine-tuning environment for large language models (1-70B parameters) and vision-language models using NeMo AutoModel on your NVIDIA Spark device. By the end, you'll have a working installation that supports parameter-efficient fine-tuning (PEFT), supervised fine-tuning (SFT), and distributed training capabilities with FP8 precision optimizations, all while maintaining compatibility with the Hugging Face ecosystem.
## What to know before starting
## Prerequisites
recipes are specifically for DIGITS SPARK. Please make sure that OS and drivers are latest.
## Ancillary files
ALl files required for finetuning are included.
## Time & risk
**Time estimate:** 30-45 mins for setup and runing finetuning. Finetuning run time varies depending on model size
**Time estimate:**
**Risks:** Model downloads can be large (several GB), ARM64 package compatibility issues may require troubleshooting.
**Risks:** Model downloads can be large (several GB), ARM64 package compatibility issues may require troubleshooting, distributed training setup complexity increases with multi-node configurations
**Rollback:**
## Instructions
## Step 1. Pull the latest Pytorch container
## Step 1. Verify system requirements
Check your NVIDIA Spark device meets the prerequisites for NeMo AutoModel installation. This step runs on the host system to confirm CUDA toolkit availability and Python version compatibility.
```bash
docker pull nvcr.io/nvidia/pytorch:25.09-py3
## Verify CUDA installation
nvcc --version
## Verify GPU accessibility
nvidia-smi
## Check available system memory
free -h
```
## Step 2. Launch Docker
## Step 2. Get the container image
```bash
docker run --gpus all -it --rm --ipc=host \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
-v ${PWD}:/workspace -w /workspace \
nvcr.io/nvidia/pytorch:25.09-py3
docker pull nvcr.io/nvidia/pytorch:25.08-py3
```
## Step 3. Install dependencies inside the contianer
## Step 3. Launch Docker
```bash
pip install transformers peft datasets "trl==0.19.1" "bitsandbytes==0.48"
docker run \
--gpus all \
--ulimit memlock=-1 \
-it --ulimit stack=67108864 \
--entrypoint /usr/bin/bash \
--rm nvcr.io/nvidia/pytorch:25.08-py3
```
## Step 4: authenticate with huggingface
## Step 10. Troubleshooting
Common issues and solutions for NeMo AutoModel setup on NVIDIA Spark devices.
| Symptom | Cause | Fix |
|---------|--------|-----|
| `nvcc: command not found` | CUDA toolkit not in PATH | Add CUDA toolkit to PATH: `export PATH=/usr/local/cuda/bin:$PATH` |
| `pip install uv` permission denied | System-level pip restrictions | Use `pip3 install --user uv` and update PATH |
| GPU not detected in training | CUDA driver/runtime mismatch | Verify driver compatibility: `nvidia-smi` and reinstall CUDA if needed |
| Out of memory during training | Model too large for available GPU memory | Reduce batch size, enable gradient checkpointing, or use model parallelism |
| ARM64 package compatibility issues | Package not available for ARM architecture | Use source installation or build from source with ARM64 flags |
## Step 11. Cleanup and rollback
Remove the installation and restore the original environment if needed. These commands safely remove all installed components.
> **Warning:** This will delete all virtual environments and downloaded models. Ensure you have backed up any important training checkpoints.
```bash
huggingface-cli login
##<input your huggingface token.
##<Enter n for git credential>
## Remove virtual environment
rm -rf .venv
```
To run LoRA on Llama3 use the following command:
## Remove cloned repository
cd ..
rm -rf Automodel
```bash
python Llama3_8B_LoRA_finetuning.py
## Remove uv (if installed with --user)
pip3 uninstall uv
## Clear Python cache
rm -rf ~/.cache/pip
```
To run qLoRA finetuning on llama3-70B use the following command:
```bash
python Llama3_70B_qLoRA_finetuning.py
```
To run full finetuning on llama3-3B use the following command:
```bash
python Llama3_3B_full_finetuning.py
```
## Step 12. Next steps

View File

@ -35,12 +35,12 @@ vision-language tasks using models like DeepSeek-V2-Lite.
## Prerequisites
- [ ] NVIDIA Spark device with Blackwell architecture
- [ ] Docker Engine installed and running: `docker --version`
- [ ] NVIDIA GPU drivers installed: `nvidia-smi`
- [ ] NVIDIA Container Toolkit configured: `docker run --rm --gpus all nvidia/cuda:12.9-base nvidia-smi`
- [ ] Sufficient disk space (>20GB available): `df -h`
- [ ] Network connectivity for pulling NGC containers: `ping nvcr.io`
- NVIDIA Spark device with Blackwell architecture
- Docker Engine installed and running: `docker --version`
- NVIDIA GPU drivers installed: `nvidia-smi`
- NVIDIA Container Toolkit configured: `docker run --rm --gpus all nvidia/cuda:12.9-base nvidia-smi`
- Sufficient disk space (>20GB available): `df -h`
- Network connectivity for pulling NGC containers: `ping nvcr.io`
## Ancillary files

View File

@ -40,17 +40,17 @@ These examples demonstrate how to accelerate large language model inference whil
## Prerequisites
- [ ] NVIDIA Spark device with sufficient GPU memory available (80GB+ recommended for GPT-OSS 120B)
- [ ] Docker with GPU support enabled
- NVIDIA Spark device with sufficient GPU memory available (80GB+ recommended for GPT-OSS 120B)
- Docker with GPU support enabled
```bash
docker run --gpus all nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev nvidia-smi
```
- [ ] Access to NVIDIA's internal container registry (for Eagle3 example)
- [ ] HuggingFace authentication configured (if needed for model downloads)
- Access to NVIDIA's internal container registry (for Eagle3 example)
- HuggingFace authentication configured (if needed for model downloads)
```bash
huggingface-cli login
```
- [ ] Network connectivity for model downloads
- Network connectivity for model downloads
## Time & risk

View File

@ -51,13 +51,13 @@ all traffic automatically encrypted and NAT traversal handled transparently.
## Prerequisites
- [ ] NVIDIA Spark device running Ubuntu (ARM64/AArch64)
- [ ] Client device (Mac, Windows, or Linux) for remote access
- [ ] Internet connectivity on both devices
- [ ] Valid email account for Tailscale authentication (Google, GitHub, Microsoft)
- [ ] SSH server availability check: `systemctl status ssh`
- [ ] Package manager working: `sudo apt update`
- [ ] User account with sudo privileges on Spark device
- NVIDIA Spark device running Ubuntu (ARM64/AArch64)
- Client device (Mac, Windows, or Linux) for remote access
- Internet connectivity on both devices
- Valid email account for Tailscale authentication (Google, GitHub, Microsoft)
- SSH server availability check: `systemctl status ssh`
- Package manager working: `sudo apt update`
- User account with sudo privileges on Spark device
## Time & risk

View File

@ -54,13 +54,13 @@ inference through kernel-level optimizations, efficient memory layouts, and adva
## Prerequisites
- [ ] NVIDIA Spark device with Blackwell architecture GPUs
- [ ] NVIDIA drivers compatible with CUDA 12.x: `nvidia-smi`
- [ ] Docker installed and GPU support configured: `docker run --rm --gpus all nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev nvidia-smi`
- [ ] Hugging Face account with token for model access: `echo $HF_TOKEN`
- [ ] Sufficient GPU VRAM (16GB+ recommended for 70B models)
- [ ] Internet connectivity for downloading models and container images
- [ ] Network: open TCP ports 8355 (LLM) and 8356 (VLM) on host for OpenAI-compatible serving
- NVIDIA Spark device with Blackwell architecture GPUs
- NVIDIA drivers compatible with CUDA 12.x: `nvidia-smi`
- Docker installed and GPU support configured: `docker run --rm --gpus all nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev nvidia-smi`
- Hugging Face account with token for model access: `echo $HF_TOKEN`
- Sufficient GPU VRAM (16GB+ recommended for 70B models)
- Internet connectivity for downloading models and container images
- Network: open TCP ports 8355 (LLM) and 8356 (VLM) on host for OpenAI-compatible serving
## Model Support Matrix

View File

@ -36,10 +36,10 @@ parameter-efficient fine-tuning methods like LoRA and QLoRA.
## Prerequisites
- [ ] NVIDIA Spark device with Blackwell GPU architecture
- [ ] `nvidia-smi` shows a summary of GPU information
- [ ] CUDA 13.0 installed: `nvcc --version`
- [ ] Internet access for downloading models and datasets
- NVIDIA Spark device with Blackwell GPU architecture
- `nvidia-smi` shows a summary of GPU information
- CUDA 13.0 installed: `nvcc --version`
- Internet access for downloading models and datasets
##Ancillary files

View File

@ -5,9 +5,9 @@
## Table of Contents
- [Overview](#overview)
- [Instructions](#instructions)
- [Run on two Sparks](#run-on-two-sparks)
- [Step 14. (Optional) Launch 405B inference server](#step-14-optional-launch-405b-inference-server)
- [Access through terminal](#access-through-terminal)
---
@ -29,14 +29,14 @@ support for ARM64.
## Prerequisites
- [ ] DGX Spark device with ARM64 processor and Blackwell GPU architecture
- [ ] CUDA 12.9 or CUDA 13.0 toolkit installed: `nvcc --version` shows CUDA toolkit version.
- [ ] Docker installed and configured: `docker --version` succeeds
- [ ] NVIDIA Container Toolkit installed
- [ ] Python 3.12 available: `python3.12 --version` succeeds
- [ ] Git installed: `git --version` succeeds
- [ ] Network access to download packages and container images
- [ ] > TODO: Verify memory and storage requirements for builds
- DGX Spark device with ARM64 processor and Blackwell GPU architecture
- CUDA 12.9 or CUDA 13.0 toolkit installed: `nvcc --version` shows CUDA toolkit version.
- Docker installed and configured: `docker --version` succeeds
- NVIDIA Container Toolkit installed
- Python 3.12 available: `python3.12 --version` succeeds
- Git installed: `git --version` succeeds
- Network access to download packages and container images
## Time & risk
@ -46,6 +46,77 @@ support for ARM64.
**Rollback:** Container approach is non-destructive.
## Instructions
## Step 1. Pull vLLM container image
Find the latest container build from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=25.09-py3
```
docker pull nvcr.io/nvidia/vllm:25.09-py3
```
## Step 2. Test vLLM in container
Launch the container and start vLLM server with a test model to verify basic functionality.
```bash
docker run -it --gpus all -p 8000:8000 \
nvcr.io/nvidia/vllm:25.09-py3 \
vllm serve "Qwen/Qwen2.5-Math-1.5B-Instruct"
```
Expected output should include:
- Model loading confirmation
- Server startup on port 8000
- GPU memory allocation details
In another terminal, test the server:
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-Math-1.5B-Instruct",
"messages": [{"role": "user", "content": "12*17"}],
"max_tokens": 500
}'
```
Expected response should contain `"content": "204"` or similar mathematical calculation.
## Step 3. Troubleshooting
| Symptom | Cause | Fix |
|---------|--------|-----|
| CUDA version mismatch errors | Wrong CUDA toolkit version | Reinstall CUDA 12.9 using exact installer |
| Container registry authentication fails | Invalid or expired GitLab token | Generate new auth token |
| SM_121a architecture not recognized | Missing LLVM patches | Verify SM_121a patches applied to LLVM source |
| Reduce MAX_JOBS to 1-2, add swap space |
| Environment variables not set |
## Step 4. Cleanup and rollback
For container approach (non-destructive):
```bash
docker rm $(docker ps -aq --filter ancestor=******:5005/dl/dgx/vllm*)
docker rmi ******:5005/dl/dgx/vllm:main-py3.31165712-devel
```
To remove CUDA 12.9:
```bash
sudo /usr/local/cuda-12.9/bin/cuda-uninstaller
```
## Step 5. Next steps
- **Production deployment:** Configure vLLM with your specific model requirements
- **Performance tuning:** Adjust batch sizes and memory settings for your workload
- **Monitoring:** Set up logging and metrics collection for production use
- **Model management:** Explore additional model formats and quantization options
## Run on two Sparks
## Step 1. Verify hardware connectivity
@ -310,74 +381,3 @@ http://192.168.100.10:8265
## - Persistent model caching across restarts
## - Alternative quantization methods (FP8, INT4)
```
## Access through terminal
## Step 1. Pull vLLM container image
Find the latest container build from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=25.09-py3
```
docker pull nvcr.io/nvidia/vllm:25.09-py3
```
## Step 2. Test vLLM in container
Launch the container and start vLLM server with a test model to verify basic functionality.
```bash
docker run -it --gpus all -p 8000:8000 \
nvcr.io/nvidia/vllm:25.09-py3 \
vllm serve "Qwen/Qwen2.5-Math-1.5B-Instruct"
```
Expected output should include:
- Model loading confirmation
- Server startup on port 8000
- GPU memory allocation details
In another terminal, test the server:
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-Math-1.5B-Instruct",
"messages": [{"role": "user", "content": "12*17"}],
"max_tokens": 500
}'
```
Expected response should contain `"content": "204"` or similar mathematical calculation.
## Step 3. Troubleshooting
| Symptom | Cause | Fix |
|---------|--------|-----|
| CUDA version mismatch errors | Wrong CUDA toolkit version | Reinstall CUDA 12.9 using exact installer |
| Container registry authentication fails | Invalid or expired GitLab token | Generate new auth token |
| SM_121a architecture not recognized | Missing LLVM patches | Verify SM_121a patches applied to LLVM source |
| Reduce MAX_JOBS to 1-2, add swap space |
| Environment variables not set |
## Step 4. Cleanup and rollback
For container approach (non-destructive):
```bash
docker rm $(docker ps -aq --filter ancestor=******:5005/dl/dgx/vllm*)
docker rmi ******:5005/dl/dgx/vllm:main-py3.31165712-devel
```
To remove CUDA 12.9:
```bash
sudo /usr/local/cuda-12.9/bin/cuda-uninstaller
```
## Step 5. Next steps
- **Production deployment:** Configure vLLM with your specific model requirements
- **Performance tuning:** Adjust batch sizes and memory settings for your workload
- **Monitoring:** Set up logging and metrics collection for production use
- **Model management:** Explore additional model formats and quantization options

View File

@ -43,14 +43,14 @@ You will deploy NVIDIA's VSS AI Blueprint on NVIDIA Spark hardware with Blackwel
## Prerequisites
- [ ] NVIDIA Spark device with ARM64 architecture and Blackwell GPU
- [ ] FastOS 1.81.38 or compatible ARM64 system
- [ ] Driver version 580.82.09 installed: `nvidia-smi | grep "Driver Version"`
- [ ] CUDA version 13.0 installed: `nvcc --version`
- [ ] Docker installed and running: `docker --version && docker compose version`
- [ ] Access to NVIDIA Container Registry with NGC API Key
- [ ] [Optional] NVIDIA API Key for remote model endpoints (hybrid deployment only)
- [ ] Sufficient storage space for video processing (>10GB recommended in `/tmp/`)
- NVIDIA Spark device with ARM64 architecture and Blackwell GPU
- FastOS 1.81.38 or compatible ARM64 system
- Driver version 580.82.09 installed: `nvidia-smi | grep "Driver Version"`
- CUDA version 13.0 installed: `nvcc --version`
- Docker installed and running: `docker --version && docker compose version`
- Access to NVIDIA Container Registry with NGC API Key
- [Optional] NVIDIA API Key for remote model endpoints (hybrid deployment only)
- Sufficient storage space for video processing (>10GB recommended in `/tmp/`)
## Ancillary files