chore: Regenerate all playbooks

This commit is contained in:
GitLab CI 2025-10-06 02:38:01 +00:00
parent 895e2d9d69
commit 6818481902
14 changed files with 167 additions and 186 deletions

View File

@ -5,36 +5,20 @@
## Table of Contents ## Table of Contents
- [Overview](#overview) - [Overview](#overview)
- [What you'll accomplish](#what-youll-accomplish)
- [What to know before starting](#what-to-know-before-starting)
- [Prerequisites](#prerequisites)
- [Ancillary files](#ancillary-files)
- [Time & risk](#time-risk)
- [Instructions](#instructions) - [Instructions](#instructions)
- [Step 1. Verify system prerequisites](#step-1-verify-system-prerequisites)
- [Step 2. Launch PyTorch container with GPU support](#step-2-launch-pytorch-container-with-gpu-support)
- [Step 3. Clone LLaMA Factory repository](#step-3-clone-llama-factory-repository)
- [Step 4. Install LLaMA Factory with dependencies](#step-4-install-llama-factory-with-dependencies) - [Step 4. Install LLaMA Factory with dependencies](#step-4-install-llama-factory-with-dependencies)
- [Step 5. Configure PyTorch for CUDA 12.9 (if needed)](#step-5-configure-pytorch-for-cuda-129-if-needed)
- [Step 6. Prepare training configuration](#step-6-prepare-training-configuration)
- [Step 7. Launch fine-tuning training](#step-7-launch-fine-tuning-training)
- [Step 8. Validate training completion](#step-8-validate-training-completion)
- [Step 9. Test inference with fine-tuned model](#step-9-test-inference-with-fine-tuned-model)
- [Step 10. Troubleshooting](#step-10-troubleshooting)
- [Step 11. Cleanup and rollback](#step-11-cleanup-and-rollback)
- [Step 12. Next steps](#step-12-next-steps)
--- ---
## Overview ## Overview
### What you'll accomplish ## What you'll accomplish
You'll set up LLaMA Factory on NVIDIA Spark with Blackwell architecture to fine-tune large You'll set up LLaMA Factory on NVIDIA Spark with Blackwell architecture to fine-tune large
language models using LoRA, QLoRA, and full fine-tuning methods. This enables efficient language models using LoRA, QLoRA, and full fine-tuning methods. This enables efficient
model adaptation for specialized domains while leveraging hardware-specific optimizations. model adaptation for specialized domains while leveraging hardware-specific optimizations.
### What to know before starting ## What to know before starting
- Basic Python knowledge for editing config files and troubleshooting - Basic Python knowledge for editing config files and troubleshooting
- Command line usage for running shell commands and managing environments - Command line usage for running shell commands and managing environments
@ -44,7 +28,7 @@ model adaptation for specialized domains while leveraging hardware-specific opti
- Dataset preparation: formatting text data into JSON structure for instruction tuning - Dataset preparation: formatting text data into JSON structure for instruction tuning
- Resource management: adjusting batch size and memory settings for GPU constraints - Resource management: adjusting batch size and memory settings for GPU constraints
### Prerequisites ## Prerequisites
- NVIDIA Spark device with Blackwell architecture - NVIDIA Spark device with Blackwell architecture
@ -60,7 +44,7 @@ model adaptation for specialized domains while leveraging hardware-specific opti
- Internet connection for downloading models from Hugging Face Hub - Internet connection for downloading models from Hugging Face Hub
### Ancillary files ## Ancillary files
- Official LLaMA Factory repository: https://github.com/hiyouga/LLaMA-Factory - Official LLaMA Factory repository: https://github.com/hiyouga/LLaMA-Factory
@ -70,7 +54,7 @@ model adaptation for specialized domains while leveraging hardware-specific opti
- Documentation: https://llamafactory.readthedocs.io/en/latest/getting_started/data_preparation.html - Documentation: https://llamafactory.readthedocs.io/en/latest/getting_started/data_preparation.html
### Time & risk ## Time & risk
**Duration:** 30-60 minutes for initial setup, 1-7 hours for training depending on model size **Duration:** 30-60 minutes for initial setup, 1-7 hours for training depending on model size
and dataset. and dataset.
@ -83,7 +67,7 @@ saved locally and can be deleted to reclaim storage space.
## Instructions ## Instructions
### Step 1. Verify system prerequisites ## Step 1. Verify system prerequisites
Check that your NVIDIA Spark system has the required components installed and accessible. Check that your NVIDIA Spark system has the required components installed and accessible.
@ -95,7 +79,7 @@ python --version
git --version git --version
``` ```
### Step 2. Launch PyTorch container with GPU support ## Step 2. Launch PyTorch container with GPU support
Start the NVIDIA PyTorch container with GPU access and mount your workspace directory. Start the NVIDIA PyTorch container with GPU access and mount your workspace directory.
> **Note:** This NVIDIA PyTorch container supports CUDA 13 > **Note:** This NVIDIA PyTorch container supports CUDA 13
@ -104,7 +88,7 @@ Start the NVIDIA PyTorch container with GPU access and mount your workspace dire
docker run --gpus all --ipc=host --ulimit memlock=-1 -it --ulimit stack=67108864 --rm -v "$PWD":/workspace nvcr.io/nvidia/pytorch:25.08-py3 bash docker run --gpus all --ipc=host --ulimit memlock=-1 -it --ulimit stack=67108864 --rm -v "$PWD":/workspace nvcr.io/nvidia/pytorch:25.08-py3 bash
``` ```
### Step 3. Clone LLaMA Factory repository ## Step 3. Clone LLaMA Factory repository
Download the LLaMA Factory source code from the official repository. Download the LLaMA Factory source code from the official repository.
@ -121,9 +105,9 @@ Install the package in editable mode with metrics support for training evaluatio
pip install -e ".[metrics]" pip install -e ".[metrics]"
``` ```
### Step 5. Configure PyTorch for CUDA 12.9 (if needed) ## Step 5. Configure PyTorch for CUDA 12.9 (if needed)
#### If using standalone Python (skip if using Docker container) *If using standalone Python (skip if using Docker container)*
In a python virtual environment, uninstall existing PyTorch and reinstall with CUDA 12.9 support for ARM64 architecture. In a python virtual environment, uninstall existing PyTorch and reinstall with CUDA 12.9 support for ARM64 architecture.
@ -132,7 +116,7 @@ pip uninstall torch torchvision torchaudio
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu129 pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu129
``` ```
#### If using Docker container *If using Docker container*
PyTorch is pre-installed with CUDA support. Verify installation: PyTorch is pre-installed with CUDA support. Verify installation:
@ -140,7 +124,7 @@ PyTorch is pre-installed with CUDA support. Verify installation:
python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')" python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"
``` ```
### Step 6. Prepare training configuration ## Step 6. Prepare training configuration
Examine the provided LoRA fine-tuning configuration for Llama-3. Examine the provided LoRA fine-tuning configuration for Llama-3.
@ -148,7 +132,7 @@ Examine the provided LoRA fine-tuning configuration for Llama-3.
cat examples/train_lora/llama3_lora_sft.yaml cat examples/train_lora/llama3_lora_sft.yaml
``` ```
### Step 7. Launch fine-tuning training ## Step 7. Launch fine-tuning training
> **Note:** Login to your hugging face hub to download the model if the model is gated > **Note:** Login to your hugging face hub to download the model if the model is gated
Execute the training process using the pre-configured LoRA setup. Execute the training process using the pre-configured LoRA setup.
@ -170,7 +154,7 @@ Example output:
Figure saved at: saves/llama3-8b/lora/sft/training_loss.png Figure saved at: saves/llama3-8b/lora/sft/training_loss.png
``` ```
### Step 8. Validate training completion ## Step 8. Validate training completion
Verify that training completed successfully and checkpoints were saved. Verify that training completed successfully and checkpoints were saved.
@ -186,7 +170,7 @@ Expected output should show:
- Training metrics showing decreasing loss values - Training metrics showing decreasing loss values
- Training loss plot saved as PNG file - Training loss plot saved as PNG file
### Step 9. Test inference with fine-tuned model ## Step 9. Test inference with fine-tuned model
Run a simple inference test to verify the fine-tuned model loads correctly. Run a simple inference test to verify the fine-tuned model loads correctly.
@ -194,7 +178,7 @@ Run a simple inference test to verify the fine-tuned model loads correctly.
llamafactory-cli chat examples/inference/llama3_lora_sft.yaml llamafactory-cli chat examples/inference/llama3_lora_sft.yaml
``` ```
### Step 10. Troubleshooting ## Step 10. Troubleshooting
| Symptom | Cause | Fix | | Symptom | Cause | Fix |
|---------|--------|-----| |---------|--------|-----|
@ -202,7 +186,7 @@ llamafactory-cli chat examples/inference/llama3_lora_sft.yaml
| Model download fails or is slow | Network connectivity or Hugging Face Hub issues | Check internet connection, try using `HF_HUB_OFFLINE=1` for cached models | | Model download fails or is slow | Network connectivity or Hugging Face Hub issues | Check internet connection, try using `HF_HUB_OFFLINE=1` for cached models |
| Training loss not decreasing | Learning rate too high/low or insufficient data | Adjust `learning_rate` parameter or check dataset quality | | Training loss not decreasing | Learning rate too high/low or insufficient data | Adjust `learning_rate` parameter or check dataset quality |
### Step 11. Cleanup and rollback ## Step 11. Cleanup and rollback
> **Warning:** This will delete all training progress and checkpoints. > **Warning:** This will delete all training progress and checkpoints.
@ -220,7 +204,7 @@ exit # Exit container
docker container prune -f docker container prune -f
``` ```
### Step 12. Next steps ## Step 12. Next steps
Test your fine-tuned model with custom prompts: Test your fine-tuned model with custom prompts:

View File

@ -35,14 +35,14 @@ FP8, FP4).
## Prerequisites ## Prerequisites
- [ ] NVIDIA Spark device with Blackwell GPU architecture - NVIDIA Spark device with Blackwell GPU architecture
- [ ] Docker installed and accessible to current user - Docker installed and accessible to current user
- [ ] NVIDIA Container Runtime configured - NVIDIA Container Runtime configured
- [ ] Hugging Face account with valid token - Hugging Face account with valid token
- [ ] At least 48GB VRAM available for FP16 Flux.1 Schnell operations - At least 48GB VRAM available for FP16 Flux.1 Schnell operations
- [ ] Verify GPU access: `nvidia-smi` - Verify GPU access: `nvidia-smi`
- [ ] Check Docker GPU integration: `docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu20.04 nvidia-smi` - Check Docker GPU integration: `docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu20.04 nvidia-smi`
- [ ] Confirm HF token access with permissions to FLUX repos: `echo $HF_TOKEN`, Sign in to your huggingface account You can create the token from create your token here (make sure you provide permissions to the token): https://huggingface.co/settings/tokens , Note the permissions to be checked and the repos: black-forest-labs/FLUX.1-dev and black-forest-labs/FLUX.1-dev-onnx (search for these repos when creating the user token) to be added. - Confirm HF token access with permissions to FLUX repos: `echo $HF_TOKEN`, Sign in to your huggingface account You can create the token from create your token here (make sure you provide permissions to the token): https://huggingface.co/settings/tokens , Note the permissions to be checked and the repos: black-forest-labs/FLUX.1-dev and black-forest-labs/FLUX.1-dev-onnx (search for these repos when creating the user token) to be added.
## Ancillary files ## Ancillary files

View File

@ -35,22 +35,22 @@ You'll establish a complete fine-tuning environment for large language models (1
## Prerequisites ## Prerequisites
- [ ] NVIDIA Spark device with Blackwell architecture GPU access - NVIDIA Spark device with Blackwell architecture GPU access
- [ ] CUDA toolkit 12.0+ installed and configured - CUDA toolkit 12.0+ installed and configured
```bash ```bash
nvcc --version nvcc --version
``` ```
- [ ] Python 3.10+ environment available - Python 3.10+ environment available
```bash ```bash
python3 --version python3 --version
``` ```
- [ ] Minimum 32GB system RAM for efficient model loading and training - Minimum 32GB system RAM for efficient model loading and training
- [ ] Active internet connection for downloading models and packages - Active internet connection for downloading models and packages
- [ ] Git installed for repository cloning - Git installed for repository cloning
```bash ```bash
git --version git --version
``` ```
- [ ] SSH access to your NVIDIA Spark device configured - SSH access to your NVIDIA Spark device configured
## Ancillary files ## Ancillary files

View File

@ -40,19 +40,19 @@ completions.
### Prerequisites ### Prerequisites
- [ ] DGX Spark device with NVIDIA drivers installed - DGX Spark device with NVIDIA drivers installed
```bash ```bash
nvidia-smi nvidia-smi
``` ```
- [ ] Docker with NVIDIA Container Toolkit configured, instructions here: https://******.nvidia.com/dgx-docs/review/621/dgx-spark/latest/nvidia-container-runtime-for-docker.html - Docker with NVIDIA Container Toolkit configured, instructions here: https://******.nvidia.com/dgx-docs/review/621/dgx-spark/latest/nvidia-container-runtime-for-docker.html
```bash ```bash
docker run -it --gpus=all nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 nvidia-smi docker run -it --gpus=all nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 nvidia-smi
``` ```
- [ ] NGC account with API key from https://ngc.nvidia.com/setup/api-key - NGC account with API key from https://ngc.nvidia.com/setup/api-key
```bash ```bash
echo $NGC_API_KEY | grep -E '^[a-zA-Z0-9]{86}==' echo $NGC_API_KEY | grep -E '^[a-zA-Z0-9]{86}=='
``` ```
- [ ] Sufficient disk space for model caching (varies by model, typically 10-50GB) - Sufficient disk space for model caching (varies by model, typically 10-50GB)
```bash ```bash
df -h ~ df -h ~
``` ```

View File

@ -40,11 +40,11 @@ inside a TensorRT-LLM container, producing an NVFP4 quantized model for deployme
## Prerequisites ## Prerequisites
- [ ] NVIDIA Spark device with Blackwell architecture GPU - NVIDIA Spark device with Blackwell architecture GPU
- [ ] Docker installed with GPU support - Docker installed with GPU support
- [ ] NVIDIA Container Toolkit configured - NVIDIA Container Toolkit configured
- [ ] At least 32GB of available storage for model files and outputs - At least 32GB of available storage for model files and outputs
- [ ] Hugging Face account with access to the target model - Hugging Face account with access to the target model
Verify your setup: Verify your setup:
```bash ```bash

View File

@ -36,12 +36,9 @@ the powerful GPU capabilities of your Spark device without complex network confi
## Prerequisites ## Prerequisites
- [ ] DGX Spark device set up and connected to your network - DGX Spark device set up and connected to your network
- Verify with: `nvidia-smi` (should show Blackwell GPU information) - NVIDIA Sync installed and connected to your Spark
- [ ] NVIDIA Sync installed and connected to your Spark - Terminal access to your local machine for testing API calls
- Verify connection status in NVIDIA Sync system tray application
- [ ] Terminal access to your local machine for testing API calls
- Verify with: `curl --version`

View File

@ -30,23 +30,23 @@ RTX Pro 6000 or DGX Spark workstation.
## Prerequisites ## Prerequisites
- [ ] NVIDIA GPU (RTX Pro 6000 or DGX Spark recommended) - NVIDIA GPU (RTX Pro 6000 or DGX Spark recommended)
```bash ```bash
nvidia-smi # Should show GPU with CUDA ≥12.9 nvidia-smi # Should show GPU with CUDA ≥12.9
``` ```
- [ ] NVIDIA drivers and CUDA toolkit installed - NVIDIA drivers and CUDA toolkit installed
```bash ```bash
nvcc --version # Should show CUDA 12.9 or higher nvcc --version # Should show CUDA 12.9 or higher
``` ```
- [ ] Docker with NVIDIA Container Toolkit - Docker with NVIDIA Container Toolkit
```bash ```bash
docker run --rm --gpus all nvidia/cuda:12.9.0-base-ubuntu22.04 nvidia-smi docker run --rm --gpus all nvidia/cuda:12.9.0-base-ubuntu22.04 nvidia-smi
``` ```
- [ ] Python 3.8+ environment - Python 3.8+ environment
```bash ```bash
python3 --version # Should show 3.8 or higher python3 --version # Should show 3.8 or higher
``` ```
- [ ] Sufficient disk space for databases (>3TB recommended) - Sufficient disk space for databases (>3TB recommended)
```bash ```bash
df -h # Check available space df -h # Check available space
``` ```

View File

@ -35,12 +35,12 @@ vision-language tasks using models like DeepSeek-V2-Lite.
## Prerequisites ## Prerequisites
- [ ] NVIDIA Spark device with Blackwell architecture - NVIDIA Spark device with Blackwell architecture
- [ ] Docker Engine installed and running: `docker --version` - Docker Engine installed and running: `docker --version`
- [ ] NVIDIA GPU drivers installed: `nvidia-smi` - NVIDIA GPU drivers installed: `nvidia-smi`
- [ ] NVIDIA Container Toolkit configured: `docker run --rm --gpus all nvidia/cuda:12.9-base nvidia-smi` - NVIDIA Container Toolkit configured: `docker run --rm --gpus all nvidia/cuda:12.9-base nvidia-smi`
- [ ] Sufficient disk space (>20GB available): `df -h` - Sufficient disk space (>20GB available): `df -h`
- [ ] Network connectivity for pulling NGC containers: `ping nvcr.io` - Network connectivity for pulling NGC containers: `ping nvcr.io`
## Ancillary files ## Ancillary files

View File

@ -40,17 +40,17 @@ These examples demonstrate how to accelerate large language model inference whil
## Prerequisites ## Prerequisites
- [ ] NVIDIA Spark device with sufficient GPU memory available (80GB+ recommended for GPT-OSS 120B) - NVIDIA Spark device with sufficient GPU memory available (80GB+ recommended for GPT-OSS 120B)
- [ ] Docker with GPU support enabled - Docker with GPU support enabled
```bash ```bash
docker run --gpus all nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev nvidia-smi docker run --gpus all nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev nvidia-smi
``` ```
- [ ] Access to NVIDIA's internal container registry (for Eagle3 example) - Access to NVIDIA's internal container registry (for Eagle3 example)
- [ ] HuggingFace authentication configured (if needed for model downloads) - HuggingFace authentication configured (if needed for model downloads)
```bash ```bash
huggingface-cli login huggingface-cli login
``` ```
- [ ] Network connectivity for model downloads - Network connectivity for model downloads
## Time & risk ## Time & risk

View File

@ -51,13 +51,13 @@ all traffic automatically encrypted and NAT traversal handled transparently.
## Prerequisites ## Prerequisites
- [ ] NVIDIA Spark device running Ubuntu (ARM64/AArch64) - NVIDIA Spark device running Ubuntu (ARM64/AArch64)
- [ ] Client device (Mac, Windows, or Linux) for remote access - Client device (Mac, Windows, or Linux) for remote access
- [ ] Internet connectivity on both devices - Internet connectivity on both devices
- [ ] Valid email account for Tailscale authentication (Google, GitHub, Microsoft) - Valid email account for Tailscale authentication (Google, GitHub, Microsoft)
- [ ] SSH server availability check: `systemctl status ssh` - SSH server availability check: `systemctl status ssh`
- [ ] Package manager working: `sudo apt update` - Package manager working: `sudo apt update`
- [ ] User account with sudo privileges on Spark device - User account with sudo privileges on Spark device
## Time & risk ## Time & risk

View File

@ -54,13 +54,13 @@ inference through kernel-level optimizations, efficient memory layouts, and adva
## Prerequisites ## Prerequisites
- [ ] NVIDIA Spark device with Blackwell architecture GPUs - NVIDIA Spark device with Blackwell architecture GPUs
- [ ] NVIDIA drivers compatible with CUDA 12.x: `nvidia-smi` - NVIDIA drivers compatible with CUDA 12.x: `nvidia-smi`
- [ ] Docker installed and GPU support configured: `docker run --rm --gpus all nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev nvidia-smi` - Docker installed and GPU support configured: `docker run --rm --gpus all nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev nvidia-smi`
- [ ] Hugging Face account with token for model access: `echo $HF_TOKEN` - Hugging Face account with token for model access: `echo $HF_TOKEN`
- [ ] Sufficient GPU VRAM (16GB+ recommended for 70B models) - Sufficient GPU VRAM (16GB+ recommended for 70B models)
- [ ] Internet connectivity for downloading models and container images - Internet connectivity for downloading models and container images
- [ ] Network: open TCP ports 8355 (LLM) and 8356 (VLM) on host for OpenAI-compatible serving - Network: open TCP ports 8355 (LLM) and 8356 (VLM) on host for OpenAI-compatible serving
## Model Support Matrix ## Model Support Matrix

View File

@ -36,10 +36,10 @@ parameter-efficient fine-tuning methods like LoRA and QLoRA.
## Prerequisites ## Prerequisites
- [ ] NVIDIA Spark device with Blackwell GPU architecture - NVIDIA Spark device with Blackwell GPU architecture
- [ ] `nvidia-smi` shows a summary of GPU information - `nvidia-smi` shows a summary of GPU information
- [ ] CUDA 13.0 installed: `nvcc --version` - CUDA 13.0 installed: `nvcc --version`
- [ ] Internet access for downloading models and datasets - Internet access for downloading models and datasets
##Ancillary files ##Ancillary files

View File

@ -5,9 +5,9 @@
## Table of Contents ## Table of Contents
- [Overview](#overview) - [Overview](#overview)
- [Instructions](#instructions)
- [Run on two Sparks](#run-on-two-sparks) - [Run on two Sparks](#run-on-two-sparks)
- [Step 14. (Optional) Launch 405B inference server](#step-14-optional-launch-405b-inference-server) - [Step 14. (Optional) Launch 405B inference server](#step-14-optional-launch-405b-inference-server)
- [Access through terminal](#access-through-terminal)
--- ---
@ -29,14 +29,14 @@ support for ARM64.
## Prerequisites ## Prerequisites
- [ ] DGX Spark device with ARM64 processor and Blackwell GPU architecture - DGX Spark device with ARM64 processor and Blackwell GPU architecture
- [ ] CUDA 12.9 or CUDA 13.0 toolkit installed: `nvcc --version` shows CUDA toolkit version. - CUDA 12.9 or CUDA 13.0 toolkit installed: `nvcc --version` shows CUDA toolkit version.
- [ ] Docker installed and configured: `docker --version` succeeds - Docker installed and configured: `docker --version` succeeds
- [ ] NVIDIA Container Toolkit installed - NVIDIA Container Toolkit installed
- [ ] Python 3.12 available: `python3.12 --version` succeeds - Python 3.12 available: `python3.12 --version` succeeds
- [ ] Git installed: `git --version` succeeds - Git installed: `git --version` succeeds
- [ ] Network access to download packages and container images - Network access to download packages and container images
- [ ] > TODO: Verify memory and storage requirements for builds
## Time & risk ## Time & risk
@ -46,6 +46,77 @@ support for ARM64.
**Rollback:** Container approach is non-destructive. **Rollback:** Container approach is non-destructive.
## Instructions
## Step 1. Pull vLLM container image
Find the latest container build from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=25.09-py3
```
docker pull nvcr.io/nvidia/vllm:25.09-py3
```
## Step 2. Test vLLM in container
Launch the container and start vLLM server with a test model to verify basic functionality.
```bash
docker run -it --gpus all -p 8000:8000 \
nvcr.io/nvidia/vllm:25.09-py3 \
vllm serve "Qwen/Qwen2.5-Math-1.5B-Instruct"
```
Expected output should include:
- Model loading confirmation
- Server startup on port 8000
- GPU memory allocation details
In another terminal, test the server:
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-Math-1.5B-Instruct",
"messages": [{"role": "user", "content": "12*17"}],
"max_tokens": 500
}'
```
Expected response should contain `"content": "204"` or similar mathematical calculation.
## Step 3. Troubleshooting
| Symptom | Cause | Fix |
|---------|--------|-----|
| CUDA version mismatch errors | Wrong CUDA toolkit version | Reinstall CUDA 12.9 using exact installer |
| Container registry authentication fails | Invalid or expired GitLab token | Generate new auth token |
| SM_121a architecture not recognized | Missing LLVM patches | Verify SM_121a patches applied to LLVM source |
| Reduce MAX_JOBS to 1-2, add swap space |
| Environment variables not set |
## Step 4. Cleanup and rollback
For container approach (non-destructive):
```bash
docker rm $(docker ps -aq --filter ancestor=******:5005/dl/dgx/vllm*)
docker rmi ******:5005/dl/dgx/vllm:main-py3.31165712-devel
```
To remove CUDA 12.9:
```bash
sudo /usr/local/cuda-12.9/bin/cuda-uninstaller
```
## Step 5. Next steps
- **Production deployment:** Configure vLLM with your specific model requirements
- **Performance tuning:** Adjust batch sizes and memory settings for your workload
- **Monitoring:** Set up logging and metrics collection for production use
- **Model management:** Explore additional model formats and quantization options
## Run on two Sparks ## Run on two Sparks
## Step 1. Verify hardware connectivity ## Step 1. Verify hardware connectivity
@ -310,74 +381,3 @@ http://192.168.100.10:8265
## - Persistent model caching across restarts ## - Persistent model caching across restarts
## - Alternative quantization methods (FP8, INT4) ## - Alternative quantization methods (FP8, INT4)
``` ```
## Access through terminal
## Step 1. Pull vLLM container image
Find the latest container build from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=25.09-py3
```
docker pull nvcr.io/nvidia/vllm:25.09-py3
```
## Step 2. Test vLLM in container
Launch the container and start vLLM server with a test model to verify basic functionality.
```bash
docker run -it --gpus all -p 8000:8000 \
nvcr.io/nvidia/vllm:25.09-py3 \
vllm serve "Qwen/Qwen2.5-Math-1.5B-Instruct"
```
Expected output should include:
- Model loading confirmation
- Server startup on port 8000
- GPU memory allocation details
In another terminal, test the server:
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-Math-1.5B-Instruct",
"messages": [{"role": "user", "content": "12*17"}],
"max_tokens": 500
}'
```
Expected response should contain `"content": "204"` or similar mathematical calculation.
## Step 3. Troubleshooting
| Symptom | Cause | Fix |
|---------|--------|-----|
| CUDA version mismatch errors | Wrong CUDA toolkit version | Reinstall CUDA 12.9 using exact installer |
| Container registry authentication fails | Invalid or expired GitLab token | Generate new auth token |
| SM_121a architecture not recognized | Missing LLVM patches | Verify SM_121a patches applied to LLVM source |
| Reduce MAX_JOBS to 1-2, add swap space |
| Environment variables not set |
## Step 4. Cleanup and rollback
For container approach (non-destructive):
```bash
docker rm $(docker ps -aq --filter ancestor=******:5005/dl/dgx/vllm*)
docker rmi ******:5005/dl/dgx/vllm:main-py3.31165712-devel
```
To remove CUDA 12.9:
```bash
sudo /usr/local/cuda-12.9/bin/cuda-uninstaller
```
## Step 5. Next steps
- **Production deployment:** Configure vLLM with your specific model requirements
- **Performance tuning:** Adjust batch sizes and memory settings for your workload
- **Monitoring:** Set up logging and metrics collection for production use
- **Model management:** Explore additional model formats and quantization options

View File

@ -43,14 +43,14 @@ You will deploy NVIDIA's VSS AI Blueprint on NVIDIA Spark hardware with Blackwel
## Prerequisites ## Prerequisites
- [ ] NVIDIA Spark device with ARM64 architecture and Blackwell GPU - NVIDIA Spark device with ARM64 architecture and Blackwell GPU
- [ ] FastOS 1.81.38 or compatible ARM64 system - FastOS 1.81.38 or compatible ARM64 system
- [ ] Driver version 580.82.09 installed: `nvidia-smi | grep "Driver Version"` - Driver version 580.82.09 installed: `nvidia-smi | grep "Driver Version"`
- [ ] CUDA version 13.0 installed: `nvcc --version` - CUDA version 13.0 installed: `nvcc --version`
- [ ] Docker installed and running: `docker --version && docker compose version` - Docker installed and running: `docker --version && docker compose version`
- [ ] Access to NVIDIA Container Registry with NGC API Key - Access to NVIDIA Container Registry with NGC API Key
- [ ] [Optional] NVIDIA API Key for remote model endpoints (hybrid deployment only) - [Optional] NVIDIA API Key for remote model endpoints (hybrid deployment only)
- [ ] Sufficient storage space for video processing (>10GB recommended in `/tmp/`) - Sufficient storage space for video processing (>10GB recommended in `/tmp/`)
## Ancillary files ## Ancillary files