diff --git a/nvidia/llama-factory/README.md b/nvidia/llama-factory/README.md index 07ebac6..37867a6 100644 --- a/nvidia/llama-factory/README.md +++ b/nvidia/llama-factory/README.md @@ -5,20 +5,36 @@ ## Table of Contents - [Overview](#overview) + - [What you'll accomplish](#what-youll-accomplish) + - [What to know before starting](#what-to-know-before-starting) + - [Prerequisites](#prerequisites) + - [Ancillary files](#ancillary-files) + - [Time & risk](#time-risk) - [Instructions](#instructions) + - [Step 1. Verify system prerequisites](#step-1-verify-system-prerequisites) + - [Step 2. Launch PyTorch container with GPU support](#step-2-launch-pytorch-container-with-gpu-support) + - [Step 3. Clone LLaMA Factory repository](#step-3-clone-llama-factory-repository) - [Step 4. Install LLaMA Factory with dependencies](#step-4-install-llama-factory-with-dependencies) + - [Step 5. Configure PyTorch for CUDA 12.9 (if needed)](#step-5-configure-pytorch-for-cuda-129-if-needed) + - [Step 6. Prepare training configuration](#step-6-prepare-training-configuration) + - [Step 7. Launch fine-tuning training](#step-7-launch-fine-tuning-training) + - [Step 8. Validate training completion](#step-8-validate-training-completion) + - [Step 9. Test inference with fine-tuned model](#step-9-test-inference-with-fine-tuned-model) + - [Step 10. Troubleshooting](#step-10-troubleshooting) + - [Step 11. Cleanup and rollback](#step-11-cleanup-and-rollback) + - [Step 12. Next steps](#step-12-next-steps) --- ## Overview -## What you'll accomplish +### What you'll accomplish You'll set up LLaMA Factory on NVIDIA Spark with Blackwell architecture to fine-tune large language models using LoRA, QLoRA, and full fine-tuning methods. This enables efficient model adaptation for specialized domains while leveraging hardware-specific optimizations. -## What to know before starting +### What to know before starting - Basic Python knowledge for editing config files and troubleshooting - Command line usage for running shell commands and managing environments @@ -28,7 +44,7 @@ model adaptation for specialized domains while leveraging hardware-specific opti - Dataset preparation: formatting text data into JSON structure for instruction tuning - Resource management: adjusting batch size and memory settings for GPU constraints -## Prerequisites +### Prerequisites - NVIDIA Spark device with Blackwell architecture @@ -44,7 +60,7 @@ model adaptation for specialized domains while leveraging hardware-specific opti - Internet connection for downloading models from Hugging Face Hub -## Ancillary files +### Ancillary files - Official LLaMA Factory repository: https://github.com/hiyouga/LLaMA-Factory @@ -54,7 +70,7 @@ model adaptation for specialized domains while leveraging hardware-specific opti - Documentation: https://llamafactory.readthedocs.io/en/latest/getting_started/data_preparation.html -## Time & risk +### Time & risk **Duration:** 30-60 minutes for initial setup, 1-7 hours for training depending on model size and dataset. @@ -67,7 +83,7 @@ saved locally and can be deleted to reclaim storage space. ## Instructions -## Step 1. Verify system prerequisites +### Step 1. Verify system prerequisites Check that your NVIDIA Spark system has the required components installed and accessible. @@ -79,7 +95,7 @@ python --version git --version ``` -## Step 2. Launch PyTorch container with GPU support +### Step 2. Launch PyTorch container with GPU support Start the NVIDIA PyTorch container with GPU access and mount your workspace directory. > **Note:** This NVIDIA PyTorch container supports CUDA 13 @@ -88,7 +104,7 @@ Start the NVIDIA PyTorch container with GPU access and mount your workspace dire docker run --gpus all --ipc=host --ulimit memlock=-1 -it --ulimit stack=67108864 --rm -v "$PWD":/workspace nvcr.io/nvidia/pytorch:25.08-py3 bash ``` -## Step 3. Clone LLaMA Factory repository +### Step 3. Clone LLaMA Factory repository Download the LLaMA Factory source code from the official repository. @@ -105,7 +121,9 @@ Install the package in editable mode with metrics support for training evaluatio pip install -e ".[metrics]" ``` -## Step 5. Configure PyTorch for CUDA 12.9 (skip if using Docker container from Step 2) +### Step 5. Configure PyTorch for CUDA 12.9 (if needed) + +#### If using standalone Python (skip if using Docker container) In a python virtual environment, uninstall existing PyTorch and reinstall with CUDA 12.9 support for ARM64 architecture. @@ -114,7 +132,7 @@ pip uninstall torch torchvision torchaudio pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu129 ``` -*If using Docker container* +#### If using Docker container PyTorch is pre-installed with CUDA support. Verify installation: @@ -122,7 +140,7 @@ PyTorch is pre-installed with CUDA support. Verify installation: python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')" ``` -## Step 6. Prepare training configuration +### Step 6. Prepare training configuration Examine the provided LoRA fine-tuning configuration for Llama-3. @@ -130,7 +148,7 @@ Examine the provided LoRA fine-tuning configuration for Llama-3. cat examples/train_lora/llama3_lora_sft.yaml ``` -## Step 7. Launch fine-tuning training +### Step 7. Launch fine-tuning training > **Note:** Login to your hugging face hub to download the model if the model is gated Execute the training process using the pre-configured LoRA setup. @@ -152,7 +170,7 @@ Example output: Figure saved at: saves/llama3-8b/lora/sft/training_loss.png ``` -## Step 8. Validate training completion +### Step 8. Validate training completion Verify that training completed successfully and checkpoints were saved. @@ -168,7 +186,7 @@ Expected output should show: - Training metrics showing decreasing loss values - Training loss plot saved as PNG file -## Step 9. Test inference with fine-tuned model +### Step 9. Test inference with fine-tuned model Run a simple inference test to verify the fine-tuned model loads correctly. @@ -176,7 +194,7 @@ Run a simple inference test to verify the fine-tuned model loads correctly. llamafactory-cli chat examples/inference/llama3_lora_sft.yaml ``` -## Step 10. Troubleshooting +### Step 10. Troubleshooting | Symptom | Cause | Fix | |---------|--------|-----| @@ -184,7 +202,7 @@ llamafactory-cli chat examples/inference/llama3_lora_sft.yaml | Model download fails or is slow | Network connectivity or Hugging Face Hub issues | Check internet connection, try using `HF_HUB_OFFLINE=1` for cached models | | Training loss not decreasing | Learning rate too high/low or insufficient data | Adjust `learning_rate` parameter or check dataset quality | -## Step 11. Cleanup and rollback +### Step 11. Cleanup and rollback > **Warning:** This will delete all training progress and checkpoints. @@ -202,7 +220,7 @@ exit # Exit container docker container prune -f ``` -## Step 12. Next steps +### Step 12. Next steps Test your fine-tuned model with custom prompts: diff --git a/nvidia/multi-modal-inference/README.md b/nvidia/multi-modal-inference/README.md index eba93aa..8b33be9 100644 --- a/nvidia/multi-modal-inference/README.md +++ b/nvidia/multi-modal-inference/README.md @@ -35,14 +35,14 @@ FP8, FP4). ## Prerequisites -- NVIDIA Spark device with Blackwell GPU architecture -- Docker installed and accessible to current user -- NVIDIA Container Runtime configured -- Hugging Face account with valid token -- At least 48GB VRAM available for FP16 Flux.1 Schnell operations -- Verify GPU access: `nvidia-smi` -- Check Docker GPU integration: `docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu20.04 nvidia-smi` -- Confirm HF token access with permissions to FLUX repos: `echo $HF_TOKEN`, Sign in to your huggingface account You can create the token from create your token here (make sure you provide permissions to the token): https://huggingface.co/settings/tokens , Note the permissions to be checked and the repos: black-forest-labs/FLUX.1-dev and black-forest-labs/FLUX.1-dev-onnx (search for these repos when creating the user token) to be added. +- [ ] NVIDIA Spark device with Blackwell GPU architecture +- [ ] Docker installed and accessible to current user +- [ ] NVIDIA Container Runtime configured +- [ ] Hugging Face account with valid token +- [ ] At least 48GB VRAM available for FP16 Flux.1 Schnell operations +- [ ] Verify GPU access: `nvidia-smi` +- [ ] Check Docker GPU integration: `docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu20.04 nvidia-smi` +- [ ] Confirm HF token access with permissions to FLUX repos: `echo $HF_TOKEN`, Sign in to your huggingface account You can create the token from create your token here (make sure you provide permissions to the token): https://huggingface.co/settings/tokens , Note the permissions to be checked and the repos: black-forest-labs/FLUX.1-dev and black-forest-labs/FLUX.1-dev-onnx (search for these repos when creating the user token) to be added. ## Ancillary files diff --git a/nvidia/nemo-fine-tune/README.md b/nvidia/nemo-fine-tune/README.md index 641c571..a6a8b2c 100644 --- a/nvidia/nemo-fine-tune/README.md +++ b/nvidia/nemo-fine-tune/README.md @@ -35,22 +35,22 @@ You'll establish a complete fine-tuning environment for large language models (1 ## Prerequisites -- NVIDIA Spark device with Blackwell architecture GPU access -- CUDA toolkit 12.0+ installed and configured +- [ ] NVIDIA Spark device with Blackwell architecture GPU access +- [ ] CUDA toolkit 12.0+ installed and configured ```bash nvcc --version ``` -- Python 3.10+ environment available +- [ ] Python 3.10+ environment available ```bash python3 --version ``` -- Minimum 32GB system RAM for efficient model loading and training -- Active internet connection for downloading models and packages -- Git installed for repository cloning +- [ ] Minimum 32GB system RAM for efficient model loading and training +- [ ] Active internet connection for downloading models and packages +- [ ] Git installed for repository cloning ```bash git --version ``` -- SSH access to your NVIDIA Spark device configured +- [ ] SSH access to your NVIDIA Spark device configured ## Ancillary files diff --git a/nvidia/nim-llm/README.md b/nvidia/nim-llm/README.md index b482c8b..9093709 100644 --- a/nvidia/nim-llm/README.md +++ b/nvidia/nim-llm/README.md @@ -1,6 +1,6 @@ # Use a NIM on Spark -> Run an LLM NIM on Spark +> Run a NIM on Spark ## Table of Contents @@ -40,19 +40,19 @@ completions. ### Prerequisites -- DGX Spark device with NVIDIA drivers installed +- [ ] DGX Spark device with NVIDIA drivers installed ```bash nvidia-smi ``` -- Docker with NVIDIA Container Toolkit configured, instructions here: https://******.nvidia.com/dgx-docs/review/621/dgx-spark/latest/nvidia-container-runtime-for-docker.html +- [ ] Docker with NVIDIA Container Toolkit configured, instructions here: https://******.nvidia.com/dgx-docs/review/621/dgx-spark/latest/nvidia-container-runtime-for-docker.html ```bash docker run -it --gpus=all nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 nvidia-smi ``` -- NGC account with API key from https://ngc.nvidia.com/setup/api-key +- [ ] NGC account with API key from https://ngc.nvidia.com/setup/api-key ```bash echo $NGC_API_KEY | grep -E '^[a-zA-Z0-9]{86}==' ``` -- Sufficient disk space for model caching (varies by model, typically 10-50GB) +- [ ] Sufficient disk space for model caching (varies by model, typically 10-50GB) ```bash df -h ~ ``` diff --git a/nvidia/nvfp4-quantization/README.md b/nvidia/nvfp4-quantization/README.md index 046460b..f9ee9fd 100644 --- a/nvidia/nvfp4-quantization/README.md +++ b/nvidia/nvfp4-quantization/README.md @@ -40,11 +40,11 @@ inside a TensorRT-LLM container, producing an NVFP4 quantized model for deployme ## Prerequisites -- NVIDIA Spark device with Blackwell architecture GPU -- Docker installed with GPU support -- NVIDIA Container Toolkit configured -- At least 32GB of available storage for model files and outputs -- Hugging Face account with access to the target model +- [ ] NVIDIA Spark device with Blackwell architecture GPU +- [ ] Docker installed with GPU support +- [ ] NVIDIA Container Toolkit configured +- [ ] At least 32GB of available storage for model files and outputs +- [ ] Hugging Face account with access to the target model Verify your setup: ```bash diff --git a/nvidia/ollama/README.md b/nvidia/ollama/README.md index 776203d..ad0ffe8 100644 --- a/nvidia/ollama/README.md +++ b/nvidia/ollama/README.md @@ -6,6 +6,7 @@ - [Overview](#overview) - [Instructions](#instructions) +- [Access with NVIDIA Sync](#access-with-nvidia-sync) --- @@ -35,9 +36,12 @@ the powerful GPU capabilities of your Spark device without complex network confi ## Prerequisites -- DGX Spark device set up and connected to your network -- NVIDIA Sync installed and connected to your Spark -- Terminal access to your local machine for testing API calls +- [ ] DGX Spark device set up and connected to your network + - Verify with: `nvidia-smi` (should show Blackwell GPU information) +- [ ] NVIDIA Sync installed and connected to your Spark + - Verify connection status in NVIDIA Sync system tray application +- [ ] Terminal access to your local machine for testing API calls + - Verify with: `curl --version` @@ -229,3 +233,7 @@ Monitor GPU and system usage during inference using the DGX Dashboard available Build applications using the Ollama API by integrating with your preferred programming language's HTTP client libraries. + +## Access with NVIDIA Sync + +## Step 1. (DRAFT) diff --git a/nvidia/protein-folding/README.md b/nvidia/protein-folding/README.md index 60f970b..e777cec 100644 --- a/nvidia/protein-folding/README.md +++ b/nvidia/protein-folding/README.md @@ -30,23 +30,23 @@ RTX Pro 6000 or DGX Spark workstation. ## Prerequisites -- NVIDIA GPU (RTX Pro 6000 or DGX Spark recommended) +- [ ] NVIDIA GPU (RTX Pro 6000 or DGX Spark recommended) ```bash nvidia-smi # Should show GPU with CUDA ≥12.9 ``` -- NVIDIA drivers and CUDA toolkit installed +- [ ] NVIDIA drivers and CUDA toolkit installed ```bash nvcc --version # Should show CUDA 12.9 or higher ``` -- Docker with NVIDIA Container Toolkit +- [ ] Docker with NVIDIA Container Toolkit ```bash docker run --rm --gpus all nvidia/cuda:12.9.0-base-ubuntu22.04 nvidia-smi ``` -- Python 3.8+ environment +- [ ] Python 3.8+ environment ```bash python3 --version # Should show 3.8 or higher ``` -- Sufficient disk space for databases (>3TB recommended) +- [ ] Sufficient disk space for databases (>3TB recommended) ```bash df -h # Check available space ``` diff --git a/nvidia/pytorch-fine-tune/README.md b/nvidia/pytorch-fine-tune/README.md index e921c16..9aeab87 100644 --- a/nvidia/pytorch-fine-tune/README.md +++ b/nvidia/pytorch-fine-tune/README.md @@ -13,101 +13,74 @@ ## Basic Idea -This playbook guides you through setting up and using Pytorch for fine-tuning large language models and vision-language models on NVIDIA Spark devices. NeMo AutoModel provides GPU-accelerated, end-to-end training for Hugging Face models with native PyTorch support, enabling instant fine-tuning without conversion delays. The framework supports distributed training across single GPU to multi-node clusters, with optimized kernels and memory-efficient recipes specifically designed for ARM64 architecture and Blackwell GPU systems. +This playbook guides you through setting up and using Pytorch for fine-tuning large language models on NVIDIA Spark devices. ## What you'll accomplish -You'll establish a complete fine-tuning environment for large language models (1-70B parameters) and vision-language models using NeMo AutoModel on your NVIDIA Spark device. By the end, you'll have a working installation that supports parameter-efficient fine-tuning (PEFT), supervised fine-tuning (SFT), and distributed training capabilities with FP8 precision optimizations, all while maintaining compatibility with the Hugging Face ecosystem. - +You'll establish a complete fine-tuning environment for large language models (1-70B parameters) on your NVIDIA Spark device. By the end, you'll have a working installation that supports parameter-efficient fine-tuning (PEFT) and supervised fine-tuning (SFT) ## What to know before starting ## Prerequisites - +recipes are specifically for DIGITS SPARK. Please make sure that OS and drivers are latest. ## Ancillary files - +ALl files required for finetuning are included. ## Time & risk -**Time estimate:** +**Time estimate:** 30-45 mins for setup and runing finetuning. Finetuning run time varies depending on model size -**Risks:** Model downloads can be large (several GB), ARM64 package compatibility issues may require troubleshooting, distributed training setup complexity increases with multi-node configurations +**Risks:** Model downloads can be large (several GB), ARM64 package compatibility issues may require troubleshooting. **Rollback:** ## Instructions -## Step 1. Verify system requirements - -Check your NVIDIA Spark device meets the prerequisites for NeMo AutoModel installation. This step runs on the host system to confirm CUDA toolkit availability and Python version compatibility. +## Step 1. Pull the latest Pytorch container ```bash -## Verify CUDA installation -nvcc --version - -## Verify GPU accessibility -nvidia-smi - -## Check available system memory -free -h +docker pull nvcr.io/nvidia/pytorch:25.09-py3 ``` -## Step 2. Get the container image +## Step 2. Launch Docker ```bash -docker pull nvcr.io/nvidia/pytorch:25.08-py3 +docker run --gpus all -it --rm --ipc=host \ +-v $HOME/.cache/huggingface:/root/.cache/huggingface \ +-v ${PWD}:/workspace -w /workspace \ +nvcr.io/nvidia/pytorch:25.09-py3 + ``` -## Step 3. Launch Docker +## Step 3. Install dependencies inside the contianer ```bash -docker run \ - --gpus all \ - --ulimit memlock=-1 \ - -it --ulimit stack=67108864 \ - --entrypoint /usr/bin/bash \ - --rm nvcr.io/nvidia/pytorch:25.08-py3 +pip install transformers peft datasets "trl==0.19.1" "bitsandbytes==0.48" ``` - - - - -## Step 10. Troubleshooting - -Common issues and solutions for NeMo AutoModel setup on NVIDIA Spark devices. - -| Symptom | Cause | Fix | -|---------|--------|-----| -| `nvcc: command not found` | CUDA toolkit not in PATH | Add CUDA toolkit to PATH: `export PATH=/usr/local/cuda/bin:$PATH` | -| `pip install uv` permission denied | System-level pip restrictions | Use `pip3 install --user uv` and update PATH | -| GPU not detected in training | CUDA driver/runtime mismatch | Verify driver compatibility: `nvidia-smi` and reinstall CUDA if needed | -| Out of memory during training | Model too large for available GPU memory | Reduce batch size, enable gradient checkpointing, or use model parallelism | -| ARM64 package compatibility issues | Package not available for ARM architecture | Use source installation or build from source with ARM64 flags | - -## Step 11. Cleanup and rollback - -Remove the installation and restore the original environment if needed. These commands safely remove all installed components. - -> **Warning:** This will delete all virtual environments and downloaded models. Ensure you have backed up any important training checkpoints. +## Step 4: authenticate with huggingface ```bash -## Remove virtual environment -rm -rf .venv +huggingface-cli login +## -## Remove cloned repository -cd .. -rm -rf Automodel +``` +To run LoRA on Llama3 use the following command: -## Remove uv (if installed with --user) -pip3 uninstall uv - -## Clear Python cache -rm -rf ~/.cache/pip +```bash +python Llama3_8B_LoRA_finetuning.py ``` -## Step 12. Next steps +To run qLoRA finetuning on llama3-70B use the following command: +```bash +python Llama3_70B_qLoRA_finetuning.py +``` +To run full finetuning on llama3-3B use the following command: +```bash +python Llama3_3B_full_finetuning.py +``` diff --git a/nvidia/sglang/README.md b/nvidia/sglang/README.md index 895e927..230b35d 100644 --- a/nvidia/sglang/README.md +++ b/nvidia/sglang/README.md @@ -35,12 +35,12 @@ vision-language tasks using models like DeepSeek-V2-Lite. ## Prerequisites -- NVIDIA Spark device with Blackwell architecture -- Docker Engine installed and running: `docker --version` -- NVIDIA GPU drivers installed: `nvidia-smi` -- NVIDIA Container Toolkit configured: `docker run --rm --gpus all nvidia/cuda:12.9-base nvidia-smi` -- Sufficient disk space (>20GB available): `df -h` -- Network connectivity for pulling NGC containers: `ping nvcr.io` +- [ ] NVIDIA Spark device with Blackwell architecture +- [ ] Docker Engine installed and running: `docker --version` +- [ ] NVIDIA GPU drivers installed: `nvidia-smi` +- [ ] NVIDIA Container Toolkit configured: `docker run --rm --gpus all nvidia/cuda:12.9-base nvidia-smi` +- [ ] Sufficient disk space (>20GB available): `df -h` +- [ ] Network connectivity for pulling NGC containers: `ping nvcr.io` ## Ancillary files diff --git a/nvidia/speculative-decoding/README.md b/nvidia/speculative-decoding/README.md index 028f072..d23ec15 100644 --- a/nvidia/speculative-decoding/README.md +++ b/nvidia/speculative-decoding/README.md @@ -40,17 +40,17 @@ These examples demonstrate how to accelerate large language model inference whil ## Prerequisites -- NVIDIA Spark device with sufficient GPU memory available (80GB+ recommended for GPT-OSS 120B) -- Docker with GPU support enabled +- [ ] NVIDIA Spark device with sufficient GPU memory available (80GB+ recommended for GPT-OSS 120B) +- [ ] Docker with GPU support enabled ```bash docker run --gpus all nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev nvidia-smi ``` -- Access to NVIDIA's internal container registry (for Eagle3 example) -- HuggingFace authentication configured (if needed for model downloads) +- [ ] Access to NVIDIA's internal container registry (for Eagle3 example) +- [ ] HuggingFace authentication configured (if needed for model downloads) ```bash huggingface-cli login ``` -- Network connectivity for model downloads +- [ ] Network connectivity for model downloads ## Time & risk diff --git a/nvidia/tailscale/README.md b/nvidia/tailscale/README.md index 73311aa..574c10c 100644 --- a/nvidia/tailscale/README.md +++ b/nvidia/tailscale/README.md @@ -51,13 +51,13 @@ all traffic automatically encrypted and NAT traversal handled transparently. ## Prerequisites -- NVIDIA Spark device running Ubuntu (ARM64/AArch64) -- Client device (Mac, Windows, or Linux) for remote access -- Internet connectivity on both devices -- Valid email account for Tailscale authentication (Google, GitHub, Microsoft) -- SSH server availability check: `systemctl status ssh` -- Package manager working: `sudo apt update` -- User account with sudo privileges on Spark device +- [ ] NVIDIA Spark device running Ubuntu (ARM64/AArch64) +- [ ] Client device (Mac, Windows, or Linux) for remote access +- [ ] Internet connectivity on both devices +- [ ] Valid email account for Tailscale authentication (Google, GitHub, Microsoft) +- [ ] SSH server availability check: `systemctl status ssh` +- [ ] Package manager working: `sudo apt update` +- [ ] User account with sudo privileges on Spark device ## Time & risk diff --git a/nvidia/trt-llm/README.md b/nvidia/trt-llm/README.md index 47c3465..359955f 100644 --- a/nvidia/trt-llm/README.md +++ b/nvidia/trt-llm/README.md @@ -54,13 +54,13 @@ inference through kernel-level optimizations, efficient memory layouts, and adva ## Prerequisites -- NVIDIA Spark device with Blackwell architecture GPUs -- NVIDIA drivers compatible with CUDA 12.x: `nvidia-smi` -- Docker installed and GPU support configured: `docker run --rm --gpus all nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev nvidia-smi` -- Hugging Face account with token for model access: `echo $HF_TOKEN` -- Sufficient GPU VRAM (16GB+ recommended for 70B models) -- Internet connectivity for downloading models and container images -- Network: open TCP ports 8355 (LLM) and 8356 (VLM) on host for OpenAI-compatible serving +- [ ] NVIDIA Spark device with Blackwell architecture GPUs +- [ ] NVIDIA drivers compatible with CUDA 12.x: `nvidia-smi` +- [ ] Docker installed and GPU support configured: `docker run --rm --gpus all nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev nvidia-smi` +- [ ] Hugging Face account with token for model access: `echo $HF_TOKEN` +- [ ] Sufficient GPU VRAM (16GB+ recommended for 70B models) +- [ ] Internet connectivity for downloading models and container images +- [ ] Network: open TCP ports 8355 (LLM) and 8356 (VLM) on host for OpenAI-compatible serving ## Model Support Matrix diff --git a/nvidia/unsloth/README.md b/nvidia/unsloth/README.md index 49f69de..a96c201 100644 --- a/nvidia/unsloth/README.md +++ b/nvidia/unsloth/README.md @@ -36,10 +36,10 @@ parameter-efficient fine-tuning methods like LoRA and QLoRA. ## Prerequisites -- NVIDIA Spark device with Blackwell GPU architecture -- `nvidia-smi` shows a summary of GPU information -- CUDA 13.0 installed: `nvcc --version` -- Internet access for downloading models and datasets +- [ ] NVIDIA Spark device with Blackwell GPU architecture +- [ ] `nvidia-smi` shows a summary of GPU information +- [ ] CUDA 13.0 installed: `nvcc --version` +- [ ] Internet access for downloading models and datasets ##Ancillary files diff --git a/nvidia/vllm/README.md b/nvidia/vllm/README.md index ec8c6a3..24bd32f 100644 --- a/nvidia/vllm/README.md +++ b/nvidia/vllm/README.md @@ -5,9 +5,9 @@ ## Table of Contents - [Overview](#overview) -- [Instructions](#instructions) - [Run on two Sparks](#run-on-two-sparks) - [Step 14. (Optional) Launch 405B inference server](#step-14-optional-launch-405b-inference-server) +- [Access through terminal](#access-through-terminal) --- @@ -29,14 +29,14 @@ support for ARM64. ## Prerequisites -- DGX Spark device with ARM64 processor and Blackwell GPU architecture -- CUDA 12.9 or CUDA 13.0 toolkit installed: `nvcc --version` shows CUDA toolkit version. -- Docker installed and configured: `docker --version` succeeds -- NVIDIA Container Toolkit installed -- Python 3.12 available: `python3.12 --version` succeeds -- Git installed: `git --version` succeeds -- Network access to download packages and container images - +- [ ] DGX Spark device with ARM64 processor and Blackwell GPU architecture +- [ ] CUDA 12.9 or CUDA 13.0 toolkit installed: `nvcc --version` shows CUDA toolkit version. +- [ ] Docker installed and configured: `docker --version` succeeds +- [ ] NVIDIA Container Toolkit installed +- [ ] Python 3.12 available: `python3.12 --version` succeeds +- [ ] Git installed: `git --version` succeeds +- [ ] Network access to download packages and container images +- [ ] > TODO: Verify memory and storage requirements for builds ## Time & risk @@ -46,77 +46,6 @@ support for ARM64. **Rollback:** Container approach is non-destructive. -## Instructions - -## Step 1. Pull vLLM container image - -Find the latest container build from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=25.09-py3 -``` -docker pull nvcr.io/nvidia/vllm:25.09-py3 -``` - -## Step 2. Test vLLM in container - -Launch the container and start vLLM server with a test model to verify basic functionality. - -```bash -docker run -it --gpus all -p 8000:8000 \ -nvcr.io/nvidia/vllm:25.09-py3 \ -vllm serve "Qwen/Qwen2.5-Math-1.5B-Instruct" -``` - -Expected output should include: -- Model loading confirmation -- Server startup on port 8000 -- GPU memory allocation details - -In another terminal, test the server: - -```bash -curl http://localhost:8000/v1/chat/completions \ --H "Content-Type: application/json" \ --d '{ - "model": "Qwen/Qwen2.5-Math-1.5B-Instruct", - "messages": [{"role": "user", "content": "12*17"}], - "max_tokens": 500 -}' -``` - -Expected response should contain `"content": "204"` or similar mathematical calculation. - -## Step 3. Troubleshooting - -| Symptom | Cause | Fix | -|---------|--------|-----| -| CUDA version mismatch errors | Wrong CUDA toolkit version | Reinstall CUDA 12.9 using exact installer | -| Container registry authentication fails | Invalid or expired GitLab token | Generate new auth token | -| SM_121a architecture not recognized | Missing LLVM patches | Verify SM_121a patches applied to LLVM source | -| Reduce MAX_JOBS to 1-2, add swap space | -| Environment variables not set | - -## Step 4. Cleanup and rollback - -For container approach (non-destructive): - -```bash -docker rm $(docker ps -aq --filter ancestor=******:5005/dl/dgx/vllm*) -docker rmi ******:5005/dl/dgx/vllm:main-py3.31165712-devel -``` - - -To remove CUDA 12.9: - -```bash -sudo /usr/local/cuda-12.9/bin/cuda-uninstaller -``` - -## Step 5. Next steps - -- **Production deployment:** Configure vLLM with your specific model requirements -- **Performance tuning:** Adjust batch sizes and memory settings for your workload -- **Monitoring:** Set up logging and metrics collection for production use -- **Model management:** Explore additional model formats and quantization options - ## Run on two Sparks ## Step 1. Verify hardware connectivity @@ -381,3 +310,74 @@ http://192.168.100.10:8265 ## - Persistent model caching across restarts ## - Alternative quantization methods (FP8, INT4) ``` + +## Access through terminal + +## Step 1. Pull vLLM container image + +Find the latest container build from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=25.09-py3 +``` +docker pull nvcr.io/nvidia/vllm:25.09-py3 +``` + +## Step 2. Test vLLM in container + +Launch the container and start vLLM server with a test model to verify basic functionality. + +```bash +docker run -it --gpus all -p 8000:8000 \ +nvcr.io/nvidia/vllm:25.09-py3 \ +vllm serve "Qwen/Qwen2.5-Math-1.5B-Instruct" +``` + +Expected output should include: +- Model loading confirmation +- Server startup on port 8000 +- GPU memory allocation details + +In another terminal, test the server: + +```bash +curl http://localhost:8000/v1/chat/completions \ +-H "Content-Type: application/json" \ +-d '{ + "model": "Qwen/Qwen2.5-Math-1.5B-Instruct", + "messages": [{"role": "user", "content": "12*17"}], + "max_tokens": 500 +}' +``` + +Expected response should contain `"content": "204"` or similar mathematical calculation. + +## Step 3. Troubleshooting + +| Symptom | Cause | Fix | +|---------|--------|-----| +| CUDA version mismatch errors | Wrong CUDA toolkit version | Reinstall CUDA 12.9 using exact installer | +| Container registry authentication fails | Invalid or expired GitLab token | Generate new auth token | +| SM_121a architecture not recognized | Missing LLVM patches | Verify SM_121a patches applied to LLVM source | +| Reduce MAX_JOBS to 1-2, add swap space | +| Environment variables not set | + +## Step 4. Cleanup and rollback + +For container approach (non-destructive): + +```bash +docker rm $(docker ps -aq --filter ancestor=******:5005/dl/dgx/vllm*) +docker rmi ******:5005/dl/dgx/vllm:main-py3.31165712-devel +``` + + +To remove CUDA 12.9: + +```bash +sudo /usr/local/cuda-12.9/bin/cuda-uninstaller +``` + +## Step 5. Next steps + +- **Production deployment:** Configure vLLM with your specific model requirements +- **Performance tuning:** Adjust batch sizes and memory settings for your workload +- **Monitoring:** Set up logging and metrics collection for production use +- **Model management:** Explore additional model formats and quantization options diff --git a/nvidia/vss/README.md b/nvidia/vss/README.md index 78b41d1..d457e12 100644 --- a/nvidia/vss/README.md +++ b/nvidia/vss/README.md @@ -43,14 +43,14 @@ You will deploy NVIDIA's VSS AI Blueprint on NVIDIA Spark hardware with Blackwel ## Prerequisites -- NVIDIA Spark device with ARM64 architecture and Blackwell GPU -- FastOS 1.81.38 or compatible ARM64 system -- Driver version 580.82.09 installed: `nvidia-smi | grep "Driver Version"` -- CUDA version 13.0 installed: `nvcc --version` -- Docker installed and running: `docker --version && docker compose version` -- Access to NVIDIA Container Registry with NGC API Key -- [Optional] NVIDIA API Key for remote model endpoints (hybrid deployment only) -- Sufficient storage space for video processing (>10GB recommended in `/tmp/`) +- [ ] NVIDIA Spark device with ARM64 architecture and Blackwell GPU +- [ ] FastOS 1.81.38 or compatible ARM64 system +- [ ] Driver version 580.82.09 installed: `nvidia-smi | grep "Driver Version"` +- [ ] CUDA version 13.0 installed: `nvcc --version` +- [ ] Docker installed and running: `docker --version && docker compose version` +- [ ] Access to NVIDIA Container Registry with NGC API Key +- [ ] [Optional] NVIDIA API Key for remote model endpoints (hybrid deployment only) +- [ ] Sufficient storage space for video processing (>10GB recommended in `/tmp/`) ## Ancillary files