commit c76a2721b8a3f699505ce9644cd8746c12cdbb55 Author: GitLab CI Date: Fri Oct 3 20:46:11 2025 +0000 chore: Regenerate all playbooks diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..04dc85d --- /dev/null +++ b/LICENSE @@ -0,0 +1,191 @@ + + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + Copyright 2025 NVIDIA Corporation + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. \ No newline at end of file diff --git a/README.md b/README.md new file mode 100644 index 0000000..53fa636 --- /dev/null +++ b/README.md @@ -0,0 +1,62 @@ + +

+ NVIDIA DGX Spark +

+ +# DGX Spark Playbooks + +Collection of step-by-step playbooks for setting up AI/ML workloads on NVIDIA DGX Spark devices with Blackwell architecture. + +## About + +These playbooks provide detailed instructions for: +- Installing and configuring popular AI frameworks +- Running inference with optimized models +- Setting up development environments +- Connecting and managing your DGX Spark device + +Each playbook includes prerequisites, step-by-step instructions, troubleshooting guidance, and example code. + +## Available Playbooks + +### NVIDIA + +- [Comfy UI](nvidia/comfy-ui/) +- [Connect to your Spark](nvidia/connect-to-your-spark/) +- [DGX Dashboard](nvidia/dgx-dashboard/) +- [FLUX.1 Dreambooth LoRA Fine-tuning](nvidia/flux-finetuning/) +- [Optimized Jax](nvidia/jax/) +- [Llama Factory](nvidia/llama-factory/) +- [MONAI-Reasoning-CXR-3B Model](nvidia/monai-reasoning/) +- [Build and Deploy a Multi-Agent Chatbot](nvidia/multi-agent-chatbot/) +- [Multi-modal Inference](nvidia/multi-modal-inference/) +- [NCCL for Two Sparks](nvidia/nccl/) +- [Fine tune with Nemo](nvidia/nemo-fine-tune/) +- [Use a NIM on Spark](nvidia/nim-llm/) +- [Quantize to NVFP4](nvidia/nvfp4-quantization/) +- [Ollama](nvidia/ollama/) +- [Use Open WebUI](nvidia/open-webui/) +- [Use Open Fold](nvidia/protein-folding/) +- [RAG application in AI Workbench](nvidia/rag-ai-workbench/) +- [SGLang Inference Server](nvidia/sglang/) +- [Speculative Decoding](nvidia/speculative-decoding/) +- [Stack two Sparks](nvidia/stack-sparks/) +- [Setup Tailscale on your Spark](nvidia/tailscale/) +- [TRT LLM for Inference](nvidia/trt-llm/) +- [Unsloth on DGX Spark](nvidia/unsloth/) +- [Install and use vLLM](nvidia/vllm/) +- [Vision-Language Model Fine-tuning](nvidia/vlm-finetuning/) +- [Install VS Code](nvidia/vscode/) +- [Video Search and Summarization](nvidia/vss/) + +## Resources + +- **Documentation**: https://www.nvidia.com/en-us/products/workstations/dgx-spark/ +- **Developer Forum**: https://forums.developer.nvidia.com/c/accelerated-computing/dgx-spark-gb10 +- **Terms of Service**: https://assets.ngc.nvidia.com/products/api-catalog/legal/NVIDIA%20API%20Trial%20Terms%20of%20Service.pdf + +## License + +See: +- [LICENSE](LICENSE) for licensing information. +- [LICENSE-3rd-party](LICENSE-3rd-party) for third-party licensing information. \ No newline at end of file diff --git a/nvidia/comfy-ui/README.md b/nvidia/comfy-ui/README.md new file mode 100644 index 0000000..4c665c0 --- /dev/null +++ b/nvidia/comfy-ui/README.md @@ -0,0 +1,194 @@ +# Comfy UI + +> Install and use ComfyUI to generate images + +## Table of Contents + +- [Overview](#overview) +- [Instructions](#instructions) + +--- + +## Overview + +## Basic idea + +ComfyUI is a visual, node-based interface for AI image generation. +Each step—like loading a model, adding text, sampling, or saving an image—is a node you connect with wires to form a workflow. +Workflows can be saved and shared as JSON files, making results easy to reproduce. +It’s flexible, letting you swap models, add effects, or combine tools, and it’s popular for Stable Diffusion because it gives precise control without needing to write code. +Think of it like building with LEGO blocks, but for AI image pipelines. + + + +## What you'll accomplish + +You'll install and configure ComfyUI, a powerful node-based GUI for Stable Diffusion, on NVIDIA Spark +devices with Blackwell architecture. By the end, you'll have a fully functional web interface accessible +via browser for AI image generation workloads. + +## What to know before starting + +- Experience working with Python virtual environments and package management +- Familiarity with command line operations and terminal usage +- Basic understanding of deep learning model deployment and checkpoints +- Knowledge of container workflows and GPU acceleration concepts +- Understanding of network configuration for accessing web services + +## Prerequisites + +**Hardware Requirements:** +- [ ] NVIDIA Spark device with Blackwell architecture +- [ ] Minimum 8GB GPU memory for Stable Diffusion models +- [ ] At least 20GB available storage space + +**Software Requirements:** +- [ ] Python 3.8 or higher installed: `python3 --version` +- [ ] pip package manager available: `pip3 --version` +- [ ] CUDA toolkit compatible with Blackwell: `nvcc --version` +- [ ] Git version control: `git --version` +- [ ] Network access to download models from Hugging Face +- [ ] Web browser access to `:8188` port + +## Ancillary files + +All required assets can be found [in the ComfyUI repository on GitHub](https://github.com/comfyanonymous/ComfyUI) + +- `requirements.txt` - Python dependencies for ComfyUI installation +- `main.py` - Primary ComfyUI server application entry point +- `v1-5-pruned-emaonly-fp16.safetensors` - Stable Diffusion 1.5 checkpoint model + +## Time & risk + +**Estimated time:** 30-45 minutes (including model download) + +**Risk level:** Medium +- Model downloads are large (~2GB) and may fail due to network issues +- PyTorch nightly builds may have compatibility issues with ARM64 architecture +- Port 8188 must be accessible for web interface functionality + +**Rollback:** Virtual environment can be deleted to remove all installed packages. Downloaded models +can be removed manually from the checkpoints directory. + +## Instructions + +## Step 1. Verify system prerequisites + +Check that your NVIDIA Spark device meets the requirements before proceeding with installation. + +```bash +python3 --version +pip3 --version +nvcc --version +nvidia-smi +``` + +Expected output should show Python 3.8+, pip available, CUDA toolkit, and GPU detection. + +## Step 2. Create Python virtual environment + +Create an isolated environment to avoid conflicts with system packages. This runs on the host system. + +```bash +python3 -m venv comfyui-env +source comfyui-env/bin/activate +``` + +Verify the virtual environment is active by checking the command prompt shows `(comfyui-env)`. + +## Step 3. Install PyTorch with CUDA support + +Install PyTorch nightly build with CUDA 12.9 support optimized for ARM64 architecture. + +```bash +pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu129 +``` + +This installation targets CUDA 12.9 compatibility with Blackwell architecture GPUs. + +## Step 4. Clone ComfyUI repository + +Download the ComfyUI source code from the official repository. + +```bash +git clone https://github.com/comfyanonymous/ComfyUI.git +cd ComfyUI/ +``` + +## Step 5. Install ComfyUI dependencies + +Install the required Python packages for ComfyUI operation. + +```bash +pip install -r requirements.txt +``` + +This installs all necessary dependencies including web interface components and model handling libraries. + +## Step 6. Download Stable Diffusion checkpoint + +Navigate to the checkpoints directory and download the Stable Diffusion 1.5 model. + +```bash +cd models/checkpoints/ +wget https://huggingface.co/Comfy-Org/stable-diffusion-v1-5-archive/resolve/main/v1-5-pruned-emaonly-fp16.safetensors +cd ../../ +``` + +The download will be approximately 2GB and may take several minutes depending on network speed. + +## Step 7. Launch ComfyUI server + +Start the ComfyUI web server with network access enabled. + +```bash +python main.py --listen 0.0.0.0 +``` + +The server will bind to all network interfaces on port 8188, making it accessible from other devices. + +## Step 8. Validate installation + +Check that ComfyUI is running correctly and accessible via web browser. + +```bash +curl -I http://localhost:8188 +``` + +Expected output should show HTTP 200 response indicating the web server is operational. + +Open a web browser and navigate to `http://:8188` where `` is your device's IP address. + +## Step 9. Troubleshooting + +| Symptom | Cause | Fix | +|---------|-------|-----| +| PyTorch CUDA not available | Incorrect CUDA version or missing drivers | Verify `nvcc --version` matches cu129, reinstall PyTorch | +| Model download fails | Network connectivity or storage space | Check internet connection, verify 20GB+ available space | +| Web interface inaccessible | Firewall blocking port 8188 | Configure firewall to allow port 8188, check IP address | +| Out of GPU memory errors | Insufficient VRAM for model | Use smaller models or enable CPU fallback mode | + +## Step 10. Cleanup and rollback + +If you need to remove the installation completely, follow these steps: + +> **Warning:** This will delete all installed packages and downloaded models. + +```bash +deactivate +rm -rf comfyui-env/ +rm -rf ComfyUI/ +``` + +To rollback during installation, press `Ctrl+C` to stop the server and remove the virtual environment. + +## Step 11. Next steps + +Test the installation with a basic image generation workflow: + +1. Access the web interface at `http://:8188` +2. Load the default workflow (should appear automatically) +3. Click "Queue Prompt" to generate your first image +4. Monitor GPU usage with `nvidia-smi` in a separate terminal + +The image generation should complete within 30-60 seconds depending on your hardware configuration. diff --git a/nvidia/connect-to-your-spark/README.md b/nvidia/connect-to-your-spark/README.md new file mode 100644 index 0000000..e53a677 --- /dev/null +++ b/nvidia/connect-to-your-spark/README.md @@ -0,0 +1,350 @@ +# Connect to your Spark + +> Use NVIDIA Sync or manual SSH to connect to your Spark + +## Table of Contents + +- [Overview](#overview) +- [Connect with NVIDIA Sync](#connect-with-nvidia-sync) + - [For macOS](#for-macos) + - [For Windows](#for-windows) + - [For Debian/Ubuntu](#for-debianubuntu) +- [Connect with Manual SSH](#connect-with-manual-ssh) + - [Testing mDNS Resolution](#testing-mdns-resolution) + +--- + +## Overview + +## Basic idea + +If you primarily work on another system, such as a laptop, and want to use your DGX Spark as a +remote resource, this playbook shows you how to connect and work over SSH. With SSH, you can +securely open a terminal session or tunnel ports to access web apps and APIs on your DGX Spark +from your local machine. + +There are two approaches: **NVIDIA Sync (recommended)** for streamlined +device management, or **manual SSH** for direct command-line control. + +Before you get started, there are some important concepts to understand: + +**Secure Shell (SSH)** is a cryptographic protocol for securely connecting to a remote computer +over an untrusted network. It lets you open a terminal on your DGX Spark as if you were sitting +at it, run commands, transfer files, and manage services—all encrypted end-to-end. + +**SSH tunneling** (also called port forwarding) securely maps a port on your laptop +(for example, localhost:8888) to a port on the DGX Spark where an app is listening +(such as JupyterLab on port 8888). Your browser connects to localhost, and SSH forwards +the traffic through the encrypted connection to the remote service without exposing +that port on the wider network. + +**mDNS (Multicast DNS)** lets devices discover each other by name on a local network without +needing a central DNS server. Your DGX Spark advertises its hostname via mDNS, so you can +connect using a name like `spark-abcd.local` (note the .local suffix), rather than looking +up its IP address. + +## What you'll accomplish + +You will establish secure SSH access to your DGX Spark device using either NVIDIA Sync or manual +SSH configuration. NVIDIA Sync provides a graphical interface for device management with +integrated app launching, while manual SSH gives you direct command-line control with port +forwarding capabilities. Both approaches enable you to run terminal commands, access web +applications, and manage your DGX Spark remotely from your laptop. + + +## What to know before starting + +- Basic terminal/command line usage +- Understanding of SSH concepts and key-based authentication +- Familiarity with network concepts like hostnames, IP addresses, and port forwarding + +## Prerequisites + +- DGX Spark device is set up and you have created a local user account +- Your laptop and DGX Spark are on the same network +- You have your DGX Spark username and password +- You have your device's mDNS hostname (printed on quick start guide) or IP address +- For the manual SSH approach, SSH client available: + + ```bash + ssh -V + ``` + +## Time & risk + +**Time estimate:** 5-10 minutes + +**Risk level:** Low - SSH setup involves credential configuration but no system-level changes +to the DGX Spark device + +**Rollback:** SSH key removal can be done by editing `~/.ssh/authorized_keys` on the DGX Spark + +## Connect with NVIDIA Sync + +## Step 1. Install NVIDIA Sync + +Download and install NVIDIA Sync for your operating system. NVIDIA Sync provides a unified +interface for managing SSH connections and launching development tools on your DGX Spark device. + +::spark-download + +Internal URLs To be removed for launch: +* [Windows Arm64](https://workbench.ngc.nvidia.com/internal/nvidia-connect/latest/dgx-connect-arm64-setup.exe) +* [Windows x86_64](https://workbench.ngc.nvidia.com/internal/nvidia-connect/latest/dgx-connect-x64-setup.exe) +* [macOS](https://workbench.ngc.nvidia.com/internal/nvidia-connect/latest/dgx-connect.dmg) +* [Linux x86_64](https://workbench.ngc.nvidia.com/internal/nvidia-connect/latest/dgx-connect-amd64.deb) +* [Linux Arm64](https://workbench.ngc.nvidia.com/internal/nvidia-connect/latest/dgx-connect-arm64.deb) + +### For macOS +* After download, open `nvidia-sync.dmg` +* Drag and drop the app into your Applications folder +* Open `NVIDIA Sync` from the Applications folder + +### For Windows +* After download, run the installer .exe +* NVIDIA Sync will automatically start after installation completes + + +### For Debian/Ubuntu +* Configure the package repository: + ``` + curl -fsSL https://workbench.download.nvidia.com/stable/linux/gpgkey | sudo tee -a /etc/apt/trusted.gpg.d/ai-workbench-desktop-key.asc + echo "deb https://workbench.download.nvidia.com/stable/linux/debian default proprietary" | sudo tee -a /etc/apt/sources.list + ``` +* Update package lists + ``` + sudo apt update + ``` +* Install NVIDIA Sync + ``` + sudo apt install nvidia-sync + ``` + +## Step 2. Configure Apps + +After starting NVIDIA Sync and agreeing to the EULA, select which development tools you want +to use. These are tools installed on your laptop that Sync can configure and launch connected to your Spark. + +You can modify these selections later in the Settings window. Applications marked "unavailable" +require installation on your laptop. + +**Default Options:** +- **DGX Dashboard**: Web application pre-installed on DGX Spark for system management and integrated JupyterLab access +- **Terminal**: Your system's built-in terminal with automatic SSH connection + +**Optional apps(require separate installation):** +- **VS Code**: Download from https://code.visualstudio.com/download +- **Cursor**: Download from https://cursor.com/downloads +- **NVIDIA AI Workbench**: Download from https://nvidia.com/workbench + +## Step 3. Add your DGX Spark device + +Finally, connect your DGX Spark by filling out the form: + +- **Name**: A descriptive name (e.g., "My DGX Spark") +- **Hostname or IP**: The mDNS hostname from your quick start guide (e.g. `spark-abcd.local`) or an IP address +- **Username**: Your DGX Spark user account name +- **Password**: Your DGX Spark user account password + +**Note:** Your password is used only during this initial setup to configure SSH key-based +authentication. It is not stored or transmitted after setup completion. NVIDIA Sync will SSH into your device and +configure its locally provisioned SSH key pair. + +Click add "Add" and NVIDIA Sync will automatically: + +1. Generate an SSH key pair on your laptop +2. Connect to your DGX Spark using your provided username and password +3. Add the public key to `~/.ssh/authorized_keys` on your device +4. Create an SSH alias locally for future connections +5. Discard your username and password information + +> **_Wait for update:_** After completing system setup for the first time, your device may take several minutes to update and become available on the network. If NVIDIA Sync fails to connect, please wait 3-4 minutes and try again. + +## Step 4. Access your DGX Spark + +Once connected, NVIDIA Sync appears as a system tray/taskbar application. Click the NVIDIA Sync +icon to open the device management interface. + +Clicking on the large "Connect" and "Disconnect" buttons controls the overall SSH connection to your device. + +**Set working directory** (optional): Choose a default directory that Apps will open in +when launched through NVIDIA Sync. This defaults to your home directory on the remote device. + +**Launch applications**: Click on any configured app to open it with automatic SSH +connection to your DGX Spark. + +"Custom Ports" are configured on the Settings screen to provide access to custom web apps or APIs running on your device. + +## Step 5. Validate SSH setup + +Verify your local SSH configuration is correct by using the SSH alias: + +* Test direct SSH connection (should not prompt for password) + + ```bash +# # Configured if you use mDNS hostname + ssh .local + ``` + + or + + ```bash +# # Configured if you use IP address + ssh + ``` + +* On the DGX Spark, verify you're connected + ```bash + hostname + whoami + ``` + +* Exit the SSH session + ```bash + exit + ``` + +## Step 6. Troubleshooting + +| Symptom | Cause | Fix | +|---------|--------|-----| +| Device name doesn't resolve | mDNS blocked on network | Use IP address instead of hostname.local | +| Connection refused/timeout | DGX Spark not booted or SSH not ready | Wait for device boot completion; SSH available after updates finish | +| Authentication failed | SSH key setup incomplete | Re-run device setup in NVIDIA Sync; check credentials | + + +## Step 7. Next steps + +Test your setup by launching a development tool: +- Click the NVIDIA Sync system tray icon +- Select "Terminal" to open a terminal session on your DGX Spark +- Or click "DGX Dashboard" to access the web interface at the forwarded localhost port + +Learn more about NVIDIA Sync features and custom tool integration in the documentation. + +## Connect with Manual SSH + +## Step 1. Verify SSH client availability + +Confirm that you have an SSH client installed on your system. Most modern operating systems +include SSH by default. Run the following in your terminal: + +```bash +## Check SSH client version +ssh -V +``` + +Expected output should show OpenSSH version information. + +## Step 2. Gather connection information + +Collect the required connection details for your DGX Spark: + +- **Username**: Your DGX Spark user account name +- **Password**: Your DGX Spark account password +- **Hostname**: Your device's mDNS hostname (from quick start guide, e.g., `spark-abcd.local`) +- **IP Address**: Alternative only needed if mDNS doesn't work on your network as described below. + +In some network configurations, like complex corporate environments, mDNS won't work as expected +and you'll have to use your devices IP address directly to connect. You know you are in this situation when +you try to SSH and the command hangs indefinitely or you get an error like: + +``` +ssh: Could not resolve hostname spark-abcd.local: Name or service not known +``` + +### Testing mDNS Resolution + +To test if mDNS is working, use the `ping` utility. + +```bash +ping spark-abcd.local +``` + +If mDNS is working and you can SSH using the hostname, you should see something like this: + +``` +$ ping -c 3 spark-abcd.local +PING spark-abcd.local (10.9.1.9): 56 data bytes +64 bytes from 10.9.1.9: icmp_seq=0 ttl=64 time=6.902 ms +64 bytes from 10.9.1.9: icmp_seq=1 ttl=64 time=116.335 ms +64 bytes from 10.9.1.9: icmp_seq=2 ttl=64 time=33.301 ms +``` + +If mDNS is **not** working and you will have to use your IP directly, you should see something like this: + +``` +$ ping -c 3 spark-abcd.local +ping: cannot resolve spark-abcd.local: Unknown host +``` + +If none of these work, you'll need to: +- Log into your router's admin panel to find the IP Address +- Connect a display, keyboard, and mouse to check from the Ubuntu desktop + +## Step 3. Test initial connection + +Connect to your DGX Spark for the first time to verify basic connectivity: + +```bash +## Connect using mDNS hostname (preferred) +ssh @.local +``` + +```bash +## Alternative: Connect using IP address +ssh @ +``` + +Replace placeholders with your actual values: +- ``: Your DGX Spark account name +- ``: Device hostname without .local suffix +- ``: Your device's IP address + +On first connection, you'll see a host fingerprint warning. Type `yes` and press Enter, +then enter your password when prompted. + +## Step 4. Verify remote connection + +Once connected, confirm you're on the DGX Spark device: + +```bash +## Check hostname +hostname +## Check system information +uname -a +## Exit the session +exit +``` + +## Step 5. Use SSH tunneling for web applications + +To access web applications running on your DGX Spark use SSH port +forwarding. In this example we'll access the DGX Dashboard web application +but this works with any service running on localhost. + +DGX Dashboard runs on localhost, port 11000. + +Open the tunnel: + +```bash +## local port 11000 → remote port 11000 +ssh -L 11000:localhost:11000 @.local +``` + +After establishing the tunnel, access the forwarded web app in your browser: [http://localhost:11000](http://localhost:11000) + + +## Step 6. Troubleshooting + +| Symptom | Cause | Fix | +|---------|--------|-----| +| `ssh: Could not resolve hostname` | mDNS not working | Use IP address instead of .local hostname | +| `Connection refused` | Device not booted or SSH disabled | Wait for full boot; SSH available after system updates complete | +| `Port forwarding fails` | Service not running or port conflict | Verify remote service is active; try different local port | + +## Step 7. Next steps + +With SSH access configured, you can: +- Open persistent terminal sessions: `ssh @.local` +- Forward web application ports: `ssh -L :localhost: @.local` diff --git a/nvidia/dgx-dashboard/README.md b/nvidia/dgx-dashboard/README.md new file mode 100644 index 0000000..02767b1 --- /dev/null +++ b/nvidia/dgx-dashboard/README.md @@ -0,0 +1,233 @@ +# DGX Dashboard + +> Manage your DGX system and launch JupyterLab + +## Table of Contents + +- [Overview](#overview) +- [Instructions](#instructions) + - [Option A: Desktop shortcut (local access)](#option-a-desktop-shortcut-local-access) + - [Option B: NVIDIA Sync (recommended for remote access)](#option-b-nvidia-sync-recommended-for-remote-access) + - [Option C: Manual SSH tunnels](#option-c-manual-ssh-tunnels) + +--- + +## Overview + +## Basic Idea + +The DGX Dashboard is a web application that runs locally on DGX Spark devices, providing a graphical interface for system updates, resource monitoring, and an integrated JupyterLab environment. Users can access the dashboard locally from the app launcher or remotely through NVIDIA Sync or SSH tunneling. The dashboard is the easiest way to update system packages and firmware when working remotely. + +## What you'll accomplish + +You will learn how to access and use the DGX Dashboard on your DGX Spark device. By the end of this walkthrough, you will be able to launch JupyterLab instances with pre-configured Python environments, monitor GPU performance, manage system updates, and run a sample AI workload using Stable Diffusion. You'll understand multiple access methods including desktop shortcuts, NVIDIA Sync, and manual SSH tunneling. + +## What to know before starting + +- Basic terminal usage for SSH connections and port forwarding +- Understanding of Python environments and Jupyter notebooks + +## Prerequisites + +- DGX Spark device with Ubuntu Desktop environment +- NVIDIA Sync installed (for remote access method) or SSH client configured + +## Ancillary files + +- Python code snippet for SDXL found [here on GitHub](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/jupyter-cell.py) + + +## Time & risk + +**Duration:** 15-30 minutes for complete walkthrough including sample AI workload + +**Risk level:** Low - Web interface operations with minimal system impact + +**Rollback:** Stop JupyterLab instances through dashboard interface; no permanent system changes made during normal usage. + +## Instructions + +## Step 1. Access DGX Dashboard + +Choose one of the following methods to access the DGX Dashboard web interface: + +### Option A: Desktop shortcut (local access) + +If you have physical or remote desktop access to the Spark device: + +1. Log into the Ubuntu Desktop environment on your Spark device +2. Open the Ubuntu app launcher by clicking on the bottom left corner of the screen +3. Click on the DGX Dashboard shortcut in the app launcher +4. The dashboard will open in your default web browser at `http://localhost:11000` + +### Option B: NVIDIA Sync (recommended for remote access) + +If you have NVIDIA Sync installed on your local machine: + +1. Click the NVIDIA Sync icon in your system tray +2. Select your Spark device from the device list +3. Click "Connect" +4. Click "DGX Dashboard" to launch the dashboard +5. The dashboard will open in your default web browser at `http://localhost:11000` using an automatic SSH tunnel + +Don't have NVIDIA Sync? [Install it here](TODO!!!!!!) + +### Option C: Manual SSH tunnels + +For manual remote access without NVIDIA Sync you must first manually configure an SSH tunnels. + +You must open a tunnel for the Dashboard server (port 11000) and for JupyterLab if you want to access it remotely. Each user account will have a different assigned port number for JupyterLab. + +1. Check your assigned JupyterLab port by sshing into the Spark device and running the following command: + +```bash +cat /opt/nvidia/dgx-dashboard-service/jupyterlab_ports.yaml +``` + +2. Look for your username and note the assigned port number +3. Create a new SSH tunnel including both ports: + +```bash +ssh -L 11000:localhost:11000 -L :localhost: @ +``` +Replace `` with your Spark device username and `` with the device's IP address. + +Replace `` with the port number from the YAML file. + +Open your web browser and navigate to `http://localhost:11000`. + + +## Step 2. Log into DGX Dashboard + +Once the dashboard loads in your browser: + +1. Enter your Spark device system username in the username field +2. Enter your system password in the password field +3. Click "Login" to access the dashboard interface + +You should see the main dashboard with panels for JupyterLab management, system monitoring, and settings. + +## Step 3. Launch JupyterLab instance + +Create and start a JupyterLab environment: + +1. Click the "Start" button in the right panel +2. Monitor the status as it transitions through: Starting → Preparing → Running +3. Wait for the status to show "Running" (this may take several minutes on first launch) +4. Once "Running", if Jupyterlab does not automatically open in your browser (a pop-up was blocked), you can click the "Open In Browser" button + +When starting, a default working directory (/home//jupyterlab) is created and a virtual environment is set up automatically. You can +review the packages installed by looking at the `requirements.txt` file that is created in the working directory. + +In the future, you can change the working directory, creating a new isolated environment, by clicking the "Stop" button, changing the path to the new working directory, and then clicking the "Start" button again. + +## Step 4. Test with sample AI workload + +Verify your setup by running a simple Stable Diffusion XL image generation example: + +1. In JupyterLab, create a new notebook: File → New → Notebook +2. Click "Python 3 (ipykernel)" to create the notebook +3. Add a new cell and paste the following code: + +```python +from diffusers import DiffusionPipeline +import torch +from PIL import Image +from datetime import datetime +from IPython.display import display + +## --- Model setup --- +MODEL_ID = "stabilityai/stable-diffusion-xl-base-1.0" +dtype = torch.float16 if torch.cuda.is_available() else torch.float32 + +pipe = DiffusionPipeline.from_pretrained( + MODEL_ID, + torch_dtype=dtype, + variant="fp16" if dtype==torch.float16 else None, +) +pipe = pipe.to("cuda" if torch.cuda.is_available() else "cpu") + +## --- Prompt setup --- +prompt = "a cozy modern reading nook with a big window, soft natural light, photorealistic" +negative_prompt = "low quality, blurry, distorted, text, watermark" + +## --- Generation settings --- +height = 1024 +width = 1024 +steps = 30 +guidance = 7.0 + +## --- Generate --- +result = pipe( + prompt=prompt, + negative_prompt=negative_prompt, + num_inference_steps=steps, + guidance_scale=guidance, + height=height, + width=width, +) + +## --- Save to file --- +image: Image.Image = result.images[0] +display(image) +image.save(f"sdxl_output.png") +print(f"Saved image as sdxl_output.png") +``` + +4. Run the cell (Shift+Enter or click the Run button) +5. The notebook will download the model and generate an image (first run may take several minutes) + +## Step 5. Monitor GPU utilization + +While the image generation is running: + +1. Switch back to the DGX Dashboard tab in your browser +2. Observe the GPU telemetry data in the monitoring panels + +## Step 6. Stop JupyterLab instance + +When finished with your session: + +1. Return to the main DGX Dashboard tab +2. Click the "Stop" button in the JupyterLab panel +3. Confirm the status changes from "Running" to "Stopped" + +## Step 6. Manage system updates + +If system updates are available it will indicated by a banner or on the Settings page. + +From the Settings page, under the "Updates" tab: + +1. Click "Update" to open the confirmation dialog +2. Click "Update Now" to initiate the update process +3. Wait for the update to complete and your device to reboot + +> **Warning**: System updates will upgrade packages, firmware if available, and trigger a reboot. Save your work before proceeding. + + +## Step 7. Troubleshooting + +| Symptom | Cause | Fix | +|---------|-------|-----| +| User can't run updates | User not in sudo group | Add user to sudo group: `sudo usermod -aG sudo ` | +| JupyterLab won't start | Issue with current virtual environment | Change the working directory in the JupyterLab panel and start a new instance | +| SSH tunnel connection refused | Incorrect IP or port | Verify Spark device IP and ensure SSH service is running | +| GPU not visible in monitoring | Driver issues | Check GPU status with `nvidia-smi` | + +## Step 8. Cleanup and rollback + +To clean up resources and return system to original state: + +1. Stop any running JupyterLab instances via dashboard +2. Delete the JupyterLab working directory + +> **Warning**: If you ran system updates, the only rollback is to restore from a system backup or recovery media. + +No permanent changes are made to the system during normal dashboard usage. + +## Step 9. Next steps + +Now that you have DGX Dashboard configured, you can: + +- Create additional JupyterLab environments for different projects +- Use the dashboard to manage system maintenance and updates diff --git a/nvidia/dgx-dashboard/assets/jupyter-cell.py b/nvidia/dgx-dashboard/assets/jupyter-cell.py new file mode 100644 index 0000000..7307db9 --- /dev/null +++ b/nvidia/dgx-dashboard/assets/jupyter-cell.py @@ -0,0 +1,59 @@ +# +# SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from diffusers import DiffusionPipeline +import torch +from PIL import Image +from datetime import datetime +from IPython.display import display + +# --- Model setup --- +MODEL_ID = "stabilityai/stable-diffusion-xl-base-1.0" +dtype = torch.float16 if torch.cuda.is_available() else torch.float32 + +pipe = DiffusionPipeline.from_pretrained( + MODEL_ID, + torch_dtype=dtype, + variant="fp16" if dtype==torch.float16 else None, +) +pipe = pipe.to("cuda" if torch.cuda.is_available() else "cpu") + +# --- Prompt setup --- +prompt = "a cozy modern reading nook with a big window, soft natural light, photorealistic" +negative_prompt = "low quality, blurry, distorted, text, watermark" + +# --- Generation settings --- +height = 1024 +width = 1024 +steps = 30 +guidance = 7.0 + +# --- Generate --- +result = pipe( + prompt=prompt, + negative_prompt=negative_prompt, + num_inference_steps=steps, + guidance_scale=guidance, + height=height, + width=width, +) + +# --- Save to file --- +image: Image.Image = result.images[0] +display(image) +image.save(f"sdxl_output.png") +print(f"Saved image as sdxl_output.png") diff --git a/nvidia/flux-finetuning/README.md b/nvidia/flux-finetuning/README.md new file mode 100644 index 0000000..d5e1f8e --- /dev/null +++ b/nvidia/flux-finetuning/README.md @@ -0,0 +1,165 @@ +# FLUX.1 Dreambooth LoRA Fine-tuning + +> Fine-tune FLUX.1-dev 11B model using multi-concept Dreambooth LoRA for custom image generation + +## Table of Contents + +- [Overview](#overview) +- [Instructions](#instructions) + +--- + +## Overview + +## Basic Idea + +This playbook demonstrates how to fine-tune the FLUX.1-dev 11B model using multi-concept Dreambooth LoRA (Low-Rank Adaptation) for custom image generation on DGX Spark. +With 128GB of unified memory and powerful GPU acceleration, DGX Spark provides an ideal environment for training an image generation model with multiple models loaded in memory, such as the Diffusion Transformer, CLIP Text Encoder, T5 Text Encoder, and the Autoencoder. + +Multi-concept Dreambooth LoRA fine-tuning allows you to teach FLUX.1 new concepts, characters, and styles. The trained LoRA weights can be easily integrated into existing ComfyUI workflows, making it perfect for prototyping and experimentation. +Moreover, this playbook demonstrates how DGX Spark can not only load several models in memory, but also run train and generate high-resolution images such as 1024px and higher. + +## What you'll accomplish + +You will have a fine-tuned FLUX.1 model capable of generating images with your custom concepts, readily available for ComfyUI workflows. +The setup includes: +- FLUX.1-dev model fine-tuning using Dreambooth LoRA technique +- Training on custom concepts ("tjtoy" toy and "sparkgpu" GPU) +- High-resolution 1K diffusion training and inference +- ComfyUI integration for intuitive visual workflows +- Docker containerization for reproducible environments + +## Prerequisites + +- DGX Spark device is set up and accessible +- No other processes running on the DGX Spark GPU +- Enough disk space for model downloads +- NVIDIA Docker installed and configured + + +## Time & risk + +**Duration**: +- 15 minutes for initial setup model download time +- 1-2 hours for dreambooth lora training + +**Risks**: +- Docker permission issues may require user group changes and session restart +- The recipe would require hyperparameter tuning and a high-quality dataset for the best results + +**Rollback**: Stop and remove Docker containers, delete downloaded models if needed + +## Instructions + +## Step 1. Configure Docker permissions + +To easily manage containers without sudo, you must be in the `docker` group. If you choose to skip this step, you will need to run Docker commands with sudo. + +Open a new terminal and test Docker access. In the terminal, run: + +```bash +docker ps +``` + +If you see a permission denied error (something like `permission denied while trying to connect to the Docker daemon socket`), add your user to the docker group: + +```bash +sudo usermod -aG docker $USER +``` + +> **Warning**: After running usermod, you must log out and log back in to start a new +> session with updated group permissions. + +## Step 2. Clone the repository + +In a terminal, clone the repository and navigate to the flux-finetuning directory. + +```bash +git clone https://******/spark-playbooks/dgx-spark-playbook-assets.git +cd dgx-spark-playbook-assets/flux-finetuning +``` + +## Step 3. Build the Docker container + +This docker image will download the required models and set up the environment for training and inference. +- `flux1-dev.safetensors` +- `ae.safetensors` +- `clip_l.safetensors` +- `t5xxl_fp16.safetensors` +```bash +docker build -t flux-training . +``` + +## Step 4. Run the Docker container + +```bash +## Run with GPU support and mount current directory +docker run --gpus all -it --rm \ + -v $(pwd):/workspace \ + -p 8188:8188 \ + flux-training +``` + +## Step 5. Train the model + +Inside the container, navigate to the sd-scripts directory and run the training script: + +```bash +cd /workspace/sd-scripts +../train.sh +``` + +The training will: +- Use LoRA with dimension 256 +- Train for 100 epochs (saves every 25 epochs) +- Learn custom concepts: "tjtoy toy" and "sparkgpu gpu" +- Output trained LoRA weights to `saved_models/flux_dreambooth.safetensors` + +## Step 6. Generate images with command-line inference + +After training completes, generate sample images: + +```bash +../inference.sh +``` + +This will generate several images demonstrating the learned concepts, stored in the `outputs` directory. + +## Step 7. Spin up ComfyUI for visual workflows + +Start ComfyUI for an intuitive interface: + +```bash +cd /workspace/ComfyUI +python main.py --listen 0.0.0.0 --port 8188 +``` + +Access ComfyUI at `http://localhost:8188` + +## Step 8. Deploy the trained LoRA in ComfyUI + +Feel free to deploy the trained LoRA in ComfyUI in existing or custom workflows. +Use your trained concepts in prompts: +- `"tjtoy toy"` - Your custom toy concept +- `"sparkgpu gpu"` - Your custom GPU concept +- `"tjtoy toy holding sparkgpu gpu"` - Combined concepts + +## Step 9. Cleanup + +Exit the container and optionally remove the Docker image: + +```bash +## Exit container +exit + +## Remove Docker image (optional) +docker stop +docker rmi flux-training +``` + +## Step 10. Next steps + +- Experiment with different LoRA strengths (0.8-1.2) in ComfyUI +- Train on your own custom concepts by replacing images in the `data/` directory +- Combine multiple LoRA models for complex compositions +- Integrate the trained LoRA into other FLUX workflows diff --git a/nvidia/flux-finetuning/assets/after_finetuning.png b/nvidia/flux-finetuning/assets/after_finetuning.png new file mode 100644 index 0000000..07455cc Binary files /dev/null and b/nvidia/flux-finetuning/assets/after_finetuning.png differ diff --git a/nvidia/flux-finetuning/assets/before_finetuning.png b/nvidia/flux-finetuning/assets/before_finetuning.png new file mode 100644 index 0000000..48031cf Binary files /dev/null and b/nvidia/flux-finetuning/assets/before_finetuning.png differ diff --git a/nvidia/flux-finetuning/assets/comfyui_workflow.png b/nvidia/flux-finetuning/assets/comfyui_workflow.png new file mode 100644 index 0000000..769cafc Binary files /dev/null and b/nvidia/flux-finetuning/assets/comfyui_workflow.png differ diff --git a/nvidia/jax/README.md b/nvidia/jax/README.md new file mode 100644 index 0000000..24fd865 --- /dev/null +++ b/nvidia/jax/README.md @@ -0,0 +1,217 @@ +# Optimized Jax + +> Develop with Optimized Jax + +## Table of Contents + +- [Overview](#overview) +- [Instructions](#instructions) + +--- + +## Overview + +## Basic idea + +JAX lets you write **NumPy-style Python code** and run it fast on GPUs without writing CUDA. It does this by: + +- **NumPy on accelerators**: Use `jax.numpy` just like NumPy, but arrays live on the GPU. +- **Function transformations**: + - `jit` → Compiles your function into fast GPU code. + - `grad` → Gives you automatic differentiation. + - `vmap` → Vectorizes your function across batches. + - `pmap` → Runs across multiple GPUs in parallel. +- **XLA backend**: JAX hands your code to XLA (Accelerated Linear Algebra compiler), which fuses operations and generates optimized GPU kernels. + +## What you'll accomplish + +You'll set up a JAX development environment on NVIDIA Spark with Blackwell architecture that enables +high-performance machine learning prototyping using familiar NumPy-like abstractions, complete with +GPU acceleration and performance optimization capabilities. + +## What to know before starting + +- Comfortable with Python and NumPy programming +- General understanding of machine learning workflows and techniques +- Experience working in a terminal +- Experience using and building containers +- Familiarity with different versions of CUDA +- Basic understanding of linear algebra (high-school level math sufficient) + +## Prerequisites + +[ ] NVIDIA Spark device with Blackwell architecture +[ ] ARM64 (AArch64) processor architecture +[ ] Docker or container runtime installed +[ ] NVIDIA Container Toolkit configured +[ ] Verify GPU access: `nvidia-smi` +[ ] Verify Docker GPU support: `docker run --gpus all nvidia/cuda:12.0-base-ubuntu20.04 nvidia-smi` +[ ] Port 8080 available for marimo notebook access + +## Ancillary files + +All required assets can be found [here on GitHub](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main) + +- [**JAX introduction notebook**](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/jax-intro.py) — covers JAX programming model differences from NumPy and performance evaluation +- [**NumPy SOM implementation**](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/numpy-som.py) — reference implementation of self-organized map training algorithm in NumPy +- [**JAX SOM implementations**](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/som-jax.py) — multiple iteratively refined implementations of SOM algorithm in JAX +- [**Environment configuration**](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/Dockerfile) — package dependencies and container setup specifications +- [**Course guide notebook**]() — overall material navigation and learning path + +## Time & risk + +**Duration:** 2-3 hours including setup, tutorial completion, and validation + +**Risks:** +- Package dependency conflicts in Python environment +- Performance validation may require architecture-specific optimizations + +**Rollback:** Container environments provide isolation; remove containers and restart to reset state. + +## Instructions + +## Step 1. Verify system prerequisites + +Confirm your NVIDIA Spark system meets the requirements and has GPU access configured. + +```bash +## Verify GPU access +nvidia-smi + +## Verify ARM64 architecture +uname -m + +## Check Docker GPU support +docker run --gpus all --rm nvcr.io/nvidia/cuda:13.0.1-runtime-ubuntu24.04 nvidia-smi +``` + +If the `docker` command fails with a permission error, you can either + +1. run it with `sudo`, e.g., `sudo docker run --gpus all --rm nvcr.io/nvidia/cuda:13.0.1-runtime-ubuntu24.04 nvidia-smi`, or +2. add yourself to the `docker` group so you can use `docker` without `sudo`. + +To add yourself to the `docker` group, first run `sudo usermod -aG docker $USER`. Then, as your user account, either run `newgrp docker` or log out and log back in. + +## Step 2. Build a Docker image + + +> **Warning:** This command will download a base image and build a container locally to support this environment + +```bash +cd jax-assets +docker build -t jax-on-spark . +``` + +## Step 3. Launch Docker container + +Run the JAX development environment in a Docker container with GPU support and port forwarding for marimo access. + +```bash +docker run --gpus all --rm -it \ + --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \ + -p 8080:8080 \ + jax-on-spark +``` + +## Step 4. Access marimo interface + +Connect to the marimo notebook server to begin the JAX tutorial. + +```bash +## Access via web browser +## Navigate to: http://localhost:8080 +``` + +The interface will load a table-of-contents display and brief introduction to marimo. + +## Step 5. Complete JAX introduction tutorial + +Work through the introductory material to understand JAX programming model differences from NumPy. + +Navigate to and complete the JAX introduction notebook, which covers: +- JAX programming model fundamentals +- Key differences from NumPy +- Performance evaluation techniques + +## Step 6. Implement NumPy baseline + +Complete the NumPy-based self-organized map (SOM) implementation to establish a performance +baseline. + +Work through the NumPy SOM notebook to: +- Understand the SOM training algorithm +- Implement the algorithm using familiar NumPy operations +- Record performance metrics for comparison + +## Step 7. Optimize with JAX implementations + +Progress through the iteratively refined JAX implementations to see performance improvements. + +Complete the JAX SOM notebook sections: +- Basic JAX port of NumPy implementation +- Performance-optimized JAX version +- GPU-accelerated parallel JAX implementation +- Compare performance across all versions + +## Step 8. Validate performance gains + +The notebooks will show you how to check the performance of each SOM training implementation; you'll see that that JAX implementations show performance improvements over NumPy baseline (and some will be quite a lot faster). + +Visually inspect the SOM training output on random color data to confirm algorithm correctness. + +## Step 10. Validate installation + +Confirm all components are working correctly and notebooks execute successfully. + +```bash +## Test GPU JAX functionality +python -c "import jax; print(jax.devices()); print(jax.device_count())" + +## Verify JAX can access GPU +python -c "import jax.numpy as jnp; x = jnp.array([1, 2, 3]); print(x.device())" +``` + +Expected output should show GPU devices detected and JAX arrays placed on GPU. + +## Step 11. Troubleshooting + +Common issues and their solutions: + +| Symptom | Cause | Fix | +|---------|--------|-----| +| `nvidia-smi` not found | Missing NVIDIA drivers | Install NVIDIA drivers for ARM64 | +| Container fails to access GPU | Missing NVIDIA Container Toolkit | Install nvidia-container-toolkit | +| JAX only uses CPU | CUDA/JAX version mismatch | Reinstall JAX with CUDA support | +| Port 8080 unavailable | Port already in use | Use `-p 8081:8080` or kill process on 8080 | +| Package conflicts in Docker build | Outdated environment file | Update environment file for Blackwell | + +## Step 12. Cleanup and rollback + +Remove containers and reset environment if needed. + +> **Warning:** This will remove all container data and downloaded images. + +```bash +## Stop and remove containers +docker stop $(docker ps -q) +docker system prune -f + +## Reset pipenv environment +pipenv --rm +``` + +To rollback: Re-run installation steps from Step 2. + +## Step 13. Next steps + +Apply JAX optimization techniques to your own NumPy-based machine learning code. + +```bash +## Example: Profile your existing NumPy code +python -m cProfile your_numpy_script.py + +## Then adapt to JAX and compare performance +``` + +Try adapting your favorite NumPy algorithms to JAX and measure performance improvements on +Blackwell GPU architecture. diff --git a/nvidia/jax/assets/00-toc.py b/nvidia/jax/assets/00-toc.py new file mode 100644 index 0000000..5c8fba3 --- /dev/null +++ b/nvidia/jax/assets/00-toc.py @@ -0,0 +1,140 @@ +# /// script +# requires-python = ">=3.12" +# dependencies = [ +# "marimo", +# "numpy==2.2.6", +# "plotly==6.3.0", +# ] +# /// + +# +# SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import marimo + +__generated_with = "0.16.3" +app = marimo.App() + + +@app.cell(hide_code=True) +def _(): + import marimo as mo + return (mo,) + + +@app.cell(hide_code=True) +def _(mo): + mo.md( + r""" + # Getting started with marimo notebooks + + This is a [marimo](https://marimo.io) notebook. You can learn more about marimo notebooks [here](https://docs.marimo.io), but you only need to know a few things to get started. + + 1. Notebooks are made up of _cells_ that can contain prose, executable code, or interactive UI elements. This cell is a prose cell. + 2. Cells can be edited and executed. To edit a cell, double-click until the frame around it turns green. By default, we'll execute the whole notebook when you load it, but to explicitly execute a cell (whether you've edited it or not), select it and press Shift+Enter. This will execute the cell, record its output, and advance to the next cell. + 3. Unlike Jupyter notebooks, which you may have used before, marimo notebooks are _reactive_, meaning that changing the code in a cell will cause any cell depending on the changed cell's outputs to re-run. This makes it possible to develop interactive apps and dashboards, but it also limits a potential reproducibility challenge that can come up while editing non-reactive notebooks. + + The next cell is a code cell. It should have run automatically, but try executing it if not. You'll be editing and re-running this cell in the next step. + """ + ) + return + + +@app.cell +def _(): + import numpy as np + import plotly.express as px + import plotly.io as pio + pio.renderers.default='notebook' + + x_size = 192 + y_size = 108 + feature_dims = 3 + + r_mult = 0.33 + g_mult = 0.33 + b_mult = 1.0 + + random_image = np.random.random(size=(y_size, x_size, feature_dims)) * np.array([r_mult, g_mult, b_mult]) + + px.imshow(random_image) + return + + +@app.cell(hide_code=True) +def _(mo): + mo.md( + r""" + ✅ Change the code above to show a purple-tinted image instead of a blue-tinted one. (If you don't know exactly what it's doing, try changing some of the variables to see how it affects the picture!) + + In the notebooks we'll be working on today, there will be many opportunities to change cells and see how small code changes influence the results and performance of our models. You should always feel free to edit notebook cells and experiment with changes to code. We'll call out several places where we've made it especially easy to try new things (or where we've given you exercises to try) with a checkmark emoji, like this: ✅ + + You just did that, so you'll notice it next time! + + # Recovering from mistakes + + Next up, we'll see how to backtrack. Say you make a mistake in your notebook and overwrite a variable declaration that you need, or perhaps you get stuck and can't figure out how to get back to a clean slate. Run the next cell. + """ + ) + return + + +@app.cell +def _(): + def foo(x): + """ paradoxically, this is an unhelpful function """ + return "sorry, I don't know anything about %r" % x + + help = foo + + help(foo) + return + + +@app.cell(hide_code=True) +def _(mo): + mo.md( + r""" + Oh, no! We've overwritten the definition of the `help` built-in function. This is an important way to access documentation as we're experimenting with new Python libraries. Fortunately, we can use this as an opportunity to learn how to clean up after our mistake. + + ✅ First, edit the previous cell, deleting the line `help = foo`. + + Then, re-run the cell. How does it change? + + # Cleaning up + + Since some of the code we'll execute will allocate memory on the GPU, we may need to clean up after it before moving on to other notebooks. When you're done with a notebook, simply go to the drop-down menu and select `Kernel -> Shut Down Kernel` before proceeding to the next notebook. + + # These notebooks + + In these notebooks, you'll be using JAX to accelerate prototype implementations of a machine learning technique. Here's how to proceed: + + * Start with [an introduction to JAX](/?file=jax-intro.py). + * Review our basic technique by studying [an implementation of self-organizing maps in numpy](/?file=numpy-som.py). + * Conclude with [two accelerated implementations of self-organizing maps in JAX](/?file=som-jax.py). + """ + ) + return + + +@app.cell +def _(): + return + + +if __name__ == "__main__": + app.run() diff --git a/nvidia/jax/assets/Dockerfile b/nvidia/jax/assets/Dockerfile new file mode 100644 index 0000000..dc3a75b --- /dev/null +++ b/nvidia/jax/assets/Dockerfile @@ -0,0 +1,29 @@ +# +# SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +FROM nvidia/cuda:13.0.1-runtime-ubuntu24.04 + +RUN ldconfig +COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/ + +RUN mkdir /app +WORKDIR /app + +RUN uv init && uv venv && uv pip install marimo && uv pip install "jax[cuda13]==0.7.2" && uv pip install "numpy==2.3.3" && uv pip install "plotly==6.3.0" && uv pip install "opencv-python-headless==4.12.0.88" && uv pip install "tqdm==4.67.1" + +COPY *.py *.mp4 /app + +CMD ["sh", "-c", "uv run marimo edit 00-toc.py --host 0.0.0.0 --port 8080 --headless --no-token --no-sandbox"] diff --git a/nvidia/jax/assets/batch-som.mp4 b/nvidia/jax/assets/batch-som.mp4 new file mode 100644 index 0000000..4dfdaa5 Binary files /dev/null and b/nvidia/jax/assets/batch-som.mp4 differ diff --git a/nvidia/jax/assets/jax-intro.py b/nvidia/jax/assets/jax-intro.py new file mode 100644 index 0000000..7dd9301 --- /dev/null +++ b/nvidia/jax/assets/jax-intro.py @@ -0,0 +1,662 @@ +# /// script +# requires-python = ">=3.12" +# dependencies = [ +# "jax==0.7.1", +# "marimo", +# "numpy==2.2.6", +# ] +# /// + +# +# SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import marimo + +__generated_with = "0.15.0" +app = marimo.App() + + +@app.cell(hide_code=True) +def _(mo): + mo.md( + r""" + # Introducing JAX + + [JAX](https://jax.readthedocs.io) is an implementation of a significant subset of the NumPy API with some features that make it especially suitable for machine learning research and high-performance computing. As we'll see, these same features also make JAX extremely useful for accelerating functions and prototypes that we've developed in NumPy. This notebook will provide a quick introduction to just some of the features of JAX that we'll be using in the rest of this workshop -- as well as pointers to some of the potential pitfalls you might run in to with it. There's a lot more to JAX than we'll be able to cover in this notebook (and the rest of the workshop), so you'll want to read the (great) documentation as you dive in more. + + We'll start by importing JAX and its implementation of the NumPy API. By convention, we'll import `jax.numpy` as `jnp` and regular `numpy` as `np` — this is because we can use both in our programs and we may want to use both for different things. + """ + ) + return + + +@app.cell +def _(): + import jax + import numpy as np + import jax.numpy as jnp + from timeit import timeit + return jax, jnp, np, timeit + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""We can use `jax.numpy` as a drop-in replacement for `numpy` in many cases (we'll see some caveats later in this notebook). In many cases, JAX arrays interoperate transparently with NumPy arrays.""") + return + + +@app.cell +def _(jnp): + za = jnp.zeros(7) + za + return (za,) + + +@app.cell +def _(np, za): + # note that we're doing elementwise addition + # between a NumPy array and a JAX array + + zna = np.ones(7) + za + zna + return (zna,) + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""There are some differences, though, and an important one is that JAX arrays can be stored in GPU memory. If you're running this notebook with a GPU-enabled version of JAX, you can see where our array is stored:""") + return + + +@app.cell +def _(za): + za.device + return + + +@app.cell(hide_code=True) +def _(mo): + mo.md( + r""" + ## NumPy and JAX + + By itself, the JAX implementations of NumPy operations are unlikely to be much faster than those from NumPy or (especially) cuPy, but — as we'll see — JAX offers some special functionality that can make JAX code much faster, especially on a GPU. Let's start by doing some simple timings of operations, though. + + We'll use the `device_put` method to convert a NumPy array to a JAX array. + """ + ) + return + + +@app.cell +def _(jax, np): + random_shape = (8192, 2048) + random_values = np.random.random(size=random_shape) + jrandom_values = jax.device_put(random_values) + return jrandom_values, random_values + + +@app.cell +def _(np, random_values): + np.matmul(random_values, random_values.T) + return + + +@app.cell +def _(np, random_values, timeit): + _result = timeit(lambda: + np.matmul(random_values, random_values.T), + number=10) + + print(f"NumPy matrix multiplication took {_result/10:.4f} seconds per iteration") + return + + +@app.cell +def _(jnp, jrandom_values, timeit): + _result = timeit(lambda: + jnp.matmul(jrandom_values, jrandom_values.T).block_until_ready(), + number=10) + + print(f"JAX matrix multiplication took {_result/10:.4f} seconds per iteration") + return + + +@app.cell(hide_code=True) +def _(mo): + mo.md( + r""" + The `block_until_ready` is a special detail that is important when we're getting timings of single lines of JAX code — basically, JAX dispatches our code to the GPU asynchronously and we need to make sure that the operation has completed before we consider it done for the purposes of timing it. + + ✅ Try running the JAX code without `block_until_ready()` to see how much of a difference it makes to time the actual code we want to execute. + + ✅ You probably saw that JAX was faster than NumPy with the matrix shape we provided (in `random_shape`). Make sure that you've added `.block_until_ready()` back to the JAX code and try some other matrix shapes (both larger and smaller). Does JAX exhibit more of a speed advantage on some matrix sizes than others? Is JAX slower than NumPy for some of these? Why (or why not), do you suppose? + """ + ) + return + + +@app.cell(hide_code=True) +def _(mo): + mo.md( + r""" + ## Functional array updates + + One major difference between JAX and NumPy is that JAX arrays are _immutable_. This means that once you create an array, you can't update it in place. In NumPy, you'd do this: + """ + ) + return + + +@app.cell +def _(zna): + zna[3] = 5.0 + zna + return + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""...whereas in JAX, you'd need to use some methods to make a copy of the array changing only one value:""") + return + + +@app.cell +def _(za): + za_1 = za.at[3].set(5.0) + za_1 + return + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""Those of you who have used functional languages will likely be comfortable with this style, but it may be an adjustment. (It's not necessarily as inefficient as it sounds! See [here](https://jax.readthedocs.io/en/latest/faq.html#buffer-donation) for more details on how to avoid copies — and what JAX does under the hood.)""") + return + + +@app.cell(hide_code=True) +def _(mo): + mo.md( + r""" + ## Random number generation + + Recall that a pseudorandom number generator is, under the hood, a function that maps from a state to a next state and a value and has a very long period. The state part is typically hidden from the user as an implementation detail: you initialize a generator with a seed, it creates an initial state, and then it updates that state every time you draw numbers from the generator. Implicit state makes parallelism difficult, so JAX takes a different approach -- generator state (a so-called key) is explicitly passed to each method and clients must call a method to split state before drawing from the generator. + + So, in NumPy, drawing some numbers from a Poisson distribution with a λ of 7 would look like this: + """ + ) + return + + +@app.cell +def _(np): + SEED = 0x12345678 + + rng = np.random.default_rng(SEED) + + # sample 4,096 values from a Poisson distribution + npa = rng.poisson(lam=7, size=4096) + + # mean and variance should both be close to 7 + np.mean(npa), np.var(npa) + return (SEED,) + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""...whereas in JAX, it'd look like this:""") + return + + +@app.cell +def _(SEED, jax, jnp): + key = jax.random.PRNGKey(SEED) + + # "split" the key to explicitly manage state -- + # both key values will be different from the + # key value we generated above + + key, nextkey = jax.random.split(key) + + jnpa = jax.random.poisson(nextkey, lam=7, shape=(4096,)) + + # mean and variance should both be close to 7 + jnp.mean(jnpa), jnp.var(jnpa) + return + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""When we split `key`, we generated two new keys. The advantage of using `nextkey` in the call to `jax.random.poisson` is that we don't have to explicitly assign to `key` later on (e.g., if we were in a loop).""") + return + + +@app.cell(hide_code=True) +def _(mo): + mo.md( + r""" + ## Just-in-time compilation + + If you're running with acclerated hardware, JAX's implementations of NumPy functions require sending code (and sometimes data) to the GPU. This may not be noticeable if you're doing a lot of work, but it can impact performance if you're invoking many small functions. JAX provides a method for _just-in-time compilation_ so that the first time you execute a function it produces a specialized version that can execute more efficiently. + + We'll see the `jax.jit` function in action later in this workshop. There are some things we'll need to keep in mind to use it effectively, and we'll cover those when we get to them! + """ + ) + return + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""## Parallelizing along axes""") + return + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""A powerful feature of JAX is the capability to parallelize functions along axes. So if, for example, you want to calculate the norm of each row in a matrix, you can do each of these in parallel. We'll start by generating a random matrix and moving it to the GPU again:""") + return + + +@app.cell +def _(jax, np): + random_shape_vmap = (8192, 16384) + random_values_vmap = np.random.random(size=random_shape_vmap) + jrandom_values_vmap = jax.device_put(random_values_vmap) + return jrandom_values_vmap, random_values_vmap + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""We'll compute the norm of each row using both NumPy and JAX and collect timings for each:""") + return + + +@app.cell +def _(np, random_values_vmap, timeit): + _result = timeit(lambda: + np.linalg.norm(random_values_vmap, axis=1), + number=10) + + print(f"NumPy row-wise norm took {_result/10:.4f} seconds per iteration") + return + + +@app.cell +def _(jnp, jrandom_values_vmap, timeit): + _result = timeit(lambda: + jnp.linalg.norm(jrandom_values_vmap, axis=1).block_until_ready(), + number=100) + + print(f"JAX row-wise norm took {_result/100:.7f} seconds per iteration") + return + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""We can also try using `jax.jit` and `jax.vmap` to see how this impacts performance.""") + return + + +@app.cell +def _(jax, jnp, jrandom_values_vmap, timeit): + jit_norm = jax.jit(jax.vmap(jnp.linalg.norm, in_axes=0)) + + _result = timeit(lambda: + jit_norm(jrandom_values_vmap).block_until_ready(), + number=100) + + print(f"JAX vmapped norm took {_result/100:.7f} seconds per iteration") + return + + +@app.cell(hide_code=True) +def _(mo): + mo.md( + r""" + Calculating the norm of every row is already pretty efficient in JAX, so we don't see much (if any) performance improvement from mapping over axes. (This will likely be the case for most NumPy functions in JAX that have an `axis` argument.) However, it's an easy example to understand and we'll see a higher-impact application of `vmap` in the next notebook. + + ## Timings + + In the above cell, we've used the `timeit` module in the standard library to repeatedly execute a small code snippet and get the average execution time. For longer-executing cells, we can simply use marimo's direct support for recording cell timings to see how long they executed. + + ✅ Mouse over a cell that has executed and look for timing information. In the version of marimo I'm using now, it will show up in the right margin, but only when you mouse over the cell. For the cells above, the timing should be roughly the value printed out times the number of iterations in the `timeit` call, so if the cell printed something like: + + ```JAX vmapped norm took 0.0355922 seconds per iteration``` + + then you'd expect the cell timing to show something like 3.6 seconds, given that we ran 100 iterations of the code. + + ## Automatic differentiation + + A particularly interesting feature of JAX is its support for _automatic differentiation_. This means that, given a function $f(x)$, JAX can automatically calculate $f'(x)$, or the _derivative_ of $f$, which is a function describing the rate of change between $f(x)$ and $f(x + \epsilon)$, where $\epsilon$ is a very small number. (JAX can also calculate the derivative for functions of multiple arguments, but our running example in this notebook will be a single-argument function.) + + If your daily work regularly involves implementing machine learning and optimization algorithms, you probably already have some ideas why this functionality could be useful. (If it doesn't and you're curious, [here's an explanation](https://en.wikipedia.org/wiki/Gradient_descent) to read on your own time.) + + In the rest of this notebook, we'll show an example of a problem we can solve with the help of JAX's support for automatic differentiation. Since not everyone spends their days thinking about optimizing functions, we've chosen a problem that doesn't require any specialized mathematical or machine learning background to understand, but we'll throw in a wrinkle at the end to show everyone why JAX's automatic differentiation is especially cool. + + Let's start with a very simple Python function: + """ + ) + return + + +@app.function +def square(x): + return x * x + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""We can calculate the derivative of `square` numerically, by calculating the slope of `x * x` while making a very small change to `x`.""") + return + + +@app.function +def square_num_prime(x, h=1e-5): + above = x + h + below = x - h + + rise = (above ** 2) - (below ** 2) + run = h * 2 + + return rise / run + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""You may remember the [power rule](https://en.wikipedia.org/wiki/Power_rule), which states that the the derivative of $x^a$ is $ax^{a-1}$. Given this rule, the derivative of $x^2 = 2x^{2-1} = 2x$. We can use this to check our answer for several values of $x$.""") + return + + +@app.cell +def _(): + [square_num_prime(x) for x in [16.0,32.0,64.0,128.0,256.0,512.0,1024.0]] + return + + +@app.cell(hide_code=True) +def _(mo): + mo.md( + r""" + You may have noticed that not every result is what we'd expect! There are several ways in which numerical differentiation may not produce a precise result, but for this example we may be able to improve the results by changing the range around $x$ for which we're measuring the change in $x^2$. + + ✅ Try some different values for the `h` parameter (both larger and smaller) and see if you can get more precise results. + """ + ) + return + + +@app.cell +def _(): + H = 1e-9 + + [square_num_prime(x, h=H) for x in [16.0,32.0,64.0,128.0,256.0,512.0,1024.0]] + return + + +@app.cell(hide_code=True) +def _(mo): + mo.md( + r""" + Numerical differentiation is straightforward and useful, but it can be imprecise when dealing with very large or very small values -- and analogous techniques for functions of multiple variables are far more complicated. Since machine learning and data processing algorithms often involve functions operating on multiple variables (or vectors of numbers), can involve very large or very small numbers, and may involve repeated calculations that would propagate imprecision, we want a more flexible and less limited technique for differentiation. Fortunately, JAX provides just this in the form of _automatic differentiation_. + + We'll use the `grad` function to automatically generate the derivative of `square`. + """ + ) + return + + +@app.cell +def _(jax): + # this is trivial but it works! + + square_prime = jax.grad(square) + square_prime(4.0) + return (square_prime,) + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""Notice that `square_prime` returns an array.""") + return + + +@app.cell +def _(square_prime): + square_prime(4.0) + return + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""... in fact, a _zero-dimensional_ JAX array. We can access the element with the `item()` function.""") + return + + +@app.cell +def _(square_prime): + [square_prime(x).item() for x in [16.0,32.0,64.0,128.0,256.0,512.0,1024.0]] + return + + +@app.cell(hide_code=True) +def _(mo): + mo.md( + r""" + We can use `square_prime` to implement [Newton's method](https://en.wikipedia.org/wiki/Newton%27s_method#Square_root) for the specific problem of approximating square roots. Basically the idea is that we'll start with an initial guess as our current guess and then repeatedly: + + 1. subtract our goal number (i.e., the number we want the square root for) from our current guess squared, + 2. divide that difference by the derivative of `square` at our current guess, and + 3. update our current guess by subtracting that quotient from it. + + After the third step, we'll compare the square of our guess to our goal number and stop if it's close enough or if we've gone a certain number of iterations. (You may have used a similar but less-efficient method of iteratively refining guesses for square roots on paper in a primary school arithmetic class!) + + We'll start with a function that updates a guess value given a guess and a goal: + """ + ) + return + + +@app.cell +def _(square_prime): + def guess_sqrt(guess, goal): + n = ((guess * guess) - goal) + d = square_prime(guess) + return guess - (n / d) + return (guess_sqrt,) + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""We'll then build out the whole method, including a tolerance value (i.e., how close does `guess ** 2` need to be to `target` for us to accept it?) and a maximum number of iterations so we don't accidentally get into an infinite loop.""") + return + + +@app.cell +def _(guess_sqrt, np): + def newton_sqrt(initial_guess, target, tolerance=1e-4, max_iter=20): + guess = initial_guess + guesses = [initial_guess] + + while np.abs((guess * guess) - target) > tolerance and max_iter > 0: + guess = guess_sqrt(guess, target) + guesses.append(guess.item()) + max_iter = max_iter - 1 + + return guesses + return (newton_sqrt,) + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""You can try this out on a few examples (with some unreasonable initial guesses).""") + return + + +@app.cell +def _(newton_sqrt): + newton_sqrt(3.0, 25.0) + return + + +@app.cell +def _(newton_sqrt): + newton_sqrt(12.0, 65536.0) + return + + +@app.cell +def _(newton_sqrt): + newton_sqrt(256.0, 123456789.0) + return + + +@app.cell(hide_code=True) +def _(mo): + mo.md( + r""" + ✅ Try some different values for initial guesses and targets and see if you can find some examples that take more or fewer iterations to come to an acceptable solution. + + Automatic differentiation is a cool technology, but — as you may object — differentiating $x^2$ isn't a particularly cool application. If this were the extent of our requirements, we could simply implement a few rules that inspected Python functions and replaced functions by their derivatives. If we didn't want to get our hands dirty, we could probably also use a library like `sympy` or hire a first-year undergraduate to perform our calculations for us. + + JAX isn't limited to trivial functions, though. Let's take a look at a more syntactically (and semantically) complex implementation of the square function. We'll call it `bogus_square` to emphasize that it is a contrived example that is meant to be difficult for JAX to deal with. + """ + ) + return + + +@app.cell +def _(jnp): + def bogus_square(x): + result = 0 + for _ in range(int(jnp.floor(x))): + result = result + x + return result + x * (x - jnp.floor(x)) + return (bogus_square,) + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""We can check our results on some examples:""") + return + + +@app.cell +def _(bogus_square): + square_examples = [1.4142135, 2.0, 4.0, 7.9372539, 8.0, 15.9687194226713, 16.0] + + [bogus_square(x) for x in square_examples] + return (square_examples,) + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""We can also differentiate this function, just as we could with the simpler `square`:""") + return + + +@app.cell +def _(bogus_square, jax): + bogus_square_prime = jax.grad(bogus_square) + return (bogus_square_prime,) + + +@app.cell +def _(bogus_square_prime, square_examples): + [bogus_square_prime(x) for x in square_examples] + return + + +@app.cell +def _(bogus_square, bogus_square_prime): + def bogus_guess_sqrt(guess, goal): + n = (bogus_square(guess) - goal) + d = bogus_square_prime(guess) + return guess - (n / d) + return + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""✅ Edit the next cell, adapting `newton_sqrt` to use `bogus_square` and `bogus_guess_sqrt`.""") + return + + +@app.function +def newton_bogus_sqrt(initial_guess, target, tolerance=1e-4, max_iter=20): + results = [initial_guess] + # exercise: fill in the body of this function + return results + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""✅ Test your function out with a few examples. You may want to avoid finding square roots of larger numbers (we'll see why in a second).""") + return + + +@app.cell +def _(): + newton_bogus_sqrt(8.0, 72.0) + return + + +@app.cell(hide_code=True) +def _(mo): + mo.md( + r""" + It's impressive that JAX can differentiate `bogus_square`, but this doesn't mean that we're free to use pathological implementations in our code. The derivative of `bogus_square` is much slower to compute than that of `square`, which means that the performance of an iterative process that depends on computing this many times (like machine learning model training or even like approximating square roots) will suffer. + + ✅ Make sure that the two following cells have executed and then mouse over them to see how long they took to execute. + """ + ) + return + + +@app.cell +def _(square_prime): + for _ in range(10): + square_prime(10000.0) + return + + +@app.cell +def _(bogus_square_prime): + for _ in range(10): + bogus_square_prime(10000.0) + return + + +@app.cell(hide_code=True) +def _(mo): + mo.md( + r""" + ✅ Can you think of a _worse_ way to implement `square`? Were you able to stump `jax.grad` with it? + + Once you're done here, go on to [the next notebook, where we'll introduce self-organizing maps in numpy](./?file=numpy-som.py). + """ + ) + return + + +@app.cell +def _(): + import marimo as mo + return (mo,) + + +if __name__ == "__main__": + app.run() diff --git a/nvidia/jax/assets/numpy-som.py b/nvidia/jax/assets/numpy-som.py new file mode 100644 index 0000000..36e1d97 --- /dev/null +++ b/nvidia/jax/assets/numpy-som.py @@ -0,0 +1,363 @@ +# /// script +# requires-python = ">=3.12" +# dependencies = [ +# "marimo", +# "numpy==2.2.6", +# "opencv-python==4.12.0.88", +# "plotly==6.3.0", +# ] +# /// + +# +# SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + + +import marimo + +__generated_with = "0.15.0" +app = marimo.App() + + +@app.cell(hide_code=True) +def _(): + import marimo as mo + return (mo,) + + +@app.cell(hide_code=True) +def _(mo): + mo.output.append(mo.md( + r""" + # Self-organizing maps in numpy + + The [self-organizing map](https://en.wikipedia.org/wiki/Self-organizing_map) is a classic unsupervised learning technique for dimensionality reduction. It takes a set of high-dimensional training examples and produces a low-dimensional map, typically consisting of a grid or cube of high-dimensional vectors. + + Informally, the training algorithm proceeds by examining each training example, identifying the most similar node to the given example in the map and influencing the neighborhood of nodes around this one to become slightly more like the training example. The size of the influenced neighborhood and the amount of influence exerted both decrease over time. (There is also a batch algorithm, which calculates map weights directly from the sets of examples that mapped to a given neighborhood with a previous version of the map.) + + The following video shows an example of training a self-organizing map from color data using a batch algorithm, from a random initial map to a relatively converged set of colorful clusters. + """ + )) + mo.output.append(mo.image("batch-som.mp4")) + mo.output.append(mo.md( + r""" + + In this notebook, we'll see how to develop two implementations of this algorithm in numpy. **If you're primarily interested in acceleration, feel free to just skim this notebook without running it!** In future notebooks, you'll learn how to accelerate your implementation with JAX. The implementation we'll develop is a prototype — there are lots of things you might want to improve or change before using it in a real system — but it will show how we can accelerate a relatively realistic codebase realizing a more involved ML technique. + """ + )) + return + + +@app.cell +def _(): + import numpy as np + return (np,) + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""There are several ways to initialize a self-organizing map. Here, we'll initialize our map with random vectors. The result of this function is a matrix with a row for every element in a self-organizing map; each row contains uniformly-sampled random numbers between 0 and 1.""") + return + + +@app.cell +def _(np): + def init_som(xdim, ydim, fdim, seed): + rng = np.random.default_rng(seed) + return rng.random(size=(xdim, ydim, fdim)).reshape(xdim * ydim, fdim) + return (init_som,) + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""It's often helpful to visualize the results of our code. Here we'll plot an example random map of three-element vectors, interpreting each vector as a color by interpreting each feature value as a red, green, or blue intensity. Because our code will refer to maps with x- and y-coordinates but represent them as matrices (where we'd list rows first and then columns), we will often rearrange or transpose our data (with methods like `reshape` and `swapaxes` or `T`) before plotting it.""") + return + + +@app.cell +def _(init_som): + import plotly.express as px + import plotly.io as pio + pio.renderers.default='notebook' + + x_size = 192 + y_size = 108 + feature_dims = 3 + + random_map = init_som(x_size, y_size, feature_dims, 42) + + px.imshow(random_map.reshape(x_size, y_size, feature_dims).swapaxes(0,1)) + return px, x_size, y_size + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""Our neighborhood function describes the part of the map influenced by a given sample. It takes two ranges (corresponding to every index along the x dimension and every index along the y dimension), the coordinates of the center of the neighborhood, and the radiuses of influence along each dimension.""") + return + + +@app.cell +def _(np): + def neighborhood(range_x, range_y, center_x, center_y, x_sigma, y_sigma): + x_distance = np.abs(center_x - range_x) + x_neighborhood = np.exp(- np.square(x_distance) / np.square(x_sigma)) + + y_distance = np.abs(center_y - range_y) + y_neighborhood = np.exp(- np.square(y_distance) / np.square(y_sigma)) + + return np.outer(x_neighborhood, y_neighborhood) + return (neighborhood,) + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""We can see an example neighborhood in the heatmap below.""") + return + + +@app.cell +def _(neighborhood, np, px, x_size, y_size): + center_x = 12 + center_y = 48 + sigma_x = 96 + sigma_y = 54 + + px.imshow(neighborhood(np.arange(x_size), np.arange(y_size), center_x, center_y, sigma_x, sigma_y).T) + return + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""Here's the basic online (i.e., one sample at a time) training algorithm.""") + return + + +@app.cell +def _(init_som, neighborhood, np): + def train_som_online(examples, xdim, ydim, x_sigma, y_sigma, max_iter, seed=None, frame_callback=None): + t = -1 + + exs = examples.copy() + fdim = exs.shape[-1] + + x_sigmas = np.linspace(x_sigma, max(2, x_sigma*.05), max_iter) + y_sigmas = np.linspace(y_sigma, max(2, y_sigma*.05), max_iter) + alphas = np.geomspace(0.35, 0.01, max_iter) + + range_x, range_y = np.arange(xdim), np.arange(ydim) + + hood = None + som = init_som(xdim, ydim, fdim, seed) + + rng = np.random.default_rng(seed) + while t < max_iter: + rng.shuffle(exs) + for ex in exs: + t = t + 1 + if t == max_iter: + break + + # best matching unit (by euclidean distance) + bmu_idx = np.argmin(((ex - som) ** 2).sum(axis = 1)) + + bmu = som[bmu_idx] + + center_x, center_y = np.divmod(bmu_idx, ydim) + + hood = neighborhood(range_x, range_y, center_x, center_y, x_sigmas[t], y_sigmas[t]).reshape(-1, 1) + + update = np.multiply(((ex - som) * alphas[t]), hood) + + frame_callback and frame_callback(t - 1, ex, hood, som) + np.add(som, update, som) + np.clip(som, 0, 1, som) + + frame_callback and frame_callback(t, ex, hood, som) + return som + return (train_som_online,) + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""We'll now introduce a small class to track our history at each training epoch or example -- this is useful for debugging and visualizing an entire training process.""") + return + + +@app.class_definition +class HistoryCallback(object): + + def __init__(self, xdim, ydim, fdim, epoch_pred): + self.frames = dict() + self.meta = dict() + self.xdim = xdim + self.ydim = ydim + self.fdim = fdim + self.epoch_pred = epoch_pred + + def __call__(self, epoch, ex, hood, som, **meta): + if self.epoch_pred(epoch): + self.frames[epoch] = (ex, hood, som.copy()) + if meta is not None: + self.meta[epoch] = meta + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""Here we'll train a small map on random color data, storing one history snapshot for every 20 examples.""") + return + + +@app.cell +def _(np, train_som_online): + fc = HistoryCallback(240,135,3, lambda x: x%20 == 0) + color_som =\ + train_som_online(np.random.random(size=(1000, 3)), + 240, 135, + 120, 70, + 50000, None, fc) + return (color_som,) + + +@app.cell(hide_code=True) +def _(mo): + mo.md( + r""" + ✅ Mouse over the above cell. How long did it take to execute? + + Here's a little example function you can pass a history callback object to in order to save every updated map to a PNG file: + """ + ) + return + + +@app.cell +def _(np): + import cv2 + + def plot_history(fc, plot_prefix="plot"): + for k in fc.frames.keys(): + ex, hood, som = fc.frames[k] + + # convert image to BGR data to save with opencv + brg = (np.roll(som, 1, axis=1) * 255).astype("uint8") + brg = brg.reshape(fc.xdim, fc.ydim, fc.fdim).swapaxes(0, 1).reshape(fc.ydim, fc.xdim, fc.fdim) + cv2.imwrite(f"{plot_prefix}-{k:06}.png", brg) + return + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""We can also plot just the final map.""") + return + + +@app.cell +def _(color_som, px): + px.imshow(color_som.reshape(240,135,3).swapaxes(0,1)).show() + return + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""Finally, let's consider the batch variant of the algorithm.""") + return + + +@app.cell +def _(init_som, neighborhood, np): + def train_som_batch(examples, xdim, ydim, x_sigma, y_sigma, epochs, min_sigma_frac=0.1, seed=None, frame_callback=None): + t = 0 + + exs = examples.copy() + fdim = exs.shape[-1] + + x_sigmas = np.geomspace(x_sigma, max(2, xdim*min_sigma_frac), epochs) + y_sigmas = np.geomspace(y_sigma, max(2, ydim*min_sigma_frac), epochs) + + range_x, range_y = np.arange(xdim), np.arange(ydim) + + hood = None + som = init_som(xdim, ydim, fdim, seed) + + rng = np.random.default_rng(seed) + while t < epochs: + hoods = np.zeros((xdim*ydim, xdim * ydim, 1)) + hoods_accum = np.zeros((xdim * ydim, 1)) + updates = np.zeros((len(examples), xdim*ydim, fdim)) + for (i, ex) in enumerate(exs): + + # best matching unit (euclidean distance) + + bmu_idx = np.argmin(((ex - som) ** 2).sum(axis = 1)) + + bmu = som[bmu_idx] + + # cache the neighborhood for this unit (if we haven't seen it yet) + if np.max(hoods[bmu_idx]) == 0: + center_x, center_y = np.divmod(bmu_idx, ydim) + hoods[bmu_idx] = neighborhood(range_x, range_y, center_x, center_y, x_sigmas[t], y_sigmas[t]).reshape(-1, 1) + + hood = hoods[bmu_idx] + hoods_accum = hoods_accum + hood + updates[i] = ex * hood + + frame_callback and frame_callback(t, ex, hood, som) + + som = np.divide(np.sum(updates, axis=0), hoods_accum + 1e-8) + t = t + 1 + + frame_callback and frame_callback(t, ex, hood, som) + return som + return (train_som_batch,) + + +@app.cell +def _(np, train_som_batch): + bfc = HistoryCallback(240,135,3, lambda x: True) + color_som_batch =\ + train_som_batch(np.random.random(size=(1000, 3)), + 240, 135, + 120, 70, + 50, min_sigma_frac=.2, seed=None, frame_callback=bfc) + return (color_som_batch,) + + +@app.cell(hide_code=True) +def _(mo): + mo.md( + r""" + ✅ Mouse over the above cell. How long did it take to execute? + + ✅ You can inspect the map in the next cell. The batch algorithm will almost certainly have a different result than the online algorithm. Does it matter? Are there qualitative differences between the mappings? How would you quantify these differences, and for which applications would they be relevant? + """ + ) + return + + +@app.cell +def _(color_som_batch, px): + px.imshow(color_som_batch.reshape(240,135,3).swapaxes(0,1)).show() + return + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""Once you're ready to move on, click through to [the final notebook](./?file=som-jax.py).""") + return + + +if __name__ == "__main__": + app.run() diff --git a/nvidia/jax/assets/som-jax.py b/nvidia/jax/assets/som-jax.py new file mode 100644 index 0000000..9feb23e --- /dev/null +++ b/nvidia/jax/assets/som-jax.py @@ -0,0 +1,448 @@ +# /// script +# requires-python = ">=3.12" +# dependencies = [ +# "jax==0.7.2", +# "marimo", +# "numpy==2.2.6", +# "plotly==6.3.0", +# "tqdm==4.67.1", +# ] +# /// + + +# +# SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import marimo + +__generated_with = "0.15.0" +app = marimo.App() + + +@app.cell(hide_code=True) +def _(mo): + mo.md( + r""" + # Self-organizing maps in JAX + + In this notebook, we'll develop implementations of online and batch self-organizing map training in JAX, refining each as we go to get better performance. We'll start with the easiest option: simply using JAX as a drop-in replacement for numpy. + + ## Accelerating NumPy functions with JAX + """ + ) + return + + +@app.cell +def _(): + import jax + import jax.numpy as jnp + import numpy as np + return jax, jnp, np + + +@app.cell(hide_code=True) +def _(mo): + mo.md( + r""" + Recall that we're initializing our map with random vectors. The result of this function is a matrix with a row for every element in a self-organizing map; each row contains uniformly-sampled random numbers between 0 and 1. + + Because JAX uses a purely functional approach to random number generation, we'll need to rewrite this code from the numpy implementation -- instead of using a stateful generator like numpy's `Generator` or `RandomState`, we'll create a `PRNGKey` object and pass that to `jax.random.uniform`. (For this example, we're not doing anything with the key — for a real application, we'd want to _split_ it so we could get the next number in the seeded sequence.) + """ + ) + return + + +@app.cell +def _(jax, jnp): + def init_som(xdim, ydim, fdim, seed): + key = jax.random.PRNGKey(seed) + return jnp.array(jax.random.uniform(key, shape=(xdim * ydim * fdim,)).reshape(xdim * ydim, fdim)) + return (init_som,) + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""We can see that JAX is not returning a numpy array:""") + return + + +@app.cell +def _(init_som): + x_size = 192 + y_size = 108 + feature_dims = 3 + + random_map = init_som(x_size, y_size, feature_dims, 42) + type(random_map) + return feature_dims, random_map, x_size, y_size + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""...and we should be able to see that this array is stored in GPU memory (if we're actually running on a GPU).""") + return + + +@app.cell +def _(random_map): + random_map.device + return + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""As before, you can visualize the result if you want — JAX will transfer arrays directly to device memory when needed by plotting libraries.""") + return + + +@app.cell +def _(feature_dims, random_map, x_size, y_size): + import plotly.express as px + import plotly.io as pio + pio.renderers.default='notebook' + + px.imshow(random_map.reshape(x_size, y_size, feature_dims).swapaxes(0,1)) + return (px,) + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""Our neighborhood function is very similar to the numpy implementation; the only difference is that we need to change `np` to `jnp`.""") + return + + +@app.cell +def _(jnp): + def neighborhood(range_x, range_y, center_x, center_y, x_sigma, y_sigma): + x_distance = jnp.abs(center_x - range_x) + x_neighborhood = jnp.exp(- jnp.square(x_distance) / jnp.square(x_sigma)) + + y_distance = jnp.abs(center_y - range_y) + y_neighborhood = jnp.exp(- jnp.square(y_distance) / jnp.square(y_sigma)) + + return jnp.outer(x_neighborhood, y_neighborhood) + return (neighborhood,) + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""Plotting results is a good way to make sure that they look the way we expect them to.""") + return + + +@app.cell +def _(neighborhood, np, px, x_size, y_size): + center_x = 12 + center_y = 48 + sigma_x = 96 + sigma_y = 54 + + px.imshow(neighborhood(np.arange(x_size), np.arange(y_size), center_x, center_y, sigma_x, sigma_y).T) + return center_x, center_y + + +@app.cell(hide_code=True) +def _(mo): + mo.md( + r""" + We're now ready to see the basic online (i.e., one sample at a time) training algorithm. Most of it is unchanged from the numpy implementation, with a few key differences: + + 1. The first differences are related to how we shuffle the example array. Because we aren't using a stateful random number generator, we'll need to split the random state key into two parts (one representing the key for the very next generation and one representing the key for the rest of the stream). We'll declare a little helper function that splits the key, shuffles the array, and returns both the key and the shuffled array. + 2. The second difference relates to how JAX handles arrays. In JAX, arrays offer an _immutable_ interface: instead of changing an array directly, JAX's API lets you make a copy of the array with a change. (In practice, this does not always mean the array is actually copied!) This impacts our code because the numpy version used some functions with output parameters, which indicate where to write the return value (rather than merely returning a new array). So, instead of `np.add(a, b, a)`, we'd do `a = jnp.add(a, b)`. + """ + ) + return + + +@app.cell +def _(init_som, jax, jnp, neighborhood): + def shuffle(key, examples): + key, nextkey = jax.random.split(key) + examples = jax.random.permutation(nextkey, examples) + return (key, examples) + + def train_som_online(examples, xdim, ydim, x_sigma, y_sigma, max_iter, seed=42, frame_callback=None): + t = -1 + exs = examples.copy() + fdim = exs.shape[-1] + x_sigmas = jnp.linspace(x_sigma, max(5, x_sigma * 0.15), max_iter) + y_sigmas = jnp.linspace(y_sigma, max(5, y_sigma * 0.15), max_iter) + alphas = jnp.geomspace(0.35, 0.01, max_iter) + range_x, range_y = (jnp.arange(xdim), jnp.arange(ydim)) + hood = None + som = init_som(xdim, ydim, fdim, seed) + key = jax.random.PRNGKey(seed) + while t < max_iter: + key, exs = shuffle(key, exs) + for ex in exs: + t = t + 1 + if t == max_iter: + break + bmu_idx = jnp.argmin(jnp.linalg.norm(ex - som, axis=1)) + bmu = som[bmu_idx] + center_x = bmu_idx // ydim + center_y = bmu_idx % ydim + hood = neighborhood(range_x, range_y, center_x, center_y, x_sigmas[t], y_sigmas[t]).reshape(-1, 1) + update = jnp.multiply((ex - som) * alphas[t], hood) + frame_callback and frame_callback(t - 1, ex, hood, som) + som = jnp.add(som, update) + som = jnp.clip(som, 0, 1) + frame_callback and frame_callback(t, ex, hood, som) + return som + return shuffle, train_som_online + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""As [in our NumPy version](./?file=numpy-som.py), we'll use a history callback class to track our progress.""") + return + + +@app.class_definition +class HistoryCallback(object): + + def __init__(self, xdim, ydim, fdim, epoch_pred): + self.frames = dict() + self.meta = dict() + self.xdim = xdim + self.ydim = ydim + self.fdim = fdim + self.epoch_pred = epoch_pred + + def __call__(self, epoch, ex, hood, som, **meta): + if self.epoch_pred(epoch): + self.frames[epoch] = (ex, hood, som) + if meta is not None: + self.meta[epoch] = meta + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""Here we'll train a small map on random color data, storing one history snapshot for every 20 examples.""") + return + + +@app.cell +def _(jax, train_som_online): + fc = HistoryCallback(240, 135, 3, lambda x: x % 20 == 0) + examples = jax.random.uniform(jax.random.PRNGKey(42), shape=(1000, 3)) + color_som = train_som_online(examples, 240, 135, 120, 70, 50000, 42, fc) + return color_som, examples + + +@app.cell +def _(color_som, px): + px.imshow(color_som.reshape(240,135,3).swapaxes(0,1)).show() + return + + +@app.cell(hide_code=True) +def _(mo): + mo.md( + r""" + ✅ Mouse over the above cell. How long did it take to execute? + + Depending on your computer, this may have actually been slower than the numpy version! Let's try using JAX's _just-in-time_ compilation to improve our performance. We'll make just-in-time compiled versions of our `neighborhood` and `shuffle` functions (as well as of the inner part of the training loop). We'll also add a progress bar. + """ + ) + return + + +@app.cell +def _(center_x, center_y, init_som, jax, jnp, neighborhood, shuffle): + import tqdm + jit_neighborhood = jax.jit(neighborhood) + jit_shuffle = jax.jit(shuffle) + + @jax.jit + def som_step(ex, som, xdim, ydim, range_x, range_y, center_x, center_y, x_sigma, y_sigma, alpha): + bmu_idx = jnp.argmin(jnp.linalg.norm(ex - som, axis=1)) + bmu = som[bmu_idx] + center_x, center_y = jnp.divmod(bmu_idx, ydim) + hood = jit_neighborhood(range_x, range_y, center_x, center_y, x_sigma, y_sigma).reshape(-1, 1) + update = jnp.multiply((ex - som) * alpha, hood) + return jnp.clip(jnp.add(som, update), 0, 1) + + def train_som_online2(examples, xdim, ydim, x_sigma, y_sigma, max_iter, seed=42, frame_callback=None): + t = 0 + exs = examples.copy() + fdim = exs.shape[-1] + x_sigmas = jnp.linspace(x_sigma, max(5, x_sigma * 0.2), max_iter) + y_sigmas = jnp.linspace(y_sigma, max(5, y_sigma * 0.2), max_iter) + alphas = jnp.geomspace(0.35, 0.01, max_iter) + range_x, range_y = (jnp.arange(xdim), jnp.arange(ydim)) + hood = None + som = init_som(xdim, ydim, fdim, seed) + key = jax.random.PRNGKey(seed) + with tqdm.tqdm(total=max_iter) as progress: + while t < max_iter: + key, exs = jit_shuffle(key, exs) + for ex in exs: + t = t + 1 + progress.update(1) + if t == max_iter: + break + som = som_step(ex, som, xdim, ydim, range_x, range_y, center_x, center_y, x_sigmas[t], y_sigmas[t], alphas[t]) + frame_callback and frame_callback(t, ex, hood, som) + return som + return jit_neighborhood, tqdm, train_som_online2 + + +@app.cell +def _(jax, train_som_online2): + _fc = HistoryCallback(240, 135, 3, lambda x: x % 20 == 0) + _examples = jax.random.uniform(jax.random.PRNGKey(42), shape=(1000, 3)) + color_som_1 = train_som_online2(_examples, 240, 135, 120, 70, 50000, 42, _fc) + return (color_som_1,) + + +@app.cell(hide_code=True) +def _(mo): + mo.md( + r""" + ✅ Mouse over the above cell. How long did it take to execute? + + Let's check our final map to make sure it looks somewhat reasonable. + """ + ) + return + + +@app.cell +def _(color_som_1, px): + px.imshow(color_som_1.reshape(240, 135, 3).swapaxes(0, 1)).show() + return + + +@app.cell(hide_code=True) +def _(mo): + mo.md( + r""" + ✅ One challenging aspect of the online algorithm is its sensitivity to hyperparameter settings: + + * Try running the code again with some different values for `x_sigma` and `y_sigma` and see how your results change! (Consider a minimum size for this value based on the size of the map and the number of training examples.) + * The `alphas` variable (which we didn't expose as a parameter) indicates how much of an effect each example has on the map. We've set it to `jnp.geomspace(0.35, 0.01, max_iter)`; try some different values and see if you get better or worse results! + + Let's now consider the batch variant of the algorithm. It can be much faster, can be implemented in parallel (or even on a cluster) and is less sensitive to hyperparameter settings. In order to exploit additional parallelism, we're going to use `jax.vmap` to calculate weight updates for each training example in parallel. This should result in a dramatic performance improvement. + """ + ) + return + + +@app.cell +def _(init_som, jax, jit_neighborhood, jnp, tqdm): + from functools import partial + + @partial(jax.vmap, in_axes=(0, None, None, None, None, None, None, None), out_axes=0) + def batch_step(ex, som, range_x, range_y, xdim, ydim, x_sigma, y_sigma): + bmu_idx = jnp.argmin(((ex - som) ** 2).sum(axis=1)) + bmu = som[bmu_idx] + center_x, center_y = jnp.divmod(bmu_idx, ydim) + hood = jit_neighborhood(range_x, range_y, center_x, center_y, x_sigma, y_sigma).reshape(-1, 1) + return (ex * hood, hood) + + def train_som_batch(examples, xdim, ydim, x_sigma, y_sigma, epochs, min_sigma_frac=0.1, seed=None, frame_callback=None): + t = 0 + exs = examples.copy() + fdim = exs.shape[-1] + x_sigmas = jnp.linspace(x_sigma, max(2, xdim * min_sigma_frac), epochs) + y_sigmas = jnp.linspace(y_sigma, max(2, ydim * min_sigma_frac), epochs) + range_x, range_y = (jnp.arange(xdim), jnp.arange(ydim)) + hood = None + som = init_som(xdim, ydim, fdim, seed) + for t in tqdm.trange(epochs): + updates, hoods = batch_step(examples, som, range_x, range_y, xdim, ydim, x_sigmas[t], y_sigmas[t]) + frame_callback and frame_callback(t, None, hood, som) + som = jnp.divide(jnp.sum(updates, axis=0), jnp.sum(hoods, axis=0).reshape(-1, 1) + 1e-10) + frame_callback and frame_callback(t, None, hood, som) + return som + return (train_som_batch,) + + +@app.cell +def _(examples, train_som_batch): + bfc = HistoryCallback(240, 135, 3, lambda x: True) + color_som_batch = train_som_batch(examples, 240, 135, 120, 70, 50, min_sigma_frac=0.25, seed=42, frame_callback=bfc) + return (color_som_batch,) + + +@app.cell(hide_code=True) +def _(mo): + mo.md( + r""" + ✅ Mouse over the above cell. How long did it take to execute? How does this compare to our other implementations? + + We have only optimized the batch step here (i.e., we're calculating the best matching unit and map updates for many examples in parallel and then summing these all at once). There are more opportunities to optimize this code with JAX, but we haven't exploited them in order to make it possible to use code that has side effects within `train_som_batch` -- in particular, we're + + 1. using the `HistoryCallback` buffer so we could debug our implementation if necessary (or render a movie of training), and + 2. using `tqdm` for a nice progress bar. + + ✅ Try removing `tqdm` and `HistoryCallback` and then JIT-compiling `train_som_batch`. Does the performance improve? How much? + + ✅ Using JAX looping constructs instead of Python looping (e.g., `for t in tqdm.trange(epochs):`) may enable further optimizations and performance improvements. Try rewriting the `train_som_batch` to use with JAX's `lax.fori_loop` (use `help` or see the JAX documentation for details). How does the performance change? + """ + ) + return + + +@app.cell +def _(color_som_batch, px): + px.imshow(color_som_batch.reshape(240,135,3).swapaxes(0,1)).show() + return + + +@app.cell(hide_code=True) +def _(mo): + mo.md( + r""" + ✅ What would you need to do to rewrite the SOM training to _not_ use `jax.vmap`? (You don't have to actually implement this unless you're interested in a puzzle!) + + + ✅ Modify `batch_step` to use an alternate distance metric. This will involve modifying the following line of code: + + ```bmu_idx = jnp.argmin(((ex - som) ** 2).sum(axis=1))``` + + so that you're taking the `argmin` (or `argmax`, if you're looking for similarity!) of a different function over each entry in the map and the current example. If you don't have a favorite distance or similarity measure, a common example is cosine similarity, which you can calculate for two vectors by dividing their dot product by the product of their magnitudes, like this: + """ + ) + return + + +@app.cell +def _(jnp, np): + example_som = np.random.random(size=(16, 3)) + example_vec = np.random.random(size=(3,)) + + jnp.divide(jnp.dot(example_som, example_vec), (jnp.linalg.norm(example_som, axis=1) * jnp.linalg.norm(example_vec))) + return + + +@app.cell(hide_code=True) +def _(mo): + mo.md(r"""✅ If you implemented cosine similarity, what change did you notice to the performance of batch training? What changes could you make to `train_som_batch` to improve performance?""") + return + + +@app.cell +def _(): + import marimo as mo + return (mo,) + + +if __name__ == "__main__": + app.run() diff --git a/nvidia/llama-factory/README.md b/nvidia/llama-factory/README.md new file mode 100644 index 0000000..37867a6 --- /dev/null +++ b/nvidia/llama-factory/README.md @@ -0,0 +1,236 @@ +# Llama Factory + +> Install and fine-tune models with LLama Factory + +## Table of Contents + +- [Overview](#overview) + - [What you'll accomplish](#what-youll-accomplish) + - [What to know before starting](#what-to-know-before-starting) + - [Prerequisites](#prerequisites) + - [Ancillary files](#ancillary-files) + - [Time & risk](#time-risk) +- [Instructions](#instructions) + - [Step 1. Verify system prerequisites](#step-1-verify-system-prerequisites) + - [Step 2. Launch PyTorch container with GPU support](#step-2-launch-pytorch-container-with-gpu-support) + - [Step 3. Clone LLaMA Factory repository](#step-3-clone-llama-factory-repository) + - [Step 4. Install LLaMA Factory with dependencies](#step-4-install-llama-factory-with-dependencies) + - [Step 5. Configure PyTorch for CUDA 12.9 (if needed)](#step-5-configure-pytorch-for-cuda-129-if-needed) + - [Step 6. Prepare training configuration](#step-6-prepare-training-configuration) + - [Step 7. Launch fine-tuning training](#step-7-launch-fine-tuning-training) + - [Step 8. Validate training completion](#step-8-validate-training-completion) + - [Step 9. Test inference with fine-tuned model](#step-9-test-inference-with-fine-tuned-model) + - [Step 10. Troubleshooting](#step-10-troubleshooting) + - [Step 11. Cleanup and rollback](#step-11-cleanup-and-rollback) + - [Step 12. Next steps](#step-12-next-steps) + +--- + +## Overview + +### What you'll accomplish + +You'll set up LLaMA Factory on NVIDIA Spark with Blackwell architecture to fine-tune large +language models using LoRA, QLoRA, and full fine-tuning methods. This enables efficient +model adaptation for specialized domains while leveraging hardware-specific optimizations. + +### What to know before starting + +- Basic Python knowledge for editing config files and troubleshooting +- Command line usage for running shell commands and managing environments +- Familiarity with PyTorch and Hugging Face Transformers ecosystem +- GPU environment setup including CUDA/cuDNN installation and VRAM management +- Fine-tuning concepts: understanding tradeoffs between LoRA, QLoRA, and full fine-tuning +- Dataset preparation: formatting text data into JSON structure for instruction tuning +- Resource management: adjusting batch size and memory settings for GPU constraints + +### Prerequisites + +- NVIDIA Spark device with Blackwell architecture + +- CUDA 12.9 or newer version installed: `nvcc --version` + +- Docker installed and configured for GPU access: `docker run --gpus all nvidia/cuda:12.9-devel nvidia-smi` + +- Git installed: `git --version` + +- Python environment with pip: `python --version && pip --version` + +- Sufficient storage space (>50GB for models and checkpoints): `df -h` + +- Internet connection for downloading models from Hugging Face Hub + +### Ancillary files + +- Official LLaMA Factory repository: https://github.com/hiyouga/LLaMA-Factory + +- NVIDIA PyTorch container: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch + +- Example training configuration: `examples/train_lora/llama3_lora_sft.yaml` (from repository) + +- Documentation: https://llamafactory.readthedocs.io/en/latest/getting_started/data_preparation.html + +### Time & risk + +**Duration:** 30-60 minutes for initial setup, 1-7 hours for training depending on model size +and dataset. + +**Risks:** Model downloads require significant bandwidth and storage. Training may consume +substantial GPU memory and require parameter tuning for hardware constraints. + +**Rollback:** Remove Docker containers and cloned repositories. Training checkpoints are +saved locally and can be deleted to reclaim storage space. + +## Instructions + +### Step 1. Verify system prerequisites + +Check that your NVIDIA Spark system has the required components installed and accessible. + +```bash +nvcc --version +docker --version +nvidia-smi +python --version +git --version +``` + +### Step 2. Launch PyTorch container with GPU support + +Start the NVIDIA PyTorch container with GPU access and mount your workspace directory. +> **Note:** This NVIDIA PyTorch container supports CUDA 13 + +```bash +docker run --gpus all --ipc=host --ulimit memlock=-1 -it --ulimit stack=67108864 --rm -v "$PWD":/workspace nvcr.io/nvidia/pytorch:25.08-py3 bash +``` + +### Step 3. Clone LLaMA Factory repository + +Download the LLaMA Factory source code from the official repository. + +```bash +git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git +cd LLaMA-Factory +``` + +### Step 4. Install LLaMA Factory with dependencies + +Install the package in editable mode with metrics support for training evaluation. + +```bash +pip install -e ".[metrics]" +``` + +### Step 5. Configure PyTorch for CUDA 12.9 (if needed) + +#### If using standalone Python (skip if using Docker container) + +In a python virtual environment, uninstall existing PyTorch and reinstall with CUDA 12.9 support for ARM64 architecture. + +```bash +pip uninstall torch torchvision torchaudio +pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu129 +``` + +#### If using Docker container + +PyTorch is pre-installed with CUDA support. Verify installation: + +```bash +python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')" +``` + +### Step 6. Prepare training configuration + +Examine the provided LoRA fine-tuning configuration for Llama-3. + +```bash +cat examples/train_lora/llama3_lora_sft.yaml +``` + +### Step 7. Launch fine-tuning training + +> **Note:** Login to your hugging face hub to download the model if the model is gated +Execute the training process using the pre-configured LoRA setup. + +```bash +huggingface-cli login # if the model is gated +llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml +``` + +Example output: +```bash +***** train metrics ***** + epoch = 3.0 + total_flos = 22851591GF + train_loss = 0.9113 + train_runtime = 0:22:21.99 + train_samples_per_second = 2.437 + train_steps_per_second = 0.306 +Figure saved at: saves/llama3-8b/lora/sft/training_loss.png +``` + +### Step 8. Validate training completion + +Verify that training completed successfully and checkpoints were saved. + +```bash +ls -la saves/llama3-8b/lora/sft/ +cat saves/llama3-8b/lora/sft/training_loss.png +``` + + +Expected output should show: +- Final checkpoint directory (`checkpoint-21` or similar) +- Model configuration files (`config.json`, `adapter_config.json`) +- Training metrics showing decreasing loss values +- Training loss plot saved as PNG file + +### Step 9. Test inference with fine-tuned model + +Run a simple inference test to verify the fine-tuned model loads correctly. + +```bash +llamafactory-cli chat examples/inference/llama3_lora_sft.yaml +``` + +### Step 10. Troubleshooting + +| Symptom | Cause | Fix | +|---------|--------|-----| +| CUDA out of memory during training | Batch size too large for GPU VRAM | Reduce `per_device_train_batch_size` or increase `gradient_accumulation_steps` | +| Model download fails or is slow | Network connectivity or Hugging Face Hub issues | Check internet connection, try using `HF_HUB_OFFLINE=1` for cached models | +| Training loss not decreasing | Learning rate too high/low or insufficient data | Adjust `learning_rate` parameter or check dataset quality | + +### Step 11. Cleanup and rollback + +> **Warning:** This will delete all training progress and checkpoints. + +To remove all generated files and free up storage space: + +```bash +cd /workspace +rm -rf LLaMA-Factory/ +docker system prune -f +``` + +To rollback Docker container changes: +```bash +exit # Exit container +docker container prune -f +``` + +### Step 12. Next steps + +Test your fine-tuned model with custom prompts: + +```bash +llamafactory-cli chat examples/inference/llama3_lora_sft.yaml +## Type: "Hello, how can you help me today?" +## Expect: Response showing fine-tuned behavior +``` + +For production deployment, export your model: +```bash +llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml +``` diff --git a/nvidia/monai-reasoning/README.md b/nvidia/monai-reasoning/README.md new file mode 100644 index 0000000..9280c50 --- /dev/null +++ b/nvidia/monai-reasoning/README.md @@ -0,0 +1,294 @@ +# MONAI-Reasoning-CXR-3B Model + +> Work with a MONAI vision-language model through Open WebUI + +## Table of Contents + +- [Overview](#overview) +- [Instructions](#instructions) + +--- + +## Overview + +## Basic idea + +The MONAI Reasoning CXR 3B model is a **medical AI model** designed for **chest X-ray (CXR) interpretation** with reasoning capabilities. It combines imaging analysis with large-scale language modeling: + +- **Medical focus**: Built within the MONAI framework for healthcare imaging tasks. +- **Vision + language**: Takes CXR images as input and produces diagnostic text or reasoning outputs. +- **Reasoning layer**: Goes beyond simple classification to explain intermediate steps (e.g., opacity → pneumonia suspicion). +- **3B scale**: A moderately large multimodal model (~3 billion parameters). +- **Trust and explainability**: Aims to make results more interpretable and clinically useful. + +## What you'll accomplish + +You'll deploy the MONAI-Reasoning-CXR-3B model, a specialized vision-language model for chest X-ray +analysis, on an NVIDIA Spark device with Blackwell GPU architecture. By the end of this +walkthrough, you will have a complete system running with VLLM serving the model for +high-performance inference and Open WebUI providing an easy-to-use interface for interacting +with the model. This setup is ideal for clinical demonstrations and research that requires +transparent AI reasoning. + +## What to know before starting + +* Experience with the Linux command line and shell scripting +* A basic understanding of Docker, including running containers and managing images +* Familiarity with Python and using pip for package management +* Knowledge of Large Language Models (LLMs) and how to interact with API endpoints +* Basic understanding of NVIDIA GPU hardware and CUDA drivers + +## Prerequisites + +**Hardware Requirements:** +* NVIDIA Spark device with ARM64 (AArch64) architecture +* NVIDIA Blackwell GPU architecture +* At least 24GB of GPU VRAM + +**Software Requirements:** + +* **NVIDIA Driver**: Ensure the driver is installed and the GPU is recognized +```bash +nvidia-smi +``` + +* **Docker Engine**: Docker must be installed and the daemon running +```bash +docker --version +``` + +* **NVIDIA Container Toolkit**: Required for GPU access in containers +```bash +docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi +``` + +* **Hugging Face CLI**: You'll need this to download the model +```bash +pip install -U huggingface_hub +huggingface-cli whoami +``` + +* **System Architecture**: Verify your system architecture for proper container selection +```bash +uname -m +## Should output: aarch64 for ARM64 systems like NVIDIA Spark +``` + +## Time & risk + +* **Estimated time:** 20-35 minutes (not including model download) +* **Risk level:** Low. All steps use publicly available containers and models +* **Rollback:** The entire deployment is containerized. To roll back, you can simply stop +and remove the Docker containers + +## Instructions + +## Step 1. Create the Project Directory + +First, create a dedicated directory to store your model weights and configuration files. This +keeps the project organized and provides a clean workspace. + +```bash +## Create the main directory +mkdir -p ~/monai-reasoning-spark +cd ~/monai-reasoning-spark + +## Create a subdirectory for the model +mkdir -p models +``` + +## Step 2. Download the MONAI-Reasoning-CXR-3B Model + +Use the Hugging Face CLI to download the model weights into the directory you just created. +The model is approximately 6GB and will take several minutes to download depending on your +internet connection. + +```bash +huggingface-cli download monai/monai-reasoning-cxr-3b \ +--local-dir ./models/monai-reasoning-cxr-3b \ +--local-dir-use-symlinks False +``` + +**Verification Step:** +```bash +ls -la ./models/monai-reasoning-cxr-3b +## You should see model files including config.json and model weights +``` + +> **Important Note:** Currently, a custom internal VLLM container is required until the sm121 support is available in the public image. The instructions below use the internal container `******:5005/dl/dgx/vllm:main-py3.31165712-devel`. + +## Step 3. Verify System Architecture + +Before proceeding, confirm your system architecture is ARM64 for proper container selection +on your NVIDIA Spark device: + +```bash +## Check your system architecture +uname -m +## Should output: aarch64 for ARM64 systems like NVIDIA Spark +``` + +## Step 4. Create a Docker Network + +Create a dedicated Docker bridge network to allow the VLLM and Open WebUI containers to +communicate with each other easily and reliably. + +```bash +docker network create monai-net +``` + +## Step 5. Deploy the VLLM Server + +Launch the VLLM container with ARM64 architecture support, attaching it to the network you +created and mounting your local model directory. This step configures the server for optimal +performance on NVIDIA Spark hardware. + +```bash +## Stop and remove existing container if running +docker stop vllm-server 2>/dev/null || true +docker rm vllm-server 2>/dev/null || true + +## Run the VLLM server with internal container +docker run --rm -d \ +--name vllm-server \ +--gpus all \ +--ipc=host \ +--ulimit memlock=-1 \ +--ulimit stack=67108864 \ +--network monai-net \ +--platform linux/arm64 \ +-v ./models/monai-reasoning-cxr-3b:/model \ +-p 8000:8000 \ +******:5005/dl/dgx/vllm:main-py3.31165712-devel \ +vllm serve /model \ +--host 0.0.0.0 \ +--port 8000 \ +--dtype bfloat16 \ +--trust-remote-code \ +--gpu-memory-utilization 0.5 \ +--enforce-eager \ +--served-model-name monai-reasoning-cxr-3b +``` + +**Wait for startup and verify:** +```bash +## Wait for the model to load (can take 1-2 minutes on Spark hardware) +sleep 90 + +## Check if container is running +docker ps + +## Test the VLLM API +curl http://localhost:8000/v1/models +``` + +You should see JSON output showing the model is loaded and available. + +## Step 6. Deploy Open WebUI + +Launch the Open WebUI container with ARM64 architecture support for your NVIDIA Spark device. + +```bash +## Define custom prompt suggestions for medical X-ray analysis +PROMPT_SUGGESTIONS='[ +{ + "title": ["Analyze X-Ray Image", "Find abnormalities and support devices"], + "content": "Find abnormalities and support devices in the image." +} +]' + +## Stop and remove existing container if running +docker stop open-webui 2>/dev/null || true +docker rm open-webui 2>/dev/null || true +sleep 5 + +## Run Open WebUI with custom configuration +docker run -d --rm \ +--name open-webui \ +--network monai-net \ +--platform linux/arm64 \ +-p 3000:8080 \ +-e WEBUI_AUTH=0 \ +-e WEBUI_NAME=monai-reasoning \ +-e ENABLE_SIGNUP=0 \ +-e ENABLE_ADMIN_CHAT_ACCESS=0 \ +-e ENABLE_VERSION_UPDATE_CHECK=0 \ +-e OPENAI_API_BASE_URL="http://vllm-server:8000/v1" \ +-e DEFAULT_PROMPT_SUGGESTIONS="$PROMPT_SUGGESTIONS" \ +ghcr.io/open-webui/open-webui:main +``` + +**Verify deployment:** +```bash +## Wait for startup +sleep 15 + +## Check both containers are running +docker ps + +## Test Open WebUI accessibility +curl -f http://localhost:3000 || echo "Still starting up" +``` + +## Step 7. Validate the Complete Deployment + +Check that both containers are running properly and all endpoints are accessible: + +```bash +## Check container status +docker ps +## You should see both vllm-server and open-webui containers running + +## Test the VLLM API +curl http://localhost:8000/v1/models +## Should return JSON with model information + +## Test Open WebUI accessibility +curl -f http://localhost:3000 +## Should return HTTP 200 response +``` + +## Step 8. Configure Open WebUI + +Configure the front-end interface to connect to your VLLM backend: + +1. Open your web browser and navigate to **http://:3000** +2. Since authentication is disabled, you'll have direct access to the interface +3. The OpenAI API connection is pre-configured through environment variables +4. Go to the main chat screen, click **"Select a model"**, and choose **monai-reasoning-cxr-3b** +5. **Important:** Navigate to **Chat Controls** → **Advanced Params** and disable **"Reasoning Tags"** to get the full reasoning output from the model + +You can now upload a chest X-ray image and ask questions directly in the chat interface. The custom prompt suggestion "Find abnormalities and support devices in the image" will be available for quick access. + +## Step 10. Troubleshooting + +| Symptom | Cause | Fix | +|---------|-------|-----| +| VLLM container fails to start | Insufficient GPU memory | Reduce `--gpu-memory-utilization` to 0.25 | +| Model download fails | Network connectivity or HF auth | Check `huggingface-cli whoami` and internet | +| Open WebUI shows connection error | Wrong backend URL | Verify `OPENAI_API_BASE_URL` is set correctly | +| Model doesn't show full reasoning | Reasoning tags enabled | Disable "Reasoning Tags" in Chat Controls → Advanced Params | + +## Step 11. Cleanup and Rollback + +To stop and remove the containers and network, run the following commands. This will not +delete your downloaded model weights. + +> **Warning:** This will stop all running containers and remove the network. + +```bash +## Stop containers +docker stop vllm-server open-webui + +## Remove network +docker network rm monai-net + +## Optional: Remove model directory to free disk space +## rm -rf ~/monai-reasoning-spark/models +``` + +## Step 12. Next Steps + +Your MONAI reasoning system is now ready for use. Upload chest X-ray images through the web +interface at http://:3000 and interact with the MONAI-Reasoning-CXR-3B model +for medical image analysis and reasoning tasks. diff --git a/nvidia/multi-agent-chatbot/README.md b/nvidia/multi-agent-chatbot/README.md new file mode 100644 index 0000000..81faa1d --- /dev/null +++ b/nvidia/multi-agent-chatbot/README.md @@ -0,0 +1,130 @@ +# Build and Deploy a Multi-Agent Chatbot + +> Deploy a multi-agent chatbot system and chat with agents on your Spark + +## Table of Contents + +- [Overview](#overview) +- [Instructions](#instructions) + +--- + +## Overview + +## Basic Idea + +This playbook shows you how to use DGX Spark to prototype, build and deploy a fully local multi-agent system. +With 128GB of unified memory, DGX Spark can run multiple LLMs and VLMs in parallel — enabling interactions across agents. + +At the core is a supervisor agent powered by gpt-oss-120B, orchestrating specialized downstream agents for coding, retrieval-augmented generation (RAG), and image understanding. Thanks to DGX Spark's out-of-the-box support for popular AI frameworks and libraries, development and prototyping were fast and frictionless. +Together, these components demonstrate how complex, multimodal workflows can be executed efficiently on local, high-performance hardware. + +## What you'll accomplish + +You will have a full-stack multi-agent chatbot system running on your DGX Spark, accessible through +your local web browser. +The setup includes: +- LLM and VLM model serving using llama.cpp servers and TensorRT LLM servers +- GPU acceleration for both model inference and document retrieval +- Multi-agent system orchestration using a supervisor agent powered by gpt-oss-120B +- MCP (Model Context Protocol) servers as tools for the supervisor agent + +## Prerequisites + +- DGX Spark device is set up and accessible +- No other processes running on the DGX Spark GPU +- Enough disk space for model downloads + + +## Time & risk + +**Duration**: 30 minutes for initial setup, plus model download time (varies by model size) + +**Risks**: +- Docker permission issues may require user group changes and session restart +- Large model downloads may take significant time depending on network speed + +**Rollback**: Stop and remove Docker containers using provided cleanup commands + +## Instructions + +## Step 1. Configure Docker permissions + +To easily manage containers without sudo, you must be in the `docker` group. If you choose to skip this step, you will need to run Docker commands with sudo. + +Open a new terminal and test Docker access. In the terminal, run: + +```bash +docker ps +``` + +If you see a permission denied error (something like `permission denied while trying to connect to the Docker daemon socket`), add your user to the docker group: + +```bash +sudo usermod -aG docker $USER +``` + +> **Warning**: After running usermod, you must log out and log back in to start a new +> session with updated group permissions. + +## Step 2. Clone the repository + +In a terminal, clone the [GitHub](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main) repository and navigate to the root directory of the multi-agent-chatbot project. + +```bash +cd multi-agent-chatbot +``` + +## Step 3. Run the setup script + +```bash +chmod +x setup.sh +./setup.sh +``` + +This script will: +- Pull model GGUF files from HuggingFace +- Build base llama cpp server images +- Start the required docker containers - model servers, the backend API server as well as the frontend UI. + +## Step 4. Wait for all the containers to become ready and healthy. + +```bash +watch 'docker ps --format "table {{.ID}}\t{{.Names}}\t{{.Status}}"' +``` + +This step can take ~20 minutes - pulling model files may take 10 minutes and starting containers may take another 10 minutes depending on network speed. + +## Step 5. Access the frontend UI + +Open your browser and go to: http://localhost:3000 + +> **Note**: If you are running this on a remote GPU via an ssh connection, in a new terminal window, you need to run the following command to be able to access the UI at localhost:3000 and for the UI to be able to communicate to the backend at localhost:8000. + +>```ssh -L 3000:localhost:3000 -L 8000:localhost:8000 username@IP-address``` + +## Step 6. Try out the sample prompts + +Click on any of the tiles on the frontend to try out the supervisor and the other agents. + +**RAG Agent**: +Before trying out the RAG agent, upload the example PDF document NVIDIA Blackwell Whitepaper as context by clicking on the "Attach" icon in the text input space at the botton of the UI. +Make sure to check the box in the "Select Sources" section on the left side of the UI before submitting the query. + + +## Step 8. Cleanup and rollback + +Steps to completely remove the containers and free up resources. + +From the root directory of the multi-agent-chatbot project, run the following commands: + +```bash +docker compose -f docker-compose.yml -d docker-compose-models.yml down +docker volume rm chatbot-spark_model-data chatbot-spark_postgres_data +``` + +## Step 9. Next steps + +- Try different prompts with the multi-agent chatbot system. +- Try different models by following the instructions in the repository. +- Try adding new MCP (Model Context Protocol) servers as tools for the supervisor agent. diff --git a/nvidia/multi-agent-chatbot/assets/document-ingestion.png b/nvidia/multi-agent-chatbot/assets/document-ingestion.png new file mode 100644 index 0000000..c10bbd5 Binary files /dev/null and b/nvidia/multi-agent-chatbot/assets/document-ingestion.png differ diff --git a/nvidia/multi-agent-chatbot/assets/multi-agent-chatbot.png b/nvidia/multi-agent-chatbot/assets/multi-agent-chatbot.png new file mode 100644 index 0000000..a97c02a Binary files /dev/null and b/nvidia/multi-agent-chatbot/assets/multi-agent-chatbot.png differ diff --git a/nvidia/multi-agent-chatbot/assets/system-diagram.png b/nvidia/multi-agent-chatbot/assets/system-diagram.png new file mode 100644 index 0000000..641cfd2 Binary files /dev/null and b/nvidia/multi-agent-chatbot/assets/system-diagram.png differ diff --git a/nvidia/multi-agent-chatbot/assets/upload-image.png b/nvidia/multi-agent-chatbot/assets/upload-image.png new file mode 100644 index 0000000..7bd2a7e Binary files /dev/null and b/nvidia/multi-agent-chatbot/assets/upload-image.png differ diff --git a/nvidia/multi-modal-inference/README.md b/nvidia/multi-modal-inference/README.md new file mode 100644 index 0000000..8b33be9 --- /dev/null +++ b/nvidia/multi-modal-inference/README.md @@ -0,0 +1,215 @@ +# Multi-modal Inference + +> Setup multi-modal inference with TensorRT + +## Table of Contents + +- [Overview](#overview) +- [Instructions](#instructions) + - [Substep A. BF16 quantized precision](#substep-a-bf16-quantized-precision) + - [Substep B. FP8 quantized precision](#substep-b-fp8-quantized-precision) + - [Substep C. FP4 quantized precision](#substep-c-fp4-quantized-precision) + - [Substep A. FP16 precision (high VRAM requirement)](#substep-a-fp16-precision-high-vram-requirement) + - [Substep B. FP8 quantized precision](#substep-b-fp8-quantized-precision) + - [Substep C. FP4 quantized precision](#substep-c-fp4-quantized-precision) + - [Substep A. BF16 precision](#substep-a-bf16-precision) + - [Substep B. FP8 quantized precision](#substep-b-fp8-quantized-precision) + +--- + +## Overview + +## What you'll accomplish + +You'll deploy GPU-accelerated multi-modal inference capabilities on NVIDIA Spark using TensorRT to run +Flux.1 and SDXL diffusion models with optimized performance across multiple precision formats (FP16, +FP8, FP4). + +## What to know before starting + +- Working with Docker containers and GPU passthrough +- Using TensorRT for model optimization +- Hugging Face model hub authentication and downloads +- Command-line tools for GPU workloads +- Basic understanding of diffusion models and image generation + +## Prerequisites + +- [ ] NVIDIA Spark device with Blackwell GPU architecture +- [ ] Docker installed and accessible to current user +- [ ] NVIDIA Container Runtime configured +- [ ] Hugging Face account with valid token +- [ ] At least 48GB VRAM available for FP16 Flux.1 Schnell operations +- [ ] Verify GPU access: `nvidia-smi` +- [ ] Check Docker GPU integration: `docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu20.04 nvidia-smi` +- [ ] Confirm HF token access with permissions to FLUX repos: `echo $HF_TOKEN`, Sign in to your huggingface account You can create the token from create your token here (make sure you provide permissions to the token): https://huggingface.co/settings/tokens , Note the permissions to be checked and the repos: black-forest-labs/FLUX.1-dev and black-forest-labs/FLUX.1-dev-onnx (search for these repos when creating the user token) to be added. + +## Ancillary files + +All necessary files can be found in the TensorRT repository [here on GitHub](https://github.com/NVIDIA/TensorRT) +- **requirements.txt** - Python dependencies for TensorRT demo environment +- **demo_txt2img_flux.py** - Flux.1 model inference script +- **demo_txt2img_xl.py** - SDXL model inference script +- **TensorRT repository** - Contains diffusion demo code and optimization tools + +## Time & risk + +**Duration**: 45-90 minutes depending on model downloads and optimization steps + +**Risks**: Large model downloads may timeout; high VRAM requirements may cause OOM errors; +quantized models may show quality degradation + +**Rollback**: Remove downloaded models from HuggingFace cache, exit container environment + +## Instructions + +## Step 1. Launch the TensorRT container environment + +Start the NVIDIA PyTorch container with GPU access and HuggingFace cache mounting. This provides +the TensorRT development environment with all required dependencies pre-installed. + +```bash +docker run --gpus all --ipc=host --ulimit memlock=-1 \ +--ulimit stack=67108864 -it --rm --ipc=host \ +-v $HOME/.cache/huggingface:/root/.cache/huggingface \ +nvcr.io/nvidia/pytorch:25.09-py3 +``` + +## Step 2. Clone and set up TensorRT repository + +Download the TensorRT repository and configure the environment for diffusion model demos. + +```bash +git clone https://github.com/NVIDIA/TensorRT.git -b main --single-branch && cd TensorRT +export TRT_OSSPATH=/workspace/TensorRT/ +cd $TRT_OSSPATH/demo/Diffusion +``` + +## Step 3. Install required dependencies + +Install NVIDIA ModelOpt and other dependencies for model quantization and optimization. + +```bash +## Install OpenGL libraries +apt update +apt install -y libgl1 libglu1-mesa libglib2.0-0t64 libxrender1 libxext6 libx11-6 libxrandr2 libxss1 libxcomposite1 libxdamage1 libxfixes3 libxcb1 + +pip install nvidia-modelopt[torch,onnx] +sed -i '/^nvidia-modelopt\[.*\]=.*/d' requirements.txt +pip3 install -r requirements.txt +``` + +## Step 4. Run Flux.1 Dev model inference + +Test multi-modal inference using the Flux.1 Dev model with different precision formats. + +### Substep A. BF16 quantized precision + +```bash +python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \ + --hf-token=$HF_TOKEN --download-onnx-models --bf16 +``` + +### Substep B. FP8 quantized precision + +```bash +python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \ + --hf-token=$HF_TOKEN --quantization-level 4 --fp8 --download-onnx-models +``` + +### Substep C. FP4 quantized precision + +```bash +python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \ + --hf-token=$HF_TOKEN --fp4 --download-onnx-models +``` + +## Step 5. Run Flux.1 Schnell model inference + +Test the faster Flux.1 Schnell variant with different precision formats. + +> **Warning**: FP16 Flux.1 Schnell requires >48GB VRAM for native export + +### Substep A. FP16 precision (high VRAM requirement) + +```bash +python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \ + --hf-token=$HF_TOKEN --version="flux.1-schnell" +``` + +### Substep B. FP8 quantized precision + +```bash +python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \ + --hf-token=$HF_TOKEN --version="flux.1-schnell" \ + --quantization-level 4 --fp8 --download-onnx-models +``` + +### Substep C. FP4 quantized precision + +```bash +python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \ + --hf-token=$HF_TOKEN --version="flux.1-schnell" \ + --fp4 --download-onnx-models +``` + +## Step 6. Run SDXL model inference + +Test the SDXL model for comparison with different precision formats. + +### Substep A. BF16 precision + +```bash +python3 demo_txt2img_xl.py "a beautiful photograph of Mt. Fuji during cherry blossom" \ + --hf-token=$HF_TOKEN --version xl-1.0 --download-onnx-models +``` + +### Substep B. FP8 quantized precision + +```bash +python3 demo_txt2img_xl.py "a beautiful photograph of Mt. Fuji during cherry blossom" \ + --hf-token=$HF_TOKEN --version xl-1.0 --download-onnx-models --fp8 +``` + +## Step 7. Validate inference outputs + +Check that the models generated images successfully and measure performance differences. + +```bash +## Check for generated images in output directory +ls -la *.png *.jpg 2>/dev/null || echo "No image files found" + +## Verify CUDA is accessible +nvidia-smi + +## Check TensorRT version +python3 -c "import tensorrt as trt; print(f'TensorRT version: {trt.__version__}')" +``` + +## Step 8. Troubleshooting + +| Symptom | Cause | Fix | +|---------|-------|-----| +| "CUDA out of memory" error | Insufficient VRAM for model | Use FP8/FP4 quantization or smaller model | +| "Invalid HF token" error | Missing or expired HuggingFace token | Set valid token: `export HF_TOKEN=` | +| Model download timeouts | Network issues or rate limiting | Retry command or pre-download models | + +## Step 9. Cleanup and rollback + +Remove downloaded models and exit container environment to free disk space. + +> **Warning**: This will delete all cached models and generated images + +```bash +## Exit container +exit + +## Remove HuggingFace cache (optional) +rm -rf $HOME/.cache/huggingface/ +``` + +## Step 10. Next steps + +Use the validated setup to generate custom images or integrate multi-modal inference into your +applications. Try different prompts or explore model fine-tuning with the established TensorRT +environment. diff --git a/nvidia/nccl/README.md b/nvidia/nccl/README.md new file mode 100644 index 0000000..5af095b --- /dev/null +++ b/nvidia/nccl/README.md @@ -0,0 +1,287 @@ +# NCCL for Two Sparks + +> Install and test NCCL on two Sparks + +## Table of Contents + +- [Overview](#overview) +- [Run on two Sparks](#run-on-two-sparks) + - [Option 1: Suggested - Netplan configuration](#option-1-suggested-netplan-configuration) + - [Option 2: Manual IP assignment (advanced)](#option-2-manual-ip-assignment-advanced) + +--- + +## Overview + +## Basic Idea + +NCCL (NVIDIA Collective Communication Library) enables high-performance GPU-to-GPU communication +across multiple nodes. This walkthrough sets up NCCL for multi-node distributed training on +DGX Spark systems with Blackwell architecture. You'll configure networking, build NCCL from +source with Blackwell support, and validate communication performance between nodes. + +## What you'll accomplish + +You'll have a working multi-node NCCL environment that enables high-bandwidth GPU communication +across DGX Spark systems for distributed training workloads, with validated network performance +and proper GPU topology detection. + +## What to know before starting + +- Working with Linux network configuration and netplan +- Docker container management and multi-container deployments +- Basic understanding of MPI (Message Passing Interface) concepts +- SSH key management and passwordless authentication setup +- NVIDIA GPU architecture fundamentals and CUDA toolkit usage + +## Prerequisites + +- [ ] Two DGX Spark systems with Blackwell GPUs: `nvidia-smi --query-gpu=gpu_name --format=csv` +- [ ] ConnectX-7 InfiniBand network cards installed: `ibdev2netdev` +- [ ] Docker installed on both nodes: `docker --version` +- [ ] CUDA toolkit available: `nvcc --version` +- [ ] SSH access between nodes: `ssh echo "success"` +- [ ] Root/sudo privileges: `sudo whoami` + +## Ancillary files + +- `cx7-netplan.yaml` - Network configuration template for ConnectX-7 interfaces +- `discover-sparks` - Script to discover DGX Spark nodes and configure SSH keys +- `trtllm-mn-entrypoint.sh` - Container entrypoint script for multi-node setup + +## Time & risk + +**Duration**: 45-60 minutes for setup and validation +**Risk level**: Medium - involves network configuration changes and container networking +**Rollback**: Network changes can be reverted using `sudo netplan apply` with original configs, +containers can be stopped with `docker stop` + +## Run on two Sparks + +## Step 1. Setup networking between nodes + +Configure network interfaces for high-performance inter-node communication. Choose one option +based on your network requirements. + +### Option 1: Suggested - Netplan configuration + +Configure network interfaces using netplan on both DGX Spark nodes for automatic link-local +addressing: + +```bash +## On both nodes, create the netplan configuration file +sudo tee /etc/netplan/40-cx7.yaml > /dev/null < enp1s0f0np0 (Up) +rocep1s0f1 port 1 ==> enp1s0f1np1 (Down) +roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up) +roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down) +``` + +Note the active interface names (marked "Up") for use in container configuration. + +## Step 4. Launch TensorRT-LLM containers on both nodes + +Start containers with appropriate network and GPU configuration for NCCL communication: + +```bash +## On both nodes, launch the container +docker run --name trtllm --rm -d \ + --gpus all --network host --ipc=host \ + --ulimit memlock=-1 --ulimit stack=67108864 \ + -e UCX_NET_DEVICES=enp1s0f0np0,enp1s0f1np1 \ + -e NCCL_SOCKET_IFNAME=enp1s0f0np0,enp1s0f1np1 \ + -e OMPI_MCA_btl_tcp_if_include=enp1s0f0np0,enp1s0f1np1 \ + -e OMPI_ALLOW_RUN_AS_ROOT=1 \ + -e OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 \ + -v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \ + -v ./trtllm-mn-entrypoint.sh:/opt/trtllm-mn-entrypoint.sh \ + -v ~/.ssh:/tmp/.ssh:ro \ + --entrypoint /opt/trtllm-mn-entrypoint.sh \ + nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc3 +``` + +## Step 5. Build NCCL with Blackwell support + +Execute these commands inside both containers to build NCCL from source with Blackwell +architecture support: + +```bash +## Install dependencies and build NCCL +sudo apt-get update && sudo apt-get install -y libopenmpi-dev +git clone -b v2.28.3-1 https://github.com/NVIDIA/nccl.git /opt/nccl/ +cd /opt/nccl/ +make -j src.build NVCC_GENCODE="-gencode=arch=compute_121,code=sm_121" + +## Set environment variables +export MPI_HOME="/usr/lib/aarch64-linux-gnu/openmpi" +export NCCL_HOME="/opt/nccl/build/" +export LD_LIBRARY_PATH="$NCCL_HOME/lib:$CUDA_HOME/lib64/:$MPI_HOME/lib:$LD_LIBRARY_PATH" +``` + +## Step 6. Build NCCL test suite + +Compile the NCCL test suite to validate communication performance: + +```bash +## Clone and build NCCL tests +git clone https://github.com/NVIDIA/nccl-tests.git /opt/nccl-tests/ +cd /opt/nccl-tests/ +make MPI=1 +``` + +## Step 7. Run NCCL communication test + +Execute multi-node NCCL performance test using the active network interface: + +```bash +## Set network interface environment variables (use your active interface from Step 3) +export UCX_NET_DEVICES=enp1s0f0np0 +export NCCL_SOCKET_IFNAME=enp1s0f0np0 +export OMPI_MCA_btl_tcp_if_include=enp1s0f0np0 + +## Run the all_gather performance test across both nodes +mpirun -np 2 -H 192.168.100.10:1,192.168.100.11:1 \ + -x NCCL_DEBUG=VERSION -x NCCL_DEBUG_SUBSYS=TUNING \ + -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \ + -x NCCL_MERGE_LEVEL=SYS -x NCCL_PROTO="SIMPLE" \ + /opt/nccl-tests/build/all_gather_perf -b 32G -e 32G -f 2 +``` + +## Step 8. Validate NCCL installation + +Verify successful NCCL compilation and multi-node communication: + +```bash +## Check NCCL library build +ls -la /opt/nccl/build/lib/ + +## Verify NCCL test binaries +ls -la /opt/nccl-tests/build/ + +## Check MPI configuration +mpirun --version +``` + +Expected output should show NCCL libraries in `/opt/nccl/build/lib/` and test binaries in +`/opt/nccl-tests/build/`. + +## Step 9. Performance validation + +Review the all_gather test output for communication performance metrics: + +Expected metrics from the test output: +- Bandwidth measurements between nodes +- Latency for different message sizes +- GPU-to-GPU communication confirmation +- No error messages or communication failures + +## Step 10. Troubleshooting + +| Symptom | Cause | Fix | +|---------|-------|-----| +| "Network unreachable" errors | Network interfaces not configured | Verify netplan config and `sudo netplan apply` | +| SSH authentication failures | SSH keys not properly distributed | Re-run `./discover-sparks` and enter passwords | +| NCCL build failures with Blackwell | Wrong compute capability specified | Verify `NVCC_GENCODE="-gencode=arch=compute_121,code=sm_121"` | +| MPI communication timeouts | Wrong network interfaces specified | Check `ibdev2netdev` and update interface names | +| Container networking issues | Host network mode problems | Ensure `--network host --ipc=host` in docker run | + +## Step 11. Cleanup and rollback + +**Warning**: These steps will stop containers and reset network configuration. + +```bash +## Stop containers on both nodes +docker stop trtllm + +## Remove containers (optional) +docker rm trtllm + +## Rollback network configuration (if needed) +sudo rm /etc/netplan/40-cx7.yaml +sudo netplan apply +``` + +## Step 12. Next steps + +Test your NCCL setup with a simple distributed training example: + +```bash +## Example: Run a simple NCCL bandwidth test +/opt/nccl-tests/build/all_reduce_perf -b 1M -e 1G -f 2 + +## Example: Verify GPU topology detection +nvidia-smi topo -m +``` + +Your NCCL environment is ready for multi-node distributed training workloads on DGX Spark +systems with Blackwell GPUs. diff --git a/nvidia/nemo-fine-tune/README.md b/nvidia/nemo-fine-tune/README.md new file mode 100644 index 0000000..a6a8b2c --- /dev/null +++ b/nvidia/nemo-fine-tune/README.md @@ -0,0 +1,351 @@ +# Fine tune with Nemo + +> Use NVIDIA NeMo to fine-tune models locally + +## Table of Contents + +- [Overview](#overview) +- [Instructions](#instructions) + - [If system installation fails](#if-system-installation-fails) + - [Install from wheel package (recommended)](#install-from-wheel-package-recommended) + - [Full Fine-tuning example:](#full-fine-tuning-example) + - [LoRA fine-tuning example:](#lora-fine-tuning-example) + - [QLoRA fine-tuning example:](#qlora-fine-tuning-example) + - [Step 9. Configure distributed training (optional)](#step-9-configure-distributed-training-optional) + +--- + +## Overview + +## Basic Idea + +This playbook guides you through setting up and using NVIDIA NeMo AutoModel for fine-tuning large language models and vision-language models on NVIDIA Spark devices. NeMo AutoModel provides GPU-accelerated, end-to-end training for Hugging Face models with native PyTorch support, enabling instant fine-tuning without conversion delays. The framework supports distributed training across single GPU to multi-node clusters, with optimized kernels and memory-efficient recipes specifically designed for ARM64 architecture and Blackwell GPU systems. + +## What you'll accomplish + +You'll establish a complete fine-tuning environment for large language models (1-70B parameters) and vision-language models using NeMo AutoModel on your NVIDIA Spark device. By the end, you'll have a working installation that supports parameter-efficient fine-tuning (PEFT), supervised fine-tuning (SFT), and distributed training capabilities with FP8 precision optimizations, all while maintaining compatibility with the Hugging Face ecosystem. + +## What to know before starting + +- Working in Linux terminal environments and SSH connections +- Basic understanding of Python virtual environments and package management +- Familiarity with GPU computing concepts and CUDA toolkit usage +- Experience with containerized workflows and Docker/Podman operations +- Understanding of machine learning model training concepts and fine-tuning workflows + +## Prerequisites + +- [ ] NVIDIA Spark device with Blackwell architecture GPU access +- [ ] CUDA toolkit 12.0+ installed and configured + ```bash + nvcc --version + ``` +- [ ] Python 3.10+ environment available + ```bash + python3 --version + ``` +- [ ] Minimum 32GB system RAM for efficient model loading and training +- [ ] Active internet connection for downloading models and packages +- [ ] Git installed for repository cloning + ```bash + git --version + ``` +- [ ] SSH access to your NVIDIA Spark device configured + +## Ancillary files + +All necessary files for the playbook can be found [here on GitHub](https://github.com/NVIDIA-NeMo/Automodel) + +## Time & risk + +**Time estimate:** 45-90 minutes for complete setup and initial model fine-tuning + +**Risks:** Model downloads can be large (several GB), ARM64 package compatibility issues may require troubleshooting, distributed training setup complexity increases with multi-node configurations + +**Rollback:** Virtual environments can be completely removed; no system-level changes are made to the host system beyond package installations + +## Instructions + +## Step 1. Verify system requirements + +Check your NVIDIA Spark device meets the prerequisites for NeMo AutoModel installation. This step runs on the host system to confirm CUDA toolkit availability and Python version compatibility. + +```bash +## Verify CUDA installation +nvcc --version + +## Check Python version (3.10+ required) +python3 --version + +## Verify GPU accessibility +nvidia-smi + +## Check available system memory +free -h +``` + +## Step 2. Get the container image + +```bash +docker pull nvcr.io/nvidia/pytorch:25.08-py3 +``` + +## Step 3. Launch Docker + +```bash +docker run \ + --gpus all \ + --ulimit memlock=-1 \ + -it --ulimit stack=67108864 \ + --entrypoint /usr/bin/bash \ + --rm nvcr.io/nvidia/pytorch:25.08-py3 +``` + +## Step 4. Install package management tools + +Install `uv` for efficient package management and virtual environment isolation. NeMo AutoModel uses `uv` for dependency management and automatic environment handling. + +```bash +## Install uv package manager +pip3 install uv + +## Verify installation +uv --version +``` + +### If system installation fails + +```bash +## Install for current user only +pip3 install --user uv + +## Add to PATH if needed +export PATH="$HOME/.local/bin:$PATH" +``` + +## Step 5. Clone NeMo AutoModel repository + +Clone the official NeMo AutoModel repository to access recipes and examples. This provides ready-to-use training configurations for various model types and training scenarios. + +```bash +## Clone the repository +git clone https://github.com/NVIDIA-NeMo/Automodel.git + +## Navigate to the repository +cd Automodel +``` + +## Step 6. Install NeMo AutoModel + +Set up the virtual environment and install NeMo AutoModel. Choose between wheel package installation for stability or source installation for latest features. + +### Install from wheel package (recommended) + +```bash +## Initialize virtual environment +uv venv --system-site-packages + +## Install packages with uv +uv sync --inexact --frozen --all-extras \ + --no-install-package torch \ + --no-install-package torchvision \ + --no-install-package triton \ + --no-install-package nvidia-cublas-cu12 \ + --no-install-package nvidia-cuda-cupti-cu12 \ + --no-install-package nvidia-cuda-nvrtc-cu12 \ + --no-install-package nvidia-cuda-runtime-cu12 \ + --no-install-package nvidia-cudnn-cu12 \ + --no-install-package nvidia-cufft-cu12 \ + --no-install-package nvidia-cufile-cu12 \ + --no-install-package nvidia-curand-cu12 \ + --no-install-package nvidia-cusolver-cu12 \ + --no-install-package nvidia-cusparse-cu12 \ + --no-install-package nvidia-cusparselt-cu12 \ + --no-install-package nvidia-nccl-cu12 \ + --no-install-package transformer-engine \ + --no-install-package nvidia-modelopt \ + --no-install-package nvidia-modelopt-core \ + --no-install-package flash-attn \ + --no-install-package transformer-engine-cu12 \ + --no-install-package transformer-engine-torch + +## Install bitsandbytes +CMAKE_ARGS="-DCOMPUTE_BACKEND=cuda -DCOMPUTE_CAPABILITY=80;86;87;89;90" \ +CMAKE_BUILD_PARALLEL_LEVEL=8 \ +uv pip install --no-deps git+https://github.com/bitsandbytes-foundation/bitsandbytes.git@50be19c39698e038a1604daf3e1b939c9ac1c342 +``` + +## Step 7. Verify installation + +Confirm NeMo AutoModel is properly installed and accessible. This step validates the installation and checks for any missing dependencies. + +```bash +## Test NeMo AutoModel import +uv run --frozen --no-sync python -c "import nemo_automodel; print('✅ NeMo AutoModel ready')" + +## Check available examples +ls -la examples/ +``` + +## Step 6. Explore available examples + +Review the pre-configured training recipes available for different model types and training scenarios. These recipes provide optimized configurations for ARM64 and Blackwell architecture. + +```bash +## List LLM fine-tuning examples +ls examples/llm_finetune/ + +## View example recipe configuration +cat examples/llm_finetune/finetune.py | head -20 +``` + +## Step 7. Run sample fine-tuning +The following commands show how to perform full fine-tuning (SFT), parameter-efficient fine-tuning (PEFT) with LoRA and QLoRA. + +First, you need to export your HF_TOKEN so that gated models can be downloaded. +```bash +## Run basic LLM fine-tuning example +export HF_TOKEN= +``` +> **Note:** Please Replace `` with your Hugging Face access token to access gated models (e.g., Llama). + +### Full Fine-tuning example: +Once inside the `Automodel` directory you git cloned from github, run: +```bash +uv run --frozen --no-sync \ +examples/llm_finetune/finetune.py \ +-c examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml \ +--step_scheduler.local_batch_size 1 \ +--loss_fn._target_ nemo_automodel.components.loss.te_parallel_ce.TEParallelCrossEntropy \ +--model.pretrained_model_name_or_path Qwen/Qwen3-8B +``` +These overrides ensure the Qwen3-8B SFT run behaves as expected: +- `--model.pretrained_model_name_or_path`: selects the Qwen/Qwen3-8B model to fine-tune (weights fetched via your Hugging Face token). +- `--loss_fn._target_`: uses the TransformerEngine-parallel cross-entropy loss variant compatible with tensor-parallel training for large LLMs. +- `--step_scheduler.local_batch_size`: sets the per-GPU micro-batch size to 1 to fit in memory; overall effective batch size is still driven by gradient accumulation and data/tensor parallel settings from the recipe. + +### LoRA fine-tuning example: +Execute a basic fine-tuning example to validate the complete setup. This demonstrates parameter-efficient fine-tuning using a small model suitable for testing. + +```bash +## Run basic LLM fine-tuning example +uv run --frozen --no-sync \ +examples/llm_finetune/finetune.py \ +-c examples/llm_finetune/llama3_2/llama3_2_1b_squad_peft.yaml \ +--model.pretrained_model_name_or_path meta-llama/Llama-3.1-8B +``` +### QLoRA fine-tuning example: +We can use QLoRA to fine-tune large models in a memory-efficient manner. +```bash +uv run --frozen --no-sync \ +examples/llm_finetune/finetune.py \ +-c examples/llm_finetune/llama3_1/llama3_1_8b_squad_qlora.yaml \ +--model.pretrained_model_name_or_path meta-llama/Meta-Llama-3-70B \ +--loss_fn._target_ nemo_automodel.components.loss.te_parallel_ce.TEParallelCrossEntropy \ +--step_scheduler.local_batch_size 1 +``` + +These overrides ensure the 70B QLoRA run behaves as expected: +- `--model.pretrained_model_name_or_path`: selects the 70B base model to fine-tune (weights fetched via your Hugging Face token). +- `--loss_fn._target_`: uses the TransformerEngine-parallel cross-entropy loss variant compatible with tensor-parallel training for large LLMs. +- `--step_scheduler.local_batch_size`: sets the per-GPU micro-batch size to 1 to fit 70B in memory; overall effective batch size is still driven by gradient accumulation and data/tensor parallel settings from the recipe. + +## Step 8. Validate training output + +Check that fine-tuning completed successfully and inspect the generated model artifacts. This confirms the training pipeline works correctly on your Spark device. + +```bash +## Check training logs +ls -la logs/ + +## Verify model checkpoint creation +ls -la checkpoints/ + +## Test model inference (if applicable) +uv run python -c " +import torch +print('GPU available:', torch.cuda.is_available()) +print('GPU count:', torch.cuda.device_count()) +" +``` + + +## Step 9. Validate complete setup + +Perform final validation to ensure all components are working correctly. This comprehensive check confirms the environment is ready for production fine-tuning workflows. + +```bash +## Test complete pipeline +uv run python -c " +import nemo_automodel +import torch +print('✅ NeMo AutoModel version:', nemo_automodel.__version__) +print('✅ CUDA available:', torch.cuda.is_available()) +print('✅ GPU count:', torch.cuda.device_count()) +print('✅ Setup complete') +" +``` + +## Step 10. Troubleshooting + +Common issues and solutions for NeMo AutoModel setup on NVIDIA Spark devices. + +| Symptom | Cause | Fix | +|---------|--------|-----| +| `nvcc: command not found` | CUDA toolkit not in PATH | Add CUDA toolkit to PATH: `export PATH=/usr/local/cuda/bin:$PATH` | +| `pip install uv` permission denied | System-level pip restrictions | Use `pip3 install --user uv` and update PATH | +| GPU not detected in training | CUDA driver/runtime mismatch | Verify driver compatibility: `nvidia-smi` and reinstall CUDA if needed | +| Out of memory during training | Model too large for available GPU memory | Reduce batch size, enable gradient checkpointing, or use model parallelism | +| ARM64 package compatibility issues | Package not available for ARM architecture | Use source installation or build from source with ARM64 flags | + +## Step 11. Cleanup and rollback + +Remove the installation and restore the original environment if needed. These commands safely remove all installed components. + +> **Warning:** This will delete all virtual environments and downloaded models. Ensure you have backed up any important training checkpoints. + +```bash +## Remove virtual environment +rm -rf .venv + +## Remove cloned repository +cd .. +rm -rf Automodel + +## Remove uv (if installed with --user) +pip3 uninstall uv + +## Clear Python cache +rm -rf ~/.cache/pip +``` + +## Step 12. Next steps + +Begin using NeMo AutoModel for your specific fine-tuning tasks. Start with provided recipes and customize based on your model requirements and dataset. + +```bash +## Copy a recipe for customization +cp recipes/llm_finetune/finetune.py my_custom_training.py + +## Edit configuration for your specific model and data +## Then run: uv run my_custom_training.py +``` + +Explore the [NeMo AutoModel GitHub repository](https://github.com/NVIDIA-NeMo/Automodel) for advanced recipes, documentation, and community examples. Consider setting up custom datasets, experimenting with different model architectures, and scaling to multi-node distributed training for larger models. diff --git a/nvidia/nim-llm/README.md b/nvidia/nim-llm/README.md new file mode 100644 index 0000000..8581112 --- /dev/null +++ b/nvidia/nim-llm/README.md @@ -0,0 +1,214 @@ +# Use a NIM on Spark + +> Run a NIM on Spark + +## Table of Contents + +- [Overview](#overview) + - [Basic Idea](#basic-idea) + - [What you'll accomplish](#what-youll-accomplish) + - [What to know before starting](#what-to-know-before-starting) + - [Prerequisites](#prerequisites) + - [Time & risk](#time-risk) +- [Instructions](#instructions) + - [Step 2. Configure NGC authentication](#step-2-configure-ngc-authentication) + +--- + +## Overview + +### Basic Idea + +NVIDIA Inference Microservices (NIMs) provide optimized containers for deploying large language +models with simplified APIs. This playbook demonstrates how to run LLM NIMs on DGX Spark devices, +enabling GPU-accelerated inference through Docker containers. You'll set up authentication with +NVIDIA's registry, launch a containerized LLM service, and perform basic inference testing to +verify functionality. + +### What you'll accomplish + +You'll deploy an LLM NIM container on your DGX Spark device, configure it for GPU acceleration, +and establish a working inference endpoint that responds to HTTP API calls with generated text +completions. + +### What to know before starting + +- Working in a terminal environment +- Using Docker commands and GPU-enabled containers +- Basic familiarity with REST APIs and curl commands +- Understanding of NVIDIA GPU environments and CUDA + +### Prerequisites + +- [ ] DGX Spark device with NVIDIA drivers installed + ```bash + nvidia-smi + ``` +- [ ] Docker with NVIDIA Container Toolkit configured + ```bash + docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu20.04 nvidia-smi + ``` +- [ ] NGC account with API key from https://ngc.nvidia.com/setup/api-key + ```bash + echo $NGC_API_KEY | grep -E '^[a-zA-Z0-9]{86}==' + ``` +- [ ] Sufficient disk space for model caching (varies by model, typically 10-50GB) + ```bash + df -h ~ + ``` + + +### Time & risk + +**Estimated time:** 15-30 minutes for setup and validation + +**Risks:** +- Large model downloads may take significant time depending on network speed +- GPU memory requirements vary by model size +- Container startup time depends on model loading + +**Rollback:** Stop and remove containers with `docker stop && docker rm `. Remove cached models from `~/.cache/nim` if disk space recovery is needed. + +## Instructions + +## Step 1. Verify environment prerequisites + +Check that your system meets the basic requirements for running GPU-enabled containers. + +```bash +nvidia-smi +docker --version +docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu20.04 nvidia-smi +``` + +### Step 2. Configure NGC authentication + +Set up access to NVIDIA's container registry using your NGC API key. + +```bash +export NGC_API_KEY="" +echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin +``` + +## Step 3. Select and configure NIM container + +Choose a specific LLM NIM from NGC and set up local caching for model assets. + +> TODO: Replace with actual available NIM container image from NGC catalog + +```bash +export CONTAINER_NAME="nim-llm-demo" +export IMG_NAME="nvcr.io/nim/meta/llama-3.1-8b-instruct-dgx-spark:latest" +export LOCAL_NIM_CACHE=~/.cache/nim +mkdir -p "$LOCAL_NIM_CACHE" +chmod -R a+w "$LOCAL_NIM_CACHE" +``` + +## Step 4. Launch NIM container + +Start the containerized LLM service with GPU acceleration and proper resource allocation. + +```bash +docker run -it --rm --name=$CONTAINER_NAME \ + --runtime=nvidia \ + --gpus all \ + --shm-size=16GB \ + -e NGC_API_KEY=$NGC_API_KEY \ + -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \ + -u $(id -u) \ + -p 8000:8000 \ + $IMG_NAME +``` + +The container will download the model on first run and may take several minutes to start. Look for +startup messages indicating the service is ready. + +## Step 5. Validate inference endpoint + +Test the deployed service with a basic completion request to verify functionality. Run the following curl command in a new terminal. + +> TODO: Replace NIM_MODEL with actual model identifier from the container + +```bash +curl -X 'POST' \ + 'http://0.0.0.0:8000/v1/chat/completions' \ + -H 'accept: application/json' \ + -H 'Content-Type: application/json' \ + -d '{ + "model": "meta/llama-3.1-8b-instruct", + "messages": [ + { + "role":"system", + "content":"detailed thinking on" + }, + { + "role":"user", + "content":"Can you write me a song?" + } + ], + "top_p": 1, + "n": 1, + "max_tokens": 15, + "frequency_penalty": 1.0, + "stop": ["hello"] + + }' + +``` + +Expected output should be a JSON response containing a completion field with generated text. + +## Step 6. Test additional functionality + +Perform extended validation with different prompts and parameters. + +> TODO: Add tool calling examples if supported by selected model + +```bash +curl -X 'POST' \ + 'http://0.0.0.0:8000/v1/completions' \ + -H 'accept: application/json' \ + -H 'Content-Type: application/json' \ + -d '{ + "model": "", + "prompt": "Explain quantum computing in simple terms:", + "max_tokens": 128, + "temperature": 0.7 + }' +``` + +## Step 6. Troubleshooting + +| Symptom | Cause | Fix | +|---------|--------|-----| +| Container fails to start with GPU error | NVIDIA Container Toolkit not configured | Install nvidia-container-toolkit and restart Docker | +| "Invalid credentials" during docker login | Incorrect NGC API key format | Verify API key from NGC portal, ensure no extra whitespace | +| Model download hangs or fails | Network connectivity or insufficient disk space | Check internet connection and available disk space in cache directory | +| API returns 404 or connection refused | Container not fully started or wrong port | Wait for container startup completion, verify port 8000 is accessible | + +## Step 8. Cleanup and rollback + +Remove the running container and optionally clean up cached model files. + +> **Warning:** Removing cached models will require re-downloading on next run. + +```bash +docker stop $CONTAINER_NAME +docker rm $CONTAINER_NAME +``` + +To remove cached models and free disk space: +```bash +rm -rf "$LOCAL_NIM_CACHE" +``` + +## Step 7. Next steps + +With a working NIM deployment, you can: + +- Integrate the API endpoint into your applications using the OpenAI-compatible interface +- Experiment with different models available in the NGC catalog +- Scale the deployment using container orchestration tools +- Monitor resource usage and optimize container resource allocation + +Test the integration with your preferred HTTP client or SDK to begin building applications. diff --git a/nvidia/nvfp4-quantization/README.md b/nvidia/nvfp4-quantization/README.md new file mode 100644 index 0000000..f9ee9fd --- /dev/null +++ b/nvidia/nvfp4-quantization/README.md @@ -0,0 +1,208 @@ +# Quantize to NVFP4 + +> Quantize a model to NVFP4 to run on Spark + +## Table of Contents + +- [Overview](#overview) + - [NVFP4 on Blackwell](#nvfp4-on-blackwell) +- [Desktop Access](#desktop-access) + +--- + +## Overview + +## Basic Idea + +### NVFP4 on Blackwell + +- **What it is:** A new 4-bit floating-point format for NVIDIA Blackwell GPUs. +- **How it works:** Uses two levels of scaling (local per-block + global tensor) to keep accuracy while using fewer bits. +- **Why it matters:** + - Cuts memory use ~3.5× vs FP16 and ~1.8× vs FP8 + - Keeps accuracy close to FP8 (usually <1% loss) + - Improves speed and energy efficiency for inference +- **Ecosystem:** Supported in NVIDIA tools (TensorRT, LLM Compressor, vLLM) and Hugging Face models. + + +## What you'll accomplish + +You'll quantize the DeepSeek-R1-Distill-Llama-8B model using NVIDIA's TensorRT Model Optimizer +inside a TensorRT-LLM container, producing an NVFP4 quantized model for deployment on NVIDIA DGX Spark. + + +## What to know before starting + +- Working with Docker containers and GPU-accelerated workloads +- Understanding of model quantization concepts and their impact on inference performance +- Experience with NVIDIA TensorRT and CUDA toolkit environments +- Familiarity with Hugging Face model repositories and authentication + +## Prerequisites + +- [ ] NVIDIA Spark device with Blackwell architecture GPU +- [ ] Docker installed with GPU support +- [ ] NVIDIA Container Toolkit configured +- [ ] At least 32GB of available storage for model files and outputs +- [ ] Hugging Face account with access to the target model + +Verify your setup: +```bash +## Check Docker GPU access +docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu20.04 nvidia-smi + +## Verify sufficient disk space +df -h . + +## Check Hugging Face CLI (install if needed: pip install huggingface_hub) +huggingface-cli whoami +``` + + + +## Time & risk + +**Estimated duration**: 45-90 minutes depending on network speed and model size + +**Risks**: +- Model download may fail due to network issues or Hugging Face authentication problems +- Quantization process is memory-intensive and may fail on systems with insufficient GPU memory +- Output files are large (several GB) and require adequate storage space + +**Rollback**: Remove the output directory and any pulled Docker images to restore original state. + +## Desktop Access + +## Step 1. Prepare the environment + +Create a local output directory where the quantized model files will be stored. This directory will be mounted into the container to persist results after the container exits. + +```bash +mkdir -p ./output_models +chmod 755 ./output_models +``` + +## Step 2. Authenticate with Hugging Face + +Ensure you have access to the DeepSeek model by logging in to Hugging Face. If you don't have the CLI installed, install it first. + +```bash +## Install Hugging Face CLI if needed +pip install huggingface_hub + +## Login to Hugging Face +huggingface-cli login +``` + +Enter your Hugging Face token when prompted. The token will be cached in `~/.cache/huggingface/token`. + +## Step 3. Run the TensorRT Model Optimizer container + +Launch the TensorRT-LLM container with GPU access, IPC settings optimized for multi-GPU workloads, and volume mounts for model caching and output persistence. + +```bash +docker run --rm -it --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \ + -v "$(pwd)/output_models:/workspace/outputs" \ + -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \ + nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc3 \ + bash -c "git clone --single-branch https://github.com/NVIDIA/TensorRT-Model-Optimizer.git /app/TensorRT-Model-Optimizer && \ + cd /app/TensorRT-Model-Optimizer && pip install -e '.[dev]' && \ + export ROOT_SAVE_PATH='/workspace/outputs' && \ + time /app/TensorRT-Model-Optimizer/examples/llm_ptq/scripts/huggingface_example.sh --model 'deepseek-ai/DeepSeek-R1-Distill-Llama-8B' --quant nvfp4 --tp 1 --export_fmt hf" +``` + +This command: +- Runs the container with full GPU access and optimized shared memory settings +- Mounts your output directory to persist quantized model files +- Mounts your Hugging Face cache to avoid re-downloading the model +- Clones and installs the TensorRT Model Optimizer from source +- Executes the quantization script with NVFP4 quantization parameters + +## Step 4. Monitor the quantization process + +The quantization process will display progress information including: +- Model download progress from Hugging Face +- Quantization calibration steps +- Model export and validation phases +- Total execution time + +Expected output includes lines similar to: +``` +Downloading model... +Starting quantization... +Calibrating with NVFP4... +Exporting to Hugging Face format... +``` + +## Step 5. Validate the quantized model + +After the container completes, verify that the quantized model files were created successfully. + +```bash +## Check output directory contents +ls -la ./output_models/ + +## Verify model files are present +find ./output_models/ -name "*.bin" -o -name "*.safetensors" -o -name "config.json" +``` + +You should see model weight files, configuration files, and tokenizer files in the output directory. + +## Step 6. Test model loading + +Verify the quantized model can be loaded properly using a simple Python test. + +```bash +## Create test script +cat > test_model.py << 'EOF' +import os +from transformers import AutoTokenizer, AutoModelForCausalLM + +model_path = "./output_models" +try: + tokenizer = AutoTokenizer.from_pretrained(model_path) + model = AutoModelForCausalLM.from_pretrained(model_path) + print(f"✓ Model loaded successfully from {model_path}") + print(f"Model config: {model.config}") +except Exception as e: + print(f"✗ Error loading model: {e}") +EOF + +## Run the test +python test_model.py +``` + +## Step 7. Troubleshooting + +| Symptom | Cause | Fix | +|---------|--------|-----| +| "Permission denied" when accessing Hugging Face | Missing or invalid HF token | Run `huggingface-cli login` with valid token | +| Container exits with CUDA out of memory | Insufficient GPU memory | Reduce batch size or use a machine with more GPU memory | +| Model files not found in output directory | Volume mount failed or wrong path | Verify `$(pwd)/output_models` resolves correctly | +| Git clone fails inside container | Network connectivity issues | Check internet connection and retry | +| Quantization process hangs | Container resource limits | Increase Docker memory limits or use --ulimit flags | + +## Step 8. Cleanup and rollback + +To clean up the environment and remove generated files: + +> **Warning:** This will permanently delete all quantized model files and cached data. + +```bash +## Remove output directory and all quantized models +rm -rf ./output_models + +## Remove Hugging Face cache (optional) +rm -rf ~/.cache/huggingface + +## Remove Docker image (optional) +docker rmi nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc3 +``` + +## Step 9. Next steps + +The quantized model is now ready for deployment. Common next steps include: +- Benchmarking inference performance compared to the original model +- Integrating the quantized model into your inference pipeline +- Deploying to NVIDIA Triton Inference Server for production serving +- Running additional validation tests on your specific use cases diff --git a/nvidia/ollama/README.md b/nvidia/ollama/README.md new file mode 100644 index 0000000..ad0ffe8 --- /dev/null +++ b/nvidia/ollama/README.md @@ -0,0 +1,239 @@ +# Ollama + +> Install and use Ollama + +## Table of Contents + +- [Overview](#overview) +- [Instructions](#instructions) +- [Access with NVIDIA Sync](#access-with-nvidia-sync) + +--- + +## Overview + +## Basic Idea + +This playbook demonstrates how to set up remote access to an Ollama server running on your NVIDIA +Spark device using NVIDIA Sync's Custom Apps feature. You'll install Ollama on your Spark device, +configure NVIDIA Sync to create an SSH tunnel, and access the Ollama API from your local machine. +This eliminates the need to expose ports on your network while enabling AI inference from your +laptop through a secure SSH tunnel. + +## What you'll accomplish + +You will have Ollama running on your NVIDIA Spark with Blackwell architecture and accessible via +API calls from your local laptop. This setup allows you to build applications or use tools on your +local machine that communicate with the Ollama API for large language model inference, leveraging +the powerful GPU capabilities of your Spark device without complex network configuration. + +## What to know before starting + +- Working with SSH connections and system tray applications +- Basic familiarity with terminal commands and cURL for API testing +- Understanding of REST API concepts and JSON formatting +- Experience with container environments and GPU-accelerated workloads + +## Prerequisites + +- [ ] DGX Spark device set up and connected to your network + - Verify with: `nvidia-smi` (should show Blackwell GPU information) +- [ ] NVIDIA Sync installed and connected to your Spark + - Verify connection status in NVIDIA Sync system tray application +- [ ] Terminal access to your local machine for testing API calls + - Verify with: `curl --version` + + + +## Time & risk + +**Duration**: 10-15 minutes for initial setup, 2-3 minutes for model download (varies by model size) + +**Risk level**: Low - No system-level changes, easily reversible by stopping the custom app + +**Rollback**: Stop the custom app in NVIDIA Sync and uninstall Ollama with standard package +removal if needed + +## Instructions + +## Step 1. Verify Ollama installation status + +**Description**: Check if Ollama is already installed on your NVIDIA Spark device. This runs on +the Spark device through NVIDIA Sync terminal to determine if installation is needed. + +```bash +ollama --version +``` + +If you see version information, skip to Step 3. If you get "command not found", proceed to Step 2. + +## Step 2. Install Ollama on your Spark device + +**Description**: Download and install Ollama using the official installation script. This runs on +the Spark device and installs the Ollama binary and service components. + +```bash +curl -fsSL https://ollama.com/install.sh | sh +``` + +Wait for the installation to complete. You should see output indicating successful installation. + +## Step 3. Download and verify a language model + +**Description**: Pull a language model to your Spark device. This downloads the model files and +makes them available for inference. The example uses Qwen2.5 30B, optimized for Blackwell GPUs. + +```bash +ollama pull qwen2.5:32b +``` + +Expected output: +``` +pulling manifest +pulling 58574f2e94b9: 100% ████████████████████████████ 18 GB +pulling 53e4ea15e8f5: 100% ████████████████████████████ 1.5 KB +pulling d18a5cc71b84: 100% ████████████████████████████ 11 KB +pulling cff3f395ef37: 100% ████████████████████████████ 120 B +pulling 3cdc64c2b371: 100% ████████████████████████████ 494 B +verifying sha256 digest +writing manifest +success +``` + +## Step 4. Access NVIDIA Sync settings + +**Description**: Open the NVIDIA Sync configuration interface on your local machine to add a new +custom application tunnel. This runs on your local laptop/workstation. + +1. Click on the NVIDIA Sync logo in your system tray/taskbar +2. Click on the gear icon in the top right corner to open Settings window +3. Click on the "Custom" tab + +## Step 5. Configure Ollama custom app in NVIDIA Sync + +**Description**: Create a new custom application entry that will establish an SSH tunnel to the +Ollama server running on port 11434. This configuration runs on your local machine. + +1. Click the "Add New" button +2. Fill out the form with these values: + - **Name**: `Ollama Server` + - **Port**: `11434` + - **Auto open in browser**: Leave unchecked (this is an API, not a web interface) + - **Start Script**: Leave empty +3. Click "Add" + +The new Ollama Server entry should now appear in your NVIDIA Sync custom apps list. + +## Step 6. Start the SSH tunnel + +**Description**: Activate the SSH tunnel to make the remote Ollama server accessible on your local +machine. This creates a secure connection from localhost:11434 to your Spark device. + +1. Click on the NVIDIA Sync logo in your system tray/taskbar +2. Under the "Custom" section, click on "Ollama Server" + +The tunnel is active when you see the connection status indicator in NVIDIA Sync. + +## Step 7. Validate API connectivity + +**Description**: Test the Ollama API connection from your local machine to ensure the tunnel is +working correctly. This runs on your local laptop terminal. + +```bash +curl http://localhost:11434/api/chat -d '{ + "model": "qwen2.5:32b", + "messages": [{ + "role": "user", + "content": "Write me a haiku about GPUs and AI." + }], + "stream": false +}' +``` + +Expected response format: +```json +{ + "model": "qwen2.5:32b", + "created_at": "2024-01-15T12:30:45.123Z", + "message": { + "role": "assistant", + "content": "Silicon power flows\nThrough circuits, dreams become real\nAI awakens" + }, + "done": true +} +``` + +## Step 8. Test additional API endpoints + +**Description**: Verify other Ollama API functionality to ensure full operation. These commands +run on your local machine and test different API capabilities. + +Test model listing: +```bash +curl http://localhost:11434/api/tags +``` + +Test streaming responses: +```bash +curl -N http://localhost:11434/api/chat -d '{ + "model": "qwen2.5:32b", + "messages": [{"role": "user", "content": "Count to 5 slowly"}], + "stream": true +}' +``` + +## Step 9. Troubleshooting + +**Description**: Common issues and their solutions when setting up Ollama with NVIDIA Sync. + +| Symptom | Cause | Fix | +|---------|--------|-----| +| "Connection refused" on localhost:11434 | SSH tunnel not active | Start Ollama Server in NVIDIA Sync custom apps | +| Model download fails with disk space error | Insufficient storage on Spark | Free up space or choose smaller model (e.g., qwen2.5:7b) | +| Ollama command not found after install | Installation path not in PATH | Restart terminal session or run `source ~/.bashrc` | +| API returns "model not found" error | Model not pulled or wrong name | Run `ollama list` to verify available models | +| Slow inference on Spark | Model too large for GPU memory | Try smaller model or check GPU memory with `nvidia-smi` | + +## Step 10. Cleanup and rollback + +**Description**: How to remove the setup and return to the original state. + +To stop the tunnel: +1. Open NVIDIA Sync and click "Ollama Server" to deactivate + +To remove the custom app: +1. Open NVIDIA Sync Settings → Custom tab +2. Select "Ollama Server" and click "Remove" + +**Warning**: To completely uninstall Ollama from your Spark device: + +```bash +sudo systemctl stop ollama +sudo systemctl disable ollama +sudo rm /usr/local/bin/ollama +sudo rm -rf /usr/share/ollama +sudo userdel ollama +``` + +This will remove all Ollama files and downloaded models. + +## Step 11. Next steps + +**Description**: Explore additional functionality and integration options with your working Ollama +setup. + +Test different models from the [Ollama library](https://ollama.com/library): +```bash +ollama pull llama3.1:8b +ollama pull codellama:13b +ollama pull phi3.5:3.8b +``` + +Monitor GPU and system usage during inference using the DGX Dashboard available through NVIDIA Sync. + +Build applications using the Ollama API by integrating with your preferred programming language's +HTTP client libraries. + +## Access with NVIDIA Sync + +## Step 1. (DRAFT) diff --git a/nvidia/open-webui/README.md b/nvidia/open-webui/README.md new file mode 100644 index 0000000..0c843b3 --- /dev/null +++ b/nvidia/open-webui/README.md @@ -0,0 +1,401 @@ +# Use Open WebUI + +> Install Open WebUI and chat with models on your Spark + +## Table of Contents + +- [Overview](#overview) +- [Instructions](#instructions) +- [Setup Open WebUI on Remote Spark with NVIDIA Sync](#setup-open-webui-on-remote-spark-with-nvidia-sync) + +--- + +## Overview + +## Basic Idea + +Open WebUI is an extensible, self-hosted AI interface that operating entirely offline. +This playbook shows you how to deploy Open WebUI with an integrated Ollama server on your DGX Spark device using +NVIDIA Sync. The setup creates a secure SSH tunnel that lets you access the web +interface from your local browser while the models run on Spark's GPU. + +## What you'll accomplish + +You will have a fully functional Open WebUI installation running on your DGX Spark, accessible through +your local web browser via NVIDIA Sync's managed SSH tunneling. The setup includes integrated Ollama +for model management, persistent data storage, and GPU acceleration for model inference. + +## What to know before starting + +- How to use NVIDIA Sync to connect to your DGX Spark device + +## Prerequisites + +- DGX Spark device is set up and accessible +- NVIDIA Sync installed and connected to your DGX Spark +- Enough disk space for the Open WebUI container image and model downloads + + +## Time & risk + +**Duration**: 15-20 minutes for initial setup, plus model download time (varies by model size) + +**Risks**: +- Docker permission issues may require user group changes and session restart +- Large model downloads may take significant time depending on network speed + +**Rollback**: Stop and remove Docker containers using provided cleanup commands, remove Custom Port +from NVIDIA Sync settings + +## Instructions + +## Step 1. Configure Docker permissions + +To easily manage containers without sudo, you must be in the `docker` group. If you choose to skip this step, you will need to run Docker commands with sudo. + +Open a new terminal and test Docker access. In the terminal, run: + +```bash +docker ps +``` + +If you see a permission denied error (something like `permission denied while trying to connect to the Docker daemon socket`), add your user to the docker group: + +```bash +sudo usermod -aG docker $USER +``` + +> **Warning**: After running usermod, you must log out and log back in to start a new +> session with updated group permissions. + +## Step 2. Verify Docker setup and pull container + +Open a new terminal, pull the Open WebUI container image with integrated Ollama: + +```bash +docker pull ghcr.io/open-webui/open-webui:ollama +``` + +## Step 3. Start the Open WebUI container + +Start the Open WebUI container by running: + +```bash +docker run -d -p 8080:8080 --gpus=all \ + -v open-webui:/app/backend/data \ + -v open-webui-ollama:/root/.ollama \ + --name open-webui ghcr.io/open-webui/open-webui:ollama +``` + +This will start the Open WebUI container and make it accessible at `http://localhost:8080`. You can access the Open WebUI interface from your local web browser. + +Application data will be stored in the `open-webui` volume and model data will be stored in the `open-webui-ollama` volume. + +## Step 4. Create administrator account + +This step sets up the initial administrator account for Open WebUI. This is a local account that you will use to access the Open WebUI interface. + +In the Open WebUI interface, click the "Get Started" button at the bottom of the screen. + +Fill out the administrator account creation form with your preferred credentials. + +Click the registration button to create your account and access the main interface. + +## Step 5. Download and configure a model + +This step downloads a language model through Ollama and configures it for use in +Open WebUI. The download happens on your DGX Spark device and may take several minutes. + +Click on the "Select a model" dropdown in the top left corner of the Open WebUI interface. + +Type `gpt-oss:20b` in the search field. + +Click the "Pull 'gpt-oss:20b' from Ollama.com" button that appears. + +Wait for the model download to complete. You can monitor progress in the interface. + +Once complete, select "gpt-oss:20b" from the model dropdown. + +## Step 6. Test the model + +This step verifies that the complete setup is working properly by testing model +inference through the web interface. + +In the chat textarea at the bottom of the Open WebUI interface, enter: + +``` +Write me a haiku about GPUs +``` + +Press Enter to send the message and wait for the model's response. + +## Step 7. Troubleshooting + +Common issues and their solutions. + +| Symptom | Cause | Fix | +|---------|-------|-----| +| Permission denied on docker ps | User not in docker group | Run Step 1 completely, including logging out and logging back in or use sudo| +| Model download fails | Network connectivity issues | Check internet connection, retry download | +| GPU not detected in container | Missing --gpus=all flag | Recreate container with correct command | +| Port 8080 already in use | Another application using port | Change port in docker command or stop conflicting service | + +## Step 8. Cleanup and rollback + +Steps to completely remove the Open WebUI installation and free up resources. + +> **Warning**: These commands will permanently delete all Open WebUI data and downloaded models. + +Stop and remove the Open WebUI container: + +```bash +docker stop open-webui +docker rm open-webui +``` + +Remove the downloaded images: + +```bash +docker rmi ghcr.io/open-webui/open-webui:ollama +``` + +Remove persistent data volumes: + +```bash +docker volume rm open-webui open-webui-ollama +``` + +To rollback permission change: `sudo deluser $USER docker` + + +## Step 9. Next steps + +Try downloading different models from the Ollama library at https://ollama.com/library. + +You can monitor GPU and memory usage through the DGX Dashboard available in NVIDIA Sync as you try different models. + +If Open WebUI reports an update is available, you can update the container image by running: + +```bash +docker pull ghcr.io/open-webui/open-webui:ollama +``` + +## Setup Open WebUI on Remote Spark with NVIDIA Sync + +> **Note**: If you haven't already installed NVIDIA Sync, [learn how here.](/spark/connect-to-your-spark/sync) + +## Step 1. Configure Docker permissions + +To easily manage containers using NVIDIA Sync, you must be able to run Docker commands without sudo. + +Open the Terminal app from NVIDIA Sync to start an interactive SSH session and test Docker access. In the terminal, run: + +```bash +docker ps +``` + +If you see a permission denied error (something like `permission denied while trying to connect to the Docker daemon socket`), add your user to the docker group: + +```bash +sudo usermod -aG docker $USER +``` + +> **Warning**: After running usermod, you must close the terminal window completely to start a new +> session with updated group permissions. + +## Step 2. Verify Docker setup and pull container + +This step confirms Docker is working properly and downloads the Open WebUI container +image. This runs on the DGX Spark device and may take several minutes depending on network speed. + +Open a new Terminal app from NVIDIA Sync and pull the Open WebUI container image with integrated Ollama: + +```bash +docker pull ghcr.io/open-webui/open-webui:ollama +``` + +Once the container image is downloaded, continue to setup NVIDIA Sync. + +## Step 3. Open NVIDIA Sync Settings + +Click on the NVIDIA Sync icon in your system tray or taskbar to open the main application window. + +Click the gear icon in the top right corner to open the Settings window. + +Click on the "Custom" tab to access Custom Ports configuration. + +## Step 4. Add Open WebUI Custom Port + +This step creates a new entry in NVIDIA Sync that will manage the Open +WebUI container and create the necessary SSH tunnel. + +Click the "Add New" button in the Custom section. + +Fill out the form with these values: + +**Name**: Open WebUI + +**Port**: 12000 + +**Auto open in browser at the following path**: Check this checkbox + +**Start Script**: Copy and paste this entire script: + +```bash +#!/usr/bin/env bash +set -euo pipefail + +NAME="open-webui" +IMAGE="ghcr.io/open-webui/open-webui:ollama" + +cleanup() { + echo "Signal received; stopping ${NAME}..." + docker stop "${NAME}" >/dev/null 2>&1 || true + exit 0 +} +trap cleanup INT TERM HUP QUIT EXIT + +## Ensure Docker CLI and daemon are available +if ! docker info >/dev/null 2>&1; then + echo "Error: Docker daemon not reachable." >&2 + exit 1 +fi + +## Already running? +if [ -n "$(docker ps -q --filter "name=^${NAME}$" --filter "status=running")" ]; then + echo "Container ${NAME} is already running." +else +# # Exists but stopped? Start it. + if [ -n "$(docker ps -aq --filter "name=^${NAME}$")" ]; then + echo "Starting existing container ${NAME}..." + docker start "${NAME}" >/dev/null + else +# # Not present: create and start it. + echo "Creating and starting ${NAME}..." + docker run -d -p 12000:8080 --gpus=all \ + -v open-webui:/app/backend/data \ + -v open-webui-ollama:/root/.ollama \ + --name "${NAME}" "${IMAGE}" >/dev/null + fi +fi + +echo "Running. Press Ctrl+C to stop ${NAME}." +## Keep the script alive until a signal arrives +while :; do sleep 86400; done +``` + +Click the "Add" button to save configuration. + +## Step 5. Launch Open WebUI + +This step starts the Open WebUI container on your DGX Spark and establishes the SSH +tunnel. The browser will open automatically if configured correctly. + +Click on the NVIDIA Sync icon in your system tray or taskbar to open the main application window. + +Under the "Custom" section, click on "Open WebUI". + +Your default web browser should automatically open to the Open WebUI interface at `http://localhost:12000`. + +## Step 6. Create administrator account + +This step sets up the initial administrator account for Open WebUI. This is a local account that you will use to access the Open WebUI interface. + +In the Open WebUI interface, click the "Get Started" button at the bottom of the screen. + +Fill out the administrator account creation form with your preferred credentials. + +Click the registration button to create your account and access the main interface. + +## Step 7. Download and configure a model + +This step downloads a language model through Ollama and configures it for use in +Open WebUI. The download happens on your DGX Spark device and may take several minutes. + +Click on the "Select a model" dropdown in the top left corner of the Open WebUI interface. + +Type `gpt-oss:20b` in the search field. + +Click the "Pull 'gpt-oss:20b' from Ollama.com" button that appears. + +Wait for the model download to complete. You can monitor progress in the interface. + +Once complete, select "gpt-oss:20b" from the model dropdown. + +## Step 8. Test the model + +This step verifies that the complete setup is working properly by testing model +inference through the web interface. + +In the chat textarea at the bottom of the Open WebUI interface, enter: + +``` +Write me a haiku about GPUs +``` + +Press Enter to send the message and wait for the model's response. + +## Step 9. Stop the Open WebUI + +When you are finished with your session and want to stop the Open WebUI server and reclaim resource, close the Open WebUI from NVIDIA Sync. + +Click on the NVIDIA Sync icon in your system tray or taskbar to open the main application window. + +Under the "Custom" section, click the `x` icon on the right of the "Open WebUI" entry. + +This will close the tunnel and stop the Open WebUI docker container. + +## Step 10. Troubleshooting + +Common issues and their solutions. + +| Symptom | Cause | Fix | +|---------|-------|-----| +| Permission denied on docker ps | User not in docker group | Run Step 1 completely, including terminal restart | +| Browser doesn't open automatically | Auto-open setting disabled | Manually navigate to localhost:12000 | +| Model download fails | Network connectivity issues | Check internet connection, retry download | +| GPU not detected in container | Missing --gpus=all flag | Recreate container with correct start script | +| Port 12000 already in use | Another application using port | Change port in Custom App settings or stop conflicting service | + +## Step 11. Cleanup and rollback + +Steps to completely remove the Open WebUI installation and free up resources. + +> **Warning**: These commands will permanently delete all Open WebUI data and downloaded models. + +Stop and remove the Open WebUI container: + +```bash +docker stop open-webui +docker rm open-webui +``` + +Remove the downloaded images: + +```bash +docker rmi ghcr.io/open-webui/open-webui:ollama +``` + +Remove persistent data volumes: + +```bash +docker volume rm open-webui open-webui-ollama +``` + +To rollback permission change: `sudo deluser $USER docker` + +Remove the Custom App from NVIDIA Sync by opening Settings > Custom tab and deleting the entry. + +## Step 12. Next steps + +Try downloading different models from the Ollama library at https://ollama.com/library. + +You can monitor GPU and memory usage through the DGX Dashboard available in NVIDIA Sync as you try different models. + +If Open WebUI reports an update is available, you can update the container image by running: + +```bash +docker pull ghcr.io/open-webui/open-webui:ollama +``` + +Then launch Open WebUI again fromNVIDIA Sync. diff --git a/nvidia/protein-folding/README.md b/nvidia/protein-folding/README.md new file mode 100644 index 0000000..4a70479 --- /dev/null +++ b/nvidia/protein-folding/README.md @@ -0,0 +1,415 @@ +# Use Open Fold + +> Use OpenFold with TensorRT optimization + +## Table of Contents + +- [Overview](#overview) +- [Access through terminal](#access-through-terminal) + - [Step 7. Option B - Run locally with demo script](#step-7-option-b-run-locally-with-demo-script) + - [Using a custom FASTA file](#using-a-custom-fasta-file) + +--- + +## Overview + +## What you'll accomplish + +You'll set up a GPU-accelerated protein folding workflow on NVIDIA Spark devices using OpenFold +with TensorRT optimization and MMseqs2-GPU. After completing this walkthrough, you'll be able to +fold proteins in under 60 seconds using either NVIDIA's cloud UI or running locally on your +RTX Pro 6000 or DGX Spark workstation. + +## What to know before starting + +- Installing Python packages via pip +- Using Docker and the NVIDIA Container Toolkit for GPU workflows +- Running basic Linux commands and setting environment variables +- Understanding FASTA files and basics of protein structure workflows +- Working with CUDA-enabled applications + +## Prerequisites + +- [ ] NVIDIA GPU (RTX Pro 6000 or DGX Spark recommended) + ```bash + nvidia-smi # Should show GPU with CUDA ≥12.9 + ``` +- [ ] NVIDIA drivers and CUDA toolkit installed + ```bash + nvcc --version # Should show CUDA 12.9 or higher + ``` +- [ ] Docker with NVIDIA Container Toolkit + ```bash + docker run --rm --gpus all nvidia/cuda:12.9.0-base-ubuntu22.04 nvidia-smi + ``` +- [ ] Python 3.8+ environment + ```bash + python3 --version # Should show 3.8 or higher + ``` +- [ ] Sufficient disk space for databases (>3TB recommended) + ```bash + df -h # Check available space + ``` + +## Ancillary files + +- OpenFold parameters (`finetuning_ptm_2.pt`) — pre-trained model weights for structure prediction +- PDB70 database — template structures for homology modeling +- UniRef90 database — sequence database for MSA generation +- MGnify database — metagenomic sequences for MSA generation +- Uniclust30 database — clustered UniProt sequences for MSA generation +- BFD database — large sequence database for MSA generation +- MMCIF files — template structure files in mmCIF format +- py3Dmol package — Python library for 3D protein visualization + +## Time & risk + +**Duration:** Initial setup takes 2-4 hours (mainly downloading databases). Each protein fold takes +<60 seconds on GPU vs hours on CPU. + +**Risks:** +- Database downloads may fail due to network interruptions +- Insufficient disk space for full databases +- GPU memory limitations for very large proteins (>2000 residues) + +**Rollback:** All operations are read-only after setup. Remove downloaded databases and output +directories to clean up. + +## Access through terminal + +## Step 1. Verify GPU and CUDA installation + +Confirm your system has the required GPU and CUDA version for running OpenFold with TensorRT +optimization. + +```bash +nvidia-smi +``` + +Expected output should show an NVIDIA GPU with CUDA capability ≥12.9. For DGX Spark or RTX Pro +6000, you should see the appropriate GPU model listed. + +```bash +nvcc --version +``` + +This should display CUDA compilation tools, release 12.9 or higher. + +## Step 2. Set up Python environment + +Create a Python virtual environment and install the required packages for protein folding and +visualization. + +```bash +python3 -m venv openfold_env +source openfold_env/bin/activate +pip install --upgrade pip +``` + +Install the py3Dmol visualization package: + +```bash +pip install py3Dmol +``` + +## Step 3. Download OpenFold and databases + +Download the OpenFold repository and required databases. This step requires significant disk +space and network bandwidth. + +> TODO: Add specific download URLs for OpenFold repository from official GitHub + +```bash +## Clone OpenFold repository +git clone +cd openfold +pip install -e . +``` + +Download the model parameters: + +> TODO: Add direct download URL for finetuning_ptm_2.pt + +```bash +mkdir -p openfold_params +wget -O openfold_params/finetuning_ptm_2.pt +``` + +## Step 4. Download sequence databases + +Download all required databases for MSA generation. Each database serves a specific purpose in +the folding pipeline. + +> TODO: Add specific download URLs for each database from official sources + +```bash +## Create database directory +mkdir -p databases +cd databases + +## Download PDB70 (for template structures) +wget +tar -xzf pdb70.tar.gz + +## Download UniRef90 (for MSA) +wget +tar -xzf uniref90.tar.gz + +## Download MGnify (metagenomic sequences) +wget +tar -xzf mgnify.tar.gz + +## Download Uniclust30 (clustered sequences) +wget +tar -xzf uniclust30.tar.gz + +## Download BFD (large sequence database) +wget +tar -xzf bfd.tar.gz + +## Download MMCIF files (structure templates) +wget +tar -xzf mmcif.tar.gz + +cd .. +``` + +## Step 5. Configure environment variables + +Set up environment variables pointing to your downloaded databases and parameters. + +```bash +export OF_PARAM_PATH="$(pwd)/openfold_params/finetuning_ptm_2.pt" +export OF_DB_PDB70="$(pwd)/databases/pdb70" +export OF_DB_UNIREF90="$(pwd)/databases/uniref90" +export OF_DB_MGNIFY="$(pwd)/databases/mgnify" +export OF_DB_UNICLUST30="$(pwd)/databases/uniclust30" +export OF_DB_BFD="$(pwd)/databases/bfd" +export OF_DB_MMCIF="$(pwd)/databases/pdb_mmcif/mmcif_files" +export OF_DB_OBSOLETE="$(pwd)/databases/pdb_mmcif/obsolete.dat" +export OF_DEVICE="cuda:0" +export OF_OUTDIR="openfold_out" +export OF_JOB="demo" +``` + +## Step 6. Option A - Use NVIDIA Build Portal (Cloud UI) + +For quick testing without local setup, use NVIDIA's online demo interface. + +1. Navigate to the OpenFold2 page on NVIDIA Build Portal + > TODO: Add specific URL for NVIDIA Build Portal OpenFold2 demo + +2. Paste your protein sequence in FASTA format + +3. Click "Run" to execute the folding pipeline + +4. View results in the integrated Mol* or py3Dmol viewer + +### Step 7. Option B - Run locally with demo script + +Create and run the OpenFold demo script for local execution on your DGX Spark or RTX Pro 6000. + +Create the demo script file: + +```bash +cat > openfold_demo.py << 'EOF' +#!/usr/bin/env python3 +""" +Single-file OpenFold runner + py3Dmol viewer. +""" +import os, subprocess as sp, glob, sys, tempfile, textwrap + +## Paths (edit for your system) +PARAM = os.getenv("OF_PARAM_PATH", "/path/to/openfold_params/finetuning_ptm_2.pt") +PDB70 = os.getenv("OF_DB_PDB70", "/path/to/pdb70") +UNIREF90 = os.getenv("OF_DB_UNIREF90", "/path/to/uniref90") +MGNIFY = os.getenv("OF_DB_MGNIFY", "/path/to/mgnify") +UNICLUST30 = os.getenv("OF_DB_UNICLUST30", "/path/to/uniclust30") +BFD = os.getenv("OF_DB_BFD", "/path/to/bfd") +MMCIF = os.getenv("OF_DB_MMCIF", "/path/to/pdb_mmcif/mmcif_files") +OBSOLETE = os.getenv("OF_DB_OBSOLETE", "/path/to/pdb_mmcif/obsolete.dat") +DEVICE = os.getenv("OF_DEVICE", "cuda:0") +OUTDIR = os.getenv("OF_OUTDIR", "openfold_out") +JOB = os.getenv("OF_JOB", "demo") + +SEQ = """>demo +MGSDKIHHHHHHENLYFQGAMASMTGGQQMGRGSMAAAAKKVVAGAAAAGGQAGD""" + +def ensure_py3dmol(): + try: + import py3Dmol + except ImportError: + sp.check_call([sys.executable, "-m", "pip", "install", "py3Dmol"]) + +def run_openfold(fasta_path): + cmd = [ + sys.executable, "openfold/run_pretrained_openfold.py", + "--fasta_path", fasta_path, + "--job_name", JOB, + "--output_dir", OUTDIR, + "--model_device", DEVICE, + "--param_path", PARAM, + "--pdb70_database_path", PDB70, + "--uniref90_database_path", UNIREF90, + "--mgnify_database_path", MGNIFY, + "--uniclust30_database_path", UNICLUST30, + "--bfd_database_path", BFD, + "--template_mmcif_dir", MMCIF, + "--obsolete_pdbs_path", OBSOLETE, + "--skip_relaxation" + ] + sp.check_call(cmd) + +def visualize(): + import py3Dmol + pdb = open(f"{OUTDIR}/{JOB}/ranked_0.pdb").read() + view = py3Dmol.view(width=800, height=520) + view.addModel(pdb, "pdb") + view.setStyle({"cartoon": {"arrows": True}}) + view.zoomTo() + open(f"{OUTDIR}/{JOB}_view.html", "w").write(view._make_html()) + print(f"Viewer written to {OUTDIR}/{JOB}_view.html") + +def main(): + ensure_py3dmol() + with tempfile.TemporaryDirectory() as td: + fasta_path = os.path.join(td, f"{JOB}.fasta") + open(fasta_path, "w").write(textwrap.dedent(SEQ).strip() + "\n") + run_openfold(fasta_path) + visualize() + +if __name__ == "__main__": + main() +EOF +``` + +Make the script executable and run it: + +```bash +chmod +x openfold_demo.py +python openfold_demo.py +``` + +## Step 8. Validate the output + +Check that the folding completed successfully and view the generated structure. + +```bash +## Verify PDB file was created +ls -la openfold_out/demo/ranked_0.pdb +``` + +The file should exist and be non-empty (typically >10KB for a small protein). + +```bash +## Check the HTML viewer was generated +ls -la openfold_out/demo_view.html +``` + +Open the HTML file in a web browser to visualize the folded protein structure: + +```bash +## On Linux with GUI +xdg-open openfold_out/demo_view.html + +## Or copy the full path and open in browser manually +realpath openfold_out/demo_view.html +``` + +## Step 9. Run with custom sequences + +To fold your own protein sequences, modify the demo script or create a new FASTA file. + +### Using a custom FASTA file + +```bash +## Create your FASTA file +cat > my_protein.fasta << 'EOF' +>my_protein +MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLPARTVETRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLRKLNPPDESGPGCMNCKCVIS +EOF + +## Run OpenFold directly +python openfold/run_pretrained_openfold.py \ + --fasta_path my_protein.fasta \ + --job_name my_protein \ + --output_dir openfold_out \ + --model_device cuda:0 \ + --param_path $OF_PARAM_PATH \ + --pdb70_database_path $OF_DB_PDB70 \ + --uniref90_database_path $OF_DB_UNIREF90 \ + --mgnify_database_path $OF_DB_MGNIFY \ + --uniclust30_database_path $OF_DB_UNICLUST30 \ + --bfd_database_path $OF_DB_BFD \ + --template_mmcif_dir $OF_DB_MMCIF \ + --obsolete_pdbs_path $OF_DB_OBSOLETE \ + --skip_relaxation +``` + +## Step 10. Troubleshooting common issues + +| Symptom | Cause | Fix | +|---------|-------|-----| +| CUDA out of memory error | Protein too large for GPU | Reduce max_template_date or use smaller sequence | +| Database file not found | Incomplete download or wrong path | Verify all databases downloaded and paths in env vars | +| ImportError: No module named 'openfold' | OpenFold not installed | Run `pip install -e .` in openfold directory | +| nvidia-smi command not found | NVIDIA drivers not installed | Install NVIDIA drivers for your GPU | +| Folding takes hours instead of minutes | Running on CPU instead of GPU | Check OF_DEVICE="cuda:0" and GPU availability | +| py3Dmol viewer shows blank page | JavaScript blocked or path issue | Use absolute path to HTML file or check browser console | + +## Step 11. Cleanup and rollback + +Remove generated outputs and optionally remove downloaded databases. + +```bash +## Remove output files only (safe) +rm -rf openfold_out/ + +## Remove virtual environment (reversible) +deactivate +rm -rf openfold_env/ +``` + +> **Warning:** The following will delete downloaded databases (>3TB). Only run if you need to +> free disk space and are willing to re-download. + +```bash +## Remove all databases (requires re-download) +rm -rf databases/ + +## Remove OpenFold repository +rm -rf openfold/ +``` + +## Step 12. Next steps + +Test the installation with a well-known protein structure to verify accuracy: + +```bash +## Test with ubiquitin (PDB: 1UBQ) +cat > test_ubiquitin.fasta << 'EOF' +>1UBQ +MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG +EOF + +python openfold/run_pretrained_openfold.py \ + --fasta_path test_ubiquitin.fasta \ + --job_name ubiquitin_test \ + --output_dir openfold_out \ + --model_device cuda:0 \ + --param_path $OF_PARAM_PATH \ + --pdb70_database_path $OF_DB_PDB70 \ + --uniref90_database_path $OF_DB_UNIREF90 \ + --mgnify_database_path $OF_DB_MGNIFY \ + --uniclust30_database_path $OF_DB_UNICLUST30 \ + --bfd_database_path $OF_DB_BFD \ + --template_mmcif_dir $OF_DB_MMCIF \ + --obsolete_pdbs_path $OF_DB_OBSOLETE \ + --skip_relaxation +``` + +For production use, consider: +- Enabling structure relaxation for higher accuracy (remove `--skip_relaxation`) +- Setting up batch processing for multiple sequences +- Integrating with drug discovery pipelines +- Scaling to full proteomes using DGX Spark clusters diff --git a/nvidia/rag-ai-workbench/README.md b/nvidia/rag-ai-workbench/README.md new file mode 100644 index 0000000..ea4579e --- /dev/null +++ b/nvidia/rag-ai-workbench/README.md @@ -0,0 +1,189 @@ +# RAG application in AI Workbench + +> Install and use AI Workbench to clone and run a reproducible RAG application + +## Table of Contents + +- [Overview](#overview) +- [Instructions](#instructions) + - [Troubleshooting installation issues](#troubleshooting-installation-issues) + - [Substep A: Upload sample dataset](#substep-a-upload-sample-dataset) + - [Substep B: Test custom dataset (optional)](#substep-b-test-custom-dataset-optional) + +--- + +## Overview + +## Basic Idea + +This walkthrough demonstrates how to set up and run an agentic retrieval-augmented generation (RAG) +project using NVIDIA AI Workbench. You'll use AI Workbench to clone and run a pre-built agentic RAG +application that intelligently routes queries, evaluates responses for relevancy and hallucination, and +iterates through evaluation and generation cycles. The project uses a Gradio web interface and can work +with both NVIDIA-hosted API endpoints or self-hosted models. + +## What you'll accomplish + +You'll have a fully functional agentic RAG application running in NVIDIA AI Workbench with a web +interface where you can submit queries and receive intelligent responses. The system will demonstrate +advanced RAG capabilities including query routing, response evaluation, and iterative refinement, +giving you hands-on experience with both AI Workbench's development environment and sophisticated RAG +architectures. + +## What to know before starting + +- Basic familiarity with retrieval-augmented generation (RAG) concepts +- Understanding of API keys and how to generate them +- Comfort working with web applications and browser interfaces +- Basic understanding of containerized development environments + +## Prerequisites + +- [ ] DGX Spark system with NVIDIA AI Workbench installed or ready to install +- [ ] Free NVIDIA API key: Generate at [NGC API Keys](https://org.ngc.nvidia.com/setup/api-keys) +- [ ] Free Tavily API key: Generate at [Tavily](https://tavily.com/) +- [ ] Internet connection for cloning repositories and accessing APIs +- [ ] Web browser for accessing the Gradio interface + +**Verification commands:** + +* Verify the NVIDIA AI Workbench application exists on your DGX Spark system +* Verify your API keys are valid and up-to-date + + +## Time & risk + +**Estimated time:** 30-45 minutes (including AI Workbench installation if needed) + +**Risk level:** Low - Uses pre-built containers and established APIs + +**Rollback:** Simply delete the cloned project from AI Workbench to remove all components. No system +changes are made outside the AI Workbench environment. + +## Instructions + +## Step 1. Install NVIDIA AI Workbench + +This step installs AI Workbench on your DGX Spark system and completes the initial setup wizard. + +On your DGX Spark system, open the **NVIDIA AI Workbench** application and click **Begin Installation**. + +1. The installation wizard will prompt for authentication +2. Wait for the automated install to complete (several minutes) +3. Click "Let's Get Started" when installation finishes + +### Troubleshooting installation issues + +If you encounter the error message: `An error occurred ... container tool failed to reach ready state. try again: docker is not running`, reboot your DGX Spark system to restart the docker service, then reopen NVIDIA AI Workbench. + +## Step 2. Verify API key requirements + +This step ensures you have the required API keys before proceeding with the project setup. + +Verify you have both required API keys. Keep these keys safe! + +* Tavily API Key: https://tavily.com/ +* NVIDIA API Key: https://org.ngc.nvidia.com/setup/api-keys + * Ensure this key has ``Public API Endpoints`` permissions + +Keep both keys available for the next step. + +## Step 3. Clone the agentic RAG project + +This step clones the pre-built agentic RAG project from GitHub into your AI Workbench environment. + +From the AI Workbench landing page, select the **Local** location if not done so already, then click **Clone Project** from the top right corner. + +Paste this Git repository URL in the clone dialog: ``https://github.com/NVIDIA/workbench-example-agentic-rag``. + +Click **Clone** to begin the clone and build process. + +## Step 4. Configure project secrets + +This step configures the API keys required for the agentic RAG application to function properly. + +While the project builds, configure the API keys using the yellow warning banner that appears: + +1. Click **Configure** in the yellow banner +2. Enter your ``NVIDIA_API_KEY`` +3. Enter your ``TAVILY_API_KEY`` +4. Save the configuration + +Wait for the project build to complete before proceeding. + +## Step 5. Launch the chat application + +This step starts the web-based chat interface where you can interact with the agentic RAG system. + +Navigate to **Environment** > **Project Container** > **Apps** > **Chat** and start the web application. + +A browser window will open automatically and load with the Gradio chat interface. + +## Step 6. Test the basic functionality + +This step verifies the agentic RAG system is working by submitting a sample query. + +In the chat application, click on or type a sample query such as: `How do I add an integration in the CLI?` + +Wait for the agentic system to process and respond. The response, while general, should demonstrate intelligent routing and evaluation. + +## Step 7. Validate project + +This step confirms the complete setup is working correctly by testing the core features. + +Verify the following components are functioning: + +```bash +✓ Web application loads without errors +✓ Sample queries return responses +✓ No API authentication errors appear +✓ The agentic reasoning process is visible in the interface under "Monitor" +``` + +## Step 8. Complete optional quickstart + +This step demonstrates advanced features by uploading data, retrieving context, and testing custom queries. + +### Substep A: Upload sample dataset +Complete the in-app quickstart instructions to upload the sample dataset and test improved RAG-based responses. + +### Substep B: Test custom dataset (optional) +Upload a custom dataset, adjust the Router prompt, and submit custom queries to test customization. + +## Step 9. Troubleshooting + +This step provides solutions for common issues you might encounter while using the chat interface. + +| Symptom | Cause | Fix | +|---------|-------|-----| +| Tavily API Error | Internet connection or DNS issues | Wait and retry query | +| 401 Unauthorized | Wrong or malformed API key | Replace key in Project Secrets and restart | +| 403 Unauthorized | API key lacks permissions | Generate new key with proper access | +| Agentic loop timeout | Complex query exceeding time limit | Try simpler query or retry | + +## Step 10. Cleanup and rollback + +This step explains how to remove the project if needed and what changes were made to your system. + +> **Warning:** This will permanently delete the project and all associated data. + +To remove the project completely: + +1. In AI Workbench, click on the three dots next to a project +2. Select "Delete Project" +3. Confirm deletion when prompted + +**Rollback notes:** All changes are contained within AI Workbench. No system-level modifications were made outside the AI Workbench environment. + +## Step 11. Next steps + +This step provides guidance on further exploration and development with the agentic RAG system. + +Explore advanced features: + +* Modify component prompts in the project code +* Upload different documents to test routing and customization +* Experiment with different query types and complexity levels +* Review the agentic reasoning logs in the "Monitor" tab to understand decision-making + +Consider customizing the Gradio UI or integrating the agentic RAG components into your own projects. diff --git a/nvidia/sglang/README.md b/nvidia/sglang/README.md new file mode 100644 index 0000000..230b35d --- /dev/null +++ b/nvidia/sglang/README.md @@ -0,0 +1,226 @@ +# SGLang Inference Server + +> Install and use SGLang on DGX Spark + +## Table of Contents + +- [Overview](#overview) + - [Time & risk](#time-risk) +- [Instructions](#instructions) + +--- + +## Overview + +## Basic Idea + +SGLang is a fast serving framework for large language models and vision language models that makes +your interaction with models faster and more controllable by co-designing the backend runtime and +frontend language. This setup uses the optimized NVIDIA SGLang NGC Container on a single NVIDIA +Spark device with Blackwell architecture, providing GPU-accelerated inference with all dependencies +pre-installed. + +## What you'll accomplish + +You'll deploy SGLang in both server and offline inference modes on your NVIDIA Spark device, +enabling high-performance LLM serving with support for text generation, chat completion, and +vision-language tasks using models like DeepSeek-V2-Lite. + +## What to know before starting + +- Working in a terminal environment on Linux systems +- Basic understanding of Docker containers and container management +- Familiarity with NVIDIA GPU drivers and CUDA toolkit concepts +- Experience with HTTP API endpoints and JSON request/response handling + +## Prerequisites + +- [ ] NVIDIA Spark device with Blackwell architecture +- [ ] Docker Engine installed and running: `docker --version` +- [ ] NVIDIA GPU drivers installed: `nvidia-smi` +- [ ] NVIDIA Container Toolkit configured: `docker run --rm --gpus all nvidia/cuda:12.9-base nvidia-smi` +- [ ] Sufficient disk space (>20GB available): `df -h` +- [ ] Network connectivity for pulling NGC containers: `ping nvcr.io` + +## Ancillary files + +- An offline inference python script [found here on GitHub](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/offline-inference.py) + +### Time & risk + +**Duration:** 15-30 minutes for initial setup and validation + +**Risk level:** Low - Uses pre-built, validated NGC container with minimal configuration + +**Rollback:** Stop and remove containers with `docker stop` and `docker rm` commands + +## Instructions + +## Step 1. Verify system prerequisites + +Check that your NVIDIA Spark device meets all requirements before proceeding. This step runs on +your host system and ensures Docker, GPU drivers, and container toolkit are properly configured. + +```bash +## Verify Docker installation +docker --version + +## Check NVIDIA GPU drivers +nvidia-smi + +## Test NVIDIA Container Toolkit +docker run --rm --gpus all nvidia/cuda:12.9-base-ubuntu20.04 nvidia-smi + +## Check available disk space +df -h / +``` + +## Step 2. Pull the SGLang NGC Container + +Download the latest SGLang container from NVIDIA NGC. This step runs on the host and may take +several minutes depending on your network connection. + +> TODO: Verify the exact container tag/version for SGLang NGC container + +```bash +## Pull the SGLang container +docker pull nvcr.io/nvidia/sglang:-py3 + +## Verify the image was downloaded +docker images | grep sglang +``` + +## Step 3. Launch SGLang container for server mode + +Start the SGLang container in server mode to enable HTTP API access. This runs the inference +server inside the container, exposing it on port 30000 for client connections. + +```bash +## Launch container with GPU support and port mapping +docker run --gpus all -it --rm \ + -p 30000:30000 \ + -v /tmp:/tmp \ + nvcr.io/nvidia/sglang:-py3 \ + bash +``` + +## Step 4. Start the SGLang inference server + +Inside the container, launch the HTTP inference server with a supported model. This step runs +inside the Docker container and starts the SGLang server daemon. + +```bash +## Start the inference server with DeepSeek-V2-Lite model +python3 -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V2-Lite \ + --host 0.0.0.0 \ + --port 30000 \ + --trust-remote-code \ + --tp 1 & + +## Wait for server to initialize +sleep 30 + +## Check server status +curl http://localhost:30000/health +``` + +## Step 5. Test client-server inference + +From a new terminal on your host system, test the SGLang server API to ensure it's working +correctly. This validates that the server is accepting requests and generating responses. + +```bash +## Test with curl +curl -X POST http://localhost:30000/generate \ + -H "Content-Type: application/json" \ + -d '{ + "text": "What does NVIDIA love?", + "sampling_params": { + "temperature": 0.7, + "max_new_tokens": 100 + } + }' +``` + +## Step 6. Test Python client API + +Create a simple Python script to test programmatic access to the SGLang server. This runs on +the host system and demonstrates how to integrate SGLang into applications. + +```python +import requests + +## Send prompt to server +response = requests.post('http://localhost:30000/generate', json={ + 'text': 'What does NVIDIA love?', + 'sampling_params': { + 'temperature': 0.7, + 'max_new_tokens': 100, + }, +}) + +print(f"Response: {response.json()['text']}") +``` + +## Step 7. Test offline inference mode + +Launch a new container instance for offline inference to demonstrate local model usage without +HTTP server. This runs entirely within the container for batch processing scenarios. + +TO DO: NEEDS TO HAVE SCRIPT FROM ASSETS PROPERLY INCORPORATED. [See here](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets) + +## Step 8. Validate installation + +Confirm that both server and offline modes are working correctly. This step verifies the +complete SGLang setup and ensures reliable operation. + +```bash +## Check server mode (from host) +curl http://localhost:30000/health +curl -X POST http://localhost:30000/generate -H "Content-Type: application/json" \ + -d '{"text": "Hello", "sampling_params": {"max_new_tokens": 10}}' + +## Check container logs +docker ps +docker logs +``` + +## Step 9. Troubleshooting + +Common issues and their resolutions: + +| Symptom | Cause | Fix | +|---------|-------|-----| +| Container fails to start with GPU errors | NVIDIA drivers/toolkit missing | Install nvidia-container-toolkit, restart Docker | +| Server responds with 404 or connection refused | Server not fully initialized | Wait 60 seconds, check container logs | +| Out of memory errors during model loading | Insufficient GPU memory | Use smaller model or increase --tp parameter | +| Model download fails | Network connectivity issues | Check internet connection, retry download | +| Permission denied accessing /tmp | Volume mount issues | Use full path: -v /tmp:/tmp or create dedicated directory | + +## Step 10. Cleanup and rollback + +Stop and remove containers to clean up resources. This step returns your system to its +original state. + +> **Warning:** This will stop all SGLang containers and remove temporary data. + +```bash +## Stop all SGLang containers +docker ps | grep sglang | awk '{print $1}' | xargs docker stop + +## Remove stopped containers +docker container prune -f + +## Remove SGLang images (optional) +docker rmi nvcr.io/nvidia/sglang:-py3 +``` + +## Step 11. Next steps + +With SGLang successfully deployed, you can now: + +- Integrate the HTTP API into your applications using the `/generate` endpoint +- Experiment with different models by changing the `--model-path` parameter +- Scale up using multiple GPUs by adjusting the `--tp` (tensor parallel) setting +- Deploy production workloads using the container orchestration platform of your choice diff --git a/nvidia/sglang/assets/offline-inference.py b/nvidia/sglang/assets/offline-inference.py new file mode 100644 index 0000000..3b91543 --- /dev/null +++ b/nvidia/sglang/assets/offline-inference.py @@ -0,0 +1,30 @@ +# +# SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import sglang as sgl + +def main(): + llm = sgl.Engine(model_path="deepseek-ai/DeepSeek-V2-Lite", trust_remote_code=True) + + prompt = "What does NVIDIA love?" + sampling_params = {"temperature": 0.7, "max_new_tokens": 100} + + output = llm.generate(prompt, sampling_params) + print(f"Output: {output}") + +if __name__ == '__main__': + main() \ No newline at end of file diff --git a/nvidia/speculative-decoding/README.md b/nvidia/speculative-decoding/README.md new file mode 100644 index 0000000..d23ec15 --- /dev/null +++ b/nvidia/speculative-decoding/README.md @@ -0,0 +1,214 @@ +# Speculative Decoding + +> Learn how to setup speculative decoding for fast inference on Spark + +## Table of Contents + +- [Overview](#overview) +- [How to run inference with speculative decoding](#how-to-run-inference-with-speculative-decoding) + - [Step 1. Run Eagle3 with GPT-OSS 120B](#step-1-run-eagle3-with-gpt-oss-120b) + - [Step 2. Test the Eagle3 setup](#step-2-test-the-eagle3-setup) + - [Step 1. Run Draft-Target Speculative Decoding](#step-1-run-draft-target-speculative-decoding) + - [Step 2. Test the Draft-Target setup](#step-2-test-the-draft-target-setup) + - [Troubleshooting](#troubleshooting) + - [Cleanup](#cleanup) + - [Next Steps](#next-steps) + +--- + +## Overview + +## Basic idea + +Speculative decoding speeds up text generation by using a **small, fast model** to draft several tokens ahead, then having the **larger model** quickly verify or adjust them. +This way, the big model doesn’t need to predict every token step-by-step, reducing latency while keeping output quality. + +## What you'll accomplish + +You'll explore two different speculative decoding approaches using TensorRT-LLM on NVIDIA Spark: +1. **Eagle3 with GPT-OSS 120B** - Advanced speculative decoding using Eagle3 draft models +2. **Traditional Draft-Target** - Classic speculative decoding with smaller model pairs (coming soon) + +These examples demonstrate how to accelerate large language model inference while maintaining output quality. + +## What to know before starting + +- Experience with Docker and containerized applications +- Understanding of speculative decoding concepts (Eagle3 vs traditional draft-target) +- Familiarity with TensorRT-LLM serving and API endpoints +- Knowledge of GPU memory management for large language models + +## Prerequisites + +- [ ] NVIDIA Spark device with sufficient GPU memory available (80GB+ recommended for GPT-OSS 120B) +- [ ] Docker with GPU support enabled + ```bash + docker run --gpus all nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev nvidia-smi + ``` +- [ ] Access to NVIDIA's internal container registry (for Eagle3 example) +- [ ] HuggingFace authentication configured (if needed for model downloads) + ```bash + huggingface-cli login + ``` +- [ ] Network connectivity for model downloads + + +## Time & risk + +**Duration:** 10-20 minutes for Eagle3 setup, additional time for model downloads (varies by network speed) + +**Risks:** GPU memory exhaustion with large models, container registry access issues, network timeouts during downloads + +**Rollback:** Stop Docker containers and optionally clean up downloaded model cache + +## How to run inference with speculative decoding + +## Example 1: Eagle3 Speculative Decoding with GPT-OSS 120B + +Eagle3 is an advanced speculative decoding technique that uses a specialized draft model to accelerate inference of large language models. + +### Step 1. Run Eagle3 with GPT-OSS 120B + +Execute the following command to download models and run Eagle3 speculative decoding: + +```bash +docker run \ + -v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \ + --rm -it --ulimit memlock=-1 --ulimit stack=67108864 \ + --gpus=all --ipc=host --network host nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \ + bash -c ' + hf download openai/gpt-oss-120b && \ + hf download nvidia/gpt-oss-120b-Eagle3 \ + --local-dir /opt/gpt-oss-120b-Eagle3/ && \ + cat > /tmp/extra-llm-api-config.yml < extra-llm-api-config.yml +print_iter_log: false +disable_overlap_scheduler: true +speculative_config: + decoding_type: DraftTarget + max_draft_len: 4 + speculative_model_dir: /opt/Llama-3.1-8B-Instruct-FP4/ +kv_cache_config: + enable_block_reuse: false +EOF + +# # Start TensorRT-LLM server + trtllm-serve nvidia/Llama-3.3-70B-Instruct-FP4 \ + --backend pytorch --tp_size 1 \ + --max_batch_size 1 \ + --kv_cache_free_gpu_memory_fraction 0.9 \ + --extra_llm_api_options ./extra-llm-api-config.yml + " +``` + +### Step 2. Test the Draft-Target setup + +Once the server is running, test it with API calls: + +```bash +## Test completion endpoint +curl -X POST http://localhost:8000/v1/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "nvidia/Llama-3.3-70B-Instruct-FP4", + "prompt": "Explain the benefits of speculative decoding:", + "max_tokens": 150, + "temperature": 0.7 + }' +``` + +#### Key Features of Draft-Target: +- **Efficient resource usage**: 8B draft model accelerates 70B target model +- **Flexible configuration**: Adjustable draft token length for optimization +- **Memory efficient**: Uses FP4 quantized models for reduced memory footprint +- **Compatible models**: Uses Llama family models with consistent tokenization + +### Troubleshooting + +Common issues and solutions: + +| Symptom | Cause | Fix | +|---------|--------|-----| +| "CUDA out of memory" error | Insufficient GPU memory | Reduce `kv_cache_free_gpu_memory_fraction` to 0.9 or use a device with more VRAM | +| Container fails to start | Docker GPU support issues | Verify `nvidia-docker` is installed and `--gpus=all` flag is supported | +| Model download fails | Network or authentication issues | Check HuggingFace authentication and network connectivity | +| Server doesn't respond | Port conflicts or firewall | Check if port 8000 is available and not blocked | + +### Cleanup + +Stop the Docker container when finished: + +```bash +## Find and stop the container +docker ps +docker stop + +## Optional: Clean up downloaded models from cache +## rm -rf $HOME/.cache/huggingface/hub/models--*gpt-oss* +``` + +### Next Steps + +- Compare both Eagle3 and Draft-Target performance with baseline inference +- Experiment with different `max_draft_len` values (1, 2, 3, 4, 8) for both approaches +- Monitor token acceptance rates and throughput improvements across different model pairs +- Test with different prompt lengths and generation parameters +- Compare Eagle3 vs Draft-Target approaches for your specific use case +- Benchmark memory usage differences between the two methods diff --git a/nvidia/speculative-decoding/assets/example b/nvidia/speculative-decoding/assets/example new file mode 100644 index 0000000..e69de29 diff --git a/nvidia/stack-sparks/README.md b/nvidia/stack-sparks/README.md new file mode 100644 index 0000000..84793df --- /dev/null +++ b/nvidia/stack-sparks/README.md @@ -0,0 +1,319 @@ +# Stack two Sparks + +> Connect two Spark devices and setup them up for inference and fine-tuning + +## Table of Contents + +- [Overview](#overview) +- [Run on two Sparks](#run-on-two-sparks) + - [Option 1: Automatic IP Assignment (Recommended)](#option-1-automatic-ip-assignment-recommended) + - [Option 2: Manual IP Assignment (Advanced)](#option-2-manual-ip-assignment-advanced) + +--- + +## Overview + +## Basic Idea + +Configure two DGX Spark systems for high-speed inter-node communication using 200GbE direct +QSFP connections and NCCL multi-node communication. This setup enables distributed training +and inference workloads across multiple Blackwell GPUs by establishing network connectivity, +configuring SSH authentication, and validating communication with NCCL performance tests. + +## What you'll accomplish + +You will physically connect two DGX Spark devices with a QSFP cable, configure network +interfaces for cluster communication, establish passwordless SSH between nodes, and validate +the setup with NCCL multi-node tests to create a functional distributed computing environment. + +## What to know before starting + +- Working with network interface configuration and netplan +- Using Docker containers with GPU and network access +- Basic understanding of distributed computing concepts +- Experience with SSH key management +- Familiarity with NVIDIA GPU architectures and CUDA environments + +## Prerequisites + +- [ ] Two DGX Spark systems with NVIDIA Blackwell GPUs available +- [ ] QSFP cable for direct 200GbE connection between devices +- [ ] Docker installed on both systems: `docker --version` +- [ ] CUDA toolkit installed: `nvcc --version` (should show 12.9 or higher) +- [ ] SSH access available on both systems: `ssh-keygen -t rsa` (if keys don't exist) +- [ ] Git available for source code compilation: `git --version` +- [ ] Root or sudo access on both systems: `sudo whoami` + +## Ancillary files + +All required files for this playbook can be found [here on GitHub](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/) + +- `discover-sparks` script for automatic node discovery and SSH key distribution +- `trtllm-mn-entrypoint.sh` container entrypoint script for multi-node setup +- Network interface mapping tools (`ibdev2netdev`, `ip link show`) + +## Time & risk + +**Duration:** 2-3 hours including validation tests +**Risk level:** Medium - involves network reconfiguration and container setup +**Rollback:** Network changes can be reversed by removing netplan configs or IP assignments + +## Run on two Sparks + +## Step 1. Physical Hardware Connection + +Connect the QSFP cable between both DGX Spark systems using the rightmost QSFP interface +on each device. This establishes the 200GbE direct connection required for high-speed +inter-node communication. + +```bash +## Check QSFP interface availability on both nodes +ip link show | grep enP2p1s0f1np1 +``` + +Expected output shows the interface exists but may be down initially. + +## Step 2. Network Interface Configuration + +Choose one option based on your network requirements. + +### Option 1: Automatic IP Assignment (Recommended) + +Configure network interfaces using netplan on both DGX Spark nodes for automatic +link-local addressing: + +```bash +## On both nodes, create the netplan configuration file +sudo tee /etc/netplan/40-cx7.yaml > /dev/null < enp1s0f0np0 (Up) +rocep1s0f1 port 1 ==> enp1s0f1np1 (Down) +roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up) +roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down) +``` + +Note the active interface names (marked "Up") for use in container configuration. + +## Step 5. Launch Containers with Network Configuration + +Start containers with appropriate network and GPU configuration for NCCL communication. +This step runs on both nodes. + +```bash +## On both nodes, launch the container +docker run --name trtllm --rm -d \ + --gpus all --network host --ipc=host \ + --ulimit memlock=-1 --ulimit stack=67108864 \ + -e UCX_NET_DEVICES=enp1s0f0np0,enp1s0f1np1 \ + -e NCCL_SOCKET_IFNAME=enp1s0f0np0,enp1s0f1np1 \ + -e OMPI_MCA_btl_tcp_if_include=enp1s0f0np0,enp1s0f1np1 \ + -e OMPI_ALLOW_RUN_AS_ROOT=1 \ + -e OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 \ + -v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \ + -v ./trtllm-mn-entrypoint.sh:/opt/trtllm-mn-entrypoint.sh \ + -v ~/.ssh:/tmp/.ssh:ro \ + --entrypoint /opt/trtllm-mn-entrypoint.sh \ + nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc3 +``` + +## Step 6. Build NCCL with Blackwell Support + +Execute these commands inside both containers to build NCCL from source with Blackwell +architecture support. Access the container with `docker exec -it trtllm bash`. + +```bash +## Install dependencies and build NCCL +sudo apt-get update && sudo apt-get install -y libopenmpi-dev +git clone -b v2.28.3-1 https://github.com/NVIDIA/nccl.git /opt/nccl/ +cd /opt/nccl/ +make -j src.build NVCC_GENCODE="-gencode=arch=compute_121,code=sm_121" + +## Set environment variables +export MPI_HOME="/usr/lib/aarch64-linux-gnu/openmpi" +export NCCL_HOME="/opt/nccl/build/" +export LD_LIBRARY_PATH="$NCCL_HOME/lib:$CUDA_HOME/lib64/:$MPI_HOME/lib:$LD_LIBRARY_PATH" +``` + +## Step 7. Build NCCL Test Suite + +Compile the NCCL test suite to validate communication performance. This runs inside +both containers. + +```bash +## Clone and build NCCL tests +git clone https://github.com/NVIDIA/nccl-tests.git /opt/nccl-tests/ +cd /opt/nccl-tests/ +make MPI=1 +``` + +## Step 8. Run NCCL Communication Test + +Execute multi-node NCCL performance test using the active network interface. This runs +from one of the containers. + +```bash +## Set network interface environment variables (use your active interface from Step 4) +export UCX_NET_DEVICES=enp1s0f0np0 +export NCCL_SOCKET_IFNAME=enp1s0f0np0 +export OMPI_MCA_btl_tcp_if_include=enp1s0f0np0 + +## Run the all_gather performance test across both nodes +mpirun -np 2 -H 192.168.100.10:1,192.168.100.11:1 \ + -x NCCL_DEBUG=VERSION -x NCCL_DEBUG_SUBSYS=TUNING \ + -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \ + -x NCCL_MERGE_LEVEL=SYS -x NCCL_PROTO="SIMPLE" \ + /opt/nccl-tests/build/all_gather_perf -b 32G -e 32G -f 2 +``` + +## Step 9. Validate NCCL Installation + +Verify successful NCCL compilation and multi-node communication by checking built +components. + +```bash +## Check NCCL library build +ls -la /opt/nccl/build/lib/ + +## Verify NCCL test binaries +ls -la /opt/nccl-tests/build/ + +## Check MPI configuration +mpirun --version +``` + +Expected output should show NCCL libraries in `/opt/nccl/build/lib/` and test binaries +in `/opt/nccl-tests/build/`. + +## Step 10. Performance Validation + +Review the all_gather test output for communication performance metrics from Step 8. + +Expected metrics from the test output: +- Bandwidth measurements between nodes +- Latency for different message sizes +- GPU-to-GPU communication confirmation +- No error messages or communication failures + +## Step 11. Additional NCCL Tests + +Run additional performance validation tests to verify the complete setup. + +```bash +## Example: Run a simple NCCL bandwidth test +/opt/nccl-tests/build/all_reduce_perf -b 1M -e 1G -f 2 + +## Example: Verify GPU topology detection +nvidia-smi topo -m +``` + +## Step 12. Troubleshooting + +| Symptom | Cause | Fix | +|---------|-------|-----| +| "Network unreachable" errors | Network interfaces not configured | Verify netplan config and `sudo netplan apply` | +| SSH authentication failures | SSH keys not properly distributed | Re-run `./discover-sparks` and enter passwords | +| NCCL build failures with Blackwell | Wrong compute capability specified | Verify `NVCC_GENCODE="-gencode=arch=compute_121,code=sm_121"` | +| MPI communication timeouts | Wrong network interfaces specified | Check `ibdev2netdev` and update interface names | +| Container networking issues | Host network mode problems | Ensure `--network host --ipc=host` in docker run | +| Node 2 not visible in cluster | Network connectivity issue | Verify QSFP cable connection, check IP configuration | + +## Step 13. Cleanup and Rollback + +> **Warning**: These steps will stop containers and reset network configuration. + +```bash +## Stop containers on both nodes +docker stop trtllm +docker rm trtllm + +## Rollback network configuration (if using Option 1) +sudo rm /etc/netplan/40-cx7.yaml +sudo netplan apply + +## Rollback network configuration (if using Option 2) +sudo ip addr del 192.168.100.10/24 dev enP2p1s0f1np1 # Node 1 +sudo ip addr del 192.168.100.11/24 dev enP2p1s0f1np1 # Node 2 +sudo ip link set enP2p1s0f1np1 down +``` + +## Step 14. Next Steps + +Your NCCL environment is ready for multi-node distributed training workloads on DGX Spark +systems with Blackwell GPUs. + +```bash +## Test basic multi-node functionality +mpirun -np 2 -H 192.168.100.10:1,192.168.100.11:1 hostname + +## Verify GPU visibility across nodes +mpirun -np 2 -H 192.168.100.10:1,192.168.100.11:1 nvidia-smi -L +``` diff --git a/nvidia/stack-sparks/assets/cx7-netplan.yaml b/nvidia/stack-sparks/assets/cx7-netplan.yaml new file mode 100644 index 0000000..795c616 --- /dev/null +++ b/nvidia/stack-sparks/assets/cx7-netplan.yaml @@ -0,0 +1,24 @@ +# +# SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +network: + version: 2 + ethernets: + enp1s0f0np0: + link-local: [ ipv4 ] + enp1s0f1np1: + link-local: [ ipv4 ] \ No newline at end of file diff --git a/nvidia/stack-sparks/assets/discover-sparks b/nvidia/stack-sparks/assets/discover-sparks new file mode 100755 index 0000000..1d516ad --- /dev/null +++ b/nvidia/stack-sparks/assets/discover-sparks @@ -0,0 +1,174 @@ +# +# SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +#!/bin/env bash + +# discover-sparks.sh +# Discover available systems using avahi-browse and generate MPI hosts file +# Searches all active interfaces automatically +# +# Usage: bash ./discover-sparks + +set -euo pipefail + +# Dynamically get interface names from ibdev2netdev output +# Use ibdev2netdev to list Infiniband devices and their network interfaces. +# The awk command searches for lines containing 'Up)' (i.e., interfaces that are up) +# and prints the 5th field, which is the interface name (e.g., enp1s0f0np0). +# The tr command removes any parentheses from the output. +INTERFACES=($(ibdev2netdev | awk '/Up\)/ {print $5}' | tr -d '()')) +if [ ${#INTERFACES[@]} -eq 0 ]; then + echo "ERROR: No active interfaces found via ibdev2netdev." + exit 1 +fi +OUTPUT_FILE="/tmp/stacked-sparks-hostfile" + +# Check if avahi-browse is available +if ! command -v avahi-browse &> /dev/null; then + echo "Error: avahi-browse not found. Please install avahi-utils package." + exit 1 +fi + +# Check if ssh-copy-id is available +if ! command -v ssh-copy-id &> /dev/null; then + echo "Error: ssh-copy-id not found. Please install openssh-client package." + exit 1 +fi + +# Create temporary file for processing +TEMP_FILE=$(mktemp) +trap 'rm -f "$TEMP_FILE"' EXIT + +# Run avahi-browse and filter for SSH services on specified interfaces +# -p: parseable output +# -r: resolve host names and addresses +# -f: terminate after dumping all entries available at startup +avahi_output=$(avahi-browse -p -r -f -t _ssh._tcp 2>/dev/null) + +# Filter for both interfaces +found_services=false +for interface in "${INTERFACES[@]}"; do + if echo "$avahi_output" | grep "$interface" >> "$TEMP_FILE"; then + found_services=true + fi +done + +if [ "$found_services" = false ]; then + echo "Warning: No services found on any specified interface" + touch "$OUTPUT_FILE" + echo "Created empty hosts file: $OUTPUT_FILE" + exit 0 +fi + +# Extract IPv4 addresses from the avahi-browse output +# Format: =;interface;IPv4;hostname\032service;description;local;fqdn;ip_address;port; + +# Clear the output file +> "$OUTPUT_FILE" + +# Parse IPv4 entries and extract IP addresses +grep "^=" "$TEMP_FILE" | grep "IPv4" | while IFS=';' read -r prefix interface protocol hostname_service description local fqdn ip_address port rest; do + # Clean up any trailing data + clean_ip=$(echo "$ip_address" | sed 's/;.*$//') + + # Validate IP address format + if [[ $clean_ip =~ ^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$ ]]; then + echo "$clean_ip" >> "$OUTPUT_FILE" + echo "Found: $clean_ip ($fqdn)" + else + echo "Warning: Invalid IP format: $clean_ip" + fi +done + +# Sort and remove duplicates +if [[ -s "$OUTPUT_FILE" ]]; then + sort -u "$OUTPUT_FILE" -o "$OUTPUT_FILE" +else + echo "No IPv4 addresses found." + exit 1 +fi + +# Check if SSH key exists, if not, prompt to generate +if [[ ! -f "$HOME/.ssh/id_rsa.pub" && ! -f "$HOME/.ssh/id_ed25519.pub" ]]; then + ssh-keygen -t ed25519 -N "" -f "$HOME/.ssh/id_ed25519" -q +fi + +echo "" +echo "Setting up bidirectional SSH access (local <-> remote nodes)..." +echo "You may be prompted for your password on each node." + +# Ensure authorized_keys file exists +mkdir -p "$HOME/.ssh" +touch "$HOME/.ssh/authorized_keys" +chmod 700 "$HOME/.ssh" +chmod 600 "$HOME/.ssh/authorized_keys" + +while read -r node_ip; do + if [[ -n "$node_ip" ]]; then + echo "" + echo "Setting up SSH access for $node_ip ..." + + # Step 1: Copy local SSH key to remote node + echo " Copying local SSH key to $node_ip ..." + if ssh-copy-id -i "$HOME/.ssh/id_ed25519" -o StrictHostKeyChecking=accept-new "$USER@$node_ip" &>/dev/null; then + echo " ✓ Successfully copied local key to $node_ip" + + # Step 2: Set up reverse SSH access (remote -> local) + echo " Setting up reverse SSH access from $node_ip ..." + + # Generate SSH key on remote node if it doesn't exist and get its public key + remote_pubkey=$(ssh -o StrictHostKeyChecking=accept-new "$USER@$node_ip" ' + # Ensure SSH directory exists + mkdir -p ~/.ssh + chmod 700 ~/.ssh + + # Generate key if it doesn'"'"'t exist + if [[ ! -f ~/.ssh/id_ed25519.pub ]]; then + ssh-keygen -t ed25519 -N "" -f ~/.ssh/id_ed25519 -q + fi + + # Output the public key + cat ~/.ssh/id_ed25519.pub + ' 2>/dev/null) + + if [[ -n "$remote_pubkey" ]]; then + # Add remote public key to local authorized_keys if not already present + if ! grep -q "$remote_pubkey" "$HOME/.ssh/authorized_keys" 2>/dev/null; then + echo "$remote_pubkey" >> "$HOME/.ssh/authorized_keys" + echo " ✓ Added $node_ip's public key to local authorized_keys" + else + echo " ✓ $node_ip's public key already in local authorized_keys" + fi + else + echo " ✗ Failed to get public key from $node_ip" + fi + else + echo " ✗ Failed to copy local SSH key to $node_ip as $USER" + fi + fi +done < "$OUTPUT_FILE" + +# Add hostfile to remote nodes +while read -r node_ip; do + if [[ -n "$node_ip" ]]; then + echo " Adding hostfile to $node_ip ..." + scp "$OUTPUT_FILE" "$USER@$node_ip:$OUTPUT_FILE" + fi +done < "$OUTPUT_FILE" + +echo "" +echo "Bidirectional SSH setup complete!" +echo "Both local and remote nodes can now SSH to each other without passwords." diff --git a/nvidia/stack-sparks/assets/docker-compose.yml b/nvidia/stack-sparks/assets/docker-compose.yml new file mode 100644 index 0000000..71cf36c --- /dev/null +++ b/nvidia/stack-sparks/assets/docker-compose.yml @@ -0,0 +1,62 @@ + +#!/bin/bash +# +# SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + + + +version: '3.8' + +services: + trtllm: + image: nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc3 + deploy: + replicas: 2 + restart_policy: + condition: any + delay: 5s + max_attempts: 3 + window: 120s + resources: + reservations: + generic_resources: + - discrete_resource_spec: + kind: 'NVIDIA_GPU' + value: 1 + environment: + - UCX_NET_DEVICES=enp1s0f0np0,enp1s0f1np1 + - NCCL_SOCKET_IFNAME=enp1s0f0np0,enp1s0f1np1 + - OMPI_MCA_btl_tcp_if_include=enp1s0f0np0,enp1s0f1np1 + - OMPI_MCA_orte_default_hostfile=/etc/openmpi-hostfile + - OMPI_MCA_rmaps_ppr_n_pernode=1 + - OMPI_ALLOW_RUN_AS_ROOT=1 + - OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 + entrypoint: /opt/trtllm-mn-entrypoint.sh + volumes: + - ~/.cache/huggingface/:/root/.cache/huggingface/ + - ~/trtllm-mn-entrypoint.sh:/opt/trtllm-mn-entrypoint.sh + - ~/.ssh:/tmp/.ssh:ro + ulimits: + memlock: -1 + stack: 67108864 + networks: + - host + +networks: + host: + name: host + external: true \ No newline at end of file diff --git a/nvidia/stack-sparks/assets/instructions-docker-swarm.md b/nvidia/stack-sparks/assets/instructions-docker-swarm.md new file mode 100644 index 0000000..d3ac198 --- /dev/null +++ b/nvidia/stack-sparks/assets/instructions-docker-swarm.md @@ -0,0 +1,247 @@ + + +# TensorRT-LLM on Stacked Spark Instructions + +## Step 1. Setup networking between nodes +Configure network interfaces using netplan on both DGX Spark nodes: + +```bash +# On both nodes, create the netplan configuration file (also available in cx7-netplan.yaml in this repository) +sudo tee /etc/netplan/40-cx7.yaml > /dev/null < : + +To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions. +``` + +### Substep B: Join worker nodes and deploy +Now we can proceed with setting up other nodes of your cluster: + +```bash +# Run the command suggested by the docker swarm init on each worker node to join the Docker swarm +docker swarm join --token : + +# On your primary node, deploy the stack using the following command +docker stack deploy -c docker-compose.yml trtllm-multinode + +# You can verify the status of your worker nodes using the following +docker stack ps trtllm-multinode + +# In case you see any errors reported by docker ps for any node, you can verify using +docker service logs +``` + +If everything is healthy, you should see a similar output to the following: +``` +nvidia@spark-1b3b:~/draft-playbooks/trt-llm-on-stacked-spark$ docker stack ps trtllm-multinode +ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS +oe9k5o6w41le trtllm-multinode_trtllm.1 nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc3 spark-1d84 Running Running 2 minutes ago +phszqzk97p83 trtllm-multinode_trtllm.2 nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc3 spark-1b3b Running Running 2 minutes ago +``` + +### Substep C. Create hosts file + + +You can check the available nodes using `docker node ls` +``` +nvidia@spark-1b3b:~$ docker node ls +ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION +hza2b7yisatqiezo33zx4in4i * spark-1b3b Ready Active Leader 28.3.3 +m1k22g3ktgnx36qz4jg5fzhr4 spark-1d84 Ready Active 28.3.3 +``` + +Generate a file containing all Docker Swarm node addresses for MPI operations, and then copy it over to your container: +```bash +docker node ls --format '{{.ID}}' | xargs -n1 docker node inspect --format '{{ .Status.Addr }}' > ~/openmpi-hostfile +docker cp ~/openmpi-hostfile $(docker ps -q -f name=trtllm-multinode):/etc/openmpi-hostfile +``` + +### Substep D. Find your Docker container ID +You can use `docker ps` to find your Docker container ID. Alternatively, you can save the container ID in a variable: +``` +export TRTLLM_MN_CONTAINER=$(docker ps -q -f name=trtllm-multinode) +``` + +### Substep E. Generate configuration file + +```bash +docker exec $TRTLLM_MN_CONTAINER bash -c 'cat < /tmp/extra-llm-api-config.yml +print_iter_log: false +kv_cache_config: + dtype: "fp8" + free_gpu_memory_fraction: 0.9 +cuda_graph_config: + enable_padding: true +EOF' +``` + +### Substep F. Download model + +```bash +docker exec \ + -e MODEL="nvidia/Qwen3-235B-A22B-FP4" \ + -e HF_TOKEN="hf_..." \ + -it $TRTLLM_MN_CONTAINER bash -c 'mpirun -x HF_TOKEN bash -c "huggingface-cli download $MODEL"' +``` + +### Substep G. Prepare dataset and benchmark + +```bash +docker exec \ + -e ISL=128 -e OSL=128 \ + -e MODEL="nvidia/Qwen3-235B-A22B-FP4" \ + -e HF_TOKEN="" \ + -it $TRTLLM_MN_CONTAINER bash -c ' + mpirun -x HF_TOKEN bash -c "python benchmarks/cpp/prepare_dataset.py --tokenizer=$MODEL --stdout token-norm-dist --num-requests=1 --input-mean=$ISL --output-mean=$OSL --input-stdev=0 --output-stdev=0 > /tmp/dataset.txt" && \ + mpirun -x HF_TOKEN trtllm-llmapi-launch trtllm-bench -m $MODEL throughput \ + --tp 2 \ + --dataset /tmp/dataset.txt \ + --backend pytorch \ + --max_num_tokens 4096 \ + --concurrency 1 \ + --max_batch_size 4 \ + --extra_llm_api_options /tmp/extra-llm-api-config.yml \ + --streaming' +``` + +### Substep H. Serve the model + +```bash +docker exec \ + -e MODEL="nvidia/Qwen3-235B-A22B-FP4" \ + -e HF_TOKEN="" \ + -it $TRTLLM_MN_CONTAINER bash -c ' + mpirun -x HF_TOKEN trtllm-llmapi-launch trtllm-serve $MODEL \ + --tp_size 2 \ + --backend pytorch \ + --max_num_tokens 32768 \ + --max_batch_size 4 \ + --extra_llm_api_options /tmp/extra-llm-api-config.yml \ + --port 8000' +``` + +This will start the TensorRT-LLM server on port 8000. You can then make inference requests to `http://localhost:8000` using the OpenAI-compatible API format. + +**Expected output:** Server startup logs and ready message. + +### Example inference request + +Once the server is running, you can test it with a CURL request. Please ensure the CURL request is run on the primary node where you previously ran Substep H. + +```bash +curl -X POST http://localhost:8000/v1/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "nvidia/Qwen3-235B-A22B-FP4", + "prompt": "What is artificial intelligence?", + "max_tokens": 100, + "temperature": 0.7, + "stream": false + }' +``` + +## Step 6. Troubleshooting + +| Symptom | Cause | Fix | +|---------|-------|-----| +| MPI hostname test returns single hostname | Network connectivity issues | Verify both nodes are on 192.168.100.0/24 subnet | +| "Permission denied" on HuggingFace download | Invalid or missing HF_TOKEN | Set valid token: `export HF_TOKEN=` | +| "CUDA out of memory" errors | Insufficient GPU memory | Reduce `--max_batch_size` or `--max_num_tokens` | +| Container exits immediately | Missing entrypoint script | Ensure `trtllm-mn-entrypoint.sh` download succeeded and has executable permissions | + +## Step 7. Cleanup and rollback + +Stop and remove containers by using the following command on the leader node: + +```bash +docker stack rm trtllm-multinode +``` + +> **Warning:** This removes all inference data and performance reports. Copy `/opt/*perf-report.json` files before cleanup if needed. + +Remove downloaded models to free disk space: + +```bash +rm -rf $HOME/.cache/huggingface/hub/models--nvidia--Qwen3* +``` + +## Step 8. Next steps + +Compare performance metrics between speculative decoding and baseline reports to quantify speed improvements. Use the multi-node setup as a foundation for deploying other large models requiring tensor parallelism, or scale to additional nodes for higher throughput workloads. diff --git a/nvidia/stack-sparks/assets/trtllm-mn-entrypoint.sh b/nvidia/stack-sparks/assets/trtllm-mn-entrypoint.sh new file mode 100755 index 0000000..63cf0fa --- /dev/null +++ b/nvidia/stack-sparks/assets/trtllm-mn-entrypoint.sh @@ -0,0 +1,61 @@ +#!/bin/bash +# +# SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +set -e + +SSH_PORT="${SSH_PORT:-2233}" + +# Install and configure OpenSSH server +apt-get update && \ + apt-get install -y openssh-server && \ + mkdir -p /var/run/sshd + +ls -lha /tmp/.ssh +cp -R /tmp/.ssh /root/ +ls -lha /root/.ssh +chown -R $USER: /root/.ssh +chmod 700 /root/.ssh +chmod 600 /root/.ssh/* +if compgen -G "/root/.ssh/*.pub" > /dev/null; then + chmod 644 /root/.ssh/*.pub +fi + + +# Allow root login and key-based auth, move port to 2233 +sed -i.bak \ + -e 's/^#\?\s*PermitRootLogin\s.*/PermitRootLogin yes/' \ + -e 's/^#\?\s*PubkeyAuthentication\s.*/PubkeyAuthentication yes/' \ + -e 's/^#\?\s*Port\s\+22\s*$/Port '$SSH_PORT'/' \ + /etc/ssh/sshd_config + +# Set root password +echo "root:root" | chpasswd + +# Configure SSH client for root to disable host key checks within * +echo -e '\nHost *\n StrictHostKeyChecking no\n Port '$SSH_PORT'\n UserKnownHostsFile=/dev/null' > /etc/ssh/ssh_config.d/trt-llm.conf && \ + chmod 600 /etc/ssh/ssh_config.d/trt-llm.conf + +# Fix login session for container +sed 's@session\\s*required\\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd + + +# Start SSHD +echo "Starting SSH" +exec /usr/sbin/sshd -D +sshd_rc = $? +echo "Failed to start SSHD, rc $sshd_rc" +exit $sshd_rc \ No newline at end of file diff --git a/nvidia/tailscale/README.md b/nvidia/tailscale/README.md new file mode 100644 index 0000000..574c10c --- /dev/null +++ b/nvidia/tailscale/README.md @@ -0,0 +1,341 @@ +# Setup Tailscale on your Spark + +> Use Tailscale to connect to your Spark on your home network no matter where you are + +## Table of Contents + +- [Overview](#overview) +- [Instructions](#instructions) + - [Step 1. Verify system requirements](#step-1-verify-system-requirements) + - [Step 2. Install SSH server (if needed)](#step-2-install-ssh-server-if-needed) + - [Step 3. Install Tailscale on NVIDIA Spark](#step-3-install-tailscale-on-nvidia-spark) + - [Step 4. Verify Tailscale installation](#step-4-verify-tailscale-installation) + - [Step 5. Connect Spark device to Tailscale network](#step-5-connect-spark-device-to-tailscale-network) + - [Step 6. Install Tailscale on client devices](#step-6-install-tailscale-on-client-devices) + - [Step 7. Connect client devices to tailnet](#step-7-connect-client-devices-to-tailnet) + - [Step 8. Verify network connectivity](#step-8-verify-network-connectivity) + - [Step 9. Configure SSH authentication](#step-9-configure-ssh-authentication) + - [Step 10. Test SSH connection](#step-10-test-ssh-connection) + - [Step 11. Validate installation](#step-11-validate-installation) + - [Step 12. Troubleshooting](#step-12-troubleshooting) + - [Step 13. Cleanup and rollback](#step-13-cleanup-and-rollback) + - [Step 14. Next steps](#step-14-next-steps) + +--- + +## Overview + +## Basic Idea + +Tailscale creates an encrypted peer-to-peer mesh network that allows secure access +to your NVIDIA Spark device from anywhere without complex firewall configurations +or port forwarding. By installing Tailscale on both your Spark and client devices, +you establish a private "tailnet" where each device gets a stable private IP +address and hostname, enabling seamless SSH access whether you're at home, work, +or a coffee shop. + +## What you'll accomplish + +You will set up Tailscale on your NVIDIA Spark device and client machines to +create secure remote access. After completion, you'll be able to SSH into your +Spark from anywhere using simple commands like `ssh user@spark-hostname`, with +all traffic automatically encrypted and NAT traversal handled transparently. + +## What to know before starting + +- Working with terminal/command line interfaces +- Basic SSH concepts and usage +- Installing packages using `apt` on Ubuntu +- Understanding of user accounts and authentication +- Familiarity with systemd service management + +## Prerequisites + +- [ ] NVIDIA Spark device running Ubuntu (ARM64/AArch64) +- [ ] Client device (Mac, Windows, or Linux) for remote access +- [ ] Internet connectivity on both devices +- [ ] Valid email account for Tailscale authentication (Google, GitHub, Microsoft) +- [ ] SSH server availability check: `systemctl status ssh` +- [ ] Package manager working: `sudo apt update` +- [ ] User account with sudo privileges on Spark device + +## Time & risk + +**Time estimate**: 15-30 minutes for initial setup, 5 minutes per additional device + +**Risks**: +- Potential SSH service configuration conflicts +- Network connectivity issues during initial setup +- Authentication provider service dependencies + +**Rollback**: Tailscale can be completely removed with `sudo apt remove tailscale` +and all network routing automatically reverts to default settings. + +## Instructions + +### Step 1. Verify system requirements + +Check that your NVIDIA Spark device is running a supported Ubuntu version and +has internet connectivity. This step runs on the Spark device to confirm +prerequisites. + +```bash +## Check Ubuntu version (should be 20.04 or newer) +lsb_release -a + +## Test internet connectivity +ping -c 3 google.com + +## Verify you have sudo access +sudo whoami +``` + +### Step 2. Install SSH server (if needed) + +Ensure SSH server is running on your Spark device since Tailscale provides +network connectivity but requires SSH for remote access. This step runs on +the Spark device. + +```bash +## Check if SSH is running +systemctl status ssh +``` + +#### If SSH is not installed or running + +```bash +## Install OpenSSH server +sudo apt update +sudo apt install -y openssh-server + +## Enable and start SSH service +sudo systemctl enable ssh --now + +## Verify SSH is running +systemctl status ssh +``` + +### Step 3. Install Tailscale on NVIDIA Spark + +Install Tailscale on your ARM64 Spark device using the official Ubuntu +repository. This step adds the Tailscale package repository and installs +the client. + +```bash +## Update package list +sudo apt update + +## Install required tools for adding external repositories +sudo apt install -y curl gnupg + +## Add Tailscale signing key +curl -fsSL https://pkgs.tailscale.com/stable/ubuntu/noble.noarmor.gpg | \ + sudo tee /usr/share/keyrings/tailscale-archive-keyring.gpg > /dev/null + +## Add Tailscale repository +curl -fsSL https://pkgs.tailscale.com/stable/ubuntu/noble.tailscale-keyring.list | \ + sudo tee /etc/apt/sources.list.d/tailscale.list + +## Update package list with new repository +sudo apt update + +## Install Tailscale +sudo apt install -y tailscale +``` + +### Step 4. Verify Tailscale installation + +Confirm Tailscale installed correctly on your Spark device before proceeding +with authentication. + +```bash +## Check Tailscale version +tailscale version + +## Check Tailscale service status +sudo systemctl status tailscaled +``` + +### Step 5. Connect Spark device to Tailscale network + +Authenticate your Spark device with Tailscale using your chosen identity +provider. This creates your private tailnet and assigns a stable IP address. + +```bash +## Start Tailscale and begin authentication +sudo tailscale up + +## Follow the URL displayed to complete login in browser +## Choose from: Google, GitHub, Microsoft, or other supported providers +``` + +### Step 6. Install Tailscale on client devices + +Install Tailscale on the devices you'll use to connect to your Spark remotely. +Choose the appropriate method for your client operating system. + +#### On macOS + +```bash +## Option 1: Install from Mac App Store +## Search for "Tailscale" and click Get → Install + +## Option 2: Download from website +## Visit https://tailscale.com/download and download .pkg installer +``` + +#### On Windows + +```bash +## Download installer from https://tailscale.com/download +## Run the .msi file and follow installation prompts +## Launch Tailscale from Start Menu or system tray +``` + +#### On Linux + +```bash +## Use same installation steps as Spark device (Steps 3-4) +## Adjust repository URLs for your specific distribution if needed +``` + +### Step 7. Connect client devices to tailnet + +Log in to Tailscale on each client device using the same identity provider +account you used for the Spark device. + +#### On macOS/Windows (GUI) +- Launch Tailscale app +- Click "Log in" button +- Sign in with same account used on Spark + +#### On Linux (CLI) +```bash +## Start Tailscale on client +sudo tailscale up + +## Complete authentication in browser using same account +``` + +### Step 8. Verify network connectivity + +Test that devices can communicate through the Tailscale network before +attempting SSH connections. + +```bash +## On any device, check tailnet status +tailscale status + +## Test ping to Spark device (use hostname or IP from status output) +tailscale ping + +## Example output should show successful pings +``` + +### Step 9. Configure SSH authentication + +Set up SSH key authentication for secure access to your Spark device. This +step runs on your client device and Spark device. + +#### Generate SSH key on client (if not already done) + +```bash +## Generate new SSH key pair +ssh-keygen -t ed25519 -f ~/.ssh/tailscale_spark + +## Display public key to copy +cat ~/.ssh/tailscale_spark.pub +``` + +#### Add public key to Spark device + +```bash +## On Spark device, add client's public key +echo "" >> ~/.ssh/authorized_keys + +## Set correct permissions +chmod 600 ~/.ssh/authorized_keys +chmod 700 ~/.ssh +``` + +### Step 10. Test SSH connection + +Connect to your Spark device using SSH over the Tailscale network to verify +the complete setup works. + +```bash +## Connect using Tailscale hostname (preferred) +ssh -i ~/.ssh/tailscale_spark @ + +## Or connect using Tailscale IP address +ssh -i ~/.ssh/tailscale_spark @ + +## Example: +## ssh -i ~/.ssh/tailscale_spark nvidia@my-spark-device +``` + +### Step 11. Validate installation + +Verify that Tailscale is working correctly and your SSH connection is stable. + +```bash +## From client device, check connection status +tailscale status + +## Test file transfer over SSH +scp -i ~/.ssh/tailscale_spark test.txt @:~/ + +## Verify you can run commands remotely +ssh -i ~/.ssh/tailscale_spark @ 'nvidia-smi' +``` + +Expected output should show: +- Tailscale status displaying both devices as "active" +- Successful file transfers +- Remote command execution working + +### Step 12. Troubleshooting + +Common issues and their solutions: + +| Symptom | Cause | Fix | +|---------|-------|-----| +| `tailscale up` auth fails | Network issues | Check internet, try `curl -I login.tailscale.com` | +| SSH connection refused | SSH not running | Run `sudo systemctl start ssh` on Spark | +| SSH auth failure | Wrong SSH keys | Check public key in `~/.ssh/authorized_keys` | +| Cannot ping hostname | DNS issues | Use IP from `tailscale status` instead | +| Devices missing | Different accounts | Use same identity provider for all devices | + +### Step 13. Cleanup and rollback + +Remove Tailscale completely if needed. This will disconnect devices from the +tailnet and remove all network configurations. + +> **Warning**: This will permanently remove the device from your Tailscale +> network and require re-authentication to rejoin. + +```bash +## Stop Tailscale service +sudo tailscale down + +## Remove Tailscale package +sudo apt remove --purge tailscale + +## Remove repository and keys (optional) +sudo rm /etc/apt/sources.list.d/tailscale.list +sudo rm /usr/share/keyrings/tailscale-archive-keyring.gpg + +## Update package list +sudo apt update +``` + +To restore: Re-run installation steps 3-5. + +### Step 14. Next steps + +Your Tailscale setup is complete. You can now: + +- Access your Spark device from any network with: `ssh @` +- Transfer files securely: `scp file.txt @:~/` +- Run Jupyter notebooks remotely by SSH tunneling: + `ssh -L 8888:localhost:8888 @` diff --git a/nvidia/trt-llm/README.md b/nvidia/trt-llm/README.md new file mode 100644 index 0000000..359955f --- /dev/null +++ b/nvidia/trt-llm/README.md @@ -0,0 +1,549 @@ +# TRT LLM for Inference + +> Install and configure TRT LLM to run on a single Spark or on two Sparks + +## Table of Contents + +- [Overview](#overview) +- [Single Spark](#single-spark) + - [Step 1. Verify environment prerequisites](#step-1-verify-environment-prerequisites) + - [Step 2. Set environment variables](#step-2-set-environment-variables) + - [Step 3. Validate TensorRT-LLM installation](#step-3-validate-tensorrt-llm-installation) + - [Step 4. Create cache directory](#step-4-create-cache-directory) + - [Step 5. Validate setup with quickstart_advanced](#step-5-validate-setup-with-quickstartadvanced) + - [LLM quickstart example](#llm-quickstart-example) + - [Step 6. Validate setup with quickstart_multimodal](#step-6-validate-setup-with-quickstartmultimodal) + - [VLM quickstart example](#vlm-quickstart-example) + - [Step 7. Serve LLM with OpenAI-compatible API](#step-7-serve-llm-with-openai-compatible-api) + - [Step 8. Troubleshooting](#step-8-troubleshooting) + - [Step 9. Cleanup and rollback](#step-9-cleanup-and-rollback) +- [Run on two Sparks](#run-on-two-sparks) + - [Step 1. Review Spark clustering documentation](#step-1-review-spark-clustering-documentation) + - [Step 2. Verify connectivity and SSH setup](#step-2-verify-connectivity-and-ssh-setup) + - [Step 3. Install NVIDIA Container Toolkit](#step-3-install-nvidia-container-toolkit) + - [Step 4. Enable resource advertising](#step-4-enable-resource-advertising) + - [Step 5. Initialize Docker Swarm](#step-5-initialize-docker-swarm) + - [Step 6. Join worker nodes and deploy](#step-6-join-worker-nodes-and-deploy) + - [Step 7. Create hosts file](#step-7-create-hosts-file) + - [Step 8. Find your Docker container ID](#step-8-find-your-docker-container-id) + - [Step 9. Generate configuration file](#step-9-generate-configuration-file) + - [Step 10. Download model](#step-10-download-model) + - [Step 11. Serve the model](#step-11-serve-the-model) + - [Step 12. Validate API server](#step-12-validate-api-server) + - [Step 13. Troubleshooting](#step-13-troubleshooting) + - [Step 14. Cleanup and rollback](#step-14-cleanup-and-rollback) + - [Step 15. Next steps](#step-15-next-steps) + +--- + +## Overview + +## What you'll accomplish + +You'll set up TensorRT-LLM to optimize and deploy large language models on NVIDIA Spark with +Blackwell GPUs, achieving significantly higher throughput and lower latency than standard PyTorch +inference through kernel-level optimizations, efficient memory layouts, and advanced quantization. + +## What to know before starting + +- Python proficiency and experience with PyTorch or similar ML frameworks +- Command-line comfort for running CLI tools and Docker containers +- Basic understanding of GPU concepts including VRAM, batching, and quantization (FP16/INT8) +- Familiarity with NVIDIA software stack (CUDA Toolkit, drivers) +- Experience with inference servers and containerized environments + +## Prerequisites + +- [ ] NVIDIA Spark device with Blackwell architecture GPUs +- [ ] NVIDIA drivers compatible with CUDA 12.x: `nvidia-smi` +- [ ] Docker installed and GPU support configured: `docker run --rm --gpus all nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev nvidia-smi` +- [ ] Hugging Face account with token for model access: `echo $HF_TOKEN` +- [ ] Sufficient GPU VRAM (16GB+ recommended for 70B models) +- [ ] Internet connectivity for downloading models and container images +- [ ] Network: open TCP ports 8355 (LLM) and 8356 (VLM) on host for OpenAI-compatible serving + +## Model Support Matrix + +The following models are supported with TensorRT-LLM on Spark. All listed models are available and ready to use: + +| Model | Quantization | Support Status | HF Handle | +|-------|-------------|----------------|-----------| +| **Llama-3.1-8B-Instruct** | FP8 | ✅ | `nvidia/Llama-3.1-8B-Instruct-FP8` | +| **Llama-3.1-8B-Instruct** | NVFP4 | ✅ | `nvidia/Llama-3.1-8B-Instruct-FP4` | +| **Llama-3.3-70B-Instruct** | NVFP4 | ✅ | `nvidia/Llama-3.3-70B-Instruct-FP4` | +| **Qwen3-8B** | FP8 | ✅ | `nvidia/Qwen3-8B-FP8` | +| **Qwen3-8B** | NVFP4 | ✅ | `nvidia/Qwen3-8B-FP4` | +| **Qwen3-14B** | FP8 | ✅ | `nvidia/Qwen3-14B-FP8` | +| **Qwen3-14B** | NVFP4 | ✅ | `nvidia/Qwen3-14B-FP4` | +| **Phi-4-multimodal-instruct** | FP8 | ✅ | `nvidia/Phi-4-multimodal-instruct-FP8` | +| **Phi-4-multimodal-instruct** | NVFP4 | ✅ | `nvidia/Phi-4-multimodal-instruct-FP4` | +| **Phi-4-reasoning-plus** | FP8 | ✅ | `nvidia/Phi-4-reasoning-plus-FP8` | +| **Phi-4-reasoning-plus** | NVFP4 | ✅ | `nvidia/Phi-4-reasoning-plus-FP4` | +| **Llama-3_3-Nemotron-Super-49B-v1_5** | FP8 | ✅ | `nvidia/Llama-3_3-Nemotron-Super-49B-v1_5-FP8` | +| **Qwen3-30B-A3B** | NVFP4 | ✅ | `nvidia/Qwen3-30B-A3B-FP4` | +| **Qwen2.5-VL-7B-Instruct** | FP8 | ✅ | `nvidia/Qwen2.5-VL-7B-Instruct-FP8` | +| **Qwen2.5-VL-7B-Instruct** | NVFP4 | ✅ | `nvidia/Qwen2.5-VL-7B-Instruct-FP4` | +| **Llama-4-Scout-17B-16E-Instruct** | NVFP4 | ✅ | `nvidia/Llama-4-Scout-17B-16E-Instruct-FP4` | +| **Qwen3-235B-A22B (two Sparks only)** | NVFP4 | ✅ | `nvidia/Qwen3-235B-A22B-FP4` | + +**Note:** You can use the NVFP4 Quantization documentation to generate your own NVFP4-quantized checkpoints for your favorite models. This enables you to take advantage of the performance and memory benefits of NVFP4 quantization even for models not already published by NVIDIA. Note: Not all model architectures are supported for NVFP4 quantization. + +## Time & risk + +**Duration**: 45-60 minutes for setup and API server deployment +**Risk level**: Medium - container pulls and model downloads may fail due to network issues +**Rollback**: Stop inference servers and remove downloaded models to free resources + +## Single Spark + +### Step 1. Verify environment prerequisites + +Confirm your Spark device has the required GPU access and network connectivity for downloading +models and containers. + +```bash +## Check GPU visibility and driver +nvidia-smi + +## Verify Docker GPU support +docker run --rm --gpus all nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev nvidia-smi + +``` + +### Step 2. Set environment variables + +Set `HF_TOKEN` for model access. + +```bash +export HF_TOKEN= +``` + +### Step 3. Validate TensorRT-LLM installation + +After confirming GPU access, verify that TensorRT-LLM can be imported inside the container. + +```bash +docker run --rm -it --gpus all \ + nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \ + python -c "import tensorrt_llm; print(f'TensorRT-LLM version: {tensorrt_llm.__version__}')" +``` + +Expected output: +``` +[TensorRT-LLM] TensorRT-LLM version: 1.1.0rc3 +TensorRT-LLM version: 1.1.0rc3 +``` + +### Step 4. Create cache directory + +Set up local caching to avoid re-downloading models on subsequent runs. + +```bash +## Create Hugging Face cache directory +mkdir -p $HOME/.cache/huggingface/ +``` + +### Step 5. Validate setup with quickstart_advanced + +This quickstart validates your TensorRT-LLM setup end-to-end by testing model loading, inference engine initialization, and GPU execution with real text generation. It's the fastest way to confirm everything works before starting the inference API server. + +### LLM quickstart example + +#### Llama 3.1 8B Instruct +```bash +export MODEL_HANDLE="nvidia/Llama-3.1-8B-Instruct-FP4" + +docker run \ + -e MODEL_HANDLE=$MODEL_HANDLE \ + -e HF_TOKEN=$HF_TOKEN \ + -v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \ + --rm -it --ulimit memlock=-1 --ulimit stack=67108864 \ + --gpus=all --ipc=host --network host \ + nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \ + bash -c ' + hf download $MODEL_HANDLE && \ + python examples/llm-api/quickstart_advanced.py \ + --model_dir $MODEL_HANDLE \ + --prompt "Paris is great because" \ + --max_tokens 64 + ' +``` + +#### GPT-OSS 20B +```bash +export MODEL_HANDLE="openai/gpt-oss-20b" + +docker run \ + -e MODEL_HANDLE=$MODEL_HANDLE \ + -e HF_TOKEN=$HF_TOKEN \ + -v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \ + --rm -it --ulimit memlock=-1 --ulimit stack=67108864 \ + --gpus=all --ipc=host --network host \ + nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \ + bash -c ' + hf download $MODEL_HANDLE && \ + python examples/llm-api/quickstart_advanced.py \ + --model_dir $MODEL_HANDLE \ + --prompt "Paris is great because" \ + --max_tokens 64 + ' +``` + +#### GPT-OSS 120B +```bash +export MODEL_HANDLE="openai/gpt-oss-120b" + +docker run \ + -e MODEL_HANDLE=$MODEL_HANDLE \ + -e HF_TOKEN=$HF_TOKEN \ + -v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \ + --rm -it --ulimit memlock=-1 --ulimit stack=67108864 \ + --gpus=all --ipc=host --network host \ + nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \ + bash -c ' + hf download $MODEL_HANDLE && \ + python examples/llm-api/quickstart_advanced.py \ + --model_dir $MODEL_HANDLE \ + --prompt "Paris is great because" \ + --max_tokens 64 + ' +``` +### Step 6. Validate setup with quickstart_multimodal + +### VLM quickstart example + +This demonstrates vision-language model capabilities by running inference with image understanding. The example uses multimodal inputs to validate both text and vision processing pipelines. + +#### Qwen2.5-VL-7B-Instruct + +```bash +export MODEL_HANDLE="nvidia/Qwen2.5-VL-7B-Instruct-FP4" + +docker run \ + -e MODEL_HANDLE=$MODEL_HANDLE \ + -e HF_TOKEN=$HF_TOKEN \ + -v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \ + --rm -it --ulimit memlock=-1 --ulimit stack=67108864 \ + --gpus=all --ipc=host --network host \ + nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \ + bash -c ' + python3 examples/llm-api/quickstart_multimodal.py \ + --model_dir $MODEL_HANDLE \ + --modality image \ + --media "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/seashore.png" \ + --prompt "What is happening in this image?" \ + ' +``` + +#### Phi-4-multimodal-instruct + +This model requires LoRA (Low-Rank Adaptation) configuration as it uses parameter-efficient fine-tuning. The `--load_lora` flag enables loading the LoRA weights that adapt the base model for multimodal instruction following. +```bash +export MODEL_HANDLE="nvidia/Phi-4-multimodal-instruct-FP4" + +docker run \ + -e MODEL_HANDLE=$MODEL_HANDLE \ + -e HF_TOKEN=$HF_TOKEN \ + -v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \ + --rm -it --ulimit memlock=-1 --ulimit stack=67108864 \ + --gpus=all --ipc=host --network host \ + nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \ + bash -c ' + python3 examples/llm-api/quickstart_multimodal.py \ + --model_type phi4mm \ + --model_dir $MODEL_HANDLE \ + --modality image \ + --media "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/seashore.png" \ + --prompt "What is happening in this image?" \ + --load_lora \ + --auto_model_name Phi4MMForCausalLM + ' +``` + + +> Note: If you hit a host OOM during downloads or first run, free the OS page cache on the host (outside the container) and retry: +```bash +sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches' +``` + +### Step 7. Serve LLM with OpenAI-compatible API + +Serve with OpenAI-compatible API via trtllm-serve: + +```bash +export MODEL_HANDLE="nvidia/Llama-3.1-8B-Instruct-FP4" + +docker run --name trtllm_llm_server --rm -it --gpus all --ipc host --network host \ + -e HF_TOKEN=$HF_TOKEN \ + -e MODEL_HANDLE="$MODEL_HANDLE" \ + -v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \ + nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \ + bash -c ' + hf download $MODEL_HANDLE && \ + cat > /tmp/extra-llm-api-config.yml < **Warning:** This will delete all cached models and may require re-downloading for future runs. + +```bash +## Remove Hugging Face cache +sudo chown -R "$USER:$USER" "$HOME/.cache/huggingface" +rm -rf $HOME/.cache/huggingface/ + +## Clean up Docker images +docker image prune -f +docker rmi nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev +``` + +## Run on two Sparks + +### Step 1. Review Spark clustering documentation + +Go to the official DGX Spark clustering documentation to understand the networking requirements and setup procedures: + +[DGX Spark Clustering Documentation](https://docs.nvidia.com/dgx/dgx-spark/spark-clustering.html) + +Review the networking configuration options and choose the appropriate setup method for your environment. + +### Step 2. Verify connectivity and SSH setup + +Verify that the two Spark nodes can communicate with each other using ping and that SSH passwordless authentication is properly configured. + +```bash +## Test network connectivity between nodes (replace with your actual node IPs) +ping -c 3 +``` + +```bash +## Test SSH passwordless authentication (replace with your actual node IP) +ssh nvidia@ hostname +``` + +**Expected results:** +- Ping should show successful packet transmission with 0% packet loss +- SSH command should execute without prompting for a password and return the remote hostname + +### Step 3. Install NVIDIA Container Toolkit + +Ensure the NVIDIA drivers and the NVIDIA Container Toolkit are installed on each node (both manager and workers) that will provide GPU resources. This package enables Docker containers to access the host's GPU hardware. Ensure you complete the [installation steps](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html), including the [Docker configuration](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-docker) for NVIDIA Container Toolkit. + +### Step 4. Enable resource advertising + +Modify the NVIDIA Container Runtime to advertise the GPUs to the Swarm by uncommenting the swarm-resource line in the config.toml file. You can do this either with your preferred text editor (e.g., vim, nano...) or with the following command: +```bash +sudo sed -i 's/^#\s*\(swarm-resource\s*=\s*".*"\)/\1/' /etc/nvidia-container-runtime/config.toml +``` +To apply the changes, restart the Docker daemon +```bash +sudo systemctl restart docker +``` + +### Step 5. Initialize Docker Swarm + +On whichever node you want to use as primary, run the following swarm initialization command +```bash +docker swarm init --advertise-addr $(ip -o -4 addr show enp1s0f0np0 | awk '{print $4}' | cut -d/ -f1) $(ip -o -4 addr show enp1s0f1np1 | awk '{print $4}' | cut -d/ -f1) +``` + +The typical output of the above would be similar to the following: +``` +Swarm initialized: current node (node-id) is now a manager. + +To add a worker to this swarm, run the following command: + + docker swarm join --token : + +To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions. +``` + +### Step 6. Join worker nodes and deploy + +Now we can proceed with setting up other nodes of your cluster: + +```bash +## Run the command suggested by the docker swarm init on each worker node to join the Docker swarm +docker swarm join --token : + +## On your primary node, deploy the stack using the following command +## Note: You'll need a docker-compose.yml file for TRT-LLM deployment +docker stack deploy -c docker-compose.yml trtllm-multinode + +## You can verify the status of your worker nodes using the following +docker stack ps trtllm-multinode + +## In case you see any errors reported by docker ps for any node, you can verify using +docker service logs +``` + +If everything is healthy, you should see a similar output to the following: +``` +nvidia@spark-1b3b:~/draft-playbooks/trt-llm-on-stacked-spark$ docker stack ps trtllm-multinode +ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS +oe9k5o6w41le trtllm-multinode_trtllm.1 nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev spark-1d84 Running Running 2 minutes ago +phszqzk97p83 trtllm-multinode_trtllm.2 nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev spark-1b3b Running Running 2 minutes ago +``` + +### Step 7. Create hosts file + +You can check the available nodes using `docker node ls` +``` +nvidia@spark-1b3b:~$ docker node ls +ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION +hza2b7yisatqiezo33zx4in4i * spark-1b3b Ready Active Leader 28.3.3 +m1k22g3ktgnx36qz4jg5fzhr4 spark-1d84 Ready Active 28.3.3 +``` + +Generate a file containing all Docker Swarm node addresses for MPI operations, and then copy it over to your container: +```bash +docker node ls --format '{{.ID}}' | xargs -n1 docker node inspect --format '{{ .Status.Addr }}' > ~/openmpi-hostfile +docker cp ~/openmpi-hostfile $(docker ps -q -f name=trtllm-multinode):/etc/openmpi-hostfile +``` + +### Step 8. Find your Docker container ID + +You can use `docker ps` to find your Docker container ID. Alternatively, you can save the container ID in a variable: +```bash +export TRTLLM_MN_CONTAINER=$(docker ps -q -f name=trtllm-multinode) +``` + +### Step 9. Generate configuration file + +```bash +docker exec $TRTLLM_MN_CONTAINER bash -c 'cat < /tmp/extra-llm-api-config.yml +print_iter_log: false +kv_cache_config: + dtype: "fp8" + free_gpu_memory_fraction: 0.9 +cuda_graph_config: + enable_padding: true +EOF' +``` + +### Step 10. Download model + +```bash +## Need to specify huggingface token for model download. +export HF_TOKEN= + +docker exec \ + -e MODEL="nvidia/Qwen3-235B-A22B-FP4" \ + -e HF_TOKEN=$HF_TOKEN \ + -it $TRTLLM_MN_CONTAINER bash -c 'mpirun -x HF_TOKEN bash -c "huggingface-cli download $MODEL"' +``` + +### Step 11. Serve the model + +```bash +docker exec \ + -e MODEL="nvidia/Qwen3-235B-A22B-FP4" \ + -e HF_TOKEN=$HF_TOKEN \ + -it $TRTLLM_MN_CONTAINER bash -c ' + mpirun -x HF_TOKEN trtllm-llmapi-launch trtllm-serve $MODEL \ + --tp_size 2 \ + --backend pytorch \ + --max_num_tokens 32768 \ + --max_batch_size 4 \ + --extra_llm_api_options /tmp/extra-llm-api-config.yml \ + --port 8000' +``` + +This will start the TensorRT-LLM server on port 8000. You can then make inference requests to `http://localhost:8000` using the OpenAI-compatible API format. + +**Expected output:** Server startup logs and ready message. + +### Step 12. Validate API server + +Verify successful deployment by checking container status and testing the API endpoint. + +```bash +docker stack ps trtllm-multinode +``` + +**Expected output:** Two running containers in the stack across different nodes. + +Once the server is running, you can test it with a CURL request. Please ensure the CURL request is run on the primary node where you previously ran Step 11. + +```bash +curl -X POST http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "nvidia/Qwen3-235B-A22B-FP4", + "prompt": "What is artificial intelligence?", + "max_tokens": 100, + "temperature": 0.7, + "stream": false + }' +``` + +**Expected output:** JSON response with generated text completion. + +### Step 13. Troubleshooting + +| Symptom | Cause | Fix | +|---------|-------|-----| +| MPI hostname test returns single hostname | Network connectivity issues | Verify both nodes are on reachable IP addresses | +| "Permission denied" on HuggingFace download | Invalid or missing HF_TOKEN | Set valid token: `export HF_TOKEN=` | +| "CUDA out of memory" errors | Insufficient GPU memory | Reduce `--max_batch_size` or `--max_num_tokens` | +| Container exits immediately | Missing entrypoint script | Ensure `trtllm-mn-entrypoint.sh` download succeeded and has executable permissions | + +### Step 14. Cleanup and rollback + +Stop and remove containers by using the following command on the leader node: + +```bash +docker stack rm trtllm-multinode +``` + +> **Warning:** This removes all inference data and performance reports. Copy `/opt/*perf-report.json` files before cleanup if needed. + +Remove downloaded models to free disk space: + +```bash +rm -rf $HOME/.cache/huggingface/hub/models--nvidia--Qwen3* +``` + +### Step 15. Next steps + +Compare performance metrics between speculative decoding and baseline reports to quantify speed improvements. Use the multi-node setup as a foundation for deploying other large models requiring tensor parallelism, or scale to additional nodes for higher throughput workloads. diff --git a/nvidia/trt-llm/assets/example b/nvidia/trt-llm/assets/example new file mode 100644 index 0000000..e69de29 diff --git a/nvidia/unsloth/README.md b/nvidia/unsloth/README.md new file mode 100644 index 0000000..a96c201 --- /dev/null +++ b/nvidia/unsloth/README.md @@ -0,0 +1,147 @@ +# Unsloth on DGX Spark + +> Optimized fine-tuning with Unsloth + +## Table of Contents + +- [Overview](#overview) +- [Instructions](#instructions) + +--- + +## Overview + +## Basic idea + +- **Performance-first**: It claims to speed up training (e.g. 2× faster on single GPU, up to 30× in multi-GPU setups) and reduce memory usage compared to standard methods. :contentReference[oaicite:0]{index=0} +- **Kernel-level optimizations**: Core compute is built with custom kernels (e.g. with Triton) and hand-optimized math to boost throughput and efficiency. :contentReference[oaicite:1]{index=1} +- **Quantization & model formats**: Supports dynamic quantization (4-bit, 16-bit) and GGUF formats to reduce footprint, while aiming to retain accuracy. :contentReference[oaicite:2]{index=2} +- **Broad model support**: Works with many LLMs (LLaMA, Mistral, Qwen, DeepSeek, etc.) and allows training, fine-tuning, exporting to formats like Ollama, vLLM, GGUF, Hugging Face. :contentReference[oaicite:3]{index=3} +- **Simplified interface**: Provides easy-to-use notebooks and tools so users can fine-tune models with minimal boilerplate. :contentReference[oaicite:4]{index=4} + +## What you'll accomplish + +You'll set up Unsloth for optimized fine-tuning of large language models on NVIDIA Spark devices, +achieving up to 2x faster training speeds with reduced memory usage through efficient +parameter-efficient fine-tuning methods like LoRA and QLoRA. + +## What to know before starting + +- Python package management with pip and virtual environments +- Hugging Face Transformers library basics (loading models, tokenizers, datasets) +- GPU fundamentals (CUDA/GPU vs CPU, VRAM constraints, device availability) +- Basic understanding of LLM training concepts (loss functions, checkpoints) +- Familiarity with prompt engineering and base model interaction +- Optional: LoRA/QLoRA parameter-efficient fine-tuning knowledge + +## Prerequisites + +- [ ] NVIDIA Spark device with Blackwell GPU architecture +- [ ] `nvidia-smi` shows a summary of GPU information +- [ ] CUDA 13.0 installed: `nvcc --version` +- [ ] Internet access for downloading models and datasets + +##Ancillary files + +The python test script can be found [here on GitHub](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/test_unsloth.py) + + +## Time & risk + +- **Duration**: 30-60 minutes for initial setup and test run +- **Risks**: +- Triton compiler version mismatches may cause compilation errors +- CUDA toolkit configuration issues may prevent kernel compilation +- Memory constraints on smaller models require batch size adjustments +- **Rollback**: Uninstall packages with `pip uninstall unsloth torch torchvision` + +## Instructions + +## Step 1. Verify prerequisites + +Confirm your NVIDIA Spark device has the required CUDA toolkit and GPU resources available. + +```bash +nvcc --version +``` +The output should show CUDA 13.0. + +```bash +nvidia-smi +``` +The output should show a summary of GPU information. + +## Step 2. Get the container image +```bash +docker pull nvcr.io/nvidia/pytorch:25.08-py3 +``` + +## Step 3. Launch Docker +```bash +docker run --gpus all --ulimit memlock=-1 -it --ulimit stack=67108864 --entrypoint /usr/bin/bash --rm nvcr.io/nvidia/pytorch:25.08-py3 +``` + +## Step 4. Install dependencies inside Docker + +```bash +pip install transformers peft datasets "trl==0.19.1" +pip install --no-deps unsloth unsloth_zoo +``` + +## Step 5. Build and install bitsandbytes inside Docker +```bash +git clone https://github.com/bitsandbytes-foundation/bitsandbytes.git +cd bitsandbytes +cmake -S . -B build -DCOMPUTE_BACKEND=cuda -DCOMPUTE_CAPABILITY="80;86;87;89;90" +cd build +make -j +cd .. +pip install . +``` + +## Step 6. Create Python test script + +Curl the test script [here](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/test_unsloth.py) into the container. + +```bash + +curl -O https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/test_unsloth.py + +We will use this test script to validate the installation with a simple fine-tuning task. + + +## Step 7. Run the validation test + +Execute the test script to verify Unsloth is working correctly. + +```bash +python test_unsloth.py +``` + +Expected output in the terminal window: +- "Unsloth: Will patch your computer to enable 2x faster free finetuning" +- Training progress bars showing loss decreasing over 60 steps +- Final training metrics showing completion + +## Step 8. Next steps + +Test with your own model and dataset by updating the `test_unsloth.py` file: + +```python +## Replace line 32 with your model choice +model_name = "unsloth/Meta-Llama-3.1-8B-bnb-4bit" + +## Load your custom dataset in line 8 +dataset = load_dataset("your_dataset_name") + +## Adjust training parameter args at line 61 +per_device_train_batch_size = 4 +max_steps = 1000 +``` + +Visit https://github.com/unslothai/unsloth/wiki +for advanced usage instructions, including: +- [Saving models in GGUF format for vLLM](https://github.com/unslothai/unsloth/wiki#saving-to-gguf) +- [Continued training from checkpoints](https://github.com/unslothai/unsloth/wiki#loading-lora-adapters-for-continued-finetuning) +- [Using custom chat templates](https://github.com/unslothai/unsloth/wiki#chat-templates) +- [Running evaluation loops](https://github.com/unslothai/unsloth/wiki#evaluation-loop---also-fixes-oom-or-crashing) diff --git a/nvidia/unsloth/assets/test_unsloth.py b/nvidia/unsloth/assets/test_unsloth.py new file mode 100644 index 0000000..a11367d --- /dev/null +++ b/nvidia/unsloth/assets/test_unsloth.py @@ -0,0 +1,90 @@ +# +# SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from unsloth import FastLanguageModel, FastModel +import torch +from trl import SFTTrainer, SFTConfig +from datasets import load_dataset +max_seq_length = 2048 # Supports RoPE Scaling internally, so choose any! +# Get LAION dataset +url = "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl" +dataset = load_dataset("json", data_files = {"train" : url}, split = "train") + +# 4bit pre quantized models we support for 4x faster downloading + no OOMs. +fourbit_models = [ + "unsloth/Meta-Llama-3.1-8B-bnb-4bit", # Llama-3.1 2x faster + "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit", + "unsloth/Meta-Llama-3.1-70B-bnb-4bit", + "unsloth/Meta-Llama-3.1-405B-bnb-4bit", # 4bit for 405b! + "unsloth/Mistral-Small-Instruct-2409", # Mistral 22b 2x faster! + "unsloth/mistral-7b-instruct-v0.3-bnb-4bit", + "unsloth/Phi-3.5-mini-instruct", # Phi-3.5 2x faster! + "unsloth/Phi-3-medium-4k-instruct", + "unsloth/gemma-2-9b-bnb-4bit", + "unsloth/gemma-2-27b-bnb-4bit", # Gemma 2x faster! + + "unsloth/Llama-3.2-1B-bnb-4bit", # NEW! Llama 3.2 models + "unsloth/Llama-3.2-1B-Instruct-bnb-4bit", + "unsloth/Llama-3.2-3B-bnb-4bit", + "unsloth/Llama-3.2-3B-Instruct-bnb-4bit", + + "unsloth/Llama-3.3-70B-Instruct-bnb-4bit" # NEW! Llama 3.3 70B! +] # More models at https://huggingface.co/unsloth + +model, tokenizer = FastModel.from_pretrained( + model_name = "unsloth/gemma-3-4B-it", + max_seq_length = 2048, # Choose any for long context! + load_in_4bit = True, # 4 bit quantization to reduce memory + load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory + full_finetuning = False, # [NEW!] We have full finetuning now! + # token = "hf_...", # use one if using gated models +) + +# Do model patching and add fast LoRA weights +model = FastLanguageModel.get_peft_model( + model, + r = 16, + target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", + "gate_proj", "up_proj", "down_proj",], + lora_alpha = 16, + lora_dropout = 0, # Supports any, but = 0 is optimized + bias = "none", # Supports any, but = "none" is optimized + # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes! + use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context + random_state = 3407, + max_seq_length = max_seq_length, + use_rslora = False, # We support rank stabilized LoRA + loftq_config = None, # And LoftQ +) + +trainer = SFTTrainer( + model = model, + train_dataset = dataset, + tokenizer = tokenizer, + args = SFTConfig( + max_seq_length = max_seq_length, + per_device_train_batch_size = 2, + gradient_accumulation_steps = 4, + warmup_steps = 10, + max_steps = 60, + logging_steps = 1, + output_dir = "outputs", + optim = "adamw_8bit", + seed = 3407, + ), +) +trainer.train() \ No newline at end of file diff --git a/nvidia/vllm/README.md b/nvidia/vllm/README.md new file mode 100644 index 0000000..2beb419 --- /dev/null +++ b/nvidia/vllm/README.md @@ -0,0 +1,383 @@ +# Install and use vLLM + +> Use a container or build vLLM from source for Spark + +## Table of Contents + +- [Overview](#overview) +- [Run on two Sparks](#run-on-two-sparks) + - [Step 14. (Optional) Launch 405B inference server](#step-14-optional-launch-405b-inference-server) +- [Access through terminal](#access-through-terminal) + +--- + +## Overview + +## What you'll accomplish + +You'll set up vLLM high-throughput LLM serving on DGX Spark with Blackwell architecture, +either using a pre-built Docker container or building from source with custom LLVM/Triton +support for ARM64. + +## What to know before starting + +- Experience building and configuring containers with Docker +- Familiarity with CUDA toolkit installation and version management +- Understanding of Python virtual environments and package management +- Knowledge of building software from source using CMake and Ninja +- Experience with Git version control and patch management + +## Prerequisites + +- [ ] DGX Spark device with ARM64 processor and Blackwell GPU architecture +- [ ] CUDA 12.9 or CUDA 13.0 toolkit installed: `nvcc --version` shows CUDA toolkit version. +- [ ] Docker installed and configured: `docker --version` succeeds +- [ ] NVIDIA Container Toolkit installed +- [ ] Python 3.12 available: `python3.12 --version` succeeds +- [ ] Git installed: `git --version` succeeds +- [ ] Network access to download packages and container images +- [ ] > TODO: Verify memory and storage requirements for builds + +## Time & risk + +**Time estimate:** 30 minutes for Docker approach + +**Risks:** Container registry access requires internal credentials + +**Rollback:** Container approach is non-destructive. + +## Run on two Sparks + +## Step 1. Verify hardware connectivity + +Connect the QSFP cable between both DGX Spark systems using the rightmost QSFP interface on each device. This step establishes the 200GbE direct connection required for high-speed inter-node communication. + +```bash +## Check QSFP interface availability on both nodes +ip link show | grep enP2p1s0f1np1 +``` + +Expected output shows the interface exists but may be down initially. + +## Step 2. Configure cluster network on Node 1 + +Set up the static IP address for the cluster network interface on the first DGX Spark system. This creates a dedicated network segment for distributed inference communication. + +```bash +## Configure static IP on Node 1 +sudo ip addr add 192.168.100.10/24 dev enP2p1s0f1np1 +sudo ip link set enP2p1s0f1np1 up +``` + +## Step 3. Configure cluster network on Node 2 + +Configure the second node with a corresponding static IP in the same network segment. + +```bash +## Configure static IP on Node 2 +sudo ip addr add 192.168.100.11/24 dev enP2p1s0f1np1 +sudo ip link set enP2p1s0f1np1 up +``` + +## Step 4. Verify network connectivity + +Test the direct connection between both nodes to ensure the cluster network is functional. + +```bash +## From Node 1, test connectivity to Node 2 +ping -c 3 192.168.100.11 + +## From Node 2, test connectivity to Node 1 +ping -c 3 192.168.100.10 +``` + +Expected output shows successful ping responses with low latency. + +## Step 5. Download cluster deployment script + +Obtain the vLLM cluster deployment script on both nodes. This script orchestrates the Ray cluster setup required for distributed inference. + +```bash +## Download on both nodes +wget https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/examples/online_serving/run_cluster.sh +chmod +x run_cluster.sh +``` + +## Step 6. Pull the NVIDIA vLLM Image from NGC + +First, you will need to configure docker to pull from NGC +If this is your first time using docker run: +```bash +sudo groupadd docker +sudo usermod -aG docker $USER +newgrp docker +``` + +You can now run docker commands without running `sudo` +Next, ensure you have an NGC API Key to be able to pull containers from NGC +## More info on setup of -- https://docs.nvidia.com/ngc/latest/ngc-private-registry-user-guide.html#accessing-the-ngc-container-registry + +With your API key ready, configure docker to pull from NGC and pull down the VLLM Image + +```bash +docker login nvcr.io +## Username will be `$oauthtoken` and the password is your NGC API Key +docker pull nvcr.io/nvidia/vllm:25.09-py3 +export VLLM_IMAGE=nvcr.io/nvidia/vllm:25.09-py3 +``` + + +## Step 7. Start Ray head node + +Launch the Ray cluster head node on Node 1. This node coordinates the distributed inference and serves the API endpoint. + +```bash +## On Node 1, start head node +export MN_IF_NAME=enP2p1s0f1np1 +bash run_cluster.sh $VLLM_IMAGE 192.168.100.10 --head ~/.cache/huggingface \ +-e VLLM_HOST_IP=192.168.100.10 \ +-e UCX_NET_DEVICES=$MN_IF_NAME \ +-e NCCL_SOCKET_IFNAME=$MN_IF_NAME \ +-e OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME \ +-e GLOO_SOCKET_IFNAME=$MN_IF_NAME \ +-e TP_SOCKET_IFNAME=$MN_IF_NAME \ +-e RAY_memory_monitor_refresh_ms=0 \ +-e MASTER_ADDR=192.168.100.10 +``` + + +## Step 8. Start Ray worker node + +Connect Node 2 to the Ray cluster as a worker node. This provides additional GPU resources for tensor parallelism. + +```bash +## On Node 2, join as worker +export MN_IF_NAME=enP2p1s0f1np1 +bash run_cluster.sh $VLLM_IMAGE 192.168.100.10 --worker ~/.cache/huggingface \ +-e VLLM_HOST_IP=192.168.100.11 \ +-e UCX_NET_DEVICES=$MN_IF_NAME \ +-e NCCL_SOCKET_IFNAME=$MN_IF_NAME \ +-e OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME \ +-e GLOO_SOCKET_IFNAME=$MN_IF_NAME \ +-e TP_SOCKET_IFNAME=$MN_IF_NAME \ +-e RAY_memory_monitor_refresh_ms=0 \ +-e MASTER_ADDR=192.168.100.10 +``` + +## Step 9. Verify cluster status + +Confirm both nodes are recognized and available in the Ray cluster. + +```bash +## On Node 1 (head node) +docker exec node ray status +``` + +Expected output shows 2 nodes with available GPU resources. + +## Step 10. Download Llama 3.3 70B model + +Authenticate with Hugging Face and download the recommended production-ready model. + +```bash +## On Node 1, authenticate and download +huggingface-cli login +huggingface-cli download meta-llama/Llama-3.3-70B-Instruct +``` + +## Step 11. Launch inference server for Llama 3.3 70B + +Start the vLLM inference server with tensor parallelism across both nodes. + +```bash +## On Node 1, enter container and start server +docker exec -it node /bin/bash +vllm serve meta-llama/Llama-3.3-70B-Instruct \ +--tensor-parallel-size 2 --max_model_len 2048 +``` + +## Step 12. Test 70B model inference + +Verify the deployment with a sample inference request. + +```bash +## Test from Node 1 or external client +curl http://localhost:8000/v1/completions \ +-H "Content-Type: application/json" \ +-d '{ +"model": "meta-llama/Llama-3.3-70B-Instruct", +"prompt": "Write a haiku about a GPU", +"max_tokens": 32, +"temperature": 0.7 +}' +``` + +Expected output includes a generated haiku response. + +## Step 13. (Optional) Deploy Llama 3.1 405B model + +> **Warning:** 405B model has insufficient memory headroom for production use. + +Download the quantized 405B model for testing purposes only. + +```bash +## On Node 1, download quantized model +huggingface-cli download hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4 +``` + +### Step 14. (Optional) Launch 405B inference server + +Start the server with memory-constrained parameters for the large model. + +```bash +## On Node 1, launch with restricted parameters +docker exec -it node /bin/bash +vllm serve hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4 \ +--tensor-parallel-size 2 --max-model-len 256 --gpu-memory-utilization 1.0 \ +--max-num-seqs 1 --max_num_batched_tokens 256 +``` + +## Step 15. (Optional) Test 405B model inference + +Verify the 405B deployment with constrained parameters. + +```bash +curl http://localhost:8000/v1/completions \ +-H "Content-Type: application/json" \ +-d '{ +"model": "hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4", +"prompt": "Write a haiku about a GPU", +"max_tokens": 32, +"temperature": 0.7 +}' +``` + +## Step 16. Validate deployment + +Perform comprehensive validation of the distributed inference system. + +```bash +## Check Ray cluster health +docker exec node ray status + +## Verify server health endpoint +curl http://192.168.100.10:8000/health + +## Monitor GPU utilization on both nodes +nvidia-smi +docker exec node nvidia-smi --query-gpu=memory.used,memory.total --format=csv +``` + +## Step 17. Troubleshooting + +Common issues and their resolutions: + +| Symptom | Cause | Fix | +|---------|--------|-----| +| Node 2 not visible in Ray cluster | Network connectivity issue | Verify QSFP cable connection, check IP configuration | +| Model download fails | Authentication or network issue | Re-run `huggingface-cli login`, check internet access | +| CUDA out of memory with 405B | Insufficient GPU memory | Use 70B model or reduce max_model_len parameter | +| Container startup fails | Missing ARM64 image | Rebuild vLLM image following ARM64 instructions | + +## Step 18. Cleanup and rollback + +Remove temporary configurations and containers when testing is complete. + +> **Warning:** This will stop all inference services and remove cluster configuration. + +```bash +## Stop containers on both nodes +docker stop node +docker rm node + +## Remove network configuration on both nodes +sudo ip addr del 192.168.100.10/24 dev enP2p1s0f1np1 # Node 1 +sudo ip addr del 192.168.100.11/24 dev enP2p1s0f1np1 # Node 2 +sudo ip link set enP2p1s0f1np1 down +``` + +## Step 19. Next steps + +Access the Ray dashboard for cluster monitoring and explore additional features: + +```bash +## Ray dashboard available at: +http://192.168.100.10:8265 + +## Consider implementing for production: +## - Health checks and automatic restarts +## - Log rotation for long-running services +## - Persistent model caching across restarts +## - Alternative quantization methods (FP8, INT4) +``` + +## Access through terminal + +## Step 1. Pull vLLM container image + +Find the latest container build from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=25.09-py3 +``` +docker pull nvcr.io/nvidia/vllm:25.09-py3 +``` + +## Step 2. Test vLLM in container + +Launch the container and start vLLM server with a test model to verify basic functionality. + +```bash +docker run -it --gpus all -p 8000:8000 \ +nvcr.io/nvidia/vllm:25.09-py3 \ +vllm serve "Qwen/Qwen2.5-Math-1.5B-Instruct" +``` + +Expected output should include: +- Model loading confirmation +- Server startup on port 8000 +- GPU memory allocation details + +In another terminal, test the server: + +```bash +curl http://localhost:8000/v1/chat/completions \ +-H "Content-Type: application/json" \ +-d '{ + "model": "Qwen/Qwen2.5-Math-1.5B-Instruct", + "messages": [{"role": "user", "content": "12*17"}], + "max_tokens": 500 +}' +``` + +Expected response should contain `"content": "204"` or similar mathematical calculation. + +## Step 3. Troubleshooting + +| Symptom | Cause | Fix | +|---------|--------|-----| +| CUDA version mismatch errors | Wrong CUDA toolkit version | Reinstall CUDA 12.9 using exact installer | +| Container registry authentication fails | Invalid or expired GitLab token | Generate new token from ****** | +| SM_121a architecture not recognized | Missing LLVM patches | Verify SM_121a patches applied to LLVM source | +| Reduce MAX_JOBS to 1-2, add swap space | +| Environment variables not set | + +## Step 4. Cleanup and rollback + +For container approach (non-destructive): + +```bash +docker rm $(docker ps -aq --filter ancestor=******:5005/dl/dgx/vllm*) +docker rmi ******:5005/dl/dgx/vllm:main-py3.31165712-devel +``` + + +To remove CUDA 12.9: + +```bash +sudo /usr/local/cuda-12.9/bin/cuda-uninstaller +``` + +## Step 5. Next steps + +- **Production deployment:** Configure vLLM with your specific model requirements +- **Performance tuning:** Adjust batch sizes and memory settings for your workload +- **Monitoring:** Set up logging and metrics collection for production use +- **Model management:** Explore additional model formats and quantization options diff --git a/nvidia/vlm-finetuning/README.md b/nvidia/vlm-finetuning/README.md new file mode 100644 index 0000000..9bcb68d --- /dev/null +++ b/nvidia/vlm-finetuning/README.md @@ -0,0 +1,204 @@ +# Vision-Language Model Fine-tuning + +> Fine-tune Vision-Language Models for image and video understanding tasks using Qwen2.5-VL and InternVL3 + +## Table of Contents + +- [Overview](#overview) +- [Instructions](#instructions) + - [Video VLM Testing](#video-vlm-testing) + +--- + +## Overview + +## Basic Idea + +This playbook demonstrates how to fine-tune Vision-Language Models (VLMs) for both image and video understanding tasks on DGX Spark. +With 128GB of unified memory and powerful GPU acceleration, DGX Spark provides an ideal environment for training VRAM intensive multimodal models that can understand and reason about visual content. + +The playbook covers two distinct VLM fine-tuning approaches: +- **Image VLM Fine-tuning**: Using Qwen2.5-VL-7B for wildfire detection from satellite imagery with GRPO (Generalized Reward Preference Optimization) +- **Video VLM Fine-tuning**: Using InternVL3 8B for dangerous driving detection and structured metadata generation from driving videos + +Both approaches leverage advanced training techniques including LoRA fine-tuning, preference optimization, and structured reasoning to achieve superior performance on specialized tasks. + +## What you'll accomplish + +You will have fine-tuned VLM models capable of understanding and analyzing both images and videos for specialized use cases, accessible through interactive Web UIs. +The setup includes: +- **Image VLM**: Qwen2.5-VL fine-tuned for wildfire detection with reasoning capability +- **Video VLM**: InternVL3 fine-tuned for dangerous driving analysis and structured metadata generation +- Interactive Streamlit interfaces for both training and inference +- Side-by-side model comparison (base vs fine-tuned) in the Web UIs +- Docker containerization for reproducible environments + +## Prerequisites + +- DGX Spark device is set up and accessible +- No other processes running on the DGX Spark GPU +- Enough disk space for model downloads and datasets +- NVIDIA Docker installed and configured +- Weights & Biases account for training monitoring (optional but recommended) + + +## Time & risk + +**Duration**: +- 15-20 minutes for initial setup and model downloads +- 30-60 minutes for image VLM training (depending on dataset size) +- 1-2 hours for video VLM training (depending on video dataset size) + +**Risks**: +- Docker permission issues may require user group changes and session restart +- Large model downloads and datasets may require significant disk space and time +- Training requires sustained GPU usage and memory +- Dataset preparation may require manual steps (Kaggle downloads, video processing) + +**Rollback**: Stop and remove Docker containers, delete downloaded models and datasets if needed + +## Instructions + +## Step 1. Configure Docker permissions + +To easily manage containers without sudo, you must be in the `docker` group. If you choose to skip this step, you will need to run Docker commands with sudo. + +Open a new terminal and test Docker access. In the terminal, run: + +```bash +docker ps +``` + +If you see a permission denied error (something like `permission denied while trying to connect to the Docker daemon socket`), add your user to the docker group: + +```bash +sudo usermod -aG docker $USER +``` + +> **Warning**: After running usermod, you must log out and log back in to start a new +> session with updated group permissions. + +## Step 2. Clone the repository + +In a terminal, clone the repository and navigate to the VLM fine-tuning directory. + +```bash +git clone https://******/spark-playbooks/dgx-spark-playbook-assets.git +cd dgx-spark-playbook-assets/vlm-finetuning +``` + +## Step 3. Build the Docker container + +Build the Docker image. This will set up the environment for both image and video VLM fine-tuning: + +```bash +docker build -t vlm-finetuning . +``` + +## Step 4. Run the Docker container + +```bash +## Run with GPU support and mount current directory +docker run --gpus all -it --rm \ + -v $(pwd):/workspace \ + -p 8501:8501 \ + -p 8888:8888 \ + -p 6080:6080 \ + vlm-finetuning +``` + +## Step 5. [Option A] For image VLM fine-tuning (Wildfire Detection) +#### 5.1. Set up Weights & Biases + +Configure your wandb credentials for training monitoring: + +```bash +export WANDB_ENTITY= +export WANDB_PROJECT="vlm_finetuning" +export WANDB_API_KEY= +``` + +#### 5.2. Download the wildfire dataset from Kaggle and place it in the `data` directory + +The wildfire dataset can be found here: https://www.kaggle.com/datasets/abdelghaniaaba/wildfire-prediction-dataset + +#### 5.3. Launch the Image VLM UI + +```bash +cd ui_image +streamlit run Image_VLM_Finetuning.py +``` + +Access the interface at `http://localhost:8501` + +#### 5.4. Configure and start training + +- Configure training parameters through the web interface +- Choose fine-tuning method (LoRA, QLoRA, or Full-Finetuning) +- Set hyperparameters (epochs, batch size, learning rate) +- Click "▶️ Start Finetuning" to begin GRPO training +- Monitor progress via embedded wandb charts + +#### 5.5. Test the fine-tuned model + +After training completes: +1. Bring down the UI with Ctrl+C +2. Edit `src/image_vlm_config.yaml` and update `finetuned_model_id` to point to your model in `saved_model/` +3. Restart the interface to test your fine-tuned model + +## Step 6. [Option B] For video VLM fine-tuning (Driver Behaviour Analysis) + +#### 6.1. Prepare your video dataset + +Structure your dataset as follows: +``` +dataset/ +├── videos/ +│ ├── video1.mp4 +│ ├── video2.mp4 +│ └── ... +└── metadata.jsonl +``` + +#### 6.2. Start Jupyter Lab + +```bash +jupyter lab --ip=0.0.0.0 --port=8888 --allow-root +``` + +Access Jupyter at `http://localhost:8888` + +#### 6.3. Run the training notebook + +```bash +cd ui_video/train +## Open and run internvl3_dangerous_driving.ipynb +## Update dataset path in the notebook to point to your data +``` + +#### 6.4. Run inference + +### Video VLM Testing +- Use the inference notebook to test on dashcam footage videos +- Generate structured JSON metadata for dangerous driving events +- Analyze traffic violations and safety risks + +## Step 7. Cleanup + +Exit the container and optionally remove the Docker image: + +```bash +## Exit container +exit + +## Remove Docker image (optional) +docker stop +docker rmi vlm-finetuning +``` + +## Step 8. Next steps + +- Train on your own custom datasets for specialized use cases +- Combine multiple VLM models for comprehensive multimodal analysis +- Explore other VLM architectures and training techniques +- Deploy fine-tuned models in production environments diff --git a/nvidia/vscode/README.md b/nvidia/vscode/README.md new file mode 100644 index 0000000..c46f807 --- /dev/null +++ b/nvidia/vscode/README.md @@ -0,0 +1,197 @@ +# Install VS Code + +> Install and use VS Code locally or remotely on Spark + +## Table of Contents + +- [Overview](#overview) +- [Instructions](#instructions) +- [Access with NVIDIA Sync](#access-with-nvidia-sync) + +--- + +## Overview + +## Basic Idea +This walkthrough establishes a local Visual Studio Code development environment directly on DGX Spark devices. By installing VSCode natively on the ARM64-based Spark system, you gain access to a full-featured IDE with extensions, integrated terminal, and Git integration while leveraging the specialized hardware for development and testing. + +## What you'll accomplish +You will have Visual Studio Code running natively on your DGX Spark device with access to the system's ARM64 architecture and GPU resources. This setup enables direct code development, debugging, and execution on the target hardware without remote development overhead. + +## What to know before starting + +• Basic experience working with Visual Studio Code interface and features + +• Familiarity with package management on Linux systems + +• Understanding of file permissions and authentication on Linux + +## Prerequisites + +• DGX Spark device with administrative privileges + +• Active internet connection for downloading the VSCode installer + +• Verify ARM64 architecture: + ```bash + uname -m +# # Expected output: aarch64 + ``` +• Verify GUI desktop environment available: + ```bash + echo $DISPLAY +# # Should return display information like :0 or :10.0 + ``` + + +## Time & risk + +**Duration:** 10-15 minutes + +**Risk level:** Low - installation uses official packages with standard rollback + +**Rollback:** Standard package removal via system package manager + +## Instructions + +## Step 1. Verify system requirements + +Before installing VSCode, confirm your DGX Spark system meets the requirements and has GUI support. + +```bash +## Verify ARM64 architecture +uname -m + +## Check available disk space (VSCode requires ~200MB) +df -h / + +## Verify desktop environment is running +ps aux | grep -E "(gnome|kde|xfce)" +``` + +## Step 2. Download VSCode ARM64 installer + +Navigate to the VSCode [download](https://code.visualstudio.com/download) page and download the appropriate ARM64 `.deb` package for your system. + +Alternatively, you can download the installer with this command: + +```bash +wget https://code.visualstudio.com/sha/download?build=stable\&os=linux-deb-arm64 -O vscode-arm64.deb +``` + +## Step 3. Install VSCode package + +Install the downloaded package using the system package manager. + +You can click on the installer file directly or use the command line. + +```bash +## Install the downloaded .deb package +sudo dpkg -i vscode-arm64.deb + +## Fix any dependency issues if they occur +sudo apt-get install -f +``` + +## Step 4. Verify installation + +Confirm the VSCode app is installed successfully and can launch. + +You can open the app directly from the list of applications or use the command line. + +```bash +## Check if VSCode is installed +which code + +## Verify version +code --version + +## Test launch (will open VSCode GUI) +code & +``` + +VSCode should launch and display the welcome screen. + +## Step 5. Configure for Spark development + +Set up VSCode for development on the DGX Spark platform. + +```bash +## Launch VSCode if not already running +code + +## Or create a new project directory and open it +mkdir ~/spark-dev-workspace +cd ~/spark-dev-workspace +code . +``` + +From within VSCode: + +* Open **File** > **Preferences** > **Settings** +* Search for "terminal integrated shell" to configure default terminal +* Install recommended extensions via **Extensions** tab (left sidebar) + +## Step 6. Validate setup and test functionality + +Test core VSCode functionality to ensure proper operation on ARM64. + +Create a test file: +```bash +## Create test directory and file +mkdir ~/vscode-test +cd ~/vscode-test +echo 'print("Hello from DGX Spark!")' > test.py +code test.py +``` + +Within VSCode: +* Verify syntax highlighting works +* Open integrated terminal (**Terminal** > **New Terminal**) +* Run the test script: `python3 test.py` +* Test Git integration by running `git status` in the terminal + +## Step 7. Troubleshooting + +| Symptom | Cause | Fix | +|---------|-------|-----| +| `dpkg: dependency problems` during install | Missing dependencies | Run `sudo apt-get install -f` | +| VSCode won't launch with GUI error | No display server/X11 | Verify GUI desktop is running: `echo $DISPLAY` | +| Extensions fail to install | Network connectivity or ARM64 compatibility | Check internet connection, verify extension ARM64 support | + +## Step 8. Uninstalling VSCode + +> **Warning:** Uninstalling VSCode will remove all user settings and extensions. + +To remove VSCode if needed: +```bash +## Remove VSCode package +sudo apt-get remove code + +## Remove configuration files (optional) +rm -rf ~/.config/Code +rm -rf ~/.vscode +``` + +## Access with NVIDIA Sync + +## Step 1. Install and Open NVIDIA Sync + +## Step 2. Add your Spark to NVIDIA Sync + +## Step 3. Install VS Code locally + +## Step 4. Open Sync and launch VS Code + +- Wait for the remote connection to be established (may ask your local machine for a password or to authorize the connection) +- It may prompt you to "trust the authors of the files in this folder" when you first land in the home directory after a successful ssh connection. + + + +## Step 5. Validation and Follow-ups + +- Verify that you can access your Spark's filesystem with VSCode as a text editor. Run test commands in the terminal like `hostnamectl` and `whoami` to ensure you are remotely accessing your spark. +- Specify a file path or directory and start editing/writing files +- Install extensions +- Clone repos +- Locally host LLM code assistant diff --git a/nvidia/vss/README.md b/nvidia/vss/README.md new file mode 100644 index 0000000..d457e12 --- /dev/null +++ b/nvidia/vss/README.md @@ -0,0 +1,326 @@ +# Video Search and Summarization + +> Run the VSS Blueprint on your Spark + +## Table of Contents + +- [Overview](#overview) +- [Instructions](#instructions) + - [Navigate to Event Verification directory](#navigate-to-event-verification-directory) + - [Configure NGC API Key](#configure-ngc-api-key) + - [Start VSS Event Verification services](#start-vss-event-verification-services) + - [Navigate to CV Event Detector directory](#navigate-to-cv-event-detector-directory) + - [Start DeepStream CV pipeline](#start-deepstream-cv-pipeline) + - [Wait for service initialization](#wait-for-service-initialization) + - [Validate Event Reviewer deployment](#validate-event-reviewer-deployment) + - [Navigate to remote LLM deployment directory](#navigate-to-remote-llm-deployment-directory) + - [Configure environment variables](#configure-environment-variables) + - [Review model configuration](#review-model-configuration) + - [Launch Standard VSS deployment](#launch-standard-vss-deployment) + - [Validate Standard VSS deployment](#validate-standard-vss-deployment) + - [For Event Reviewer deployment](#for-event-reviewer-deployment) + - [For Standard VSS deployment](#for-standard-vss-deployment) + +--- + +## Overview + +## Basic Idea + +Deploy NVIDIA's Video Search and Summarization (VSS) AI Blueprint to build intelligent video analytics systems that combine vision language models, large language models, and retrieval-augmented generation. The system transforms raw video content into real-time actionable insights with video summarization, Q&A, and real-time alerts. You'll set up either a completely local Event Reviewer deployment or a hybrid deployment using remote model endpoints. + +## What you'll accomplish + +You will deploy NVIDIA's VSS AI Blueprint on NVIDIA Spark hardware with Blackwell architecture, choosing between two deployment scenarios: VSS Event Reviewer (completely local with VLM pipeline) or Standard VSS (hybrid deployment with remote LLM/embedding endpoints). This includes setting up Alert Bridge, VLM Pipeline, Alert Inspector UI, Video Storage Toolkit, and optional DeepStream CV pipeline for automated video analysis and event verification. + +## What to know before starting + +- Working with NVIDIA Docker containers and container registries +- Setting up Docker Compose environments with shared networks +- Managing environment variables and authentication tokens +- Working with NVIDIA DeepStream and computer vision pipelines +- Basic understanding of video processing and analysis workflows + +## Prerequisites + +- [ ] NVIDIA Spark device with ARM64 architecture and Blackwell GPU +- [ ] FastOS 1.81.38 or compatible ARM64 system +- [ ] Driver version 580.82.09 installed: `nvidia-smi | grep "Driver Version"` +- [ ] CUDA version 13.0 installed: `nvcc --version` +- [ ] Docker installed and running: `docker --version && docker compose version` +- [ ] Access to NVIDIA Container Registry with NGC API Key +- [ ] [Optional] NVIDIA API Key for remote model endpoints (hybrid deployment only) +- [ ] Sufficient storage space for video processing (>10GB recommended in `/tmp/`) + +## Ancillary files + +- [VSS Blueprint GitHub Repository](https://github.com/NVIDIA-AI-Blueprints/video-search-and-summarization) - Main codebase and Docker Compose configurations +- [Sample CV Detection Pipeline](https://github.com/NVIDIA-AI-Blueprints/video-search-and-summarization/tree/main/examples/cv-event-detector) - Reference CV pipeline for event reviewer workflow +- [VSS Official Documentation](https://docs.nvidia.com/vss/latest/index.html) - Complete system documentation + +## Time & risk + +**Duration:** 30-45 minutes for initial setup, additional time for video processing validation + +**Risks:** +- Container startup can be resource-intensive and time-consuming with large model downloads +- Network configuration conflicts if shared network already exists +- Remote API endpoints may have rate limits or connectivity issues (hybrid deployment) + +**Rollback:** Stop all containers with `docker compose down`, remove shared network with `docker network rm vss-shared-network`, and clean up temporary media directories. + +## Instructions + +## Step 1. Verify environment requirements + +Check that your system meets the hardware and software prerequisites. + +```bash +## Verify driver version +nvidia-smi | grep "Driver Version" +## Expected output: Driver Version: 580.82.09 + +## Verify CUDA version +nvcc --version +## Expected output: release 13.0 + +## Verify Docker is running +docker --version && docker compose version +``` + +## Step 2. Clone the VSS repository + +Clone the Video Search and Summarization repository from NVIDIA's public GitHub. + +```bash +## Clone the VSS AI Blueprint repository +git clone https://github.com/NVIDIA-AI-Blueprints/video-search-and-summarization.git +cd video-search-and-summarization +``` + +## Step 3. Run the cache cleaner script + +Start the system cache cleaner to optimize memory usage during container operations. + +```bash +## Start the cache cleaner script in background +sudo sh deploy/scripts/sys_cache_cleaner.sh & +``` + +## Step 4. Set up Docker shared network + +Create a Docker network that will be shared between VSS services and CV pipeline containers. + +```bash +## Create shared network (may require sudo depending on Docker configuration) +docker network create vss-shared-network +``` + +> **Warning:** If the network already exists, you may see an error. Remove it first with `docker network rm vss-shared-network` if needed. + +## Step 5. Authenticate with NVIDIA Container Registry + +Log in to NVIDIA's container registry using your NGC API Key. + +```bash +## Log in to NVIDIA Container Registry +docker login nvcr.io +## Username: $oauthtoken +## Password: +``` + +## Step 6. Choose deployment scenario + +Choose between two deployment options based on your requirements: + +| Deployment Scenario | VLM (Cosmos-Reason1-7B) | LLM (Llama 3.1 70B) | Embedding/Reranker | CV Pipeline | +|----------------------|--------------------------|---------------------|--------------------|-------------| +| VSS Event Reviewer | Local | Not Used | Not Used | Local | +| Standard VSS (Hybrid)| Local | Remote | Remote | Optional | + +Proceed with **Option A** for Event Reviewer or **Option B** for Standard VSS. + +## Step 7. Option A - VSS Event Reviewer (Completely Local) + +### Navigate to Event Verification directory + +Change to the directory containing the Event Verification Docker Compose configuration. + +```bash +cd deploy/docker/event_reviewer/ +``` + +### Configure NGC API Key + +Update the environment file with your NGC API Key. + +```bash +## Edit the .env file and update NGC_API_KEY +echo "NGC_API_KEY=" >> .env +``` + +### Start VSS Event Verification services + +Launch the complete VSS Event Verification stack including Alert Bridge, VLM Pipeline, Alert Inspector UI, and Video Storage Toolkit. + +```bash +## Start VSS Event Verification with ARM64 and SBSA optimizations +IS_SBSA=1 IS_AARCH64=1 ALERT_REVIEW_MEDIA_BASE_DIR=/tmp/alert-media-dir docker compose up +``` + +> **Note:** This step will take several minutes as containers are pulled and services initialize. The VSS backend requires additional startup time. + +### Navigate to CV Event Detector directory + +In a new terminal session, navigate to the computer vision event detector configuration. + +```bash +cd video-search-and-summarization/examples/cv-event-detector +``` + +### Start DeepStream CV pipeline + +Launch the DeepStream computer vision pipeline and CV UI services. + +```bash +## Start CV pipeline with ARM64 and SBSA optimizations +IS_SBSA=1 IS_AARCH64=1 ALERT_VERIFICATION_MEDIA_BASE_DIR=/tmp/alert-media-dir docker compose up +``` + +### Wait for service initialization + +Allow time for all containers to fully initialize before accessing the user interfaces. + +```bash +## Monitor container status +docker ps +## Verify all containers show "Up" status and VSS backend logs show ready state +``` + +### Validate Event Reviewer deployment + +Access the web interfaces to confirm successful deployment and functionality. + +```bash +## Test CV UI accessibility (replace with your system's IP) +curl -I http://:7862 +## Expected: HTTP 200 response + +## Test Alert Inspector UI accessibility +curl -I http://:7860 +## Expected: HTTP 200 response +``` + +Open these URLs in your browser: +- `http://:7862` - CV UI to launch and monitor CV pipeline +- `http://:7860` - Alert Inspector UI to view clips and verification results + +## Step 8. Option B - Standard VSS (Hybrid Deployment) + +### Navigate to remote LLM deployment directory + +```bash +cd deploy/docker/remote_llm_deployment/ +``` + +### Configure environment variables + +Update the environment file with your API keys and deployment preferences. + +```bash +## Edit .env file with required keys +echo "NVIDIA_API_KEY=" >> .env +echo "NGC_API_KEY=" >> .env +echo "DISABLE_CV_PIPELINE=true" >> .env # Set to false to enable CV +echo "INSTALL_PROPRIETARY_CODECS=false" >> .env # Set to true to enable CV +``` + +### Review model configuration + +Verify that the config.yaml file contains the correct remote endpoints. + +```bash +## Check model server endpoints in config.yaml +cat config.yaml | grep -A 10 "model_server" +``` + +### Launch Standard VSS deployment + +```bash +## Start Standard VSS with hybrid deployment +docker compose up +``` + +### Validate Standard VSS deployment + +Access the VSS UI to confirm successful deployment. + +```bash +## Test VSS UI accessibility (replace with your system's IP) +curl -I http://:9100 +## Expected: HTTP 200 response +``` + +Open `http://:9100` in your browser to access the VSS interface. + +## Step 9. Test video processing workflow + +Run a basic test to verify the video analysis pipeline is functioning based on your deployment. + +### For Event Reviewer deployment +- Access CV UI at `http://:7862` to upload and process videos +- Monitor results in Alert Inspector UI at `http://:7860` + +### For Standard VSS deployment +- Access VSS interface at `http://:9100` +- Upload videos and test summarization features + +## Step 10. Troubleshooting + +| Symptom | Cause | Fix | +|---------|--------|-----| +| Container fails to start with "pull access denied" | Missing or incorrect nvcr.io credentials | Re-run `docker login nvcr.io` with valid credentials | +| Network creation fails | Existing network with same name | Run `docker network rm vss-shared-network` then recreate | +| Services fail to communicate | Incorrect environment variables | Verify `IS_SBSA=1 IS_AARCH64=1` are set correctly | +| Web interfaces not accessible | Services still starting or port conflicts | Wait 2-3 minutes, check `docker ps` for container status | + +## Step 11. Cleanup and rollback + +To completely remove the VSS deployment and free up system resources. + +> **Warning:** This will destroy all processed video data and analysis results. + +```bash +## For Event Reviewer deployment +cd deploy/docker/event_reviewer/ +docker compose down +cd ../../examples/cv-event-detector/ +docker compose down + +## For Standard VSS deployment +cd deploy/docker/remote_llm_deployment/ +docker compose down + +## Remove shared network (if using Event Reviewer) +docker network rm vss-shared-network + +## Clean up temporary media files and stop cache cleaner +rm -rf /tmp/alert-media-dir +sudo pkill -f sys_cache_cleaner.sh +``` + +## Step 12. Next steps + +With VSS deployed, you can now: + +**Event Reviewer deployment:** +- Upload video files through the CV UI at port 7862 +- Monitor automated event detection and verification +- Review analysis results in the Alert Inspector UI at port 7860 +- Configure custom event detection rules and thresholds + +**Standard VSS deployment:** +- Access full VSS capabilities at port 9100 +- Test video summarization and Q&A features +- Configure knowledge graphs and graph databases +- Integrate with existing video processing workflows diff --git a/src/images/dgx-spark-banner.png b/src/images/dgx-spark-banner.png new file mode 100644 index 0000000..15df222 Binary files /dev/null and b/src/images/dgx-spark-banner.png differ