diff --git a/nvidia/cutile-kernels/README.md b/nvidia/cutile-kernels/README.md index dfeeda0..ebbed5f 100644 --- a/nvidia/cutile-kernels/README.md +++ b/nvidia/cutile-kernels/README.md @@ -122,15 +122,18 @@ All required assets can be found in the [TileGym repository](https://github.com/ * Large downloads may fail due to network issues * First run includes JIT compilation overhead * **Rollback:** Remove Docker container to undo all changes -* **Last Updated:** February 2026 - * First Publication +* **Last Updated:** 06/16/2026 + * Upgrade CUDA container to 13.2.0-devel-ubuntu22.04 + * Upgrade Nsight Systems to 2025.1.3 + * Add docker preparation steps for TileGym + * Pin TileGym to v1.3.0 ## Kernel Benchmarks ## Step 1. Pull CUDA NGC container with CTK 13.x ```bash -docker pull nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 +docker pull nvcr.io/nvidia/cuda:13.2.0-devel-ubuntu22.04 ``` Launch an interactive session with GPU access: @@ -138,18 +141,26 @@ Launch an interactive session with GPU access: ```bash docker run --gpus all -it --rm \ -v ~/TileGym:/workspace/TileGym \ - nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 \ + nvcr.io/nvidia/cuda:13.2.0-devel-ubuntu22.04 \ /bin/bash ``` > [!NOTE] > The `-v` flag mounts a local directory to persist the TileGym repository. The `--rm` flag automatically removes the container when you exit; omit it if you want to keep the container for later use. -Or if running outside a container, install Tile IR directly: +Prepare the docker for installing TileGym. ```bash -## Requires root privileges - run with sudo or as root -sudo apt-get install cuda-tile-ir-13-1 cuda-compiler-13-1 +apt-get update && apt-get install -y --no-install-recommends \ + python3-pip python3-dev python-is-python3 \ + git wget curl build-essential nsight-systems-2025.1.3 +update-alternatives --install /usr/bin/nsys nsys /opt/nvidia/nsight-systems/2025.1.3/bin/nsys 100 && hash -r +python -m pip install --upgrade pip setuptools wheel + +pip install --no-cache-dir --pre "torch==2.9.1" --index-url https://download.pytorch.org/whl/cu130 + +pip install --no-cache-dir --no-deps accelerate==1.13.0 && \ + pip install --no-cache-dir sentencepiece protobuf ``` ## Step 2. Clone TileGym repository @@ -157,18 +168,32 @@ sudo apt-get install cuda-tile-ir-13-1 cuda-compiler-13-1 ```bash git clone https://github.com/NVIDIA/TileGym cd TileGym +git checkout v1.3.0 pip install . ``` -## Step 3. Run benchmark suite +## Step 3. Run individual benchmarks + +To run specific kernel benchmarks: ```bash cd tests/benchmark/ -bash run_all.sh -``` -> [!NOTE] -> The benchmark runs sequentially to ensure accurate timing results. This may take 10-15 minutes to complete all kernels. +## Flash Multi-Head Attention +python bench_fused_attention.py + +## Matrix Multiplication +python bench_matrix_multiplication.py + +## RMSNorm +python bench_rmsnorm.py + +## RoPE +python bench_rope.py + +## SwiGLU +python bench_swiglu.py +``` ## Step 4. View results @@ -190,27 +215,17 @@ fused-attention-batch4-head32-d128-fwd-causal=True-float16-TFLOPS: ✓ PASSED: bench_fused_attention.py ``` -## Step 5. Run individual benchmarks - -To run specific kernel benchmarks: +## Step 5. Run benchmark suite ```bash -## Flash Multi-Head Attention -python bench_fused_attention.py - -## Matrix Multiplication -python bench_matrix_multiplication.py - -## RMSNorm -python bench_rmsnorm.py - -## RoPE -python bench_rope.py - -## SwiGLU -python bench_swiglu.py +cd tests/benchmark/ +bash run_all.sh ``` +> [!NOTE] +> NOT RECOMMENDED: The benchmark runs sequentially to ensure accurate timing results. This may take 40-60 minutes to complete all kernels. + + ## Step 6. Clean up Exit the container: @@ -223,7 +238,7 @@ Remove this workflow's containers (if you ran without `--rm`): ```bash ## Preferred: remove only containers from this workflow's image -docker rm $(docker ps -a --filter ancestor=nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 --format '{{.ID}}') +docker ps -a --filter ancestor=nvcr.io/nvidia/cuda:13.2.0-devel-ubuntu22.04 -q | xargs -r docker rm ## Alternative: prune all stopped containers (will prompt for confirmation) ## docker container prune @@ -232,7 +247,7 @@ docker rm $(docker ps -a --filter ancestor=nvcr.io/nvidia/cuda:13.1-devel-ubuntu Remove the image (optional): ```bash -docker rmi nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 +docker rmi nvcr.io/nvidia/cuda:13.2.0-devel-ubuntu22.04 ``` ## Step 7. Repeat on B300 @@ -250,6 +265,8 @@ First, clone TileGym on the host: ```bash mkdir -p ~/TileGym git clone https://github.com/NVIDIA/TileGym ~/TileGym +cd ~/TileGym +git checkout v1.3.0 ``` Then launch the container with the repository mounted: @@ -258,13 +275,28 @@ Then launch the container with the repository mounted: docker run --gpus all -it --rm \ -v ~/TileGym:/workspace/TileGym \ -v ~/.cache/huggingface:/root/.cache/huggingface \ - nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 \ + nvcr.io/nvidia/cuda:13.2.0-devel-ubuntu22.04 \ /bin/bash ``` > [!NOTE] > The `-v ~/.cache/huggingface:/root/.cache/huggingface` mounts your HuggingFace cache to avoid re-downloading models. +Prepare the container for installing TileGym: + +```bash +apt-get update && apt-get install -y --no-install-recommends \ + python3-pip python3-dev python-is-python3 \ + git wget curl build-essential nsight-systems-2025.1.3 +update-alternatives --install /usr/bin/nsys nsys /opt/nvidia/nsight-systems/2025.1.3/bin/nsys 100 && hash -r +python -m pip install --upgrade pip setuptools wheel + +pip install --no-cache-dir --pre "torch==2.9.1" --index-url https://download.pytorch.org/whl/cu130 + +pip install --no-cache-dir --no-deps accelerate==1.13.0 && \ + pip install --no-cache-dir sentencepiece protobuf +``` + Install TileGym inside the container: ```bash @@ -286,7 +318,7 @@ export HF_TOKEN= Navigate to the transformers benchmark directory: ```bash -cd modeling/transformers +cd /workspace/TileGym/modeling/transformers ``` **Option A: Run Qwen2-7B benchmark** @@ -841,7 +873,7 @@ Use the ratios below as a reference for how kernel performance scales from DGX S | Slow first run | JIT compilation | Normal - cuTile compiles kernels on first run | | `FileNotFoundError: input_prompt_small.txt` | Missing input file | Run from `modeling/transformers` directory | | `torch.cuda.OutOfMemoryError` | Insufficient GPU memory | Reduce `--batch_size` parameter | -| `ImportError: cuda.tile` | Missing Tile IR | Install: `apt-get install cuda-tile-ir-13-1` | +| `ImportError: cuda.tile` | Missing Tile IR | Install: `apt-get install cuda-tile-ir-13-2` | | Benchmark hangs | GPU busy or locked | Check `nvidia-smi` for other processes | > [!NOTE] diff --git a/nvidia/vllm/README.md b/nvidia/vllm/README.md index 5d44665..b722da1 100644 --- a/nvidia/vllm/README.md +++ b/nvidia/vllm/README.md @@ -7,7 +7,6 @@ - [Overview](#overview) - [Instructions](#instructions) - [Run on two Sparks](#run-on-two-sparks) - - [Step 11. (Optional) Launch 405B inference server](#step-11-optional-launch-405b-inference-server) - [Run on multiple Sparks through a switch](#run-on-multiple-sparks-through-a-switch) - [Run Agent Ready Qwen3.6 35B Model with vLLM](#run-agent-ready-qwen36-35b-model-with-vllm) - [Troubleshooting](#troubleshooting) @@ -257,46 +256,56 @@ This includes: - Passwordless SSH setup - Network connectivity verification +> **Heads up:** the `discover-sparks` script in the linked playbook writes its SSH key to `~/.ssh/` and fails if the directory does not exist yet. Run `mkdir -p ~/.ssh && chmod 700 ~/.ssh` on both nodes first if you have never used SSH on them. + ## Step 2. Download cluster deployment script Obtain the vLLM cluster deployment script on both nodes. This script orchestrates the Ray cluster setup required for distributed inference. ```bash -## Download on both nodes -wget https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/examples/ray_serving/run_cluster.sh +## Download on both nodes — pinned to a known-good commit so upstream changes +## can't silently break this playbook against the 26.05-py3 image. +wget https://raw.githubusercontent.com/vllm-project/vllm/51c1ee9b7c8acbba4899a8ebffd390685d171946/examples/ray_serving/run_cluster.sh + +## Patch the script to pip-install ray inside the container before ray starts. +## The 26.05-py3 NGC image ships without ray (upstream made it an optional CUDA dep); +## the install takes ~10s on first container launch. +sed -i 's|^RAY_START_CMD="ray start|RAY_START_CMD="pip install -q --root-user-action=ignore '\''ray[default]>=2.9'\'' \&\& ray start|' run_cluster.sh + chmod +x run_cluster.sh ``` -## Step 3. Pull the NVIDIA vLLM Image from NGC +## Step 3. Pull the NVIDIA vLLM image from NGC -First, you will need to configure docker to pull from NGC -If this is your first time using docker run: +First, configure docker. If this is your first time using docker, run: ```bash sudo groupadd docker sudo usermod -aG docker $USER newgrp docker ``` -After this, you should be able to run docker commands without using `sudo`. +After this, you should be able to run docker commands without `sudo`. +Pull the image **on both nodes**: ```bash -docker pull nvcr.io/nvidia/vllm:25.11-py3 -export VLLM_IMAGE=nvcr.io/nvidia/vllm:25.11-py3 +docker pull nvcr.io/nvidia/vllm:26.05-py3 +export VLLM_IMAGE=nvcr.io/nvidia/vllm:26.05-py3 ``` - ## Step 4. Start Ray head node Launch the Ray cluster head node on Node 1. This node coordinates the distributed inference and serves the API endpoint. ```bash -## On Node 1, start head node +## On Node 1, start head node. Run inside tmux/screen so an SSH drop doesn't +## tear down the cluster (run_cluster.sh has an EXIT trap that stops the container). ## Get the IP address of the high-speed interface ## Use the interface that shows "(Up)" from ibdev2netdev (enp1s0f0np0 or enp1s0f1np1) export MN_IF_NAME=enp1s0f1np1 export VLLM_HOST_IP=$(ip -4 addr show $MN_IF_NAME | grep -oP '(?<=inet\s)\d+(\.\d+){3}') +export VLLM_IMAGE=nvcr.io/nvidia/vllm:26.05-py3 echo "Using interface $MN_IF_NAME with IP $VLLM_HOST_IP" @@ -311,10 +320,11 @@ bash run_cluster.sh $VLLM_IMAGE $VLLM_HOST_IP --head ~/.cache/huggingface \ -e MASTER_ADDR=$VLLM_HOST_IP ``` +Leave this terminal open — closing it stops the head node and tears down the cluster. ## Step 5. Start Ray worker node -Connect Node 2 to the Ray cluster as a worker node. This provides additional GPU resources for tensor parallelism. +Open a second terminal, SSH to Node 2 (`ssh user@`), and join the Ray cluster as a worker. Replace `` below with the QSFP-side IP from Node 1 (run `echo $VLLM_HOST_IP` on Node 1 to print it). Run inside tmux/screen on Node 2 as well. ```bash ## On Node 2, join as worker @@ -325,10 +335,12 @@ export MN_IF_NAME=enp1s0f1np1 ## Get Node 2's own IP address export VLLM_HOST_IP=$(ip -4 addr show $MN_IF_NAME | grep -oP '(?<=inet\s)\d+(\.\d+){3}') -## IMPORTANT: Set HEAD_NODE_IP to Node 1's IP address -## You must get this value from Node 1 (run: echo $VLLM_HOST_IP on Node 1) +## Set this to Node 1's QSFP IP (see step header) export HEAD_NODE_IP= +## Set the image tag (same as Step 3) +export VLLM_IMAGE=nvcr.io/nvidia/vllm:26.05-py3 + echo "Worker IP: $VLLM_HOST_IP, connecting to head node at: $HEAD_NODE_IP" bash run_cluster.sh $VLLM_IMAGE $HEAD_NODE_IP --worker ~/.cache/huggingface \ @@ -341,7 +353,6 @@ bash run_cluster.sh $VLLM_IMAGE $HEAD_NODE_IP --worker ~/.cache/huggingface \ -e RAY_memory_monitor_refresh_ms=0 \ -e MASTER_ADDR=$HEAD_NODE_IP ``` -> **Note:** Replace `` with the actual IP address from Node 1, specifically the QSFP interface nep1s0f1np1 configured in the [Connect two Sparks](https://build.nvidia.com/spark/connect-two-sparks) playbook. ## Step 6. Verify cluster status @@ -360,12 +371,12 @@ Expected output shows 2 nodes with available GPU resources. ## Step 7. Download Llama 3.3 70B model -Authenticate with Hugging Face and download the recommended production-ready model. +Llama 3.3 70B is a gated model — first accept its license at and create an HF access token with read permission. Then authenticate inside the container so the cache lands at `/root/.cache/huggingface` (mounted from `~/.cache/huggingface`). ```bash -## From within the same container where `ray status` ran, run the following -hf auth login -hf download meta-llama/Llama-3.3-70B-Instruct +docker exec -it $VLLM_CONTAINER /bin/bash -c ' + hf auth login + hf download meta-llama/Llama-3.3-70B-Instruct' ``` ## Step 8. Launch inference server for Llama 3.3 70B @@ -374,18 +385,17 @@ Start the vLLM inference server with tensor parallelism across both nodes. ```bash ## On Node 1, enter container and start server -export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$') docker exec -it $VLLM_CONTAINER /bin/bash -c ' vllm serve meta-llama/Llama-3.3-70B-Instruct \ - --tensor-parallel-size 2 --max_model_len 2048' + --tensor-parallel-size 2 --max-model-len 2048 \ + --distributed-executor-backend ray' ``` ## Step 9. Test 70B model inference -Verify the deployment with a sample inference request. +Verify the deployment with a sample inference request. Run this on Node 1 itself; from an external client, replace `localhost` with Node 1's reachable IP. ```bash -## Test from Node 1 or external client curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ @@ -403,29 +413,36 @@ Expected output includes a generated haiku response. > [!WARNING] > 405B model has insufficient memory headroom for production use. -Download the quantized 405B model for testing purposes only. +Download the quantized 405B model for testing purposes only. Runs inside the head container so the cache lands in the mounted HF directory. ```bash -## On Node 1, download quantized model -huggingface-cli download hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4 +docker exec -it $VLLM_CONTAINER /bin/bash -c ' + hf download hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4' ``` -### Step 11. (Optional) Launch 405B inference server +## Step 11. (Optional) Launch 405B inference server Start the server with memory-constrained parameters for the large model. ```bash ## On Node 1, launch with restricted parameters -export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$') docker exec -it $VLLM_CONTAINER /bin/bash -c ' vllm serve hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4 \ --tensor-parallel-size 2 --max-model-len 64 --gpu-memory-utilization 0.9 \ - --max-num-seqs 1 --max_num_batched_tokens 64' + --max-num-seqs 1 --max-num-batched-tokens 64 \ + --distributed-executor-backend ray' +``` + +Startup is slow for 405B — expect several minutes of model-loading logs across both nodes. The server is ready to take traffic once you see: + +``` +INFO: Application startup complete. +INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) ``` ## Step 12. (Optional) Test 405B model inference -Verify the 405B deployment with constrained parameters. +Verify the 405B deployment with constrained parameters. As in Step 9, run on Node 1 or replace `localhost` with Node 1's reachable IP from an external client. ```bash curl http://localhost:8000/v1/completions \ @@ -444,33 +461,32 @@ Perform comprehensive validation of the distributed inference system. ```bash ## Check Ray cluster health -export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$') docker exec $VLLM_CONTAINER ray status ## Verify server health endpoint -curl http://192.168.100.10:8000/health +curl http://localhost:8000/health -## Monitor GPU utilization on both nodes +## Monitor GPU utilization on both nodes (DGX Spark has unified memory, +## so the --query-gpu memory fields report N/A; use raw nvidia-smi instead). nvidia-smi -export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$') -docker exec $VLLM_CONTAINER nvidia-smi --query-gpu=memory.used,memory.total --format=csv ``` ## Step 14. Next steps -Access the Ray dashboard for cluster monitoring and explore additional features: +The Ray dashboard runs on port 8265 of the head node. It binds to the container's network (host networking), so it is only directly reachable from Node 1 itself. From an external workstation, tunnel it over SSH: ```bash -## Ray dashboard available at: -http://:8265 - -## Consider implementing for production: -## - Health checks and automatic restarts -## - Log rotation for long-running services -## - Persistent model caching across restarts -## - Alternative quantization methods (FP8, INT4) +## From your workstation: +ssh -L 8265:localhost:8265 nvidia@ +## then open http://localhost:8265 in a local browser ``` +Consider for production: +- Health checks and automatic restarts +- Log rotation for long-running services +- Persistent model caching across restarts +- Alternative quantization methods (FP8, INT4) + ## Run on multiple Sparks through a switch ## Step 1. Configure network connectivity @@ -640,8 +656,6 @@ curl http://localhost:8000/health ## Monitor GPU utilization on all nodes nvidia-smi -export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$') -docker exec $VLLM_CONTAINER nvidia-smi --query-gpu=memory.used,memory.total --format=csv ``` ## Step 11. Next steps