chore: Regenerate all playbooks

2026-06-24 23:29:31 +00:00 · 2026-06-24 14:57:10 +00:00 · 2026-06-24 14:57:10 +00:00 · 797933babb
commit 797933babb
parent 6a749bdcb0
2 changed files with 126 additions and 80 deletions
--- a/nvidia/cutile-kernels/README.md
+++ b/nvidia/cutile-kernels/README.md
@ -122,15 +122,18 @@ All required assets can be found in the [TileGym repository](https://github.com/
  * Large downloads may fail due to network issues
  * First run includes JIT compilation overhead
 * **Rollback:** Remove Docker container to undo all changes
-* **Last Updated:** February 2026
+* **Last Updated:** 06/16/2026
-  * First Publication
+  * Upgrade CUDA container to 13.2.0-devel-ubuntu22.04
  * Upgrade Nsight Systems to 2025.1.3
  * Add docker preparation steps for TileGym
  * Pin TileGym to v1.3.0
 ## Kernel Benchmarks
 ## Step 1. Pull CUDA NGC container with CTK 13.x
 ```bash
-docker pull nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04
+docker pull nvcr.io/nvidia/cuda:13.2.0-devel-ubuntu22.04
 ```
 Launch an interactive session with GPU access:
@ -138,18 +141,26 @@ Launch an interactive session with GPU access:
 ```bash
 docker run --gpus all -it --rm \
  -v ~/TileGym:/workspace/TileGym \
-  nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 \
+  nvcr.io/nvidia/cuda:13.2.0-devel-ubuntu22.04 \
  /bin/bash
 ```
 > [!NOTE]
 > The `-v` flag mounts a local directory to persist the TileGym repository. The `--rm` flag automatically removes the container when you exit; omit it if you want to keep the container for later use.
-Or if running outside a container, install Tile IR directly:
+Prepare the docker for installing TileGym.
 ```bash
-## Requires root privileges - run with sudo or as root
+apt-get update && apt-get install -y --no-install-recommends \
-sudo apt-get install cuda-tile-ir-13-1 cuda-compiler-13-1
+    python3-pip python3-dev python-is-python3 \
    git wget curl build-essential nsight-systems-2025.1.3
 update-alternatives --install /usr/bin/nsys nsys /opt/nvidia/nsight-systems/2025.1.3/bin/nsys 100 && hash -r
 python -m pip install --upgrade pip setuptools wheel
 pip install --no-cache-dir --pre "torch==2.9.1" --index-url https://download.pytorch.org/whl/cu130
 pip install --no-cache-dir --no-deps accelerate==1.13.0 && \
    pip install --no-cache-dir sentencepiece protobuf
 ```
 ## Step 2. Clone TileGym repository
@ -157,18 +168,32 @@ sudo apt-get install cuda-tile-ir-13-1 cuda-compiler-13-1
 ```bash
 git clone https://github.com/NVIDIA/TileGym
 cd TileGym
 git checkout v1.3.0
 pip install .
 ```
-## Step 3. Run benchmark suite
+## Step 3. Run individual benchmarks
 To run specific kernel benchmarks:
 ```bash
 cd tests/benchmark/
 bash run_all.sh
 ```
-> [!NOTE]
+## Flash Multi-Head Attention
-> The benchmark runs sequentially to ensure accurate timing results. This may take 10-15 minutes to complete all kernels.
+python bench_fused_attention.py
 ## Matrix Multiplication
 python bench_matrix_multiplication.py
 ## RMSNorm
 python bench_rmsnorm.py
 ## RoPE
 python bench_rope.py
 ## SwiGLU
 python bench_swiglu.py
 ```
 ## Step 4. View results
@ -190,27 +215,17 @@ fused-attention-batch4-head32-d128-fwd-causal=True-float16-TFLOPS:
 ✓ PASSED: bench_fused_attention.py
 ```
-## Step 5. Run individual benchmarks
+## Step 5. Run benchmark suite
 To run specific kernel benchmarks:
 ```bash
-## Flash Multi-Head Attention
+cd tests/benchmark/
-python bench_fused_attention.py
+bash run_all.sh
 ## Matrix Multiplication
 python bench_matrix_multiplication.py
 ## RMSNorm
 python bench_rmsnorm.py
 ## RoPE
 python bench_rope.py
 ## SwiGLU
 python bench_swiglu.py
 ```
 > [!NOTE]
 > NOT RECOMMENDED: The benchmark runs sequentially to ensure accurate timing results. This may take 40-60 minutes to complete all kernels.
 ## Step 6. Clean up
 Exit the container:
@ -223,7 +238,7 @@ Remove this workflow's containers (if you ran without `--rm`):
 ```bash
 ## Preferred: remove only containers from this workflow's image
-docker rm $(docker ps -a --filter ancestor=nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 --format '{{.ID}}')
+docker ps -a --filter ancestor=nvcr.io/nvidia/cuda:13.2.0-devel-ubuntu22.04 -q | xargs -r docker rm
 ## Alternative: prune all stopped containers (will prompt for confirmation)
 ## docker container prune
@ -232,7 +247,7 @@ docker rm $(docker ps -a --filter ancestor=nvcr.io/nvidia/cuda:13.1-devel-ubuntu
 Remove the image (optional):
 ```bash
-docker rmi nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04
+docker rmi nvcr.io/nvidia/cuda:13.2.0-devel-ubuntu22.04
 ```
 ## Step 7. Repeat on B300
@ -250,6 +265,8 @@ First, clone TileGym on the host:
 ```bash
 mkdir -p ~/TileGym
 git clone https://github.com/NVIDIA/TileGym ~/TileGym
 cd ~/TileGym
 git checkout v1.3.0
 ```
 Then launch the container with the repository mounted:
@ -258,13 +275,28 @@ Then launch the container with the repository mounted:
 docker run --gpus all -it --rm \
  -v ~/TileGym:/workspace/TileGym \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
-  nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 \
+  nvcr.io/nvidia/cuda:13.2.0-devel-ubuntu22.04 \
  /bin/bash
 ```
 > [!NOTE]
 > The `-v ~/.cache/huggingface:/root/.cache/huggingface` mounts your HuggingFace cache to avoid re-downloading models.
 Prepare the container for installing TileGym:
 ```bash
 apt-get update && apt-get install -y --no-install-recommends \
    python3-pip python3-dev python-is-python3 \
    git wget curl build-essential nsight-systems-2025.1.3
 update-alternatives --install /usr/bin/nsys nsys /opt/nvidia/nsight-systems/2025.1.3/bin/nsys 100 && hash -r
 python -m pip install --upgrade pip setuptools wheel
 pip install --no-cache-dir --pre "torch==2.9.1" --index-url https://download.pytorch.org/whl/cu130
 pip install --no-cache-dir --no-deps accelerate==1.13.0 && \
    pip install --no-cache-dir sentencepiece protobuf
 ```
 Install TileGym inside the container:
 ```bash
@ -286,7 +318,7 @@ export HF_TOKEN=<your_huggingface_token>
 Navigate to the transformers benchmark directory:
 ```bash
-cd modeling/transformers
+cd /workspace/TileGym/modeling/transformers
 ```
 **Option A: Run Qwen2-7B benchmark**
@ -841,7 +873,7 @@ Use the ratios below as a reference for how kernel performance scales from DGX S
 | Slow first run | JIT compilation | Normal - cuTile compiles kernels on first run |
 | `FileNotFoundError: input_prompt_small.txt` | Missing input file | Run from `modeling/transformers` directory |
 | `torch.cuda.OutOfMemoryError` | Insufficient GPU memory | Reduce `--batch_size` parameter |
-| `ImportError: cuda.tile` | Missing Tile IR | Install: `apt-get install cuda-tile-ir-13-1` |
+| `ImportError: cuda.tile` | Missing Tile IR | Install: `apt-get install cuda-tile-ir-13-2` |
 | Benchmark hangs | GPU busy or locked | Check `nvidia-smi` for other processes |
 > [!NOTE] 
--- a/nvidia/vllm/README.md
+++ b/nvidia/vllm/README.md
@ -7,7 +7,6 @@
 - [Overview](#overview)
 - [Instructions](#instructions)
 - [Run on two Sparks](#run-on-two-sparks)
  - [Step 11. (Optional) Launch 405B inference server](#step-11-optional-launch-405b-inference-server)
 - [Run on multiple Sparks through a switch](#run-on-multiple-sparks-through-a-switch)
 - [Run Agent Ready Qwen3.6 35B Model with vLLM](#run-agent-ready-qwen36-35b-model-with-vllm)
 - [Troubleshooting](#troubleshooting)
@ -257,46 +256,56 @@ This includes:
 - Passwordless SSH setup
 - Network connectivity verification
 > **Heads up:** the `discover-sparks` script in the linked playbook writes its SSH key to `~/.ssh/` and fails if the directory does not exist yet. Run `mkdir -p ~/.ssh && chmod 700 ~/.ssh` on both nodes first if you have never used SSH on them.
 ## Step 2. Download cluster deployment script
 Obtain the vLLM cluster deployment script on both nodes. This script orchestrates the Ray cluster setup required for distributed inference.
 ```bash
-## Download on both nodes
+## Download on both nodes — pinned to a known-good commit so upstream changes
-wget https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/examples/ray_serving/run_cluster.sh
+## can't silently break this playbook against the 26.05-py3 image.
 wget https://raw.githubusercontent.com/vllm-project/vllm/51c1ee9b7c8acbba4899a8ebffd390685d171946/examples/ray_serving/run_cluster.sh
 ## Patch the script to pip-install ray inside the container before ray starts.
 ## The 26.05-py3 NGC image ships without ray (upstream made it an optional CUDA dep);
 ## the install takes ~10s on first container launch.
 sed -i 's|^RAY_START_CMD="ray start|RAY_START_CMD="pip install -q --root-user-action=ignore '\''ray[default]>=2.9'\'' \&\& ray start|' run_cluster.sh
 chmod +x run_cluster.sh
 ```
-## Step 3. Pull the NVIDIA vLLM Image from NGC
+## Step 3. Pull the NVIDIA vLLM image from NGC
-First, you will need to configure docker to pull from NGC
+First, configure docker. If this is your first time using docker, run:
 If this is your first time using docker run:
 ```bash
 sudo groupadd docker
 sudo usermod -aG docker $USER
 newgrp docker
 ```
-After this, you should be able to run docker commands without using `sudo`.
+After this, you should be able to run docker commands without `sudo`.
 Pull the image **on both nodes**:
 ```bash
-docker pull nvcr.io/nvidia/vllm:25.11-py3
+docker pull nvcr.io/nvidia/vllm:26.05-py3
-export VLLM_IMAGE=nvcr.io/nvidia/vllm:25.11-py3
+export VLLM_IMAGE=nvcr.io/nvidia/vllm:26.05-py3
 ```
 ## Step 4. Start Ray head node
 Launch the Ray cluster head node on Node 1. This node coordinates the distributed inference and serves the API endpoint.
 ```bash
-## On Node 1, start head node
+## On Node 1, start head node. Run inside tmux/screen so an SSH drop doesn't
 ## tear down the cluster (run_cluster.sh has an EXIT trap that stops the container).
 ## Get the IP address of the high-speed interface
 ## Use the interface that shows "(Up)" from ibdev2netdev (enp1s0f0np0 or enp1s0f1np1)
 export MN_IF_NAME=enp1s0f1np1
 export VLLM_HOST_IP=$(ip -4 addr show $MN_IF_NAME | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
 export VLLM_IMAGE=nvcr.io/nvidia/vllm:26.05-py3
 echo "Using interface $MN_IF_NAME with IP $VLLM_HOST_IP"
@ -311,10 +320,11 @@ bash run_cluster.sh $VLLM_IMAGE $VLLM_HOST_IP --head ~/.cache/huggingface \
  -e MASTER_ADDR=$VLLM_HOST_IP
 ```
 Leave this terminal open — closing it stops the head node and tears down the cluster.
 ## Step 5. Start Ray worker node
-Connect Node 2 to the Ray cluster as a worker node. This provides additional GPU resources for tensor parallelism.
+Open a second terminal, SSH to Node 2 (`ssh user@<NODE_2_IP>`), and join the Ray cluster as a worker. Replace `<NODE_1_IP_ADDRESS>` below with the QSFP-side IP from Node 1 (run `echo $VLLM_HOST_IP` on Node 1 to print it). Run inside tmux/screen on Node 2 as well.
 ```bash
 ## On Node 2, join as worker
@ -325,10 +335,12 @@ export MN_IF_NAME=enp1s0f1np1
 ## Get Node 2's own IP address
 export VLLM_HOST_IP=$(ip -4 addr show $MN_IF_NAME | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
-## IMPORTANT: Set HEAD_NODE_IP to Node 1's IP address
+## Set this to Node 1's QSFP IP (see step header)
 ## You must get this value from Node 1 (run: echo $VLLM_HOST_IP on Node 1)
 export HEAD_NODE_IP=<NODE_1_IP_ADDRESS>
 ## Set the image tag (same as Step 3)
 export VLLM_IMAGE=nvcr.io/nvidia/vllm:26.05-py3
 echo "Worker IP: $VLLM_HOST_IP, connecting to head node at: $HEAD_NODE_IP"
 bash run_cluster.sh $VLLM_IMAGE $HEAD_NODE_IP --worker ~/.cache/huggingface \
@ -341,7 +353,6 @@ bash run_cluster.sh $VLLM_IMAGE $HEAD_NODE_IP --worker ~/.cache/huggingface \
  -e RAY_memory_monitor_refresh_ms=0 \
  -e MASTER_ADDR=$HEAD_NODE_IP
 ```
 > **Note:** Replace `<NODE_1_IP_ADDRESS>` with the actual IP address from Node 1, specifically the QSFP interface nep1s0f1np1 configured in the [Connect two Sparks](https://build.nvidia.com/spark/connect-two-sparks) playbook.
 ## Step 6. Verify cluster status
@ -360,12 +371,12 @@ Expected output shows 2 nodes with available GPU resources.
 ## Step 7. Download Llama 3.3 70B model
-Authenticate with Hugging Face and download the recommended production-ready model.
+Llama 3.3 70B is a gated model — first accept its license at <https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct> and create an HF access token with read permission. Then authenticate inside the container so the cache lands at `/root/.cache/huggingface` (mounted from `~/.cache/huggingface`).
 ```bash
-## From within the same container where `ray status` ran, run the following
+docker exec -it $VLLM_CONTAINER /bin/bash -c '
-hf auth login
+  hf auth login
-hf download meta-llama/Llama-3.3-70B-Instruct
+  hf download meta-llama/Llama-3.3-70B-Instruct'
 ```
 ## Step 8. Launch inference server for Llama 3.3 70B
@ -374,18 +385,17 @@ Start the vLLM inference server with tensor parallelism across both nodes.
 ```bash
 ## On Node 1, enter container and start server
 export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
 docker exec -it $VLLM_CONTAINER /bin/bash -c '
  vllm serve meta-llama/Llama-3.3-70B-Instruct \
-    --tensor-parallel-size 2 --max_model_len 2048'
+    --tensor-parallel-size 2 --max-model-len 2048 \
    --distributed-executor-backend ray'
 ```
 ## Step 9. Test 70B model inference
-Verify the deployment with a sample inference request.
+Verify the deployment with a sample inference request. Run this on Node 1 itself; from an external client, replace `localhost` with Node 1's reachable IP.
 ```bash
 ## Test from Node 1 or external client
 curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
@ -403,29 +413,36 @@ Expected output includes a generated haiku response.
 > [!WARNING]
 > 405B model has insufficient memory headroom for production use.
-Download the quantized 405B model for testing purposes only.
+Download the quantized 405B model for testing purposes only. Runs inside the head container so the cache lands in the mounted HF directory.
 ```bash
-## On Node 1, download quantized model
+docker exec -it $VLLM_CONTAINER /bin/bash -c '
-huggingface-cli download hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4
+  hf download hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4'
 ```
-### Step 11. (Optional) Launch 405B inference server
+## Step 11. (Optional) Launch 405B inference server
 Start the server with memory-constrained parameters for the large model.
 ```bash
 ## On Node 1, launch with restricted parameters
 export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
 docker exec -it $VLLM_CONTAINER /bin/bash -c '
  vllm serve hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4 \
    --tensor-parallel-size 2 --max-model-len 64 --gpu-memory-utilization 0.9 \
-    --max-num-seqs 1 --max_num_batched_tokens 64'
+    --max-num-seqs 1 --max-num-batched-tokens 64 \
    --distributed-executor-backend ray'
 ```
 Startup is slow for 405B — expect several minutes of model-loading logs across both nodes. The server is ready to take traffic once you see:
 ```
 INFO:     Application startup complete.
 INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
 ```
 ## Step 12. (Optional) Test 405B model inference
-Verify the 405B deployment with constrained parameters.
+Verify the 405B deployment with constrained parameters. As in Step 9, run on Node 1 or replace `localhost` with Node 1's reachable IP from an external client.
 ```bash
 curl http://localhost:8000/v1/completions \
@ -444,33 +461,32 @@ Perform comprehensive validation of the distributed inference system.
 ```bash
 ## Check Ray cluster health
 export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
 docker exec $VLLM_CONTAINER ray status
 ## Verify server health endpoint
-curl http://192.168.100.10:8000/health
+curl http://localhost:8000/health
-## Monitor GPU utilization on both nodes
+## Monitor GPU utilization on both nodes (DGX Spark has unified memory,
 ## so the --query-gpu memory fields report N/A; use raw nvidia-smi instead).
 nvidia-smi
 export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
 docker exec $VLLM_CONTAINER nvidia-smi --query-gpu=memory.used,memory.total --format=csv
 ```
 ## Step 14. Next steps
-Access the Ray dashboard for cluster monitoring and explore additional features:
+The Ray dashboard runs on port 8265 of the head node. It binds to the container's network (host networking), so it is only directly reachable from Node 1 itself. From an external workstation, tunnel it over SSH:
 ```bash
-## Ray dashboard available at:
+## From your workstation:
-http://<head-node-ip>:8265
+ssh -L 8265:localhost:8265 nvidia@<NODE_1_IP>
-
+## then open http://localhost:8265 in a local browser
 ## Consider implementing for production:
 ## - Health checks and automatic restarts
 ## - Log rotation for long-running services
 ## - Persistent model caching across restarts
 ## - Alternative quantization methods (FP8, INT4)
 ```
 Consider for production:
 - Health checks and automatic restarts
 - Log rotation for long-running services
 - Persistent model caching across restarts
 - Alternative quantization methods (FP8, INT4)
 ## Run on multiple Sparks through a switch
 ## Step 1. Configure network connectivity
@ -640,8 +656,6 @@ curl http://localhost:8000/health
 ## Monitor GPU utilization on all nodes
 nvidia-smi
 export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
 docker exec $VLLM_CONTAINER nvidia-smi --query-gpu=memory.used,memory.total --format=csv
 ```
 ## Step 11. Next steps