mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-06-24 23:29:31 +00:00
chore: Regenerate all playbooks
This commit is contained in:
parent
6a749bdcb0
commit
797933babb
@ -122,15 +122,18 @@ All required assets can be found in the [TileGym repository](https://github.com/
|
||||
* Large downloads may fail due to network issues
|
||||
* First run includes JIT compilation overhead
|
||||
* **Rollback:** Remove Docker container to undo all changes
|
||||
* **Last Updated:** February 2026
|
||||
* First Publication
|
||||
* **Last Updated:** 06/16/2026
|
||||
* Upgrade CUDA container to 13.2.0-devel-ubuntu22.04
|
||||
* Upgrade Nsight Systems to 2025.1.3
|
||||
* Add docker preparation steps for TileGym
|
||||
* Pin TileGym to v1.3.0
|
||||
|
||||
## Kernel Benchmarks
|
||||
|
||||
## Step 1. Pull CUDA NGC container with CTK 13.x
|
||||
|
||||
```bash
|
||||
docker pull nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04
|
||||
docker pull nvcr.io/nvidia/cuda:13.2.0-devel-ubuntu22.04
|
||||
```
|
||||
|
||||
Launch an interactive session with GPU access:
|
||||
@ -138,18 +141,26 @@ Launch an interactive session with GPU access:
|
||||
```bash
|
||||
docker run --gpus all -it --rm \
|
||||
-v ~/TileGym:/workspace/TileGym \
|
||||
nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 \
|
||||
nvcr.io/nvidia/cuda:13.2.0-devel-ubuntu22.04 \
|
||||
/bin/bash
|
||||
```
|
||||
|
||||
> [!NOTE]
|
||||
> The `-v` flag mounts a local directory to persist the TileGym repository. The `--rm` flag automatically removes the container when you exit; omit it if you want to keep the container for later use.
|
||||
|
||||
Or if running outside a container, install Tile IR directly:
|
||||
Prepare the docker for installing TileGym.
|
||||
|
||||
```bash
|
||||
## Requires root privileges - run with sudo or as root
|
||||
sudo apt-get install cuda-tile-ir-13-1 cuda-compiler-13-1
|
||||
apt-get update && apt-get install -y --no-install-recommends \
|
||||
python3-pip python3-dev python-is-python3 \
|
||||
git wget curl build-essential nsight-systems-2025.1.3
|
||||
update-alternatives --install /usr/bin/nsys nsys /opt/nvidia/nsight-systems/2025.1.3/bin/nsys 100 && hash -r
|
||||
python -m pip install --upgrade pip setuptools wheel
|
||||
|
||||
pip install --no-cache-dir --pre "torch==2.9.1" --index-url https://download.pytorch.org/whl/cu130
|
||||
|
||||
pip install --no-cache-dir --no-deps accelerate==1.13.0 && \
|
||||
pip install --no-cache-dir sentencepiece protobuf
|
||||
```
|
||||
|
||||
## Step 2. Clone TileGym repository
|
||||
@ -157,18 +168,32 @@ sudo apt-get install cuda-tile-ir-13-1 cuda-compiler-13-1
|
||||
```bash
|
||||
git clone https://github.com/NVIDIA/TileGym
|
||||
cd TileGym
|
||||
git checkout v1.3.0
|
||||
pip install .
|
||||
```
|
||||
|
||||
## Step 3. Run benchmark suite
|
||||
## Step 3. Run individual benchmarks
|
||||
|
||||
To run specific kernel benchmarks:
|
||||
|
||||
```bash
|
||||
cd tests/benchmark/
|
||||
bash run_all.sh
|
||||
```
|
||||
|
||||
> [!NOTE]
|
||||
> The benchmark runs sequentially to ensure accurate timing results. This may take 10-15 minutes to complete all kernels.
|
||||
## Flash Multi-Head Attention
|
||||
python bench_fused_attention.py
|
||||
|
||||
## Matrix Multiplication
|
||||
python bench_matrix_multiplication.py
|
||||
|
||||
## RMSNorm
|
||||
python bench_rmsnorm.py
|
||||
|
||||
## RoPE
|
||||
python bench_rope.py
|
||||
|
||||
## SwiGLU
|
||||
python bench_swiglu.py
|
||||
```
|
||||
|
||||
## Step 4. View results
|
||||
|
||||
@ -190,27 +215,17 @@ fused-attention-batch4-head32-d128-fwd-causal=True-float16-TFLOPS:
|
||||
✓ PASSED: bench_fused_attention.py
|
||||
```
|
||||
|
||||
## Step 5. Run individual benchmarks
|
||||
|
||||
To run specific kernel benchmarks:
|
||||
## Step 5. Run benchmark suite
|
||||
|
||||
```bash
|
||||
## Flash Multi-Head Attention
|
||||
python bench_fused_attention.py
|
||||
|
||||
## Matrix Multiplication
|
||||
python bench_matrix_multiplication.py
|
||||
|
||||
## RMSNorm
|
||||
python bench_rmsnorm.py
|
||||
|
||||
## RoPE
|
||||
python bench_rope.py
|
||||
|
||||
## SwiGLU
|
||||
python bench_swiglu.py
|
||||
cd tests/benchmark/
|
||||
bash run_all.sh
|
||||
```
|
||||
|
||||
> [!NOTE]
|
||||
> NOT RECOMMENDED: The benchmark runs sequentially to ensure accurate timing results. This may take 40-60 minutes to complete all kernels.
|
||||
|
||||
|
||||
## Step 6. Clean up
|
||||
|
||||
Exit the container:
|
||||
@ -223,7 +238,7 @@ Remove this workflow's containers (if you ran without `--rm`):
|
||||
|
||||
```bash
|
||||
## Preferred: remove only containers from this workflow's image
|
||||
docker rm $(docker ps -a --filter ancestor=nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 --format '{{.ID}}')
|
||||
docker ps -a --filter ancestor=nvcr.io/nvidia/cuda:13.2.0-devel-ubuntu22.04 -q | xargs -r docker rm
|
||||
|
||||
## Alternative: prune all stopped containers (will prompt for confirmation)
|
||||
## docker container prune
|
||||
@ -232,7 +247,7 @@ docker rm $(docker ps -a --filter ancestor=nvcr.io/nvidia/cuda:13.1-devel-ubuntu
|
||||
Remove the image (optional):
|
||||
|
||||
```bash
|
||||
docker rmi nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04
|
||||
docker rmi nvcr.io/nvidia/cuda:13.2.0-devel-ubuntu22.04
|
||||
```
|
||||
|
||||
## Step 7. Repeat on B300
|
||||
@ -250,6 +265,8 @@ First, clone TileGym on the host:
|
||||
```bash
|
||||
mkdir -p ~/TileGym
|
||||
git clone https://github.com/NVIDIA/TileGym ~/TileGym
|
||||
cd ~/TileGym
|
||||
git checkout v1.3.0
|
||||
```
|
||||
|
||||
Then launch the container with the repository mounted:
|
||||
@ -258,13 +275,28 @@ Then launch the container with the repository mounted:
|
||||
docker run --gpus all -it --rm \
|
||||
-v ~/TileGym:/workspace/TileGym \
|
||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||
nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 \
|
||||
nvcr.io/nvidia/cuda:13.2.0-devel-ubuntu22.04 \
|
||||
/bin/bash
|
||||
```
|
||||
|
||||
> [!NOTE]
|
||||
> The `-v ~/.cache/huggingface:/root/.cache/huggingface` mounts your HuggingFace cache to avoid re-downloading models.
|
||||
|
||||
Prepare the container for installing TileGym:
|
||||
|
||||
```bash
|
||||
apt-get update && apt-get install -y --no-install-recommends \
|
||||
python3-pip python3-dev python-is-python3 \
|
||||
git wget curl build-essential nsight-systems-2025.1.3
|
||||
update-alternatives --install /usr/bin/nsys nsys /opt/nvidia/nsight-systems/2025.1.3/bin/nsys 100 && hash -r
|
||||
python -m pip install --upgrade pip setuptools wheel
|
||||
|
||||
pip install --no-cache-dir --pre "torch==2.9.1" --index-url https://download.pytorch.org/whl/cu130
|
||||
|
||||
pip install --no-cache-dir --no-deps accelerate==1.13.0 && \
|
||||
pip install --no-cache-dir sentencepiece protobuf
|
||||
```
|
||||
|
||||
Install TileGym inside the container:
|
||||
|
||||
```bash
|
||||
@ -286,7 +318,7 @@ export HF_TOKEN=<your_huggingface_token>
|
||||
Navigate to the transformers benchmark directory:
|
||||
|
||||
```bash
|
||||
cd modeling/transformers
|
||||
cd /workspace/TileGym/modeling/transformers
|
||||
```
|
||||
|
||||
**Option A: Run Qwen2-7B benchmark**
|
||||
@ -841,7 +873,7 @@ Use the ratios below as a reference for how kernel performance scales from DGX S
|
||||
| Slow first run | JIT compilation | Normal - cuTile compiles kernels on first run |
|
||||
| `FileNotFoundError: input_prompt_small.txt` | Missing input file | Run from `modeling/transformers` directory |
|
||||
| `torch.cuda.OutOfMemoryError` | Insufficient GPU memory | Reduce `--batch_size` parameter |
|
||||
| `ImportError: cuda.tile` | Missing Tile IR | Install: `apt-get install cuda-tile-ir-13-1` |
|
||||
| `ImportError: cuda.tile` | Missing Tile IR | Install: `apt-get install cuda-tile-ir-13-2` |
|
||||
| Benchmark hangs | GPU busy or locked | Check `nvidia-smi` for other processes |
|
||||
|
||||
> [!NOTE]
|
||||
|
||||
@ -7,7 +7,6 @@
|
||||
- [Overview](#overview)
|
||||
- [Instructions](#instructions)
|
||||
- [Run on two Sparks](#run-on-two-sparks)
|
||||
- [Step 11. (Optional) Launch 405B inference server](#step-11-optional-launch-405b-inference-server)
|
||||
- [Run on multiple Sparks through a switch](#run-on-multiple-sparks-through-a-switch)
|
||||
- [Run Agent Ready Qwen3.6 35B Model with vLLM](#run-agent-ready-qwen36-35b-model-with-vllm)
|
||||
- [Troubleshooting](#troubleshooting)
|
||||
@ -257,46 +256,56 @@ This includes:
|
||||
- Passwordless SSH setup
|
||||
- Network connectivity verification
|
||||
|
||||
> **Heads up:** the `discover-sparks` script in the linked playbook writes its SSH key to `~/.ssh/` and fails if the directory does not exist yet. Run `mkdir -p ~/.ssh && chmod 700 ~/.ssh` on both nodes first if you have never used SSH on them.
|
||||
|
||||
## Step 2. Download cluster deployment script
|
||||
|
||||
Obtain the vLLM cluster deployment script on both nodes. This script orchestrates the Ray cluster setup required for distributed inference.
|
||||
|
||||
```bash
|
||||
## Download on both nodes
|
||||
wget https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/examples/ray_serving/run_cluster.sh
|
||||
## Download on both nodes — pinned to a known-good commit so upstream changes
|
||||
## can't silently break this playbook against the 26.05-py3 image.
|
||||
wget https://raw.githubusercontent.com/vllm-project/vllm/51c1ee9b7c8acbba4899a8ebffd390685d171946/examples/ray_serving/run_cluster.sh
|
||||
|
||||
## Patch the script to pip-install ray inside the container before ray starts.
|
||||
## The 26.05-py3 NGC image ships without ray (upstream made it an optional CUDA dep);
|
||||
## the install takes ~10s on first container launch.
|
||||
sed -i 's|^RAY_START_CMD="ray start|RAY_START_CMD="pip install -q --root-user-action=ignore '\''ray[default]>=2.9'\'' \&\& ray start|' run_cluster.sh
|
||||
|
||||
chmod +x run_cluster.sh
|
||||
```
|
||||
|
||||
## Step 3. Pull the NVIDIA vLLM Image from NGC
|
||||
## Step 3. Pull the NVIDIA vLLM image from NGC
|
||||
|
||||
First, you will need to configure docker to pull from NGC
|
||||
If this is your first time using docker run:
|
||||
First, configure docker. If this is your first time using docker, run:
|
||||
```bash
|
||||
sudo groupadd docker
|
||||
sudo usermod -aG docker $USER
|
||||
newgrp docker
|
||||
```
|
||||
|
||||
After this, you should be able to run docker commands without using `sudo`.
|
||||
After this, you should be able to run docker commands without `sudo`.
|
||||
|
||||
Pull the image **on both nodes**:
|
||||
|
||||
```bash
|
||||
docker pull nvcr.io/nvidia/vllm:25.11-py3
|
||||
export VLLM_IMAGE=nvcr.io/nvidia/vllm:25.11-py3
|
||||
docker pull nvcr.io/nvidia/vllm:26.05-py3
|
||||
export VLLM_IMAGE=nvcr.io/nvidia/vllm:26.05-py3
|
||||
```
|
||||
|
||||
|
||||
## Step 4. Start Ray head node
|
||||
|
||||
Launch the Ray cluster head node on Node 1. This node coordinates the distributed inference and serves the API endpoint.
|
||||
|
||||
```bash
|
||||
## On Node 1, start head node
|
||||
## On Node 1, start head node. Run inside tmux/screen so an SSH drop doesn't
|
||||
## tear down the cluster (run_cluster.sh has an EXIT trap that stops the container).
|
||||
|
||||
## Get the IP address of the high-speed interface
|
||||
## Use the interface that shows "(Up)" from ibdev2netdev (enp1s0f0np0 or enp1s0f1np1)
|
||||
export MN_IF_NAME=enp1s0f1np1
|
||||
export VLLM_HOST_IP=$(ip -4 addr show $MN_IF_NAME | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
|
||||
export VLLM_IMAGE=nvcr.io/nvidia/vllm:26.05-py3
|
||||
|
||||
echo "Using interface $MN_IF_NAME with IP $VLLM_HOST_IP"
|
||||
|
||||
@ -311,10 +320,11 @@ bash run_cluster.sh $VLLM_IMAGE $VLLM_HOST_IP --head ~/.cache/huggingface \
|
||||
-e MASTER_ADDR=$VLLM_HOST_IP
|
||||
```
|
||||
|
||||
Leave this terminal open — closing it stops the head node and tears down the cluster.
|
||||
|
||||
## Step 5. Start Ray worker node
|
||||
|
||||
Connect Node 2 to the Ray cluster as a worker node. This provides additional GPU resources for tensor parallelism.
|
||||
Open a second terminal, SSH to Node 2 (`ssh user@<NODE_2_IP>`), and join the Ray cluster as a worker. Replace `<NODE_1_IP_ADDRESS>` below with the QSFP-side IP from Node 1 (run `echo $VLLM_HOST_IP` on Node 1 to print it). Run inside tmux/screen on Node 2 as well.
|
||||
|
||||
```bash
|
||||
## On Node 2, join as worker
|
||||
@ -325,10 +335,12 @@ export MN_IF_NAME=enp1s0f1np1
|
||||
## Get Node 2's own IP address
|
||||
export VLLM_HOST_IP=$(ip -4 addr show $MN_IF_NAME | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
|
||||
|
||||
## IMPORTANT: Set HEAD_NODE_IP to Node 1's IP address
|
||||
## You must get this value from Node 1 (run: echo $VLLM_HOST_IP on Node 1)
|
||||
## Set this to Node 1's QSFP IP (see step header)
|
||||
export HEAD_NODE_IP=<NODE_1_IP_ADDRESS>
|
||||
|
||||
## Set the image tag (same as Step 3)
|
||||
export VLLM_IMAGE=nvcr.io/nvidia/vllm:26.05-py3
|
||||
|
||||
echo "Worker IP: $VLLM_HOST_IP, connecting to head node at: $HEAD_NODE_IP"
|
||||
|
||||
bash run_cluster.sh $VLLM_IMAGE $HEAD_NODE_IP --worker ~/.cache/huggingface \
|
||||
@ -341,7 +353,6 @@ bash run_cluster.sh $VLLM_IMAGE $HEAD_NODE_IP --worker ~/.cache/huggingface \
|
||||
-e RAY_memory_monitor_refresh_ms=0 \
|
||||
-e MASTER_ADDR=$HEAD_NODE_IP
|
||||
```
|
||||
> **Note:** Replace `<NODE_1_IP_ADDRESS>` with the actual IP address from Node 1, specifically the QSFP interface nep1s0f1np1 configured in the [Connect two Sparks](https://build.nvidia.com/spark/connect-two-sparks) playbook.
|
||||
|
||||
## Step 6. Verify cluster status
|
||||
|
||||
@ -360,12 +371,12 @@ Expected output shows 2 nodes with available GPU resources.
|
||||
|
||||
## Step 7. Download Llama 3.3 70B model
|
||||
|
||||
Authenticate with Hugging Face and download the recommended production-ready model.
|
||||
Llama 3.3 70B is a gated model — first accept its license at <https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct> and create an HF access token with read permission. Then authenticate inside the container so the cache lands at `/root/.cache/huggingface` (mounted from `~/.cache/huggingface`).
|
||||
|
||||
```bash
|
||||
## From within the same container where `ray status` ran, run the following
|
||||
docker exec -it $VLLM_CONTAINER /bin/bash -c '
|
||||
hf auth login
|
||||
hf download meta-llama/Llama-3.3-70B-Instruct
|
||||
hf download meta-llama/Llama-3.3-70B-Instruct'
|
||||
```
|
||||
|
||||
## Step 8. Launch inference server for Llama 3.3 70B
|
||||
@ -374,18 +385,17 @@ Start the vLLM inference server with tensor parallelism across both nodes.
|
||||
|
||||
```bash
|
||||
## On Node 1, enter container and start server
|
||||
export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
|
||||
docker exec -it $VLLM_CONTAINER /bin/bash -c '
|
||||
vllm serve meta-llama/Llama-3.3-70B-Instruct \
|
||||
--tensor-parallel-size 2 --max_model_len 2048'
|
||||
--tensor-parallel-size 2 --max-model-len 2048 \
|
||||
--distributed-executor-backend ray'
|
||||
```
|
||||
|
||||
## Step 9. Test 70B model inference
|
||||
|
||||
Verify the deployment with a sample inference request.
|
||||
Verify the deployment with a sample inference request. Run this on Node 1 itself; from an external client, replace `localhost` with Node 1's reachable IP.
|
||||
|
||||
```bash
|
||||
## Test from Node 1 or external client
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
@ -403,29 +413,36 @@ Expected output includes a generated haiku response.
|
||||
> [!WARNING]
|
||||
> 405B model has insufficient memory headroom for production use.
|
||||
|
||||
Download the quantized 405B model for testing purposes only.
|
||||
Download the quantized 405B model for testing purposes only. Runs inside the head container so the cache lands in the mounted HF directory.
|
||||
|
||||
```bash
|
||||
## On Node 1, download quantized model
|
||||
huggingface-cli download hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4
|
||||
docker exec -it $VLLM_CONTAINER /bin/bash -c '
|
||||
hf download hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4'
|
||||
```
|
||||
|
||||
### Step 11. (Optional) Launch 405B inference server
|
||||
## Step 11. (Optional) Launch 405B inference server
|
||||
|
||||
Start the server with memory-constrained parameters for the large model.
|
||||
|
||||
```bash
|
||||
## On Node 1, launch with restricted parameters
|
||||
export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
|
||||
docker exec -it $VLLM_CONTAINER /bin/bash -c '
|
||||
vllm serve hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4 \
|
||||
--tensor-parallel-size 2 --max-model-len 64 --gpu-memory-utilization 0.9 \
|
||||
--max-num-seqs 1 --max_num_batched_tokens 64'
|
||||
--max-num-seqs 1 --max-num-batched-tokens 64 \
|
||||
--distributed-executor-backend ray'
|
||||
```
|
||||
|
||||
Startup is slow for 405B — expect several minutes of model-loading logs across both nodes. The server is ready to take traffic once you see:
|
||||
|
||||
```
|
||||
INFO: Application startup complete.
|
||||
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
|
||||
```
|
||||
|
||||
## Step 12. (Optional) Test 405B model inference
|
||||
|
||||
Verify the 405B deployment with constrained parameters.
|
||||
Verify the 405B deployment with constrained parameters. As in Step 9, run on Node 1 or replace `localhost` with Node 1's reachable IP from an external client.
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/completions \
|
||||
@ -444,33 +461,32 @@ Perform comprehensive validation of the distributed inference system.
|
||||
|
||||
```bash
|
||||
## Check Ray cluster health
|
||||
export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
|
||||
docker exec $VLLM_CONTAINER ray status
|
||||
|
||||
## Verify server health endpoint
|
||||
curl http://192.168.100.10:8000/health
|
||||
curl http://localhost:8000/health
|
||||
|
||||
## Monitor GPU utilization on both nodes
|
||||
## Monitor GPU utilization on both nodes (DGX Spark has unified memory,
|
||||
## so the --query-gpu memory fields report N/A; use raw nvidia-smi instead).
|
||||
nvidia-smi
|
||||
export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
|
||||
docker exec $VLLM_CONTAINER nvidia-smi --query-gpu=memory.used,memory.total --format=csv
|
||||
```
|
||||
|
||||
## Step 14. Next steps
|
||||
|
||||
Access the Ray dashboard for cluster monitoring and explore additional features:
|
||||
The Ray dashboard runs on port 8265 of the head node. It binds to the container's network (host networking), so it is only directly reachable from Node 1 itself. From an external workstation, tunnel it over SSH:
|
||||
|
||||
```bash
|
||||
## Ray dashboard available at:
|
||||
http://<head-node-ip>:8265
|
||||
|
||||
## Consider implementing for production:
|
||||
## - Health checks and automatic restarts
|
||||
## - Log rotation for long-running services
|
||||
## - Persistent model caching across restarts
|
||||
## - Alternative quantization methods (FP8, INT4)
|
||||
## From your workstation:
|
||||
ssh -L 8265:localhost:8265 nvidia@<NODE_1_IP>
|
||||
## then open http://localhost:8265 in a local browser
|
||||
```
|
||||
|
||||
Consider for production:
|
||||
- Health checks and automatic restarts
|
||||
- Log rotation for long-running services
|
||||
- Persistent model caching across restarts
|
||||
- Alternative quantization methods (FP8, INT4)
|
||||
|
||||
## Run on multiple Sparks through a switch
|
||||
|
||||
## Step 1. Configure network connectivity
|
||||
@ -640,8 +656,6 @@ curl http://localhost:8000/health
|
||||
|
||||
## Monitor GPU utilization on all nodes
|
||||
nvidia-smi
|
||||
export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
|
||||
docker exec $VLLM_CONTAINER nvidia-smi --query-gpu=memory.used,memory.total --format=csv
|
||||
```
|
||||
|
||||
## Step 11. Next steps
|
||||
|
||||
Loading…
Reference in New Issue
Block a user