mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-06-24 23:29:31 +00:00
chore: Regenerate all playbooks
This commit is contained in:
parent
6a749bdcb0
commit
797933babb
@ -122,15 +122,18 @@ All required assets can be found in the [TileGym repository](https://github.com/
|
|||||||
* Large downloads may fail due to network issues
|
* Large downloads may fail due to network issues
|
||||||
* First run includes JIT compilation overhead
|
* First run includes JIT compilation overhead
|
||||||
* **Rollback:** Remove Docker container to undo all changes
|
* **Rollback:** Remove Docker container to undo all changes
|
||||||
* **Last Updated:** February 2026
|
* **Last Updated:** 06/16/2026
|
||||||
* First Publication
|
* Upgrade CUDA container to 13.2.0-devel-ubuntu22.04
|
||||||
|
* Upgrade Nsight Systems to 2025.1.3
|
||||||
|
* Add docker preparation steps for TileGym
|
||||||
|
* Pin TileGym to v1.3.0
|
||||||
|
|
||||||
## Kernel Benchmarks
|
## Kernel Benchmarks
|
||||||
|
|
||||||
## Step 1. Pull CUDA NGC container with CTK 13.x
|
## Step 1. Pull CUDA NGC container with CTK 13.x
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
docker pull nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04
|
docker pull nvcr.io/nvidia/cuda:13.2.0-devel-ubuntu22.04
|
||||||
```
|
```
|
||||||
|
|
||||||
Launch an interactive session with GPU access:
|
Launch an interactive session with GPU access:
|
||||||
@ -138,18 +141,26 @@ Launch an interactive session with GPU access:
|
|||||||
```bash
|
```bash
|
||||||
docker run --gpus all -it --rm \
|
docker run --gpus all -it --rm \
|
||||||
-v ~/TileGym:/workspace/TileGym \
|
-v ~/TileGym:/workspace/TileGym \
|
||||||
nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 \
|
nvcr.io/nvidia/cuda:13.2.0-devel-ubuntu22.04 \
|
||||||
/bin/bash
|
/bin/bash
|
||||||
```
|
```
|
||||||
|
|
||||||
> [!NOTE]
|
> [!NOTE]
|
||||||
> The `-v` flag mounts a local directory to persist the TileGym repository. The `--rm` flag automatically removes the container when you exit; omit it if you want to keep the container for later use.
|
> The `-v` flag mounts a local directory to persist the TileGym repository. The `--rm` flag automatically removes the container when you exit; omit it if you want to keep the container for later use.
|
||||||
|
|
||||||
Or if running outside a container, install Tile IR directly:
|
Prepare the docker for installing TileGym.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
## Requires root privileges - run with sudo or as root
|
apt-get update && apt-get install -y --no-install-recommends \
|
||||||
sudo apt-get install cuda-tile-ir-13-1 cuda-compiler-13-1
|
python3-pip python3-dev python-is-python3 \
|
||||||
|
git wget curl build-essential nsight-systems-2025.1.3
|
||||||
|
update-alternatives --install /usr/bin/nsys nsys /opt/nvidia/nsight-systems/2025.1.3/bin/nsys 100 && hash -r
|
||||||
|
python -m pip install --upgrade pip setuptools wheel
|
||||||
|
|
||||||
|
pip install --no-cache-dir --pre "torch==2.9.1" --index-url https://download.pytorch.org/whl/cu130
|
||||||
|
|
||||||
|
pip install --no-cache-dir --no-deps accelerate==1.13.0 && \
|
||||||
|
pip install --no-cache-dir sentencepiece protobuf
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 2. Clone TileGym repository
|
## Step 2. Clone TileGym repository
|
||||||
@ -157,18 +168,32 @@ sudo apt-get install cuda-tile-ir-13-1 cuda-compiler-13-1
|
|||||||
```bash
|
```bash
|
||||||
git clone https://github.com/NVIDIA/TileGym
|
git clone https://github.com/NVIDIA/TileGym
|
||||||
cd TileGym
|
cd TileGym
|
||||||
|
git checkout v1.3.0
|
||||||
pip install .
|
pip install .
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 3. Run benchmark suite
|
## Step 3. Run individual benchmarks
|
||||||
|
|
||||||
|
To run specific kernel benchmarks:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
cd tests/benchmark/
|
cd tests/benchmark/
|
||||||
bash run_all.sh
|
|
||||||
```
|
|
||||||
|
|
||||||
> [!NOTE]
|
## Flash Multi-Head Attention
|
||||||
> The benchmark runs sequentially to ensure accurate timing results. This may take 10-15 minutes to complete all kernels.
|
python bench_fused_attention.py
|
||||||
|
|
||||||
|
## Matrix Multiplication
|
||||||
|
python bench_matrix_multiplication.py
|
||||||
|
|
||||||
|
## RMSNorm
|
||||||
|
python bench_rmsnorm.py
|
||||||
|
|
||||||
|
## RoPE
|
||||||
|
python bench_rope.py
|
||||||
|
|
||||||
|
## SwiGLU
|
||||||
|
python bench_swiglu.py
|
||||||
|
```
|
||||||
|
|
||||||
## Step 4. View results
|
## Step 4. View results
|
||||||
|
|
||||||
@ -190,27 +215,17 @@ fused-attention-batch4-head32-d128-fwd-causal=True-float16-TFLOPS:
|
|||||||
✓ PASSED: bench_fused_attention.py
|
✓ PASSED: bench_fused_attention.py
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 5. Run individual benchmarks
|
## Step 5. Run benchmark suite
|
||||||
|
|
||||||
To run specific kernel benchmarks:
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
## Flash Multi-Head Attention
|
cd tests/benchmark/
|
||||||
python bench_fused_attention.py
|
bash run_all.sh
|
||||||
|
|
||||||
## Matrix Multiplication
|
|
||||||
python bench_matrix_multiplication.py
|
|
||||||
|
|
||||||
## RMSNorm
|
|
||||||
python bench_rmsnorm.py
|
|
||||||
|
|
||||||
## RoPE
|
|
||||||
python bench_rope.py
|
|
||||||
|
|
||||||
## SwiGLU
|
|
||||||
python bench_swiglu.py
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
> [!NOTE]
|
||||||
|
> NOT RECOMMENDED: The benchmark runs sequentially to ensure accurate timing results. This may take 40-60 minutes to complete all kernels.
|
||||||
|
|
||||||
|
|
||||||
## Step 6. Clean up
|
## Step 6. Clean up
|
||||||
|
|
||||||
Exit the container:
|
Exit the container:
|
||||||
@ -223,7 +238,7 @@ Remove this workflow's containers (if you ran without `--rm`):
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
## Preferred: remove only containers from this workflow's image
|
## Preferred: remove only containers from this workflow's image
|
||||||
docker rm $(docker ps -a --filter ancestor=nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 --format '{{.ID}}')
|
docker ps -a --filter ancestor=nvcr.io/nvidia/cuda:13.2.0-devel-ubuntu22.04 -q | xargs -r docker rm
|
||||||
|
|
||||||
## Alternative: prune all stopped containers (will prompt for confirmation)
|
## Alternative: prune all stopped containers (will prompt for confirmation)
|
||||||
## docker container prune
|
## docker container prune
|
||||||
@ -232,7 +247,7 @@ docker rm $(docker ps -a --filter ancestor=nvcr.io/nvidia/cuda:13.1-devel-ubuntu
|
|||||||
Remove the image (optional):
|
Remove the image (optional):
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
docker rmi nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04
|
docker rmi nvcr.io/nvidia/cuda:13.2.0-devel-ubuntu22.04
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 7. Repeat on B300
|
## Step 7. Repeat on B300
|
||||||
@ -250,6 +265,8 @@ First, clone TileGym on the host:
|
|||||||
```bash
|
```bash
|
||||||
mkdir -p ~/TileGym
|
mkdir -p ~/TileGym
|
||||||
git clone https://github.com/NVIDIA/TileGym ~/TileGym
|
git clone https://github.com/NVIDIA/TileGym ~/TileGym
|
||||||
|
cd ~/TileGym
|
||||||
|
git checkout v1.3.0
|
||||||
```
|
```
|
||||||
|
|
||||||
Then launch the container with the repository mounted:
|
Then launch the container with the repository mounted:
|
||||||
@ -258,13 +275,28 @@ Then launch the container with the repository mounted:
|
|||||||
docker run --gpus all -it --rm \
|
docker run --gpus all -it --rm \
|
||||||
-v ~/TileGym:/workspace/TileGym \
|
-v ~/TileGym:/workspace/TileGym \
|
||||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||||
nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 \
|
nvcr.io/nvidia/cuda:13.2.0-devel-ubuntu22.04 \
|
||||||
/bin/bash
|
/bin/bash
|
||||||
```
|
```
|
||||||
|
|
||||||
> [!NOTE]
|
> [!NOTE]
|
||||||
> The `-v ~/.cache/huggingface:/root/.cache/huggingface` mounts your HuggingFace cache to avoid re-downloading models.
|
> The `-v ~/.cache/huggingface:/root/.cache/huggingface` mounts your HuggingFace cache to avoid re-downloading models.
|
||||||
|
|
||||||
|
Prepare the container for installing TileGym:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
apt-get update && apt-get install -y --no-install-recommends \
|
||||||
|
python3-pip python3-dev python-is-python3 \
|
||||||
|
git wget curl build-essential nsight-systems-2025.1.3
|
||||||
|
update-alternatives --install /usr/bin/nsys nsys /opt/nvidia/nsight-systems/2025.1.3/bin/nsys 100 && hash -r
|
||||||
|
python -m pip install --upgrade pip setuptools wheel
|
||||||
|
|
||||||
|
pip install --no-cache-dir --pre "torch==2.9.1" --index-url https://download.pytorch.org/whl/cu130
|
||||||
|
|
||||||
|
pip install --no-cache-dir --no-deps accelerate==1.13.0 && \
|
||||||
|
pip install --no-cache-dir sentencepiece protobuf
|
||||||
|
```
|
||||||
|
|
||||||
Install TileGym inside the container:
|
Install TileGym inside the container:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
@ -286,7 +318,7 @@ export HF_TOKEN=<your_huggingface_token>
|
|||||||
Navigate to the transformers benchmark directory:
|
Navigate to the transformers benchmark directory:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
cd modeling/transformers
|
cd /workspace/TileGym/modeling/transformers
|
||||||
```
|
```
|
||||||
|
|
||||||
**Option A: Run Qwen2-7B benchmark**
|
**Option A: Run Qwen2-7B benchmark**
|
||||||
@ -841,7 +873,7 @@ Use the ratios below as a reference for how kernel performance scales from DGX S
|
|||||||
| Slow first run | JIT compilation | Normal - cuTile compiles kernels on first run |
|
| Slow first run | JIT compilation | Normal - cuTile compiles kernels on first run |
|
||||||
| `FileNotFoundError: input_prompt_small.txt` | Missing input file | Run from `modeling/transformers` directory |
|
| `FileNotFoundError: input_prompt_small.txt` | Missing input file | Run from `modeling/transformers` directory |
|
||||||
| `torch.cuda.OutOfMemoryError` | Insufficient GPU memory | Reduce `--batch_size` parameter |
|
| `torch.cuda.OutOfMemoryError` | Insufficient GPU memory | Reduce `--batch_size` parameter |
|
||||||
| `ImportError: cuda.tile` | Missing Tile IR | Install: `apt-get install cuda-tile-ir-13-1` |
|
| `ImportError: cuda.tile` | Missing Tile IR | Install: `apt-get install cuda-tile-ir-13-2` |
|
||||||
| Benchmark hangs | GPU busy or locked | Check `nvidia-smi` for other processes |
|
| Benchmark hangs | GPU busy or locked | Check `nvidia-smi` for other processes |
|
||||||
|
|
||||||
> [!NOTE]
|
> [!NOTE]
|
||||||
|
|||||||
@ -7,7 +7,6 @@
|
|||||||
- [Overview](#overview)
|
- [Overview](#overview)
|
||||||
- [Instructions](#instructions)
|
- [Instructions](#instructions)
|
||||||
- [Run on two Sparks](#run-on-two-sparks)
|
- [Run on two Sparks](#run-on-two-sparks)
|
||||||
- [Step 11. (Optional) Launch 405B inference server](#step-11-optional-launch-405b-inference-server)
|
|
||||||
- [Run on multiple Sparks through a switch](#run-on-multiple-sparks-through-a-switch)
|
- [Run on multiple Sparks through a switch](#run-on-multiple-sparks-through-a-switch)
|
||||||
- [Run Agent Ready Qwen3.6 35B Model with vLLM](#run-agent-ready-qwen36-35b-model-with-vllm)
|
- [Run Agent Ready Qwen3.6 35B Model with vLLM](#run-agent-ready-qwen36-35b-model-with-vllm)
|
||||||
- [Troubleshooting](#troubleshooting)
|
- [Troubleshooting](#troubleshooting)
|
||||||
@ -257,46 +256,56 @@ This includes:
|
|||||||
- Passwordless SSH setup
|
- Passwordless SSH setup
|
||||||
- Network connectivity verification
|
- Network connectivity verification
|
||||||
|
|
||||||
|
> **Heads up:** the `discover-sparks` script in the linked playbook writes its SSH key to `~/.ssh/` and fails if the directory does not exist yet. Run `mkdir -p ~/.ssh && chmod 700 ~/.ssh` on both nodes first if you have never used SSH on them.
|
||||||
|
|
||||||
## Step 2. Download cluster deployment script
|
## Step 2. Download cluster deployment script
|
||||||
|
|
||||||
Obtain the vLLM cluster deployment script on both nodes. This script orchestrates the Ray cluster setup required for distributed inference.
|
Obtain the vLLM cluster deployment script on both nodes. This script orchestrates the Ray cluster setup required for distributed inference.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
## Download on both nodes
|
## Download on both nodes — pinned to a known-good commit so upstream changes
|
||||||
wget https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/examples/ray_serving/run_cluster.sh
|
## can't silently break this playbook against the 26.05-py3 image.
|
||||||
|
wget https://raw.githubusercontent.com/vllm-project/vllm/51c1ee9b7c8acbba4899a8ebffd390685d171946/examples/ray_serving/run_cluster.sh
|
||||||
|
|
||||||
|
## Patch the script to pip-install ray inside the container before ray starts.
|
||||||
|
## The 26.05-py3 NGC image ships without ray (upstream made it an optional CUDA dep);
|
||||||
|
## the install takes ~10s on first container launch.
|
||||||
|
sed -i 's|^RAY_START_CMD="ray start|RAY_START_CMD="pip install -q --root-user-action=ignore '\''ray[default]>=2.9'\'' \&\& ray start|' run_cluster.sh
|
||||||
|
|
||||||
chmod +x run_cluster.sh
|
chmod +x run_cluster.sh
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 3. Pull the NVIDIA vLLM Image from NGC
|
## Step 3. Pull the NVIDIA vLLM image from NGC
|
||||||
|
|
||||||
First, you will need to configure docker to pull from NGC
|
First, configure docker. If this is your first time using docker, run:
|
||||||
If this is your first time using docker run:
|
|
||||||
```bash
|
```bash
|
||||||
sudo groupadd docker
|
sudo groupadd docker
|
||||||
sudo usermod -aG docker $USER
|
sudo usermod -aG docker $USER
|
||||||
newgrp docker
|
newgrp docker
|
||||||
```
|
```
|
||||||
|
|
||||||
After this, you should be able to run docker commands without using `sudo`.
|
After this, you should be able to run docker commands without `sudo`.
|
||||||
|
|
||||||
|
Pull the image **on both nodes**:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
docker pull nvcr.io/nvidia/vllm:25.11-py3
|
docker pull nvcr.io/nvidia/vllm:26.05-py3
|
||||||
export VLLM_IMAGE=nvcr.io/nvidia/vllm:25.11-py3
|
export VLLM_IMAGE=nvcr.io/nvidia/vllm:26.05-py3
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
## Step 4. Start Ray head node
|
## Step 4. Start Ray head node
|
||||||
|
|
||||||
Launch the Ray cluster head node on Node 1. This node coordinates the distributed inference and serves the API endpoint.
|
Launch the Ray cluster head node on Node 1. This node coordinates the distributed inference and serves the API endpoint.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
## On Node 1, start head node
|
## On Node 1, start head node. Run inside tmux/screen so an SSH drop doesn't
|
||||||
|
## tear down the cluster (run_cluster.sh has an EXIT trap that stops the container).
|
||||||
|
|
||||||
## Get the IP address of the high-speed interface
|
## Get the IP address of the high-speed interface
|
||||||
## Use the interface that shows "(Up)" from ibdev2netdev (enp1s0f0np0 or enp1s0f1np1)
|
## Use the interface that shows "(Up)" from ibdev2netdev (enp1s0f0np0 or enp1s0f1np1)
|
||||||
export MN_IF_NAME=enp1s0f1np1
|
export MN_IF_NAME=enp1s0f1np1
|
||||||
export VLLM_HOST_IP=$(ip -4 addr show $MN_IF_NAME | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
|
export VLLM_HOST_IP=$(ip -4 addr show $MN_IF_NAME | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
|
||||||
|
export VLLM_IMAGE=nvcr.io/nvidia/vllm:26.05-py3
|
||||||
|
|
||||||
echo "Using interface $MN_IF_NAME with IP $VLLM_HOST_IP"
|
echo "Using interface $MN_IF_NAME with IP $VLLM_HOST_IP"
|
||||||
|
|
||||||
@ -311,10 +320,11 @@ bash run_cluster.sh $VLLM_IMAGE $VLLM_HOST_IP --head ~/.cache/huggingface \
|
|||||||
-e MASTER_ADDR=$VLLM_HOST_IP
|
-e MASTER_ADDR=$VLLM_HOST_IP
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Leave this terminal open — closing it stops the head node and tears down the cluster.
|
||||||
|
|
||||||
## Step 5. Start Ray worker node
|
## Step 5. Start Ray worker node
|
||||||
|
|
||||||
Connect Node 2 to the Ray cluster as a worker node. This provides additional GPU resources for tensor parallelism.
|
Open a second terminal, SSH to Node 2 (`ssh user@<NODE_2_IP>`), and join the Ray cluster as a worker. Replace `<NODE_1_IP_ADDRESS>` below with the QSFP-side IP from Node 1 (run `echo $VLLM_HOST_IP` on Node 1 to print it). Run inside tmux/screen on Node 2 as well.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
## On Node 2, join as worker
|
## On Node 2, join as worker
|
||||||
@ -325,10 +335,12 @@ export MN_IF_NAME=enp1s0f1np1
|
|||||||
## Get Node 2's own IP address
|
## Get Node 2's own IP address
|
||||||
export VLLM_HOST_IP=$(ip -4 addr show $MN_IF_NAME | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
|
export VLLM_HOST_IP=$(ip -4 addr show $MN_IF_NAME | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
|
||||||
|
|
||||||
## IMPORTANT: Set HEAD_NODE_IP to Node 1's IP address
|
## Set this to Node 1's QSFP IP (see step header)
|
||||||
## You must get this value from Node 1 (run: echo $VLLM_HOST_IP on Node 1)
|
|
||||||
export HEAD_NODE_IP=<NODE_1_IP_ADDRESS>
|
export HEAD_NODE_IP=<NODE_1_IP_ADDRESS>
|
||||||
|
|
||||||
|
## Set the image tag (same as Step 3)
|
||||||
|
export VLLM_IMAGE=nvcr.io/nvidia/vllm:26.05-py3
|
||||||
|
|
||||||
echo "Worker IP: $VLLM_HOST_IP, connecting to head node at: $HEAD_NODE_IP"
|
echo "Worker IP: $VLLM_HOST_IP, connecting to head node at: $HEAD_NODE_IP"
|
||||||
|
|
||||||
bash run_cluster.sh $VLLM_IMAGE $HEAD_NODE_IP --worker ~/.cache/huggingface \
|
bash run_cluster.sh $VLLM_IMAGE $HEAD_NODE_IP --worker ~/.cache/huggingface \
|
||||||
@ -341,7 +353,6 @@ bash run_cluster.sh $VLLM_IMAGE $HEAD_NODE_IP --worker ~/.cache/huggingface \
|
|||||||
-e RAY_memory_monitor_refresh_ms=0 \
|
-e RAY_memory_monitor_refresh_ms=0 \
|
||||||
-e MASTER_ADDR=$HEAD_NODE_IP
|
-e MASTER_ADDR=$HEAD_NODE_IP
|
||||||
```
|
```
|
||||||
> **Note:** Replace `<NODE_1_IP_ADDRESS>` with the actual IP address from Node 1, specifically the QSFP interface nep1s0f1np1 configured in the [Connect two Sparks](https://build.nvidia.com/spark/connect-two-sparks) playbook.
|
|
||||||
|
|
||||||
## Step 6. Verify cluster status
|
## Step 6. Verify cluster status
|
||||||
|
|
||||||
@ -360,12 +371,12 @@ Expected output shows 2 nodes with available GPU resources.
|
|||||||
|
|
||||||
## Step 7. Download Llama 3.3 70B model
|
## Step 7. Download Llama 3.3 70B model
|
||||||
|
|
||||||
Authenticate with Hugging Face and download the recommended production-ready model.
|
Llama 3.3 70B is a gated model — first accept its license at <https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct> and create an HF access token with read permission. Then authenticate inside the container so the cache lands at `/root/.cache/huggingface` (mounted from `~/.cache/huggingface`).
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
## From within the same container where `ray status` ran, run the following
|
docker exec -it $VLLM_CONTAINER /bin/bash -c '
|
||||||
hf auth login
|
hf auth login
|
||||||
hf download meta-llama/Llama-3.3-70B-Instruct
|
hf download meta-llama/Llama-3.3-70B-Instruct'
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 8. Launch inference server for Llama 3.3 70B
|
## Step 8. Launch inference server for Llama 3.3 70B
|
||||||
@ -374,18 +385,17 @@ Start the vLLM inference server with tensor parallelism across both nodes.
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
## On Node 1, enter container and start server
|
## On Node 1, enter container and start server
|
||||||
export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
|
|
||||||
docker exec -it $VLLM_CONTAINER /bin/bash -c '
|
docker exec -it $VLLM_CONTAINER /bin/bash -c '
|
||||||
vllm serve meta-llama/Llama-3.3-70B-Instruct \
|
vllm serve meta-llama/Llama-3.3-70B-Instruct \
|
||||||
--tensor-parallel-size 2 --max_model_len 2048'
|
--tensor-parallel-size 2 --max-model-len 2048 \
|
||||||
|
--distributed-executor-backend ray'
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 9. Test 70B model inference
|
## Step 9. Test 70B model inference
|
||||||
|
|
||||||
Verify the deployment with a sample inference request.
|
Verify the deployment with a sample inference request. Run this on Node 1 itself; from an external client, replace `localhost` with Node 1's reachable IP.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
## Test from Node 1 or external client
|
|
||||||
curl http://localhost:8000/v1/completions \
|
curl http://localhost:8000/v1/completions \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d '{
|
-d '{
|
||||||
@ -403,29 +413,36 @@ Expected output includes a generated haiku response.
|
|||||||
> [!WARNING]
|
> [!WARNING]
|
||||||
> 405B model has insufficient memory headroom for production use.
|
> 405B model has insufficient memory headroom for production use.
|
||||||
|
|
||||||
Download the quantized 405B model for testing purposes only.
|
Download the quantized 405B model for testing purposes only. Runs inside the head container so the cache lands in the mounted HF directory.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
## On Node 1, download quantized model
|
docker exec -it $VLLM_CONTAINER /bin/bash -c '
|
||||||
huggingface-cli download hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4
|
hf download hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4'
|
||||||
```
|
```
|
||||||
|
|
||||||
### Step 11. (Optional) Launch 405B inference server
|
## Step 11. (Optional) Launch 405B inference server
|
||||||
|
|
||||||
Start the server with memory-constrained parameters for the large model.
|
Start the server with memory-constrained parameters for the large model.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
## On Node 1, launch with restricted parameters
|
## On Node 1, launch with restricted parameters
|
||||||
export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
|
|
||||||
docker exec -it $VLLM_CONTAINER /bin/bash -c '
|
docker exec -it $VLLM_CONTAINER /bin/bash -c '
|
||||||
vllm serve hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4 \
|
vllm serve hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4 \
|
||||||
--tensor-parallel-size 2 --max-model-len 64 --gpu-memory-utilization 0.9 \
|
--tensor-parallel-size 2 --max-model-len 64 --gpu-memory-utilization 0.9 \
|
||||||
--max-num-seqs 1 --max_num_batched_tokens 64'
|
--max-num-seqs 1 --max-num-batched-tokens 64 \
|
||||||
|
--distributed-executor-backend ray'
|
||||||
|
```
|
||||||
|
|
||||||
|
Startup is slow for 405B — expect several minutes of model-loading logs across both nodes. The server is ready to take traffic once you see:
|
||||||
|
|
||||||
|
```
|
||||||
|
INFO: Application startup complete.
|
||||||
|
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 12. (Optional) Test 405B model inference
|
## Step 12. (Optional) Test 405B model inference
|
||||||
|
|
||||||
Verify the 405B deployment with constrained parameters.
|
Verify the 405B deployment with constrained parameters. As in Step 9, run on Node 1 or replace `localhost` with Node 1's reachable IP from an external client.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl http://localhost:8000/v1/completions \
|
curl http://localhost:8000/v1/completions \
|
||||||
@ -444,33 +461,32 @@ Perform comprehensive validation of the distributed inference system.
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
## Check Ray cluster health
|
## Check Ray cluster health
|
||||||
export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
|
|
||||||
docker exec $VLLM_CONTAINER ray status
|
docker exec $VLLM_CONTAINER ray status
|
||||||
|
|
||||||
## Verify server health endpoint
|
## Verify server health endpoint
|
||||||
curl http://192.168.100.10:8000/health
|
curl http://localhost:8000/health
|
||||||
|
|
||||||
## Monitor GPU utilization on both nodes
|
## Monitor GPU utilization on both nodes (DGX Spark has unified memory,
|
||||||
|
## so the --query-gpu memory fields report N/A; use raw nvidia-smi instead).
|
||||||
nvidia-smi
|
nvidia-smi
|
||||||
export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
|
|
||||||
docker exec $VLLM_CONTAINER nvidia-smi --query-gpu=memory.used,memory.total --format=csv
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 14. Next steps
|
## Step 14. Next steps
|
||||||
|
|
||||||
Access the Ray dashboard for cluster monitoring and explore additional features:
|
The Ray dashboard runs on port 8265 of the head node. It binds to the container's network (host networking), so it is only directly reachable from Node 1 itself. From an external workstation, tunnel it over SSH:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
## Ray dashboard available at:
|
## From your workstation:
|
||||||
http://<head-node-ip>:8265
|
ssh -L 8265:localhost:8265 nvidia@<NODE_1_IP>
|
||||||
|
## then open http://localhost:8265 in a local browser
|
||||||
## Consider implementing for production:
|
|
||||||
## - Health checks and automatic restarts
|
|
||||||
## - Log rotation for long-running services
|
|
||||||
## - Persistent model caching across restarts
|
|
||||||
## - Alternative quantization methods (FP8, INT4)
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Consider for production:
|
||||||
|
- Health checks and automatic restarts
|
||||||
|
- Log rotation for long-running services
|
||||||
|
- Persistent model caching across restarts
|
||||||
|
- Alternative quantization methods (FP8, INT4)
|
||||||
|
|
||||||
## Run on multiple Sparks through a switch
|
## Run on multiple Sparks through a switch
|
||||||
|
|
||||||
## Step 1. Configure network connectivity
|
## Step 1. Configure network connectivity
|
||||||
@ -640,8 +656,6 @@ curl http://localhost:8000/health
|
|||||||
|
|
||||||
## Monitor GPU utilization on all nodes
|
## Monitor GPU utilization on all nodes
|
||||||
nvidia-smi
|
nvidia-smi
|
||||||
export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
|
|
||||||
docker exec $VLLM_CONTAINER nvidia-smi --query-gpu=memory.used,memory.total --format=csv
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 11. Next steps
|
## Step 11. Next steps
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user