dgx-spark-playbooks/nvidia/vllm/README.md

# Install and Use vLLM for Inference

> Use a container or build vLLM from source for Spark

## Table of Contents

- [Overview](#overview)
- [Instructions](#instructions)
- [Run on two Sparks](#run-on-two-sparks)
  - [Step 11. (Optional) Launch 405B inference server](#step-11-optional-launch-405b-inference-server)
- [Troubleshooting](#troubleshooting)

---

## Overview

## Basic idea

vLLM is an inference engine designed to run large language models efficiently. The key idea is **maximizing throughput and minimizing memory waste** when serving LLMs.

- It uses a memory-efficient attention algoritm called **PagedAttention** to handle long sequences without running out of GPU memory.
- New requests can be added to a batch already in process through **continuous batching** to keep GPUs fully utilized.
- It has an **OpenAI-compatible API** so applications built for the OpenAI API can switch to a vLLM backend with little or no modification.

## What you'll accomplish

You'll set up vLLM high-throughput LLM serving on DGX Spark with Blackwell architecture,
either using a pre-built Docker container or building from source with custom LLVM/Triton
support for ARM64.

## What to know before starting

- Experience building and configuring containers with Docker
- Familiarity with CUDA toolkit installation and version management
- Understanding of Python virtual environments and package management
- Knowledge of building software from source using CMake and Ninja
- Experience with Git version control and patch management

## Prerequisites

- DGX Spark device with ARM64 processor and Blackwell GPU architecture
- CUDA 13.0 toolkit installed: `nvcc --version` shows CUDA toolkit version.
- Docker installed and configured: `docker --version` succeeds
- NVIDIA Container Toolkit installed
- Python 3.12 available: `python3.12 --version` succeeds
- Git installed: `git --version` succeeds
- Network access to download packages and container images


## Time & risk

* **Duration:** 30 minutes for Docker approach
* **Risks:** Container registry access requires internal credentials
* **Rollback:** Container approach is non-destructive.

## Instructions

## Step 1. Pull vLLM container image

Find the latest container build from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=25.09-py3
```
docker pull nvcr.io/nvidia/vllm:25.09-py3
```

## Step 2. Test vLLM in container

Launch the container and start vLLM server with a test model to verify basic functionality.

```bash
docker run -it --gpus all -p 8000:8000 \
nvcr.io/nvidia/vllm:25.09-py3 \
vllm serve "Qwen/Qwen2.5-Math-1.5B-Instruct"
```

Expected output should include:
- Model loading confirmation
- Server startup on port 8000
- GPU memory allocation details

In another terminal, test the server:

```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "Qwen/Qwen2.5-Math-1.5B-Instruct",
    "messages": [{"role": "user", "content": "12*17"}],
    "max_tokens": 500
}'
```

Expected response should contain `"content": "204"` or similar mathematical calculation.

## Step 3. Cleanup and rollback

For container approach (non-destructive):

```bash
docker rm $(docker ps -aq --filter ancestor=nvcr.io/nvidia/vllm:25.09-py3)
docker rmi nvcr.io/nvidia/vllm
```


To remove CUDA 12.9:

```bash
sudo /usr/local/cuda-12.9/bin/cuda-uninstaller
```

## Step 4. Next steps

- **Production deployment:** Configure vLLM with your specific model requirements
- **Performance tuning:** Adjust batch sizes and memory settings for your workload
- **Monitoring:** Set up logging and metrics collection for production use
- **Model management:** Explore additional model formats and quantization options

## Run on two Sparks

## Step 1. Configure network connectivity

Follow the network setup instructions from the [Connect two Sparks](https://build.nvidia.com/spark/stack-sparks/stacked-sparks) playbook to establish connectivity between your DGX Spark nodes.

This includes:
- Physical QSFP cable connection
- Network interface configuration (automatic or manual IP assignment)
- Passwordless SSH setup
- Network connectivity verification

## Step 2. Download cluster deployment script

Obtain the vLLM cluster deployment script on both nodes. This script orchestrates the Ray cluster setup required for distributed inference.

```bash
## Download on both nodes
wget https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/examples/online_serving/run_cluster.sh
chmod +x run_cluster.sh
```

## Step 3. Pull the NVIDIA vLLM Image from NGC

First, you will need to configure docker to pull from NGC
If this is your first time using docker run:
```bash
sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker
```

After this, you should be able to run docker commands without using `sudo`.


```bash
docker pull nvcr.io/nvidia/vllm:25.09-py3
export VLLM_IMAGE=nvcr.io/nvidia/vllm:25.09-py3
```


## Step 4. Start Ray head node

Launch the Ray cluster head node on Node 1. This node coordinates the distributed inference and serves the API endpoint.

```bash
## On Node 1, start head node
export MN_IF_NAME=enP2p1s0f1np1
bash run_cluster.sh $VLLM_IMAGE 192.168.100.10 --head ~/.cache/huggingface \
-e VLLM_HOST_IP=192.168.100.10 \
-e UCX_NET_DEVICES=$MN_IF_NAME \
-e NCCL_SOCKET_IFNAME=$MN_IF_NAME \
-e OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME \
-e GLOO_SOCKET_IFNAME=$MN_IF_NAME \
-e TP_SOCKET_IFNAME=$MN_IF_NAME \
-e RAY_memory_monitor_refresh_ms=0 \
-e MASTER_ADDR=192.168.100.10
```


## Step 5. Start Ray worker node

Connect Node 2 to the Ray cluster as a worker node. This provides additional GPU resources for tensor parallelism.

```bash
## On Node 2, join as worker
export MN_IF_NAME=enP2p1s0f1np1
bash run_cluster.sh $VLLM_IMAGE 192.168.100.10 --worker ~/.cache/huggingface \
-e VLLM_HOST_IP=192.168.100.11 \
-e UCX_NET_DEVICES=$MN_IF_NAME \
-e NCCL_SOCKET_IFNAME=$MN_IF_NAME \
-e OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME \
-e GLOO_SOCKET_IFNAME=$MN_IF_NAME \
-e TP_SOCKET_IFNAME=$MN_IF_NAME \
-e RAY_memory_monitor_refresh_ms=0 \
-e MASTER_ADDR=192.168.100.10
```

## Step 6. Verify cluster status

Confirm both nodes are recognized and available in the Ray cluster.

```bash
## On Node 1 (head node)
docker exec node ray status
```

Expected output shows 2 nodes with available GPU resources.

## Step 7. Download Llama 3.3 70B model

Authenticate with Hugging Face and download the recommended production-ready model.

```bash
## On Node 1, authenticate and download
huggingface-cli login
huggingface-cli download meta-llama/Llama-3.3-70B-Instruct
```

## Step 8. Launch inference server for Llama 3.3 70B

Start the vLLM inference server with tensor parallelism across both nodes.

```bash
## On Node 1, enter container and start server
docker exec -it node /bin/bash
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 2 --max_model_len 2048
```

## Step 9. Test 70B model inference

Verify the deployment with a sample inference request.

```bash
## Test from Node 1 or external client
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.3-70B-Instruct",
"prompt": "Write a haiku about a GPU",
"max_tokens": 32,
"temperature": 0.7
}'
```

Expected output includes a generated haiku response.

## Step 10. (Optional) Deploy Llama 3.1 405B model

> [!WARNING]
> 405B model has insufficient memory headroom for production use.

Download the quantized 405B model for testing purposes only.

```bash
## On Node 1, download quantized model
huggingface-cli download hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4
```

### Step 11. (Optional) Launch 405B inference server

Start the server with memory-constrained parameters for the large model.

```bash
## On Node 1, launch with restricted parameters
docker exec -it node /bin/bash
vllm serve hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4 \
--tensor-parallel-size 2 --max-model-len 256 --gpu-memory-utilization 1.0 \
--max-num-seqs 1 --max_num_batched_tokens 256
```

## Step 12. (Optional) Test 405B model inference

Verify the 405B deployment with constrained parameters.

```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4",
"prompt": "Write a haiku about a GPU",
"max_tokens": 32,
"temperature": 0.7
}'
```

## Step 13. Validate deployment

Perform comprehensive validation of the distributed inference system.

```bash
## Check Ray cluster health
docker exec node ray status

## Verify server health endpoint
curl http://192.168.100.10:8000/health

## Monitor GPU utilization on both nodes
nvidia-smi
docker exec node nvidia-smi --query-gpu=memory.used,memory.total --format=csv
```

## Step 14. Cleanup and rollback

Remove temporary configurations and containers when testing is complete.

> [!WARNING]
> This will stop all inference services and remove cluster configuration.

```bash
## Stop containers on both nodes
docker stop node
docker rm node

## Remove network configuration on both nodes
sudo ip addr del 192.168.100.10/24 dev enP2p1s0f1np1  # Node 1
sudo ip addr del 192.168.100.11/24 dev enP2p1s0f1np1  # Node 2
sudo ip link set enP2p1s0f1np1 down
```

## Step 15. Next steps

Access the Ray dashboard for cluster monitoring and explore additional features:

```bash
## Ray dashboard available at:
http://192.168.100.10:8265

## Consider implementing for production:
## - Health checks and automatic restarts
## - Log rotation for long-running services
## - Persistent model caching across restarts
## - Alternative quantization methods (FP8, INT4)
```

## Troubleshooting

## Common issues for running on a single Spark

| Symptom | Cause | Fix |
|---------|--------|-----|
| CUDA version mismatch errors | Wrong CUDA toolkit version | Reinstall CUDA 12.9 using exact installer |
| Container registry authentication fails | Invalid or expired GitLab token | Generate new auth token |
| SM_121a architecture not recognized | Missing LLVM patches | Verify SM_121a patches applied to LLVM source |

## Common Issues for running on two Starks
| Symptom | Cause | Fix |
|---------|--------|-----|
| Node 2 not visible in Ray cluster | Network connectivity issue | Verify QSFP cable connection, check IP configuration |
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
| Model download fails | Authentication or network issue | Re-run `huggingface-cli login`, check internet access |
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your HuggingFace token; and request access to the gated model on your web browser |
| CUDA out of memory with 405B | Insufficient GPU memory | Use 70B model or reduce max_model_len parameter |
| Container startup fails | Missing ARM64 image | Rebuild vLLM image following ARM64 instructions |

> [!NOTE]
> DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within
> the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
```bash
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
```
chore: Regenerate all playbooks 2025-10-13 13:22:50 +00:00			`# Install and Use vLLM for Inference`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`> Use a container or build vLLM from source for Spark`

			`## Table of Contents`

			`- [Overview](#overview)`
chore: Regenerate all playbooks 2025-10-06 13:35:52 +00:00			`- [Instructions](#instructions)`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00			`- [Run on two Sparks](#run-on-two-sparks)`
chore: Regenerate all playbooks 2025-10-10 20:39:52 +00:00			`- [Step 11. (Optional) Launch 405B inference server](#step-11-optional-launch-405b-inference-server)`
chore: Regenerate all playbooks 2025-10-10 00:11:49 +00:00			`- [Troubleshooting](#troubleshooting)`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`---`

			`## Overview`

chore: Regenerate all playbooks 2025-10-08 20:25:52 +00:00			`## Basic idea`

chore: Regenerate all playbooks 2025-10-10 20:39:52 +00:00			`vLLM is an inference engine designed to run large language models efficiently. The key idea is maximizing throughput and minimizing memory waste when serving LLMs.`
chore: Regenerate all playbooks 2025-10-08 20:25:52 +00:00
chore: Regenerate all playbooks 2025-10-10 20:39:52 +00:00			`- It uses a memory-efficient attention algoritm called PagedAttention to handle long sequences without running out of GPU memory.`
			`- New requests can be added to a batch already in process through continuous batching to keep GPUs fully utilized.`
			`- It has an OpenAI-compatible API so applications built for the OpenAI API can switch to a vLLM backend with little or no modification.`
chore: Regenerate all playbooks 2025-10-08 20:25:52 +00:00
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00			`## What you'll accomplish`

chore: Regenerate all playbooks 2025-10-10 20:39:52 +00:00			`You'll set up vLLM high-throughput LLM serving on DGX Spark with Blackwell architecture,`
			`either using a pre-built Docker container or building from source with custom LLVM/Triton`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00			`support for ARM64.`

			`## What to know before starting`

			`- Experience building and configuring containers with Docker`
chore: Regenerate all playbooks 2025-10-10 20:39:52 +00:00			`- Familiarity with CUDA toolkit installation and version management`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00			`- Understanding of Python virtual environments and package management`
			`- Knowledge of building software from source using CMake and Ninja`
			`- Experience with Git version control and patch management`

			`## Prerequisites`

chore: Regenerate all playbooks 2025-10-06 13:35:52 +00:00			`- DGX Spark device with ARM64 processor and Blackwell GPU architecture`
chore: Regenerate all playbooks 2025-10-10 20:39:52 +00:00			- CUDA 13.0 toolkit installed: `nvcc --version` shows CUDA toolkit version.
chore: Regenerate all playbooks 2025-10-06 13:35:52 +00:00			- Docker installed and configured: `docker --version` succeeds
			`- NVIDIA Container Toolkit installed`
			- Python 3.12 available: `python3.12 --version` succeeds
			- Git installed: `git --version` succeeds
			`- Network access to download packages and container images`

chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`## Time & risk`

chore: Regenerate all playbooks 2025-10-08 22:00:07 +00:00			`* Duration: 30 minutes for Docker approach`
			`* Risks: Container registry access requires internal credentials`
chore: Regenerate all playbooks 2025-10-10 00:11:49 +00:00			`* Rollback: Container approach is non-destructive.`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
chore: Regenerate all playbooks 2025-10-06 13:35:52 +00:00			`## Instructions`

chore: Regenerate all playbooks 2025-10-10 20:39:52 +00:00			`## Step 1. Pull vLLM container image`
chore: Regenerate all playbooks 2025-10-06 13:35:52 +00:00
			`Find the latest container build from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=25.09-py3`
			```
			`docker pull nvcr.io/nvidia/vllm:25.09-py3`
			```

			`## Step 2. Test vLLM in container`

			`Launch the container and start vLLM server with a test model to verify basic functionality.`

			```bash
			`docker run -it --gpus all -p 8000:8000 \`
			`nvcr.io/nvidia/vllm:25.09-py3 \`
			`vllm serve "Qwen/Qwen2.5-Math-1.5B-Instruct"`
			```

			`Expected output should include:`
			`- Model loading confirmation`
			`- Server startup on port 8000`
			`- GPU memory allocation details`

			`In another terminal, test the server:`

			```bash
			`curl http://localhost:8000/v1/chat/completions \`
			`-H "Content-Type: application/json" \`
			`-d '{`
			`"model": "Qwen/Qwen2.5-Math-1.5B-Instruct",`
			`"messages": [{"role": "user", "content": "12*17"}],`
			`"max_tokens": 500`
			`}'`
			```

			Expected response should contain `"content": "204"` or similar mathematical calculation.

chore: Regenerate all playbooks 2025-10-12 17:01:59 +00:00			`## Step 3. Cleanup and rollback`
chore: Regenerate all playbooks 2025-10-06 13:35:52 +00:00
			`For container approach (non-destructive):`

			```bash
chore: Regenerate all playbooks 2025-10-13 01:11:21 +00:00			`docker rm $(docker ps -aq --filter ancestor=nvcr.io/nvidia/vllm:25.09-py3)`
			`docker rmi nvcr.io/nvidia/vllm`
chore: Regenerate all playbooks 2025-10-06 13:35:52 +00:00			```


			`To remove CUDA 12.9:`

			```bash
			`sudo /usr/local/cuda-12.9/bin/cuda-uninstaller`
			```

chore: Regenerate all playbooks 2025-10-12 17:01:59 +00:00			`## Step 4. Next steps`
chore: Regenerate all playbooks 2025-10-06 13:35:52 +00:00
			`- Production deployment: Configure vLLM with your specific model requirements`
chore: Regenerate all playbooks 2025-10-10 20:39:52 +00:00			`- Performance tuning: Adjust batch sizes and memory settings for your workload`
chore: Regenerate all playbooks 2025-10-06 13:35:52 +00:00			`- Monitoring: Set up logging and metrics collection for production use`
			`- Model management: Explore additional model formats and quantization options`

chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00			`## Run on two Sparks`

chore: Regenerate all playbooks 2025-10-10 20:39:52 +00:00			`## Step 1. Configure network connectivity`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
chore: Regenerate all playbooks 2025-10-12 17:01:59 +00:00			`Follow the network setup instructions from the [Connect two Sparks](https://build.nvidia.com/spark/stack-sparks/stacked-sparks) playbook to establish connectivity between your DGX Spark nodes.`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
chore: Regenerate all playbooks 2025-10-10 20:39:52 +00:00			`This includes:`
			`- Physical QSFP cable connection`
			`- Network interface configuration (automatic or manual IP assignment)`
			`- Passwordless SSH setup`
			`- Network connectivity verification`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
chore: Regenerate all playbooks 2025-10-10 20:39:52 +00:00			`## Step 2. Download cluster deployment script`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`Obtain the vLLM cluster deployment script on both nodes. This script orchestrates the Ray cluster setup required for distributed inference.`

			```bash
			`## Download on both nodes`
			`wget https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/examples/online_serving/run_cluster.sh`
			`chmod +x run_cluster.sh`
			```

chore: Regenerate all playbooks 2025-10-10 20:39:52 +00:00			`## Step 3. Pull the NVIDIA vLLM Image from NGC`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`First, you will need to configure docker to pull from NGC`
			`If this is your first time using docker run:`
			```bash
			`sudo groupadd docker`
			`sudo usermod -aG docker $USER`
			`newgrp docker`
			```

chore: Regenerate all playbooks 2025-10-10 20:39:52 +00:00			After this, you should be able to run docker commands without using `sudo`.
chore: Regenerate all playbooks 2025-10-08 14:19:44 +00:00
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			```bash
			`docker pull nvcr.io/nvidia/vllm:25.09-py3`
			`export VLLM_IMAGE=nvcr.io/nvidia/vllm:25.09-py3`
			```


chore: Regenerate all playbooks 2025-10-10 20:39:52 +00:00			`## Step 4. Start Ray head node`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`Launch the Ray cluster head node on Node 1. This node coordinates the distributed inference and serves the API endpoint.`

			```bash
			`## On Node 1, start head node`
			`export MN_IF_NAME=enP2p1s0f1np1`
			`bash run_cluster.sh $VLLM_IMAGE 192.168.100.10 --head ~/.cache/huggingface \`
			`-e VLLM_HOST_IP=192.168.100.10 \`
			`-e UCX_NET_DEVICES=$MN_IF_NAME \`
			`-e NCCL_SOCKET_IFNAME=$MN_IF_NAME \`
			`-e OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME \`
			`-e GLOO_SOCKET_IFNAME=$MN_IF_NAME \`
			`-e TP_SOCKET_IFNAME=$MN_IF_NAME \`
			`-e RAY_memory_monitor_refresh_ms=0 \`
			`-e MASTER_ADDR=192.168.100.10`
			```


chore: Regenerate all playbooks 2025-10-10 20:39:52 +00:00			`## Step 5. Start Ray worker node`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`Connect Node 2 to the Ray cluster as a worker node. This provides additional GPU resources for tensor parallelism.`

			```bash
			`## On Node 2, join as worker`
			`export MN_IF_NAME=enP2p1s0f1np1`
			`bash run_cluster.sh $VLLM_IMAGE 192.168.100.10 --worker ~/.cache/huggingface \`
			`-e VLLM_HOST_IP=192.168.100.11 \`
			`-e UCX_NET_DEVICES=$MN_IF_NAME \`
			`-e NCCL_SOCKET_IFNAME=$MN_IF_NAME \`
			`-e OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME \`
			`-e GLOO_SOCKET_IFNAME=$MN_IF_NAME \`
			`-e TP_SOCKET_IFNAME=$MN_IF_NAME \`
			`-e RAY_memory_monitor_refresh_ms=0 \`
			`-e MASTER_ADDR=192.168.100.10`
			```

chore: Regenerate all playbooks 2025-10-10 20:39:52 +00:00			`## Step 6. Verify cluster status`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`Confirm both nodes are recognized and available in the Ray cluster.`

			```bash
			`## On Node 1 (head node)`
			`docker exec node ray status`
			```

			`Expected output shows 2 nodes with available GPU resources.`

chore: Regenerate all playbooks 2025-10-10 20:39:52 +00:00			`## Step 7. Download Llama 3.3 70B model`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`Authenticate with Hugging Face and download the recommended production-ready model.`

			```bash
			`## On Node 1, authenticate and download`
			`huggingface-cli login`
			`huggingface-cli download meta-llama/Llama-3.3-70B-Instruct`
			```

chore: Regenerate all playbooks 2025-10-10 20:39:52 +00:00			`## Step 8. Launch inference server for Llama 3.3 70B`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`Start the vLLM inference server with tensor parallelism across both nodes.`

			```bash
			`## On Node 1, enter container and start server`
			`docker exec -it node /bin/bash`
			`vllm serve meta-llama/Llama-3.3-70B-Instruct \`
			`--tensor-parallel-size 2 --max_model_len 2048`
			```

chore: Regenerate all playbooks 2025-10-10 20:39:52 +00:00			`## Step 9. Test 70B model inference`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`Verify the deployment with a sample inference request.`

			```bash
			`## Test from Node 1 or external client`
			`curl http://localhost:8000/v1/completions \`
			`-H "Content-Type: application/json" \`
			`-d '{`
			`"model": "meta-llama/Llama-3.3-70B-Instruct",`
			`"prompt": "Write a haiku about a GPU",`
			`"max_tokens": 32,`
			`"temperature": 0.7`
			`}'`
			```

			`Expected output includes a generated haiku response.`

chore: Regenerate all playbooks 2025-10-10 20:39:52 +00:00			`## Step 10. (Optional) Deploy Llama 3.1 405B model`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
chore: Regenerate all playbooks 2025-10-12 20:53:42 +00:00			`> [!WARNING]`
			`> 405B model has insufficient memory headroom for production use.`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`Download the quantized 405B model for testing purposes only.`

			```bash
			`## On Node 1, download quantized model`
			`huggingface-cli download hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4`
			```

chore: Regenerate all playbooks 2025-10-10 20:39:52 +00:00			`### Step 11. (Optional) Launch 405B inference server`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`Start the server with memory-constrained parameters for the large model.`

			```bash
			`## On Node 1, launch with restricted parameters`
			`docker exec -it node /bin/bash`
			`vllm serve hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4 \`
			`--tensor-parallel-size 2 --max-model-len 256 --gpu-memory-utilization 1.0 \`
			`--max-num-seqs 1 --max_num_batched_tokens 256`
			```

chore: Regenerate all playbooks 2025-10-10 20:39:52 +00:00			`## Step 12. (Optional) Test 405B model inference`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`Verify the 405B deployment with constrained parameters.`

			```bash
			`curl http://localhost:8000/v1/completions \`
			`-H "Content-Type: application/json" \`
			`-d '{`
			`"model": "hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4",`
			`"prompt": "Write a haiku about a GPU",`
			`"max_tokens": 32,`
			`"temperature": 0.7`
			`}'`
			```

chore: Regenerate all playbooks 2025-10-10 20:39:52 +00:00			`## Step 13. Validate deployment`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`Perform comprehensive validation of the distributed inference system.`

			```bash
			`## Check Ray cluster health`
			`docker exec node ray status`

			`## Verify server health endpoint`
			`curl http://192.168.100.10:8000/health`

			`## Monitor GPU utilization on both nodes`
			`nvidia-smi`
			`docker exec node nvidia-smi --query-gpu=memory.used,memory.total --format=csv`
			```

chore: Regenerate all playbooks 2025-10-10 20:39:52 +00:00			`## Step 14. Cleanup and rollback`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`Remove temporary configurations and containers when testing is complete.`

chore: Regenerate all playbooks 2025-10-12 20:53:42 +00:00			`> [!WARNING]`
			`> This will stop all inference services and remove cluster configuration.`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			```bash
			`## Stop containers on both nodes`
			`docker stop node`
			`docker rm node`

			`## Remove network configuration on both nodes`
			`sudo ip addr del 192.168.100.10/24 dev enP2p1s0f1np1 # Node 1`
			`sudo ip addr del 192.168.100.11/24 dev enP2p1s0f1np1 # Node 2`
			`sudo ip link set enP2p1s0f1np1 down`
			```

chore: Regenerate all playbooks 2025-10-10 20:39:52 +00:00			`## Step 15. Next steps`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`Access the Ray dashboard for cluster monitoring and explore additional features:`

			```bash
			`## Ray dashboard available at:`
			`http://192.168.100.10:8265`

			`## Consider implementing for production:`
			`## - Health checks and automatic restarts`
chore: Regenerate all playbooks 2025-10-10 20:39:52 +00:00			`## - Log rotation for long-running services`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00			`## - Persistent model caching across restarts`
			`## - Alternative quantization methods (FP8, INT4)`
			```
chore: Regenerate all playbooks 2025-10-10 00:11:49 +00:00
			`## Troubleshooting`

chore: Regenerate all playbooks 2025-10-12 17:01:59 +00:00			`## Common issues for running on a single Spark`

			`\| Symptom \| Cause \| Fix \|`
			`\|---------\|--------\|-----\|`
			`\| CUDA version mismatch errors \| Wrong CUDA toolkit version \| Reinstall CUDA 12.9 using exact installer \|`
			`\| Container registry authentication fails \| Invalid or expired GitLab token \| Generate new auth token \|`
			`\| SM_121a architecture not recognized \| Missing LLVM patches \| Verify SM_121a patches applied to LLVM source \|`

			`## Common Issues for running on two Starks`
chore: Regenerate all playbooks 2025-10-10 00:11:49 +00:00			`\| Symptom \| Cause \| Fix \|`
			`\|---------\|--------\|-----\|`
			`\| Node 2 not visible in Ray cluster \| Network connectivity issue \| Verify QSFP cable connection, check IP configuration \|`
chore: Regenerate all playbooks 2025-10-10 20:59:55 +00:00			`\| Cannot access gated repo for URL \| Certain HuggingFace models have restricted access \| Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser \|`
chore: Regenerate all playbooks 2025-10-10 00:11:49 +00:00			\| Model download fails \| Authentication or network issue \| Re-run `huggingface-cli login`, check internet access \|
chore: Regenerate all playbooks 2025-10-10 01:42:46 +00:00			`\| Cannot access gated repo for URL \| Certain HuggingFace models have restricted access \| Regenerate your HuggingFace token; and request access to the gated model on your web browser \|`
chore: Regenerate all playbooks 2025-10-10 00:11:49 +00:00			`\| CUDA out of memory with 405B \| Insufficient GPU memory \| Use 70B model or reduce max_model_len parameter \|`
			`\| Container startup fails \| Missing ARM64 image \| Rebuild vLLM image following ARM64 instructions \|`

chore: Regenerate all playbooks 2025-10-12 20:13:25 +00:00			`> [!NOTE]`
			`> DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.`
chore: Regenerate all playbooks 2025-10-10 20:39:52 +00:00			`> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within`
chore: Regenerate all playbooks 2025-10-10 00:11:49 +00:00			`> the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:`
			```bash
			`sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'`
			```