mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-25 19:33:53 +00:00
chore: Regenerate all playbooks
This commit is contained in:
parent
49bdd1d7d1
commit
5472c97a8c
@ -117,8 +117,8 @@ Reminder: not all model architectures are supported for NVFP4 quantization.
|
|||||||
* **Duration**: 45-60 minutes for setup and API server deployment
|
* **Duration**: 45-60 minutes for setup and API server deployment
|
||||||
* **Risk level**: Medium - container pulls and model downloads may fail due to network issues
|
* **Risk level**: Medium - container pulls and model downloads may fail due to network issues
|
||||||
* **Rollback**: Stop inference servers and remove downloaded models to free resources.
|
* **Rollback**: Stop inference servers and remove downloaded models to free resources.
|
||||||
* **Last Updated:** 10/18/2025
|
* **Last Updated:** 12/11/2025
|
||||||
* Fix broken links
|
* Improve TRT-LLM Run on Two Sparks workflow
|
||||||
|
|
||||||
## Single Spark
|
## Single Spark
|
||||||
|
|
||||||
|
|||||||
@ -52,8 +52,9 @@ support for ARM64.
|
|||||||
* **Duration:** 30 minutes for Docker approach
|
* **Duration:** 30 minutes for Docker approach
|
||||||
* **Risks:** Container registry access requires internal credentials
|
* **Risks:** Container registry access requires internal credentials
|
||||||
* **Rollback:** Container approach is non-destructive.
|
* **Rollback:** Container approach is non-destructive.
|
||||||
* **Last Updated:** 10/18/2025
|
* **Last Updated:** 12/11/2025
|
||||||
* Minor copyedits
|
* Upgrade vLLM container
|
||||||
|
* Improve cluster setup instructions
|
||||||
|
|
||||||
## Instructions
|
## Instructions
|
||||||
|
|
||||||
@ -246,9 +247,9 @@ Start the vLLM inference server with tensor parallelism across both nodes.
|
|||||||
```bash
|
```bash
|
||||||
## On Node 1, enter container and start server
|
## On Node 1, enter container and start server
|
||||||
export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
|
export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
|
||||||
docker exec -it $VLLM_CONTAINER /bin/bash
|
docker exec -it $VLLM_CONTAINER /bin/bash -c '
|
||||||
vllm serve meta-llama/Llama-3.3-70B-Instruct \
|
vllm serve meta-llama/Llama-3.3-70B-Instruct \
|
||||||
--tensor-parallel-size 2 --max_model_len 2048
|
--tensor-parallel-size 2 --max_model_len 2048'
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 9. Test 70B model inference
|
## Step 9. Test 70B model inference
|
||||||
@ -258,13 +259,13 @@ Verify the deployment with a sample inference request.
|
|||||||
```bash
|
```bash
|
||||||
## Test from Node 1 or external client
|
## Test from Node 1 or external client
|
||||||
curl http://localhost:8000/v1/completions \
|
curl http://localhost:8000/v1/completions \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d '{
|
-d '{
|
||||||
"model": "meta-llama/Llama-3.3-70B-Instruct",
|
"model": "meta-llama/Llama-3.3-70B-Instruct",
|
||||||
"prompt": "Write a haiku about a GPU",
|
"prompt": "Write a haiku about a GPU",
|
||||||
"max_tokens": 32,
|
"max_tokens": 32,
|
||||||
"temperature": 0.7
|
"temperature": 0.7
|
||||||
}'
|
}'
|
||||||
```
|
```
|
||||||
|
|
||||||
Expected output includes a generated haiku response.
|
Expected output includes a generated haiku response.
|
||||||
@ -288,10 +289,10 @@ Start the server with memory-constrained parameters for the large model.
|
|||||||
```bash
|
```bash
|
||||||
## On Node 1, launch with restricted parameters
|
## On Node 1, launch with restricted parameters
|
||||||
export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
|
export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
|
||||||
docker exec -it $VLLM_CONTAINER /bin/bash
|
docker exec -it $VLLM_CONTAINER /bin/bash -c '
|
||||||
vllm serve hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4 \
|
vllm serve hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4 \
|
||||||
--tensor-parallel-size 2 --max-model-len 256 --gpu-memory-utilization 1.0 \
|
--tensor-parallel-size 2 --max-model-len 256 --gpu-memory-utilization 1.0 \
|
||||||
--max-num-seqs 1 --max_num_batched_tokens 256
|
--max-num-seqs 1 --max_num_batched_tokens 256'
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 12. (Optional) Test 405B model inference
|
## Step 12. (Optional) Test 405B model inference
|
||||||
@ -300,13 +301,13 @@ Verify the 405B deployment with constrained parameters.
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl http://localhost:8000/v1/completions \
|
curl http://localhost:8000/v1/completions \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d '{
|
-d '{
|
||||||
"model": "hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4",
|
"model": "hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4",
|
||||||
"prompt": "Write a haiku about a GPU",
|
"prompt": "Write a haiku about a GPU",
|
||||||
"max_tokens": 32,
|
"max_tokens": 32,
|
||||||
"temperature": 0.7
|
"temperature": 0.7
|
||||||
}'
|
}'
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 13. Validate deployment
|
## Step 13. Validate deployment
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user