chore: Regenerate all playbooks

This commit is contained in:
GitLab CI 2025-12-12 05:31:37 +00:00
parent 49bdd1d7d1
commit 5472c97a8c
2 changed files with 26 additions and 25 deletions

View File

@ -117,8 +117,8 @@ Reminder: not all model architectures are supported for NVFP4 quantization.
* **Duration**: 45-60 minutes for setup and API server deployment * **Duration**: 45-60 minutes for setup and API server deployment
* **Risk level**: Medium - container pulls and model downloads may fail due to network issues * **Risk level**: Medium - container pulls and model downloads may fail due to network issues
* **Rollback**: Stop inference servers and remove downloaded models to free resources. * **Rollback**: Stop inference servers and remove downloaded models to free resources.
* **Last Updated:** 10/18/2025 * **Last Updated:** 12/11/2025
* Fix broken links * Improve TRT-LLM Run on Two Sparks workflow
## Single Spark ## Single Spark

View File

@ -52,8 +52,9 @@ support for ARM64.
* **Duration:** 30 minutes for Docker approach * **Duration:** 30 minutes for Docker approach
* **Risks:** Container registry access requires internal credentials * **Risks:** Container registry access requires internal credentials
* **Rollback:** Container approach is non-destructive. * **Rollback:** Container approach is non-destructive.
* **Last Updated:** 10/18/2025 * **Last Updated:** 12/11/2025
* Minor copyedits * Upgrade vLLM container
* Improve cluster setup instructions
## Instructions ## Instructions
@ -246,9 +247,9 @@ Start the vLLM inference server with tensor parallelism across both nodes.
```bash ```bash
## On Node 1, enter container and start server ## On Node 1, enter container and start server
export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$') export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
docker exec -it $VLLM_CONTAINER /bin/bash docker exec -it $VLLM_CONTAINER /bin/bash -c '
vllm serve meta-llama/Llama-3.3-70B-Instruct \ vllm serve meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 2 --max_model_len 2048 --tensor-parallel-size 2 --max_model_len 2048'
``` ```
## Step 9. Test 70B model inference ## Step 9. Test 70B model inference
@ -258,13 +259,13 @@ Verify the deployment with a sample inference request.
```bash ```bash
## Test from Node 1 or external client ## Test from Node 1 or external client
curl http://localhost:8000/v1/completions \ curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
"model": "meta-llama/Llama-3.3-70B-Instruct", "model": "meta-llama/Llama-3.3-70B-Instruct",
"prompt": "Write a haiku about a GPU", "prompt": "Write a haiku about a GPU",
"max_tokens": 32, "max_tokens": 32,
"temperature": 0.7 "temperature": 0.7
}' }'
``` ```
Expected output includes a generated haiku response. Expected output includes a generated haiku response.
@ -288,10 +289,10 @@ Start the server with memory-constrained parameters for the large model.
```bash ```bash
## On Node 1, launch with restricted parameters ## On Node 1, launch with restricted parameters
export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$') export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
docker exec -it $VLLM_CONTAINER /bin/bash docker exec -it $VLLM_CONTAINER /bin/bash -c '
vllm serve hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4 \ vllm serve hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4 \
--tensor-parallel-size 2 --max-model-len 256 --gpu-memory-utilization 1.0 \ --tensor-parallel-size 2 --max-model-len 256 --gpu-memory-utilization 1.0 \
--max-num-seqs 1 --max_num_batched_tokens 256 --max-num-seqs 1 --max_num_batched_tokens 256'
``` ```
## Step 12. (Optional) Test 405B model inference ## Step 12. (Optional) Test 405B model inference
@ -300,13 +301,13 @@ Verify the 405B deployment with constrained parameters.
```bash ```bash
curl http://localhost:8000/v1/completions \ curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
"model": "hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4", "model": "hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4",
"prompt": "Write a haiku about a GPU", "prompt": "Write a haiku about a GPU",
"max_tokens": 32, "max_tokens": 32,
"temperature": 0.7 "temperature": 0.7
}' }'
``` ```
## Step 13. Validate deployment ## Step 13. Validate deployment