mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-22 01:53:53 +00:00
chore: Regenerate all playbooks
This commit is contained in:
parent
35c96dce16
commit
f08aacc290
@ -12,9 +12,7 @@
|
||||
- [Step 4. Validate TensorRT-LLM installation](#step-4-validate-tensorrt-llm-installation)
|
||||
- [Step 5. Create cache directory](#step-5-create-cache-directory)
|
||||
- [Step 6. Validate setup with quickstart_advanced](#step-6-validate-setup-with-quickstartadvanced)
|
||||
- [LLM quickstart example](#llm-quickstart-example)
|
||||
- [Step 7. Validate setup with quickstart_multimodal](#step-7-validate-setup-with-quickstartmultimodal)
|
||||
- [VLM quickstart example](#vlm-quickstart-example)
|
||||
- [Step 8. Serve LLM with OpenAI-compatible API](#step-8-serve-llm-with-openai-compatible-api)
|
||||
- [Step 9. Troubleshooting](#step-9-troubleshooting)
|
||||
- [Step 10. Cleanup and rollback](#step-10-cleanup-and-rollback)
|
||||
@ -39,6 +37,15 @@
|
||||
|
||||
## Overview
|
||||
|
||||
## Basic idea
|
||||
|
||||
**NVIDIA TensorRT-LLM (TRT-LLM)** is an open-source library for optimizing and accelerating large language model (LLM) inference on NVIDIA GPUs.
|
||||
|
||||
It provides highly efficient kernels, memory management, and parallelism strategies—like tensor, pipeline, and sequence parallelism—so developers can serve LLMs with lower latency and higher throughput.
|
||||
|
||||
TRT-LLM integrates with frameworks like Hugging Face and PyTorch, making it easier to deploy state-of-the-art models at scale.
|
||||
|
||||
|
||||
## What you'll accomplish
|
||||
|
||||
You'll set up TensorRT-LLM to optimize and deploy large language models on NVIDIA Spark with
|
||||
@ -89,13 +96,17 @@ The following models are supported with TensorRT-LLM on Spark. All listed models
|
||||
| **Llama-4-Scout-17B-16E-Instruct** | NVFP4 | ✅ | `nvidia/Llama-4-Scout-17B-16E-Instruct-FP4` |
|
||||
| **Qwen3-235B-A22B (two Sparks only)** | NVFP4 | ✅ | `nvidia/Qwen3-235B-A22B-FP4` |
|
||||
|
||||
**Note:** You can use the NVFP4 Quantization documentation to generate your own NVFP4-quantized checkpoints for your favorite models. This enables you to take advantage of the performance and memory benefits of NVFP4 quantization even for models not already published by NVIDIA. Note: Not all model architectures are supported for NVFP4 quantization.
|
||||
**Note:** You can use the NVFP4 Quantization documentation to generate your own NVFP4-quantized checkpoints for your favorite models. This enables you to take advantage of the performance and memory benefits of NVFP4 quantization even for models not already published by NVIDIA.
|
||||
|
||||
Reminder: not all model architectures are supported for NVFP4 quantization.
|
||||
|
||||
## Time & risk
|
||||
|
||||
**Duration**: 45-60 minutes for setup and API server deployment
|
||||
|
||||
**Risk level**: Medium - container pulls and model downloads may fail due to network issues
|
||||
**Rollback**: Stop inference servers and remove downloaded models to free resources
|
||||
|
||||
**Rollback**: Stop inference servers and remove downloaded models to free resources.
|
||||
|
||||
## Single Spark
|
||||
|
||||
@ -170,7 +181,7 @@ mkdir -p $HOME/.cache/huggingface/
|
||||
|
||||
This quickstart validates your TensorRT-LLM setup end-to-end by testing model loading, inference engine initialization, and GPU execution with real text generation. It's the fastest way to confirm everything works before starting the inference API server.
|
||||
|
||||
### LLM quickstart example
|
||||
**LLM quickstart example**
|
||||
|
||||
#### Llama 3.1 8B Instruct
|
||||
```bash
|
||||
@ -241,7 +252,7 @@ docker run \
|
||||
```
|
||||
### Step 7. Validate setup with quickstart_multimodal
|
||||
|
||||
### VLM quickstart example
|
||||
**VLM quickstart example**
|
||||
|
||||
This demonstrates vision-language model capabilities by running inference with image understanding. The example uses multimodal inputs to validate both text and vision processing pipelines.
|
||||
|
||||
@ -405,9 +416,7 @@ docker rmi nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev
|
||||
|
||||
### Step 1. Review Spark clustering documentation
|
||||
|
||||
Go to the official DGX Spark clustering documentation to understand the networking requirements and setup procedures:
|
||||
|
||||
[DGX Spark Clustering Documentation](https://docs.nvidia.com/dgx/dgx-spark/spark-clustering.html)
|
||||
Go to the official DGX Spark clustering documentation to understand the networking requirements and setup procedures:[DGX Spark Clustering Documentation](https://docs.nvidia.com/dgx/dgx-spark/spark-clustering.html)
|
||||
|
||||
Review the networking configuration options and choose the appropriate setup method for your environment.
|
||||
|
||||
|
||||
Loading…
Reference in New Issue
Block a user