**Note:** You can use the NVFP4 Quantization documentation to generate your own NVFP4-quantized checkpoints for your favorite models. This enables you to take advantage of the performance and memory benefits of NVFP4 quantization even for models not already published by NVIDIA. Note: Not all model architectures are supported for NVFP4 quantization.
## Time & risk
**Duration**: 45-60 minutes for setup and API server deployment
**Risk level**: Medium - container pulls and model downloads may fail due to network issues
**Rollback**: Stop inference servers and remove downloaded models to free resources
## Single Spark
### Step 1. Verify environment prerequisites
Confirm your Spark device has the required GPU access and network connectivity for downloading
models and containers.
```bash
## Check GPU visibility and driver
nvidia-smi
## Verify Docker GPU support
docker run --rm --gpus all nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev nvidia-smi
```
### Step 2. Set environment variables
Set `HF_TOKEN` for model access.
```bash
export HF_TOKEN=<your-huggingface-token>
```
### Step 3. Validate TensorRT-LLM installation
After confirming GPU access, verify that TensorRT-LLM can be imported inside the container.
Set up local caching to avoid re-downloading models on subsequent runs.
```bash
## Create Hugging Face cache directory
mkdir -p $HOME/.cache/huggingface/
```
### Step 5. Validate setup with quickstart_advanced
This quickstart validates your TensorRT-LLM setup end-to-end by testing model loading, inference engine initialization, and GPU execution with real text generation. It's the fastest way to confirm everything works before starting the inference API server.
### Step 6. Validate setup with quickstart_multimodal
### VLM quickstart example
This demonstrates vision-language model capabilities by running inference with image understanding. The example uses multimodal inputs to validate both text and vision processing pipelines.
This model requires LoRA (Low-Rank Adaptation) configuration as it uses parameter-efficient fine-tuning. The `--load_lora` flag enables loading the LoRA weights that adapt the base model for multimodal instruction following.
Review the networking configuration options and choose the appropriate setup method for your environment.
### Step 2. Verify connectivity and SSH setup
Verify that the two Spark nodes can communicate with each other using ping and that SSH passwordless authentication is properly configured.
```bash
## Test network connectivity between nodes (replace with your actual node IPs)
ping -c 3 <other-node-ip>
```
```bash
## Test SSH passwordless authentication (replace with your actual node IP)
ssh nvidia@<other-node-ip> hostname
```
**Expected results:**
- Ping should show successful packet transmission with 0% packet loss
- SSH command should execute without prompting for a password and return the remote hostname
### Step 3. Install NVIDIA Container Toolkit
Ensure the NVIDIA drivers and the NVIDIA Container Toolkit are installed on each node (both manager and workers) that will provide GPU resources. This package enables Docker containers to access the host's GPU hardware. Ensure you complete the [installation steps](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html), including the [Docker configuration](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-docker) for NVIDIA Container Toolkit.
### Step 4. Enable resource advertising
Modify the NVIDIA Container Runtime to advertise the GPUs to the Swarm by uncommenting the swarm-resource line in the config.toml file. You can do this either with your preferred text editor (e.g., vim, nano...) or with the following command:
```bash
sudo sed -i 's/^#\s*\(swarm-resource\s*=\s*".*"\)/\1/' /etc/nvidia-container-runtime/config.toml
```
To apply the changes, restart the Docker daemon
```bash
sudo systemctl restart docker
```
### Step 5. Initialize Docker Swarm
On whichever node you want to use as primary, run the following swarm initialization command
This will start the TensorRT-LLM server on port 8000. You can then make inference requests to `http://localhost:8000` using the OpenAI-compatible API format.
**Expected output:** Server startup logs and ready message.
### Step 12. Validate API server
Verify successful deployment by checking container status and testing the API endpoint.
```bash
docker stack ps trtllm-multinode
```
**Expected output:** Two running containers in the stack across different nodes.
Once the server is running, you can test it with a CURL request. Please ensure the CURL request is run on the primary node where you previously ran Step 11.
```bash
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/Qwen3-235B-A22B-FP4",
"prompt": "What is artificial intelligence?",
"max_tokens": 100,
"temperature": 0.7,
"stream": false
}'
```
**Expected output:** JSON response with generated text completion.
### Step 13. Troubleshooting
| Symptom | Cause | Fix |
|---------|-------|-----|
| MPI hostname test returns single hostname | Network connectivity issues | Verify both nodes are on reachable IP addresses |
| "Permission denied" on HuggingFace download | Invalid or missing HF_TOKEN | Set valid token: `export HF_TOKEN=<TOKEN>` |
| "CUDA out of memory" errors | Insufficient GPU memory | Reduce `--max_batch_size` or `--max_num_tokens` |
| Container exits immediately | Missing entrypoint script | Ensure `trtllm-mn-entrypoint.sh` download succeeded and has executable permissions |
### Step 14. Cleanup and rollback
Stop and remove containers by using the following command on the leader node:
```bash
docker stack rm trtllm-multinode
```
> **Warning:** This removes all inference data and performance reports. Copy `/opt/*perf-report.json` files before cleanup if needed.
Compare performance metrics between speculative decoding and baseline reports to quantify speed improvements. Use the multi-node setup as a foundation for deploying other large models requiring tensor parallelism, or scale to additional nodes for higher throughput workloads.