# TensorRT-LLM on Stacked Spark Instructions ## Step 1. Setup networking between nodes Configure network interfaces using netplan on both DGX Spark nodes: ```bash # On both nodes, create the netplan configuration file (also available in cx7-netplan.yaml in this repository) sudo tee /etc/netplan/40-cx7.yaml > /dev/null <

: To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions. ``` ### Substep B: Join worker nodes and deploy Now we can proceed with setting up other nodes of your cluster: ```bash # Run the command suggested by the docker swarm init on each worker node to join the Docker swarm docker swarm join --token

: # On your primary node, deploy the stack using the following command docker stack deploy -c docker-compose.yml trtllm-multinode # You can verify the status of your worker nodes using the following docker stack ps trtllm-multinode # In case you see any errors reported by docker ps for any node, you can verify using docker service logs ``` If everything is healthy, you should see a similar output to the following: ``` nvidia@spark-1b3b:~/draft-playbooks/trt-llm-on-stacked-spark$ docker stack ps trtllm-multinode ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS oe9k5o6w41le trtllm-multinode_trtllm.1 nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc3 spark-1d84 Running Running 2 minutes ago phszqzk97p83 trtllm-multinode_trtllm.2 nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc3 spark-1b3b Running Running 2 minutes ago ``` ### Substep C. Create hosts file You can check the available nodes using `docker node ls` ``` nvidia@spark-1b3b:~$ docker node ls ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION hza2b7yisatqiezo33zx4in4i * spark-1b3b Ready Active Leader 28.3.3 m1k22g3ktgnx36qz4jg5fzhr4 spark-1d84 Ready Active 28.3.3 ``` Generate a file containing all Docker Swarm node addresses for MPI operations, and then copy it over to your container: ```bash docker node ls --format '{{.ID}}' | xargs -n1 docker node inspect --format '{{ .Status.Addr }}' > ~/openmpi-hostfile docker cp ~/openmpi-hostfile $(docker ps -q -f name=trtllm-multinode):/etc/openmpi-hostfile ``` ### Substep D. Find your Docker container ID You can use `docker ps` to find your Docker container ID. Alternatively, you can save the container ID in a variable: ``` export TRTLLM_MN_CONTAINER=$(docker ps -q -f name=trtllm-multinode) ``` ### Substep E. Generate configuration file ```bash docker exec $TRTLLM_MN_CONTAINER bash -c 'cat < /tmp/extra-llm-api-config.yml print_iter_log: false kv_cache_config: dtype: "fp8" free_gpu_memory_fraction: 0.9 cuda_graph_config: enable_padding: true EOF' ``` ### Substep F. Download model ```bash docker exec \ -e MODEL="nvidia/Qwen3-235B-A22B-FP4" \ -e HF_TOKEN="hf_..." \ -it $TRTLLM_MN_CONTAINER bash -c 'mpirun -x HF_TOKEN bash -c "huggingface-cli download $MODEL"' ``` ### Substep G. Prepare dataset and benchmark ```bash docker exec \ -e ISL=128 -e OSL=128 \ -e MODEL="nvidia/Qwen3-235B-A22B-FP4" \ -e HF_TOKEN="" \ -it $TRTLLM_MN_CONTAINER bash -c ' mpirun -x HF_TOKEN bash -c "python benchmarks/cpp/prepare_dataset.py --tokenizer=$MODEL --stdout token-norm-dist --num-requests=1 --input-mean=$ISL --output-mean=$OSL --input-stdev=0 --output-stdev=0 > /tmp/dataset.txt" && \ mpirun -x HF_TOKEN trtllm-llmapi-launch trtllm-bench -m $MODEL throughput \ --tp 2 \ --dataset /tmp/dataset.txt \ --backend pytorch \ --max_num_tokens 4096 \ --concurrency 1 \ --max_batch_size 4 \ --extra_llm_api_options /tmp/extra-llm-api-config.yml \ --streaming' ``` ### Substep H. Serve the model ```bash docker exec \ -e MODEL="nvidia/Qwen3-235B-A22B-FP4" \ -e HF_TOKEN="" \ -it $TRTLLM_MN_CONTAINER bash -c ' mpirun -x HF_TOKEN trtllm-llmapi-launch trtllm-serve $MODEL \ --tp_size 2 \ --backend pytorch \ --max_num_tokens 32768 \ --max_batch_size 4 \ --extra_llm_api_options /tmp/extra-llm-api-config.yml \ --port 8000' ``` This will start the TensorRT-LLM server on port 8000. You can then make inference requests to `http://localhost:8000` using the OpenAI-compatible API format. **Expected output:** Server startup logs and ready message. ### Example inference request Once the server is running, you can test it with a CURL request. Please ensure the CURL request is run on the primary node where you previously ran Substep H. ```bash curl -X POST http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "nvidia/Qwen3-235B-A22B-FP4", "prompt": "What is artificial intelligence?", "max_tokens": 100, "temperature": 0.7, "stream": false }' ``` ## Step 6. Troubleshooting | Symptom | Cause | Fix | |---------|-------|-----| | MPI hostname test returns single hostname | Network connectivity issues | Verify both nodes are on 192.168.100.0/24 subnet | | "Permission denied" on HuggingFace download | Invalid or missing HF_TOKEN | Set valid token: `export HF_TOKEN=` | | "CUDA out of memory" errors | Insufficient GPU memory | Reduce `--max_batch_size` or `--max_num_tokens` | | Container exits immediately | Missing entrypoint script | Ensure `trtllm-mn-entrypoint.sh` download succeeded and has executable permissions | ## Step 7. Cleanup and rollback Stop and remove containers by using the following command on the leader node: ```bash docker stack rm trtllm-multinode ``` > **Warning:** This removes all inference data and performance reports. Copy `/opt/*perf-report.json` files before cleanup if needed. Remove downloaded models to free disk space: ```bash rm -rf $HOME/.cache/huggingface/hub/models--nvidia--Qwen3* ``` ## Step 8. Next steps Compare performance metrics between speculative decoding and baseline reports to quantify speed improvements. Use the multi-node setup as a foundation for deploying other large models requiring tensor parallelism, or scale to additional nodes for higher throughput workloads.