2025-10-03 20:46:11 +00:00
# Speculative Decoding
2025-10-09 15:38:30 +00:00
> Learn how to set up speculative decoding for fast inference on Spark
2025-10-03 20:46:11 +00:00
## Table of Contents
- [Overview ](#overview )
2025-10-07 17:40:52 +00:00
- [Instructions ](#instructions )
2025-10-06 15:32:36 +00:00
- [Step 1. Configure Docker permissions ](#step-1-configure-docker-permissions )
2025-10-07 21:12:07 +00:00
- [Step 2. Run draft-target speculative decoding ](#step-2-run-draft-target-speculative-decoding )
- [Step 3. Test the draft-target setup ](#step-3-test-the-draft-target-setup )
2025-10-08 20:25:52 +00:00
- [Step 4. Troubleshooting ](#step-4-troubleshooting )
- [Step 5. Cleanup ](#step-5-cleanup )
- [Step 6. Next Steps ](#step-6-next-steps )
2025-10-03 20:46:11 +00:00
---
## Overview
## Basic idea
Speculative decoding speeds up text generation by using a **small, fast model** to draft several tokens ahead, then having the **larger model** quickly verify or adjust them.
2025-10-06 15:32:36 +00:00
This way, the big model doesn't need to predict every token step-by-step, reducing latency while keeping output quality.
2025-10-03 20:46:11 +00:00
## What you'll accomplish
2025-10-06 15:32:36 +00:00
You'll explore speculative decoding using TensorRT-LLM on NVIDIA Spark using the traditional Draft-Target approach.
2025-10-03 20:46:11 +00:00
These examples demonstrate how to accelerate large language model inference while maintaining output quality.
## What to know before starting
- Experience with Docker and containerized applications
2025-10-06 15:32:36 +00:00
- Understanding of speculative decoding concepts
2025-10-03 20:46:11 +00:00
- Familiarity with TensorRT-LLM serving and API endpoints
- Knowledge of GPU memory management for large language models
## Prerequisites
2025-10-06 15:32:36 +00:00
- NVIDIA Spark device with sufficient GPU memory available
2025-10-06 13:35:52 +00:00
- Docker with GPU support enabled
2025-10-07 21:12:07 +00:00
2025-10-03 20:46:11 +00:00
```bash
docker run --gpus all nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev nvidia-smi
```
2025-10-06 13:35:52 +00:00
- HuggingFace authentication configured (if needed for model downloads)
2025-10-07 21:12:07 +00:00
2025-10-03 20:46:11 +00:00
```bash
huggingface-cli login
```
2025-10-06 13:35:52 +00:00
- Network connectivity for model downloads
2025-10-03 20:46:11 +00:00
## Time & risk
2025-10-08 22:00:07 +00:00
* **Duration:** 10-20 minutes for setup, additional time for model downloads (varies by network speed)
* **Risks:** GPU memory exhaustion with large models, container registry access issues, network timeouts during downloads
* **Rollback:** Stop Docker containers and optionally clean up downloaded model cache.
* DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
```bash
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
```
2025-10-03 20:46:11 +00:00
2025-10-07 17:40:52 +00:00
## Instructions
2025-10-03 20:46:11 +00:00
2025-10-06 15:32:36 +00:00
### Step 1. Configure Docker permissions
2025-10-03 20:46:11 +00:00
2025-10-06 15:32:36 +00:00
To easily manage containers without sudo, you must be in the `docker` group. If you choose to skip this step, you will need to run Docker commands with sudo.
2025-10-03 20:46:11 +00:00
2025-10-06 15:32:36 +00:00
Open a new terminal and test Docker access. In the terminal, run:
2025-10-03 20:46:11 +00:00
2025-10-06 15:32:36 +00:00
```bash
docker ps
2025-10-03 20:46:11 +00:00
```
2025-10-06 15:32:36 +00:00
If you see a permission denied error (something like `permission denied while trying to connect to the Docker daemon socket` ), add your user to the docker group:
2025-10-03 20:46:11 +00:00
```bash
2025-10-06 15:32:36 +00:00
sudo usermod -aG docker $USER
2025-10-03 20:46:11 +00:00
```
2025-10-06 15:32:36 +00:00
> **Warning**: After running usermod, you must log out and log back in to start a new
> session with updated group permissions.
2025-10-03 20:46:11 +00:00
2025-10-07 21:12:07 +00:00
### Step 2. Run draft-target speculative decoding
2025-10-03 20:46:11 +00:00
Execute the following command to set up and run traditional speculative decoding:
```bash
docker run \
-v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \
--rm -it --ulimit memlock=-1 --ulimit stack=67108864 \
--gpus=all --ipc=host --network host nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \
bash -c "
# # Download models
hf download nvidia/Llama-3.3-70B-Instruct-FP4 & & \
hf download nvidia/Llama-3.1-8B-Instruct-FP4 \
--local-dir /opt/Llama-3.1-8B-Instruct-FP4/ & & \
# # Create configuration file
cat < < EOF > extra-llm-api-config.yml
print_iter_log: false
disable_overlap_scheduler: true
speculative_config:
decoding_type: DraftTarget
max_draft_len: 4
speculative_model_dir: /opt/Llama-3.1-8B-Instruct-FP4/
kv_cache_config:
enable_block_reuse: false
EOF
# # Start TensorRT-LLM server
trtllm-serve nvidia/Llama-3.3-70B-Instruct-FP4 \
--backend pytorch --tp_size 1 \
--max_batch_size 1 \
--kv_cache_free_gpu_memory_fraction 0.9 \
--extra_llm_api_options ./extra-llm-api-config.yml
"
```
2025-10-07 21:12:07 +00:00
### Step 3. Test the draft-target setup
2025-10-03 20:46:11 +00:00
2025-10-06 15:32:36 +00:00
Once the server is running, test it by making an API call from another terminal:
2025-10-03 20:46:11 +00:00
```bash
## Test completion endpoint
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/Llama-3.3-70B-Instruct-FP4",
"prompt": "Explain the benefits of speculative decoding:",
"max_tokens": 150,
"temperature": 0.7
}'
```
2025-10-08 20:25:52 +00:00
**Key features of draft-target:**
2025-10-03 20:46:11 +00:00
- **Efficient resource usage**: 8B draft model accelerates 70B target model
- **Flexible configuration**: Adjustable draft token length for optimization
- **Memory efficient**: Uses FP4 quantized models for reduced memory footprint
- **Compatible models**: Uses Llama family models with consistent tokenization
2025-10-08 20:25:52 +00:00
### Step 4. Troubleshooting
2025-10-03 20:46:11 +00:00
Common issues and solutions:
| Symptom | Cause | Fix |
|---------|--------|-----|
| "CUDA out of memory" error | Insufficient GPU memory | Reduce `kv_cache_free_gpu_memory_fraction` to 0.9 or use a device with more VRAM |
| Container fails to start | Docker GPU support issues | Verify `nvidia-docker` is installed and `--gpus=all` flag is supported |
| Model download fails | Network or authentication issues | Check HuggingFace authentication and network connectivity |
| Server doesn't respond | Port conflicts or firewall | Check if port 8000 is available and not blocked |
2025-10-08 20:25:52 +00:00
### Step 5. Cleanup
2025-10-03 20:46:11 +00:00
Stop the Docker container when finished:
```bash
## Find and stop the container
docker ps
docker stop < container_id >
## Optional: Clean up downloaded models from cache
## rm -rf $HOME/.cache/huggingface/hub/models--*gpt-oss*
```
2025-10-08 20:25:52 +00:00
### Step 6. Next Steps
2025-10-03 20:46:11 +00:00
2025-10-06 15:32:36 +00:00
- Experiment with different `max_draft_len` values (1, 2, 3, 4, 8)
- Monitor token acceptance rates and throughput improvements
2025-10-03 20:46:11 +00:00
- Test with different prompt lengths and generation parameters
2025-10-07 21:12:07 +00:00
- Read more on Speculative Decoding [here ](https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html ).