dgx-spark-playbooks/nvidia/speculative-decoding/README.md

175 lines
6.2 KiB
Markdown
Raw Normal View History

2025-10-03 20:46:11 +00:00
# Speculative Decoding
2025-10-09 15:38:30 +00:00
> Learn how to set up speculative decoding for fast inference on Spark
2025-10-03 20:46:11 +00:00
## Table of Contents
- [Overview](#overview)
2025-10-07 17:40:52 +00:00
- [Instructions](#instructions)
2025-10-06 15:32:36 +00:00
- [Step 1. Configure Docker permissions](#step-1-configure-docker-permissions)
2025-10-07 21:12:07 +00:00
- [Step 2. Run draft-target speculative decoding](#step-2-run-draft-target-speculative-decoding)
- [Step 3. Test the draft-target setup](#step-3-test-the-draft-target-setup)
2025-10-08 20:25:52 +00:00
- [Step 5. Cleanup](#step-5-cleanup)
- [Step 6. Next Steps](#step-6-next-steps)
2025-10-10 00:11:49 +00:00
- [Troubleshooting](#troubleshooting)
2025-10-03 20:46:11 +00:00
---
## Overview
## Basic idea
Speculative decoding speeds up text generation by using a **small, fast model** to draft several tokens ahead, then having the **larger model** quickly verify or adjust them.
2025-10-06 15:32:36 +00:00
This way, the big model doesn't need to predict every token step-by-step, reducing latency while keeping output quality.
2025-10-03 20:46:11 +00:00
## What you'll accomplish
2025-10-06 15:32:36 +00:00
You'll explore speculative decoding using TensorRT-LLM on NVIDIA Spark using the traditional Draft-Target approach.
2025-10-03 20:46:11 +00:00
These examples demonstrate how to accelerate large language model inference while maintaining output quality.
## What to know before starting
- Experience with Docker and containerized applications
2025-10-06 15:32:36 +00:00
- Understanding of speculative decoding concepts
2025-10-03 20:46:11 +00:00
- Familiarity with TensorRT-LLM serving and API endpoints
- Knowledge of GPU memory management for large language models
## Prerequisites
2025-10-06 15:32:36 +00:00
- NVIDIA Spark device with sufficient GPU memory available
2025-10-06 13:35:52 +00:00
- Docker with GPU support enabled
2025-10-07 21:12:07 +00:00
2025-10-03 20:46:11 +00:00
```bash
docker run --gpus all nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev nvidia-smi
```
2025-10-06 13:35:52 +00:00
- HuggingFace authentication configured (if needed for model downloads)
2025-10-07 21:12:07 +00:00
2025-10-03 20:46:11 +00:00
```bash
huggingface-cli login
```
2025-10-06 13:35:52 +00:00
- Network connectivity for model downloads
2025-10-03 20:46:11 +00:00
## Time & risk
2025-10-08 22:00:07 +00:00
* **Duration:** 10-20 minutes for setup, additional time for model downloads (varies by network speed)
* **Risks:** GPU memory exhaustion with large models, container registry access issues, network timeouts during downloads
* **Rollback:** Stop Docker containers and optionally clean up downloaded model cache.
2025-10-03 20:46:11 +00:00
2025-10-07 17:40:52 +00:00
## Instructions
2025-10-03 20:46:11 +00:00
2025-10-06 15:32:36 +00:00
### Step 1. Configure Docker permissions
2025-10-03 20:46:11 +00:00
2025-10-06 15:32:36 +00:00
To easily manage containers without sudo, you must be in the `docker` group. If you choose to skip this step, you will need to run Docker commands with sudo.
2025-10-03 20:46:11 +00:00
2025-10-06 15:32:36 +00:00
Open a new terminal and test Docker access. In the terminal, run:
2025-10-03 20:46:11 +00:00
2025-10-06 15:32:36 +00:00
```bash
docker ps
2025-10-03 20:46:11 +00:00
```
2025-10-06 15:32:36 +00:00
If you see a permission denied error (something like `permission denied while trying to connect to the Docker daemon socket`), add your user to the docker group:
2025-10-03 20:46:11 +00:00
```bash
2025-10-06 15:32:36 +00:00
sudo usermod -aG docker $USER
2025-10-03 20:46:11 +00:00
```
2025-10-06 15:32:36 +00:00
> **Warning**: After running usermod, you must log out and log back in to start a new
> session with updated group permissions.
2025-10-03 20:46:11 +00:00
2025-10-07 21:12:07 +00:00
### Step 2. Run draft-target speculative decoding
2025-10-03 20:46:11 +00:00
Execute the following command to set up and run traditional speculative decoding:
```bash
docker run \
-v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \
--rm -it --ulimit memlock=-1 --ulimit stack=67108864 \
--gpus=all --ipc=host --network host nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \
bash -c "
# # Download models
hf download nvidia/Llama-3.3-70B-Instruct-FP4 && \
hf download nvidia/Llama-3.1-8B-Instruct-FP4 \
--local-dir /opt/Llama-3.1-8B-Instruct-FP4/ && \
# # Create configuration file
cat <<EOF > extra-llm-api-config.yml
print_iter_log: false
disable_overlap_scheduler: true
speculative_config:
decoding_type: DraftTarget
max_draft_len: 4
speculative_model_dir: /opt/Llama-3.1-8B-Instruct-FP4/
kv_cache_config:
enable_block_reuse: false
EOF
# # Start TensorRT-LLM server
trtllm-serve nvidia/Llama-3.3-70B-Instruct-FP4 \
--backend pytorch --tp_size 1 \
--max_batch_size 1 \
--kv_cache_free_gpu_memory_fraction 0.9 \
--extra_llm_api_options ./extra-llm-api-config.yml
"
```
2025-10-07 21:12:07 +00:00
### Step 3. Test the draft-target setup
2025-10-03 20:46:11 +00:00
2025-10-06 15:32:36 +00:00
Once the server is running, test it by making an API call from another terminal:
2025-10-03 20:46:11 +00:00
```bash
## Test completion endpoint
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/Llama-3.3-70B-Instruct-FP4",
"prompt": "Explain the benefits of speculative decoding:",
"max_tokens": 150,
"temperature": 0.7
}'
```
2025-10-08 20:25:52 +00:00
**Key features of draft-target:**
2025-10-03 20:46:11 +00:00
- **Efficient resource usage**: 8B draft model accelerates 70B target model
- **Flexible configuration**: Adjustable draft token length for optimization
- **Memory efficient**: Uses FP4 quantized models for reduced memory footprint
- **Compatible models**: Uses Llama family models with consistent tokenization
2025-10-08 20:25:52 +00:00
### Step 5. Cleanup
2025-10-03 20:46:11 +00:00
Stop the Docker container when finished:
```bash
## Find and stop the container
docker ps
docker stop <container_id>
## Optional: Clean up downloaded models from cache
## rm -rf $HOME/.cache/huggingface/hub/models--*gpt-oss*
```
2025-10-08 20:25:52 +00:00
### Step 6. Next Steps
2025-10-03 20:46:11 +00:00
2025-10-06 15:32:36 +00:00
- Experiment with different `max_draft_len` values (1, 2, 3, 4, 8)
- Monitor token acceptance rates and throughput improvements
2025-10-03 20:46:11 +00:00
- Test with different prompt lengths and generation parameters
2025-10-07 21:12:07 +00:00
- Read more on Speculative Decoding [here](https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html).
2025-10-10 00:11:49 +00:00
## Troubleshooting
| Symptom | Cause | Fix |
|---------|--------|-----|
| "CUDA out of memory" error | Insufficient GPU memory | Reduce `kv_cache_free_gpu_memory_fraction` to 0.9 or use a device with more VRAM |
| Container fails to start | Docker GPU support issues | Verify `nvidia-docker` is installed and `--gpus=all` flag is supported |
| Model download fails | Network or authentication issues | Check HuggingFace authentication and network connectivity |
2025-10-10 20:39:52 +00:00
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your HuggingFace token; and request access to the gated model on your web browser |
2025-10-10 00:11:49 +00:00
| Server doesn't respond | Port conflicts or firewall | Check if port 8000 is available and not blocked |
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within
> the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
```bash
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
```