dgx-spark-playbooks/nvidia/speculative-decoding/README.md

# Speculative Decoding

> Learn how to set up speculative decoding for fast inference on Spark

## Table of Contents

- [Overview](#overview)
- [Instructions](#instructions)
  - [Step 1. Configure Docker permissions](#step-1-configure-docker-permissions)
  - [Step 2. Run draft-target speculative decoding](#step-2-run-draft-target-speculative-decoding)
  - [Step 3. Test the draft-target setup](#step-3-test-the-draft-target-setup)
  - [Step 5.  Cleanup](#step-5-cleanup)
  - [Step 6. Next Steps](#step-6-next-steps)
- [Troubleshooting](#troubleshooting)

---

## Overview

## Basic idea

Speculative decoding speeds up text generation by using a **small, fast model** to draft several tokens ahead, then having the **larger model** quickly verify or adjust them.
This way, the big model doesn't need to predict every token step-by-step, reducing latency while keeping output quality.

## What you'll accomplish

You'll explore speculative decoding using TensorRT-LLM on NVIDIA Spark using the traditional Draft-Target approach.
These examples demonstrate how to accelerate large language model inference while maintaining output quality.

## What to know before starting

- Experience with Docker and containerized applications
- Understanding of speculative decoding concepts
- Familiarity with TensorRT-LLM serving and API endpoints
- Knowledge of GPU memory management for large language models

## Prerequisites

- NVIDIA Spark device with sufficient GPU memory available
- Docker with GPU support enabled

  ```bash
  docker run --gpus all nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev nvidia-smi
  ```
- HuggingFace authentication configured (if needed for model downloads)

  ```bash
  huggingface-cli login
  ```
- Network connectivity for model downloads


## Time & risk

* **Duration:** 10-20 minutes for setup, additional time for model downloads (varies by network speed)
* **Risks:** GPU memory exhaustion with large models, container registry access issues, network timeouts during downloads
* **Rollback:** Stop Docker containers and optionally clean up downloaded model cache.

## Instructions

### Step 1. Configure Docker permissions

To easily manage containers without sudo, you must be in the `docker` group. If you choose to skip this step, you will need to run Docker commands with sudo.

Open a new terminal and test Docker access. In the terminal, run:

```bash
docker ps
```

If you see a permission denied error (something like `permission denied while trying to connect to the Docker daemon socket`), add your user to the docker group:

```bash
sudo usermod -aG docker $USER
```

> **Warning**: After running usermod, you must log out and log back in to start a new
> session with updated group permissions.

### Step 2. Run draft-target speculative decoding

Execute the following command to set up and run traditional speculative decoding:

```bash
docker run \
  -v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \
  --rm -it --ulimit memlock=-1 --ulimit stack=67108864 \
  --gpus=all --ipc=host --network host nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \
  bash -c "
#    # Download models
    hf download nvidia/Llama-3.3-70B-Instruct-FP4 && \
    hf download nvidia/Llama-3.1-8B-Instruct-FP4 \
    --local-dir /opt/Llama-3.1-8B-Instruct-FP4/ && \

#    # Create configuration file
    cat <<EOF > extra-llm-api-config.yml
print_iter_log: false
disable_overlap_scheduler: true
speculative_config:
  decoding_type: DraftTarget
  max_draft_len: 4
  speculative_model_dir: /opt/Llama-3.1-8B-Instruct-FP4/
kv_cache_config:
  enable_block_reuse: false
EOF

#    # Start TensorRT-LLM server
    trtllm-serve nvidia/Llama-3.3-70B-Instruct-FP4 \
      --backend pytorch --tp_size 1 \
      --max_batch_size 1 \
      --kv_cache_free_gpu_memory_fraction 0.9 \
      --extra_llm_api_options ./extra-llm-api-config.yml
  "
```

### Step 3. Test the draft-target setup

Once the server is running, test it by making an API call from another terminal:

```bash
## Test completion endpoint
curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/Llama-3.3-70B-Instruct-FP4",
    "prompt": "Explain the benefits of speculative decoding:",
    "max_tokens": 150,
    "temperature": 0.7
  }'
```

**Key features of draft-target:**

- **Efficient resource usage**: 8B draft model accelerates 70B target model
- **Flexible configuration**: Adjustable draft token length for optimization
- **Memory efficient**: Uses FP4 quantized models for reduced memory footprint
- **Compatible models**: Uses Llama family models with consistent tokenization

### Step 5.  Cleanup

Stop the Docker container when finished:

```bash
## Find and stop the container
docker ps
docker stop <container_id>

## Optional: Clean up downloaded models from cache
## rm -rf $HOME/.cache/huggingface/hub/models--*gpt-oss*
```

### Step 6. Next Steps

- Experiment with different `max_draft_len` values (1, 2, 3, 4, 8)
- Monitor token acceptance rates and throughput improvements
- Test with different prompt lengths and generation parameters
- Read more on Speculative Decoding [here](https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html).

## Troubleshooting

| Symptom | Cause | Fix |
|---------|--------|-----|
| "CUDA out of memory" error | Insufficient GPU memory | Reduce `kv_cache_free_gpu_memory_fraction` to 0.9 or use a device with more VRAM |
| Container fails to start | Docker GPU support issues | Verify `nvidia-docker` is installed and `--gpus=all` flag is supported |
| Model download fails | Network or authentication issues | Check HuggingFace authentication and network connectivity |
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your HuggingFace token; and request access to the gated model on your web browser |
| Server doesn't respond | Port conflicts or firewall | Check if port 8000 is available and not blocked |

> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. 
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within 
> the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
```bash
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
```
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00			`# Speculative Decoding`

chore: Regenerate all playbooks 2025-10-09 15:38:30 +00:00			`> Learn how to set up speculative decoding for fast inference on Spark`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`## Table of Contents`

			`- [Overview](#overview)`
chore: Regenerate all playbooks 2025-10-07 17:40:52 +00:00			`- [Instructions](#instructions)`
chore: Regenerate all playbooks 2025-10-06 15:32:36 +00:00			`- [Step 1. Configure Docker permissions](#step-1-configure-docker-permissions)`
chore: Regenerate all playbooks 2025-10-07 21:12:07 +00:00			`- [Step 2. Run draft-target speculative decoding](#step-2-run-draft-target-speculative-decoding)`
			`- [Step 3. Test the draft-target setup](#step-3-test-the-draft-target-setup)`
chore: Regenerate all playbooks 2025-10-08 20:25:52 +00:00			`- [Step 5. Cleanup](#step-5-cleanup)`
			`- [Step 6. Next Steps](#step-6-next-steps)`
chore: Regenerate all playbooks 2025-10-10 00:11:49 +00:00			`- [Troubleshooting](#troubleshooting)`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`---`

			`## Overview`

			`## Basic idea`

			`Speculative decoding speeds up text generation by using a small, fast model to draft several tokens ahead, then having the larger model quickly verify or adjust them.`
chore: Regenerate all playbooks 2025-10-06 15:32:36 +00:00			`This way, the big model doesn't need to predict every token step-by-step, reducing latency while keeping output quality.`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`## What you'll accomplish`

chore: Regenerate all playbooks 2025-10-06 15:32:36 +00:00			`You'll explore speculative decoding using TensorRT-LLM on NVIDIA Spark using the traditional Draft-Target approach.`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00			`These examples demonstrate how to accelerate large language model inference while maintaining output quality.`

			`## What to know before starting`

			`- Experience with Docker and containerized applications`
chore: Regenerate all playbooks 2025-10-06 15:32:36 +00:00			`- Understanding of speculative decoding concepts`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00			`- Familiarity with TensorRT-LLM serving and API endpoints`
			`- Knowledge of GPU memory management for large language models`

			`## Prerequisites`

chore: Regenerate all playbooks 2025-10-06 15:32:36 +00:00			`- NVIDIA Spark device with sufficient GPU memory available`
chore: Regenerate all playbooks 2025-10-06 13:35:52 +00:00			`- Docker with GPU support enabled`
chore: Regenerate all playbooks 2025-10-07 21:12:07 +00:00
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00			```bash
			`docker run --gpus all nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev nvidia-smi`
			```
chore: Regenerate all playbooks 2025-10-06 13:35:52 +00:00			`- HuggingFace authentication configured (if needed for model downloads)`
chore: Regenerate all playbooks 2025-10-07 21:12:07 +00:00
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00			```bash
			`huggingface-cli login`
			```
chore: Regenerate all playbooks 2025-10-06 13:35:52 +00:00			`- Network connectivity for model downloads`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00

			`## Time & risk`

chore: Regenerate all playbooks 2025-10-08 22:00:07 +00:00			`* Duration: 10-20 minutes for setup, additional time for model downloads (varies by network speed)`
			`* Risks: GPU memory exhaustion with large models, container registry access issues, network timeouts during downloads`
			`* Rollback: Stop Docker containers and optionally clean up downloaded model cache.`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
chore: Regenerate all playbooks 2025-10-07 17:40:52 +00:00			`## Instructions`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
chore: Regenerate all playbooks 2025-10-06 15:32:36 +00:00			`### Step 1. Configure Docker permissions`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
chore: Regenerate all playbooks 2025-10-06 15:32:36 +00:00			To easily manage containers without sudo, you must be in the `docker` group. If you choose to skip this step, you will need to run Docker commands with sudo.
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
chore: Regenerate all playbooks 2025-10-06 15:32:36 +00:00			`Open a new terminal and test Docker access. In the terminal, run:`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
chore: Regenerate all playbooks 2025-10-06 15:32:36 +00:00			```bash
			`docker ps`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00			```

chore: Regenerate all playbooks 2025-10-06 15:32:36 +00:00			If you see a permission denied error (something like `permission denied while trying to connect to the Docker daemon socket`), add your user to the docker group:
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			```bash
chore: Regenerate all playbooks 2025-10-06 15:32:36 +00:00			`sudo usermod -aG docker $USER`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00			```

chore: Regenerate all playbooks 2025-10-06 15:32:36 +00:00			`> Warning: After running usermod, you must log out and log back in to start a new`
			`> session with updated group permissions.`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
chore: Regenerate all playbooks 2025-10-07 21:12:07 +00:00			`### Step 2. Run draft-target speculative decoding`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`Execute the following command to set up and run traditional speculative decoding:`

			```bash
			`docker run \`
			`-v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \`
			`--rm -it --ulimit memlock=-1 --ulimit stack=67108864 \`
			`--gpus=all --ipc=host --network host nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \`
			`bash -c "`
			`# # Download models`
			`hf download nvidia/Llama-3.3-70B-Instruct-FP4 && \`
			`hf download nvidia/Llama-3.1-8B-Instruct-FP4 \`
			`--local-dir /opt/Llama-3.1-8B-Instruct-FP4/ && \`

			`# # Create configuration file`
			`cat <<EOF > extra-llm-api-config.yml`
			`print_iter_log: false`
			`disable_overlap_scheduler: true`
			`speculative_config:`
			`decoding_type: DraftTarget`
			`max_draft_len: 4`
			`speculative_model_dir: /opt/Llama-3.1-8B-Instruct-FP4/`
			`kv_cache_config:`
			`enable_block_reuse: false`
			`EOF`

			`# # Start TensorRT-LLM server`
			`trtllm-serve nvidia/Llama-3.3-70B-Instruct-FP4 \`
			`--backend pytorch --tp_size 1 \`
			`--max_batch_size 1 \`
			`--kv_cache_free_gpu_memory_fraction 0.9 \`
			`--extra_llm_api_options ./extra-llm-api-config.yml`
			`"`
			```

chore: Regenerate all playbooks 2025-10-07 21:12:07 +00:00			`### Step 3. Test the draft-target setup`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
chore: Regenerate all playbooks 2025-10-06 15:32:36 +00:00			`Once the server is running, test it by making an API call from another terminal:`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			```bash
			`## Test completion endpoint`
			`curl -X POST http://localhost:8000/v1/completions \`
			`-H "Content-Type: application/json" \`
			`-d '{`
			`"model": "nvidia/Llama-3.3-70B-Instruct-FP4",`
			`"prompt": "Explain the benefits of speculative decoding:",`
			`"max_tokens": 150,`
			`"temperature": 0.7`
			`}'`
			```

chore: Regenerate all playbooks 2025-10-08 20:25:52 +00:00			`Key features of draft-target:`

chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00			`- Efficient resource usage: 8B draft model accelerates 70B target model`
			`- Flexible configuration: Adjustable draft token length for optimization`
			`- Memory efficient: Uses FP4 quantized models for reduced memory footprint`
			`- Compatible models: Uses Llama family models with consistent tokenization`

chore: Regenerate all playbooks 2025-10-08 20:25:52 +00:00			`### Step 5. Cleanup`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`Stop the Docker container when finished:`

			```bash
			`## Find and stop the container`
			`docker ps`
			`docker stop <container_id>`

			`## Optional: Clean up downloaded models from cache`
			`## rm -rf $HOME/.cache/huggingface/hub/models--gpt-oss`
			```

chore: Regenerate all playbooks 2025-10-08 20:25:52 +00:00			`### Step 6. Next Steps`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
chore: Regenerate all playbooks 2025-10-06 15:32:36 +00:00			- Experiment with different `max_draft_len` values (1, 2, 3, 4, 8)
			`- Monitor token acceptance rates and throughput improvements`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00			`- Test with different prompt lengths and generation parameters`
chore: Regenerate all playbooks 2025-10-07 21:12:07 +00:00			`- Read more on Speculative Decoding [here](https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html).`
chore: Regenerate all playbooks 2025-10-10 00:11:49 +00:00
			`## Troubleshooting`

			`\| Symptom \| Cause \| Fix \|`
			`\|---------\|--------\|-----\|`
			\| "CUDA out of memory" error \| Insufficient GPU memory \| Reduce `kv_cache_free_gpu_memory_fraction` to 0.9 or use a device with more VRAM \|
			\| Container fails to start \| Docker GPU support issues \| Verify `nvidia-docker` is installed and `--gpus=all` flag is supported \|
			`\| Model download fails \| Network or authentication issues \| Check HuggingFace authentication and network connectivity \|`
chore: Regenerate all playbooks 2025-10-10 20:39:52 +00:00			`\| Cannot access gated repo for URL \| Certain HuggingFace models have restricted access \| Regenerate your HuggingFace token; and request access to the gated model on your web browser \|`
chore: Regenerate all playbooks 2025-10-10 00:11:49 +00:00			`\| Server doesn't respond \| Port conflicts or firewall \| Check if port 8000 is available and not blocked \|`

			`> Note: DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.`
			`> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within`
			`> the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:`
			```bash
			`sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'`
			```