mirror of https://github.com/NVIDIA/dgx-spark-playbooks.git synced 2026-04-22 18:13:52 +00:00

History

GitLab CI 0f5c77e06e chore: Regenerate all playbooks		2025-10-06 15:32:36 +00:00
..
assets	chore: Regenerate all playbooks	2025-10-03 20:46:11 +00:00
README.md	chore: Regenerate all playbooks	2025-10-06 15:32:36 +00:00

README.md

Speculative Decoding

Learn how to setup speculative decoding for fast inference on Spark

Overview
How to run inference with speculative decoding

Overview

Basic idea

Speculative decoding speeds up text generation by using a small, fast model to draft several tokens ahead, then having the larger model quickly verify or adjust them. This way, the big model doesn't need to predict every token step-by-step, reducing latency while keeping output quality.

What you'll accomplish

You'll explore speculative decoding using TensorRT-LLM on NVIDIA Spark using the traditional Draft-Target approach.

These examples demonstrate how to accelerate large language model inference while maintaining output quality.

What to know before starting

Experience with Docker and containerized applications
Understanding of speculative decoding concepts
Familiarity with TensorRT-LLM serving and API endpoints
Knowledge of GPU memory management for large language models

Prerequisites

NVIDIA Spark device with sufficient GPU memory available

Docker with GPU support enabled

docker run --gpus all nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev nvidia-smi

HuggingFace authentication configured (if needed for model downloads)
```
huggingface-cli login
```
Network connectivity for model downloads

Time & risk

Duration: 10-20 minutes for setup, additional time for model downloads (varies by network speed)

Risks: GPU memory exhaustion with large models, container registry access issues, network timeouts during downloads

Rollback: Stop Docker containers and optionally clean up downloaded model cache

How to run inference with speculative decoding

Traditional Draft-Target Speculative Decoding

This example demonstrates traditional speculative decoding using a smaller draft model to accelerate a larger target model.

Step 1. Configure Docker permissions

To easily manage containers without sudo, you must be in the docker group. If you choose to skip this step, you will need to run Docker commands with sudo.

Open a new terminal and test Docker access. In the terminal, run:

docker ps

If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group:

sudo usermod -aG docker $USER

Warning

: After running usermod, you must log out and log back in to start a new session with updated group permissions.

Step 2. Run Draft-Target Speculative Decoding

Execute the following command to set up and run traditional speculative decoding:

docker run \
  -v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \
  --rm -it --ulimit memlock=-1 --ulimit stack=67108864 \
  --gpus=all --ipc=host --network host nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \
  bash -c "
#    # Download models
    hf download nvidia/Llama-3.3-70B-Instruct-FP4 && \
    hf download nvidia/Llama-3.1-8B-Instruct-FP4 \
    --local-dir /opt/Llama-3.1-8B-Instruct-FP4/ && \

#    # Create configuration file
    cat <<EOF > extra-llm-api-config.yml
print_iter_log: false
disable_overlap_scheduler: true
speculative_config:
  decoding_type: DraftTarget
  max_draft_len: 4
  speculative_model_dir: /opt/Llama-3.1-8B-Instruct-FP4/
kv_cache_config:
  enable_block_reuse: false
EOF

#    # Start TensorRT-LLM server
    trtllm-serve nvidia/Llama-3.3-70B-Instruct-FP4 \
      --backend pytorch --tp_size 1 \
      --max_batch_size 1 \
      --kv_cache_free_gpu_memory_fraction 0.9 \
      --extra_llm_api_options ./extra-llm-api-config.yml
  "

Step 3. Test the Draft-Target setup

Once the server is running, test it by making an API call from another terminal:

## Test completion endpoint
curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/Llama-3.3-70B-Instruct-FP4",
    "prompt": "Explain the benefits of speculative decoding:",
    "max_tokens": 150,
    "temperature": 0.7
  }'

Key Features of Draft-Target:

Efficient resource usage: 8B draft model accelerates 70B target model
Flexible configuration: Adjustable draft token length for optimization
Memory efficient: Uses FP4 quantized models for reduced memory footprint
Compatible models: Uses Llama family models with consistent tokenization

Troubleshooting

Common issues and solutions:

Symptom	Cause	Fix
"CUDA out of memory" error	Insufficient GPU memory	Reduce `kv_cache_free_gpu_memory_fraction` to 0.9 or use a device with more VRAM
Container fails to start	Docker GPU support issues	Verify `nvidia-docker` is installed and `--gpus=all` flag is supported
Model download fails	Network or authentication issues	Check HuggingFace authentication and network connectivity
Server doesn't respond	Port conflicts or firewall	Check if port 8000 is available and not blocked

Cleanup

Stop the Docker container when finished:

## Find and stop the container
docker ps
docker stop <container_id>

## Optional: Clean up downloaded models from cache
## rm -rf $HOME/.cache/huggingface/hub/models--*gpt-oss*

Next Steps

Experiment with different max_draft_len values (1, 2, 3, 4, 8)
Monitor token acceptance rates and throughput improvements
Test with different prompt lengths and generation parameters