From 41b629f82bfa227d67fca68826a765586f588b40 Mon Sep 17 00:00:00 2001 From: GitLab CI Date: Wed, 8 Oct 2025 16:27:42 +0000 Subject: [PATCH] chore: Regenerate all playbooks --- nvidia/speculative-decoding/README.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/nvidia/speculative-decoding/README.md b/nvidia/speculative-decoding/README.md index 747c76e..3f7e689 100644 --- a/nvidia/speculative-decoding/README.md +++ b/nvidia/speculative-decoding/README.md @@ -9,9 +9,9 @@ - [Step 1. Configure Docker permissions](#step-1-configure-docker-permissions) - [Step 2. Run draft-target speculative decoding](#step-2-run-draft-target-speculative-decoding) - [Step 3. Test the draft-target setup](#step-3-test-the-draft-target-setup) - - [Troubleshooting](#troubleshooting) - - [Cleanup](#cleanup) - - [Next Steps](#next-steps) + - [Step 4. Troubleshooting](#step-4-troubleshooting) + - [Step 5. Cleanup](#step-5-cleanup) + - [Step 6. Next Steps](#step-6-next-steps) --- @@ -25,7 +25,6 @@ This way, the big model doesn't need to predict every token step-by-step, reduci ## What you'll accomplish You'll explore speculative decoding using TensorRT-LLM on NVIDIA Spark using the traditional Draft-Target approach. - These examples demonstrate how to accelerate large language model inference while maintaining output quality. ## What to know before starting @@ -132,13 +131,14 @@ curl -X POST http://localhost:8000/v1/completions \ }' ``` -#### Key features of draft-target: +**Key features of draft-target:** + - **Efficient resource usage**: 8B draft model accelerates 70B target model - **Flexible configuration**: Adjustable draft token length for optimization - **Memory efficient**: Uses FP4 quantized models for reduced memory footprint - **Compatible models**: Uses Llama family models with consistent tokenization -### Troubleshooting +### Step 4. Troubleshooting Common issues and solutions: @@ -149,7 +149,7 @@ Common issues and solutions: | Model download fails | Network or authentication issues | Check HuggingFace authentication and network connectivity | | Server doesn't respond | Port conflicts or firewall | Check if port 8000 is available and not blocked | -### Cleanup +### Step 5. Cleanup Stop the Docker container when finished: @@ -162,7 +162,7 @@ docker stop ## rm -rf $HOME/.cache/huggingface/hub/models--*gpt-oss* ``` -### Next Steps +### Step 6. Next Steps - Experiment with different `max_draft_len` values (1, 2, 3, 4, 8) - Monitor token acceptance rates and throughput improvements