chore: Regenerate all playbooks

2026-04-22 01:53:53 +00:00 · 2025-10-08 16:27:42 +00:00 · 2025-10-08 16:27:42 +00:00 · 41b629f82b
commit 41b629f82b
parent 2176f83be0
1 changed files with 8 additions and 8 deletions
--- a/nvidia/speculative-decoding/README.md
+++ b/nvidia/speculative-decoding/README.md
@ -9,9 +9,9 @@
  - [Step 1. Configure Docker permissions](#step-1-configure-docker-permissions)
  - [Step 2. Run draft-target speculative decoding](#step-2-run-draft-target-speculative-decoding)
  - [Step 3. Test the draft-target setup](#step-3-test-the-draft-target-setup)
-  - [Troubleshooting](#troubleshooting)
-  - [Cleanup](#cleanup)
-  - [Next Steps](#next-steps)
+  - [Step 4. Troubleshooting](#step-4-troubleshooting)
+  - [Step 5.  Cleanup](#step-5-cleanup)
+  - [Step 6. Next Steps](#step-6-next-steps)

 ---

@ -25,7 +25,6 @@ This way, the big model doesn't need to predict every token step-by-step, reduci
 ## What you'll accomplish

 You'll explore speculative decoding using TensorRT-LLM on NVIDIA Spark using the traditional Draft-Target approach.
-
 These examples demonstrate how to accelerate large language model inference while maintaining output quality.

 ## What to know before starting
@ -132,13 +131,14 @@ curl -X POST http://localhost:8000/v1/completions \
  }'
 ```

-#### Key features of draft-target:
+**Key features of draft-target:**
+
 - **Efficient resource usage**: 8B draft model accelerates 70B target model
 - **Flexible configuration**: Adjustable draft token length for optimization
 - **Memory efficient**: Uses FP4 quantized models for reduced memory footprint
 - **Compatible models**: Uses Llama family models with consistent tokenization

-### Troubleshooting
+### Step 4. Troubleshooting

 Common issues and solutions:

@ -149,7 +149,7 @@ Common issues and solutions:
 | Model download fails | Network or authentication issues | Check HuggingFace authentication and network connectivity |
 | Server doesn't respond | Port conflicts or firewall | Check if port 8000 is available and not blocked |

-### Cleanup
+### Step 5.  Cleanup

 Stop the Docker container when finished:

@ -162,7 +162,7 @@ docker stop <container_id>
 ## rm -rf $HOME/.cache/huggingface/hub/models--*gpt-oss*
 ```

-### Next Steps
+### Step 6. Next Steps

 - Experiment with different `max_draft_len` values (1, 2, 3, 4, 8)
 - Monitor token acceptance rates and throughput improvements