From 316b9a41fa03571fa98662a12fc5bf0ce5142faf Mon Sep 17 00:00:00 2001 From: GitLab CI Date: Tue, 7 Oct 2025 17:40:52 +0000 Subject: [PATCH] chore: Regenerate all playbooks --- nvidia/speculative-decoding/README.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/nvidia/speculative-decoding/README.md b/nvidia/speculative-decoding/README.md index 494dd39..6da7f6f 100644 --- a/nvidia/speculative-decoding/README.md +++ b/nvidia/speculative-decoding/README.md @@ -5,7 +5,7 @@ ## Table of Contents - [Overview](#overview) -- [How to run inference with speculative decoding](#how-to-run-inference-with-speculative-decoding) +- [Instructions](#instructions) - [Step 1. Configure Docker permissions](#step-1-configure-docker-permissions) - [Step 2. Run Draft-Target Speculative Decoding](#step-2-run-draft-target-speculative-decoding) - [Step 3. Test the Draft-Target setup](#step-3-test-the-draft-target-setup) @@ -57,7 +57,7 @@ These examples demonstrate how to accelerate large language model inference whil **Rollback:** Stop Docker containers and optionally clean up downloaded model cache -## How to run inference with speculative decoding +## Instructions ## Traditional Draft-Target Speculative Decoding @@ -169,3 +169,4 @@ docker stop - Experiment with different `max_draft_len` values (1, 2, 3, 4, 8) - Monitor token acceptance rates and throughput improvements - Test with different prompt lengths and generation parameters +- Read more on Speculative Decoding [here](https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html)