From b1999f0f3fef083e4a1329b3b8b3ab9baf78d432 Mon Sep 17 00:00:00 2001 From: GitLab CI Date: Tue, 7 Oct 2025 21:12:07 +0000 Subject: [PATCH] chore: Regenerate all playbooks --- nvidia/speculative-decoding/README.md | 20 +++++++++----------- 1 file changed, 9 insertions(+), 11 deletions(-) diff --git a/nvidia/speculative-decoding/README.md b/nvidia/speculative-decoding/README.md index 6da7f6f..747c76e 100644 --- a/nvidia/speculative-decoding/README.md +++ b/nvidia/speculative-decoding/README.md @@ -7,8 +7,8 @@ - [Overview](#overview) - [Instructions](#instructions) - [Step 1. Configure Docker permissions](#step-1-configure-docker-permissions) - - [Step 2. Run Draft-Target Speculative Decoding](#step-2-run-draft-target-speculative-decoding) - - [Step 3. Test the Draft-Target setup](#step-3-test-the-draft-target-setup) + - [Step 2. Run draft-target speculative decoding](#step-2-run-draft-target-speculative-decoding) + - [Step 3. Test the draft-target setup](#step-3-test-the-draft-target-setup) - [Troubleshooting](#troubleshooting) - [Cleanup](#cleanup) - [Next Steps](#next-steps) @@ -39,10 +39,12 @@ These examples demonstrate how to accelerate large language model inference whil - NVIDIA Spark device with sufficient GPU memory available - Docker with GPU support enabled + ```bash docker run --gpus all nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev nvidia-smi ``` - HuggingFace authentication configured (if needed for model downloads) + ```bash huggingface-cli login ``` @@ -55,14 +57,10 @@ These examples demonstrate how to accelerate large language model inference whil **Risks:** GPU memory exhaustion with large models, container registry access issues, network timeouts during downloads -**Rollback:** Stop Docker containers and optionally clean up downloaded model cache +**Rollback:** Stop Docker containers and optionally clean up downloaded model cache. ## Instructions -## Traditional Draft-Target Speculative Decoding - -This example demonstrates traditional speculative decoding using a smaller draft model to accelerate a larger target model. - ### Step 1. Configure Docker permissions To easily manage containers without sudo, you must be in the `docker` group. If you choose to skip this step, you will need to run Docker commands with sudo. @@ -82,7 +80,7 @@ sudo usermod -aG docker $USER > **Warning**: After running usermod, you must log out and log back in to start a new > session with updated group permissions. -### Step 2. Run Draft-Target Speculative Decoding +### Step 2. Run draft-target speculative decoding Execute the following command to set up and run traditional speculative decoding: @@ -118,7 +116,7 @@ EOF " ``` -### Step 3. Test the Draft-Target setup +### Step 3. Test the draft-target setup Once the server is running, test it by making an API call from another terminal: @@ -134,7 +132,7 @@ curl -X POST http://localhost:8000/v1/completions \ }' ``` -#### Key Features of Draft-Target: +#### Key features of draft-target: - **Efficient resource usage**: 8B draft model accelerates 70B target model - **Flexible configuration**: Adjustable draft token length for optimization - **Memory efficient**: Uses FP4 quantized models for reduced memory footprint @@ -169,4 +167,4 @@ docker stop - Experiment with different `max_draft_len` values (1, 2, 3, 4, 8) - Monitor token acceptance rates and throughput improvements - Test with different prompt lengths and generation parameters -- Read more on Speculative Decoding [here](https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html) +- Read more on Speculative Decoding [here](https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html).