mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-24 19:03:54 +00:00
chore: Regenerate all playbooks
This commit is contained in:
parent
2176f83be0
commit
41b629f82b
@ -9,9 +9,9 @@
|
|||||||
- [Step 1. Configure Docker permissions](#step-1-configure-docker-permissions)
|
- [Step 1. Configure Docker permissions](#step-1-configure-docker-permissions)
|
||||||
- [Step 2. Run draft-target speculative decoding](#step-2-run-draft-target-speculative-decoding)
|
- [Step 2. Run draft-target speculative decoding](#step-2-run-draft-target-speculative-decoding)
|
||||||
- [Step 3. Test the draft-target setup](#step-3-test-the-draft-target-setup)
|
- [Step 3. Test the draft-target setup](#step-3-test-the-draft-target-setup)
|
||||||
- [Troubleshooting](#troubleshooting)
|
- [Step 4. Troubleshooting](#step-4-troubleshooting)
|
||||||
- [Cleanup](#cleanup)
|
- [Step 5. Cleanup](#step-5-cleanup)
|
||||||
- [Next Steps](#next-steps)
|
- [Step 6. Next Steps](#step-6-next-steps)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@ -25,7 +25,6 @@ This way, the big model doesn't need to predict every token step-by-step, reduci
|
|||||||
## What you'll accomplish
|
## What you'll accomplish
|
||||||
|
|
||||||
You'll explore speculative decoding using TensorRT-LLM on NVIDIA Spark using the traditional Draft-Target approach.
|
You'll explore speculative decoding using TensorRT-LLM on NVIDIA Spark using the traditional Draft-Target approach.
|
||||||
|
|
||||||
These examples demonstrate how to accelerate large language model inference while maintaining output quality.
|
These examples demonstrate how to accelerate large language model inference while maintaining output quality.
|
||||||
|
|
||||||
## What to know before starting
|
## What to know before starting
|
||||||
@ -132,13 +131,14 @@ curl -X POST http://localhost:8000/v1/completions \
|
|||||||
}'
|
}'
|
||||||
```
|
```
|
||||||
|
|
||||||
#### Key features of draft-target:
|
**Key features of draft-target:**
|
||||||
|
|
||||||
- **Efficient resource usage**: 8B draft model accelerates 70B target model
|
- **Efficient resource usage**: 8B draft model accelerates 70B target model
|
||||||
- **Flexible configuration**: Adjustable draft token length for optimization
|
- **Flexible configuration**: Adjustable draft token length for optimization
|
||||||
- **Memory efficient**: Uses FP4 quantized models for reduced memory footprint
|
- **Memory efficient**: Uses FP4 quantized models for reduced memory footprint
|
||||||
- **Compatible models**: Uses Llama family models with consistent tokenization
|
- **Compatible models**: Uses Llama family models with consistent tokenization
|
||||||
|
|
||||||
### Troubleshooting
|
### Step 4. Troubleshooting
|
||||||
|
|
||||||
Common issues and solutions:
|
Common issues and solutions:
|
||||||
|
|
||||||
@ -149,7 +149,7 @@ Common issues and solutions:
|
|||||||
| Model download fails | Network or authentication issues | Check HuggingFace authentication and network connectivity |
|
| Model download fails | Network or authentication issues | Check HuggingFace authentication and network connectivity |
|
||||||
| Server doesn't respond | Port conflicts or firewall | Check if port 8000 is available and not blocked |
|
| Server doesn't respond | Port conflicts or firewall | Check if port 8000 is available and not blocked |
|
||||||
|
|
||||||
### Cleanup
|
### Step 5. Cleanup
|
||||||
|
|
||||||
Stop the Docker container when finished:
|
Stop the Docker container when finished:
|
||||||
|
|
||||||
@ -162,7 +162,7 @@ docker stop <container_id>
|
|||||||
## rm -rf $HOME/.cache/huggingface/hub/models--*gpt-oss*
|
## rm -rf $HOME/.cache/huggingface/hub/models--*gpt-oss*
|
||||||
```
|
```
|
||||||
|
|
||||||
### Next Steps
|
### Step 6. Next Steps
|
||||||
|
|
||||||
- Experiment with different `max_draft_len` values (1, 2, 3, 4, 8)
|
- Experiment with different `max_draft_len` values (1, 2, 3, 4, 8)
|
||||||
- Monitor token acceptance rates and throughput improvements
|
- Monitor token acceptance rates and throughput improvements
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user