mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-25 19:33:53 +00:00
chore: Regenerate all playbooks
This commit is contained in:
parent
4eff49eab8
commit
4a14a6d298
@ -1,6 +1,6 @@
|
|||||||
# Quantize to NVFP4
|
# Quantize to NVFP4
|
||||||
|
|
||||||
> Quantize a model to NVFP4 to run on Spark
|
> Quantize a model to NVFP4 to run on Spark using TensorRT Model Optimizer
|
||||||
|
|
||||||
## Table of Contents
|
## Table of Contents
|
||||||
|
|
||||||
@ -29,6 +29,8 @@
|
|||||||
You'll quantize the DeepSeek-R1-Distill-Llama-8B model using NVIDIA's TensorRT Model Optimizer
|
You'll quantize the DeepSeek-R1-Distill-Llama-8B model using NVIDIA's TensorRT Model Optimizer
|
||||||
inside a TensorRT-LLM container, producing an NVFP4 quantized model for deployment on NVIDIA DGX Spark.
|
inside a TensorRT-LLM container, producing an NVFP4 quantized model for deployment on NVIDIA DGX Spark.
|
||||||
|
|
||||||
|
The examples use NVIDIA FP4 quantized models which help reduce model size by approximately 2x by reducing the precision of model layers.
|
||||||
|
This quantization approach aims to preserve accuracy while providing significant throughput improvements. However, it's important to note that quantization can potentially impact model accuracy - we recommend running evaluations to verify if the quantized model maintains acceptable performance for your use case.
|
||||||
|
|
||||||
## What to know before starting
|
## What to know before starting
|
||||||
|
|
||||||
@ -162,12 +164,16 @@ You should see model weight files, configuration files, and tokenizer files in t
|
|||||||
|
|
||||||
## Step 7. Test model loading
|
## Step 7. Test model loading
|
||||||
|
|
||||||
Verify the quantized model can be loaded properly using a simple Python test.
|
First, set the path to your quantized model:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
## Set path to quantized model directory
|
||||||
export MODEL_PATH="./output_models/saved_models_DeepSeek-R1-Distill-Llama-8B_nvfp4_hf/"
|
export MODEL_PATH="./output_models/saved_models_DeepSeek-R1-Distill-Llama-8B_nvfp4_hf/"
|
||||||
|
```
|
||||||
|
|
||||||
|
Now verify the quantized model can be loaded properly using a simple test:
|
||||||
|
|
||||||
|
```bash
|
||||||
docker run \
|
docker run \
|
||||||
-e HF_TOKEN=$HF_TOKEN \
|
-e HF_TOKEN=$HF_TOKEN \
|
||||||
-v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \
|
-v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \
|
||||||
@ -183,7 +189,41 @@ docker run \
|
|||||||
'
|
'
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 8. Troubleshooting
|
## Step 8. Serve the model with OpenAI-compatible API
|
||||||
|
Start the TensorRT-LLM OpenAI-compatible API server with the quantized model.
|
||||||
|
First, set the path to your quantized model:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
## Set path to quantized model directory
|
||||||
|
export MODEL_PATH="./output_models/saved_models_DeepSeek-R1-Distill-Llama-8B_nvfp4_hf/"
|
||||||
|
|
||||||
|
docker run \
|
||||||
|
-e HF_TOKEN=$HF_TOKEN \
|
||||||
|
-v "$MODEL_PATH:/workspace/model" \
|
||||||
|
--rm -it --ulimit memlock=-1 --ulimit stack=67108864 \
|
||||||
|
--gpus=all --ipc=host --network host \
|
||||||
|
nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \
|
||||||
|
trtllm-serve /workspace/model \
|
||||||
|
--backend pytorch \
|
||||||
|
--max_batch_size 4 \
|
||||||
|
--port 8000
|
||||||
|
```
|
||||||
|
|
||||||
|
Run the following to test the server with a client CURL request:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -X POST http://localhost:8000/v1/chat/completions \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{
|
||||||
|
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
|
||||||
|
"prompt": "What is artificial intelligence?",
|
||||||
|
"max_tokens": 100,
|
||||||
|
"temperature": 0.7,
|
||||||
|
"stream": false
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Step 9. Troubleshooting
|
||||||
|
|
||||||
| Symptom | Cause | Fix |
|
| Symptom | Cause | Fix |
|
||||||
|---------|--------|-----|
|
|---------|--------|-----|
|
||||||
@ -193,7 +233,7 @@ docker run \
|
|||||||
| Git clone fails inside container | Network connectivity issues | Check internet connection and retry |
|
| Git clone fails inside container | Network connectivity issues | Check internet connection and retry |
|
||||||
| Quantization process hangs | Container resource limits | Increase Docker memory limits or use `--ulimit` flags |
|
| Quantization process hangs | Container resource limits | Increase Docker memory limits or use `--ulimit` flags |
|
||||||
|
|
||||||
## Step 9. Cleanup and rollback
|
## Step 10. Cleanup and rollback
|
||||||
|
|
||||||
To clean up the environment and remove generated files:
|
To clean up the environment and remove generated files:
|
||||||
|
|
||||||
@ -210,7 +250,7 @@ rm -rf ~/.cache/huggingface
|
|||||||
docker rmi nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev
|
docker rmi nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 10. Next steps
|
## Step 11. Next steps
|
||||||
|
|
||||||
The quantized model is now ready for deployment. Common next steps include:
|
The quantized model is now ready for deployment. Common next steps include:
|
||||||
- Benchmarking inference performance compared to the original model.
|
- Benchmarking inference performance compared to the original model.
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user