dgx-spark-playbooks/nvidia/nvfp4-quantization/README.md

220 lines
7.7 KiB
Markdown
Raw Normal View History

2025-10-03 20:46:11 +00:00
# Quantize to NVFP4
> Quantize a model to NVFP4 to run on Spark
## Table of Contents
- [Overview](#overview)
- [NVFP4 on Blackwell](#nvfp4-on-blackwell)
2025-10-06 13:35:52 +00:00
- [Instructions](#instructions)
2025-10-03 20:46:11 +00:00
---
## Overview
2025-10-06 22:37:10 +00:00
## Basic idea
2025-10-03 20:46:11 +00:00
### NVFP4 on Blackwell
2025-10-06 22:37:10 +00:00
- **What it is:** A new 4-bit floating-point format for NVIDIA Blackwell GPUs
- **How it works:** Uses two levels of scaling (local per-block + global tensor) to keep accuracy while using fewer bits
2025-10-03 20:46:11 +00:00
- **Why it matters:**
2025-10-06 15:38:08 +00:00
- Cuts memory use ~3.5x vs FP16 and ~1.8x vs FP8
2025-10-03 20:46:11 +00:00
- Keeps accuracy close to FP8 (usually <1% loss)
- Improves speed and energy efficiency for inference
## What you'll accomplish
You'll quantize the DeepSeek-R1-Distill-Llama-8B model using NVIDIA's TensorRT Model Optimizer
inside a TensorRT-LLM container, producing an NVFP4 quantized model for deployment on NVIDIA DGX Spark.
## What to know before starting
- Working with Docker containers and GPU-accelerated workloads
- Understanding of model quantization concepts and their impact on inference performance
- Experience with NVIDIA TensorRT and CUDA toolkit environments
- Familiarity with Hugging Face model repositories and authentication
## Prerequisites
2025-10-06 13:35:52 +00:00
- NVIDIA Spark device with Blackwell architecture GPU
- Docker installed with GPU support
- NVIDIA Container Toolkit configured
2025-10-06 16:45:23 +00:00
- Available storage for model files and outputs
2025-10-06 13:35:52 +00:00
- Hugging Face account with access to the target model
2025-10-03 20:46:11 +00:00
Verify your setup:
```bash
## Check Docker GPU access
2025-10-06 15:38:08 +00:00
docker run --rm --gpus all nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev nvidia-smi
2025-10-03 20:46:11 +00:00
## Verify sufficient disk space
df -h .
```
## Time & risk
**Estimated duration**: 45-90 minutes depending on network speed and model size
**Risks**:
- Model download may fail due to network issues or Hugging Face authentication problems
- Quantization process is memory-intensive and may fail on systems with insufficient GPU memory
- Output files are large (several GB) and require adequate storage space
**Rollback**: Remove the output directory and any pulled Docker images to restore original state.
2025-10-06 13:35:52 +00:00
## Instructions
2025-10-03 20:46:11 +00:00
2025-10-06 15:38:08 +00:00
## Step 1. Configure Docker permissions
To easily manage containers without sudo, you must be in the `docker` group. If you choose to skip this step, you will need to run Docker commands with sudo.
Open a new terminal and test Docker access. In the terminal, run:
```bash
docker ps
```
If you see a permission denied error (something like `permission denied while trying to connect to the Docker daemon socket`), add your user to the docker group:
```bash
sudo usermod -aG docker $USER
```
> **Warning**: After running usermod, you must log out and log back in to start a new
> session with updated group permissions.
## Step 2. Prepare the environment
2025-10-03 20:46:11 +00:00
Create a local output directory where the quantized model files will be stored. This directory will be mounted into the container to persist results after the container exits.
```bash
mkdir -p ./output_models
chmod 755 ./output_models
```
2025-10-06 15:38:08 +00:00
## Step 3. Authenticate with Hugging Face
2025-10-03 20:46:11 +00:00
2025-10-06 15:38:08 +00:00
Ensure you have access to the DeepSeek model by setting your Hugging Face authentication token.
2025-10-03 20:46:11 +00:00
```bash
2025-10-06 15:38:08 +00:00
## Export your Hugging Face token as an environment variable
## Get your token from: https://huggingface.co/settings/tokens
export HF_TOKEN="your_token_here"
2025-10-03 20:46:11 +00:00
```
2025-10-06 15:38:08 +00:00
The token will be automatically used by the container for model downloads.
2025-10-03 20:46:11 +00:00
2025-10-06 15:38:08 +00:00
## Step 4. Run the TensorRT Model Optimizer container
2025-10-03 20:46:11 +00:00
Launch the TensorRT-LLM container with GPU access, IPC settings optimized for multi-GPU workloads, and volume mounts for model caching and output persistence.
```bash
docker run --rm -it --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
2025-10-06 15:38:08 +00:00
-v "./output_models:/workspace/output_models" \
2025-10-03 20:46:11 +00:00
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
2025-10-06 15:38:08 +00:00
-e HF_TOKEN=$HF_TOKEN \
nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \
bash -c "
git clone -b 0.35.0 --single-branch https://github.com/NVIDIA/TensorRT-Model-Optimizer.git /app/TensorRT-Model-Optimizer && \
cd /app/TensorRT-Model-Optimizer && pip install -e '.[dev]' && \
export ROOT_SAVE_PATH='/workspace/output_models' && \
/app/TensorRT-Model-Optimizer/examples/llm_ptq/scripts/huggingface_example.sh \
--model 'deepseek-ai/DeepSeek-R1-Distill-Llama-8B' \
--quant nvfp4 \
--tp 1 \
--export_fmt hf
"
2025-10-03 20:46:11 +00:00
```
2025-10-06 16:45:23 +00:00
Note: You may encounter this `pynvml.NVMLError_NotSupported: Not Supported`. This is expected in some environments, does not affect results, and will be fixed in an upcoming release.
2025-10-06 22:37:10 +00:00
Note: Please be aware that if your model is too large, you may encounter an out of memory error. You can try quantizing a smaller model instead.
2025-10-06 15:38:08 +00:00
2025-10-03 20:46:11 +00:00
This command:
- Runs the container with full GPU access and optimized shared memory settings
- Mounts your output directory to persist quantized model files
- Mounts your Hugging Face cache to avoid re-downloading the model
- Clones and installs the TensorRT Model Optimizer from source
- Executes the quantization script with NVFP4 quantization parameters
2025-10-06 15:38:08 +00:00
## Step 5. Monitor the quantization process
2025-10-03 20:46:11 +00:00
The quantization process will display progress information including:
- Model download progress from Hugging Face
- Quantization calibration steps
- Model export and validation phases
2025-10-06 15:38:08 +00:00
## Step 6. Validate the quantized model
2025-10-03 20:46:11 +00:00
After the container completes, verify that the quantized model files were created successfully.
```bash
## Check output directory contents
ls -la ./output_models/
## Verify model files are present
find ./output_models/ -name "*.bin" -o -name "*.safetensors" -o -name "config.json"
```
You should see model weight files, configuration files, and tokenizer files in the output directory.
2025-10-06 15:38:08 +00:00
## Step 7. Test model loading
2025-10-03 20:46:11 +00:00
Verify the quantized model can be loaded properly using a simple Python test.
```bash
2025-10-06 15:38:08 +00:00
export MODEL_PATH="./output_models/saved_models_DeepSeek-R1-Distill-Llama-8B_nvfp4_hf/"
docker run \
-e HF_TOKEN=$HF_TOKEN \
-v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \
-v "$MODEL_PATH:/workspace/model" \
--rm -it --ulimit memlock=-1 --ulimit stack=67108864 \
--gpus=all --ipc=host --network host \
nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \
bash -c '
python examples/llm-api/quickstart_advanced.py \
--model_dir /workspace/model/ \
--prompt "Paris is great because" \
--max_tokens 64
'
2025-10-03 20:46:11 +00:00
```
2025-10-06 15:38:08 +00:00
## Step 8. Troubleshooting
2025-10-03 20:46:11 +00:00
| Symptom | Cause | Fix |
|---------|--------|-----|
| "Permission denied" when accessing Hugging Face | Missing or invalid HF token | Run `huggingface-cli login` with valid token |
| Container exits with CUDA out of memory | Insufficient GPU memory | Reduce batch size or use a machine with more GPU memory |
| Model files not found in output directory | Volume mount failed or wrong path | Verify `$(pwd)/output_models` resolves correctly |
| Git clone fails inside container | Network connectivity issues | Check internet connection and retry |
2025-10-06 22:37:10 +00:00
| Quantization process hangs | Container resource limits | Increase Docker memory limits or use `--ulimit` flags |
2025-10-03 20:46:11 +00:00
2025-10-06 22:37:10 +00:00
## Step 9. Cleanup and rollback
2025-10-03 20:46:11 +00:00
To clean up the environment and remove generated files:
> **Warning:** This will permanently delete all quantized model files and cached data.
```bash
## Remove output directory and all quantized models
rm -rf ./output_models
## Remove Hugging Face cache (optional)
rm -rf ~/.cache/huggingface
## Remove Docker image (optional)
2025-10-06 15:38:08 +00:00
docker rmi nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev
2025-10-03 20:46:11 +00:00
```
2025-10-06 22:37:10 +00:00
## Step 10. Next steps
2025-10-03 20:46:11 +00:00
The quantized model is now ready for deployment. Common next steps include:
2025-10-06 22:37:10 +00:00
- Benchmarking inference performance compared to the original model.
- Integrating the quantized model into your inference pipeline.
- Deploying to NVIDIA Triton Inference Server for production serving.
- Running additional validation tests on your specific use cases.