mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-24 19:03:54 +00:00
chore: Regenerate all playbooks
This commit is contained in:
parent
1a5db15f29
commit
c2881d1b16
@ -19,7 +19,7 @@
|
|||||||
- **What it is:** A new 4-bit floating-point format for NVIDIA Blackwell GPUs.
|
- **What it is:** A new 4-bit floating-point format for NVIDIA Blackwell GPUs.
|
||||||
- **How it works:** Uses two levels of scaling (local per-block + global tensor) to keep accuracy while using fewer bits.
|
- **How it works:** Uses two levels of scaling (local per-block + global tensor) to keep accuracy while using fewer bits.
|
||||||
- **Why it matters:**
|
- **Why it matters:**
|
||||||
- Cuts memory use ~3.5× vs FP16 and ~1.8× vs FP8
|
- Cuts memory use ~3.5x vs FP16 and ~1.8x vs FP8
|
||||||
- Keeps accuracy close to FP8 (usually <1% loss)
|
- Keeps accuracy close to FP8 (usually <1% loss)
|
||||||
- Improves speed and energy efficiency for inference
|
- Improves speed and energy efficiency for inference
|
||||||
- **Ecosystem:** Supported in NVIDIA tools (TensorRT, LLM Compressor, vLLM) and Hugging Face models.
|
- **Ecosystem:** Supported in NVIDIA tools (TensorRT, LLM Compressor, vLLM) and Hugging Face models.
|
||||||
@ -49,7 +49,7 @@ inside a TensorRT-LLM container, producing an NVFP4 quantized model for deployme
|
|||||||
Verify your setup:
|
Verify your setup:
|
||||||
```bash
|
```bash
|
||||||
## Check Docker GPU access
|
## Check Docker GPU access
|
||||||
docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu20.04 nvidia-smi
|
docker run --rm --gpus all nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev nvidia-smi
|
||||||
|
|
||||||
## Verify sufficient disk space
|
## Verify sufficient disk space
|
||||||
df -h .
|
df -h .
|
||||||
@ -58,8 +58,6 @@ df -h .
|
|||||||
huggingface-cli whoami
|
huggingface-cli whoami
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Time & risk
|
## Time & risk
|
||||||
|
|
||||||
**Estimated duration**: 45-90 minutes depending on network speed and model size
|
**Estimated duration**: 45-90 minutes depending on network speed and model size
|
||||||
@ -73,7 +71,26 @@ huggingface-cli whoami
|
|||||||
|
|
||||||
## Instructions
|
## Instructions
|
||||||
|
|
||||||
## Step 1. Prepare the environment
|
## Step 1. Configure Docker permissions
|
||||||
|
|
||||||
|
To easily manage containers without sudo, you must be in the `docker` group. If you choose to skip this step, you will need to run Docker commands with sudo.
|
||||||
|
|
||||||
|
Open a new terminal and test Docker access. In the terminal, run:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker ps
|
||||||
|
```
|
||||||
|
|
||||||
|
If you see a permission denied error (something like `permission denied while trying to connect to the Docker daemon socket`), add your user to the docker group:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo usermod -aG docker $USER
|
||||||
|
```
|
||||||
|
|
||||||
|
> **Warning**: After running usermod, you must log out and log back in to start a new
|
||||||
|
> session with updated group permissions.
|
||||||
|
|
||||||
|
## Step 2. Prepare the environment
|
||||||
|
|
||||||
Create a local output directory where the quantized model files will be stored. This directory will be mounted into the container to persist results after the container exits.
|
Create a local output directory where the quantized model files will be stored. This directory will be mounted into the container to persist results after the container exits.
|
||||||
|
|
||||||
@ -82,35 +99,42 @@ mkdir -p ./output_models
|
|||||||
chmod 755 ./output_models
|
chmod 755 ./output_models
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 2. Authenticate with Hugging Face
|
## Step 3. Authenticate with Hugging Face
|
||||||
|
|
||||||
Ensure you have access to the DeepSeek model by logging in to Hugging Face. If you don't have the CLI installed, install it first.
|
Ensure you have access to the DeepSeek model by setting your Hugging Face authentication token.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
## Install Hugging Face CLI if needed
|
## Export your Hugging Face token as an environment variable
|
||||||
pip install huggingface_hub
|
## Get your token from: https://huggingface.co/settings/tokens
|
||||||
|
export HF_TOKEN="your_token_here"
|
||||||
## Login to Hugging Face
|
|
||||||
huggingface-cli login
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Enter your Hugging Face token when prompted. The token will be cached in `~/.cache/huggingface/token`.
|
The token will be automatically used by the container for model downloads.
|
||||||
|
|
||||||
## Step 3. Run the TensorRT Model Optimizer container
|
## Step 4. Run the TensorRT Model Optimizer container
|
||||||
|
|
||||||
Launch the TensorRT-LLM container with GPU access, IPC settings optimized for multi-GPU workloads, and volume mounts for model caching and output persistence.
|
Launch the TensorRT-LLM container with GPU access, IPC settings optimized for multi-GPU workloads, and volume mounts for model caching and output persistence.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
docker run --rm -it --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
|
docker run --rm -it --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
|
||||||
-v "$(pwd)/output_models:/workspace/outputs" \
|
-v "./output_models:/workspace/output_models" \
|
||||||
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
|
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
|
||||||
nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc3 \
|
-e HF_TOKEN=$HF_TOKEN \
|
||||||
bash -c "git clone --single-branch https://github.com/NVIDIA/TensorRT-Model-Optimizer.git /app/TensorRT-Model-Optimizer && \
|
nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \
|
||||||
cd /app/TensorRT-Model-Optimizer && pip install -e '.[dev]' && \
|
bash -c "
|
||||||
export ROOT_SAVE_PATH='/workspace/outputs' && \
|
git clone -b 0.35.0 --single-branch https://github.com/NVIDIA/TensorRT-Model-Optimizer.git /app/TensorRT-Model-Optimizer && \
|
||||||
time /app/TensorRT-Model-Optimizer/examples/llm_ptq/scripts/huggingface_example.sh --model 'deepseek-ai/DeepSeek-R1-Distill-Llama-8B' --quant nvfp4 --tp 1 --export_fmt hf"
|
cd /app/TensorRT-Model-Optimizer && pip install -e '.[dev]' && \
|
||||||
|
export ROOT_SAVE_PATH='/workspace/output_models' && \
|
||||||
|
/app/TensorRT-Model-Optimizer/examples/llm_ptq/scripts/huggingface_example.sh \
|
||||||
|
--model 'deepseek-ai/DeepSeek-R1-Distill-Llama-8B' \
|
||||||
|
--quant nvfp4 \
|
||||||
|
--tp 1 \
|
||||||
|
--export_fmt hf
|
||||||
|
"
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Warning: If your model is too large, you may encounter an out of memory error. You can try quantizing a smaller model instead.
|
||||||
|
|
||||||
This command:
|
This command:
|
||||||
- Runs the container with full GPU access and optimized shared memory settings
|
- Runs the container with full GPU access and optimized shared memory settings
|
||||||
- Mounts your output directory to persist quantized model files
|
- Mounts your output directory to persist quantized model files
|
||||||
@ -118,23 +142,14 @@ This command:
|
|||||||
- Clones and installs the TensorRT Model Optimizer from source
|
- Clones and installs the TensorRT Model Optimizer from source
|
||||||
- Executes the quantization script with NVFP4 quantization parameters
|
- Executes the quantization script with NVFP4 quantization parameters
|
||||||
|
|
||||||
## Step 4. Monitor the quantization process
|
## Step 5. Monitor the quantization process
|
||||||
|
|
||||||
The quantization process will display progress information including:
|
The quantization process will display progress information including:
|
||||||
- Model download progress from Hugging Face
|
- Model download progress from Hugging Face
|
||||||
- Quantization calibration steps
|
- Quantization calibration steps
|
||||||
- Model export and validation phases
|
- Model export and validation phases
|
||||||
- Total execution time
|
|
||||||
|
|
||||||
Expected output includes lines similar to:
|
## Step 6. Validate the quantized model
|
||||||
```
|
|
||||||
Downloading model...
|
|
||||||
Starting quantization...
|
|
||||||
Calibrating with NVFP4...
|
|
||||||
Exporting to Hugging Face format...
|
|
||||||
```
|
|
||||||
|
|
||||||
## Step 5. Validate the quantized model
|
|
||||||
|
|
||||||
After the container completes, verify that the quantized model files were created successfully.
|
After the container completes, verify that the quantized model files were created successfully.
|
||||||
|
|
||||||
@ -148,31 +163,30 @@ find ./output_models/ -name "*.bin" -o -name "*.safetensors" -o -name "config.js
|
|||||||
|
|
||||||
You should see model weight files, configuration files, and tokenizer files in the output directory.
|
You should see model weight files, configuration files, and tokenizer files in the output directory.
|
||||||
|
|
||||||
## Step 6. Test model loading
|
## Step 7. Test model loading
|
||||||
|
|
||||||
Verify the quantized model can be loaded properly using a simple Python test.
|
Verify the quantized model can be loaded properly using a simple Python test.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
## Create test script
|
|
||||||
cat > test_model.py << 'EOF'
|
|
||||||
import os
|
|
||||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
||||||
|
|
||||||
model_path = "./output_models"
|
export MODEL_PATH="./output_models/saved_models_DeepSeek-R1-Distill-Llama-8B_nvfp4_hf/"
|
||||||
try:
|
|
||||||
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
|
||||||
model = AutoModelForCausalLM.from_pretrained(model_path)
|
|
||||||
print(f"✓ Model loaded successfully from {model_path}")
|
|
||||||
print(f"Model config: {model.config}")
|
|
||||||
except Exception as e:
|
|
||||||
print(f"✗ Error loading model: {e}")
|
|
||||||
EOF
|
|
||||||
|
|
||||||
## Run the test
|
docker run \
|
||||||
python test_model.py
|
-e HF_TOKEN=$HF_TOKEN \
|
||||||
|
-v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \
|
||||||
|
-v "$MODEL_PATH:/workspace/model" \
|
||||||
|
--rm -it --ulimit memlock=-1 --ulimit stack=67108864 \
|
||||||
|
--gpus=all --ipc=host --network host \
|
||||||
|
nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \
|
||||||
|
bash -c '
|
||||||
|
python examples/llm-api/quickstart_advanced.py \
|
||||||
|
--model_dir /workspace/model/ \
|
||||||
|
--prompt "Paris is great because" \
|
||||||
|
--max_tokens 64
|
||||||
|
'
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 7. Troubleshooting
|
## Step 8. Troubleshooting
|
||||||
|
|
||||||
| Symptom | Cause | Fix |
|
| Symptom | Cause | Fix |
|
||||||
|---------|--------|-----|
|
|---------|--------|-----|
|
||||||
@ -196,7 +210,7 @@ rm -rf ./output_models
|
|||||||
rm -rf ~/.cache/huggingface
|
rm -rf ~/.cache/huggingface
|
||||||
|
|
||||||
## Remove Docker image (optional)
|
## Remove Docker image (optional)
|
||||||
docker rmi nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc3
|
docker rmi nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 9. Next steps
|
## Step 9. Next steps
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user