diff --git a/nvidia/nvfp4-quantization/README.md b/nvidia/nvfp4-quantization/README.md index 6ad35b4..0d6c996 100644 --- a/nvidia/nvfp4-quantization/README.md +++ b/nvidia/nvfp4-quantization/README.md @@ -12,12 +12,12 @@ ## Overview -## Basic Idea +## Basic idea ### NVFP4 on Blackwell -- **What it is:** A new 4-bit floating-point format for NVIDIA Blackwell GPUs. -- **How it works:** Uses two levels of scaling (local per-block + global tensor) to keep accuracy while using fewer bits. +- **What it is:** A new 4-bit floating-point format for NVIDIA Blackwell GPUs +- **How it works:** Uses two levels of scaling (local per-block + global tensor) to keep accuracy while using fewer bits - **Why it matters:** - Cuts memory use ~3.5x vs FP16 and ~1.8x vs FP8 - Keeps accuracy close to FP8 (usually <1% loss) @@ -130,7 +130,7 @@ docker run --rm -it --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=671 ``` Note: You may encounter this `pynvml.NVMLError_NotSupported: Not Supported`. This is expected in some environments, does not affect results, and will be fixed in an upcoming release. -Note: If your model is too large, you may encounter an out of memory error. You can try quantizing a smaller model instead. +Note: Please be aware that if your model is too large, you may encounter an out of memory error. You can try quantizing a smaller model instead. This command: - Runs the container with full GPU access and optimized shared memory settings @@ -191,9 +191,9 @@ docker run \ | Container exits with CUDA out of memory | Insufficient GPU memory | Reduce batch size or use a machine with more GPU memory | | Model files not found in output directory | Volume mount failed or wrong path | Verify `$(pwd)/output_models` resolves correctly | | Git clone fails inside container | Network connectivity issues | Check internet connection and retry | -| Quantization process hangs | Container resource limits | Increase Docker memory limits or use --ulimit flags | +| Quantization process hangs | Container resource limits | Increase Docker memory limits or use `--ulimit` flags | -## Step 8. Cleanup and rollback +## Step 9. Cleanup and rollback To clean up the environment and remove generated files: @@ -210,10 +210,10 @@ rm -rf ~/.cache/huggingface docker rmi nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev ``` -## Step 9. Next steps +## Step 10. Next steps The quantized model is now ready for deployment. Common next steps include: -- Benchmarking inference performance compared to the original model -- Integrating the quantized model into your inference pipeline -- Deploying to NVIDIA Triton Inference Server for production serving -- Running additional validation tests on your specific use cases +- Benchmarking inference performance compared to the original model. +- Integrating the quantized model into your inference pipeline. +- Deploying to NVIDIA Triton Inference Server for production serving. +- Running additional validation tests on your specific use cases.