diff --git a/nvidia/nvfp4-quantization/README.md b/nvidia/nvfp4-quantization/README.md index 0d6c996..abd23ae 100644 --- a/nvidia/nvfp4-quantization/README.md +++ b/nvidia/nvfp4-quantization/README.md @@ -1,6 +1,6 @@ # Quantize to NVFP4 -> Quantize a model to NVFP4 to run on Spark +> Quantize a model to NVFP4 to run on Spark using TensorRT Model Optimizer ## Table of Contents @@ -29,6 +29,8 @@ You'll quantize the DeepSeek-R1-Distill-Llama-8B model using NVIDIA's TensorRT Model Optimizer inside a TensorRT-LLM container, producing an NVFP4 quantized model for deployment on NVIDIA DGX Spark. +The examples use NVIDIA FP4 quantized models which help reduce model size by approximately 2x by reducing the precision of model layers. +This quantization approach aims to preserve accuracy while providing significant throughput improvements. However, it's important to note that quantization can potentially impact model accuracy - we recommend running evaluations to verify if the quantized model maintains acceptable performance for your use case. ## What to know before starting @@ -162,12 +164,16 @@ You should see model weight files, configuration files, and tokenizer files in t ## Step 7. Test model loading -Verify the quantized model can be loaded properly using a simple Python test. +First, set the path to your quantized model: ```bash - +## Set path to quantized model directory export MODEL_PATH="./output_models/saved_models_DeepSeek-R1-Distill-Llama-8B_nvfp4_hf/" +``` +Now verify the quantized model can be loaded properly using a simple test: + +```bash docker run \ -e HF_TOKEN=$HF_TOKEN \ -v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \ @@ -183,7 +189,41 @@ docker run \ ' ``` -## Step 8. Troubleshooting +## Step 8. Serve the model with OpenAI-compatible API +Start the TensorRT-LLM OpenAI-compatible API server with the quantized model. +First, set the path to your quantized model: + +```bash +## Set path to quantized model directory +export MODEL_PATH="./output_models/saved_models_DeepSeek-R1-Distill-Llama-8B_nvfp4_hf/" + +docker run \ + -e HF_TOKEN=$HF_TOKEN \ + -v "$MODEL_PATH:/workspace/model" \ + --rm -it --ulimit memlock=-1 --ulimit stack=67108864 \ + --gpus=all --ipc=host --network host \ + nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \ + trtllm-serve /workspace/model \ + --backend pytorch \ + --max_batch_size 4 \ + --port 8000 +``` + +Run the following to test the server with a client CURL request: + +```bash +curl -X POST http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B", + "prompt": "What is artificial intelligence?", + "max_tokens": 100, + "temperature": 0.7, + "stream": false + }' +``` + +## Step 9. Troubleshooting | Symptom | Cause | Fix | |---------|--------|-----| @@ -193,7 +233,7 @@ docker run \ | Git clone fails inside container | Network connectivity issues | Check internet connection and retry | | Quantization process hangs | Container resource limits | Increase Docker memory limits or use `--ulimit` flags | -## Step 9. Cleanup and rollback +## Step 10. Cleanup and rollback To clean up the environment and remove generated files: @@ -210,7 +250,7 @@ rm -rf ~/.cache/huggingface docker rmi nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev ``` -## Step 10. Next steps +## Step 11. Next steps The quantized model is now ready for deployment. Common next steps include: - Benchmarking inference performance compared to the original model.