dgx-spark-playbooks/nvidia/station-nvfp4-quantization/endpoint-test.yaml

kind: Playbook
metadata:
  name: station-nvfp4-quantization
  displayName: NVFP4 Quantization
  shortDescription: Quantize a model to NVFP4 to run on DGX Station using TensorRT Model Optimizer
  publisher: nvidia
  description: |
    # REPLACE THIS WITH YOUR MODEL CARD
    https://gitlab-master.nvidia.com/api-catalog/examples/-/blob/main/modelcard-example-mixtral8x7b.md?ref_type=heads

  labelsV2:
  - gpuType:playbook:gpu_type_station
  - DGX
  - Station

  attributes:
  - key: DURATION
    value: 1 HR

spec:
  artifactName: station-nvfp4-quantization
  nvcfFunctionId: None
  attributes:

    showUnavailableBanner: false
    apiDocsUrl: None
    termsOfUse: |

    cta:
      text: View on GitHub
      url: https://nvidia.github.io/TensorRT-Model-Optimizer/


    tabs:
    -
      id: overview

      label: Overview
      content: |
        # Basic idea

        NVFP4 is a 4-bit floating-point format introduced with NVIDIA Blackwell GPUs to maintain model accuracy while reducing memory bandwidth and storage requirements for inference workloads.
        Unlike uniform INT4 quantization, NVFP4 retains floating-point semantics with a shared exponent and a compact mantissa, allowing higher dynamic range and more stable convergence.
        NVIDIA Blackwell Tensor Cores natively support mixed-precision execution across FP16, FP8, and FP4, enabling models to use FP4 for weights and activations while accumulating in higher precision (typically FP16).
        This design minimizes quantization error during matrix multiplications and supports efficient conversion pipelines in TensorRT-LLM for fine-tuned layer-wise quantization.

        Immediate benefits are:

        - Cut memory use ~3.5x vs FP16 and ~1.8x vs FP8
        - Maintain accuracy close to FP8 (usually <1% loss)
        - Improve speed and energy efficiency for inference

        # What you'll accomplish

        You'll quantize the DeepSeek-R1-Distill-Llama-8B model using NVIDIA's TensorRT Model Optimizer
        inside a TensorRT-LLM container, producing an NVFP4 quantized model for deployment on NVIDIA DGX Station.

        The examples use NVIDIA FP4 quantized models which help reduce model size by approximately 2x by reducing the precision of model layers.
        This quantization approach aims to preserve accuracy while providing significant throughput improvements. However, it's important to note that quantization can potentially impact model accuracy - we recommend running evaluations to verify if the quantized model maintains acceptable performance for your use case.

        # What to know before starting

        - Working with Docker containers and GPU-accelerated workloads
        - Understanding of model quantization concepts and their impact on inference performance
        - Experience with NVIDIA TensorRT and CUDA toolkit environments
        - Familiarity with Hugging Face model repositories and authentication

        # Prerequisites

        - NVIDIA DGX Station with Blackwell architecture GPU (GB300)
        - Docker installed with GPU support
        - NVIDIA Container Toolkit configured
        - Available storage for model files and outputs
        - Hugging Face account with access to the target model

        Verify your setup:
        ```bash
        # Check available GPUs
        nvidia-smi

        # Verify sufficient disk space
        df -h .
        ```

        # Time & risk

        * **Estimated duration**: 45-90 minutes depending on network speed and model size
        * **Risks**:
          * Model download may fail due to network issues or Hugging Face authentication problems
          * Quantization process is memory-intensive and may fail on systems with insufficient GPU memory
          * Output files are large (several GB) and require adequate storage space
        * **Rollback**: Remove the output directory and any pulled Docker images to restore original state.


    -
      id: instructions

      label: Instructions
      content: |
        # Step 1. Configure Docker permissions

        To easily manage containers without sudo, you must be in the `docker` group. If you choose to skip this step, you will need to run Docker commands with sudo.

        Open a new terminal and test Docker access. In the terminal, run:

        ```bash
        docker ps
        ```

        If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo.

        ```bash
        sudo usermod -aG docker $USER
        newgrp docker
        ```

        # Step 2. Prepare the environment

        Create a local output directory where the quantized model files will be stored. This directory will be mounted into the container to persist results after the container exits.

        ```bash
        mkdir -p ./output_models
        chmod 755 ./output_models
        ```

        # Step 3. Authenticate with Hugging Face

        Ensure you have access to the DeepSeek model by setting your Hugging Face authentication token.

        ```bash
        # Export your Hugging Face token as an environment variable
        # Get your token from: https://huggingface.co/settings/tokens
        export HF_TOKEN="your_token_here"
        ```

        The token will be automatically used by the container for model downloads.

        # Step 4. Identify your GB300 GPU

        If your system has multiple GPUs, you need to identify the device ID of your GB300 GPU. Run `nvidia-smi` to list all available GPUs:

        ```bash
        nvidia-smi
        ```

        Example output on a system with multiple GPUs:

        ```text
        +-----------------------------------------------------------------------------------------+
        | NVIDIA-SMI 590.35                 Driver Version: 590.35         CUDA Version: 13.1     |
        +-----------------------------------------+------------------------+----------------------+
        | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
        |=========================================+========================+======================|
        |   0  NVIDIA RTX 6000                On  |   00000004:01:00.0 Off |                  N/A |
        +-----------------------------------------+------------------------+----------------------+
        |   1  NVIDIA GB300                   On  |   00000009:06:00.0 Off |                    0 |
        +-----------------------------------------+------------------------+----------------------+
        ```

        In this example, the GB300 is device **1**. Note this number for use in Docker commands.

        > [!NOTE]
        > The examples below assume the GB300 is device 1. If your GPU has a different ID, adjust the `--gpus '"device=X"'` parameter in the Docker commands accordingly.

        # Step 5. Run the quantization process using TensorRT Model Optimizer

        Launch the vLLM container with GPU access, IPC settings optimized for multi-GPU workloads, and volume mounts for model caching and output persistence.

        ```bash
        docker run --rm -it --gpus "device=1" --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
          -v "./output_models:/workspace/output_models" \
          -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
          -e HF_TOKEN=$HF_TOKEN \
          nvcr.io/nvidia/vllm:25.12.post1-py3 \
          bash -c "
            git clone -b 0.41.0 --single-branch https://github.com/NVIDIA/TensorRT-Model-Optimizer.git /app/TensorRT-Model-Optimizer && \
            cd /app/TensorRT-Model-Optimizer && pip install -e '.[dev]' && \
            export ROOT_SAVE_PATH='/workspace/output_models' && \
            /app/TensorRT-Model-Optimizer/examples/llm_ptq/scripts/huggingface_example.sh \
            --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
            --quant nvfp4 \
            --tasks quant
          "
        ```

        > [!NOTE]
        > * You can safely ignore the `No module named 'mpi4py'` error. It does not affect the quantization process.
        > * You may encounter this `pynvml.NVMLError_NotSupported: Not Supported`. This is expected in some environments, does not affect results, and will be fixed in an upcoming release.
        > * If your model is too large, you may encounter an out of memory error. You can try quantizing a smaller model instead.

        This command:

        - Runs the container with full GPU access and optimized shared memory settings
        - Mounts your output directory to persist quantized model files
        - Mounts your Hugging Face cache to avoid re-downloading the model
        - Clones and installs the TensorRT Model Optimizer from source
        - Executes the quantization script with NVFP4 quantization parameters

        # Step 6. Monitor the quantization process

        The quantization process will display progress information including:

        - Model download progress from Hugging Face
        - Quantization calibration steps
        - Model export and validation phases

        # Step 7. Validate the quantized model

        After the container completes, verify that the quantized model files were created successfully.

        ```bash
        # Check output directory contents
        ls -la ./output_models/

        # Verify model files are present
        find ./output_models/ -name "*.bin" -o -name "*.safetensors" -o -name "config.json"
        ```

        You should see model weight files, configuration files, and tokenizer files in the output directory.

        Now verify the quantized model can be loaded properly using a simple test:

        ```bash
        # Set path to quantized model directory
        export MODEL_PATH="./output_models/saved_models_DeepSeek-R1-Distill-Llama-8B_nvfp4/"

        docker run \
          -e HF_TOKEN=$HF_TOKEN \
          -v "$MODEL_PATH:/workspace/model" \
          --rm -it --ulimit memlock=-1 --ulimit stack=67108864 \
          --gpus '"device=1"' --ipc=host --network host \
          nvcr.io/nvidia/vllm:25.12.post1-py3 \
          vllm serve /workspace/model \
            --max-model-len 4096 \
            --port 8000
        ```

        > [!NOTE]
        > This starts a vLLM server on port 8000. When you are done validating, stop it with **Ctrl+C** (or exit the container) before starting Step 8, which also uses port 8000.

        # Step 8. Serve the model with OpenAI-compatible API

        Start the vLLM OpenAI-compatible API server with the quantized model.
        First, set the path to your quantized model:

        ```bash
        # Set path to quantized model directory
        export MODEL_PATH="./output_models/saved_models_DeepSeek-R1-Distill-Llama-8B_nvfp4/"

        docker run \
          -e HF_TOKEN=$HF_TOKEN \
          -v "$MODEL_PATH:/workspace/model" \
          --rm -it --ulimit memlock=-1 --ulimit stack=67108864 \
          --gpus '"device=1"' --ipc=host --network host \
          nvcr.io/nvidia/vllm:25.12.post1-py3 \
          vllm serve /workspace/model \
            --backend pytorch \
            --max-num-seqs 4 \
            --max-model-len 8192 \
            --port 8000
        ```

        Run the following to test the server with a client CURL request:

        ```bash
        curl -X POST http://localhost:8000/v1/chat/completions \
          -H "Content-Type: application/json" \
          -d '{
            "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
            "messages": [{"role": "user", "content": "What is artificial intelligence?"}],
            "max_tokens": 100,
            "temperature": 0.7,
            "stream": false
          }'
        ```

        Try changing knobs such as `--max-model-len` to find the right serving configuration for your use case.

        # Step 9. Cleanup and rollback

        To clean up the environment and remove generated files:

        > [!WARNING]
        > This will permanently delete all quantized model files and cached data.

        ```bash
        # Remove output directory and all quantized models
        rm -rf ./output_models

        # Remove Hugging Face cache (optional)
        rm -rf ~/.cache/huggingface

        # Remove Docker image (optional)
        docker rmi nvcr.io/nvidia/vllm:25.12.post1-py3
        ```

        # Step 10. Next steps

        The quantized model is now ready for deployment. Common next steps include:
        - Benchmarking inference performance compared to the original model.
        - Integrating the quantized model into your inference pipeline.
        - Deploying to NVIDIA Triton Inference Server for production serving.
        - Running additional validation tests on your specific use cases.


    -
      id: troubleshooting

      label: Troubleshooting
      content: |
        | Symptom | Cause | Fix |
        |---------|--------|-----|
        | "Permission denied" when accessing Hugging Face | Missing or invalid HF token | Run `huggingface-cli login` with valid token |
        | Container exits with CUDA out of memory | Insufficient GPU memory | Reduce batch size or use a machine with more GPU memory |
        | Model files not found in output directory | Volume mount failed or wrong path | Verify `$(pwd)/output_models` resolves correctly |
        | Git clone fails inside container | Network connectivity issues | Check internet connection and retry |
        | Quantization process hangs | Container resource limits | Increase Docker memory limits or use `--ulimit` flags |
        | Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
        | Log ends with MPI or `ModuleNotFoundError: No module named 'mpi4py'` | TensorRT-LLM / runner step uses MPI; quantization may have already succeeded | Check that the quantization output (e.g. encoder config, saved model under `output_models/`) was produced. The final runner step can fail with an MPI error even when NVFP4 quantization completed successfully. Install `mpi4py` or use a container that includes it if you need the full pipeline. |


    resources:
    - name: TensorRT Model Optimizer Documentation
      url: https://nvidia.github.io/TensorRT-Model-Optimizer/


    - name: TensorRT-LLM Documentation
      url: https://nvidia.github.io/TensorRT-LLM/