chore: Regenerate all playbooks

This commit is contained in:
GitLab CI 2025-10-12 18:25:34 +00:00
parent f96690e73d
commit 8499e486ff

View File

@ -5,7 +5,7 @@
## Table of Contents ## Table of Contents
- [Overview](#overview) - [Overview](#overview)
- [NVFP4 on Blackwell](#nvfp4-on-blackwell) - [Basic Idea](#basic-idea)
- [Instructions](#instructions) - [Instructions](#instructions)
- [Troubleshooting](#troubleshooting) - [Troubleshooting](#troubleshooting)
@ -14,15 +14,17 @@
## Overview ## Overview
## Basic idea ## Basic idea
### Basic Idea
### NVFP4 on Blackwell NVFP4 is a 4-bit floating-point format introduced with NVIDIA Blackwell GPUs to maintain model accuracy while reducing memory bandwidth and storage requirements for inference workloads.
Unlike uniform INT4 quantization, NVFP4 retains floating-point semantics with a shared exponent and a compact mantissa, allowing higher dynamic range and more stable convergence.
NVIDIA Blackwell Tensor Cores natively support mixed-precision execution across FP16, FP8, and FP4, enabling models to use FP4 for weights and activations while accumulating in higher precision (typically FP16).
This design minimizes quantization error during matrix multiplications and supports efficient conversion pipelines in TensorRT-LLM for fine-tuned layer-wise quantization.
- **What it is:** A new 4-bit floating-point format for NVIDIA Blackwell GPUs Immediate benefits are:
- **How it works:** Uses two levels of scaling (local per-block + global tensor) to keep accuracy while using fewer bits - Cut memory use ~3.5x vs FP16 and ~1.8x vs FP8
- **Why it matters:** - Maintain accuracy close to FP8 (usually <1% loss)
- Cuts memory use ~3.5x vs FP16 and ~1.8x vs FP8 - Improve speed and energy efficiency for inference
- Keeps accuracy close to FP8 (usually <1% loss)
- Improves speed and energy efficiency for inference
## What you'll accomplish ## What you'll accomplish