NVFP4 is a 4-bit floating-point format natively supported by NVIDIA Blackwell Tensor Cores.
When applied during **pretraining**, NVFP4 reduces memory bandwidth and compute cost for matrix multiplications while preserving model quality through mixed-precision accumulation in higher precision (BF16/FP32).
Megatron-Bridge is NVIDIA's library for large-scale distributed training built on top of Megatron-Core.
It provides composable recipe configs for models, optimizers, and mixed-precision strategies — including the first-class `bf16_with_nvfp4_mixed` recipe used in this playbook.
Combining the two lets you pretrain LLMs at lower memory cost and higher throughput compared to BF16-only training, with minimal accuracy trade-off.
Key benefits:
- **~2×higher training throughput vs BF16** - Higher TFLOPs at minimal loss in model quality
- **NativeBlackwell NVFP4 GEMMs** — FP4 matmuls run as a single Tensor Core instruction, no software emulation overhead
- **Recipe-basedconfiguration** — swap between `bf16_mixed`, `bf16_with_fp8_current_scaling_mixed`, and `bf16_with_nvfp4_mixed` with a single line
- **Stabilitycontrols** — pin the first/last N transformer layers in BF16 (this playbook keeps the last 4 layers in BF16 via `first_last_layers_bf16`)
- **~2×memory reduction** - For inference weight storage vs FP8, ~3.5× vs FP16
# What you'll accomplish
Pretrain a **Llama 3.1 8B** model using Megatron-Bridge with NVFP4 mixed precision on NVIDIA DGX Station.
You'll run a short training loop with mock data to verify the full pipeline end-to-end, compare against a plain BF16 baseline via the `--disable-fp4` flag and then learn how to point it at real data if required.
NVFP4 is **1.68× faster** than BF16 (≈68% higher throughput) with ≈13.8 GB (≈6%) less peak VRAM — the regime NVFP4 was designed for, where matmul FLOPs dominate each step and quantization overhead is amortized over wide linear projections.
# What to know before starting
- Basic Python and PyTorch usage
- Familiarity with distributed training concepts (`torchrun`)
- Understanding of mixed precision training (FP16/BF16/FP8)
# Prerequisites
- NVIDIA DGX Station with Blackwell architecture GPU (GB300 chip)
The recommended way to run Megatron-Bridge on DGX Station is through the [NeMo Framework container](https://github.com/NVIDIA-NeMo/Megatron-Bridge#-nemo-framework-container), which includes Megatron-Bridge, Megatron-Core, Transformer Engine, and all CUDA dependencies pre-installed. Running outside the container is not supported in this playbook — the NVFP4 kernels rely on the exact Transformer Engine / CUDA versions shipped inside the image.
The pretraining script can be found at `pretrain_llama.py`. The key piece is the NVFP4 precision config, built on top of Megatron-Bridge's prebuilt `bf16_with_nvfp4_mixed` recipe:
```python
from megatron.bridge.training.mixed_precision import bf16_with_nvfp4_mixed
def nvfp4_mixed_precision():
cfg = bf16_with_nvfp4_mixed()
cfg.first_last_layers_bf16 = True
cfg.num_layers_at_start_in_bf16 = 0
cfg.num_layers_at_end_in_bf16 = 4
return cfg
```
`bf16_with_nvfp4_mixed()` already sets `fp8="e4m3"` and `fp8_recipe="nvfp4"` under the hood; we just toggle the layer-pinning knobs on top:
- **Last4layers in BF16** (`num_layers_at_end_in_bf16=4`) for training stability (adjustable per model)
- **Nostart-layer pinning** (`num_layers_at_start_in_bf16=0`) — last-layer stability is usually enough
> [!NOTE]
> The script uses `llama3_8b_pretrain_config()` which defaults to `context_parallel_size=2`. The script overrides this to `context_parallel_size=1` for single-GPU runs. If you swap in a larger recipe (e.g. `nemotron_3_nano_pretrain_config`, which defaults to TP=4), you **must** either launch `torchrun --nproc_per_node=4` on a 4-GPU node or override `config.model.tensor_model_parallel_size = 1` before calling `pretrain(...)`, or you will hit:
> `AssertionError:world size (1) is not divisible by total_model_size (...tensor_model_parallel_size=4 * ...)`.
# Step 3. Launch NVFP4 pre-training
Launch a short training run with mock data and tee the output to a log file so you can inspect VRAM and per-iteration latency afterwards:
To compare the two runs on **latency** and **throughput**, grep the per-iteration lines out of each log:
```bash
grep -E "elapsed time per iteration|MODEL_TFLOP" nvfp4.log
grep -E "elapsed time per iteration|MODEL_TFLOP" bf16.log
```
Each step prints two lines:
- `Step Time : 5.39s GPU utilization:2347.0MODEL_TFLOP/s/GPU` — step latency and throughput
- `iteration 10/50 | ... elapsed time per iteration (ms): 5390 | ... lm loss:...` — same latency in ms plus loss
Iteration 10 includes one-time CUDA-graph/compile overhead, so average iterations 20–50 for a fair per-step latency number.
### Measuring peak VRAM (from `nvidia-smi`)
Megatron's in-log memory numbers (`mem-max-reserved-gigabytes`) reflect PyTorch's caching-allocator reservation, which can drift from what the device actually holds. For an accurate read, watch `nvidia-smi` live from a second shell while training runs:
```bash
watch -n 1 nvidia-smi
```
See the measured numbers in `overview.md` for expected VRAM and latency on 1× GB300 with Llama 3.1 8B.
# Step 5. Script arguments
`pretrain_llama.py` accepts the following arguments:
| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--disable-fp4` | flag | off | Disable NVFP4; use plain BF16 mixed precision as a baseline |
| `--train-iters` | int | 50 | Number of training iterations |
| `--warmup-iters` | int | 2 | Number of warmup iterations |
| `--global-batch-size` | int | 64 | Global batch size |
| `--micro-batch-size` | int | 4 | Micro batch size (drives peak VRAM; increase to use more memory) |
Megatron-Bridge expects preprocessed data in Megatron format. See the [Megatron-Bridge data preparation guide](https://docs.nvidia.com/nemo/megatron-bridge/latest/) for details.
# Step 7. Cleanup
Remove checkpoints and log files generated by the runs:
```bash
rm -rf nemo_experiments/ nvfp4.log bf16.log
```
Then exit the container shell (`exit`) — the `--rm` flag in Step 1 deletes it automatically.
| `RuntimeError:NVFP4 is not supported on this GPU` or similar FP4 error | GPU is not Blackwell architecture | NVFP4 requires Blackwell GPUs (GB200, GB300). Check with `nvidia-smi` |
| `ModuleNotFoundError:Nomodule named 'megatron.bridge'` | Megatron Bridge not installed | Run `pip install megatron-bridge` or use the NGC container |
| `CUDA out of memory` during model init | Insufficient GPU memory for Llama 3.1 8B + optimizer states | Reduce `micro_batch_size` or use `--nproc_per_node` for model parallelism |
| `torchrun` hangs or times out | NCCL communication failure between GPUs | Check `NCCL_DEBUG=INFO torchrun ...` for details; verify all GPUs are visible |
| Training loss is NaN | Precision instability | Increase `num_layers_at_end_in_bf16` (e.g., from 4 to 8) or reduce learning rate |
| Slow training throughput | Not using Tensor Cores efficiently | Ensure batch dimensions are multiples of 8; check that `nvidia-smi` shows high GPU utilization |
| Permission denied on Docker | User not in docker group | Run `sudo usermod -aG docker $USER && newgrp docker` |