sarman/dgx-spark-playbooks

Fork 0

mirror of https://github.com/NVIDIA/dgx-spark-playbooks.git synced 2026-06-18 04:22:21 +00:00

GitLab CI a9383bb067 chore: Regenerate all playbooks

2026-05-29 00:08:55 +00:00

12 KiB

Raw Permalink Blame History

NVFP4 Pretraining with Megatron Bridge

Pretrain Llama 3.1 8B with NVFP4 mixed precision on DGX Station using Megatron Bridge

Overview
Pretrain with NVFP4
Troubleshooting

Overview

NVFP4 training

NVFP4 is a 4-bit floating-point format natively supported by NVIDIA Blackwell Tensor Cores. When applied during pretraining, NVFP4 reduces memory bandwidth and compute cost for matrix multiplications while preserving model quality through mixed-precision accumulation in higher precision (BF16/FP32).

Megatron-Bridge is NVIDIA's library for large-scale distributed training built on top of Megatron-Core. It provides composable recipe configs for models, optimizers, and mixed-precision strategies — including the first-class bf16_with_nvfp4_mixed recipe used in this playbook.

Combining the two lets you pretrain LLMs at lower memory cost and higher throughput compared to BF16-only training, with minimal accuracy trade-off.

Key benefits:

~2× higher training throughput vs BF16 - Higher TFLOPs at minimal loss in model quality
Native Blackwell NVFP4 GEMMs — FP4 matmuls run as a single Tensor Core instruction, no software emulation overhead
Recipe-based configuration — swap between bf16_mixed, bf16_with_fp8_current_scaling_mixed, and bf16_with_nvfp4_mixed with a single line
Stability controls — pin the first/last N transformer layers in BF16 (this playbook keeps the last 4 layers in BF16 via first_last_layers_bf16)
~2× memory reduction - For inference weight storage vs FP8, ~3.5× vs FP16

What you'll accomplish

Pretrain a Llama 3.1 8B model using Megatron-Bridge with NVFP4 mixed precision on NVIDIA DGX Station. You'll run a short training loop with mock data to verify the full pipeline end-to-end, compare against a plain BF16 baseline via the --disable-fp4 flag and then learn how to point it at real data if required.

Measured results

Run settings:

Model: Llama 3.1 8B (llama3_8b_pretrain_config())
50 iterations, 2 warmup
Global batch size 64, micro batch size 4, sequence length 4096
Dummy data (Megatron-Core's built-in MockGPTDataset — synthetic random token IDs, no real corpus)
Single GB300 GPU, nvcr.io/nvidia/nemo:26.04 container
Latency: average of iterations 20–50 (iter 10 includes one-time CUDA-graph/compile overhead)
VRAM: peak of nvidia-smi --query-compute-apps=used_memory sampled every 2 s during the run

Precision	Recipe	Avg step time	Throughput (Model TFLOP/s/GPU)	Peak VRAM
BF16 baseline	`bf16_mixed()`	9.05 s	~1399	221.6 GB
NVFP4 (last-4 BF16)	`bf16_with_nvfp4_mixed()` + `first_last_layers_bf16=True`, `num_layers_at_end_in_bf16=4`	5.39 s	~2347	207.8 GB

NVFP4 is 1.68× faster than BF16 (≈68% higher throughput) with ≈13.8 GB (≈6%) less peak VRAM — the regime NVFP4 was designed for, where matmul FLOPs dominate each step and quantization overhead is amortized over wide linear projections.

What to know before starting

Basic Python and PyTorch usage
Familiarity with distributed training concepts (torchrun)
Understanding of mixed precision training (FP16/BF16/FP8)

Prerequisites

NVIDIA DGX Station with Blackwell architecture GPU (GB300 chip)
Docker installed with GPU support
NVIDIA Container Toolkit configured
Megatron-Bridge installed (via the NeMo Framework NGC container)

Verify your setup:

## Check GPU availability and architecture
nvidia-smi

## Verify Python and torch
python3 -c "import torch; print(torch.cuda.get_device_name(0))"

Time & risk

Estimated duration: 20-30 minutes (quick test loop with default --train-iters 50); longer for real data
Risks:
- NVFP4 requires Blackwell GPUs — will fail on Hopper or older
- Mock data is used by default (eval_iters=0); real data requires a preprocessed Megatron-format dataset
Rollback: Stop the torchrun process and remove any checkpoint directories
Last Updated: 05/26/2026
- First Publication

Pretrain with NVFP4

Step 1. Set up the environment

The recommended way to run Megatron-Bridge on DGX Station is through the NeMo Framework container, which includes Megatron-Bridge, Megatron-Core, Transformer Engine, and all CUDA dependencies pre-installed. Running outside the container is not supported in this playbook — the NVFP4 kernels rely on the exact Transformer Engine / CUDA versions shipped inside the image.

git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-spark-playbooks/nvidia/station-nvfp4-pretraining/assets

## Use the latest nemo tag
export TAG=26.04

docker run --rm -it \
  --gpus all \
  --ipc host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
  -v "$(pwd):/workdir" \
  -w /workdir \
  --entrypoint bash \
  nvcr.io/nvidia/nemo:${TAG}

All subsequent torchrun / python commands in this playbook are meant to be executed from the shell inside this container.

Step 2. Review the pretraining script

The pretraining script can be found at pretrain_llama.py. The key piece is the NVFP4 precision config, built on top of Megatron-Bridge's prebuilt bf16_with_nvfp4_mixed recipe:

from megatron.bridge.training.mixed_precision import bf16_with_nvfp4_mixed

def nvfp4_mixed_precision():
    cfg = bf16_with_nvfp4_mixed()
    cfg.first_last_layers_bf16 = True
    cfg.num_layers_at_start_in_bf16 = 0
    cfg.num_layers_at_end_in_bf16 = 4
    return cfg

bf16_with_nvfp4_mixed() already sets fp8="e4m3" and fp8_recipe="nvfp4" under the hood; we just toggle the layer-pinning knobs on top:

Last 4 layers in BF16 (num_layers_at_end_in_bf16=4) for training stability (adjustable per model)
No start-layer pinning (num_layers_at_start_in_bf16=0) — last-layer stability is usually enough

Note

The script uses llama3_8b_pretrain_config() which defaults to context_parallel_size=2. The script overrides this to context_parallel_size=1 for single-GPU runs. If you swap in a larger recipe (e.g. nemotron_3_nano_pretrain_config, which defaults to TP=4), you must either launch torchrun --nproc_per_node=4 on a 4-GPU node or override config.model.tensor_model_parallel_size = 1 before calling pretrain(...), or you will hit: AssertionError: world size (1) is not divisible by total_model_size (...tensor_model_parallel_size=4 * ...).

Step 3. Launch NVFP4 pre-training

Launch a short training run with mock data and tee the output to a log file so you can inspect VRAM and per-iteration latency afterwards:

torchrun --nproc_per_node=1 pretrain_llama.py > nvfp4.log 2>&1

Expected output (see nvfp4.log):

Model initialization logs and a Theoretical memory footprints: weight and optimizer=... line
Iteration progress printed every step (log_interval=1), e.g. iteration 10/50 | ... elapsed time per iteration (ms): ... | lm loss: ...
A [Rank 0] ... memory (GB) | mem-max-reserved-gigabytes: ... line — this is your peak VRAM
A checkpoint saved to /workdir/nemo_experiments/default/checkpoints

If the run finishes with EXIT=0 (or no traceback), your NVFP4 pretraining setup is working.

Step 4. Compare with BF16 baseline

Run the same script with --disable-fp4 to establish a BF16 baseline, again logging to a file:

## Remove the prior checkpoint directory so the two runs don't interfere
rm -rf nemo_experiments

torchrun --nproc_per_node=1 pretrain_llama.py --disable-fp4 > bf16.log 2>&1

To compare the two runs on latency and throughput, grep the per-iteration lines out of each log:

grep -E "elapsed time per iteration|MODEL_TFLOP" nvfp4.log
grep -E "elapsed time per iteration|MODEL_TFLOP" bf16.log

Each step prints two lines:

Step Time : 5.39s GPU utilization: 2347.0MODEL_TFLOP/s/GPU — step latency and throughput
iteration 10/50 | ... elapsed time per iteration (ms): 5390 | ... lm loss: ... — same latency in ms plus loss

Iteration 10 includes one-time CUDA-graph/compile overhead, so average iterations 20–50 for a fair per-step latency number.

Measuring peak VRAM (from `nvidia-smi`)

Megatron's in-log memory numbers (mem-max-reserved-gigabytes) reflect PyTorch's caching-allocator reservation, which can drift from what the device actually holds. For an accurate read, watch nvidia-smi live from a second shell while training runs:

watch -n 1 nvidia-smi

See the measured numbers in overview.md for expected VRAM and latency on 1× GB300 with Llama 3.1 8B.

Step 5. Script arguments

pretrain_llama.py accepts the following arguments:

Argument	Type	Default	Description
`--disable-fp4`	flag	off	Disable NVFP4; use plain BF16 mixed precision as a baseline
`--train-iters`	int	50	Number of training iterations
`--warmup-iters`	int	2	Number of warmup iterations
`--global-batch-size`	int	64	Global batch size
`--micro-batch-size`	int	4	Micro batch size (drives peak VRAM; increase to use more memory)
`--seq-length`	int	4096	Sequence length

Example combining several flags:

torchrun --nproc_per_node=1 pretrain_llama.py \
    --train-iters 50 --warmup-iters 2 \
    --global-batch-size 64 --micro-batch-size 4 --seq-length 4096

Step 6. Point to real data

To train on your own dataset, modify the config in the script:

config = llama3_8b_pretrain_config()
config.data.data_path = "/path/to/your/preprocessed/dataset"
config.train.train_iters = 5000
config.train.global_batch_size = 256
config.train.micro_batch_size = 2

Megatron-Bridge expects preprocessed data in Megatron format. See the Megatron-Bridge data preparation guide for details.

Step 7. Cleanup

Remove checkpoints and log files generated by the runs:

rm -rf nemo_experiments/ nvfp4.log bf16.log

Then exit the container shell (exit) — the --rm flag in Step 1 deletes it automatically.

References

Troubleshooting

Symptom	Cause	Fix
`RuntimeError: NVFP4 is not supported on this GPU` or similar FP4 error	GPU is not Blackwell architecture	NVFP4 requires Blackwell GPUs (GB200, GB300). Check with `nvidia-smi`
`ModuleNotFoundError: No module named 'megatron.bridge'`	Megatron Bridge not installed	Run `pip install megatron-bridge` or use the NGC container
`CUDA out of memory` during model init	Insufficient GPU memory for Llama 3.1 8B + optimizer states	Reduce `micro_batch_size` or use `--nproc_per_node` for model parallelism
`torchrun` hangs or times out	NCCL communication failure between GPUs	Check `NCCL_DEBUG=INFO torchrun ...` for details; verify all GPUs are visible
Training loss is NaN	Precision instability	Increase `num_layers_at_end_in_bf16` (e.g., from 4 to 8) or reduce learning rate
`--disable-fp4` works but NVFP4 crashes	Transformer Engine version mismatch	Ensure Transformer Engine supports NVFP4; update with `pip install --upgrade transformer-engine`
Slow training throughput	Not using Tensor Cores efficiently	Ensure batch dimensions are multiples of 8; check that `nvidia-smi` shows high GPU utilization
Permission denied on Docker	User not in docker group	Run `sudo usermod -aG docker $USER && newgrp docker`

12 KiB Raw Permalink Blame History Unescape Escape