sarman/dgx-spark-playbooks

Fork 0

mirror of https://github.com/NVIDIA/dgx-spark-playbooks.git synced 2026-04-22 01:53:53 +00:00

TharunGaneshram 02b1d1d36b

Merge e542e522c5 into 6e98abc3b0

2026-04-14 09:38:46 +02:00

30 KiB

Raw Blame History

Fine-tune with NeMo

Use NVIDIA NeMo to fine-tune models locally

Overview
Instructions
Run on two Sparks
Troubleshooting

Overview

Basic idea

This playbook guides you through setting up and using NVIDIA NeMo AutoModel for fine-tuning large language models and vision-language models on NVIDIA Spark devices. NeMo AutoModel provides GPU-accelerated, end-to-end training for Hugging Face models with native PyTorch support, enabling instant fine-tuning without conversion delays. The framework supports distributed training across single GPU to multi-node clusters, with optimized kernels and memory-efficient recipes specifically designed for ARM64 architecture and Blackwell GPU systems.

What you'll accomplish

You'll establish a complete fine-tuning environment for large language models (1-70B parameters) and vision-language models using NeMo AutoModel on your NVIDIA Spark device. By the end, you'll have a working installation that supports parameter-efficient fine-tuning (PEFT), supervised fine-tuning (SFT), and distributed training capabilities with FP8 precision optimizations, all while maintaining compatibility with the Hugging Face ecosystem.

What to know before starting

Working in Linux terminal environments and SSH connections
Basic understanding of Python virtual environments and package management
Familiarity with GPU computing concepts and CUDA toolkit usage
Experience with containerized workflows and Docker/Podman operations
Understanding of machine learning model training concepts and fine-tuning workflows

Prerequisites

NVIDIA Spark device with Blackwell architecture GPU access
CUDA toolkit 12.0+ installed and configured: nvcc --version
Python 3.10+ environment available: python3 --version
Minimum 32GB system RAM for efficient model loading and training
Active internet connection for downloading models and packages
Git installed for repository cloning: git --version
SSH access to your NVIDIA Spark device configured

Ancillary files

All necessary files for the playbook can be found here on GitHub

Time & risk

Duration: 45-90 minutes for complete setup and initial model fine-tuning
Risks: Model downloads can be large (several GB), ARM64 package compatibility issues may require troubleshooting, distributed training setup complexity increases with multi-node configurations
Rollback: Virtual environments can be completely removed; no system-level changes are made to the host system beyond package installations.
Last Updated: 03/04/2026
- Recommend running Nemo finetune workflow via Docker

Instructions

Step 1. Verify system requirements

Check your NVIDIA Spark device meets the prerequisites for NeMo AutoModel installation. This step runs on the host system to confirm CUDA toolkit availability and Python version compatibility.

## Verify CUDA installation
nvcc --version

## Check Python version (3.10+ required)
python3 --version

## Verify GPU accessibility
nvidia-smi

## Check available system memory
free -h

## Docker permission:
docker ps

## if there is permission issue, (e.g., permission denied while trying to connect to the Docker daemon socket), then do:
sudo usermod -aG docker $USER
newgrp docker

Step 2. Configure Docker permissions

To easily manage containers without sudo, you must be in the docker group. If you choose to skip this step, you will need to run Docker commands with sudo.

Open a new terminal and test Docker access. In the terminal, run:

docker ps

If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo .

sudo usermod -aG docker $USER
newgrp docker

Step 3. Get the container image with NeMo AutoModel

docker pull nvcr.io/nvidia/nemo-automodel:26.02

Step 4. Launch Docker

Launch an interactive container with GPU access. The --rm flag ensures the container is removed when you exit.

docker run \
  --gpus all \
  --ulimit memlock=-1 \
  -it --ulimit stack=67108864 \
  --entrypoint /usr/bin/bash \
  --rm nvcr.io/nvidia/nemo-automodel:26.02

Step 5. Explore available examples

Review the pre-configured training recipes available for different model types and training scenarios. These recipes provide optimized configurations for ARM64 and Blackwell architecture.

## Navigate to /opt/Automodel
cd /opt/Automodel

## List LLM fine-tuning examples
ls examples/llm_finetune/

## View example recipe configuration
cat examples/llm_finetune/finetune.py | head -20

Step 6. Run sample fine-tuning

The following commands show how to perform full fine-tuning (SFT), parameter-efficient fine-tuning (PEFT) with LoRA and QLoRA.

First, export your HF_TOKEN so that gated models can be downloaded.

## Run basic LLM fine-tuning example
export HF_TOKEN=<your_huggingface_token>

Note

Replace <your_huggingface_token> with your personal Hugging Face access token. A valid token is required to download any gated model.

Generate a token: Hugging Face tokens, guide available here.

Request and receive access on each model's page (and accept license/terms) before attempting downloads.

Llama-3.1-8B: meta-llama/Llama-3.1-8B

Qwen3-8B: Qwen/Qwen3-8B

Meta-Llama-3-70B: meta-llama/Meta-Llama-3-70B

The same steps apply for any other gated model you use: visit its model card on Hugging Face, request access, accept the license, and wait for approval.

LoRA fine-tuning example:

Execute a basic fine-tuning example to validate the complete setup. This demonstrates parameter-efficient fine-tuning using a small model suitable for testing. For the examples below, we are using YAML for configuration, and parameter overrides are passed as command line arguments.

## Run basic LLM fine-tuning example
cd /opt/Automodel
python3 examples/llm_finetune/finetune.py \
-c examples/llm_finetune/llama3_2/llama3_2_1b_squad_peft.yaml \
--model.pretrained_model_name_or_path meta-llama/Llama-3.1-8B \
--packed_sequence.packed_sequence_size 1024 \
--step_scheduler.max_steps 20

These overrides ensure the Llama-3.1-8B LoRA run behaves as expected:

--model.pretrained_model_name_or_path: selects the Llama-3.1-8B model to fine-tune from the Hugging Face model hub (weights fetched via your Hugging Face token).
--packed_sequence.packed_sequence_size: sets the packed sequence size to 1024 to enable packed sequence training.
--step_scheduler.max_steps: sets the maximum number of training steps. We set it to 20 for demonstration purposes, please adjust this based on your needs.

Note

The recipe YAML llama3_2_1b_squad_peft.yaml defines training hyperparameters (LoRA rank, learning rate, etc.) that are reusable across Llama model sizes. The --model.pretrained_model_name_or_path override determines which model weights are actually loaded.

QLoRA fine-tuning example:

We can use QLoRA to fine-tune large models in a memory-efficient manner.

cd /opt/Automodel
python3 examples/llm_finetune/finetune.py \
-c examples/llm_finetune/llama3_1/llama3_1_8b_squad_qlora.yaml \
--model.pretrained_model_name_or_path meta-llama/Meta-Llama-3-70B \
--loss_fn._target_ nemo_automodel.components.loss.te_parallel_ce.TEParallelCrossEntropy \
--step_scheduler.local_batch_size 1 \
--packed_sequence.packed_sequence_size 1024 \
--step_scheduler.max_steps 20

These overrides ensure the 70B QLoRA run behaves as expected:

--model.pretrained_model_name_or_path: selects the 70B base model to fine-tune (weights fetched via your Hugging Face token).
--loss_fn._target_: uses the TransformerEngine-parallel cross-entropy loss variant compatible with tensor-parallel training for large LLMs.
--step_scheduler.local_batch_size: sets the per-GPU micro-batch size to 1 to fit 70B in memory; overall effective batch size is still driven by gradient accumulation and data/tensor parallel settings from the recipe.
--step_scheduler.max_steps: sets the maximum number of training steps. We set it to 20 for demonstration purposes, please adjust this based on your needs.
--packed_sequence.packed_sequence_size: sets the packed sequence size to 1024 to enable packed sequence training.

Full Fine-tuning example:

Run the following command to perform full (SFT) fine-tuning:

cd /opt/Automodel
python3 examples/llm_finetune/finetune.py \
-c examples/llm_finetune/qwen/qwen3_8b_squad_spark.yaml \
--model.pretrained_model_name_or_path Qwen/Qwen3-8B \
--step_scheduler.local_batch_size 1 \
--step_scheduler.max_steps 20 \
--packed_sequence.packed_sequence_size 1024

These overrides ensure the Qwen3-8B SFT run behaves as expected:

--model.pretrained_model_name_or_path: selects the Qwen/Qwen3-8B model to fine-tune from the Hugging Face model hub (weights fetched via your Hugging Face token). Adjust this if you want to fine-tune a different model.
--step_scheduler.max_steps: sets the maximum number of training steps. We set it to 20 for demonstration purposes, please adjust this based on your needs.
--step_scheduler.local_batch_size: sets the per-GPU micro-batch size to 1 to fit in memory; overall effective batch size is still driven by gradient accumulation and data/tensor parallel settings from the recipe.
--packed_sequence.packed_sequence_size: sets the packed sequence size to 1024 to enable packed sequence training.

Step 7. Validate successful training completion

Validate the fine-tuned model by inspecting artifacts contained in the checkpoint directory.

## Inspect logs and checkpoint output.
## The LATEST is a symlink pointing to the latest checkpoint.
## The checkpoint is the one that was saved during training.
## below is an example of the expected output (username and domain-users are placeholders).
ls -lah checkpoints/LATEST/

## $ ls -lah checkpoints/LATEST/
## total 32K
## drwxr-xr-x 6 username domain-users 4.0K Oct 16 22:33 .
## drwxr-xr-x 4 username domain-users 4.0K Oct 16 22:33 ..
## -rw-r--r-- 1 username domain-users 1.6K Oct 16 22:33 config.yaml
## drwxr-xr-x 2 username domain-users 4.0K Oct 16 22:33 dataloader
## drwxr-xr-x 2 username domain-users 4.0K Oct 16 22:33 model
## drwxr-xr-x 2 username domain-users 4.0K Oct 16 22:33 optim
## drwxr-xr-x 2 username domain-users 4.0K Oct 16 22:33 rng
## -rw-r--r-- 1 username domain-users 1.3K Oct 16 22:33 step_scheduler.pt

Step 8. Cleanup (Optional)

The container was launched with the --rm flag, so it is automatically removed when you exit. To reclaim disk space used by the Docker image, run:

Warning

This will remove the NeMo AutoModel image. You will need to pull it again if you want to use it later.

docker rmi nvcr.io/nvidia/nemo-automodel:26.02

Step 9. Optional: Publish your fine-tuned model checkpoint on Hugging Face Hub

Publish your fine-tuned model checkpoint on Hugging Face Hub.

Note

This is an optional step and is not required for using the fine-tuned model. It is useful if you want to share your fine-tuned model with others or use it in other projects. You can also use the fine-tuned model in other projects by cloning the repository and using the checkpoint. To use the fine-tuned model in other projects, you need to have the Hugging Face CLI installed. You can install the Hugging Face CLI by running pip install huggingface_hub. For more information, please refer to the Hugging Face CLI documentation.

Tip

You can use the hf command to upload the fine-tuned model checkpoint to Hugging Face Hub. For more information, please refer to the Hugging Face CLI documentation.

## Publish the fine-tuned model checkpoint to Hugging Face Hub
## will be published under the namespace <your_huggingface_username>/my-cool-model, adjust name as needed.
hf upload my-cool-model checkpoints/LATEST/model

Tip

The above command can fail if you don't have write permissions to the Hugging Face Hub, with the HF_TOKEN you used. Sample error message:
user@host:/opt/Automodel$ hf upload my-cool-model checkpoints/LATEST/model
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 409, in hf_raise_for_status
    response.raise_for_status()
  File "/home/user/.local/lib/python3.10/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/api/repos/create
To fix this, you need to create an access token with write permissions, please see the Hugging Face guide here for instructions.

Step 10. Next steps

Begin using NeMo AutoModel for your specific fine-tuning tasks. Start with provided recipes and customize based on your model requirements and dataset.

## Copy a recipe for customization
cp examples/llm_finetune/finetune.py my_custom_training.py

## Edit configuration for your specific model and data, then run:
python3 my_custom_training.py

Explore the NeMo AutoModel GitHub repository for more recipes, documentation, and community examples. Consider setting up custom datasets, experimenting with different model architectures, and scaling to multi-node distributed training for larger models.

Run on two Sparks

Step 1. Configure network connectivity

Follow the network setup instructions from the Connect two Sparks playbook to establish connectivity between your DGX Spark nodes.

This includes:

Physical QSFP cable connection
Network interface configuration (automatic or manual IP assignment)
Passwordless SSH setup
Network connectivity verification

Note

Steps 2 to 8 must be conducted on each node.

Step 2. Configure Docker permissions

To easily manage containers without sudo, you must be in the docker group. If you choose to skip this step, you will need to run Docker commands with sudo.

Open a new terminal and test Docker access. In the terminal, run:

docker ps

sudo usermod -aG docker $USER
newgrp docker

Step 3. Install NVIDIA Container Toolkit & setup Docker environment

Ensure the NVIDIA drivers and the NVIDIA Container Toolkit are installed on each node (both manager and workers) that will provide GPU resources. This package enables Docker containers to access the host's GPU hardware. Ensure you complete the installation steps, including the Docker configuration for NVIDIA Container Toolkit.

Step 4. Deploy Docker Containers

Download the pytorch-ft-entrypoint.sh script into your home directory and run the following command to make it executable:

chmod +x $HOME/pytorch-ft-entrypoint.sh

Deploy the docker container by running the following command:

docker run -d \
  --name automodel-node \
  --gpus all \
  --network host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  --device=/dev/infiniband \
  -v "$PWD"/pytorch-ft-entrypoint.sh:/opt/pytorch-ft-entrypoint.sh \
  -v "$HOME/.cache/huggingface/":/root/.cache/huggingface/ \
  -v "$HOME/.ssh":/tmp/.ssh:ro \
  -e UCX_NET_DEVICES=enp1s0f0np0,enp1s0f1np1 \
  -e NCCL_SOCKET_IFNAME=enp1s0f0np0,enp1s0f1np1 \
  -e GLOO_SOCKET_IFNAME=enp1s0f0np0,enp1s0f1np1 \
  -e NCCL_DEBUG=INFO \
  -e TORCH_NCCL_ASYNC_ERROR_HANDLING=1 \
  -e TORCH_DISTRIBUTED_DEBUG=INFO \
  -e CUDA_DEVICE_MAX_CONNECTIONS=1 \
  -e CUDA_VISIBLE_DEVICES=0 \
  nvcr.io/nvidia/pytorch:25.10-py3 \
  /opt/pytorch-ft-entrypoint.sh

Step 5. Install package management tools

Launch a terminal into your docker container on the node.

docker exec -it automodel-node bash

Note

All subsequent steps and commands, other than "Cleanup and rollback", should be run from within the docker container terminal.

Install uv for efficient package management and virtual environment isolation. NeMo AutoModel uses uv for dependency management and automatic environment handling.

## Install uv package manager
pip3 install uv

## Verify installation
uv --version

Step 6. Clone NeMo AutoModel repository

Clone the official NeMo AutoModel repository to access recipes and examples. This provides ready-to-use training configurations for various model types and training scenarios.

## Clone the repository
git clone https://github.com/NVIDIA-NeMo/Automodel.git

## Navigate to the repository
cd Automodel

Step 7. Install NeMo AutoModel

Set up the virtual environment and install NeMo AutoModel. Choose between wheel package installation for stability or source installation for latest features.

Install from wheel package (recommended):

## Initialize virtual environment
uv venv --system-site-packages

## Install packages with uv
uv sync --inexact --frozen --all-extras \
  --no-install-package torch \
  --no-install-package torchvision \
  --no-install-package triton \
  --no-install-package nvidia-cublas-cu12 \
  --no-install-package nvidia-cuda-cupti-cu12 \
  --no-install-package nvidia-cuda-nvrtc-cu12 \
  --no-install-package nvidia-cuda-runtime-cu12 \
  --no-install-package nvidia-cudnn-cu12 \
  --no-install-package nvidia-cufft-cu12 \
  --no-install-package nvidia-cufile-cu12 \
  --no-install-package nvidia-curand-cu12 \
  --no-install-package nvidia-cusolver-cu12 \
  --no-install-package nvidia-cusparse-cu12 \
  --no-install-package nvidia-cusparselt-cu12 \
  --no-install-package nvidia-nccl-cu12 \
  --no-install-package transformer-engine \
  --no-install-package nvidia-modelopt \
  --no-install-package nvidia-modelopt-core \
  --no-install-package flash-attn \
  --no-install-package transformer-engine-cu12 \
  --no-install-package transformer-engine-torch

## Install bitsandbytes
CMAKE_ARGS="-DCOMPUTE_BACKEND=cuda -DCOMPUTE_CAPABILITY=80;86;87;89;90" \
CMAKE_BUILD_PARALLEL_LEVEL=8 \
uv pip install --no-deps git+https://github.com/bitsandbytes-foundation/bitsandbytes.git@50be19c39698e038a1604daf3e1b939c9ac1c342

Step 8. Verify installation

Confirm NeMo AutoModel is properly installed and accessible. This step validates the installation and checks for any missing dependencies.

## Test NeMo AutoModel import
uv run --frozen --no-sync python -c "import nemo_automodel; print('✅ NeMo AutoModel ready')"

Note

You might see a warning stating grouped_gemm is not available. You can ignore this warning if you see '✅ NeMo AutoModel ready'.

Note

Ensure steps 2 to 8 were conducted on all nodes for correct setup.

Step 9. Run sample multi-node fine-tuning

The following commands show how to perform full fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT) with LoRA across both Spark devices using torch.distributed.run.

First, export your HF_TOKEN on both nodes so that gated models can be downloaded.

export HF_TOKEN=<your_huggingface_token>

Note

Replace <your_huggingface_token> with your personal Hugging Face access token. A valid token is required to download any gated model.

Generate a token: Hugging Face tokens, guide available here.

Request and receive access on each model's page (and accept license/terms) before attempting downloads.

Llama-3.1-8B: meta-llama/Llama-3.1-8B

Qwen3-8B: Qwen/Qwen3-8B

Mixtral-8x7B: mistralai/Mixtral-8x7B

The same steps apply for any other gated model you use: visit its model card on Hugging Face, request access, accept the license, and wait for approval.

Next, export a few multi-node PyTorch configuration environment variables.

MASTER_ADDR: IP address of your master node as set in Connect two Sparks. ex: 192.168.100.10
MASTER_PORT: Set a port number that can be used on your master node. ex: 12345
NODE_RANK: Master rank is set to 0 and Worker rank is set to 1

Run this on the Master node

export MASTER_ADDR=<TODO: specify IP>
export MASTER_PORT=<TODO: specify port>
export NODE_RANK=0

Run this on the Worker node

export MASTER_ADDR=<TODO: specify IP>
export MASTER_PORT=<TODO: specify port>
export NODE_RANK=1

LoRA fine-tuning example:

Run this on the all nodes:

uv run --frozen --no-sync python -m torch.distributed.run \
  --nnodes=2 \
  --nproc_per_node=1 \
  --node_rank=${NODE_RANK} \
  --rdzv_backend=static \
  --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \
  examples/llm_finetune/finetune.py \
  -c examples/llm_finetune/llama3_2/llama3_2_1b_squad_peft.yaml \
  --model.pretrained_model_name_or_path meta-llama/Llama-3.1-8B \
  --packed_sequence.packed_sequence_size 1024 \
  --step_scheduler.max_steps 100

The following torch.distributed.run parameters configure our dual-node distributed PyTorch workload and communication:

--nnodes: sets the total number of nodes participating in the distributed training. This is 2 for our dual-node case.
--nproc_per_node: sets the number of processes to be executed on each node. 1 fine-tuning process will occur on each node in our example.
--node_rank: sets the rank of the current node. Again, Master rank is set to 0 and Worker rank is set to 1.
--rdzv_backend: sets the backend used for the rendezvous mechanism. The rendezvous mechanism allows nodes to discover each other and establish communication channels before beginning the distributed workload. We use fixed for a pre-configured rendezvous setup.
--rdzv_endpoint: sets the endpoint on which the rendezvous is expected to occur. This will be the Master node IP address and port specified earlier.

These config overrides ensure the Llama-3.1-8B LoRA run behaves as expected:

--model.pretrained_model_name_or_path: selects the Llama-3.1-8B model to fine-tune from the Hugging Face model hub (weights fetched via your Hugging Face token).
--packed_sequence.packed_sequence_size: sets the packed sequence size to 1024 to enable packed sequence training.
--step_scheduler.max_steps: sets the maximum number of training steps. We set it to 100 for demonstation purposes, please adjust this based on your needs.

Note

NCCL WARN NET/IB : roceP2p1s0f1:1 unknown event type (18) logs during multi-node workloads can be ignored and are a sign that RoCE is functional.

Full Fine-tuning example:

Run this on the all nodes:

uv run --frozen --no-sync python -m torch.distributed.run \
  --nnodes=2 \
  --nproc_per_node=1 \
  --node_rank=${NODE_RANK} \
  --rdzv_backend=static \
  --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \
  examples/llm_finetune/finetune.py \
  -c examples/llm_finetune/qwen/qwen3_8b_squad_spark.yaml \
  --model.pretrained_model_name_or_path Qwen/Qwen3-8B \
  --step_scheduler.local_batch_size 1 \
  --step_scheduler.max_steps 100 \
  --packed_sequence.packed_sequence_size 1024

These config overrides ensure the Qwen3-8B SFT run behaves as expected:

--model.pretrained_model_name_or_path: selects the Qwen/Qwen3-8B model to fine-tune from the Hugging Face model hub (weights fetched via your Hugging Face token). Adjust this if you want to fine-tune a different model.
--step_scheduler.max_steps: sets the maximum number of training steps. We set it to 100 for demonstation purposes, please adjust this based on your needs.
--step_scheduler.local_batch_size: sets the per-GPU micro-batch size to 1 to fit in memory; overall effective batch size is still driven by gradient accumulation and data/tensor parallel settings from the recipe.

Step 10. Validate successful training completion

Validate the fine-tuned model by inspecting artifacts contained in the checkpoint directory on your Master node.

## Inspect logs and checkpoint output.
## The LATEST is a symlink pointing to the latest checkpoint.
## The checkpoint is the one that was saved during training.
## below is an example of the expected output (username and domain-users are placeholders).
ls -lah checkpoints/LATEST/

## root@gx10-f154:/workspace/Automodel# ls -lah checkpoints/LATEST/
## total 36K
## drwxr-xr-x 6 username domain-users 4.0K Dec  8 20:16 .
## drwxr-xr-x 3 username domain-users 4.0K Dec  8 20:16 ..
## -rw-r--r-- 1 username domain-users 1.6K Dec  8 20:16 config.yaml
## drwxr-xr-x 2 username domain-users 4.0K Dec  8 20:16 dataloader
## -rw-r--r-- 1 username domain-users   66 Dec  8 20:16 losses.json
## drwxr-xr-x 3 username domain-users 4.0K Dec  8 20:16 model
## drwxr-xr-x 2 username domain-users 4.0K Dec  8 20:16 optim
## drwxr-xr-x 2 username domain-users 4.0K Dec  8 20:16 rng
## -rw-r--r-- 1 username domain-users 1.3K Dec  8 20:16 step_scheduler.pt

Step 11. Cleanup and rollback

Stop and remove containers by using the following command on all nodes:

docker stop automodel-node
docker rm automodel-node

Warning

This removes all training data and performance reports. Copy checkpoints/ out of the container in advance if you want to keep it.

Troubleshooting

Common issues for running on a single Spark

Symptom	Cause	Fix
`nvcc: command not found`	CUDA toolkit not in PATH	Add CUDA toolkit to PATH: `export PATH=/usr/local/cuda/bin:$PATH`
`pip install uv` permission denied	System-level pip restrictions	Use `pip3 install --user uv` and update PATH
GPU not detected in training	CUDA driver/runtime mismatch	Verify driver compatibility: `nvidia-smi` and reinstall CUDA if needed
Out of memory during training	Model too large for available GPU memory	Reduce batch size, enable gradient checkpointing, or use model parallelism
ARM64 package compatibility issues	Package not available for ARM architecture	Use source installation or build from source with ARM64 flags
Cannot access gated repo for URL	Certain HuggingFace models have restricted access	Regenerate your HuggingFace token; and request access to the gated model on your web browser

Common Issues for running on two Starks

Symptom	Cause	Fix
`nvcc: command not found`	CUDA toolkit not in PATH	Add CUDA toolkit to PATH: `export PATH=/usr/local/cuda/bin:$PATH`
Container exits immediately	Missing entrypoint script	Ensure `pytorch-ft-entrypoint.sh` download succeeded and has executable permissions
`The container name "/automodel-node" is already in use`	Another docker container of the same name is in use on the node (likely forgotten during clean up)	Remove (or rename) the old container or rename the new one
GPU not detected in training	CUDA driver/runtime mismatch	Verify driver compatibility: `nvidia-smi` and reinstall CUDA if needed
Out of memory during training	Model too large for available GPU memory	Reduce batch size, enable gradient checkpointing, or use model parallelism
Cannot access gated repo for URL	Certain HuggingFace models have restricted access	Regenerate your HuggingFace token; and request access to the gated model on your web browser
Checkpoint loading failure when running fine-tuning examples consecutively: `No such file or directory: 'checkpoints/epoch_0_step_/'`	Fine-tuning script attempts to load old checkpoints unsuccessfully	Remove the `checkpoints/` directory before running again
`Unable to find address for: enp1s0f0np0` when attempting single node fine-tuning run on multi-node container	`enp1s0f0np0` is not configured with an IP	Verify network configuration or, if you configured the devices on `enp1s0f1np1`, set `NCCL_SOCKET_IFNAME` and `GLOO_SOCKET_IFNAME` to only `enp1s0f1np1`

Note

DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

30 KiB Raw Blame History

Fine-tune with NeMo

Table of Contents

Overview

Basic idea

What you'll accomplish

What to know before starting

Prerequisites

Ancillary files

Time & risk

Instructions

Step 1. Verify system requirements

Step 2. Configure Docker permissions

Step 3. Get the container image with NeMo AutoModel

Step 4. Launch Docker

Step 5. Explore available examples

Step 6. Run sample fine-tuning

Step 7. Validate successful training completion

Step 8. Cleanup (Optional)

Step 9. Optional: Publish your fine-tuned model checkpoint on Hugging Face Hub

Step 10. Next steps

Run on two Sparks

Step 1. Configure network connectivity

Step 2. Configure Docker permissions

Step 3. Install NVIDIA Container Toolkit & setup Docker environment

Step 4. Deploy Docker Containers

Step 5. Install package management tools

Step 6. Clone NeMo AutoModel repository

Step 7. Install NeMo AutoModel

Step 8. Verify installation

Step 9. Run sample multi-node fine-tuning

Step 10. Validate successful training completion

Step 11. Cleanup and rollback

Troubleshooting

Common issues for running on a single Spark

Common Issues for running on two Starks

30 KiB

Raw Blame History