dgx-spark-playbooks/nvidia/pytorch-fine-tune/README.md

# Fine-tune with Pytorch

> Use Pytorch to fine-tune models locally

## Table of Contents

- [Overview](#overview)
- [Instructions](#instructions)
- [Run on two Sparks](#run-on-two-sparks)
  - [Step 1. Configure network connectivity](#step-1-configure-network-connectivity)
  - [Step 2. Configure Docker permissions](#step-2-configure-docker-permissions)
  - [Step 3. Install NVIDIA Container Toolkit & setup Docker environment](#step-3-install-nvidia-container-toolkit-setup-docker-environment)
  - [Step 4. Enable resource advertising](#step-4-enable-resource-advertising)
  - [Step 5. Initialize Docker Swarm](#step-5-initialize-docker-swarm)
  - [Step 6. Join worker nodes and deploy](#step-6-join-worker-nodes-and-deploy)
  - [Step 7. Find your Docker container ID](#step-7-find-your-docker-container-id)
  - [Step 9. Adapt the configuration files](#step-9-adapt-the-configuration-files)
  - [Step 10. Run finetuning scripts](#step-10-run-finetuning-scripts)
  - [Step 14. Cleanup and rollback](#step-14-cleanup-and-rollback)
- [Troubleshooting](#troubleshooting)

---

## Overview

## Basic idea

This playbook guides you through setting up and using Pytorch for fine-tuning large language models on NVIDIA Spark devices.

## What you'll accomplish

You'll establish a complete fine-tuning environment for large language models (1-70B parameters) on your NVIDIA Spark device. 
By the end, you'll have a working installation that supports parameter-efficient fine-tuning (PEFT) and supervised fine-tuning (SFT).

## What to know before starting

- Previous experience with fine-tuning in Pytorch
- Working with Docker


## Prerequisites
Recipes are specifically for DIGITS SPARK. Please make sure that OS and drivers are latest.


## Ancillary files

ALl files required for fine-tuning are included in the folder in [the GitHub repository here](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/pytorch-fine-tune).

## Time & risk

* **Time estimate:** 30-45 mins for setup and runing fine-tuning. Fine-tuning run time varies depending on model size 
* **Risks:** Model downloads can be large (several GB), ARM64 package compatibility issues may require troubleshooting.
* **Last Updated:** 12/16/2025
  * Fix broken commands to access files from GitHub
  * Upgrade to latest pytorch container version nvcr.io/nvidia/pytorch:25.11-py3
  * Temporarily remove Llama3_70B_qLoRA_finetuning command to investigate slowness.

## Instructions

## Step 1. Configure Docker permissions

To easily manage containers without sudo, you must be in the `docker` group. If you choose to skip this step, you will need to run Docker commands with sudo.

Open a new terminal and test Docker access. In the terminal, run:

```bash
docker ps
```

If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo .

```bash
sudo usermod -aG docker $USER
newgrp docker
```

## Step 2.  Pull the latest Pytorch container

```bash
docker pull nvcr.io/nvidia/pytorch:25.11-py3
```

## Step 3. Launch Docker

```bash
docker run --gpus all -it --rm --ipc=host \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
-v ${PWD}:/workspace -w /workspace \
nvcr.io/nvidia/pytorch:25.11-py3
```

## Step 4. Install dependencies inside the container

```bash
pip install transformers peft datasets trl bitsandbytes
```

## Step 5: Authenticate with Huggingface

```bash
hf auth login
##<input your huggingface token.
##<Enter n for git credential>
```

## Step 6:  Clone the git repo with fine-tuning recipes

```bash
git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-spark-playbooks/nvidia/pytorch-fine-tune/assets
```

## Step7: Run the fine-tuning recipes

To run LoRA on Llama3-8B use the following command:
```bash
python Llama3_8B_LoRA_finetuning.py
```

To run full fine-tuning on llama3-3B use the following command:
```bash
python Llama3_3B_full_finetuning.py
```

## Run on two Sparks

### Step 1. Configure network connectivity

Follow the network setup instructions from the [Connect two Sparks](https://build.nvidia.com/spark/connect-two-sparks/stacked-sparks) playbook to establish connectivity between your DGX Spark nodes.

This includes:
- Physical QSFP cable connection
- Network interface configuration (automatic or manual IP assignment)
- Passwordless SSH setup
- Network connectivity verification

### Step 2. Configure Docker permissions

To easily manage containers without sudo, you must be in the `docker` group. If you choose to skip this step, you will need to run Docker commands with sudo.

Open a new terminal and test Docker access. In the terminal, run:

```bash
docker ps
```

If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo .

```bash
sudo usermod -aG docker $USER
newgrp docker
```
### Step 3. Install NVIDIA Container Toolkit & setup Docker environment

Ensure the NVIDIA drivers and the NVIDIA Container Toolkit are installed on each node (both manager and workers) that will provide GPU resources. This package enables Docker containers to access the host's GPU hardware. Ensure you complete the [installation steps](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html), including the [Docker configuration](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-docker) for NVIDIA Container Toolkit.

### Step 4. Enable resource advertising

First, find your GPU UUID by running:
```bash
nvidia-smi -a | grep UUID
```

Next, modify the Docker daemon configuration to advertise the GPU to Swarm. Edit **/etc/docker/daemon.json**:

```bash
sudo nano /etc/docker/daemon.json
```

Add or modify the file to include the nvidia runtime and GPU UUID (replace **GPU-45cbf7b3-f919-7228-7a26-b06628ebefa1** with your actual GPU UUID):

```json
{
  "runtimes": {
    "nvidia": {
      "path": "nvidia-container-runtime",
      "runtimeArgs": []
    }
  },
  "default-runtime": "nvidia",
  "node-generic-resources": [
    "NVIDIA_GPU=GPU-45cbf7b3-f919-7228-7a26-b06628ebefa1"
    ]
}
```

Modify the NVIDIA Container Runtime to advertise the GPUs to the Swarm by uncommenting the swarm-resource line in the **config.toml** file. You can do this either with your preferred text editor (e.g., vim, nano...) or with the following command:
```bash
sudo sed -i 's/^#\s*\(swarm-resource\s*=\s*".*"\)/\1/' /etc/nvidia-container-runtime/config.toml
```

Finally, restart the Docker daemon to apply all changes:
```bash
sudo systemctl restart docker
```

Repeat these steps on all nodes.

### Step 5. Initialize Docker Swarm

On whichever node you want to use as primary, run the following swarm initialization command
```bash
docker swarm init --advertise-addr $(ip -o -4 addr show enp1s0f0np0 | awk '{print $4}' | cut -d/ -f1) $(ip -o -4 addr show enp1s0f1np1 | awk '{print $4}' | cut -d/ -f1)
```

The typical output of the above would be similar to the following:
```
Swarm initialized: current node (node-id) is now a manager.

To add a worker to this swarm, run the following command:

    docker swarm join --token <worker-token> <advertise-addr>:<port>

To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.
```

### Step 6. Join worker nodes and deploy

Now we can proceed with setting up the worker nodes of your cluster. Repeat these steps on all worker nodes.

Run the command suggested by the docker swarm init on each worker node to join the Docker swarm
```bash
docker swarm join --token <worker-token> <advertise-addr>:<port>
```

On both nodes, download the [**pytorch-ft-entrypoint.sh**](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/pytorch-fine-tune/assets/pytorch-ft-entrypoint.sh) script into the directory containing your finetuning scripts and configuration files and run the following command to make it executable:

```bash
chmod +x $PWD/pytorch-ft-entrypoint.sh
```

On your primary node, deploy the Finetuning multi-node stack by downloading the [**docker-compose.yml**](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/pytorch-fine-tune/assets/docker-compose.yml) file into the same directory as in the previous step and running the following command:
```bash
docker stack deploy -c $PWD/docker-compose.yml finetuning-multinode
```
> [!NOTE]
> Ensure you download both files into the same directory from which you are running the command.

You can verify the status of your worker nodes using the following
```bash
docker stack ps finetuning-multinode
```

If everything is healthy, you should see a similar output to the following:
```
nvidia@spark-1b3b:~$ docker stack ps finetuning-multinode
ID             NAME                                IMAGE                              NODE         DESIRED STATE   CURRENT STATE            ERROR     PORTS
vlun7z9cacf9   finetuning-multinode_finetunine.1   nvcr.io/nvidia/pytorch:25.10-py3   spark-1d84   Running         Starting 2 seconds ago             
tjl49zicvxoi   finetuning-multinode_finetunine.2   nvcr.io/nvidia/pytorch:25.10-py3   spark-1b3b   Running         Starting 2 seconds ago             

```

> [!NOTE]
> If your "Current state" is not "Running", see troubleshooting section for more information.

### Step 7. Find your Docker container ID

You can use `docker ps` to find your Docker container ID. You can save the container ID in a variable as shown below. Run this command on both nodes.
```bash
export FINETUNING_CONTAINER=$(docker ps -q -f name=finetuning-multinode)
```

### Step 9. Adapt the configuration files

For multi-node runs, we provide 2 configuration files:
- [**config_finetuning.yaml**](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/pytorch-fine-tune/assets/configs/config_finetuning.yaml) used for full finetuning of Llama3 3B.
- [**config_fsdp_lora.yaml**](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/pytorch-fine-tune/assets/configs/config_fsdp_lora.yaml) used for finetuning with LoRa and FSDP of Llama3 8B and Llama3 70B.

These configuration files need to be adapted:
- Set `machine_rank` on each of your nodes according to its rank. Your master node should have a rank `0`. The second node has a rank `1`.
- Set `main_process_ip` using the IP address of your master node. Ensure that both configuration files have the same value. Use `ifconfig` on your main node to find the correct value for the CX-7 IP address on this node.
- Set a port number that can be used on your main node.

The fields that need to be filled in your YAML files:

```bash
machine_rank: 0
main_process_ip: < TODO: specify IP >
main_process_port: < TODO: specify port >
```

All the scripts and configuration files are available in this [**repository**](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/pytorch-fine-tune/assets).

### Step 10. Run finetuning scripts

Once you successfully run the previous steps, you can use one of the `run-multi-llama_*` scripts for finetuning. Here is an example for Llama3 70B using LoRa for finetuning and FSDP2.

```bash
## Need to specify huggingface token for model download.
export HF_TOKEN=<your-huggingface-token>

docker exec \
  -e HF_TOKEN=$HF_TOKEN \
  -it $FINETUNING_CONTAINER bash -c '
  bash /workspace/install-requirements;
  accelerate launch --config_file=/workspace/configs/config_fsdp_lora.yaml /workspace/Llama3_70B_LoRA_finetuning.py'
```

During the run, the progress bar of the finetuning will appear on your main node's stdout only. This is an expected behavior as `accelerate` uses a wrapper around `tqdm` to display the progress on the main process only as explained [**here**](https://github.com/huggingface/accelerate/blob/main/src/accelerate/utils/tqdm.py#L25). Using `nvidia-smi` on the worker node should show that the GPU is used.

### Step 14. Cleanup and rollback

Stop and remove containers by using the following command on the leader node:

```bash
docker stack rm finetuning-multinode
```

Remove downloaded models to free disk space:

```bash
rm -rf $HOME/.cache/huggingface/hub/models--meta-llama* $HOME/.cache/huggingface/hub/datasets*
```

## Troubleshooting

| Symptom | Cause | Fix |
|---------|--------|-----|
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |

> [!NOTE]
> DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. 
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within 
> the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
```bash
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
```
chore: Regenerate all playbooks 2025-10-28 14:35:31 +00:00			`# Fine-tune with Pytorch`
chore: Regenerate all playbooks 2025-10-05 17:01:34 +00:00
			`> Use Pytorch to fine-tune models locally`

			`## Table of Contents`

			`- [Overview](#overview)`
			`- [Instructions](#instructions)`
chore: Regenerate all playbooks 2025-11-21 15:47:41 +00:00			`- [Run on two Sparks](#run-on-two-sparks)`
			`- [Step 1. Configure network connectivity](#step-1-configure-network-connectivity)`
			`- [Step 2. Configure Docker permissions](#step-2-configure-docker-permissions)`
			`- [Step 3. Install NVIDIA Container Toolkit & setup Docker environment](#step-3-install-nvidia-container-toolkit-setup-docker-environment)`
			`- [Step 4. Enable resource advertising](#step-4-enable-resource-advertising)`
			`- [Step 5. Initialize Docker Swarm](#step-5-initialize-docker-swarm)`
			`- [Step 6. Join worker nodes and deploy](#step-6-join-worker-nodes-and-deploy)`
			`- [Step 7. Find your Docker container ID](#step-7-find-your-docker-container-id)`
			`- [Step 9. Adapt the configuration files](#step-9-adapt-the-configuration-files)`
			`- [Step 10. Run finetuning scripts](#step-10-run-finetuning-scripts)`
			`- [Step 14. Cleanup and rollback](#step-14-cleanup-and-rollback)`
chore: Regenerate all playbooks 2025-10-10 00:11:49 +00:00			`- [Troubleshooting](#troubleshooting)`
chore: Regenerate all playbooks 2025-10-05 17:01:34 +00:00
			`---`

			`## Overview`

chore: Regenerate all playbooks 2025-10-07 18:54:16 +00:00			`## Basic idea`
chore: Regenerate all playbooks 2025-10-05 17:01:34 +00:00
chore: Regenerate all playbooks 2025-10-06 15:35:14 +00:00			`This playbook guides you through setting up and using Pytorch for fine-tuning large language models on NVIDIA Spark devices.`
chore: Regenerate all playbooks 2025-10-05 17:01:34 +00:00
			`## What you'll accomplish`

chore: Regenerate all playbooks 2025-10-07 18:54:16 +00:00			`You'll establish a complete fine-tuning environment for large language models (1-70B parameters) on your NVIDIA Spark device.`
			`By the end, you'll have a working installation that supports parameter-efficient fine-tuning (PEFT) and supervised fine-tuning (SFT).`

chore: Regenerate all playbooks 2025-10-05 17:01:34 +00:00			`## What to know before starting`

chore: Regenerate all playbooks 2025-10-07 19:26:54 +00:00			`- Previous experience with fine-tuning in Pytorch`
			`- Working with Docker`

chore: Regenerate all playbooks 2025-10-05 17:01:34 +00:00

			`## Prerequisites`
chore: Regenerate all playbooks 2025-10-07 18:54:16 +00:00			`Recipes are specifically for DIGITS SPARK. Please make sure that OS and drivers are latest.`
chore: Regenerate all playbooks 2025-10-05 17:01:34 +00:00

			`## Ancillary files`

chore: Regenerate all playbooks 2025-10-13 17:36:51 +00:00			`ALl files required for fine-tuning are included in the folder in [the GitHub repository here](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/pytorch-fine-tune).`
chore: Regenerate all playbooks 2025-10-05 17:01:34 +00:00
			`## Time & risk`

chore: Regenerate all playbooks 2025-10-08 22:00:07 +00:00			`* Time estimate: 30-45 mins for setup and runing fine-tuning. Fine-tuning run time varies depending on model size`
			`* Risks: Model downloads can be large (several GB), ARM64 package compatibility issues may require troubleshooting.`
chore: Regenerate all playbooks 2025-12-17 15:56:41 +00:00			`* Last Updated: 12/16/2025`
chore: Regenerate all playbooks 2025-12-11 20:20:28 +00:00			`* Fix broken commands to access files from GitHub`
chore: Regenerate all playbooks 2025-12-16 03:14:04 +00:00			`* Upgrade to latest pytorch container version nvcr.io/nvidia/pytorch:25.11-py3`
chore: Regenerate all playbooks 2025-12-17 15:56:41 +00:00			`* Temporarily remove Llama3_70B_qLoRA_finetuning command to investigate slowness.`
chore: Regenerate all playbooks 2025-10-05 17:01:34 +00:00
			`## Instructions`

chore: Regenerate all playbooks 2025-10-07 13:16:58 +00:00			`## Step 1. Configure Docker permissions`

			To easily manage containers without sudo, you must be in the `docker` group. If you choose to skip this step, you will need to run Docker commands with sudo.

			`Open a new terminal and test Docker access. In the terminal, run:`

			```bash
			`docker ps`
			```

chore: Regenerate all playbooks 2025-10-12 18:54:50 +00:00			`If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo .`
chore: Regenerate all playbooks 2025-10-07 13:16:58 +00:00
			```bash
			`sudo usermod -aG docker $USER`
chore: Regenerate all playbooks 2025-10-12 18:54:50 +00:00			`newgrp docker`
chore: Regenerate all playbooks 2025-10-07 13:16:58 +00:00			```

			`## Step 2. Pull the latest Pytorch container`
chore: Regenerate all playbooks 2025-10-05 17:01:34 +00:00
			```bash
chore: Regenerate all playbooks 2025-12-16 03:14:04 +00:00			`docker pull nvcr.io/nvidia/pytorch:25.11-py3`
chore: Regenerate all playbooks 2025-10-05 17:01:34 +00:00			```

chore: Regenerate all playbooks 2025-10-07 13:16:58 +00:00			`## Step 3. Launch Docker`
chore: Regenerate all playbooks 2025-10-05 17:01:34 +00:00
			```bash
chore: Regenerate all playbooks 2025-10-06 15:35:14 +00:00			`docker run --gpus all -it --rm --ipc=host \`
			`-v $HOME/.cache/huggingface:/root/.cache/huggingface \`
			`-v ${PWD}:/workspace -w /workspace \`
chore: Regenerate all playbooks 2025-12-16 03:14:04 +00:00			`nvcr.io/nvidia/pytorch:25.11-py3`
chore: Regenerate all playbooks 2025-10-05 17:01:34 +00:00			```

chore: Regenerate all playbooks 2025-10-07 18:54:16 +00:00			`## Step 4. Install dependencies inside the container`
chore: Regenerate all playbooks 2025-10-05 17:01:34 +00:00
			```bash
chore: Regenerate all playbooks 2025-12-16 03:14:04 +00:00			`pip install transformers peft datasets trl bitsandbytes`
chore: Regenerate all playbooks 2025-10-05 17:01:34 +00:00			```

chore: Regenerate all playbooks 2025-10-07 18:54:16 +00:00			`## Step 5: Authenticate with Huggingface`
chore: Regenerate all playbooks 2025-10-06 15:32:36 +00:00
chore: Regenerate all playbooks 2025-10-06 14:46:10 +00:00			```bash
chore: Regenerate all playbooks 2025-12-16 03:14:04 +00:00			`hf auth login`
chore: Regenerate all playbooks 2025-10-06 15:35:14 +00:00			`##<input your huggingface token.`
			`##<Enter n for git credential>`
chore: Regenerate all playbooks 2025-10-07 17:45:25 +00:00			```

chore: Regenerate all playbooks 2025-10-07 19:03:50 +00:00			`## Step 6: Clone the git repo with fine-tuning recipes`
chore: Regenerate all playbooks 2025-10-06 15:32:36 +00:00
chore: Regenerate all playbooks 2025-10-07 17:45:25 +00:00			```bash
chore: Regenerate all playbooks 2025-10-13 17:36:51 +00:00			`git clone https://github.com/NVIDIA/dgx-spark-playbooks`
chore: Regenerate all playbooks 2025-11-07 15:04:17 +00:00			`cd dgx-spark-playbooks/nvidia/pytorch-fine-tune/assets`
chore: Regenerate all playbooks 2025-10-06 15:35:14 +00:00			```
chore: Regenerate all playbooks 2025-10-06 15:32:36 +00:00
chore: Regenerate all playbooks 2025-10-07 18:54:16 +00:00			`## Step7: Run the fine-tuning recipes`
chore: Regenerate all playbooks 2025-10-07 17:45:25 +00:00
			`To run LoRA on Llama3-8B use the following command:`
chore: Regenerate all playbooks 2025-10-06 15:35:14 +00:00			```bash
			`python Llama3_8B_LoRA_finetuning.py`
chore: Regenerate all playbooks 2025-10-06 14:46:10 +00:00			```
chore: Regenerate all playbooks 2025-10-06 15:32:36 +00:00
chore: Regenerate all playbooks 2025-10-07 18:54:16 +00:00			`To run full fine-tuning on llama3-3B use the following command:`
chore: Regenerate all playbooks 2025-10-06 15:35:14 +00:00			```bash
			`python Llama3_3B_full_finetuning.py`
			```
chore: Regenerate all playbooks 2025-10-10 00:11:49 +00:00
chore: Regenerate all playbooks 2025-11-21 15:47:41 +00:00			`## Run on two Sparks`

			`### Step 1. Configure network connectivity`

			`Follow the network setup instructions from the [Connect two Sparks](https://build.nvidia.com/spark/connect-two-sparks/stacked-sparks) playbook to establish connectivity between your DGX Spark nodes.`

			`This includes:`
			`- Physical QSFP cable connection`
			`- Network interface configuration (automatic or manual IP assignment)`
			`- Passwordless SSH setup`
			`- Network connectivity verification`

			`### Step 2. Configure Docker permissions`

			To easily manage containers without sudo, you must be in the `docker` group. If you choose to skip this step, you will need to run Docker commands with sudo.

			`Open a new terminal and test Docker access. In the terminal, run:`

			```bash
			`docker ps`
			```

			`If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo .`

			```bash
			`sudo usermod -aG docker $USER`
			`newgrp docker`
			```
			`### Step 3. Install NVIDIA Container Toolkit & setup Docker environment`

			Ensure the NVIDIA drivers and the NVIDIA Container Toolkit are installed on each node (both manager and workers) that will provide GPU resources. This package enables Docker containers to access the host's GPU hardware. Ensure you complete the [installation steps](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html), including the [Docker configuration](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-docker) for NVIDIA Container Toolkit.

			`### Step 4. Enable resource advertising`

			`First, find your GPU UUID by running:`
			```bash
			`nvidia-smi -a \| grep UUID`
			```

			`Next, modify the Docker daemon configuration to advertise the GPU to Swarm. Edit /etc/docker/daemon.json:`

			```bash
			`sudo nano /etc/docker/daemon.json`
			```

			`Add or modify the file to include the nvidia runtime and GPU UUID (replace GPU-45cbf7b3-f919-7228-7a26-b06628ebefa1 with your actual GPU UUID):`

			```json
			`{`
			`"runtimes": {`
			`"nvidia": {`
			`"path": "nvidia-container-runtime",`
			`"runtimeArgs": []`
			`}`
			`},`
			`"default-runtime": "nvidia",`
			`"node-generic-resources": [`
			`"NVIDIA_GPU=GPU-45cbf7b3-f919-7228-7a26-b06628ebefa1"`
			`]`
			`}`
			```

			`Modify the NVIDIA Container Runtime to advertise the GPUs to the Swarm by uncommenting the swarm-resource line in the config.toml file. You can do this either with your preferred text editor (e.g., vim, nano...) or with the following command:`
			```bash
			`sudo sed -i 's/^#\s\(swarm-resource\s=\s"."\)/\1/' /etc/nvidia-container-runtime/config.toml`
			```

			`Finally, restart the Docker daemon to apply all changes:`
			```bash
			`sudo systemctl restart docker`
			```

			`Repeat these steps on all nodes.`

			`### Step 5. Initialize Docker Swarm`

			`On whichever node you want to use as primary, run the following swarm initialization command`
			```bash
			`docker swarm init --advertise-addr $(ip -o -4 addr show enp1s0f0np0 \| awk '{print $4}' \| cut -d/ -f1) $(ip -o -4 addr show enp1s0f1np1 \| awk '{print $4}' \| cut -d/ -f1)`
			```

			`The typical output of the above would be similar to the following:`
			```
			`Swarm initialized: current node (node-id) is now a manager.`

			`To add a worker to this swarm, run the following command:`

			`docker swarm join --token <worker-token> <advertise-addr>:<port>`

			`To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.`
			```

			`### Step 6. Join worker nodes and deploy`

			`Now we can proceed with setting up the worker nodes of your cluster. Repeat these steps on all worker nodes.`

			`Run the command suggested by the docker swarm init on each worker node to join the Docker swarm`
			```bash
			`docker swarm join --token <worker-token> <advertise-addr>:<port>`
			```

			`On both nodes, download the [pytorch-ft-entrypoint.sh](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/pytorch-fine-tune/assets/pytorch-ft-entrypoint.sh) script into the directory containing your finetuning scripts and configuration files and run the following command to make it executable:`

			```bash
			`chmod +x $PWD/pytorch-ft-entrypoint.sh`
			```

			`On your primary node, deploy the Finetuning multi-node stack by downloading the [docker-compose.yml](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/pytorch-fine-tune/assets/docker-compose.yml) file into the same directory as in the previous step and running the following command:`
			```bash
			`docker stack deploy -c $PWD/docker-compose.yml finetuning-multinode`
			```
			`> [!NOTE]`
			`> Ensure you download both files into the same directory from which you are running the command.`

			`You can verify the status of your worker nodes using the following`
			```bash
			`docker stack ps finetuning-multinode`
			```

			`If everything is healthy, you should see a similar output to the following:`
			```
			`nvidia@spark-1b3b:~$ docker stack ps finetuning-multinode`
			`ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS`
			`vlun7z9cacf9 finetuning-multinode_finetunine.1 nvcr.io/nvidia/pytorch:25.10-py3 spark-1d84 Running Starting 2 seconds ago`
			`tjl49zicvxoi finetuning-multinode_finetunine.2 nvcr.io/nvidia/pytorch:25.10-py3 spark-1b3b Running Starting 2 seconds ago`

			```

			`> [!NOTE]`
			`> If your "Current state" is not "Running", see troubleshooting section for more information.`

			`### Step 7. Find your Docker container ID`

			You can use `docker ps` to find your Docker container ID. You can save the container ID in a variable as shown below. Run this command on both nodes.
			```bash
			`export FINETUNING_CONTAINER=$(docker ps -q -f name=finetuning-multinode)`
			```

			`### Step 9. Adapt the configuration files`

			`For multi-node runs, we provide 2 configuration files:`
			`- [config_finetuning.yaml](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/pytorch-fine-tune/assets/configs/config_finetuning.yaml) used for full finetuning of Llama3 3B.`
			`- [config_fsdp_lora.yaml](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/pytorch-fine-tune/assets/configs/config_fsdp_lora.yaml) used for finetuning with LoRa and FSDP of Llama3 8B and Llama3 70B.`

			`These configuration files need to be adapted:`
			- Set `machine_rank` on each of your nodes according to its rank. Your master node should have a rank `0`. The second node has a rank `1`.
chore: Regenerate all playbooks 2025-12-18 00:18:05 +00:00			- Set `main_process_ip` using the IP address of your master node. Ensure that both configuration files have the same value. Use `ifconfig` on your main node to find the correct value for the CX-7 IP address on this node.
chore: Regenerate all playbooks 2025-11-21 15:47:41 +00:00			`- Set a port number that can be used on your main node.`

			`The fields that need to be filled in your YAML files:`

			```bash
			`machine_rank: 0`
			`main_process_ip: < TODO: specify IP >`
			`main_process_port: < TODO: specify port >`
			```

chore: Regenerate all playbooks 2025-12-18 00:18:05 +00:00			`All the scripts and configuration files are available in this [repository](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/pytorch-fine-tune/assets).`

chore: Regenerate all playbooks 2025-11-21 15:47:41 +00:00			`### Step 10. Run finetuning scripts`

			Once you successfully run the previous steps, you can use one of the `run-multi-llama_*` scripts for finetuning. Here is an example for Llama3 70B using LoRa for finetuning and FSDP2.

			```bash
			`## Need to specify huggingface token for model download.`
			`export HF_TOKEN=<your-huggingface-token>`

			`docker exec \`
			`-e HF_TOKEN=$HF_TOKEN \`
			`-it $FINETUNING_CONTAINER bash -c '`
			`bash /workspace/install-requirements;`
			`accelerate launch --config_file=/workspace/configs/config_fsdp_lora.yaml /workspace/Llama3_70B_LoRA_finetuning.py'`
			```

chore: Regenerate all playbooks 2025-12-18 00:18:05 +00:00			During the run, the progress bar of the finetuning will appear on your main node's stdout only. This is an expected behavior as `accelerate` uses a wrapper around `tqdm` to display the progress on the main process only as explained [here](https://github.com/huggingface/accelerate/blob/main/src/accelerate/utils/tqdm.py#L25). Using `nvidia-smi` on the worker node should show that the GPU is used.

chore: Regenerate all playbooks 2025-11-21 15:47:41 +00:00			`### Step 14. Cleanup and rollback`

			`Stop and remove containers by using the following command on the leader node:`

			```bash
			`docker stack rm finetuning-multinode`
			```

			`Remove downloaded models to free disk space:`

			```bash
			`rm -rf $HOME/.cache/huggingface/hub/models--meta-llama* $HOME/.cache/huggingface/hub/datasets*`
			```

chore: Regenerate all playbooks 2025-10-10 00:11:49 +00:00			`## Troubleshooting`

chore: Regenerate all playbooks 2025-10-10 20:59:55 +00:00			`\| Symptom \| Cause \| Fix \|`
			`\|---------\|--------\|-----\|`
			`\| Cannot access gated repo for URL \| Certain HuggingFace models have restricted access \| Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser \|`

chore: Regenerate all playbooks 2025-10-12 20:53:42 +00:00			`> [!NOTE]`
			`> DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.`
chore: Regenerate all playbooks 2025-10-10 00:11:49 +00:00			`> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within`
			`> the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:`
			```bash
			`sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'`
			```