dgx-spark-playbooks/nvidia/pytorch-fine-tune/README.md

# Fine tune with Pytorch

> Use Pytorch to fine-tune models locally

## Table of Contents

- [Overview](#overview)
- [Instructions](#instructions)

---

## Overview

## Basic Idea

This playbook guides you through setting up and using Pytorch for fine-tuning large language models on NVIDIA Spark devices.

## What you'll accomplish

You'll establish a complete fine-tuning environment for large language models (1-70B parameters) on your NVIDIA Spark device. By the end, you'll have a working installation that supports parameter-efficient fine-tuning (PEFT) and supervised fine-tuning (SFT)
## What to know before starting


## Prerequisites
recipes are specifically for DIGITS SPARK. Please make sure that OS and drivers are latest.


## Ancillary files

ALl files required for finetuning are included.

## Time & risk

**Time estimate:** 30-45 mins for setup and runing finetuning. Finetuning run time varies depending on model size 

**Risks:** Model downloads can be large (several GB), ARM64 package compatibility issues may require troubleshooting.

**Rollback:**

## Instructions

## Step 1. Configure Docker permissions

To easily manage containers without sudo, you must be in the `docker` group. If you choose to skip this step, you will need to run Docker commands with sudo.

Open a new terminal and test Docker access. In the terminal, run:

```bash
docker ps
```

If you see a permission denied error (something like `permission denied while trying to connect to the Docker daemon socket`), add your user to the docker group:

```bash
sudo usermod -aG docker $USER
```

> **Warning**: After running usermod, you must log out and log back in to start a new
> session with updated group permissions.

## Step 2.  Pull the latest Pytorch container

```bash
docker pull nvcr.io/nvidia/pytorch:25.09-py3
```

## Step 3. Launch Docker

```bash
docker run --gpus all -it --rm --ipc=host \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
-v ${PWD}:/workspace -w /workspace \
nvcr.io/nvidia/pytorch:25.09-py3

    -v $HOME/.cache/huggingface:/root/.cache/huggingface \
    -v ${PWD}:/workspace -w /workspace \
    nvcr.io/nvidia/pytorch:25.09-py3
```

## Step 4. Install dependencies inside the contianer

```bash
pip install transformers peft datasets "trl==0.19.1" "bitsandbytes==0.48"
```

## Step 5: authenticate with huggingface

```bash
huggingface-cli login
##<input your huggingface token.
##<Enter n for git credential>
```

## Step6:  Clone the git repo with finetuning recipes

```bash
git clone <github link>
cd assets
```

##Step7: Run the finetuning recipes

To run LoRA on Llama3-8B use the following command:
```bash
python Llama3_8B_LoRA_finetuning.py
```

To run qLoRA finetuning on llama3-70B use the following command:
```bash
python Llama3_70B_qLoRA_finetuning.py
```

To run full finetuning on llama3-3B use the following command:
```bash
python Llama3_3B_full_finetuning.py
```
chore: Regenerate all playbooks 2025-10-05 17:01:34 +00:00			`# Fine tune with Pytorch`

			`> Use Pytorch to fine-tune models locally`

			`## Table of Contents`

			`- [Overview](#overview)`
			`- [Instructions](#instructions)`

			`---`

			`## Overview`

			`## Basic Idea`

chore: Regenerate all playbooks 2025-10-06 15:35:14 +00:00			`This playbook guides you through setting up and using Pytorch for fine-tuning large language models on NVIDIA Spark devices.`
chore: Regenerate all playbooks 2025-10-05 17:01:34 +00:00
			`## What you'll accomplish`

chore: Regenerate all playbooks 2025-10-06 15:35:14 +00:00			`You'll establish a complete fine-tuning environment for large language models (1-70B parameters) on your NVIDIA Spark device. By the end, you'll have a working installation that supports parameter-efficient fine-tuning (PEFT) and supervised fine-tuning (SFT)`
chore: Regenerate all playbooks 2025-10-05 17:01:34 +00:00			`## What to know before starting`



			`## Prerequisites`
chore: Regenerate all playbooks 2025-10-06 15:35:14 +00:00			`recipes are specifically for DIGITS SPARK. Please make sure that OS and drivers are latest.`
chore: Regenerate all playbooks 2025-10-05 17:01:34 +00:00

			`## Ancillary files`

chore: Regenerate all playbooks 2025-10-06 15:35:14 +00:00			`ALl files required for finetuning are included.`
chore: Regenerate all playbooks 2025-10-05 17:01:34 +00:00
			`## Time & risk`

chore: Regenerate all playbooks 2025-10-06 15:35:14 +00:00			`Time estimate: 30-45 mins for setup and runing finetuning. Finetuning run time varies depending on model size`
chore: Regenerate all playbooks 2025-10-05 17:01:34 +00:00
chore: Regenerate all playbooks 2025-10-06 15:35:14 +00:00			`Risks: Model downloads can be large (several GB), ARM64 package compatibility issues may require troubleshooting.`
chore: Regenerate all playbooks 2025-10-05 17:01:34 +00:00
			`Rollback:`

			`## Instructions`

chore: Regenerate all playbooks 2025-10-07 13:16:58 +00:00			`## Step 1. Configure Docker permissions`

			To easily manage containers without sudo, you must be in the `docker` group. If you choose to skip this step, you will need to run Docker commands with sudo.

			`Open a new terminal and test Docker access. In the terminal, run:`

			```bash
			`docker ps`
			```

			If you see a permission denied error (something like `permission denied while trying to connect to the Docker daemon socket`), add your user to the docker group:

			```bash
			`sudo usermod -aG docker $USER`
			```

			`> Warning: After running usermod, you must log out and log back in to start a new`
			`> session with updated group permissions.`

			`## Step 2. Pull the latest Pytorch container`
chore: Regenerate all playbooks 2025-10-05 17:01:34 +00:00
			```bash
chore: Regenerate all playbooks 2025-10-06 15:35:14 +00:00			`docker pull nvcr.io/nvidia/pytorch:25.09-py3`
chore: Regenerate all playbooks 2025-10-05 17:01:34 +00:00			```

chore: Regenerate all playbooks 2025-10-07 13:16:58 +00:00			`## Step 3. Launch Docker`
chore: Regenerate all playbooks 2025-10-05 17:01:34 +00:00
			```bash
chore: Regenerate all playbooks 2025-10-06 15:35:14 +00:00			`docker run --gpus all -it --rm --ipc=host \`
			`-v $HOME/.cache/huggingface:/root/.cache/huggingface \`
			`-v ${PWD}:/workspace -w /workspace \`
			`nvcr.io/nvidia/pytorch:25.09-py3`

chore: Regenerate all playbooks 2025-10-07 13:16:58 +00:00			`-v $HOME/.cache/huggingface:/root/.cache/huggingface \`
			`-v ${PWD}:/workspace -w /workspace \`
			`nvcr.io/nvidia/pytorch:25.09-py3`
chore: Regenerate all playbooks 2025-10-05 17:01:34 +00:00			```

chore: Regenerate all playbooks 2025-10-07 13:16:58 +00:00			`## Step 4. Install dependencies inside the contianer`
chore: Regenerate all playbooks 2025-10-05 17:01:34 +00:00
			```bash
chore: Regenerate all playbooks 2025-10-06 15:35:14 +00:00			`pip install transformers peft datasets "trl==0.19.1" "bitsandbytes==0.48"`
chore: Regenerate all playbooks 2025-10-05 17:01:34 +00:00			```

chore: Regenerate all playbooks 2025-10-07 13:16:58 +00:00			`## Step 5: authenticate with huggingface`
chore: Regenerate all playbooks 2025-10-06 15:32:36 +00:00
chore: Regenerate all playbooks 2025-10-06 14:46:10 +00:00			```bash
chore: Regenerate all playbooks 2025-10-06 15:35:14 +00:00			`huggingface-cli login`
			`##<input your huggingface token.`
			`##<Enter n for git credential>`
chore: Regenerate all playbooks 2025-10-07 17:45:25 +00:00			```

			`## Step6: Clone the git repo with finetuning recipes`
chore: Regenerate all playbooks 2025-10-06 15:32:36 +00:00
chore: Regenerate all playbooks 2025-10-07 17:45:25 +00:00			```bash
			`git clone <github link>`
			`cd assets`
chore: Regenerate all playbooks 2025-10-06 15:35:14 +00:00			```
chore: Regenerate all playbooks 2025-10-06 15:32:36 +00:00
chore: Regenerate all playbooks 2025-10-07 17:45:25 +00:00			`##Step7: Run the finetuning recipes`

			`To run LoRA on Llama3-8B use the following command:`
chore: Regenerate all playbooks 2025-10-06 15:35:14 +00:00			```bash
			`python Llama3_8B_LoRA_finetuning.py`
chore: Regenerate all playbooks 2025-10-06 14:46:10 +00:00			```
chore: Regenerate all playbooks 2025-10-06 15:32:36 +00:00
chore: Regenerate all playbooks 2025-10-06 15:35:14 +00:00			`To run qLoRA finetuning on llama3-70B use the following command:`
			```bash
			`python Llama3_70B_qLoRA_finetuning.py`
			```
chore: Regenerate all playbooks 2025-10-07 13:16:58 +00:00
chore: Regenerate all playbooks 2025-10-06 15:35:14 +00:00			`To run full finetuning on llama3-3B use the following command:`
			```bash
			`python Llama3_3B_full_finetuning.py`
			```