dgx-spark-playbooks/nvidia/pytorch-fine-tune/README.md

121 lines
3.2 KiB
Markdown
Raw Normal View History

2025-10-05 17:01:34 +00:00
# Fine tune with Pytorch
> Use Pytorch to fine-tune models locally
## Table of Contents
- [Overview](#overview)
- [Instructions](#instructions)
---
## Overview
2025-10-07 18:54:16 +00:00
## Basic idea
2025-10-05 17:01:34 +00:00
2025-10-06 15:35:14 +00:00
This playbook guides you through setting up and using Pytorch for fine-tuning large language models on NVIDIA Spark devices.
2025-10-05 17:01:34 +00:00
## What you'll accomplish
2025-10-07 18:54:16 +00:00
You'll establish a complete fine-tuning environment for large language models (1-70B parameters) on your NVIDIA Spark device.
By the end, you'll have a working installation that supports parameter-efficient fine-tuning (PEFT) and supervised fine-tuning (SFT).
2025-10-05 17:01:34 +00:00
## What to know before starting
2025-10-07 19:26:54 +00:00
- Previous experience with fine-tuning in Pytorch
- Working with Docker
2025-10-05 17:01:34 +00:00
## Prerequisites
2025-10-07 18:54:16 +00:00
Recipes are specifically for DIGITS SPARK. Please make sure that OS and drivers are latest.
2025-10-05 17:01:34 +00:00
## Ancillary files
2025-10-07 19:03:50 +00:00
ALl files required for fine-tuning are included in the folder in [the GitHub repository here](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}).
2025-10-05 17:01:34 +00:00
## Time & risk
2025-10-07 18:54:16 +00:00
**Time estimate:** 30-45 mins for setup and runing fine-tuning. Fine-tuning run time varies depending on model size
2025-10-05 17:01:34 +00:00
2025-10-06 15:35:14 +00:00
**Risks:** Model downloads can be large (several GB), ARM64 package compatibility issues may require troubleshooting.
2025-10-05 17:01:34 +00:00
## Instructions
2025-10-07 13:16:58 +00:00
## Step 1. Configure Docker permissions
To easily manage containers without sudo, you must be in the `docker` group. If you choose to skip this step, you will need to run Docker commands with sudo.
Open a new terminal and test Docker access. In the terminal, run:
```bash
docker ps
```
If you see a permission denied error (something like `permission denied while trying to connect to the Docker daemon socket`), add your user to the docker group:
```bash
sudo usermod -aG docker $USER
```
> **Warning**: After running usermod, you must log out and log back in to start a new
> session with updated group permissions.
## Step 2. Pull the latest Pytorch container
2025-10-05 17:01:34 +00:00
```bash
2025-10-06 15:35:14 +00:00
docker pull nvcr.io/nvidia/pytorch:25.09-py3
2025-10-05 17:01:34 +00:00
```
2025-10-07 13:16:58 +00:00
## Step 3. Launch Docker
2025-10-05 17:01:34 +00:00
```bash
2025-10-06 15:35:14 +00:00
docker run --gpus all -it --rm --ipc=host \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
-v ${PWD}:/workspace -w /workspace \
nvcr.io/nvidia/pytorch:25.09-py3
2025-10-07 13:16:58 +00:00
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
-v ${PWD}:/workspace -w /workspace \
nvcr.io/nvidia/pytorch:25.09-py3
2025-10-05 17:01:34 +00:00
```
2025-10-07 18:54:16 +00:00
## Step 4. Install dependencies inside the container
2025-10-05 17:01:34 +00:00
```bash
2025-10-06 15:35:14 +00:00
pip install transformers peft datasets "trl==0.19.1" "bitsandbytes==0.48"
2025-10-05 17:01:34 +00:00
```
2025-10-07 18:54:16 +00:00
## Step 5: Authenticate with Huggingface
2025-10-06 15:32:36 +00:00
2025-10-06 14:46:10 +00:00
```bash
2025-10-06 15:35:14 +00:00
huggingface-cli login
##<input your huggingface token.
##<Enter n for git credential>
2025-10-07 17:45:25 +00:00
```
2025-10-07 19:03:50 +00:00
## Step 6: Clone the git repo with fine-tuning recipes
2025-10-06 15:32:36 +00:00
2025-10-07 17:45:25 +00:00
```bash
2025-10-07 18:03:03 +00:00
git clone https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}
cd ${MODEL}/assets
2025-10-06 15:35:14 +00:00
```
2025-10-06 15:32:36 +00:00
2025-10-07 18:54:16 +00:00
## Step7: Run the fine-tuning recipes
2025-10-07 17:45:25 +00:00
To run LoRA on Llama3-8B use the following command:
2025-10-06 15:35:14 +00:00
```bash
python Llama3_8B_LoRA_finetuning.py
2025-10-06 14:46:10 +00:00
```
2025-10-06 15:32:36 +00:00
2025-10-07 19:03:50 +00:00
To run qLoRA fine-tuning on llama3-70B use the following command:
2025-10-06 15:35:14 +00:00
```bash
python Llama3_70B_qLoRA_finetuning.py
```
2025-10-07 13:16:58 +00:00
2025-10-07 18:54:16 +00:00
To run full fine-tuning on llama3-3B use the following command:
2025-10-06 15:35:14 +00:00
```bash
python Llama3_3B_full_finetuning.py
```