dgx-spark-playbooks/nvidia/unsloth/README.md

# Unsloth on DGX Spark

> Optimized fine-tuning with Unsloth

## Table of Contents

- [Overview](#overview)
- [Instructions](#instructions)

---

## Overview

## Basic idea

- **Performance-first**: It claims to speed up training (e.g. 2× faster on single GPU, up to 30× in multi-GPU setups) and reduce memory usage compared to standard methods.   
- **Kernel-level optimizations**: Core compute is built with custom kernels (e.g. with Triton) and hand-optimized math to boost throughput and efficiency.  
- **Quantization & model formats**: Supports dynamic quantization (4-bit, 16-bit) and GGUF formats to reduce footprint, while aiming to retain accuracy.    
- **Broad model support**: Works with many LLMs (LLaMA, Mistral, Qwen, DeepSeek, etc.) and allows training, fine-tuning, exporting to formats like Ollama, vLLM, GGUF, Hugging Face.   
- **Simplified interface**: Provides easy-to-use notebooks and tools so users can fine-tune models with minimal boilerplate.  

## What you'll accomplish

You'll set up Unsloth for optimized fine-tuning of large language models on NVIDIA Spark devices, 
achieving up to 2x faster training speeds with reduced memory usage through efficient 
parameter-efficient fine-tuning methods like LoRA and QLoRA.

## What to know before starting

- Python package management with pip and virtual environments
- Hugging Face Transformers library basics (loading models, tokenizers, datasets)
- GPU fundamentals (CUDA/GPU vs CPU, VRAM constraints, device availability)
- Basic understanding of LLM training concepts (loss functions, checkpoints)
- Familiarity with prompt engineering and base model interaction
- Optional: LoRA/QLoRA parameter-efficient fine-tuning knowledge

## Prerequisites

- NVIDIA Spark device with Blackwell GPU architecture
- `nvidia-smi` shows a summary of GPU information
- CUDA 13.0 installed: `nvcc --version`
- Internet access for downloading models and datasets

## Ancillary files

The Python test script can be found [here on GitHub](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/test_unsloth.py)


## Time & risk

* **Duration**: 30-60 minutes for initial setup and test run
* **Risks**: 
  * Triton compiler version mismatches may cause compilation errors
  * CUDA toolkit configuration issues may prevent kernel compilation
  * Memory constraints on smaller models require batch size adjustments
* **Rollback**: Uninstall packages with `pip uninstall unsloth torch torchvision`.
* DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
```bash
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
```

## Instructions

## Step 1. Verify prerequisites

Confirm your NVIDIA Spark device has the required CUDA toolkit and GPU resources available.

```bash
nvcc --version
```
The output should show CUDA 13.0.

```bash
nvidia-smi
```
The output should show a summary of GPU information.

## Step 2. Get the container image
```bash
docker pull nvcr.io/nvidia/pytorch:25.09-py3
```

## Step 3. Launch Docker
```bash
docker run --gpus all --ulimit memlock=-1 -it --ulimit stack=67108864 --entrypoint /usr/bin/bash --rm nvcr.io/nvidia/pytorch:25.09-py3
```

## Step 4. Install dependencies inside Docker

```bash
pip install transformers peft datasets "trl==0.19.1"
pip install --no-deps unsloth unsloth_zoo
```

## Step 5. Build and install bitsandbytes inside Docker
```bash
pip install --no-deps bitsandbytes
```

## Step 6. Create Python test script

Curl the test script [here](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/test_unsloth.py) into the container.

```bash
curl -O https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/test_unsloth.py
```

We will use this test script to validate the installation with a simple fine-tuning task.


## Step 7. Run the validation test

Execute the test script to verify Unsloth is working correctly.

```bash
python test_unsloth.py
```

Expected output in the terminal window:
- "Unsloth: Will patch your computer to enable 2x faster free finetuning"
- Training progress bars showing loss decreasing over 60 steps
- Final training metrics showing completion

## Step 8. Next steps

Test with your own model and dataset by updating the `test_unsloth.py` file:

```python
## Replace line 32 with your model choice
model_name = "unsloth/Meta-Llama-3.1-8B-bnb-4bit"

## Load your custom dataset in line 8
dataset = load_dataset("your_dataset_name")

## Adjust training parameter args at line 61
per_device_train_batch_size = 4
max_steps = 1000
```

Visit https://github.com/unslothai/unsloth/wiki
for advanced usage instructions, including:
- [Saving models in GGUF format for vLLM](https://github.com/unslothai/unsloth/wiki#saving-to-gguf)
- [Continued training from checkpoints](https://github.com/unslothai/unsloth/wiki#loading-lora-adapters-for-continued-finetuning)
- [Using custom chat templates](https://github.com/unslothai/unsloth/wiki#chat-templates)
- [Running evaluation loops](https://github.com/unslothai/unsloth/wiki#evaluation-loop---also-fixes-oom-or-crashing)
-												chore: Regenerate all playbooks

											
										
										
											2025-10-03 20:46:11 +00:00
+								# Unsloth on DGX Spark
 								> Optimized fine-tuning with Unsloth
 								## Table of Contents
 								- [Overview](#overview)
 								- [Instructions](#instructions)
 								---
 								## Overview
 								## Basic idea
-												chore: Regenerate all playbooks

											
										
										
											2025-10-07 16:38:31 +00:00
+								- **Performance-first**: It claims to speed up training (e.g. 2× faster on single GPU, up to 30× in multi-GPU setups) and reduce memory usage compared to standard methods.
 								- **Kernel-level optimizations**: Core compute is built with custom kernels (e.g. with Triton) and hand-optimized math to boost throughput and efficiency.
 								- **Quantization & model formats**: Supports dynamic quantization (4-bit, 16-bit) and GGUF formats to reduce footprint, while aiming to retain accuracy.
 								- **Broad model support**: Works with many LLMs (LLaMA, Mistral, Qwen, DeepSeek, etc.) and allows training, fine-tuning, exporting to formats like Ollama, vLLM, GGUF, Hugging Face.
 								- **Simplified interface**: Provides easy-to-use notebooks and tools so users can fine-tune models with minimal boilerplate.
-												chore: Regenerate all playbooks

											
										
										
											2025-10-03 20:46:11 +00:00
 								## What you'll accomplish
 								You'll set up Unsloth for optimized fine-tuning of large language models on NVIDIA Spark devices,
 								achieving up to 2x faster training speeds with reduced memory usage through efficient
 								parameter-efficient fine-tuning methods like LoRA and QLoRA.
 								## What to know before starting
 								- Python package management with pip and virtual environments
 								- Hugging Face Transformers library basics (loading models, tokenizers, datasets)
 								- GPU fundamentals (CUDA/GPU vs CPU, VRAM constraints, device availability)
 								- Basic understanding of LLM training concepts (loss functions, checkpoints)
 								- Familiarity with prompt engineering and base model interaction
 								- Optional: LoRA/QLoRA parameter-efficient fine-tuning knowledge
 								## Prerequisites
-												chore: Regenerate all playbooks

											
										
										
											2025-10-06 13:35:52 +00:00
+								- NVIDIA Spark device with Blackwell GPU architecture
 								- `nvidia-smi` shows a summary of GPU information
 								- CUDA 13.0 installed: `nvcc --version`
 								- Internet access for downloading models and datasets
-												chore: Regenerate all playbooks

											
										
										
											2025-10-03 20:46:11 +00:00
-												chore: Regenerate all playbooks

											
										
										
											2025-10-07 16:38:31 +00:00
+								## Ancillary files
-												chore: Regenerate all playbooks

											
										
										
											2025-10-03 20:46:11 +00:00
-												chore: Regenerate all playbooks

											
										
										
											2025-10-07 16:38:31 +00:00
+								The Python test script can be found [here on GitHub](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/test_unsloth.py)
-												chore: Regenerate all playbooks

											
										
										
											2025-10-03 20:46:11 +00:00
 								## Time & risk
-												chore: Regenerate all playbooks

											
										
										
											2025-10-08 22:00:07 +00:00
+								* **Duration**: 30-60 minutes for initial setup and test run
 								* **Risks**:
 								  * Triton compiler version mismatches may cause compilation errors
 								  * CUDA toolkit configuration issues may prevent kernel compilation
 								  * Memory constraints on smaller models require batch size adjustments
 								* **Rollback**: Uninstall packages with `pip uninstall unsloth torch torchvision`.
 								* DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
 								```bash
 								sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
 								```
-												chore: Regenerate all playbooks

											
										
										
											2025-10-03 20:46:11 +00:00
 								## Instructions
 								## Step 1. Verify prerequisites
 								Confirm your NVIDIA Spark device has the required CUDA toolkit and GPU resources available.
 								```bash
 								nvcc --version
 								```
 								The output should show CUDA 13.0.
 								```bash
 								nvidia-smi
 								```
 								The output should show a summary of GPU information.
 								## Step 2. Get the container image
 								```bash
-												chore: Regenerate all playbooks

											
										
										
											2025-10-07 23:31:50 +00:00
+								docker pull nvcr.io/nvidia/pytorch:25.09-py3
-												chore: Regenerate all playbooks

											
										
										
											2025-10-03 20:46:11 +00:00
+								```
 								## Step 3. Launch Docker
 								```bash
-												chore: Regenerate all playbooks

											
										
										
											2025-10-07 23:31:50 +00:00
+								docker run --gpus all --ulimit memlock=-1 -it --ulimit stack=67108864 --entrypoint /usr/bin/bash --rm nvcr.io/nvidia/pytorch:25.09-py3
-												chore: Regenerate all playbooks

											
										
										
											2025-10-03 20:46:11 +00:00
+								```
 								## Step 4. Install dependencies inside Docker
 								```bash
 								pip install transformers peft datasets "trl==0.19.1"
 								pip install --no-deps unsloth unsloth_zoo
 								```
 								## Step 5. Build and install bitsandbytes inside Docker
 								```bash
-												chore: Regenerate all playbooks

											
										
										
											2025-10-07 23:31:50 +00:00
+								pip install --no-deps bitsandbytes
-												chore: Regenerate all playbooks

											
										
										
											2025-10-03 20:46:11 +00:00
+								```
 								## Step 6. Create Python test script
 								Curl the test script [here](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/test_unsloth.py) into the container.
 								```bash
 								curl -O https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/test_unsloth.py
-												chore: Regenerate all playbooks

											
										
										
											2025-10-07 23:31:50 +00:00
+								```
-												chore: Regenerate all playbooks

											
										
										
											2025-10-03 20:46:11 +00:00
 								We will use this test script to validate the installation with a simple fine-tuning task.
 								## Step 7. Run the validation test
 								Execute the test script to verify Unsloth is working correctly.
 								```bash
 								python test_unsloth.py
 								```
 								Expected output in the terminal window:
 								- "Unsloth: Will patch your computer to enable 2x faster free finetuning"
 								- Training progress bars showing loss decreasing over 60 steps
 								- Final training metrics showing completion
 								## Step 8. Next steps
 								Test with your own model and dataset by updating the `test_unsloth.py` file:
 								```python
 								## Replace line 32 with your model choice
 								model_name = "unsloth/Meta-Llama-3.1-8B-bnb-4bit"
 								## Load your custom dataset in line 8
 								dataset = load_dataset("your_dataset_name")
 								## Adjust training parameter args at line 61
 								per_device_train_batch_size = 4
 								max_steps = 1000
 								```
 								Visit https://github.com/unslothai/unsloth/wiki
 								for advanced usage instructions, including:
 								- [Saving models in GGUF format for vLLM](https://github.com/unslothai/unsloth/wiki#saving-to-gguf)
 								- [Continued training from checkpoints](https://github.com/unslothai/unsloth/wiki#loading-lora-adapters-for-continued-finetuning)
 								- [Using custom chat templates](https://github.com/unslothai/unsloth/wiki#chat-templates)
 								- [Running evaluation loops](https://github.com/unslothai/unsloth/wiki#evaluation-loop---also-fixes-oom-or-crashing)