mirror of https://github.com/NVIDIA/dgx-spark-playbooks.git synced 2026-04-22 18:13:52 +00:00

History

GitLab CI b3a97461df chore: Regenerate all playbooks		2025-10-08 22:00:07 +00:00
..
assets	chore: Regenerate all playbooks	2025-10-03 20:46:11 +00:00
README.md	chore: Regenerate all playbooks	2025-10-08 22:00:07 +00:00

README.md

Unsloth on DGX Spark

Optimized fine-tuning with Unsloth

Overview
Instructions

Overview

Basic idea

Performance-first: It claims to speed up training (e.g. 2× faster on single GPU, up to 30× in multi-GPU setups) and reduce memory usage compared to standard methods.
Kernel-level optimizations: Core compute is built with custom kernels (e.g. with Triton) and hand-optimized math to boost throughput and efficiency.
Quantization & model formats: Supports dynamic quantization (4-bit, 16-bit) and GGUF formats to reduce footprint, while aiming to retain accuracy.
Broad model support: Works with many LLMs (LLaMA, Mistral, Qwen, DeepSeek, etc.) and allows training, fine-tuning, exporting to formats like Ollama, vLLM, GGUF, Hugging Face.
Simplified interface: Provides easy-to-use notebooks and tools so users can fine-tune models with minimal boilerplate.

What you'll accomplish

You'll set up Unsloth for optimized fine-tuning of large language models on NVIDIA Spark devices, achieving up to 2x faster training speeds with reduced memory usage through efficient parameter-efficient fine-tuning methods like LoRA and QLoRA.

What to know before starting

Python package management with pip and virtual environments
Hugging Face Transformers library basics (loading models, tokenizers, datasets)
GPU fundamentals (CUDA/GPU vs CPU, VRAM constraints, device availability)
Basic understanding of LLM training concepts (loss functions, checkpoints)
Familiarity with prompt engineering and base model interaction
Optional: LoRA/QLoRA parameter-efficient fine-tuning knowledge

Prerequisites

NVIDIA Spark device with Blackwell GPU architecture
nvidia-smi shows a summary of GPU information
CUDA 13.0 installed: nvcc --version
Internet access for downloading models and datasets

Ancillary files

The Python test script can be found here on GitHub

Time & risk

Duration: 30-60 minutes for initial setup and test run
Risks:
- Triton compiler version mismatches may cause compilation errors
- CUDA toolkit configuration issues may prevent kernel compilation
- Memory constraints on smaller models require batch size adjustments
Rollback: Uninstall packages with pip uninstall unsloth torch torchvision.
DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

Instructions

Step 1. Verify prerequisites

Confirm your NVIDIA Spark device has the required CUDA toolkit and GPU resources available.

nvcc --version

The output should show CUDA 13.0.

nvidia-smi

The output should show a summary of GPU information.

Step 2. Get the container image

docker pull nvcr.io/nvidia/pytorch:25.09-py3

Step 3. Launch Docker

docker run --gpus all --ulimit memlock=-1 -it --ulimit stack=67108864 --entrypoint /usr/bin/bash --rm nvcr.io/nvidia/pytorch:25.09-py3

Step 4. Install dependencies inside Docker

pip install transformers peft datasets "trl==0.19.1"
pip install --no-deps unsloth unsloth_zoo

Step 5. Build and install bitsandbytes inside Docker

pip install --no-deps bitsandbytes

Step 6. Create Python test script

Curl the test script here into the container.

curl -O https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/test_unsloth.py

We will use this test script to validate the installation with a simple fine-tuning task.

Step 7. Run the validation test

Execute the test script to verify Unsloth is working correctly.

python test_unsloth.py

Expected output in the terminal window:

"Unsloth: Will patch your computer to enable 2x faster free finetuning"
Training progress bars showing loss decreasing over 60 steps
Final training metrics showing completion

Step 8. Next steps

Test with your own model and dataset by updating the test_unsloth.py file:

## Replace line 32 with your model choice
model_name = "unsloth/Meta-Llama-3.1-8B-bnb-4bit"

## Load your custom dataset in line 8
dataset = load_dataset("your_dataset_name")

## Adjust training parameter args at line 61
per_device_train_batch_size = 4
max_steps = 1000

Visit https://github.com/unslothai/unsloth/wiki for advanced usage instructions, including:

README.md Unescape Escape