| .. | ||
| assets | ||
| README.md | ||
Unsloth on DGX Spark
Optimized fine-tuning with Unsloth
Table of Contents
Overview
Basic idea
- Performance-first: It claims to speed up training (e.g. 2× faster on single GPU, up to 30× in multi-GPU setups) and reduce memory usage compared to standard methods.
- Kernel-level optimizations: Core compute is built with custom kernels (e.g. with Triton) and hand-optimized math to boost throughput and efficiency.
- Quantization & model formats: Supports dynamic quantization (4-bit, 16-bit) and GGUF formats to reduce footprint, while aiming to retain accuracy.
- Broad model support: Works with many LLMs (LLaMA, Mistral, Qwen, DeepSeek, etc.) and allows training, fine-tuning, exporting to formats like Ollama, vLLM, GGUF, Hugging Face.
- Simplified interface: Provides easy-to-use notebooks and tools so users can fine-tune models with minimal boilerplate.
What you'll accomplish
You'll set up Unsloth for optimized fine-tuning of large language models on NVIDIA Spark devices, achieving up to 2x faster training speeds with reduced memory usage through efficient parameter-efficient fine-tuning methods like LoRA and QLoRA.
What to know before starting
- Python package management with pip and virtual environments
- Hugging Face Transformers library basics (loading models, tokenizers, datasets)
- GPU fundamentals (CUDA/GPU vs CPU, VRAM constraints, device availability)
- Basic understanding of LLM training concepts (loss functions, checkpoints)
- Familiarity with prompt engineering and base model interaction
- Optional: LoRA/QLoRA parameter-efficient fine-tuning knowledge
Prerequisites
- NVIDIA Spark device with Blackwell GPU architecture
nvidia-smishows a summary of GPU information- CUDA 13.0 installed:
nvcc --version - Internet access for downloading models and datasets
Ancillary files
The Python test script can be found here on GitHub
Time & risk
- Duration: 30-60 minutes for initial setup and test run
- Risks:
- Triton compiler version mismatches may cause compilation errors
- CUDA toolkit configuration issues may prevent kernel compilation
- Memory constraints on smaller models require batch size adjustments
- Rollback: Uninstall packages with
pip uninstall unsloth torch torchvision. - DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
Instructions
Step 1. Verify prerequisites
Confirm your NVIDIA Spark device has the required CUDA toolkit and GPU resources available.
nvcc --version
The output should show CUDA 13.0.
nvidia-smi
The output should show a summary of GPU information.
Step 2. Get the container image
docker pull nvcr.io/nvidia/pytorch:25.09-py3
Step 3. Launch Docker
docker run --gpus all --ulimit memlock=-1 -it --ulimit stack=67108864 --entrypoint /usr/bin/bash --rm nvcr.io/nvidia/pytorch:25.09-py3
Step 4. Install dependencies inside Docker
pip install transformers peft datasets "trl==0.19.1"
pip install --no-deps unsloth unsloth_zoo
Step 5. Build and install bitsandbytes inside Docker
pip install --no-deps bitsandbytes
Step 6. Create Python test script
Curl the test script here into the container.
curl -O https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/test_unsloth.py
We will use this test script to validate the installation with a simple fine-tuning task.
Step 7. Run the validation test
Execute the test script to verify Unsloth is working correctly.
python test_unsloth.py
Expected output in the terminal window:
- "Unsloth: Will patch your computer to enable 2x faster free finetuning"
- Training progress bars showing loss decreasing over 60 steps
- Final training metrics showing completion
Step 8. Next steps
Test with your own model and dataset by updating the test_unsloth.py file:
## Replace line 32 with your model choice
model_name = "unsloth/Meta-Llama-3.1-8B-bnb-4bit"
## Load your custom dataset in line 8
dataset = load_dataset("your_dataset_name")
## Adjust training parameter args at line 61
per_device_train_batch_size = 4
max_steps = 1000
Visit https://github.com/unslothai/unsloth/wiki for advanced usage instructions, including: