- **Performance-first**: It claims to speed up training (e.g. 2× faster on single GPU, up to 30× in multi-GPU setups) and reduce memory usage compared to standard methods. :contentReference[oaicite:0]{index=0}
- **Kernel-level optimizations**: Core compute is built with custom kernels (e.g. with Triton) and hand-optimized math to boost throughput and efficiency. :contentReference[oaicite:1]{index=1}
- **Quantization & model formats**: Supports dynamic quantization (4-bit, 16-bit) and GGUF formats to reduce footprint, while aiming to retain accuracy. :contentReference[oaicite:2]{index=2}
- **Broad model support**: Works with many LLMs (LLaMA, Mistral, Qwen, DeepSeek, etc.) and allows training, fine-tuning, exporting to formats like Ollama, vLLM, GGUF, Hugging Face. :contentReference[oaicite:3]{index=3}
- **Simplified interface**: Provides easy-to-use notebooks and tools so users can fine-tune models with minimal boilerplate. :contentReference[oaicite:4]{index=4}
## What you'll accomplish
You'll set up Unsloth for optimized fine-tuning of large language models on NVIDIA Spark devices,
achieving up to 2x faster training speeds with reduced memory usage through efficient
parameter-efficient fine-tuning methods like LoRA and QLoRA.
## What to know before starting
- Python package management with pip and virtual environments
- Hugging Face Transformers library basics (loading models, tokenizers, datasets)
- GPU fundamentals (CUDA/GPU vs CPU, VRAM constraints, device availability)
- Basic understanding of LLM training concepts (loss functions, checkpoints)
- Familiarity with prompt engineering and base model interaction
The python test script can be found [here on GitHub](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/test_unsloth.py)
## Time & risk
- **Duration**: 30-60 minutes for initial setup and test run
- **Risks**:
- Triton compiler version mismatches may cause compilation errors
- CUDA toolkit configuration issues may prevent kernel compilation
- Memory constraints on smaller models require batch size adjustments
- **Rollback**: Uninstall packages with `pip uninstall unsloth torch torchvision`
## Instructions
## Step 1. Verify prerequisites
Confirm your NVIDIA Spark device has the required CUDA toolkit and GPU resources available.
```bash
nvcc --version
```
The output should show CUDA 13.0.
```bash
nvidia-smi
```
The output should show a summary of GPU information.
## Step 2. Get the container image
```bash
docker pull nvcr.io/nvidia/pytorch:25.08-py3
```
## Step 3. Launch Docker
```bash
docker run --gpus all --ulimit memlock=-1 -it --ulimit stack=67108864 --entrypoint /usr/bin/bash --rm nvcr.io/nvidia/pytorch:25.08-py3
Curl the test script [here](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/test_unsloth.py) into the container.