| .. | ||
| README.md | ||
LLaMA Factory
Install and fine-tune models with LLaMA Factory
Table of Contents
Overview
Basic idea
LLaMA Factory is an open-source framework that simplifies the process of training and fine tuning large language models. It offers a unified interface for a variety of cutting edge methods such as SFT, RLHF, and QLoRA techniques. It also supports a wide range of LLM architectures such as LLaMA, Mistral and Qwen. This playbook demonstrates how to fine-tune large language models using LLaMA Factory CLI on your NVIDIA Spark device.
What you'll accomplish
You'll set up LLaMA Factory on NVIDIA Spark with Blackwell architecture to fine-tune large language models using LoRA, QLoRA, and full fine-tuning methods. This enables efficient model adaptation for specialized domains while leveraging hardware-specific optimizations.
What to know before starting
- Basic Python knowledge for editing config files and troubleshooting
- Command line usage for running shell commands and managing environments
- Familiarity with PyTorch and Hugging Face Transformers ecosystem
- GPU environment setup including CUDA/cuDNN installation and VRAM management
- Fine-tuning concepts: understanding tradeoffs between LoRA, QLoRA, and full fine-tuning
- Dataset preparation: formatting text data into JSON structure for instruction tuning
- Resource management: adjusting batch size and memory settings for GPU constraints
Prerequisites
-
NVIDIA Spark device with Blackwell architecture
-
CUDA 12.9 or newer version installed:
nvcc --version -
Git installed:
git --version -
Python 3 with venv and pip:
python3 --version && pip3 --version -
Sufficient storage space (>50GB for models and checkpoints):
df -h -
Internet connection for downloading models from Hugging Face Hub
Ancillary files
-
Official LLaMA Factory repository: https://github.com/hiyouga/LLaMA-Factory
-
PyTorch with CUDA 13: install via
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130 -
Example training configuration:
examples/train_lora/qwen3_lora_sft.yaml(from repository) -
Documentation: https://llamafactory.readthedocs.io/en/latest/getting_started/data_preparation.html
Time & risk
- Duration: 30-60 minutes for initial setup, 1-7 hours for training depending on model size and dataset.
- Risks: Model downloads require significant bandwidth and storage. Training may consume substantial GPU memory and require parameter tuning for hardware constraints.
- Rollback: Deactivate the virtual environment and remove the
factoryEnvandLLaMA-Factorydirectories. Training checkpoints are saved locally and can be deleted to reclaim storage space. - Last Updated: 02/18/2026
- Updated to venv-based setup with PyTorch CUDA 13 (no Docker). Qwen3 LoRA fine-tuning workflow.
Instructions
Step 1. Verify system prerequisites
Check that your NVIDIA Spark system has the required components installed and accessible.
nvcc --version
nvidia-smi
python3 --version
git --version
Step 2. Create and activate a Python virtual environment
Create a virtual environment and activate it for the LLaMA Factory installation.
python3 -m venv factoryEnv
source ./factoryEnv/bin/activate
Step 3. Install PyTorch with CUDA 13 support
Install PyTorch, torchvision, and torchaudio with CUDA 13.0 support from the official PyTorch index.
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
Step 4. Verify PyTorch CUDA support
Confirm that PyTorch can see the GPU.
python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"
Step 5. Clone LLaMA Factory repository
Download the LLaMA Factory source code from the official repository.
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
Step 6. Install LLaMA Factory with dependencies
Install LLaMA Factory in editable mode with metrics support.
pip install -e ".[metrics]"
Step 7. Prepare training configuration
Examine the provided LoRA fine-tuning configuration for Qwen3.
cat examples/train_lora/qwen3_lora_sft.yaml
Step 8. Launch fine-tuning training
Note
Login to your Hugging Face Hub to download the model if the model is gated.
Execute the training process using the pre-configured LoRA setup.
hf auth login # if the model is gated
llamafactory-cli train examples/train_lora/qwen3_lora_sft.yaml
Example output:
***** train metrics *****
epoch = 3.0
total_flos = 11076559GF
train_loss = 0.9993
train_runtime = 0:14:32.12
train_samples_per_second = 3.749
train_steps_per_second = 0.471
Figure saved at: saves/qwen3-4b/lora/sft/training_loss.png
Step 9. Validate training completion
Verify that training completed successfully and checkpoints were saved.
ls -la saves/qwen3-4b/lora/sft/
Expected output should show:
- Final checkpoint directory (
checkpoint-411or similar) - Model configuration files (
adapter_config.json) - Training metrics showing decreasing loss values
- Training loss plot saved as PNG file
Step 10. Test inference with fine-tuned model
Test your fine-tuned model with custom prompts:
llamafactory-cli chat examples/inference/qwen3_lora_sft.yaml
## Type: "Hello, how can you help me today?"
## Expect: Response showing fine-tuned behavior
Step 11. For production deployment, export your model
llamafactory-cli export examples/merge_lora/qwen3_lora_sft.yaml
Step 12. Cleanup and rollback
Warning
This will delete all training progress and checkpoints.
To remove the virtual environment and cloned repository:
deactivate
cd ..
rm -rf LLaMA-Factory/
rm -rf factoryEnv/
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| CUDA out of memory during training | Batch size too large for GPU VRAM | Reduce per_device_train_batch_size or increase gradient_accumulation_steps |
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your HuggingFace token; and request access to the gated model on your web browser |
| Model download fails or is slow | Network connectivity or Hugging Face Hub issues | Check internet connection, try using HF_HUB_OFFLINE=1 for cached models |
| Training loss not decreasing | Learning rate too high/low or insufficient data | Adjust learning_rate parameter or check dataset quality |
Note
DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'