2025-10-08 20:25:52 +00:00
# LLaMA Factory
2025-10-03 20:46:11 +00:00
2025-10-08 20:25:52 +00:00
> Install and fine-tune models with LLaMA Factory
2025-10-03 20:46:11 +00:00
## Table of Contents
- [Overview ](#overview )
- [Instructions ](#instructions )
- [Step 4. Install LLaMA Factory with dependencies ](#step-4-install-llama-factory-with-dependencies )
2025-10-10 00:11:49 +00:00
- [Troubleshooting ](#troubleshooting )
2025-10-03 20:46:11 +00:00
---
## Overview
2025-10-08 20:25:52 +00:00
## Basic idea
LLaMA Factory is an open-source framework that simplifies the process of training and fine
tuning large language models. It offers a unified interface for a variety of cutting edge
methods such as SFT, RLHF, and QLoRA techniques. It also supports a wide range of LLM
architectures such as LLaMA, Mistral and Qwen. This playbook demonstrates how to fine-tune
large language models using LLaMA Factory CLI on your NVIDIA Spark device.
2025-10-06 13:35:52 +00:00
## What you'll accomplish
2025-10-03 20:46:11 +00:00
You'll set up LLaMA Factory on NVIDIA Spark with Blackwell architecture to fine-tune large
language models using LoRA, QLoRA, and full fine-tuning methods. This enables efficient
model adaptation for specialized domains while leveraging hardware-specific optimizations.
2025-10-06 13:35:52 +00:00
## What to know before starting
2025-10-03 20:46:11 +00:00
- Basic Python knowledge for editing config files and troubleshooting
- Command line usage for running shell commands and managing environments
- Familiarity with PyTorch and Hugging Face Transformers ecosystem
- GPU environment setup including CUDA/cuDNN installation and VRAM management
- Fine-tuning concepts: understanding tradeoffs between LoRA, QLoRA, and full fine-tuning
- Dataset preparation: formatting text data into JSON structure for instruction tuning
- Resource management: adjusting batch size and memory settings for GPU constraints
2025-10-06 13:35:52 +00:00
## Prerequisites
2025-10-03 20:46:11 +00:00
- NVIDIA Spark device with Blackwell architecture
- CUDA 12.9 or newer version installed: `nvcc --version`
2025-12-16 03:14:04 +00:00
- Docker installed and configured for GPU access: `docker run --gpus all nvcr.io/nvidia/pytorch:25.11-py3 nvidia-smi`
2025-10-03 20:46:11 +00:00
- Git installed: `git --version`
- Python environment with pip: `python --version && pip --version`
- Sufficient storage space (>50GB for models and checkpoints): `df -h`
- Internet connection for downloading models from Hugging Face Hub
2025-10-06 13:35:52 +00:00
## Ancillary files
2025-10-03 20:46:11 +00:00
- Official LLaMA Factory repository: https://github.com/hiyouga/LLaMA-Factory
- NVIDIA PyTorch container: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch
- Example training configuration: `examples/train_lora/llama3_lora_sft.yaml` (from repository)
- Documentation: https://llamafactory.readthedocs.io/en/latest/getting_started/data_preparation.html
2025-10-06 13:35:52 +00:00
## Time & risk
2025-10-03 20:46:11 +00:00
2025-10-08 22:00:07 +00:00
* **Duration:** 30-60 minutes for initial setup, 1-7 hours for training depending on model size and dataset.
* **Risks:** Model downloads require significant bandwidth and storage. Training may consume substantial GPU memory and require parameter tuning for hardware constraints.
* **Rollback:** Remove Docker containers and cloned repositories. Training checkpoints are saved locally and can be deleted to reclaim storage space.
2025-12-16 03:14:04 +00:00
* **Last Updated:** 12/15/2025
* Upgrade to latest pytorch container version nvcr.io/nvidia/pytorch:25.11-py3
2025-10-03 20:46:11 +00:00
## Instructions
2025-10-06 13:35:52 +00:00
## Step 1. Verify system prerequisites
2025-10-03 20:46:11 +00:00
Check that your NVIDIA Spark system has the required components installed and accessible.
```bash
nvcc --version
docker --version
nvidia-smi
python --version
git --version
```
2025-10-06 13:35:52 +00:00
## Step 2. Launch PyTorch container with GPU support
2025-10-03 20:46:11 +00:00
Start the NVIDIA PyTorch container with GPU access and mount your workspace directory.
2025-10-12 20:13:25 +00:00
> [!NOTE]
> This NVIDIA PyTorch container supports CUDA 13
2025-10-03 20:46:11 +00:00
```bash
2025-12-16 03:14:04 +00:00
docker run --gpus all --ipc=host --ulimit memlock=-1 -it --ulimit stack=67108864 --rm -v "$PWD":/workspace nvcr.io/nvidia/pytorch:25.11-py3 bash
2025-10-03 20:46:11 +00:00
```
2025-10-06 13:35:52 +00:00
## Step 3. Clone LLaMA Factory repository
2025-10-03 20:46:11 +00:00
Download the LLaMA Factory source code from the official repository.
```bash
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
```
### Step 4. Install LLaMA Factory with dependencies
Install the package in editable mode with metrics support for training evaluation.
```bash
pip install -e ".[metrics]"
```
2025-10-07 23:31:50 +00:00
## Step 5. Verify Pytorch CUDA support.
2025-10-03 20:46:11 +00:00
2025-10-08 20:25:52 +00:00
PyTorch is pre-installed with CUDA support.
To verify installation:
2025-10-03 20:46:11 +00:00
```bash
python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"
```
2025-10-06 13:35:52 +00:00
## Step 6. Prepare training configuration
2025-10-03 20:46:11 +00:00
Examine the provided LoRA fine-tuning configuration for Llama-3.
```bash
cat examples/train_lora/llama3_lora_sft.yaml
```
2025-10-06 13:35:52 +00:00
## Step 7. Launch fine-tuning training
2025-10-03 20:46:11 +00:00
2025-10-12 20:13:25 +00:00
> [!NOTE]
> Login to your hugging face hub to download the model if the model is gated.
2025-10-08 20:25:52 +00:00
2025-10-03 20:46:11 +00:00
Execute the training process using the pre-configured LoRA setup.
```bash
huggingface-cli login # if the model is gated
llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
```
Example output:
```bash
***** train metrics ** ***
epoch = 3.0
total_flos = 22851591GF
train_loss = 0.9113
train_runtime = 0:22:21.99
train_samples_per_second = 2.437
train_steps_per_second = 0.306
Figure saved at: saves/llama3-8b/lora/sft/training_loss.png
```
2025-10-06 13:35:52 +00:00
## Step 8. Validate training completion
2025-10-03 20:46:11 +00:00
Verify that training completed successfully and checkpoints were saved.
```bash
ls -la saves/llama3-8b/lora/sft/
```
Expected output should show:
- Final checkpoint directory (`checkpoint-21` or similar)
- Model configuration files (`config.json`, `adapter_config.json` )
- Training metrics showing decreasing loss values
- Training loss plot saved as PNG file
2025-10-06 13:35:52 +00:00
## Step 9. Test inference with fine-tuned model
2025-10-03 20:46:11 +00:00
2025-10-07 23:31:50 +00:00
Test your fine-tuned model with custom prompts:
2025-10-03 20:46:11 +00:00
```bash
llamafactory-cli chat examples/inference/llama3_lora_sft.yaml
2025-10-07 23:31:50 +00:00
## Type: "Hello, how can you help me today?"
## Expect: Response showing fine-tuned behavior
```
## Step 10. For production deployment, export your model
```bash
llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml
2025-10-03 20:46:11 +00:00
```
2025-10-10 00:11:49 +00:00
## Step 11. Cleanup and rollback
2025-10-03 20:46:11 +00:00
2025-10-12 20:53:42 +00:00
> [!WARNING]
> This will delete all training progress and checkpoints.
2025-10-03 20:46:11 +00:00
To remove all generated files and free up storage space:
```bash
cd /workspace
rm -rf LLaMA-Factory/
docker system prune -f
```
To rollback Docker container changes:
```bash
exit # Exit container
docker container prune -f
```
2025-10-10 00:11:49 +00:00
## Troubleshooting
| Symptom | Cause | Fix |
|---------|--------|-----|
| CUDA out of memory during training | Batch size too large for GPU VRAM | Reduce `per_device_train_batch_size` or increase `gradient_accumulation_steps` |
2025-10-10 20:59:55 +00:00
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token ](https://huggingface.co/docs/hub/en/security-tokens ); and request access to the [gated model ](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information ) on your web browser |
2025-10-10 00:11:49 +00:00
| Model download fails or is slow | Network connectivity or Hugging Face Hub issues | Check internet connection, try using `HF_HUB_OFFLINE=1` for cached models |
| Training loss not decreasing | Learning rate too high/low or insufficient data | Adjust `learning_rate` parameter or check dataset quality |
2025-10-12 20:13:25 +00:00
> [!NOTE]
> DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
2025-10-10 00:11:49 +00:00
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within
> the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
```bash
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
```