dgx-spark-playbooks/nvidia/llama-factory/README.md

# LLaMA Factory

> Install and fine-tune models with LLaMA Factory

## Table of Contents

- [Overview](#overview)
- [Instructions](#instructions)
  - [Step 4. Install LLaMA Factory with dependencies](#step-4-install-llama-factory-with-dependencies)
- [Troubleshooting](#troubleshooting)

---

## Overview

## Basic idea
LLaMA Factory is an open-source framework that simplifies the process of training and fine 
tuning large language models. It offers a unified interface for a variety of cutting edge 
methods such as SFT, RLHF, and QLoRA techniques. It also supports a wide range of LLM 
architectures such as LLaMA, Mistral and Qwen. This playbook demonstrates how to fine-tune 
large language models using LLaMA Factory CLI on your NVIDIA Spark device.

## What you'll accomplish

You'll set up LLaMA Factory on NVIDIA Spark with Blackwell architecture to fine-tune large 
language models using LoRA, QLoRA, and full fine-tuning methods. This enables efficient 
model adaptation for specialized domains while leveraging hardware-specific optimizations.

## What to know before starting

- Basic Python knowledge for editing config files and troubleshooting
- Command line usage for running shell commands and managing environments  
- Familiarity with PyTorch and Hugging Face Transformers ecosystem
- GPU environment setup including CUDA/cuDNN installation and VRAM management
- Fine-tuning concepts: understanding tradeoffs between LoRA, QLoRA, and full fine-tuning
- Dataset preparation: formatting text data into JSON structure for instruction tuning
- Resource management: adjusting batch size and memory settings for GPU constraints

## Prerequisites

- NVIDIA Spark device with Blackwell architecture

- CUDA 12.9 or newer version installed: `nvcc --version`

- Docker installed and configured for GPU access: `docker run --gpus all nvcr.io/nvidia/pytorch:25.11-py3 nvidia-smi`

- Git installed: `git --version`

- Python environment with pip: `python --version && pip --version`

- Sufficient storage space (>50GB for models and checkpoints): `df -h`

- Internet connection for downloading models from Hugging Face Hub

## Ancillary files

- Official LLaMA Factory repository: https://github.com/hiyouga/LLaMA-Factory

- NVIDIA PyTorch container: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch

- Example training configuration: `examples/train_lora/llama3_lora_sft.yaml` (from repository)

- Documentation: https://llamafactory.readthedocs.io/en/latest/getting_started/data_preparation.html

## Time & risk

* **Duration:** 30-60 minutes for initial setup, 1-7 hours for training depending on model size and dataset.
* **Risks:** Model downloads require significant bandwidth and storage. Training may consume substantial GPU memory and require parameter tuning for hardware constraints.
* **Rollback:** Remove Docker containers and cloned repositories. Training checkpoints are saved locally and can be deleted to reclaim storage space.
* **Last Updated:** 12/15/2025
  * Upgrade to latest pytorch container version nvcr.io/nvidia/pytorch:25.11-py3

## Instructions

## Step 1. Verify system prerequisites

Check that your NVIDIA Spark system has the required components installed and accessible.

```bash
nvcc --version
docker --version
nvidia-smi
python --version
git --version
```

## Step 2. Launch PyTorch container with GPU support

Start the NVIDIA PyTorch container with GPU access and mount your workspace directory.
> [!NOTE]
> This NVIDIA PyTorch container supports CUDA 13

```bash
docker run --gpus all --ipc=host --ulimit memlock=-1 -it --ulimit stack=67108864 --rm -v "$PWD":/workspace nvcr.io/nvidia/pytorch:25.11-py3 bash
```

## Step 3. Clone LLaMA Factory repository

Download the LLaMA Factory source code from the official repository.

```bash
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
```

### Step 4. Install LLaMA Factory with dependencies

Install the package in editable mode with metrics support for training evaluation.

```bash
pip install -e ".[metrics]"
```

## Step 5. Verify Pytorch CUDA support. 

PyTorch is pre-installed with CUDA support.

To verify installation:

```bash
python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"
```

## Step 6. Prepare training configuration

Examine the provided LoRA fine-tuning configuration for Llama-3.

```bash
cat examples/train_lora/llama3_lora_sft.yaml
```

## Step 7. Launch fine-tuning training

> [!NOTE]
> Login to your hugging face hub to download the model if the model is gated.

Execute the training process using the pre-configured LoRA setup.

```bash
huggingface-cli login # if the model is gated
llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
```

Example output:
```bash
***** train metrics *****
  epoch                    =        3.0
  total_flos               = 22851591GF
  train_loss               =     0.9113
  train_runtime            = 0:22:21.99
  train_samples_per_second =      2.437
  train_steps_per_second   =      0.306
Figure saved at: saves/llama3-8b/lora/sft/training_loss.png
```

## Step 8. Validate training completion

Verify that training completed successfully and checkpoints were saved.

```bash
ls -la saves/llama3-8b/lora/sft/
```


Expected output should show:
- Final checkpoint directory (`checkpoint-21` or similar)
- Model configuration files (`config.json`, `adapter_config.json`) 
- Training metrics showing decreasing loss values
- Training loss plot saved as PNG file

## Step 9. Test inference with fine-tuned model

Test your fine-tuned model with custom prompts:

```bash
llamafactory-cli chat examples/inference/llama3_lora_sft.yaml
## Type: "Hello, how can you help me today?"
## Expect: Response showing fine-tuned behavior
```

## Step 10. For production deployment, export your model
```bash
llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml
```

## Step 11. Cleanup and rollback

> [!WARNING]
> This will delete all training progress and checkpoints.

To remove all generated files and free up storage space:

```bash
cd /workspace
rm -rf LLaMA-Factory/
docker system prune -f
```

To rollback Docker container changes:
```bash
exit  # Exit container
docker container prune -f
```

## Troubleshooting

| Symptom | Cause | Fix |
|---------|--------|-----|
| CUDA out of memory during training | Batch size too large for GPU VRAM | Reduce `per_device_train_batch_size` or increase `gradient_accumulation_steps` |
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
| Model download fails or is slow | Network connectivity or Hugging Face Hub issues | Check internet connection, try using `HF_HUB_OFFLINE=1` for cached models |
| Training loss not decreasing | Learning rate too high/low or insufficient data | Adjust `learning_rate` parameter or check dataset quality |

> [!NOTE]
> DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. 
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within 
> the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
```bash
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
```
chore: Regenerate all playbooks 2025-10-08 20:25:52 +00:00			`# LLaMA Factory`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
chore: Regenerate all playbooks 2025-10-08 20:25:52 +00:00			`> Install and fine-tune models with LLaMA Factory`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`## Table of Contents`

			`- [Overview](#overview)`
			`- [Instructions](#instructions)`
			`- [Step 4. Install LLaMA Factory with dependencies](#step-4-install-llama-factory-with-dependencies)`
chore: Regenerate all playbooks 2025-10-10 00:11:49 +00:00			`- [Troubleshooting](#troubleshooting)`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`---`

			`## Overview`

chore: Regenerate all playbooks 2025-10-08 20:25:52 +00:00			`## Basic idea`
			`LLaMA Factory is an open-source framework that simplifies the process of training and fine`
			`tuning large language models. It offers a unified interface for a variety of cutting edge`
			`methods such as SFT, RLHF, and QLoRA techniques. It also supports a wide range of LLM`
			`architectures such as LLaMA, Mistral and Qwen. This playbook demonstrates how to fine-tune`
			`large language models using LLaMA Factory CLI on your NVIDIA Spark device.`

chore: Regenerate all playbooks 2025-10-06 13:35:52 +00:00			`## What you'll accomplish`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`You'll set up LLaMA Factory on NVIDIA Spark with Blackwell architecture to fine-tune large`
			`language models using LoRA, QLoRA, and full fine-tuning methods. This enables efficient`
			`model adaptation for specialized domains while leveraging hardware-specific optimizations.`

chore: Regenerate all playbooks 2025-10-06 13:35:52 +00:00			`## What to know before starting`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`- Basic Python knowledge for editing config files and troubleshooting`
			`- Command line usage for running shell commands and managing environments`
			`- Familiarity with PyTorch and Hugging Face Transformers ecosystem`
			`- GPU environment setup including CUDA/cuDNN installation and VRAM management`
			`- Fine-tuning concepts: understanding tradeoffs between LoRA, QLoRA, and full fine-tuning`
			`- Dataset preparation: formatting text data into JSON structure for instruction tuning`
			`- Resource management: adjusting batch size and memory settings for GPU constraints`

chore: Regenerate all playbooks 2025-10-06 13:35:52 +00:00			`## Prerequisites`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`- NVIDIA Spark device with Blackwell architecture`

			- CUDA 12.9 or newer version installed: `nvcc --version`

chore: Regenerate all playbooks 2025-12-16 03:14:04 +00:00			- Docker installed and configured for GPU access: `docker run --gpus all nvcr.io/nvidia/pytorch:25.11-py3 nvidia-smi`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			- Git installed: `git --version`

			- Python environment with pip: `python --version && pip --version`

			- Sufficient storage space (>50GB for models and checkpoints): `df -h`

			`- Internet connection for downloading models from Hugging Face Hub`

chore: Regenerate all playbooks 2025-10-06 13:35:52 +00:00			`## Ancillary files`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`- Official LLaMA Factory repository: https://github.com/hiyouga/LLaMA-Factory`

			`- NVIDIA PyTorch container: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch`

			- Example training configuration: `examples/train_lora/llama3_lora_sft.yaml` (from repository)

			`- Documentation: https://llamafactory.readthedocs.io/en/latest/getting_started/data_preparation.html`

chore: Regenerate all playbooks 2025-10-06 13:35:52 +00:00			`## Time & risk`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
chore: Regenerate all playbooks 2025-10-08 22:00:07 +00:00			`* Duration: 30-60 minutes for initial setup, 1-7 hours for training depending on model size and dataset.`
			`* Risks: Model downloads require significant bandwidth and storage. Training may consume substantial GPU memory and require parameter tuning for hardware constraints.`
			`* Rollback: Remove Docker containers and cloned repositories. Training checkpoints are saved locally and can be deleted to reclaim storage space.`
chore: Regenerate all playbooks 2025-12-16 03:14:04 +00:00			`* Last Updated: 12/15/2025`
			`* Upgrade to latest pytorch container version nvcr.io/nvidia/pytorch:25.11-py3`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`## Instructions`

chore: Regenerate all playbooks 2025-10-06 13:35:52 +00:00			`## Step 1. Verify system prerequisites`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`Check that your NVIDIA Spark system has the required components installed and accessible.`

			```bash
			`nvcc --version`
			`docker --version`
			`nvidia-smi`
			`python --version`
			`git --version`
			```

chore: Regenerate all playbooks 2025-10-06 13:35:52 +00:00			`## Step 2. Launch PyTorch container with GPU support`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`Start the NVIDIA PyTorch container with GPU access and mount your workspace directory.`
chore: Regenerate all playbooks 2025-10-12 20:13:25 +00:00			`> [!NOTE]`
			`> This NVIDIA PyTorch container supports CUDA 13`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			```bash
chore: Regenerate all playbooks 2025-12-16 03:14:04 +00:00			`docker run --gpus all --ipc=host --ulimit memlock=-1 -it --ulimit stack=67108864 --rm -v "$PWD":/workspace nvcr.io/nvidia/pytorch:25.11-py3 bash`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00			```

chore: Regenerate all playbooks 2025-10-06 13:35:52 +00:00			`## Step 3. Clone LLaMA Factory repository`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`Download the LLaMA Factory source code from the official repository.`

			```bash
			`git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git`
			`cd LLaMA-Factory`
			```

			`### Step 4. Install LLaMA Factory with dependencies`

			`Install the package in editable mode with metrics support for training evaluation.`

			```bash
			`pip install -e ".[metrics]"`
			```

chore: Regenerate all playbooks 2025-10-07 23:31:50 +00:00			`## Step 5. Verify Pytorch CUDA support.`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
chore: Regenerate all playbooks 2025-10-08 20:25:52 +00:00			`PyTorch is pre-installed with CUDA support.`

			`To verify installation:`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			```bash
			`python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"`
			```

chore: Regenerate all playbooks 2025-10-06 13:35:52 +00:00			`## Step 6. Prepare training configuration`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`Examine the provided LoRA fine-tuning configuration for Llama-3.`

			```bash
			`cat examples/train_lora/llama3_lora_sft.yaml`
			```

chore: Regenerate all playbooks 2025-10-06 13:35:52 +00:00			`## Step 7. Launch fine-tuning training`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
chore: Regenerate all playbooks 2025-10-12 20:13:25 +00:00			`> [!NOTE]`
			`> Login to your hugging face hub to download the model if the model is gated.`
chore: Regenerate all playbooks 2025-10-08 20:25:52 +00:00
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00			`Execute the training process using the pre-configured LoRA setup.`

			```bash
			`huggingface-cli login # if the model is gated`
			`llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml`
			```

			`Example output:`
			```bash
			`*** train metrics ***`
			`epoch = 3.0`
			`total_flos = 22851591GF`
			`train_loss = 0.9113`
			`train_runtime = 0:22:21.99`
			`train_samples_per_second = 2.437`
			`train_steps_per_second = 0.306`
			`Figure saved at: saves/llama3-8b/lora/sft/training_loss.png`
			```

chore: Regenerate all playbooks 2025-10-06 13:35:52 +00:00			`## Step 8. Validate training completion`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`Verify that training completed successfully and checkpoints were saved.`

			```bash
			`ls -la saves/llama3-8b/lora/sft/`
			```


			`Expected output should show:`
			- Final checkpoint directory (`checkpoint-21` or similar)
			- Model configuration files (`config.json`, `adapter_config.json`)
			`- Training metrics showing decreasing loss values`
			`- Training loss plot saved as PNG file`

chore: Regenerate all playbooks 2025-10-06 13:35:52 +00:00			`## Step 9. Test inference with fine-tuned model`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
chore: Regenerate all playbooks 2025-10-07 23:31:50 +00:00			`Test your fine-tuned model with custom prompts:`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			```bash
			`llamafactory-cli chat examples/inference/llama3_lora_sft.yaml`
chore: Regenerate all playbooks 2025-10-07 23:31:50 +00:00			`## Type: "Hello, how can you help me today?"`
			`## Expect: Response showing fine-tuned behavior`
			```

			`## Step 10. For production deployment, export your model`
			```bash
			`llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00			```

chore: Regenerate all playbooks 2025-10-10 00:11:49 +00:00			`## Step 11. Cleanup and rollback`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
chore: Regenerate all playbooks 2025-10-12 20:53:42 +00:00			`> [!WARNING]`
			`> This will delete all training progress and checkpoints.`
chore: Regenerate all playbooks 2025-10-03 20:46:11 +00:00
			`To remove all generated files and free up storage space:`

			```bash
			`cd /workspace`
			`rm -rf LLaMA-Factory/`
			`docker system prune -f`
			```

			`To rollback Docker container changes:`
			```bash
			`exit # Exit container`
			`docker container prune -f`
			```
chore: Regenerate all playbooks 2025-10-10 00:11:49 +00:00
			`## Troubleshooting`

			`\| Symptom \| Cause \| Fix \|`
			`\|---------\|--------\|-----\|`
			\| CUDA out of memory during training \| Batch size too large for GPU VRAM \| Reduce `per_device_train_batch_size` or increase `gradient_accumulation_steps` \|
chore: Regenerate all playbooks 2025-10-10 20:59:55 +00:00			`\| Cannot access gated repo for URL \| Certain HuggingFace models have restricted access \| Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser \|`
chore: Regenerate all playbooks 2025-10-10 00:11:49 +00:00			\| Model download fails or is slow \| Network connectivity or Hugging Face Hub issues \| Check internet connection, try using `HF_HUB_OFFLINE=1` for cached models \|
			\| Training loss not decreasing \| Learning rate too high/low or insufficient data \| Adjust `learning_rate` parameter or check dataset quality \|

chore: Regenerate all playbooks 2025-10-12 20:13:25 +00:00			`> [!NOTE]`
			`> DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.`
chore: Regenerate all playbooks 2025-10-10 00:11:49 +00:00			`> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within`
			`> the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:`
			```bash
			`sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'`
			```