mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-25 03:13:53 +00:00
chore: Regenerate all playbooks
This commit is contained in:
parent
81b4535418
commit
bfde041ae0
@ -12,6 +12,13 @@
|
|||||||
|
|
||||||
## Overview
|
## Overview
|
||||||
|
|
||||||
|
## Basic idea
|
||||||
|
LLaMA Factory is an open-source framework that simplifies the process of training and fine
|
||||||
|
tuning large language models. It offers a unified interface for a variety of cutting edge
|
||||||
|
methods such as SFT, RLHF, and QLoRA techniques. It also supports a wide range of LLM
|
||||||
|
architectures such as LLaMA, Mistral and Qwen. This playbook demonstrates how to fine-tune
|
||||||
|
large language models using LLaMA Factory CLI on your NVIDIA Spark device.
|
||||||
|
|
||||||
## What you'll accomplish
|
## What you'll accomplish
|
||||||
|
|
||||||
You'll set up LLaMA Factory on NVIDIA Spark with Blackwell architecture to fine-tune large
|
You'll set up LLaMA Factory on NVIDIA Spark with Blackwell architecture to fine-tune large
|
||||||
@ -107,7 +114,8 @@ pip install -e ".[metrics]"
|
|||||||
|
|
||||||
## Step 5. Verify Pytorch CUDA support.
|
## Step 5. Verify Pytorch CUDA support.
|
||||||
|
|
||||||
PyTorch is pre-installed with CUDA support. Verify installation:
|
PyTorch is pre-installed with CUDA support.
|
||||||
|
To verify installation:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"
|
python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"
|
||||||
@ -123,7 +131,7 @@ cat examples/train_lora/llama3_lora_sft.yaml
|
|||||||
|
|
||||||
## Step 7. Launch fine-tuning training
|
## Step 7. Launch fine-tuning training
|
||||||
|
|
||||||
> **Note:** Login to your hugging face hub to download the model if the model is gated
|
> **Note:** Login to your hugging face hub to download the model if the model is gated.
|
||||||
Execute the training process using the pre-configured LoRA setup.
|
Execute the training process using the pre-configured LoRA setup.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
|||||||
@ -6,18 +6,13 @@
|
|||||||
|
|
||||||
- [Overview](#overview)
|
- [Overview](#overview)
|
||||||
- [Instructions](#instructions)
|
- [Instructions](#instructions)
|
||||||
- [If system installation fails](#if-system-installation-fails)
|
|
||||||
- [Install from wheel package (recommended)](#install-from-wheel-package-recommended)
|
|
||||||
- [Full Fine-tuning example:](#full-fine-tuning-example)
|
|
||||||
- [LoRA fine-tuning example:](#lora-fine-tuning-example)
|
|
||||||
- [QLoRA fine-tuning example:](#qlora-fine-tuning-example)
|
|
||||||
- [Step 9. Configure distributed training (optional)](#step-9-configure-distributed-training-optional)
|
- [Step 9. Configure distributed training (optional)](#step-9-configure-distributed-training-optional)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Overview
|
## Overview
|
||||||
|
|
||||||
## Basic Idea
|
## Basic idea
|
||||||
|
|
||||||
This playbook guides you through setting up and using NVIDIA NeMo AutoModel for fine-tuning large language models and vision-language models on NVIDIA Spark devices. NeMo AutoModel provides GPU-accelerated, end-to-end training for Hugging Face models with native PyTorch support, enabling instant fine-tuning without conversion delays. The framework supports distributed training across single GPU to multi-node clusters, with optimized kernels and memory-efficient recipes specifically designed for ARM64 architecture and Blackwell GPU systems.
|
This playbook guides you through setting up and using NVIDIA NeMo AutoModel for fine-tuning large language models and vision-language models on NVIDIA Spark devices. NeMo AutoModel provides GPU-accelerated, end-to-end training for Hugging Face models with native PyTorch support, enabling instant fine-tuning without conversion delays. The framework supports distributed training across single GPU to multi-node clusters, with optimized kernels and memory-efficient recipes specifically designed for ARM64 architecture and Blackwell GPU systems.
|
||||||
|
|
||||||
@ -36,20 +31,11 @@ You'll establish a complete fine-tuning environment for large language models (1
|
|||||||
## Prerequisites
|
## Prerequisites
|
||||||
|
|
||||||
- NVIDIA Spark device with Blackwell architecture GPU access
|
- NVIDIA Spark device with Blackwell architecture GPU access
|
||||||
- CUDA toolkit 12.0+ installed and configured
|
- CUDA toolkit 12.0+ installed and configured: `nvcc --version`
|
||||||
```bash
|
- Python 3.10+ environment available: `python3 --version`
|
||||||
nvcc --version
|
|
||||||
```
|
|
||||||
- Python 3.10+ environment available
|
|
||||||
```bash
|
|
||||||
python3 --version
|
|
||||||
```
|
|
||||||
- Minimum 32GB system RAM for efficient model loading and training
|
- Minimum 32GB system RAM for efficient model loading and training
|
||||||
- Active internet connection for downloading models and packages
|
- Active internet connection for downloading models and packages
|
||||||
- Git installed for repository cloning
|
- Git installed for repository cloning: `git --version`
|
||||||
```bash
|
|
||||||
git --version
|
|
||||||
```
|
|
||||||
- SSH access to your NVIDIA Spark device configured
|
- SSH access to your NVIDIA Spark device configured
|
||||||
|
|
||||||
## Ancillary files
|
## Ancillary files
|
||||||
@ -58,11 +44,11 @@ All necessary files for the playbook can be found [here on GitHub](https://githu
|
|||||||
|
|
||||||
## Time & risk
|
## Time & risk
|
||||||
|
|
||||||
**Time estimate:** 45-90 minutes for complete setup and initial model fine-tuning
|
**Duration:** 45-90 minutes for complete setup and initial model fine-tuning
|
||||||
|
|
||||||
**Risks:** Model downloads can be large (several GB), ARM64 package compatibility issues may require troubleshooting, distributed training setup complexity increases with multi-node configurations
|
**Risks:** Model downloads can be large (several GB), ARM64 package compatibility issues may require troubleshooting, distributed training setup complexity increases with multi-node configurations
|
||||||
|
|
||||||
**Rollback:** Virtual environments can be completely removed; no system-level changes are made to the host system beyond package installations
|
**Rollback:** Virtual environments can be completely removed; no system-level changes are made to the host system beyond package installations.
|
||||||
|
|
||||||
## Instructions
|
## Instructions
|
||||||
|
|
||||||
@ -113,7 +99,7 @@ pip3 install uv
|
|||||||
uv --version
|
uv --version
|
||||||
```
|
```
|
||||||
|
|
||||||
### If system installation fails
|
#### If system installation fails
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
## Install for current user only
|
## Install for current user only
|
||||||
@ -139,7 +125,7 @@ cd Automodel
|
|||||||
|
|
||||||
Set up the virtual environment and install NeMo AutoModel. Choose between wheel package installation for stability or source installation for latest features.
|
Set up the virtual environment and install NeMo AutoModel. Choose between wheel package installation for stability or source installation for latest features.
|
||||||
|
|
||||||
### Install from wheel package (recommended)
|
#### Install from wheel package (recommended)
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
## Initialize virtual environment
|
## Initialize virtual environment
|
||||||
@ -209,7 +195,7 @@ export HF_TOKEN=<your_huggingface_token>
|
|||||||
```
|
```
|
||||||
> **Note:** Please Replace `<your_huggingface_token>` with your Hugging Face access token to access gated models (e.g., Llama).
|
> **Note:** Please Replace `<your_huggingface_token>` with your Hugging Face access token to access gated models (e.g., Llama).
|
||||||
|
|
||||||
### Full Fine-tuning example:
|
#### Full Fine-tuning example:
|
||||||
Once inside the `Automodel` directory you git cloned from github, run:
|
Once inside the `Automodel` directory you git cloned from github, run:
|
||||||
```bash
|
```bash
|
||||||
uv run --frozen --no-sync \
|
uv run --frozen --no-sync \
|
||||||
@ -224,7 +210,7 @@ These overrides ensure the Qwen3-8B SFT run behaves as expected:
|
|||||||
- `--loss_fn._target_`: uses the TransformerEngine-parallel cross-entropy loss variant compatible with tensor-parallel training for large LLMs.
|
- `--loss_fn._target_`: uses the TransformerEngine-parallel cross-entropy loss variant compatible with tensor-parallel training for large LLMs.
|
||||||
- `--step_scheduler.local_batch_size`: sets the per-GPU micro-batch size to 1 to fit in memory; overall effective batch size is still driven by gradient accumulation and data/tensor parallel settings from the recipe.
|
- `--step_scheduler.local_batch_size`: sets the per-GPU micro-batch size to 1 to fit in memory; overall effective batch size is still driven by gradient accumulation and data/tensor parallel settings from the recipe.
|
||||||
|
|
||||||
### LoRA fine-tuning example:
|
#### LoRA fine-tuning example:
|
||||||
Execute a basic fine-tuning example to validate the complete setup. This demonstrates parameter-efficient fine-tuning using a small model suitable for testing.
|
Execute a basic fine-tuning example to validate the complete setup. This demonstrates parameter-efficient fine-tuning using a small model suitable for testing.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
@ -234,7 +220,7 @@ examples/llm_finetune/finetune.py \
|
|||||||
-c examples/llm_finetune/llama3_2/llama3_2_1b_squad_peft.yaml \
|
-c examples/llm_finetune/llama3_2/llama3_2_1b_squad_peft.yaml \
|
||||||
--model.pretrained_model_name_or_path meta-llama/Llama-3.1-8B
|
--model.pretrained_model_name_or_path meta-llama/Llama-3.1-8B
|
||||||
```
|
```
|
||||||
### QLoRA fine-tuning example:
|
#### QLoRA fine-tuning example:
|
||||||
We can use QLoRA to fine-tune large models in a memory-efficient manner.
|
We can use QLoRA to fine-tune large models in a memory-efficient manner.
|
||||||
```bash
|
```bash
|
||||||
uv run --frozen --no-sync \
|
uv run --frozen --no-sync \
|
||||||
|
|||||||
@ -485,7 +485,7 @@ rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
|
|||||||
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
|
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
|
||||||
```
|
```
|
||||||
|
|
||||||
Use an interface that shows as "(Up)" in your output. In this example, we'll use enP2p1s0f0np0.
|
Use an interface that shows as "(Up)" in your output. In this example, we'll use **enp1s0f0np0**.
|
||||||
|
|
||||||
On Node 1:
|
On Node 1:
|
||||||
```bash
|
```bash
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user