chore: Regenerate all playbooks

This commit is contained in:
GitLab CI 2025-10-08 15:33:48 +00:00
parent e2189781ad
commit 3c1b873c69
2 changed files with 21 additions and 35 deletions

View File

@ -6,7 +6,6 @@
- [Overview](#overview) - [Overview](#overview)
- [Instructions](#instructions) - [Instructions](#instructions)
- [Step 9. Configure distributed training (optional)](#step-9-configure-distributed-training-optional)
--- ---
@ -99,7 +98,7 @@ pip3 install uv
uv --version uv --version
``` ```
#### If system installation fails **If system installation fails:**
```bash ```bash
## Install for current user only ## Install for current user only
@ -125,7 +124,7 @@ cd Automodel
Set up the virtual environment and install NeMo AutoModel. Choose between wheel package installation for stability or source installation for latest features. Set up the virtual environment and install NeMo AutoModel. Choose between wheel package installation for stability or source installation for latest features.
#### Install from wheel package (recommended) **Install from wheel package (recommended):**
```bash ```bash
## Initialize virtual environment ## Initialize virtual environment
@ -173,7 +172,7 @@ uv run --frozen --no-sync python -c "import nemo_automodel; print('✅ NeMo Auto
ls -la examples/ ls -la examples/
``` ```
## Step 6. Explore available examples ## Step 8. Explore available examples
Review the pre-configured training recipes available for different model types and training scenarios. These recipes provide optimized configurations for ARM64 and Blackwell architecture. Review the pre-configured training recipes available for different model types and training scenarios. These recipes provide optimized configurations for ARM64 and Blackwell architecture.
@ -185,18 +184,21 @@ ls examples/llm_finetune/
cat examples/llm_finetune/finetune.py | head -20 cat examples/llm_finetune/finetune.py | head -20
``` ```
## Step 7. Run sample fine-tuning ## Step 9. Run sample fine-tuning
The following commands show how to perform full fine-tuning (SFT), parameter-efficient fine-tuning (PEFT) with LoRA and QLoRA. The following commands show how to perform full fine-tuning (SFT), parameter-efficient fine-tuning (PEFT) with LoRA and QLoRA.
First, you need to export your HF_TOKEN so that gated models can be downloaded. First, export your HF_TOKEN so that gated models can be downloaded.
```bash ```bash
## Run basic LLM fine-tuning example ## Run basic LLM fine-tuning example
export HF_TOKEN=<your_huggingface_token> export HF_TOKEN=<your_huggingface_token>
``` ```
> **Note:** Please Replace `<your_huggingface_token>` with your Hugging Face access token to access gated models (e.g., Llama). > **Note:** Please Replace `<your_huggingface_token>` with your Hugging Face access token to access gated models (e.g., Llama).
#### Full Fine-tuning example: **Full Fine-tuning example:**
Once inside the `Automodel` directory you git cloned from github, run:
Once inside the `Automodel` directory you cloned from github, run:
```bash ```bash
uv run --frozen --no-sync \ uv run --frozen --no-sync \
examples/llm_finetune/finetune.py \ examples/llm_finetune/finetune.py \
@ -210,7 +212,8 @@ These overrides ensure the Qwen3-8B SFT run behaves as expected:
- `--loss_fn._target_`: uses the TransformerEngine-parallel cross-entropy loss variant compatible with tensor-parallel training for large LLMs. - `--loss_fn._target_`: uses the TransformerEngine-parallel cross-entropy loss variant compatible with tensor-parallel training for large LLMs.
- `--step_scheduler.local_batch_size`: sets the per-GPU micro-batch size to 1 to fit in memory; overall effective batch size is still driven by gradient accumulation and data/tensor parallel settings from the recipe. - `--step_scheduler.local_batch_size`: sets the per-GPU micro-batch size to 1 to fit in memory; overall effective batch size is still driven by gradient accumulation and data/tensor parallel settings from the recipe.
#### LoRA fine-tuning example: **LoRA fine-tuning example:**
Execute a basic fine-tuning example to validate the complete setup. This demonstrates parameter-efficient fine-tuning using a small model suitable for testing. Execute a basic fine-tuning example to validate the complete setup. This demonstrates parameter-efficient fine-tuning using a small model suitable for testing.
```bash ```bash
@ -220,8 +223,10 @@ examples/llm_finetune/finetune.py \
-c examples/llm_finetune/llama3_2/llama3_2_1b_squad_peft.yaml \ -c examples/llm_finetune/llama3_2/llama3_2_1b_squad_peft.yaml \
--model.pretrained_model_name_or_path meta-llama/Llama-3.1-8B --model.pretrained_model_name_or_path meta-llama/Llama-3.1-8B
``` ```
#### QLoRA fine-tuning example: **QLoRA fine-tuning example:**
We can use QLoRA to fine-tune large models in a memory-efficient manner. We can use QLoRA to fine-tune large models in a memory-efficient manner.
```bash ```bash
uv run --frozen --no-sync \ uv run --frozen --no-sync \
examples/llm_finetune/finetune.py \ examples/llm_finetune/finetune.py \
@ -236,7 +241,7 @@ These overrides ensure the 70B QLoRA run behaves as expected:
- `--loss_fn._target_`: uses the TransformerEngine-parallel cross-entropy loss variant compatible with tensor-parallel training for large LLMs. - `--loss_fn._target_`: uses the TransformerEngine-parallel cross-entropy loss variant compatible with tensor-parallel training for large LLMs.
- `--step_scheduler.local_batch_size`: sets the per-GPU micro-batch size to 1 to fit 70B in memory; overall effective batch size is still driven by gradient accumulation and data/tensor parallel settings from the recipe. - `--step_scheduler.local_batch_size`: sets the per-GPU micro-batch size to 1 to fit 70B in memory; overall effective batch size is still driven by gradient accumulation and data/tensor parallel settings from the recipe.
## Step 8. Validate training output ## Step 10. Validate training output
Check that fine-tuning completed successfully and inspect the generated model artifacts. This confirms the training pipeline works correctly on your Spark device. Check that fine-tuning completed successfully and inspect the generated model artifacts. This confirms the training pipeline works correctly on your Spark device.
@ -254,26 +259,8 @@ print('GPU available:', torch.cuda.is_available())
print('GPU count:', torch.cuda.device_count()) print('GPU count:', torch.cuda.device_count())
" "
``` ```
<!--
### Step 9. Configure distributed training (optional)
Set up multi-GPU training configuration for larger models. This step is optional but recommended for models requiring more computational resources. ## Step 11. Validate complete setup
```bash
## Check available GPUs
nvidia-smi -L
## Configure distributed training environment
export CUDA_VISIBLE_DEVICES=0,1
## Run distributed training example
uv run torchrun --nproc_per_node=2 \
recipes/llm_finetune/finetune.py \
--model_id meta-llama/Llama-2-7b-hf \
--distributed
``` -->
## Step 9. Validate complete setup
Perform final validation to ensure all components are working correctly. This comprehensive check confirms the environment is ready for production fine-tuning workflows. Perform final validation to ensure all components are working correctly. This comprehensive check confirms the environment is ready for production fine-tuning workflows.
@ -289,7 +276,7 @@ print('✅ Setup complete')
" "
``` ```
## Step 10. Troubleshooting ## Step 12. Troubleshooting
Common issues and solutions for NeMo AutoModel setup on NVIDIA Spark devices. Common issues and solutions for NeMo AutoModel setup on NVIDIA Spark devices.
@ -301,7 +288,7 @@ Common issues and solutions for NeMo AutoModel setup on NVIDIA Spark devices.
| Out of memory during training | Model too large for available GPU memory | Reduce batch size, enable gradient checkpointing, or use model parallelism | | Out of memory during training | Model too large for available GPU memory | Reduce batch size, enable gradient checkpointing, or use model parallelism |
| ARM64 package compatibility issues | Package not available for ARM architecture | Use source installation or build from source with ARM64 flags | | ARM64 package compatibility issues | Package not available for ARM architecture | Use source installation or build from source with ARM64 flags |
## Step 11. Cleanup and rollback ## Step 13. Cleanup and rollback
Remove the installation and restore the original environment if needed. These commands safely remove all installed components. Remove the installation and restore the original environment if needed. These commands safely remove all installed components.
@ -322,7 +309,7 @@ pip3 uninstall uv
rm -rf ~/.cache/pip rm -rf ~/.cache/pip
``` ```
## Step 12. Next steps ## Step 14. Next steps
Begin using NeMo AutoModel for your specific fine-tuning tasks. Start with provided recipes and customize based on your model requirements and dataset. Begin using NeMo AutoModel for your specific fine-tuning tasks. Start with provided recipes and customize based on your model requirements and dataset.

View File

@ -99,8 +99,7 @@ Expected response should contain `"content": "204"` or similar mathematical calc
| CUDA version mismatch errors | Wrong CUDA toolkit version | Reinstall CUDA 12.9 using exact installer | | CUDA version mismatch errors | Wrong CUDA toolkit version | Reinstall CUDA 12.9 using exact installer |
| Container registry authentication fails | Invalid or expired GitLab token | Generate new auth token | | Container registry authentication fails | Invalid or expired GitLab token | Generate new auth token |
| SM_121a architecture not recognized | Missing LLVM patches | Verify SM_121a patches applied to LLVM source | | SM_121a architecture not recognized | Missing LLVM patches | Verify SM_121a patches applied to LLVM source |
| Reduce MAX_JOBS to 1-2, add swap space |
| Environment variables not set |
## Step 4. Cleanup and rollback ## Step 4. Cleanup and rollback