mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-25 19:33:53 +00:00
chore: Regenerate all playbooks
This commit is contained in:
parent
a79c14d8f5
commit
002501ec63
@ -37,6 +37,33 @@ SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|||||||
|
|
||||||
-----------------------------------------------------------------------------------------
|
-----------------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
== Triton (OpenAI)
|
||||||
|
|
||||||
|
Copyright 2018-2020 Philippe Tillet
|
||||||
|
Copyright 2020-2022 OpenAI
|
||||||
|
|
||||||
|
MIT License
|
||||||
|
|
||||||
|
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||||
|
of this software and associated documentation files (the "Software"), to deal
|
||||||
|
in the Software without restriction, including without limitation the rights
|
||||||
|
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||||
|
copies of the Software, and to permit persons to whom the Software is
|
||||||
|
furnished to do so, subject to the following conditions:
|
||||||
|
|
||||||
|
The above copyright notice and this permission notice shall be included in all
|
||||||
|
copies or substantial portions of the Software.
|
||||||
|
|
||||||
|
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||||
|
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||||
|
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||||
|
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||||
|
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||||
|
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||||
|
SOFTWARE.
|
||||||
|
|
||||||
|
-----------------------------------------------------------------------------------------
|
||||||
|
|
||||||
== Transformers (Hugging Face)
|
== Transformers (Hugging Face)
|
||||||
|
|
||||||
Copyright 2018- The Hugging Face team. All rights reserved.
|
Copyright 2018- The Hugging Face team. All rights reserved.
|
||||||
|
|||||||
@ -94,17 +94,15 @@ sudo usermod -aG docker $USER
|
|||||||
newgrp docker
|
newgrp docker
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 3. Get the container image and clone the repository for mounting
|
## Step 3. Get the container image with NeMo AutoModel
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
docker pull nvcr.io/nvidia/nemo-automodel:26.02
|
docker pull nvcr.io/nvidia/nemo-automodel:26.02
|
||||||
|
|
||||||
git clone https://github.com/NVIDIA-NeMo/Automodel.git
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 4. Launch Docker
|
## Step 4. Launch Docker
|
||||||
|
|
||||||
Replace `<local-path-to-Automodel>` with the absolute path to the Automodel directory you cloned in Step 3.
|
Launch an interactive container with GPU access. The `--rm` flag ensures the container is removed when you exit.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
docker run \
|
docker run \
|
||||||
@ -112,54 +110,17 @@ docker run \
|
|||||||
--ulimit memlock=-1 \
|
--ulimit memlock=-1 \
|
||||||
-it --ulimit stack=67108864 \
|
-it --ulimit stack=67108864 \
|
||||||
--entrypoint /usr/bin/bash \
|
--entrypoint /usr/bin/bash \
|
||||||
-v <local-path-to-Automodel>:/opt/Automodel \
|
|
||||||
--rm nvcr.io/nvidia/nemo-automodel:26.02
|
--rm nvcr.io/nvidia/nemo-automodel:26.02
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 5. Install NeMo Automodel with latest features
|
## Step 5. Explore available examples
|
||||||
|
|
||||||
First `cd` into the NeMo Automodel directory
|
|
||||||
```bash
|
|
||||||
cd /opt/Automodel
|
|
||||||
```
|
|
||||||
|
|
||||||
Next, run the following two commands to sync the environment packages
|
|
||||||
```bash
|
|
||||||
bash docker/common/update_pyproject_pytorch.sh /opt/Automodel
|
|
||||||
|
|
||||||
uv sync --locked --extra all --all-groups
|
|
||||||
```
|
|
||||||
|
|
||||||
## Step 6. Verify installation
|
|
||||||
|
|
||||||
Confirm NeMo AutoModel is properly installed and accessible. This step validates the installation and checks for any missing dependencies.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
## Test NeMo AutoModel import
|
|
||||||
uv run --frozen --no-sync python -c "import nemo_automodel; print('✅ NeMo AutoModel ready')"
|
|
||||||
|
|
||||||
## Check available examples
|
|
||||||
ls -la examples/
|
|
||||||
|
|
||||||
## Below is an example of the expected output (username and domain-users are placeholders).
|
|
||||||
## $ ls -la examples/
|
|
||||||
## total 36
|
|
||||||
## drwxr-xr-x 9 username domain-users 4096 Oct 16 14:52 .
|
|
||||||
## drwxr-xr-x 16 username domain-users 4096 Oct 16 14:52 ..
|
|
||||||
## drwxr-xr-x 3 username domain-users 4096 Oct 16 14:52 benchmark
|
|
||||||
## drwxr-xr-x 3 username domain-users 4096 Oct 16 14:52 diffusion
|
|
||||||
## drwxr-xr-x 20 username domain-users 4096 Oct 16 14:52 llm_finetune
|
|
||||||
## drwxr-xr-x 3 username domain-users 4096 Oct 14 09:27 llm_kd
|
|
||||||
## drwxr-xr-x 2 username domain-users 4096 Oct 16 14:52 llm_pretrain
|
|
||||||
## drwxr-xr-x 6 username domain-users 4096 Oct 14 09:27 vlm_finetune
|
|
||||||
## drwxr-xr-x 2 username domain-users 4096 Oct 14 09:27 vlm_generate
|
|
||||||
```
|
|
||||||
|
|
||||||
## Step 7. Explore available examples
|
|
||||||
|
|
||||||
Review the pre-configured training recipes available for different model types and training scenarios. These recipes provide optimized configurations for ARM64 and Blackwell architecture.
|
Review the pre-configured training recipes available for different model types and training scenarios. These recipes provide optimized configurations for ARM64 and Blackwell architecture.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
## Navigate to /opt/Automodel
|
||||||
|
cd /opt/Automodel
|
||||||
|
|
||||||
## List LLM fine-tuning examples
|
## List LLM fine-tuning examples
|
||||||
ls examples/llm_finetune/
|
ls examples/llm_finetune/
|
||||||
|
|
||||||
@ -167,7 +128,7 @@ ls examples/llm_finetune/
|
|||||||
cat examples/llm_finetune/finetune.py | head -20
|
cat examples/llm_finetune/finetune.py | head -20
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 8. Run sample fine-tuning
|
## Step 6. Run sample fine-tuning
|
||||||
The following commands show how to perform full fine-tuning (SFT), parameter-efficient fine-tuning (PEFT) with LoRA and QLoRA.
|
The following commands show how to perform full fine-tuning (SFT), parameter-efficient fine-tuning (PEFT) with LoRA and QLoRA.
|
||||||
|
|
||||||
First, export your HF_TOKEN so that gated models can be downloaded.
|
First, export your HF_TOKEN so that gated models can be downloaded.
|
||||||
@ -194,8 +155,8 @@ For the examples below, we are using YAML for configuration, and parameter overr
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
## Run basic LLM fine-tuning example
|
## Run basic LLM fine-tuning example
|
||||||
uv run --frozen --no-sync \
|
cd /opt/Automodel
|
||||||
examples/llm_finetune/finetune.py \
|
python3 examples/llm_finetune/finetune.py \
|
||||||
-c examples/llm_finetune/llama3_2/llama3_2_1b_squad_peft.yaml \
|
-c examples/llm_finetune/llama3_2/llama3_2_1b_squad_peft.yaml \
|
||||||
--model.pretrained_model_name_or_path meta-llama/Llama-3.1-8B \
|
--model.pretrained_model_name_or_path meta-llama/Llama-3.1-8B \
|
||||||
--packed_sequence.packed_sequence_size 1024 \
|
--packed_sequence.packed_sequence_size 1024 \
|
||||||
@ -205,16 +166,18 @@ examples/llm_finetune/finetune.py \
|
|||||||
These overrides ensure the Llama-3.1-8B LoRA run behaves as expected:
|
These overrides ensure the Llama-3.1-8B LoRA run behaves as expected:
|
||||||
- `--model.pretrained_model_name_or_path`: selects the Llama-3.1-8B model to fine-tune from the Hugging Face model hub (weights fetched via your Hugging Face token).
|
- `--model.pretrained_model_name_or_path`: selects the Llama-3.1-8B model to fine-tune from the Hugging Face model hub (weights fetched via your Hugging Face token).
|
||||||
- `--packed_sequence.packed_sequence_size`: sets the packed sequence size to 1024 to enable packed sequence training.
|
- `--packed_sequence.packed_sequence_size`: sets the packed sequence size to 1024 to enable packed sequence training.
|
||||||
- `--step_scheduler.max_steps`: sets the maximum number of training steps. We set it to 20 for demonstation purposes, please adjust this based on your needs.
|
- `--step_scheduler.max_steps`: sets the maximum number of training steps. We set it to 20 for demonstration purposes, please adjust this based on your needs.
|
||||||
|
|
||||||
|
> [!NOTE]
|
||||||
|
> The recipe YAML `llama3_2_1b_squad_peft.yaml` defines training hyperparameters (LoRA rank, learning rate, etc.) that are reusable across Llama model sizes. The `--model.pretrained_model_name_or_path` override determines which model weights are actually loaded.
|
||||||
|
|
||||||
**QLoRA fine-tuning example:**
|
**QLoRA fine-tuning example:**
|
||||||
|
|
||||||
We can use QLoRA to fine-tune large models in a memory-efficient manner.
|
We can use QLoRA to fine-tune large models in a memory-efficient manner.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
uv run --frozen --no-sync \
|
cd /opt/Automodel
|
||||||
examples/llm_finetune/finetune.py \
|
python3 examples/llm_finetune/finetune.py \
|
||||||
-c examples/llm_finetune/llama3_1/llama3_1_8b_squad_qlora.yaml \
|
-c examples/llm_finetune/llama3_1/llama3_1_8b_squad_qlora.yaml \
|
||||||
--model.pretrained_model_name_or_path meta-llama/Meta-Llama-3-70B \
|
--model.pretrained_model_name_or_path meta-llama/Meta-Llama-3-70B \
|
||||||
--loss_fn._target_ nemo_automodel.components.loss.te_parallel_ce.TEParallelCrossEntropy \
|
--loss_fn._target_ nemo_automodel.components.loss.te_parallel_ce.TEParallelCrossEntropy \
|
||||||
@ -227,29 +190,31 @@ These overrides ensure the 70B QLoRA run behaves as expected:
|
|||||||
- `--model.pretrained_model_name_or_path`: selects the 70B base model to fine-tune (weights fetched via your Hugging Face token).
|
- `--model.pretrained_model_name_or_path`: selects the 70B base model to fine-tune (weights fetched via your Hugging Face token).
|
||||||
- `--loss_fn._target_`: uses the TransformerEngine-parallel cross-entropy loss variant compatible with tensor-parallel training for large LLMs.
|
- `--loss_fn._target_`: uses the TransformerEngine-parallel cross-entropy loss variant compatible with tensor-parallel training for large LLMs.
|
||||||
- `--step_scheduler.local_batch_size`: sets the per-GPU micro-batch size to 1 to fit 70B in memory; overall effective batch size is still driven by gradient accumulation and data/tensor parallel settings from the recipe.
|
- `--step_scheduler.local_batch_size`: sets the per-GPU micro-batch size to 1 to fit 70B in memory; overall effective batch size is still driven by gradient accumulation and data/tensor parallel settings from the recipe.
|
||||||
- `--step_scheduler.max_steps`: sets the maximum number of training steps. We set it to 20 for demonstation purposes, please adjust this based on your needs.
|
- `--step_scheduler.max_steps`: sets the maximum number of training steps. We set it to 20 for demonstration purposes, please adjust this based on your needs.
|
||||||
- `--packed_sequence.packed_sequence_size`: sets the packed sequence size to 1024 to enable packed sequence training.
|
- `--packed_sequence.packed_sequence_size`: sets the packed sequence size to 1024 to enable packed sequence training.
|
||||||
|
|
||||||
**Full Fine-tuning example:**
|
**Full Fine-tuning example:**
|
||||||
|
|
||||||
Once inside the `Automodel` directory you cloned from GitHub, run:
|
Run the following command to perform full (SFT) fine-tuning:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
uv run --frozen --no-sync \
|
cd /opt/Automodel
|
||||||
examples/llm_finetune/finetune.py \
|
python3 examples/llm_finetune/finetune.py \
|
||||||
-c examples/llm_finetune/qwen/qwen3_8b_squad_spark.yaml \
|
-c examples/llm_finetune/qwen/qwen3_8b_squad_spark.yaml \
|
||||||
--model.pretrained_model_name_or_path Qwen/Qwen3-8B \
|
--model.pretrained_model_name_or_path Qwen/Qwen3-8B \
|
||||||
--step_scheduler.local_batch_size 1 \
|
--step_scheduler.local_batch_size 1 \
|
||||||
--step_scheduler.max_steps 20 \
|
--step_scheduler.max_steps 20 \
|
||||||
--packed_sequence.packed_sequence_size 1024
|
--packed_sequence.packed_sequence_size 1024
|
||||||
```
|
```
|
||||||
|
|
||||||
These overrides ensure the Qwen3-8B SFT run behaves as expected:
|
These overrides ensure the Qwen3-8B SFT run behaves as expected:
|
||||||
- `--model.pretrained_model_name_or_path`: selects the Qwen/Qwen3-8B model to fine-tune from the Hugging Face model hub (weights fetched via your Hugging Face token). Adjust this if you want to fine-tune a different model.
|
- `--model.pretrained_model_name_or_path`: selects the Qwen/Qwen3-8B model to fine-tune from the Hugging Face model hub (weights fetched via your Hugging Face token). Adjust this if you want to fine-tune a different model.
|
||||||
- `--step_scheduler.max_steps`: sets the maximum number of training steps. We set it to 20 for demonstation purposes, please adjust this based on your needs.
|
- `--step_scheduler.max_steps`: sets the maximum number of training steps. We set it to 20 for demonstration purposes, please adjust this based on your needs.
|
||||||
- `--step_scheduler.local_batch_size`: sets the per-GPU micro-batch size to 1 to fit in memory; overall effective batch size is still driven by gradient accumulation and data/tensor parallel settings from the recipe.
|
- `--step_scheduler.local_batch_size`: sets the per-GPU micro-batch size to 1 to fit in memory; overall effective batch size is still driven by gradient accumulation and data/tensor parallel settings from the recipe.
|
||||||
|
- `--packed_sequence.packed_sequence_size`: sets the packed sequence size to 1024 to enable packed sequence training.
|
||||||
|
|
||||||
|
|
||||||
## Step 9. Validate successful training completion
|
## Step 7. Validate successful training completion
|
||||||
|
|
||||||
Validate the fine-tuned model by inspecting artifacts contained in the checkpoint directory.
|
Validate the fine-tuned model by inspecting artifacts contained in the checkpoint directory.
|
||||||
|
|
||||||
@ -272,28 +237,17 @@ ls -lah checkpoints/LATEST/
|
|||||||
## -rw-r--r-- 1 username domain-users 1.3K Oct 16 22:33 step_scheduler.pt
|
## -rw-r--r-- 1 username domain-users 1.3K Oct 16 22:33 step_scheduler.pt
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 10. Cleanup and rollback (Optional)
|
## Step 8. Cleanup (Optional)
|
||||||
|
|
||||||
Remove the installation and restore the original environment if needed. These commands safely remove all installed components.
|
The container was launched with the `--rm` flag, so it is automatically removed when you exit. To reclaim disk space used by the Docker image, run:
|
||||||
|
|
||||||
> [!WARNING]
|
> [!WARNING]
|
||||||
> This will delete all virtual environments and downloaded models. Ensure you have backed up any important training checkpoints.
|
> This will remove the NeMo AutoModel image. You will need to pull it again if you want to use it later.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
## Remove virtual environment
|
docker rmi nvcr.io/nvidia/nemo-automodel:26.02
|
||||||
rm -rf .venv
|
|
||||||
|
|
||||||
## Remove cloned repository
|
|
||||||
cd ..
|
|
||||||
rm -rf Automodel
|
|
||||||
|
|
||||||
## Remove uv (if installed with --user)
|
|
||||||
pip3 uninstall uv
|
|
||||||
|
|
||||||
## Clear Python cache
|
|
||||||
rm -rf ~/.cache/pip
|
|
||||||
```
|
```
|
||||||
## Step 11. Optional: Publish your fine-tuned model checkpoint on Hugging Face Hub
|
## Step 9. Optional: Publish your fine-tuned model checkpoint on Hugging Face Hub
|
||||||
|
|
||||||
Publish your fine-tuned model checkpoint on Hugging Face Hub.
|
Publish your fine-tuned model checkpoint on Hugging Face Hub.
|
||||||
> [!NOTE]
|
> [!NOTE]
|
||||||
@ -301,7 +255,7 @@ Publish your fine-tuned model checkpoint on Hugging Face Hub.
|
|||||||
> It is useful if you want to share your fine-tuned model with others or use it in other projects.
|
> It is useful if you want to share your fine-tuned model with others or use it in other projects.
|
||||||
> You can also use the fine-tuned model in other projects by cloning the repository and using the checkpoint.
|
> You can also use the fine-tuned model in other projects by cloning the repository and using the checkpoint.
|
||||||
> To use the fine-tuned model in other projects, you need to have the Hugging Face CLI installed.
|
> To use the fine-tuned model in other projects, you need to have the Hugging Face CLI installed.
|
||||||
> You can install the Hugging Face CLI by running `pip install huggingface-cli`.
|
> You can install the Hugging Face CLI by running `pip install huggingface_hub`.
|
||||||
> For more information, please refer to the [Hugging Face CLI documentation](https://huggingface.co/docs/huggingface_hub/en/guides/cli).
|
> For more information, please refer to the [Hugging Face CLI documentation](https://huggingface.co/docs/huggingface_hub/en/guides/cli).
|
||||||
|
|
||||||
> [!TIP]
|
> [!TIP]
|
||||||
@ -318,26 +272,26 @@ hf upload my-cool-model checkpoints/LATEST/model
|
|||||||
> The above command can fail if you don't have write permissions to the Hugging Face Hub, with the HF_TOKEN you used.
|
> The above command can fail if you don't have write permissions to the Hugging Face Hub, with the HF_TOKEN you used.
|
||||||
> Sample error message:
|
> Sample error message:
|
||||||
> ```bash
|
> ```bash
|
||||||
> akoumparouli@1604ab7-lcedt:/mnt/4tb/auto/Automodel8$ hf upload my-cool-model checkpoints/LATEST/model
|
> user@host:/opt/Automodel$ hf upload my-cool-model checkpoints/LATEST/model
|
||||||
> Traceback (most recent call last):
|
> Traceback (most recent call last):
|
||||||
> File "/home/akoumparouli/.local/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 409, in hf_raise_for_status
|
> File "/home/user/.local/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 409, in hf_raise_for_status
|
||||||
> response.raise_for_status()
|
> response.raise_for_status()
|
||||||
> File "/home/akoumparouli/.local/lib/python3.10/site-packages/requests/models.py", line 1024, in raise_for_status
|
> File "/home/user/.local/lib/python3.10/site-packages/requests/models.py", line 1024, in raise_for_status
|
||||||
> raise HTTPError(http_error_msg, response=self)
|
> raise HTTPError(http_error_msg, response=self)
|
||||||
> requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/api/repos/create
|
> requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/api/repos/create
|
||||||
> ```
|
> ```
|
||||||
> To fix this, you need to create an access token with *write* permissions, please see the Hugging Face guide [here](https://huggingface.co/docs/hub/en/security-tokens) for instructions.
|
> To fix this, you need to create an access token with *write* permissions, please see the Hugging Face guide [here](https://huggingface.co/docs/hub/en/security-tokens) for instructions.
|
||||||
|
|
||||||
## Step 12. Next steps
|
## Step 10. Next steps
|
||||||
|
|
||||||
Begin using NeMo AutoModel for your specific fine-tuning tasks. Start with provided recipes and customize based on your model requirements and dataset.
|
Begin using NeMo AutoModel for your specific fine-tuning tasks. Start with provided recipes and customize based on your model requirements and dataset.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
## Copy a recipe for customization
|
## Copy a recipe for customization
|
||||||
cp recipes/llm_finetune/finetune.py my_custom_training.py
|
cp examples/llm_finetune/finetune.py my_custom_training.py
|
||||||
|
|
||||||
## Edit configuration for your specific model and data
|
## Edit configuration for your specific model and data, then run:
|
||||||
## Then run: uv run my_custom_training.py
|
python3 my_custom_training.py
|
||||||
```
|
```
|
||||||
|
|
||||||
Explore the [NeMo AutoModel GitHub repository](https://github.com/NVIDIA-NeMo/Automodel) for more recipes, documentation, and community examples. Consider setting up custom datasets, experimenting with different model architectures, and scaling to multi-node distributed training for larger models.
|
Explore the [NeMo AutoModel GitHub repository](https://github.com/NVIDIA-NeMo/Automodel) for more recipes, documentation, and community examples. Consider setting up custom datasets, experimenting with different model architectures, and scaling to multi-node distributed training for larger models.
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user