mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-25 03:13:53 +00:00
chore: Regenerate all playbooks
This commit is contained in:
parent
7874a7b269
commit
e07330f8dc
@ -85,7 +85,7 @@ Start the NVIDIA PyTorch container with GPU access and mount your workspace dire
|
|||||||
> **Note:** This NVIDIA PyTorch container supports CUDA 13
|
> **Note:** This NVIDIA PyTorch container supports CUDA 13
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
docker run --gpus all --ipc=host --ulimit memlock=-1 -it --ulimit stack=67108864 --rm -v "$PWD":/workspace nvcr.io/nvidia/pytorch:25.08-py3 bash
|
docker run --gpus all --ipc=host --ulimit memlock=-1 -it --ulimit stack=67108864 --rm -v "$PWD":/workspace nvcr.io/nvidia/pytorch:25.09-py3 bash
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 3. Clone LLaMA Factory repository
|
## Step 3. Clone LLaMA Factory repository
|
||||||
@ -105,16 +105,7 @@ Install the package in editable mode with metrics support for training evaluatio
|
|||||||
pip install -e ".[metrics]"
|
pip install -e ".[metrics]"
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 5. Configure PyTorch for CUDA 12.9 (skip if using Docker container from Step 2)
|
## Step 5. Verify Pytorch CUDA support.
|
||||||
|
|
||||||
In a python virtual environment, uninstall existing PyTorch and reinstall with CUDA 12.9 support for ARM64 architecture.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
pip uninstall torch torchvision torchaudio
|
|
||||||
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu129
|
|
||||||
```
|
|
||||||
|
|
||||||
*If using Docker container*
|
|
||||||
|
|
||||||
PyTorch is pre-installed with CUDA support. Verify installation:
|
PyTorch is pre-installed with CUDA support. Verify installation:
|
||||||
|
|
||||||
@ -158,7 +149,6 @@ Verify that training completed successfully and checkpoints were saved.
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
ls -la saves/llama3-8b/lora/sft/
|
ls -la saves/llama3-8b/lora/sft/
|
||||||
cat saves/llama3-8b/lora/sft/training_loss.png
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
@ -170,13 +160,20 @@ Expected output should show:
|
|||||||
|
|
||||||
## Step 9. Test inference with fine-tuned model
|
## Step 9. Test inference with fine-tuned model
|
||||||
|
|
||||||
Run a simple inference test to verify the fine-tuned model loads correctly.
|
Test your fine-tuned model with custom prompts:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
llamafactory-cli chat examples/inference/llama3_lora_sft.yaml
|
llamafactory-cli chat examples/inference/llama3_lora_sft.yaml
|
||||||
|
## Type: "Hello, how can you help me today?"
|
||||||
|
## Expect: Response showing fine-tuned behavior
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 10. Troubleshooting
|
## Step 10. For production deployment, export your model
|
||||||
|
```bash
|
||||||
|
llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
## Step 11. Troubleshooting
|
||||||
|
|
||||||
| Symptom | Cause | Fix |
|
| Symptom | Cause | Fix |
|
||||||
|---------|--------|-----|
|
|---------|--------|-----|
|
||||||
@ -184,7 +181,7 @@ llamafactory-cli chat examples/inference/llama3_lora_sft.yaml
|
|||||||
| Model download fails or is slow | Network connectivity or Hugging Face Hub issues | Check internet connection, try using `HF_HUB_OFFLINE=1` for cached models |
|
| Model download fails or is slow | Network connectivity or Hugging Face Hub issues | Check internet connection, try using `HF_HUB_OFFLINE=1` for cached models |
|
||||||
| Training loss not decreasing | Learning rate too high/low or insufficient data | Adjust `learning_rate` parameter or check dataset quality |
|
| Training loss not decreasing | Learning rate too high/low or insufficient data | Adjust `learning_rate` parameter or check dataset quality |
|
||||||
|
|
||||||
## Step 11. Cleanup and rollback
|
## Step 12. Cleanup and rollback
|
||||||
|
|
||||||
> **Warning:** This will delete all training progress and checkpoints.
|
> **Warning:** This will delete all training progress and checkpoints.
|
||||||
|
|
||||||
@ -201,18 +198,3 @@ To rollback Docker container changes:
|
|||||||
exit # Exit container
|
exit # Exit container
|
||||||
docker container prune -f
|
docker container prune -f
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 12. Next steps
|
|
||||||
|
|
||||||
Test your fine-tuned model with custom prompts:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
llamafactory-cli chat examples/inference/llama3_lora_sft.yaml
|
|
||||||
## Type: "Hello, how can you help me today?"
|
|
||||||
## Expect: Response showing fine-tuned behavior
|
|
||||||
```
|
|
||||||
|
|
||||||
For production deployment, export your model:
|
|
||||||
```bash
|
|
||||||
llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml
|
|
||||||
```
|
|
||||||
|
|||||||
@ -76,12 +76,12 @@ The output should show a summary of GPU information.
|
|||||||
|
|
||||||
## Step 2. Get the container image
|
## Step 2. Get the container image
|
||||||
```bash
|
```bash
|
||||||
docker pull nvcr.io/nvidia/pytorch:25.08-py3
|
docker pull nvcr.io/nvidia/pytorch:25.09-py3
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 3. Launch Docker
|
## Step 3. Launch Docker
|
||||||
```bash
|
```bash
|
||||||
docker run --gpus all --ulimit memlock=-1 -it --ulimit stack=67108864 --entrypoint /usr/bin/bash --rm nvcr.io/nvidia/pytorch:25.08-py3
|
docker run --gpus all --ulimit memlock=-1 -it --ulimit stack=67108864 --entrypoint /usr/bin/bash --rm nvcr.io/nvidia/pytorch:25.09-py3
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 4. Install dependencies inside Docker
|
## Step 4. Install dependencies inside Docker
|
||||||
@ -93,13 +93,7 @@ pip install --no-deps unsloth unsloth_zoo
|
|||||||
|
|
||||||
## Step 5. Build and install bitsandbytes inside Docker
|
## Step 5. Build and install bitsandbytes inside Docker
|
||||||
```bash
|
```bash
|
||||||
git clone https://github.com/bitsandbytes-foundation/bitsandbytes.git
|
pip install --no-deps bitsandbytes
|
||||||
cd bitsandbytes
|
|
||||||
cmake -S . -B build -DCOMPUTE_BACKEND=cuda -DCOMPUTE_CAPABILITY="80;86;87;89;90"
|
|
||||||
cd build
|
|
||||||
make -j
|
|
||||||
cd ..
|
|
||||||
pip install .
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 6. Create Python test script
|
## Step 6. Create Python test script
|
||||||
@ -107,8 +101,8 @@ pip install .
|
|||||||
Curl the test script [here](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/test_unsloth.py) into the container.
|
Curl the test script [here](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/test_unsloth.py) into the container.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
|
||||||
curl -O https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/test_unsloth.py
|
curl -O https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/test_unsloth.py
|
||||||
|
```
|
||||||
|
|
||||||
We will use this test script to validate the installation with a simple fine-tuning task.
|
We will use this test script to validate the installation with a simple fine-tuning task.
|
||||||
|
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user