chore: Regenerate all playbooks

This commit is contained in:
GitLab CI 2025-12-15 20:19:01 +00:00
parent 5472c97a8c
commit b4e7892d2c
2 changed files with 15 additions and 22 deletions

View File

@ -44,14 +44,14 @@ and proper GPU topology detection.
* **Duration**: 30 minutes for setup and validation
* **Risk level**: Medium - involves network configuration changes
* **Rollback**: The NCCL & NCCL Tests repositories can be deleted from DGX Spark
* **Last Updated:** 10/12/2025
* First publication
* **Last Updated:** 12/15/2025
* Use nccl latest version v2.28.9-1
## Run on two Sparks
## Step 1. Configure network connectivity
Follow the network setup instructions from the Connect two Sparks playbook to establish connectivity between your DGX Spark nodes.
Follow the network setup instructions from the [Connect two Sparks](https://build.nvidia.com/spark/connect-two-sparks/stacked-sparks) playbook to establish connectivity between your DGX Spark nodes.
This includes:
- Physical QSFP cable connection
@ -67,7 +67,7 @@ architecture support:
```bash
## Install dependencies and build NCCL
sudo apt-get update && sudo apt-get install -y libopenmpi-dev
git clone -b v2.28.3-1 https://github.com/NVIDIA/nccl.git ~/nccl/
git clone -b v2.28.9-1 https://github.com/NVIDIA/nccl.git ~/nccl/
cd ~/nccl/
make -j src.build NVCC_GENCODE="-gencode=arch=compute_121,code=sm_121"
@ -80,7 +80,7 @@ export LD_LIBRARY_PATH="$NCCL_HOME/lib:$CUDA_HOME/lib64/:$MPI_HOME/lib:$LD_LIBRA
## Step 3. Build NCCL test suite
Compile the NCCL test suite to validate communication performance:
Compile the NCCL test suite on **both nodes**:
```bash
## Clone and build NCCL tests
@ -91,7 +91,7 @@ make MPI=1
## Step 4. Find the active network interface and IP addresses
Execute multi-node NCCL performance test using the active network interface. First, identify which network ports are available and up:
First, identify which network ports are available and up:
```bash
## Check network port status

View File

@ -55,9 +55,8 @@ The Python test script can be found [here on GitHub](https://github.com/NVIDIA/d
* CUDA toolkit configuration issues may prevent kernel compilation
* Memory constraints on smaller models require batch size adjustments
* **Rollback**: Uninstall packages with `pip uninstall unsloth torch torchvision`.
* **Last Updated:** 11/07/2025
* Add required python dependencies
* Fix broken commands to access files on GitHub
* **Last Updated:** 12/15/2025
* Upgrade pytorch container and python dependencies to the latest version
## Instructions
@ -77,28 +76,22 @@ The output should show a summary of GPU information.
## Step 2. Get the container image
```bash
docker pull nvcr.io/nvidia/pytorch:25.09-py3
docker pull nvcr.io/nvidia/pytorch:25.11-py3
```
## Step 3. Launch Docker
```bash
docker run --gpus all --ulimit memlock=-1 -it --ulimit stack=67108864 --entrypoint /usr/bin/bash --rm nvcr.io/nvidia/pytorch:25.09-py3
docker run --gpus all --ulimit memlock=-1 -it --ulimit stack=67108864 --entrypoint /usr/bin/bash --rm nvcr.io/nvidia/pytorch:25.11-py3
```
## Step 4. Install dependencies inside Docker
```bash
pip install transformers peft "datasets==4.3.0" "trl==0.19.1"
pip install --no-deps unsloth unsloth_zoo
pip install hf_transfer
pip install transformers peft hf_transfer "datasets==4.3.0" "trl==0.26.1"
pip install --no-deps unsloth unsloth_zoo bitsandbytes
```
## Step 5. Build and install bitsandbytes inside Docker
```bash
pip install --no-deps bitsandbytes
```
## Step 6. Create Python test script
## Step 5. Create Python test script
Curl the test script [here](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/unsloth/assets/test_unsloth.py) into the container.
@ -109,7 +102,7 @@ curl -O https://raw.githubusercontent.com/NVIDIA/dgx-spark-playbooks/refs/heads/
We will use this test script to validate the installation with a simple fine-tuning task.
## Step 7. Run the validation test
## Step 6. Run the validation test
Execute the test script to verify Unsloth is working correctly.
@ -122,7 +115,7 @@ Expected output in the terminal window:
- Training progress bars showing loss decreasing over 60 steps
- Final training metrics showing completion
## Step 8. Next steps
## Step 7. Next steps
Test with your own model and dataset by updating the `test_unsloth.py` file: