mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-23 10:33:51 +00:00
chore: Regenerate all playbooks
This commit is contained in:
parent
5472c97a8c
commit
b4e7892d2c
@ -44,14 +44,14 @@ and proper GPU topology detection.
|
|||||||
* **Duration**: 30 minutes for setup and validation
|
* **Duration**: 30 minutes for setup and validation
|
||||||
* **Risk level**: Medium - involves network configuration changes
|
* **Risk level**: Medium - involves network configuration changes
|
||||||
* **Rollback**: The NCCL & NCCL Tests repositories can be deleted from DGX Spark
|
* **Rollback**: The NCCL & NCCL Tests repositories can be deleted from DGX Spark
|
||||||
* **Last Updated:** 10/12/2025
|
* **Last Updated:** 12/15/2025
|
||||||
* First publication
|
* Use nccl latest version v2.28.9-1
|
||||||
|
|
||||||
## Run on two Sparks
|
## Run on two Sparks
|
||||||
|
|
||||||
## Step 1. Configure network connectivity
|
## Step 1. Configure network connectivity
|
||||||
|
|
||||||
Follow the network setup instructions from the Connect two Sparks playbook to establish connectivity between your DGX Spark nodes.
|
Follow the network setup instructions from the [Connect two Sparks](https://build.nvidia.com/spark/connect-two-sparks/stacked-sparks) playbook to establish connectivity between your DGX Spark nodes.
|
||||||
|
|
||||||
This includes:
|
This includes:
|
||||||
- Physical QSFP cable connection
|
- Physical QSFP cable connection
|
||||||
@ -67,7 +67,7 @@ architecture support:
|
|||||||
```bash
|
```bash
|
||||||
## Install dependencies and build NCCL
|
## Install dependencies and build NCCL
|
||||||
sudo apt-get update && sudo apt-get install -y libopenmpi-dev
|
sudo apt-get update && sudo apt-get install -y libopenmpi-dev
|
||||||
git clone -b v2.28.3-1 https://github.com/NVIDIA/nccl.git ~/nccl/
|
git clone -b v2.28.9-1 https://github.com/NVIDIA/nccl.git ~/nccl/
|
||||||
cd ~/nccl/
|
cd ~/nccl/
|
||||||
make -j src.build NVCC_GENCODE="-gencode=arch=compute_121,code=sm_121"
|
make -j src.build NVCC_GENCODE="-gencode=arch=compute_121,code=sm_121"
|
||||||
|
|
||||||
@ -80,7 +80,7 @@ export LD_LIBRARY_PATH="$NCCL_HOME/lib:$CUDA_HOME/lib64/:$MPI_HOME/lib:$LD_LIBRA
|
|||||||
|
|
||||||
## Step 3. Build NCCL test suite
|
## Step 3. Build NCCL test suite
|
||||||
|
|
||||||
Compile the NCCL test suite to validate communication performance:
|
Compile the NCCL test suite on **both nodes**:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
## Clone and build NCCL tests
|
## Clone and build NCCL tests
|
||||||
@ -91,7 +91,7 @@ make MPI=1
|
|||||||
|
|
||||||
## Step 4. Find the active network interface and IP addresses
|
## Step 4. Find the active network interface and IP addresses
|
||||||
|
|
||||||
Execute multi-node NCCL performance test using the active network interface. First, identify which network ports are available and up:
|
First, identify which network ports are available and up:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
## Check network port status
|
## Check network port status
|
||||||
|
|||||||
@ -55,9 +55,8 @@ The Python test script can be found [here on GitHub](https://github.com/NVIDIA/d
|
|||||||
* CUDA toolkit configuration issues may prevent kernel compilation
|
* CUDA toolkit configuration issues may prevent kernel compilation
|
||||||
* Memory constraints on smaller models require batch size adjustments
|
* Memory constraints on smaller models require batch size adjustments
|
||||||
* **Rollback**: Uninstall packages with `pip uninstall unsloth torch torchvision`.
|
* **Rollback**: Uninstall packages with `pip uninstall unsloth torch torchvision`.
|
||||||
* **Last Updated:** 11/07/2025
|
* **Last Updated:** 12/15/2025
|
||||||
* Add required python dependencies
|
* Upgrade pytorch container and python dependencies to the latest version
|
||||||
* Fix broken commands to access files on GitHub
|
|
||||||
|
|
||||||
## Instructions
|
## Instructions
|
||||||
|
|
||||||
@ -77,28 +76,22 @@ The output should show a summary of GPU information.
|
|||||||
|
|
||||||
## Step 2. Get the container image
|
## Step 2. Get the container image
|
||||||
```bash
|
```bash
|
||||||
docker pull nvcr.io/nvidia/pytorch:25.09-py3
|
docker pull nvcr.io/nvidia/pytorch:25.11-py3
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 3. Launch Docker
|
## Step 3. Launch Docker
|
||||||
```bash
|
```bash
|
||||||
docker run --gpus all --ulimit memlock=-1 -it --ulimit stack=67108864 --entrypoint /usr/bin/bash --rm nvcr.io/nvidia/pytorch:25.09-py3
|
docker run --gpus all --ulimit memlock=-1 -it --ulimit stack=67108864 --entrypoint /usr/bin/bash --rm nvcr.io/nvidia/pytorch:25.11-py3
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 4. Install dependencies inside Docker
|
## Step 4. Install dependencies inside Docker
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pip install transformers peft "datasets==4.3.0" "trl==0.19.1"
|
pip install transformers peft hf_transfer "datasets==4.3.0" "trl==0.26.1"
|
||||||
pip install --no-deps unsloth unsloth_zoo
|
pip install --no-deps unsloth unsloth_zoo bitsandbytes
|
||||||
pip install hf_transfer
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 5. Build and install bitsandbytes inside Docker
|
## Step 5. Create Python test script
|
||||||
```bash
|
|
||||||
pip install --no-deps bitsandbytes
|
|
||||||
```
|
|
||||||
|
|
||||||
## Step 6. Create Python test script
|
|
||||||
|
|
||||||
Curl the test script [here](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/unsloth/assets/test_unsloth.py) into the container.
|
Curl the test script [here](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/unsloth/assets/test_unsloth.py) into the container.
|
||||||
|
|
||||||
@ -109,7 +102,7 @@ curl -O https://raw.githubusercontent.com/NVIDIA/dgx-spark-playbooks/refs/heads/
|
|||||||
We will use this test script to validate the installation with a simple fine-tuning task.
|
We will use this test script to validate the installation with a simple fine-tuning task.
|
||||||
|
|
||||||
|
|
||||||
## Step 7. Run the validation test
|
## Step 6. Run the validation test
|
||||||
|
|
||||||
Execute the test script to verify Unsloth is working correctly.
|
Execute the test script to verify Unsloth is working correctly.
|
||||||
|
|
||||||
@ -122,7 +115,7 @@ Expected output in the terminal window:
|
|||||||
- Training progress bars showing loss decreasing over 60 steps
|
- Training progress bars showing loss decreasing over 60 steps
|
||||||
- Final training metrics showing completion
|
- Final training metrics showing completion
|
||||||
|
|
||||||
## Step 8. Next steps
|
## Step 7. Next steps
|
||||||
|
|
||||||
Test with your own model and dataset by updating the `test_unsloth.py` file:
|
Test with your own model and dataset by updating the `test_unsloth.py` file:
|
||||||
|
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user