mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-21 17:43:52 +00:00
chore: Regenerate all playbooks
This commit is contained in:
parent
5472c97a8c
commit
b4e7892d2c
@ -44,14 +44,14 @@ and proper GPU topology detection.
|
||||
* **Duration**: 30 minutes for setup and validation
|
||||
* **Risk level**: Medium - involves network configuration changes
|
||||
* **Rollback**: The NCCL & NCCL Tests repositories can be deleted from DGX Spark
|
||||
* **Last Updated:** 10/12/2025
|
||||
* First publication
|
||||
* **Last Updated:** 12/15/2025
|
||||
* Use nccl latest version v2.28.9-1
|
||||
|
||||
## Run on two Sparks
|
||||
|
||||
## Step 1. Configure network connectivity
|
||||
|
||||
Follow the network setup instructions from the Connect two Sparks playbook to establish connectivity between your DGX Spark nodes.
|
||||
Follow the network setup instructions from the [Connect two Sparks](https://build.nvidia.com/spark/connect-two-sparks/stacked-sparks) playbook to establish connectivity between your DGX Spark nodes.
|
||||
|
||||
This includes:
|
||||
- Physical QSFP cable connection
|
||||
@ -67,7 +67,7 @@ architecture support:
|
||||
```bash
|
||||
## Install dependencies and build NCCL
|
||||
sudo apt-get update && sudo apt-get install -y libopenmpi-dev
|
||||
git clone -b v2.28.3-1 https://github.com/NVIDIA/nccl.git ~/nccl/
|
||||
git clone -b v2.28.9-1 https://github.com/NVIDIA/nccl.git ~/nccl/
|
||||
cd ~/nccl/
|
||||
make -j src.build NVCC_GENCODE="-gencode=arch=compute_121,code=sm_121"
|
||||
|
||||
@ -80,7 +80,7 @@ export LD_LIBRARY_PATH="$NCCL_HOME/lib:$CUDA_HOME/lib64/:$MPI_HOME/lib:$LD_LIBRA
|
||||
|
||||
## Step 3. Build NCCL test suite
|
||||
|
||||
Compile the NCCL test suite to validate communication performance:
|
||||
Compile the NCCL test suite on **both nodes**:
|
||||
|
||||
```bash
|
||||
## Clone and build NCCL tests
|
||||
@ -91,7 +91,7 @@ make MPI=1
|
||||
|
||||
## Step 4. Find the active network interface and IP addresses
|
||||
|
||||
Execute multi-node NCCL performance test using the active network interface. First, identify which network ports are available and up:
|
||||
First, identify which network ports are available and up:
|
||||
|
||||
```bash
|
||||
## Check network port status
|
||||
|
||||
@ -55,9 +55,8 @@ The Python test script can be found [here on GitHub](https://github.com/NVIDIA/d
|
||||
* CUDA toolkit configuration issues may prevent kernel compilation
|
||||
* Memory constraints on smaller models require batch size adjustments
|
||||
* **Rollback**: Uninstall packages with `pip uninstall unsloth torch torchvision`.
|
||||
* **Last Updated:** 11/07/2025
|
||||
* Add required python dependencies
|
||||
* Fix broken commands to access files on GitHub
|
||||
* **Last Updated:** 12/15/2025
|
||||
* Upgrade pytorch container and python dependencies to the latest version
|
||||
|
||||
## Instructions
|
||||
|
||||
@ -77,28 +76,22 @@ The output should show a summary of GPU information.
|
||||
|
||||
## Step 2. Get the container image
|
||||
```bash
|
||||
docker pull nvcr.io/nvidia/pytorch:25.09-py3
|
||||
docker pull nvcr.io/nvidia/pytorch:25.11-py3
|
||||
```
|
||||
|
||||
## Step 3. Launch Docker
|
||||
```bash
|
||||
docker run --gpus all --ulimit memlock=-1 -it --ulimit stack=67108864 --entrypoint /usr/bin/bash --rm nvcr.io/nvidia/pytorch:25.09-py3
|
||||
docker run --gpus all --ulimit memlock=-1 -it --ulimit stack=67108864 --entrypoint /usr/bin/bash --rm nvcr.io/nvidia/pytorch:25.11-py3
|
||||
```
|
||||
|
||||
## Step 4. Install dependencies inside Docker
|
||||
|
||||
```bash
|
||||
pip install transformers peft "datasets==4.3.0" "trl==0.19.1"
|
||||
pip install --no-deps unsloth unsloth_zoo
|
||||
pip install hf_transfer
|
||||
pip install transformers peft hf_transfer "datasets==4.3.0" "trl==0.26.1"
|
||||
pip install --no-deps unsloth unsloth_zoo bitsandbytes
|
||||
```
|
||||
|
||||
## Step 5. Build and install bitsandbytes inside Docker
|
||||
```bash
|
||||
pip install --no-deps bitsandbytes
|
||||
```
|
||||
|
||||
## Step 6. Create Python test script
|
||||
## Step 5. Create Python test script
|
||||
|
||||
Curl the test script [here](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/unsloth/assets/test_unsloth.py) into the container.
|
||||
|
||||
@ -109,7 +102,7 @@ curl -O https://raw.githubusercontent.com/NVIDIA/dgx-spark-playbooks/refs/heads/
|
||||
We will use this test script to validate the installation with a simple fine-tuning task.
|
||||
|
||||
|
||||
## Step 7. Run the validation test
|
||||
## Step 6. Run the validation test
|
||||
|
||||
Execute the test script to verify Unsloth is working correctly.
|
||||
|
||||
@ -122,7 +115,7 @@ Expected output in the terminal window:
|
||||
- Training progress bars showing loss decreasing over 60 steps
|
||||
- Final training metrics showing completion
|
||||
|
||||
## Step 8. Next steps
|
||||
## Step 7. Next steps
|
||||
|
||||
Test with your own model and dataset by updating the `test_unsloth.py` file:
|
||||
|
||||
|
||||
Loading…
Reference in New Issue
Block a user