diff --git a/nvidia/nccl/README.md b/nvidia/nccl/README.md index 3dbf90a..0524adf 100644 --- a/nvidia/nccl/README.md +++ b/nvidia/nccl/README.md @@ -44,14 +44,14 @@ and proper GPU topology detection. * **Duration**: 30 minutes for setup and validation * **Risk level**: Medium - involves network configuration changes * **Rollback**: The NCCL & NCCL Tests repositories can be deleted from DGX Spark -* **Last Updated:** 10/12/2025 - * First publication +* **Last Updated:** 12/15/2025 + * Use nccl latest version v2.28.9-1 ## Run on two Sparks ## Step 1. Configure network connectivity -Follow the network setup instructions from the Connect two Sparks playbook to establish connectivity between your DGX Spark nodes. +Follow the network setup instructions from the [Connect two Sparks](https://build.nvidia.com/spark/connect-two-sparks/stacked-sparks) playbook to establish connectivity between your DGX Spark nodes. This includes: - Physical QSFP cable connection @@ -67,7 +67,7 @@ architecture support: ```bash ## Install dependencies and build NCCL sudo apt-get update && sudo apt-get install -y libopenmpi-dev -git clone -b v2.28.3-1 https://github.com/NVIDIA/nccl.git ~/nccl/ +git clone -b v2.28.9-1 https://github.com/NVIDIA/nccl.git ~/nccl/ cd ~/nccl/ make -j src.build NVCC_GENCODE="-gencode=arch=compute_121,code=sm_121" @@ -80,7 +80,7 @@ export LD_LIBRARY_PATH="$NCCL_HOME/lib:$CUDA_HOME/lib64/:$MPI_HOME/lib:$LD_LIBRA ## Step 3. Build NCCL test suite -Compile the NCCL test suite to validate communication performance: +Compile the NCCL test suite on **both nodes**: ```bash ## Clone and build NCCL tests @@ -91,7 +91,7 @@ make MPI=1 ## Step 4. Find the active network interface and IP addresses -Execute multi-node NCCL performance test using the active network interface. First, identify which network ports are available and up: +First, identify which network ports are available and up: ```bash ## Check network port status diff --git a/nvidia/unsloth/README.md b/nvidia/unsloth/README.md index 0dfc457..2e47c73 100644 --- a/nvidia/unsloth/README.md +++ b/nvidia/unsloth/README.md @@ -55,9 +55,8 @@ The Python test script can be found [here on GitHub](https://github.com/NVIDIA/d * CUDA toolkit configuration issues may prevent kernel compilation * Memory constraints on smaller models require batch size adjustments * **Rollback**: Uninstall packages with `pip uninstall unsloth torch torchvision`. -* **Last Updated:** 11/07/2025 - * Add required python dependencies - * Fix broken commands to access files on GitHub +* **Last Updated:** 12/15/2025 + * Upgrade pytorch container and python dependencies to the latest version ## Instructions @@ -77,28 +76,22 @@ The output should show a summary of GPU information. ## Step 2. Get the container image ```bash -docker pull nvcr.io/nvidia/pytorch:25.09-py3 +docker pull nvcr.io/nvidia/pytorch:25.11-py3 ``` ## Step 3. Launch Docker ```bash -docker run --gpus all --ulimit memlock=-1 -it --ulimit stack=67108864 --entrypoint /usr/bin/bash --rm nvcr.io/nvidia/pytorch:25.09-py3 +docker run --gpus all --ulimit memlock=-1 -it --ulimit stack=67108864 --entrypoint /usr/bin/bash --rm nvcr.io/nvidia/pytorch:25.11-py3 ``` ## Step 4. Install dependencies inside Docker ```bash -pip install transformers peft "datasets==4.3.0" "trl==0.19.1" -pip install --no-deps unsloth unsloth_zoo -pip install hf_transfer +pip install transformers peft hf_transfer "datasets==4.3.0" "trl==0.26.1" +pip install --no-deps unsloth unsloth_zoo bitsandbytes ``` -## Step 5. Build and install bitsandbytes inside Docker -```bash -pip install --no-deps bitsandbytes -``` - -## Step 6. Create Python test script +## Step 5. Create Python test script Curl the test script [here](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/unsloth/assets/test_unsloth.py) into the container. @@ -109,7 +102,7 @@ curl -O https://raw.githubusercontent.com/NVIDIA/dgx-spark-playbooks/refs/heads/ We will use this test script to validate the installation with a simple fine-tuning task. -## Step 7. Run the validation test +## Step 6. Run the validation test Execute the test script to verify Unsloth is working correctly. @@ -122,7 +115,7 @@ Expected output in the terminal window: - Training progress bars showing loss decreasing over 60 steps - Final training metrics showing completion -## Step 8. Next steps +## Step 7. Next steps Test with your own model and dataset by updating the `test_unsloth.py` file: